Artificial Intelligence (AI) is quickly becoming part of everyday life, and the demand for custom language models (LLMs) is growing with it. One important step in training large language models (LLMs) is gathering a significant amount of text data.
In this article, we will show how to automate text collection from an entire website using LLM web scraping techniques. We will build a custom Python script to extract, parse, and save website text so it can be used as a foundation for LLM training datasets.
The source code can be found on GitHub.
TL;DR
To scrape a website for LLM training, start by discovering all URLs (typically via a sitemap), then use a Python script with BeautifulSoup to fetch and parse each page. Extract only the meaningful text and clean it before saving. If you're scraping at scale, use multithreading and route requests through a proxy API (like ScrapingBee) to avoid blocks and rate limits.
Key takeaways
- Training an LLM is time- and resource-intensive, often requiring GPUs or TPUs.
- The dataset size depends on the model: from millions of words (small models) to billions (large foundational models).
- Common frameworks for training include TensorFlow, PyTorch, and Hugging Face Transformers.
- Raw HTML is noisy—clean text extraction is essential to avoid poor training results ("garbage in, garbage out").
Preparing to write a custom script to load training data
So, the question is: where do you get the training data for the LLMs? One option is to extract data from websites related to the task you're solving. That's why we'll write a custom Python script for this.
Prerequisites
Before we dive into the code, let's quickly cover the prerequisites. I expect that you have:
- Basic knowledge of Python and a general understanding of Python web scraping.
- Already installed Python 3 on your machine.
- Installed your favorite code editor or IDE.
That's basically it!
Setting up your environment
When you're ready to proceed, create a new folder for your Python project. Inside, add two files:
find_links.py— we'll use this to find all website URLs.extract_data.py— this will contain the main script to download website data.
Next, initialize the project with uv and create a virtual environment:
uv init
This command might create a main.py file but we won't be using it.
Now install the required dependencies:
uv add beautifulsoup4 requests lxml scrapingbee pandas
Let me briefly cover the tools we're going to use:
- requests — library to make HTTP requests easily.
- beautifulsoup4 — enables us to perform data parsing of HTML content.
- lxml — fast XML/HTML parser used by BeautifulSoup.
- pandas — data analysis and manipulation tool.
- scrapingbee — ScrapingBee Python client used to route requests through proxies and avoid getting blocked, with extra features like JS rendering and screenshots.
Alright, at this point we're ready to go!
Respecting robots.txt and scraping policies
Before scraping any website, check its robots.txt file. This file defines which parts of the site are allowed or disallowed for automated access. While it is not a legal document by itself, it reflects the site owner's preferences and should be respected.
When collecting large amounts of data for LLM training, always review the website's Terms of Service and scraping policies. Make sure your use case is allowed and complies with applicable laws.
Avoid scraping Personally Identifiable Information (PII), such as names, emails, or user-generated private data. Also, be mindful of server load: send requests at a reasonable rate and avoid overwhelming the target site.
Finding all website pages
Now, before we can download any data, there's another important task to solve: we need to understand what pages the target website actually contains. This can be a challenge on its own, so I have prepared an article showing a few solutions for how to find all urls on a domain.
Today, we're going to use a solid but simple approach: scanning the website's sitemap and extracting links from it. Let's open the find_links.py file and import the necessary dependencies:
import csv
from pathlib import Path
import requests
from bs4 import BeautifulSoup as Soup
We'll save the URLs into a CSV file containing the actual link, last modification date, and priority, in case you need that extra data later:
from typing import Final
# Constants for the attributes to be extracted from the sitemap.
ATTRS: Final[tuple[str, ...]] = ("loc", "lastmod", "priority")
Now, let's code the main function:
def parse_sitemap(
url: str,
csv_filename: str = "urls.csv",
visited: set[str] | None = None,
) -> bool:
"""Parse the sitemap at the given URL and append the data to a CSV file."""
if not url:
print("No sitemap URL provided.")
return False
if visited is None:
visited = set()
url = url.strip()
# Avoid processing the same sitemap more than once.
if url in visited:
return True
visited.add(url)
try:
response = requests.get(url, timeout=30)
response.raise_for_status()
except requests.RequestException as e:
print(f"Failed to fetch sitemap {url}: {e}")
return False
soup = Soup(response.content, "xml")
success = True
# Recursively parse nested sitemaps.
for sitemap in soup.find_all("sitemap"):
loc = sitemap.find("loc")
if loc and loc.text:
success = parse_sitemap(
loc.text.strip(),
csv_filename,
visited,
) and success
# Find all URL entries in the sitemap.
urls = soup.find_all("url")
rows: list[list[str]] = []
for url_entry in urls:
row = []
for attr in ATTRS:
found_attr = url_entry.find(attr)
row.append(found_attr.text.strip() if found_attr else "n/a")
rows.append(row)
if not rows:
return success
# Save the CSV file in the same directory as the script.
csv_path = Path(__file__).resolve().parent / csv_filename
file_exists = csv_path.exists()
try:
with csv_path.open("a", newline="", encoding="utf-8") as csvfile:
writer = csv.writer(csvfile)
if not file_exists:
writer.writerow(ATTRS)
writer.writerows(rows)
except OSError as e:
print(f"Failed to write sitemap data to {csv_path}: {e}")
return False
return success
if __name__ == "__main__":
parse_sitemap("https://example.com/sitemap.xml")
This code is pretty straightforward:
- We fetch the given sitemap URL.
- Parse the response with Beautiful Soup.
- Look for any nested sitemaps and process them recursively.
- Find every
urltag in the sitemap. - Extract the required attributes from each found URL.
- Save the data to a CSV file.
That's it! If you have issues processing a sitemap because your request is being blocked, you can use the ScrapingBee client, as I'll show in the section below. Some websites protect themselves from automated web scraping and web scrapers, so routing requests through a proxy API can help avoid unnecessary blocks.
To run the script with uv, use:
uv run python find_links.py
Fetching website data from every page
At this point, you should have a urls.csv file with all website links ready for data extraction. Open the extract_data.py file, and let's get down to business.
Simple script to load website data
Let's code the first version of our script to extract data. Start by importing the necessary libraries:
from pathlib import Path
import pandas as pd
import requests
from bs4 import BeautifulSoup
Next, define a few constants that we'll use throughout the script:
INPUT_CSV = Path("urls.csv")
OUTPUT_FILE = Path("extracted_texts.txt")
URL_COLUMN = "loc"
REQUEST_TIMEOUT = 30
Now, let's add a helper function to read URLs from the CSV file:
def load_urls(csv_path: Path) -> list[str]:
"""Load URLs from the sitemap CSV file."""
try:
df = pd.read_csv(csv_path)
except FileNotFoundError:
print(f"Input file not found: {csv_path}")
return []
except pd.errors.EmptyDataError:
print(f"Input file is empty: {csv_path}")
return []
except pd.errors.ParserError as e:
print(f"Failed to parse CSV file {csv_path}: {e}")
return []
if URL_COLUMN not in df.columns:
print(f"Missing required column: {URL_COLUMN}")
return []
return [
str(url).strip()
for url in df[URL_COLUMN].dropna()
if str(url).strip() and str(url).strip().lower() != "n/a"
]
Then, add a function to fetch and extract text from a single page:
def fetch_page_text(url: str) -> str | None:
"""Fetch a page and extract readable text from its HTML."""
try:
response = requests.get(url, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
content_type = response.headers.get("Content-Type", "")
if content_type and "text/html" not in content_type:
print(f"Skipping non-HTML page: {url}")
return None
soup = BeautifulSoup(response.text, "html.parser")
# Remove elements that usually do not contain useful training text.
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)
Finally, save all scraped data to a UTF-8 encoded text file:
def save_texts(texts: list[str], output_path: Path) -> bool:
"""Save extracted texts to a UTF-8 encoded text file."""
try:
with output_path.open("w", encoding="utf-8") as file:
for text in texts:
file.write(text + "\n\n")
except OSError as e:
print(f"Failed to write output file {output_path}: {e}")
return False
return True
Here's the first version of the full script:
from pathlib import Path
import pandas as pd
import requests
from bs4 import BeautifulSoup
INPUT_CSV = Path("urls.csv")
OUTPUT_FILE = Path("extracted_texts.txt")
URL_COLUMN = "loc"
REQUEST_TIMEOUT = 30
def load_urls(csv_path: Path) -> list[str]:
"""Load URLs from the sitemap CSV file."""
try:
df = pd.read_csv(csv_path)
except FileNotFoundError:
print(f"Input file not found: {csv_path}")
return []
except pd.errors.EmptyDataError:
print(f"Input file is empty: {csv_path}")
return []
except pd.errors.ParserError as e:
print(f"Failed to parse CSV file {csv_path}: {e}")
return []
if URL_COLUMN not in df.columns:
print(f"Missing required column: {URL_COLUMN}")
return []
return [
str(url).strip()
for url in df[URL_COLUMN].dropna()
if str(url).strip() and str(url).strip().lower() != "n/a"
]
def fetch_page_text(url: str) -> str | None:
"""Fetch a page and extract readable text from its HTML."""
try:
response = requests.get(url, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
content_type = response.headers.get("Content-Type", "")
if content_type and "text/html" not in content_type:
print(f"Skipping non-HTML page: {url}")
return None
soup = BeautifulSoup(response.text, "html.parser")
# Remove elements that usually do not contain useful training text.
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)
def save_texts(texts: list[str], output_path: Path) -> bool:
"""Save extracted texts to a UTF-8 encoded text file."""
try:
with output_path.open("w", encoding="utf-8") as file:
for text in texts:
file.write(text + "\n\n")
except OSError as e:
print(f"Failed to write output file {output_path}: {e}")
return False
return True
def main() -> None:
urls = load_urls(INPUT_CSV)
if not urls:
print("No URLs found. Nothing to extract.")
return
all_texts: list[str] = []
for url in urls:
text = fetch_page_text(url)
if text:
all_texts.append(text)
if save_texts(all_texts, OUTPUT_FILE):
print("Text extraction completed successfully!")
if __name__ == "__main__":
main()
This version already works nicely for a basic web scraping workflow. However, you might run into a common issue: your requests may get blocked because many websites try to protect themselves from automated crawlers. Next, let's see how to reduce that risk.
Targeting specific elements for cleaner data
Using soup.get_text() on the entire HTML document will extract almost everything: navigation menus, footers, sidebars, ads, and other repeated page elements. This creates noisy data, which is a problem for LLM training. Garbage in = garbage out.
Instead, target the specific content containers that hold the main text. Common options include <article>, <main>, or well-defined <div> elements with content-related classes, such as content, post-body, or article-text. By focusing on these sections, you get cleaner, more relevant text for your dataset.
Using proxies to avoid getting blocked
As we've already discussed in one of the previous articles, scraping without getting blocked can be tricky. You can find the detailed explanation there, but here I'll show you a simple way to use premium proxies without managing them manually.
First, register on ScrapingBee for free. You'll receive 1000 free credits to test the API. After logging in, open your dashboard and copy your API token.
The safest option is to store the token in an environment variable instead of hardcoding it in your script:
export SCRAPINGBEE_API_KEY="YOUR_TOKEN"
On Windows PowerShell, use:
$env:SCRAPINGBEE_API_KEY="YOUR_TOKEN"
Now, return to your Python script and set up a ScrapingBee client:
import os
from scrapingbee import ScrapingBeeClient
api_key = os.getenv("SCRAPINGBEE_API_KEY")
if not api_key:
raise RuntimeError("Missing SCRAPINGBEE_API_KEY environment variable.")
client = ScrapingBeeClient(api_key=api_key)
Then, find the line where you send a regular request:
response = requests.get(url, timeout=REQUEST_TIMEOUT)
Replace it with the following:
response = client.get(
url,
params={
# Use premium proxies for tougher websites.
"premium_proxy": True,
"country_code": "gb",
# Block unnecessary resources to speed up loading.
"block_resources": True,
"device": "desktop",
},
)
If you're using the fetch_page_text() function from the previous section, it will look like this:
def fetch_page_text(url: str) -> str | None:
"""Fetch a page through ScrapingBee and extract readable text from its HTML."""
try:
response = client.get(
url,
params={
"premium_proxy": True,
"country_code": "gb",
"block_resources": True,
"device": "desktop",
},
)
response.raise_for_status()
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
content_type = response.headers.get("Content-Type", "")
if content_type and "text/html" not in content_type:
print(f"Skipping non-HTML page: {url}")
return None
soup = BeautifulSoup(response.text, "html.parser")
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)
This is it! ScrapingBee will handle the proxy setup for you, so you can focus on extracting the data instead of fighting IP blocks. To learn more about other features, refer to the ScrapingBee Python client documentation.
Using multiple threads
Visiting one page after another is not very efficient, especially when you have hundreds or thousands of pages to process. Since most of the time is spent waiting for HTTP responses, we can speed things up by using multiple threads.
Let's add ThreadPoolExecutor and as_completed to the imports:
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import pandas as pd
import requests
from bs4 import BeautifulSoup
Next, add a constant to control how many pages we process at the same time:
MAX_WORKERS = 5
Now update the main() function:
def main() -> None:
urls = load_urls(INPUT_CSV)
if not urls:
print("No URLs found. Nothing to extract.")
return
all_texts: list[str] = []
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = {
executor.submit(fetch_page_text, url): url
for url in urls
}
for future in as_completed(futures):
url = futures[future]
try:
text = future.result()
except Exception as e:
print(f"Unexpected error processing {url}: {e}")
continue
if text:
all_texts.append(text)
if save_texts(all_texts, OUTPUT_FILE):
print("Text extraction completed successfully!")
Here we set up five workers to process pages concurrently. Each worker calls the existing fetch_page_text() function, so we don't have to rewrite our scraping logic.
Keep MAX_WORKERS reasonable. Setting it too high may overload the target website or trigger rate limits.
Implementing retries for additional robustness
When scraping a website, you may run into temporary server errors, timeouts, or network hiccups. Instead of failing immediately, we can retry the request a few times before giving up.
We'll use the tenacity library for this. Install it with uv:
uv add tenacity
Then import it in your Python script:
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
Now extract the request logic into a separate function and wrap it with a retry decorator:
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type(requests.RequestException),
reraise=True,
)
def fetch_with_retry(url: str) -> requests.Response:
"""Fetch a URL with retries."""
response = requests.get(url, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
return response
This tells tenacity to retry failed requests up to three times, using exponential backoff between attempts.
Now update fetch_page_text() to use this new helper:
def fetch_page_text(url: str) -> str | None:
"""Fetch a page and extract readable text from its HTML."""
try:
response = fetch_with_retry(url)
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
content_type = response.headers.get("Content-Type", "")
if content_type and "text/html" not in content_type:
print(f"Skipping non-HTML page: {url}")
return None
soup = BeautifulSoup(response.text, "html.parser")
# Remove elements that usually do not contain useful training text.
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)
All other helper functions stay the same. Here's the updated version of the full script:
The example below uses the basic requests version of the script. If you've already switched to ScrapingBee, you can apply the same retry pattern to the client.get() call.
from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import pandas as pd
import requests
from bs4 import BeautifulSoup
from tenacity import retry, retry_if_exception_type, stop_after_attempt, wait_exponential
INPUT_CSV = Path("urls.csv")
OUTPUT_FILE = Path("extracted_texts.txt")
URL_COLUMN = "loc"
REQUEST_TIMEOUT = 30
MAX_WORKERS = 5
def load_urls(csv_path: Path) -> list[str]:
"""Load URLs from the sitemap CSV file."""
try:
df = pd.read_csv(csv_path)
except FileNotFoundError:
print(f"Input file not found: {csv_path}")
return []
except pd.errors.EmptyDataError:
print(f"Input file is empty: {csv_path}")
return []
except pd.errors.ParserError as e:
print(f"Failed to parse CSV file {csv_path}: {e}")
return []
if URL_COLUMN not in df.columns:
print(f"Missing required column: {URL_COLUMN}")
return []
return [
str(url).strip()
for url in df[URL_COLUMN].dropna()
if str(url).strip() and str(url).strip().lower() != "n/a"
]
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type(requests.RequestException),
reraise=True,
)
def fetch_with_retry(url: str) -> requests.Response:
"""Fetch a URL with retries."""
response = requests.get(url, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
return response
def fetch_page_text(url: str) -> str | None:
"""Fetch a page and extract readable text from its HTML."""
try:
response = fetch_with_retry(url)
except requests.RequestException as e:
print(f"Error fetching {url}: {e}")
return None
content_type = response.headers.get("Content-Type", "")
if content_type and "text/html" not in content_type:
print(f"Skipping non-HTML page: {url}")
return None
soup = BeautifulSoup(response.text, "html.parser")
# Remove elements that usually do not contain useful training text.
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
return soup.get_text(separator="\n", strip=True)
def save_texts(texts: list[str], output_path: Path) -> bool:
"""Save extracted texts to a UTF-8 encoded text file."""
try:
with output_path.open("w", encoding="utf-8") as file:
for text in texts:
file.write(text + "\n\n")
except OSError as e:
print(f"Failed to write output file {output_path}: {e}")
return False
return True
def main() -> None:
urls = load_urls(INPUT_CSV)
if not urls:
print("No URLs found. Nothing to extract.")
return
all_texts: list[str] = []
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = {
executor.submit(fetch_page_text, url): url
for url in urls
}
for future in as_completed(futures):
url = futures[future]
try:
text = future.result()
except Exception as e:
print(f"Unexpected error processing {url}: {e}")
continue
if text:
all_texts.append(text)
if save_texts(all_texts, OUTPUT_FILE):
print("Text extraction completed successfully!")
if __name__ == "__main__":
main()
Now the script can process multiple pages concurrently and handle temporary request failures more gracefully.
Why train your own model?
One question to answer is: why would someone want to train their own model in the first place? Here are a few common reasons:
- Customization: Pre-made models might not meet your specific needs. Training or fine-tuning your own model lets you adapt it to particular tasks, industries, or datasets.
- Privacy: For sensitive applications, a custom model can help keep data within your own environment. There have already been data leaks related to AI usage, so you should be careful when sending potentially sensitive data to third-party tools.
- Performance: Custom models can be optimized for specific types of data or queries, which can help them outperform general-purpose models in narrow use cases.
How long does it take to train a model?
Training large language models can be time-consuming and resource-intensive. The duration depends on several factors, including the model size, the amount of data, and the available compute resources.
For example, training a small or medium-sized model can take several days or even weeks on high-performance GPUs. Frontier-scale models, such as GPT-4-class systems, require massive distributed computing setups, and their exact training time and hardware details are usually not publicly disclosed.
How much data is needed?
The amount of data required depends on the size and complexity of the model. Here are some rough estimates:
- Small models may need around 1–5 million words. For comparison, the entire Harry Potter series contains a little over 1 million words.
- Medium-sized models typically need tens of millions of words.
- Large models may require hundreds of millions to billions of words, while frontier-scale large language models can be trained on billions or even trillions of words.
To put that into perspective, 1 billion words is roughly equal to more than 900 full Harry Potter series.
Tools for training a model
Training your own language model requires not only powerful hardware but also specialized software. The exact stack depends on whether you're training from scratch, fine-tuning an existing model, or preparing data for later training. Here are some common tools you might use:
PyTorch: An open-source deep learning framework widely used for training and fine-tuning modern language models.
TensorFlow: An end-to-end open-source machine learning platform that can be used for building, training, and deploying neural networks.
Hugging Face Transformers: A library that provides pre-trained models and tools for training, fine-tuning, and inference. It works with frameworks like PyTorch and TensorFlow.
Hugging Face Datasets: A library for loading, processing, and sharing datasets. This is especially useful when preparing scraped text for training or fine-tuning.
PEFT: A library for parameter-efficient fine-tuning. It helps adapt large pre-trained models without updating all model parameters, which can reduce compute and storage costs.
DeepSpeed: A deep learning optimization library used for distributed and large-scale model training.
Vertex AI: A cloud platform for training, deploying, and customizing ML models and AI applications.
CUDA: NVIDIA's toolkit for GPU-accelerated computing. It is essential when training large models on NVIDIA GPUs.
TPUs (Tensor Processing Units): Google's custom accelerators for machine learning workloads. TPUs can speed up training for large models.
So, as you can see, there's a lot to pick from. For most practical projects, you will probably start with PyTorch, Hugging Face Transformers, Hugging Face Datasets, and either PEFT or a managed platform like Vertex AI.
Learn about the best AI web scrapers in our tutorial.
Conclusion
So, in this article we've covered how to scrape text from all pages of a website to collect data for training a language model. We went through finding website pages with sitemaps, data extraction, parsing HTML content, and using proxies and multithreading to make the process more efficient.
Once your scraped data is formatted and cleaned, you can move on to training your model. However, remember that scraped data is still raw input: data quality is crucial for building a high-performing model, so you may need to clean and review the dataset further before using it.
Feel free to refer to the source code on GitHub for the final version of the scripts. You might also be interested in learning how to employ AI to efficiently scrape website data.
Thanks for staying with me. Happy scraping, and good luck with your model training journey!
Before you go, check out these related reads:
- Scrapegraph AI Tutorial: Scrape websites easily with LLaMA AI
- How to use asyncio to scrape websites with Python
How to scrape text from a website for LLM training: FAQ
Why scrape website text for LLM training?
Scraping website text helps you collect domain-specific training data at scale. For example, if you want a model to understand a specific industry, product, documentation site, or knowledge base, extracting clean text from relevant pages gives you the raw material needed for training or fine-tuning.
Can I scrape any website for LLM training?
Not always. Before scraping a website, check its Terms of Service and robots.txt file. You should also avoid collecting private data, copyrighted content you are not allowed to use, or anything that may violate applicable laws.
How much text do I need to train an LLM?
It depends on the model size and your goal. Small models may work with millions of words, while large language models usually need much larger datasets. In many cases, fine-tuning an existing model requires far less data than training one from scratch.
Is raw scraped data ready for training?
No. Raw scraped data usually contains navigation menus, ads, duplicate text, scripts, footers, and other noise. You should clean, deduplicate, and review the data before using it for training.
Why use proxies for web scraping?
Websites may block repeated requests from the same IP address. Proxies, or a proxy API like ScrapingBee, help reduce blocks and make large-scale web scraping more reliable.
Should I use multithreading when scraping websites?
Yes, but carefully. Multithreading can speed up data extraction by processing multiple pages at once, but too many workers can overload the target website or trigger rate limits.

Ilya is an IT tutor and author, web developer, and ex-Microsoft/Cisco specialist. His primary programming languages are Ruby, JavaScript, Python, and Elixir. He enjoys coding, teaching people and learning new things. In his free time he writes educational posts, participates in OpenSource projects, tweets, goes in for sports and plays music.
