If you need to Python extract text from HTML, this guide walks you through it step by step, without overcomplicating things. You'll learn what text extraction actually means, which Python libraries make it easy, and how to deal with real-world HTML that's messy, noisy, and inconsistent.
We'll start simple with the basics, then move into practical examples, cleanup strategies, and a small end-to-end pipeline. By the end, you'll know how to turn raw HTML into clean, usable text you can store, analyze, or feed into other systems.

Quick answer (TL;DR)
To extract text from HTML in Python, use BeautifulSoup and its get_text() method. Parse the HTML, remove scripts and styles, then extract the readable content. This covers most use cases and is easy to maintain.
The sections below go deeper, with practical code samples, cleanup techniques, and real-world examples for more complex scenarios.
You can also check this tutorial for more information: How to turn HTML to text in Python.
What does it mean to extract text from HTML in Python
HTML is the source code behind a web page. It mixes real content with a lot of extra stuff that browsers need to display the page correctly. When you extract text from HTML in Python, you're stripping all that noise away and keeping only what a human would actually read.
- Raw HTML is full of tags like
<div>and<span>, links, attributes, inline styles, scripts, and metadata. That's great for rendering a page, but it's not so great for analysis. - Extracted text is the visible content: headings, paragraphs, list items, and labels. Nothing more.
This matters because most backend tasks don't care about layout. They care about words. Once you reduce HTML to clean text, it becomes much easier to store, search, index, or feed into other systems.
A few basic terms help here:
- The DOM, or Document Object Model, is how browsers and libraries represent an HTML page internally. It's basically a tree.
- Tags are the individual HTML markers like
<p>or<h1>. - Elements are those tags plus whatever is inside them. Text extraction works by walking the DOM and collecting only the text nodes while skipping things like scripts and styles.
If this sounds similar to other parsing problems, that's because it is. Text extraction is a common case of data parsing, and many of the same ideas apply. This collection of real-world parsing discussions is a good reference point: Data parsing questions and answers.
When you should extract text instead of raw HTML
You should extract text when the structure of the page doesn't matter, but the content does. Saving raw HTML makes sense if you plan to re-render pages later or need the full markup. In many projects, that's unnecessary overhead.
Text extraction is usually the better option for search indexing, NLP pipelines, analytics, reporting, and logs. Search engines don't need JavaScript. Language models don't care about CSS. Reports are easier to build when you're working with plain text instead of nested tags.
Think about real-world cases. You might want the main content of blog posts, product descriptions from ecommerce pages, or text from documentation sites. In all of these examples, raw HTML just gets in the way. Clean text is smaller, easier to debug, and faster to process.
Main ways to extract text from HTML with Python
There are a few solid ways to extract text from HTML in Python. Which one you pick depends on how messy the HTML is and how much control you need.
- The most common option is BeautifulSoup. It's beginner-friendly, forgiving with broken HTML, and good enough for most scraping and parsing tasks. If you're just trying to get readable text out of a page, this is usually where you start.
- Another option is lxml and related helpers like html-text. These are faster and more strict. They work great when you're dealing with large volumes of HTML or need precise control over the DOM. The tradeoff is a steeper learning curve.
- You may also see Parsel mentioned in scraping discussions. Parsel is designed for structured data extraction using CSS and XPath selectors, not for general text cleanup or readability-focused extraction. It hasn't seen much active development in the past couple of years, and it's usually a better fit for extracting specific fields (like titles or prices) rather than full-page text.
- Regex-based helpers exist, but they should be used very carefully. Simple patterns can work for tiny, predictable snippets, but regex breaks fast when HTML gets nested or inconsistent. For full pages, regex should be a last resort, not the main tool.
If you're new, start with BeautifulSoup. If performance becomes a problem later, you can switch.
HTML example used in the following sections
To keep things consistent, the next examples will all use the same HTML snippet. This avoids repeating markup and makes it easier to compare different extraction approaches.
This example is closer to a real web page. It includes common elements like navigation links, nested content, styles, and scripts. Some text is visible, some is not, and some should be ignored during extraction.
<!DOCTYPE html>
<html>
<head>
<title>Sample documentation page</title>
<meta charset="utf-8">
<style>
body { font-family: Arial; }
.hidden { display: none; }
</style>
<script>
console.log("analytics code");
</script>
</head>
<body>
<nav>
<a href="/">Home</a>
<a href="/docs">Docs</a>
<a href="/contact">Contact</a>
</nav>
<main>
<h1>Getting started with HTML parsing</h1>
<p>This page explains how to extract text from HTML using Python.</p>
<p class="hidden">This paragraph should not appear in extracted text.</p>
<section>
<h2>Installation</h2>
<p>Install the required libraries using pip.</p>
</section>
</main>
<footer>
<p>© 2025 Example Corp</p>
</footer>
</body>
</html>
Key things to notice in this HTML:
- The
<script>and<style>blocks should never end up in extracted text. - Elements with classes like
hiddenmay or may not be relevant depending on your use case. - Navigation and footer text is visible, but sometimes you'll want to exclude it later.
Use this HTML as the input for the examples in the next sections. We'll focus on how different libraries handle parsing and text extraction rather than rewriting the markup each time.
Extract text from HTML with BeautifulSoup
BeautifulSoup is popular for a reason. It's easy to install, easy to read, and handles messy real-world HTML without complaining.
First, install the dependencies:
pip install beautifulsoup4
For larger projects you should consider using tools like uv and Poetry but it's not really needed for today's tutorial.
The usual workflow looks like this:
- Fetch the HTML, either from a URL or a string
- Parse it into a DOM
- Remove elements you don't want, like scripts and styles
- Extract clean text
Here's a simple, real-world example using BeautifulSoup:
from bs4 import BeautifulSoup
# Paste sample HTML from the section above
html_source = "..."
# Parse HTML into a DOM-like structure
soup = BeautifulSoup(html_source, "html.parser")
# Remove script and style tags to avoid JavaScript and CSS leaking into output
for tag in soup(["script", "style"]):
tag.decompose()
# Optionally remove elements that are hidden via CSS classes
for tag in soup.select(".hidden"):
tag.decompose()
# Extract clean, human-readable text
# separator keeps words from sticking together
# strip removes extra whitespace
text = soup.get_text(separator=" ", strip=True)
print(text)
Key points:
- The HTML includes
<script>and<style>tags on purpose. These are common in real-world pages and are a major source of noise. If you don't remove them, JavaScript and CSS often end up mixed into your extracted text. - BeautifulSoup parses the HTML into a DOM-like structure using the built-in
html.parser. This parser is fast enough for most projects and handles imperfect or messy HTML well. - The loop that targets
["script", "style"]removes those elements entirely from the document tree. Usingdecompose()matters because it deletes both the tag and its contents, not just the wrapper. - The optional step that removes elements with the
.hiddenclass shows how you can filter out content that is technically in the HTML but not meant to be visible. - The
get_text()call collects all remaining text nodes. Theseparatorargument inserts spaces between text blocks so words don't stick together, andstripcleans up extra whitespace. - This pattern is a strong default for Python extract text from HTML tasks. It works for static HTML, scraped pages, and HTML loaded from files, and it's easy to extend when you need more control.
Extract text using html-text or lxml.html
If you want cleaner text output with less manual post-processing, html-text and lxml.html are solid options. The main advantage is better handling of non-text elements, invisible content, and whitespace normalization compared to naïve extraction.
Install the package:
pip install html-text lxml
If lxml fails to build, install your OS's libxml2/libxslt dev packages.
Below is the same sample HTML from earlier. We'll parse it and extract text using an out-of-the-box helper.
from lxml import html
import html_text
html_source = "..."
# Parse once with lxml
doc = html.fromstring(html_source)
# html-text can accept an lxml HtmlElement directly
text = html_text.extract_text(doc)
print(text)
Key points:
lxml.htmlparses HTML into a real DOM tree, not just a string wrapper. This gives you a fast, strict structure you can query later with XPath or CSS selectors when you need precision or performance.html-textis responsible only for turning HTML into readable plain text. When given an already-parsed lxml tree, it walks the DOM and extracts human-facing text while ignoring non-text elements and obvious noise like scripts and styles.- Since HTML parsers don't evaluate CSS or layout, handling of "hidden" content is heuristic. Libraries rely on signals like tag types, common class names,
aria-hidden, or inline styles such asdisplay: none. For truly accurate visibility detection, a rendering engine (for example, a headless browser) is required.
- Since HTML parsers don't evaluate CSS or layout, handling of "hidden" content is heuristic. Libraries rely on signals like tag types, common class names,
- In this setup, parsing and extraction are clearly separated.
lxmlhandles structure,html-texthandles text output. This avoids double parsing and keeps the mental model simple. - This pattern works well when you want clean text quickly, but still want the option to add DOM-level filtering later without changing the extraction step.
Quick comparison with BeautifulSoup:
- Pick BeautifulSoup when you want the most beginner-friendly option and are fine doing explicit cleanup (scripts, styles, layout blocks).
- Pick
lxml.htmlwhen performance matters or when you need strict parsing and fine-grained DOM control via XPath or CSS selectors. - Pick
html-textwhen your goal is cleaner text extraction and built-in normalization, not automatic content selection.
Using regex to clean or extract text from HTML
Regex and HTML have a long, messy history. Short version: you should not use regex to parse full HTML documents. HTML is nested, inconsistent, and often broken in ways regex can't handle safely. That said, regex can still be useful in very narrow cases. The key rule is this: only use regex after you've already parsed HTML with a real parser like BeautifulSoup or lxml.
Safe use cases include cleaning up leftover tags, collapsing whitespace, or removing very specific patterns from already-extracted text.
⚠️ Warning: Do not try to extract full page content or navigate HTML structure using regex alone. It will break sooner or later. If you want a deeper explanation of why this is a bad idea, this article covers it well: Parsing HTML with regex.
Here's a small example. We start with text that already came from an HTML parser, then use regex only for cleanup.
import re
from bs4 import BeautifulSoup
# Paste sample HTML from the section above
html_source = "..."
# 1) Parse HTML with a real HTML parser (don't regex full HTML)
soup = BeautifulSoup(html_source, "html.parser")
# 2) Drop obvious noise first (scripts/styles)
for tag in soup(["script", "style"]):
tag.decompose()
# 3) Extract text safely from the parsed DOM
text = soup.get_text(separator=" ", strip=True)
# 4) Use regex only for small cleanup on plain text
# - Collapse repeated whitespace into a single space
# - Optional: normalize weird non-breaking spaces
clean_text = re.sub(r"\s+", " ", text).replace("\u00a0", " ").strip()
print(clean_text)
Key points:
- HTML is parsed first using BeautifulSoup. This is the critical step. A real HTML parser understands nesting, broken markup, and edge cases that regex cannot handle safely.
- If you want faster parsing, install
lxmland useBeautifulSoup(html, "lxml").
- If you want faster parsing, install
- Script and style tags are removed before text extraction. This prevents JavaScript and CSS from leaking into the output and keeps the text readable.
get_text()is used to extract human-readable content from the parsed DOM. At this point, the output is already safe and structured as plain text.- Regex is applied only after the HTML has been converted to text. This is the safe boundary where regex makes sense.
- The regex pattern
\s+collapses multiple whitespace characters into a single space. This helps clean up text that comes from multiple tags, line breaks, or inconsistent spacing. - The optional replacement of non-breaking spaces handles a common edge case seen in copied or scraped content.
- This approach keeps responsibilities clear. The HTML parser handles structure, and regex handles small, predictable text cleanup tasks.
Comparing the main HTML text extraction methods
Each approach we covered solves a slightly different problem. This quick comparison should help you decide where to start and when to switch tools.
| Method | Best for | Pros | Cons |
|---|---|---|---|
| BeautifulSoup | General-purpose text extraction | Easy to learn, very forgiving, flexible cleanup | Requires manual removal of boilerplate |
| lxml.html | Performance and precision | Fast, strict parsing, XPath support | Less beginner-friendly |
| html-text | Clean main content | Reduces non-content noise and normalizes text with minimal setup | Less control over what gets removed |
| Regex (cleanup only) | Post-processing text | Simple, fast, predictable | Unsafe for parsing HTML structure |
Rule of thumb: start with BeautifulSoup for most Python extract text from HTML tasks. Use html-text for cleaner text extraction and normalization (not automatic content selection), and switch to lxml.html when performance or precise DOM control matters. Regex should stay in the cleanup lane, not in charge of parsing.
Handling real world HTML content
Real sites are messy. You'll run into broken markup, tons of scripts, inline styles, cookie banners, ad containers, hidden text, and random layout junk like headers and sidebars. If you just call get_text() on the whole page, you usually get a wall of noise.
The good news is: Python HTML parsers handle most of this fine. The trick is what you do before extraction (remove junk) and what you do after extraction (normalize the text).
For the next examples, here's a more "real-world-ish" HTML snippet with common problems baked in:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<title>Product page</title>
<style>
.sidebar { width: 300px; }
.hidden { display: none; }
</style>
<script>
window.__tracking = { session: "abc123" };
</script>
</head>
<body>
<header class="site-header">
<nav>
<a href="/">Home</a>
<a href="/shop">Shop</a>
<a href="/support">Support</a>
</nav>
<div class="cookie-banner">We use cookies. Accept?</div>
</header>
<div class="layout">
<aside class="sidebar">
<h3>Recommended</h3>
<ul>
<li><a href="/p/1">Thing 1</a></li>
<li><a href="/p/2">Thing 2</a></li>
</ul>
<div class="ad">BUY NOW!!!</div>
</aside>
<main id="content">
<h1>SuperWidget 3000</h1>
<p>The SuperWidget 3000 is built for devs who hate flaky tools.</p>
<p class="hidden">Internal SKU: SW-3000-SECRET</p>
<section>
<h2>Key features</h2>
<ul>
<li>Fast setup</li>
<li>Clean output</li>
<li>Works on messy HTML</li>
</ul>
</section>
<section>
<h2>Notes</h2>
<p>Ships worldwide. Returns within 30 days.</p>
</section>
</main>
</div>
<footer class="site-footer">
<p>© 2025 Example Corp</p>
<a href="/legal">Legal</a>
</footer>
</body>
</html>
Remove unwanted parts: scripts, styles, and layout blocks
Most of the time, you don't want navigation, cookie banners, sidebars, or footers in your extracted text. You want the main content. So the workflow is: parse → remove junk → extract.
Here's a short before/after that shows why this matters.
Before (no cleanup):
Home Shop Support We use cookies. Accept? Recommended Thing 1 Thing 2 BUY NOW!!! SuperWidget 3000 The SuperWidget 3000 is built for devs who hate flaky tools. Internal SKU: SW-3000-SECRET Key features Fast setup Clean output Works on messy HTML Notes Ships worldwide. Returns within 30 days. © 2025 Example Corp Legal
After (remove layout + scripts/styles + hidden):
Product page
SuperWidget 3000
The SuperWidget 3000 is built for devs who hate flaky tools.
Key features
Fast setup
Clean output
Works on messy HTML
Notes
Ships worldwide. Returns within 30 days.
Code example (BeautifulSoup + CSS selectors):
from bs4 import BeautifulSoup
html_source = """...paste the sample HTML from above..."""
soup = BeautifulSoup(html_source, "html.parser")
# 1) Always drop script/style first
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
# 2) Drop common layout blocks (site chrome)
for tag in soup.select("header, footer, nav, aside, .cookie-banner, .ad"):
tag.decompose()
# 3) Drop hidden content you know is irrelevant (common patterns)
for tag in soup.select(".hidden, [aria-hidden='true']"):
tag.decompose()
# 4) Extract text from what remains
text = soup.get_text(separator="\n", strip=True)
print(text)
Explanations:
- Scripts and styles are pure noise for text extraction. Kill them early.
- Layout blocks are where most boilerplate lives, like navigation, footers, sidebars, and banners.
- Hidden elements can leak internal junk such as SKUs, tracking labels, or A/B test copy.
- Using
separator="\n"makes the output more readable and much easier to post-process later.
Clean and normalize extracted text
Even after solid cleanup, extracted text is often still messy. You'll see extra blank lines, weird spacing, non-breaking spaces, and inconsistent line breaks. A small normalization step makes the output feel finished and easier to work with for search, NLP, or reporting.
Below are two small helper functions you can reuse across projects.
from bs4 import BeautifulSoup
import re
def normalize_text(text: str) -> str:
"""Normalize whitespace while keeping meaningful line breaks."""
# Normalize non-breaking spaces
text = text.replace("\u00a0", " ")
# Normalize Windows/Mac line endings
text = text.replace("\r\n", "\n").replace("\r", "\n")
# Trim whitespace on each line
lines = [line.strip() for line in text.split("\n")]
# Drop empty lines and collapse runs of empty lines
cleaned_lines = []
blank_run = 0
for line in lines:
if not line:
blank_run += 1
if blank_run <= 1:
cleaned_lines.append("")
continue
blank_run = 0
# Collapse internal whitespace in a line
cleaned_lines.append(re.sub(r"\s+", " ", line))
return "\n".join(cleaned_lines).strip()
def extract_blocks(soup: BeautifulSoup) -> list[str]:
blocks = []
for tag in soup.select("h1, h2, h3, p, li"):
txt = tag.get_text(" ", strip=True)
if txt:
blocks.append(txt)
return blocks
# Example usage after extraction:
# text = soup.get_text(separator="\n", strip=True)
# clean = normalize_text(text)
# print(clean)
# paragraphs = extract_blocks(soup)
# print(paragraphs)
normalize_text()is responsible for cleaning up plain text after HTML parsing and extraction. It assumes the structure is already handled and focuses only on readability.- The first replacement handles non-breaking spaces (
\u00a0). These show up a lot in scraped content and often cause weird spacing issues if you don't normalize them. - Line endings are normalized next. Real-world text can contain Windows (
\r\n) or old Mac (\r) line breaks. Converting everything to\nkeeps behavior consistent across systems. - The loop that tracks
blank_runcollapses multiple empty lines into a single empty line. This prevents giant vertical gaps while still preserving paragraph separation. - Inside non-empty lines, internal whitespace is collapsed using regex. This avoids double or triple spaces caused by inline tags or broken formatting.
extract_blocks()takes a different approach. Instead of guessing paragraph boundaries from text, it uses HTML structure directly.- The selector
h1, h2, h3, p, litargets common content blocks like headings, paragraphs, and list items. You can extend or shrink this list depending on your page type. - The result of
extract_blocks()is a list of clean text blocks that already represent meaningful sections. This is often easier to store, index, or analyze than one giant string. - Together, these helpers cover both sides of the problem: text cleanup when you already have flat text, and structure-aware extraction when you want real paragraphs.
From text extraction to full data pipelines in Python
Text extraction is just one stage in a bigger workflow. In real projects, you usually end up with a pipeline that looks like this:
- Fetch pages (HTTP or files)
- Parse HTML into a structure you can query
- Remove junk (scripts, styles, headers, footers, sidebars, ads)
- Extract the text you actually want
- Normalize the text (whitespace, line breaks, weird characters)
- Store results somewhere useful (CSV, JSON, plain text, or a database)
If you want a broader "big picture" view of pulling structured info out of pages (not just text), this guide is a good companion: Data extraction in Python.
Here are some common output formats:
- JSON / JSONL: best when records have multiple fields and you want to keep structure
- CSV: best for quick analysis in spreadsheets or pandas
- Plain text files: best when you only care about readable content (search indexing, NLP input, archiving)
Example mini project: Crawl a few pages and save clean text
This mini project crawls a small set of pages starting from a single URL, follows "next" pagination links, extracts clean text, and stores results on disk. What this script will do:
- Starts from one
start_urland crawls up tomax_pagesby following the page’s Next link (ul.pager li.next a) - Downloads each page using a shared
requests.Sessionand parses HTML with BeautifulSoup - Removes obvious noise (
script,style,noscript) plus a handful of common "site chrome" selectors (header/footer/nav/aside, cookie/ad/sidebar patterns) - Extracts text from what remains using paragraph-ish separators (
\n\n) - Normalizes the extracted text (NBSP cleanup, consistent newlines, trimmed lines, collapsed whitespace, and collapsed blank-line runs)
- Writes results to disk as:
- a single JSONL file (
pages.jsonl) with one record per page (url,title,text) - one
.txtfile per page in a folder, using a title-based filename plus a short URL hash for uniqueness
- a single JSONL file (
Here's the code:
import json
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable
from urllib.parse import urljoin
import unicodedata
import hashlib
# Install with: pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup
@dataclass(frozen=True)
class PageText:
"""One extracted page of clean text."""
url: str
title: str
text: str
def normalize_text(text: str) -> str:
"""Normalize whitespace while keeping meaningful line breaks."""
text = text.replace("\u00a0", " ") # non-breaking spaces are common in HTML
text = text.replace("\r\n", "\n").replace("\r", "\n")
lines = [line.strip() for line in text.split("\n")]
cleaned_lines: list[str] = []
blank_run = 0
for line in lines:
if not line:
blank_run += 1
if blank_run <= 1:
cleaned_lines.append("")
continue
blank_run = 0
cleaned_lines.append(re.sub(r"\s+", " ", line))
return "\n".join(cleaned_lines).strip()
def extract_clean_text(soup: BeautifulSoup) -> tuple[str, str]:
"""
Remove common noise from a parsed page and return (title, clean_text).
"""
# Drop obvious noise first
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
# Optional: drop common layout blocks on typical sites.
for tag in soup.select("header, footer, nav, aside, .cookie, .cookie-banner, .ad, .ads, .sidebar"):
tag.decompose()
# Hidden content patterns (varies by site)
for tag in soup.select(".hidden, [aria-hidden='true']"):
tag.decompose()
# Extract text with paragraph-ish breaks
raw_text = soup.get_text(separator="\n\n", strip=True)
clean_text = normalize_text(raw_text)
title = soup.title.get_text(" ", strip=True) if soup.title else ""
return title, clean_text
def fetch_soup(url: str, session: requests.Session, *, timeout_s: float = 20.0) -> BeautifulSoup:
"""Fetch a URL and return a BeautifulSoup object."""
headers = {
"User-Agent": "Mozilla/5.0 (compatible; TextExtractor/1.0)"
}
resp = session.get(url, headers=headers, timeout=timeout_s)
resp.raise_for_status()
# Prefer requests' detected encoding, fallback to UTF-8
encoding = resp.encoding or "utf-8"
html_text = resp.content.decode(encoding, errors="replace")
return BeautifulSoup(html_text, "html.parser")
def find_next_page_url(soup: BeautifulSoup, current_url: str) -> str | None:
"""
books.toscrape.com pagination:
<ul class="pager">
<li class="next"><a href="catalogue/page-2.html">next</a></li>
</ul>
"""
next_link = soup.select_one("ul.pager li.next a[href]")
if not next_link:
return None
return urljoin(current_url, next_link["href"])
def crawl_pages(start_url: str, *, max_pages: int = 3) -> list[PageText]:
"""Follow 'next' links and extract clean text from each page."""
results: list[PageText] = []
url = start_url
with requests.Session() as session:
for _ in range(max_pages):
soup = fetch_soup(url, session)
title, text = extract_clean_text(soup)
results.append(PageText(url=url, title=title, text=text))
next_url = find_next_page_url(soup, url)
if not next_url:
break
url = next_url
return results
def save_jsonl(records: Iterable[PageText], path: Path) -> None:
"""Write one JSON object per line (easy to stream and append)."""
path.parent.mkdir(parents=True, exist_ok=True)
with path.open("w", encoding="utf-8") as f:
for r in records:
f.write(json.dumps({"url": r.url, "title": r.title, "text": r.text}, ensure_ascii=False))
f.write("\n")
def slugify_filename(title: str, *, max_len: int = 80) -> str:
"""
Make a filesystem-friendly filename stem.
- Keeps non-ASCII where possible (modern filesystems handle UTF-8 fine).
- Removes path separators and control chars.
- Collapses whitespace.
"""
title = title.strip()
# Normalize unicode (avoids weird lookalike / combining sequences)
title = unicodedata.normalize("NFKC", title)
# Replace path separators and other unsafe chars with underscores
# (keep letters/numbers from any language)
title = re.sub(r"[\\/:*?\"<>|\x00-\x1f]+", "_", title)
# Collapse whitespace
title = re.sub(r"\s+", " ", title).strip()
# Shorten
if len(title) > max_len:
title = title[:max_len].rstrip()
# Avoid empty
return title or "page"
def short_url_hash(url: str, n: int = 8) -> str:
return hashlib.blake2b(url.encode("utf-8"), digest_size=16).hexdigest()[:n]
def save_txt(records: Iterable[PageText], folder: Path) -> None:
"""Save one .txt file per page (good for quick inspection)."""
folder.mkdir(parents=True, exist_ok=True)
for i, r in enumerate(records, start=1):
# Title can be non-ASCII; that's fine on most systems.
# Add a short URL hash so filenames are unique + stable.
stem = slugify_filename(r.title or "", max_len=80)
h = short_url_hash(r.url, n=8)
out_path = folder / f"{i:02d}_{stem}_{h}.txt"
out_path.write_text(r.text, encoding="utf-8")
def main() -> None:
# Start on the first listing page, then follow "next" links.
start_url = "https://books.toscrape.com/"
# Crawl a few pages (tweak this number when you want more)
pages = crawl_pages(start_url, max_pages=3)
# Store results
out_dir = Path("output")
save_jsonl(pages, out_dir / "pages.jsonl")
save_txt(pages, out_dir / "pages_txt")
print(f"Saved {len(pages)} pages to: {out_dir.resolve()}")
if __name__ == "__main__":
main()
Key points:
- This script turns HTML text extraction into a small but realistic pipeline. It fetches pages, follows pagination, cleans the DOM, extracts readable text, and saves the results.
- Instead of hard-coding page URLs, the crawler starts from a single entry point and discovers subsequent pages via the site's pagination markup. This reflects how real sites are structured and scales cleanly.
- A single
requests.Sessionis reused across requests. This reduces connection overhead and keeps headers and cookies consistent. - Each page is parsed once into a BeautifulSoup object. All cleanup and extraction operate on that parsed structure, not on raw HTML strings.
- Noise removal happens before extraction. Scripts, styles, hidden elements, and common layout blocks are removed so they never leak into the text layer.
- Text is extracted using double newlines as separators. This preserves paragraph boundaries and makes later normalization or block-based processing easier.
normalize_text()handles the unglamorous cleanup work: non-breaking spaces, inconsistent line endings, excess blank lines, and repeated whitespace.- Pagination is handled by detecting the
nextlink and resolving it to an absolute URL. When no next link exists, crawling stops automatically. - Results are saved in two formats. JSONL preserves structured records for pipelines and analysis, while plain text files are convenient for inspection, indexing, or NLP input.
- The pipeline is intentionally modular. You can change selectors, target a different site, or swap the output format without rewriting the overall flow.
HTML text extraction in other languages
The ideas behind HTML text extraction are not Python-specific. No matter the language, the workflow looks almost the same: fetch a page, parse HTML into a DOM, remove junk, extract readable text, then clean it up.
If you work in a mixed stack or switch between languages, this is good news. Once you understand the concepts in Python, you can transfer them almost directly to other ecosystems like Ruby, JS, or C#. The main differences are library names and syntax. The mental model stays the same.
Ruby and HTML text extraction
Ruby has a strong ecosystem for HTML parsing and web scraping. Just like in Python, Ruby developers typically rely on real HTML parsers instead of string manipulation.
The flow in Ruby mirrors what you've seen so far:
- Fetch HTML (
Net::HTTP, Faraday, or similar) - Parse it into a DOM
- Remove scripts, styles, and layout blocks
- Extract text nodes
- Normalize the result
Here's a very simple Ruby example that follows the same flow you've seen in Python. It fetches a page, parses HTML, removes scripts and styles, and extracts readable text:
require "net/http"
require "uri"
require "nokogiri"
url = URI("https://example.com")
html_source = Net::HTTP.get(url)
# Parse HTML into a DOM
doc = Nokogiri::HTML(html_source)
# Remove script and style tags
doc.search("script, style").each(&:remove)
# Extract readable text
text = doc.text
# Basic normalization
clean_text = text
.gsub("\u00a0", " ")
.lines
.map(&:strip)
.reject(&:empty?)
.join("\n")
puts clean_text
If you're writing Ruby and want a language-specific walkthrough, this guide is a good starting point: Ruby HTML parser guide.
Extracting text with Ruby and Nokogiri
Nokogiri is the most popular HTML parsing library in the Ruby world. Conceptually, it plays the same role as BeautifulSoup or lxml in Python.
With Nokogiri, you:
- Load HTML into a document object
- Use CSS selectors or XPath to find elements
- Remove unwanted nodes like scripts and styles
- Call
.textto extract readable content
If you're comfortable with BeautifulSoup, Nokogiri will feel familiar. CSS selectors work the same way. XPath is available when you need more precision. And, just like in Python, the key is to clean the DOM before you extract text.
Here's a Nokogiri example that shows just the core idea: select nodes, remove junk, then extract text:
require "nokogiri"
html_source = <<~HTML
<html>
<head>
<style>.hidden { display: none }</style>
<script>console.log("tracking")</script>
</head>
<body>
<h1>Sample page</h1>
<p>This is visible text.</p>
<p class="hidden">This should not be extracted.</p>
</body>
</html>
HTML
# Load HTML into a document object
doc = Nokogiri::HTML(html_source)
# Remove unwanted nodes
doc.search("script, style, .hidden").each(&:remove)
# Extract readable content
text = doc.text.strip
puts text
This article dives deeper into practical Nokogiri usage and patterns: Parse HTML with Nokogiri.
HTML parsing in JavaScript
JavaScript follows the same extraction flow, whether you're running in Node.js or in a browser-like environment. You fetch HTML, parse it into a DOM, remove junk nodes, extract text, then clean it up.
In Node.js, this is usually done with libraries like jsdom or cheerio. They give you a DOM API that feels very similar to what browsers provide, so the mental model stays the same.
The flow in JavaScript looks like this:
- Fetch HTML (
fetch,axios, or similar) - Parse it into a DOM
- Remove scripts, styles, and layout elements
- Extract text content
- Normalize whitespace
Here's a very simple Node.js example using jsdom.
import { JSDOM } from "jsdom";
// Example HTML (could also come from fetch)
const html_source = `
<!doctype html>
<html>
<head>
<style>.hidden { display: none }</style>
<script>console.log("tracking")</script>
</head>
<body>
<header>Navigation</header>
<main>
<h1>Sample page</h1>
<p>This is visible text.</p>
<p class="hidden">This should not appear.</p>
</main>
<footer>Footer content</footer>
</body>
</html>
`;
// Load HTML into a DOM
const dom = new JSDOM(html_source);
const document = dom.window.document;
// Remove unwanted elements
document.querySelectorAll("script, style, header, footer, .hidden")
.forEach(el => el.remove());
// Extract readable text
let text = document.body.textContent || "";
// Basic normalization
text = text
.replace(/\u00a0/g, " ")
.replace(/\s+/g, " ")
.trim();
console.log(text);
If you're comfortable with browser DOM APIs, this should feel natural. CSS selectors work the same way, and the overall extraction logic matches what you've already seen in Python, Ruby, or C#.
Learn more about JS web scraping in our dedicated tutorial.
HTML parsing in C#
C# and .NET follow the same overall approach you've seen in Python and Ruby. You fetch HTML, parse it into a DOM-like structure, remove elements you don't care about, then extract and clean the text.
In the .NET ecosystem, developers typically rely on dedicated HTML parsing libraries rather than string operations. These libraries understand malformed markup, support CSS selectors or XPath, and make it easy to remove scripts, styles, and layout blocks before extracting text.
The flow in C# looks very familiar:
- Download HTML using
HttpClient - Load it into an HTML document object
- Select and remove unwanted nodes like scripts, styles, headers, and footers
- Extract text nodes and normalize the result
Here's a C# example that follows the same extraction flow using a common .NET HTML parser.
using System;
using System.Net.Http;
using System.Text.RegularExpressions;
using HtmlAgilityPack;
var url = "https://example.com";
using var http = new HttpClient();
var html_source = await http.GetStringAsync(url);
// Load HTML into a document object
var doc = new HtmlDocument();
doc.LoadHtml(html_source);
// Remove unwanted nodes
var nodesToRemove = doc.DocumentNode.SelectNodes("//script|//style");
if (nodesToRemove != null)
{
foreach (var node in nodesToRemove)
{
node.Remove();
}
}
// Extract readable text
var text = doc.DocumentNode.InnerText;
// Basic normalization
text = Regex.Replace(text, @"\s+", " ").Trim();
Console.WriteLine(text);
If you're working in C# and want a concrete, language-specific walkthrough, this guide covers the available libraries and common patterns in more detail: C# HTML parser guide.
The takeaway is simple: languages change, libraries change, but the extraction mindset doesn't. Once you understand how to safely extract text from HTML in one language, picking it up in another becomes much easier.
Let an API handle HTML fetching for you
Everything we've covered so far focuses on parsing, cleaning, and extracting text. That's the part you actually want to control. What usually slows projects down is everything before that.
Fetching pages reliably is hard. Real sites block requests, require JavaScript rendering, rotate content, or behave differently based on headers and IPs. You end up spending more time fighting networking issues than working on extraction logic. One way to simplify this is to offload the fetching part to a scraping API and keep text extraction in Python.
The idea is simple: let an external service deal with JavaScript rendering, anti-bot defenses, and retries, then give you clean HTML that's ready to parse.
With a tool like Web Scraping API by ScrapingBee, you can request a page and get back fully rendered HTML. From there, nothing changes in your code. You still pass the HTML into BeautifulSoup, lxml, or html-text and apply the same extraction and cleanup steps you've already seen.
This setup keeps responsibilities clear:
- The API handles fetching, rendering, and blocking issues
- Your Python code handles parsing, text extraction, and analysis
If you're working on production pipelines or crawling at scale, this approach saves time, reduces edge cases, and lets you focus on the part that actually matters: turning HTML into useful data.
Conclusion
Extracting text from HTML is less about one library and more about having a clear process. Fetch the page, parse it with a real HTML parser, remove the junk, extract readable content, then clean it up. Python makes this easy with tools like BeautifulSoup, lxml, and html-text, but the same ideas apply in Ruby, C#, and JavaScript. Once you understand the flow, switching languages or libraries is mostly mechanical.
Start simple. Use real parsers. Avoid regex for structure. And when fetching pages becomes the bottleneck, offload that part so you can focus on extraction and analysis.
Before you go, check out these related reads:
- Pyppeteer: the Puppeteer for Python developers
- Web crawling with Python
- How to parse HTML in Python: A step-by-step guide for beginners
Frequently asked questions (FAQs)
Which Python library is best to extract text from HTML?
- BeautifulSoup is usually the best starting point. It's easy to learn, tolerant of messy HTML, and flexible when you need custom cleanup.
- lxml is faster and more strict, which makes it a good choice at scale or when you need precise DOM control with XPath or CSS selectors.
- html-text works well when you mainly want clean, normalized plain text with minimal post-processing, but it doesn't reliably isolate the "main article" on its own.
How do I ignore scripts, styles, and navigation when I extract text?
Parse the HTML first, then remove unwanted elements like <script>, <style>, headers, footers, and sidebars using tag filters or CSS selectors. After that, extract text. This cleanup step dramatically improves readability and prevents JavaScript or layout junk from leaking into results.
Can I extract only part of a page instead of all text?
Yes. Instead of extracting from the whole document, select a specific container like <main> or an article section using CSS selectors. Then extract text only from that subtree. This keeps results focused and avoids pulling in navigation, ads, or unrelated page content.
Is Python the only option for HTML text extraction?
No. The same approach works in many languages. Ruby, C#, and JavaScript all have solid HTML parsers. The syntax changes, but the flow stays the same: parse HTML, remove noise, extract text, then clean it. Choose based on your stack.

I am a Python/Django Developer always ready to learn and teach new things to fellow developers.


