New Amazon API: We've just released a brand new way to scrape Amazon at scale Start Free Trial 🐝

Python extract text from HTML: Library guide for developers

25 December 2025 | 28 min read

If you need to Python extract text from HTML, this guide walks you through it step by step, without overcomplicating things. You'll learn what text extraction actually means, which Python libraries make it easy, and how to deal with real-world HTML that's messy, noisy, and inconsistent.

We'll start simple with the basics, then move into practical examples, cleanup strategies, and a small end-to-end pipeline. By the end, you'll know how to turn raw HTML into clean, usable text you can store, analyze, or feed into other systems.

Python extract text from HTML: Library guide for developers

Quick answer (TL;DR)

To extract text from HTML in Python, use BeautifulSoup and its get_text() method. Parse the HTML, remove scripts and styles, then extract the readable content. This covers most use cases and is easy to maintain.

The sections below go deeper, with practical code samples, cleanup techniques, and real-world examples for more complex scenarios.

You can also check this tutorial for more information: How to turn HTML to text in Python.

What does it mean to extract text from HTML in Python

HTML is the source code behind a web page. It mixes real content with a lot of extra stuff that browsers need to display the page correctly. When you extract text from HTML in Python, you're stripping all that noise away and keeping only what a human would actually read.

  • Raw HTML is full of tags like <div> and <span>, links, attributes, inline styles, scripts, and metadata. That's great for rendering a page, but it's not so great for analysis.
  • Extracted text is the visible content: headings, paragraphs, list items, and labels. Nothing more.

This matters because most backend tasks don't care about layout. They care about words. Once you reduce HTML to clean text, it becomes much easier to store, search, index, or feed into other systems.

A few basic terms help here:

  • The DOM, or Document Object Model, is how browsers and libraries represent an HTML page internally. It's basically a tree.
  • Tags are the individual HTML markers like <p> or <h1>.
  • Elements are those tags plus whatever is inside them. Text extraction works by walking the DOM and collecting only the text nodes while skipping things like scripts and styles.

If this sounds similar to other parsing problems, that's because it is. Text extraction is a common case of data parsing, and many of the same ideas apply. This collection of real-world parsing discussions is a good reference point: Data parsing questions and answers.

When you should extract text instead of raw HTML

You should extract text when the structure of the page doesn't matter, but the content does. Saving raw HTML makes sense if you plan to re-render pages later or need the full markup. In many projects, that's unnecessary overhead.

Text extraction is usually the better option for search indexing, NLP pipelines, analytics, reporting, and logs. Search engines don't need JavaScript. Language models don't care about CSS. Reports are easier to build when you're working with plain text instead of nested tags.

Think about real-world cases. You might want the main content of blog posts, product descriptions from ecommerce pages, or text from documentation sites. In all of these examples, raw HTML just gets in the way. Clean text is smaller, easier to debug, and faster to process.

Main ways to extract text from HTML with Python

There are a few solid ways to extract text from HTML in Python. Which one you pick depends on how messy the HTML is and how much control you need.

  • The most common option is BeautifulSoup. It's beginner-friendly, forgiving with broken HTML, and good enough for most scraping and parsing tasks. If you're just trying to get readable text out of a page, this is usually where you start.
  • Another option is lxml and related helpers like html-text. These are faster and more strict. They work great when you're dealing with large volumes of HTML or need precise control over the DOM. The tradeoff is a steeper learning curve.
  • You may also see Parsel mentioned in scraping discussions. Parsel is designed for structured data extraction using CSS and XPath selectors, not for general text cleanup or readability-focused extraction. It hasn't seen much active development in the past couple of years, and it's usually a better fit for extracting specific fields (like titles or prices) rather than full-page text.
  • Regex-based helpers exist, but they should be used very carefully. Simple patterns can work for tiny, predictable snippets, but regex breaks fast when HTML gets nested or inconsistent. For full pages, regex should be a last resort, not the main tool.

If you're new, start with BeautifulSoup. If performance becomes a problem later, you can switch.

HTML example used in the following sections

To keep things consistent, the next examples will all use the same HTML snippet. This avoids repeating markup and makes it easier to compare different extraction approaches.

This example is closer to a real web page. It includes common elements like navigation links, nested content, styles, and scripts. Some text is visible, some is not, and some should be ignored during extraction.

<!DOCTYPE html>
<html>
<head>
    <title>Sample documentation page</title>
    <meta charset="utf-8">
    <style>
        body { font-family: Arial; }
        .hidden { display: none; }
    </style>
    <script>
        console.log("analytics code");
    </script>
</head>
<body>
    <nav>
        <a href="/">Home</a>
        <a href="/docs">Docs</a>
        <a href="/contact">Contact</a>
    </nav>

    <main>
        <h1>Getting started with HTML parsing</h1>
        <p>This page explains how to extract text from HTML using Python.</p>
        <p class="hidden">This paragraph should not appear in extracted text.</p>

        <section>
            <h2>Installation</h2>
            <p>Install the required libraries using pip.</p>
        </section>
    </main>

    <footer>
        <p>© 2025 Example Corp</p>
    </footer>
</body>
</html>

Key things to notice in this HTML:

  • The <script> and <style> blocks should never end up in extracted text.
  • Elements with classes like hidden may or may not be relevant depending on your use case.
  • Navigation and footer text is visible, but sometimes you'll want to exclude it later.

Use this HTML as the input for the examples in the next sections. We'll focus on how different libraries handle parsing and text extraction rather than rewriting the markup each time.

Extract text from HTML with BeautifulSoup

BeautifulSoup is popular for a reason. It's easy to install, easy to read, and handles messy real-world HTML without complaining.

First, install the dependencies:

pip install beautifulsoup4

For larger projects you should consider using tools like uv and Poetry but it's not really needed for today's tutorial.

The usual workflow looks like this:

  • Fetch the HTML, either from a URL or a string
  • Parse it into a DOM
  • Remove elements you don't want, like scripts and styles
  • Extract clean text

Here's a simple, real-world example using BeautifulSoup:

from bs4 import BeautifulSoup

# Paste sample HTML from the section above
html_source = "..."

# Parse HTML into a DOM-like structure
soup = BeautifulSoup(html_source, "html.parser")

# Remove script and style tags to avoid JavaScript and CSS leaking into output
for tag in soup(["script", "style"]):
    tag.decompose()

# Optionally remove elements that are hidden via CSS classes
for tag in soup.select(".hidden"):
    tag.decompose()

# Extract clean, human-readable text
# separator keeps words from sticking together
# strip removes extra whitespace
text = soup.get_text(separator=" ", strip=True)

print(text)

Key points:

  • The HTML includes <script> and <style> tags on purpose. These are common in real-world pages and are a major source of noise. If you don't remove them, JavaScript and CSS often end up mixed into your extracted text.
  • BeautifulSoup parses the HTML into a DOM-like structure using the built-in html.parser. This parser is fast enough for most projects and handles imperfect or messy HTML well.
  • The loop that targets ["script", "style"] removes those elements entirely from the document tree. Using decompose() matters because it deletes both the tag and its contents, not just the wrapper.
  • The optional step that removes elements with the .hidden class shows how you can filter out content that is technically in the HTML but not meant to be visible.
  • The get_text() call collects all remaining text nodes. The separator argument inserts spaces between text blocks so words don't stick together, and strip cleans up extra whitespace.
  • This pattern is a strong default for Python extract text from HTML tasks. It works for static HTML, scraped pages, and HTML loaded from files, and it's easy to extend when you need more control.

Extract text using html-text or lxml.html

If you want cleaner text output with less manual post-processing, html-text and lxml.html are solid options. The main advantage is better handling of non-text elements, invisible content, and whitespace normalization compared to naïve extraction.

Install the package:

pip install html-text lxml

If lxml fails to build, install your OS's libxml2/libxslt dev packages.

Below is the same sample HTML from earlier. We'll parse it and extract text using an out-of-the-box helper.

from lxml import html
import html_text

html_source = "..."

# Parse once with lxml
doc = html.fromstring(html_source)

# html-text can accept an lxml HtmlElement directly
text = html_text.extract_text(doc)

print(text)

Key points:

  • lxml.html parses HTML into a real DOM tree, not just a string wrapper. This gives you a fast, strict structure you can query later with XPath or CSS selectors when you need precision or performance.
  • html-text is responsible only for turning HTML into readable plain text. When given an already-parsed lxml tree, it walks the DOM and extracts human-facing text while ignoring non-text elements and obvious noise like scripts and styles.
    • Since HTML parsers don't evaluate CSS or layout, handling of "hidden" content is heuristic. Libraries rely on signals like tag types, common class names, aria-hidden, or inline styles such as display: none. For truly accurate visibility detection, a rendering engine (for example, a headless browser) is required.
  • In this setup, parsing and extraction are clearly separated. lxml handles structure, html-text handles text output. This avoids double parsing and keeps the mental model simple.
  • This pattern works well when you want clean text quickly, but still want the option to add DOM-level filtering later without changing the extraction step.

Quick comparison with BeautifulSoup:

  • Pick BeautifulSoup when you want the most beginner-friendly option and are fine doing explicit cleanup (scripts, styles, layout blocks).
  • Pick lxml.html when performance matters or when you need strict parsing and fine-grained DOM control via XPath or CSS selectors.
  • Pick html-text when your goal is cleaner text extraction and built-in normalization, not automatic content selection.

Using regex to clean or extract text from HTML

Regex and HTML have a long, messy history. Short version: you should not use regex to parse full HTML documents. HTML is nested, inconsistent, and often broken in ways regex can't handle safely. That said, regex can still be useful in very narrow cases. The key rule is this: only use regex after you've already parsed HTML with a real parser like BeautifulSoup or lxml.

Safe use cases include cleaning up leftover tags, collapsing whitespace, or removing very specific patterns from already-extracted text.

⚠️ Warning: Do not try to extract full page content or navigate HTML structure using regex alone. It will break sooner or later. If you want a deeper explanation of why this is a bad idea, this article covers it well: Parsing HTML with regex.

Here's a small example. We start with text that already came from an HTML parser, then use regex only for cleanup.

import re
from bs4 import BeautifulSoup

# Paste sample HTML from the section above
html_source = "..."

# 1) Parse HTML with a real HTML parser (don't regex full HTML)
soup = BeautifulSoup(html_source, "html.parser")

# 2) Drop obvious noise first (scripts/styles)
for tag in soup(["script", "style"]):
    tag.decompose()

# 3) Extract text safely from the parsed DOM
text = soup.get_text(separator=" ", strip=True)

# 4) Use regex only for small cleanup on plain text
#    - Collapse repeated whitespace into a single space
#    - Optional: normalize weird non-breaking spaces
clean_text = re.sub(r"\s+", " ", text).replace("\u00a0", " ").strip()

print(clean_text)

Key points:

  • HTML is parsed first using BeautifulSoup. This is the critical step. A real HTML parser understands nesting, broken markup, and edge cases that regex cannot handle safely.
    • If you want faster parsing, install lxml and use BeautifulSoup(html, "lxml").
  • Script and style tags are removed before text extraction. This prevents JavaScript and CSS from leaking into the output and keeps the text readable.
  • get_text() is used to extract human-readable content from the parsed DOM. At this point, the output is already safe and structured as plain text.
  • Regex is applied only after the HTML has been converted to text. This is the safe boundary where regex makes sense.
  • The regex pattern \s+ collapses multiple whitespace characters into a single space. This helps clean up text that comes from multiple tags, line breaks, or inconsistent spacing.
  • The optional replacement of non-breaking spaces handles a common edge case seen in copied or scraped content.
  • This approach keeps responsibilities clear. The HTML parser handles structure, and regex handles small, predictable text cleanup tasks.

Comparing the main HTML text extraction methods

Each approach we covered solves a slightly different problem. This quick comparison should help you decide where to start and when to switch tools.

MethodBest forProsCons
BeautifulSoupGeneral-purpose text extractionEasy to learn, very forgiving, flexible cleanupRequires manual removal of boilerplate
lxml.htmlPerformance and precisionFast, strict parsing, XPath supportLess beginner-friendly
html-textClean main contentReduces non-content noise and normalizes text with minimal setupLess control over what gets removed
Regex (cleanup only)Post-processing textSimple, fast, predictableUnsafe for parsing HTML structure

Rule of thumb: start with BeautifulSoup for most Python extract text from HTML tasks. Use html-text for cleaner text extraction and normalization (not automatic content selection), and switch to lxml.html when performance or precise DOM control matters. Regex should stay in the cleanup lane, not in charge of parsing.

Handling real world HTML content

Real sites are messy. You'll run into broken markup, tons of scripts, inline styles, cookie banners, ad containers, hidden text, and random layout junk like headers and sidebars. If you just call get_text() on the whole page, you usually get a wall of noise.

The good news is: Python HTML parsers handle most of this fine. The trick is what you do before extraction (remove junk) and what you do after extraction (normalize the text).

For the next examples, here's a more "real-world-ish" HTML snippet with common problems baked in:

<!doctype html>
<html>
<head>
  <meta charset="utf-8">
  <title>Product page</title>
  <style>
    .sidebar { width: 300px; }
    .hidden { display: none; }
  </style>
  <script>
    window.__tracking = { session: "abc123" };
  </script>
</head>
<body>
  <header class="site-header">
    <nav>
      <a href="/">Home</a>
      <a href="/shop">Shop</a>
      <a href="/support">Support</a>
    </nav>
    <div class="cookie-banner">We use cookies. Accept?</div>
  </header>

  <div class="layout">
    <aside class="sidebar">
      <h3>Recommended</h3>
      <ul>
        <li><a href="/p/1">Thing 1</a></li>
        <li><a href="/p/2">Thing 2</a></li>
      </ul>
      <div class="ad">BUY NOW!!!</div>
    </aside>

    <main id="content">
      <h1>SuperWidget 3000</h1>
      <p>The SuperWidget 3000 is built for devs who hate flaky tools.</p>
      <p class="hidden">Internal SKU: SW-3000-SECRET</p>

      <section>
        <h2>Key features</h2>
        <ul>
          <li>Fast setup</li>
          <li>Clean output</li>
          <li>Works on messy HTML</li>
        </ul>
      </section>

      <section>
        <h2>Notes</h2>
        <p>Ships worldwide.&nbsp;&nbsp;Returns within 30 days.</p>
      </section>
    </main>
  </div>

  <footer class="site-footer">
    <p>© 2025 Example Corp</p>
    <a href="/legal">Legal</a>
  </footer>
</body>
</html>

Remove unwanted parts: scripts, styles, and layout blocks

Most of the time, you don't want navigation, cookie banners, sidebars, or footers in your extracted text. You want the main content. So the workflow is: parse → remove junk → extract.

Here's a short before/after that shows why this matters.

Before (no cleanup):

Home Shop Support We use cookies. Accept? Recommended Thing 1 Thing 2 BUY NOW!!! SuperWidget 3000 The SuperWidget 3000 is built for devs who hate flaky tools. Internal SKU: SW-3000-SECRET Key features Fast setup Clean output Works on messy HTML Notes Ships worldwide. Returns within 30 days. © 2025 Example Corp Legal

After (remove layout + scripts/styles + hidden):

Product page
SuperWidget 3000
The SuperWidget 3000 is built for devs who hate flaky tools.
Key features
Fast setup
Clean output
Works on messy HTML
Notes
Ships worldwide. Returns within 30 days.

Code example (BeautifulSoup + CSS selectors):

from bs4 import BeautifulSoup

html_source = """...paste the sample HTML from above..."""

soup = BeautifulSoup(html_source, "html.parser")

# 1) Always drop script/style first
for tag in soup(["script", "style", "noscript"]):
    tag.decompose()

# 2) Drop common layout blocks (site chrome)
for tag in soup.select("header, footer, nav, aside, .cookie-banner, .ad"):
    tag.decompose()

# 3) Drop hidden content you know is irrelevant (common patterns)
for tag in soup.select(".hidden, [aria-hidden='true']"):
    tag.decompose()

# 4) Extract text from what remains
text = soup.get_text(separator="\n", strip=True)
print(text)

Explanations:

  • Scripts and styles are pure noise for text extraction. Kill them early.
  • Layout blocks are where most boilerplate lives, like navigation, footers, sidebars, and banners.
  • Hidden elements can leak internal junk such as SKUs, tracking labels, or A/B test copy.
  • Using separator="\n" makes the output more readable and much easier to post-process later.

Clean and normalize extracted text

Even after solid cleanup, extracted text is often still messy. You'll see extra blank lines, weird spacing, non-breaking spaces, and inconsistent line breaks. A small normalization step makes the output feel finished and easier to work with for search, NLP, or reporting.

Below are two small helper functions you can reuse across projects.

from bs4 import BeautifulSoup
import re

def normalize_text(text: str) -> str:
    """Normalize whitespace while keeping meaningful line breaks."""
    # Normalize non-breaking spaces
    text = text.replace("\u00a0", " ")

    # Normalize Windows/Mac line endings
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    # Trim whitespace on each line
    lines = [line.strip() for line in text.split("\n")]

    # Drop empty lines and collapse runs of empty lines
    cleaned_lines = []
    blank_run = 0
    for line in lines:
        if not line:
            blank_run += 1
            if blank_run <= 1:
                cleaned_lines.append("")
            continue
        blank_run = 0
        # Collapse internal whitespace in a line
        cleaned_lines.append(re.sub(r"\s+", " ", line))

    return "\n".join(cleaned_lines).strip()


def extract_blocks(soup: BeautifulSoup) -> list[str]:
    blocks = []
    for tag in soup.select("h1, h2, h3, p, li"):
        txt = tag.get_text(" ", strip=True)
        if txt:
            blocks.append(txt)
    return blocks


# Example usage after extraction:
# text = soup.get_text(separator="\n", strip=True)
# clean = normalize_text(text)
# print(clean)

# paragraphs = extract_blocks(soup)
# print(paragraphs)
  • normalize_text() is responsible for cleaning up plain text after HTML parsing and extraction. It assumes the structure is already handled and focuses only on readability.
  • The first replacement handles non-breaking spaces (\u00a0). These show up a lot in scraped content and often cause weird spacing issues if you don't normalize them.
  • Line endings are normalized next. Real-world text can contain Windows (\r\n) or old Mac (\r) line breaks. Converting everything to \n keeps behavior consistent across systems.
  • The loop that tracks blank_run collapses multiple empty lines into a single empty line. This prevents giant vertical gaps while still preserving paragraph separation.
  • Inside non-empty lines, internal whitespace is collapsed using regex. This avoids double or triple spaces caused by inline tags or broken formatting.
  • extract_blocks() takes a different approach. Instead of guessing paragraph boundaries from text, it uses HTML structure directly.
  • The selector h1, h2, h3, p, li targets common content blocks like headings, paragraphs, and list items. You can extend or shrink this list depending on your page type.
  • The result of extract_blocks() is a list of clean text blocks that already represent meaningful sections. This is often easier to store, index, or analyze than one giant string.
  • Together, these helpers cover both sides of the problem: text cleanup when you already have flat text, and structure-aware extraction when you want real paragraphs.

From text extraction to full data pipelines in Python

Text extraction is just one stage in a bigger workflow. In real projects, you usually end up with a pipeline that looks like this:

  1. Fetch pages (HTTP or files)
  2. Parse HTML into a structure you can query
  3. Remove junk (scripts, styles, headers, footers, sidebars, ads)
  4. Extract the text you actually want
  5. Normalize the text (whitespace, line breaks, weird characters)
  6. Store results somewhere useful (CSV, JSON, plain text, or a database)

If you want a broader "big picture" view of pulling structured info out of pages (not just text), this guide is a good companion: Data extraction in Python.

Here are some common output formats:

  • JSON / JSONL: best when records have multiple fields and you want to keep structure
  • CSV: best for quick analysis in spreadsheets or pandas
  • Plain text files: best when you only care about readable content (search indexing, NLP input, archiving)

Example mini project: Crawl a few pages and save clean text

This mini project crawls a small set of pages starting from a single URL, follows "next" pagination links, extracts clean text, and stores results on disk. What this script will do:

  1. Starts from one start_url and crawls up to max_pages by following the page’s Next link (ul.pager li.next a)
  2. Downloads each page using a shared requests.Session and parses HTML with BeautifulSoup
  3. Removes obvious noise (script, style, noscript) plus a handful of common "site chrome" selectors (header/footer/nav/aside, cookie/ad/sidebar patterns)
  4. Extracts text from what remains using paragraph-ish separators (\n\n)
  5. Normalizes the extracted text (NBSP cleanup, consistent newlines, trimmed lines, collapsed whitespace, and collapsed blank-line runs)
  6. Writes results to disk as:
    • a single JSONL file (pages.jsonl) with one record per page (url, title, text)
    • one .txt file per page in a folder, using a title-based filename plus a short URL hash for uniqueness

Here's the code:

import json
import re
from dataclasses import dataclass
from pathlib import Path
from typing import Iterable
from urllib.parse import urljoin
import unicodedata
import hashlib

# Install with: pip install requests beautifulsoup4
import requests
from bs4 import BeautifulSoup


@dataclass(frozen=True)
class PageText:
    """One extracted page of clean text."""
    url: str
    title: str
    text: str


def normalize_text(text: str) -> str:
    """Normalize whitespace while keeping meaningful line breaks."""
    text = text.replace("\u00a0", " ")  # non-breaking spaces are common in HTML
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    lines = [line.strip() for line in text.split("\n")]

    cleaned_lines: list[str] = []
    blank_run = 0
    for line in lines:
        if not line:
            blank_run += 1
            if blank_run <= 1:
                cleaned_lines.append("")
            continue

        blank_run = 0
        cleaned_lines.append(re.sub(r"\s+", " ", line))

    return "\n".join(cleaned_lines).strip()


def extract_clean_text(soup: BeautifulSoup) -> tuple[str, str]:
    """
    Remove common noise from a parsed page and return (title, clean_text).
    """
    # Drop obvious noise first
    for tag in soup(["script", "style", "noscript"]):
        tag.decompose()

    # Optional: drop common layout blocks on typical sites.
    for tag in soup.select("header, footer, nav, aside, .cookie, .cookie-banner, .ad, .ads, .sidebar"):
        tag.decompose()

    # Hidden content patterns (varies by site)
    for tag in soup.select(".hidden, [aria-hidden='true']"):
        tag.decompose()

    # Extract text with paragraph-ish breaks
    raw_text = soup.get_text(separator="\n\n", strip=True)
    clean_text = normalize_text(raw_text)

    title = soup.title.get_text(" ", strip=True) if soup.title else ""
    return title, clean_text


def fetch_soup(url: str, session: requests.Session, *, timeout_s: float = 20.0) -> BeautifulSoup:
    """Fetch a URL and return a BeautifulSoup object."""
    headers = {
        "User-Agent": "Mozilla/5.0 (compatible; TextExtractor/1.0)"
    }
    resp = session.get(url, headers=headers, timeout=timeout_s)
    resp.raise_for_status()

    # Prefer requests' detected encoding, fallback to UTF-8
    encoding = resp.encoding or "utf-8"
    html_text = resp.content.decode(encoding, errors="replace")

    return BeautifulSoup(html_text, "html.parser")


def find_next_page_url(soup: BeautifulSoup, current_url: str) -> str | None:
    """
    books.toscrape.com pagination:
    <ul class="pager">
      <li class="next"><a href="catalogue/page-2.html">next</a></li>
    </ul>
    """
    next_link = soup.select_one("ul.pager li.next a[href]")
    if not next_link:
        return None
    return urljoin(current_url, next_link["href"])


def crawl_pages(start_url: str, *, max_pages: int = 3) -> list[PageText]:
    """Follow 'next' links and extract clean text from each page."""
    results: list[PageText] = []
    url = start_url

    with requests.Session() as session:
        for _ in range(max_pages):
            soup = fetch_soup(url, session)

            title, text = extract_clean_text(soup)
            results.append(PageText(url=url, title=title, text=text))

            next_url = find_next_page_url(soup, url)
            if not next_url:
                break
            url = next_url

    return results


def save_jsonl(records: Iterable[PageText], path: Path) -> None:
    """Write one JSON object per line (easy to stream and append)."""
    path.parent.mkdir(parents=True, exist_ok=True)
    with path.open("w", encoding="utf-8") as f:
        for r in records:
            f.write(json.dumps({"url": r.url, "title": r.title, "text": r.text}, ensure_ascii=False))
            f.write("\n")


def slugify_filename(title: str, *, max_len: int = 80) -> str:
    """
    Make a filesystem-friendly filename stem.
    - Keeps non-ASCII where possible (modern filesystems handle UTF-8 fine).
    - Removes path separators and control chars.
    - Collapses whitespace.
    """
    title = title.strip()

    # Normalize unicode (avoids weird lookalike / combining sequences)
    title = unicodedata.normalize("NFKC", title)

    # Replace path separators and other unsafe chars with underscores
    # (keep letters/numbers from any language)
    title = re.sub(r"[\\/:*?\"<>|\x00-\x1f]+", "_", title)

    # Collapse whitespace
    title = re.sub(r"\s+", " ", title).strip()

    # Shorten
    if len(title) > max_len:
        title = title[:max_len].rstrip()

    # Avoid empty
    return title or "page"


def short_url_hash(url: str, n: int = 8) -> str:
    return hashlib.blake2b(url.encode("utf-8"), digest_size=16).hexdigest()[:n]


def save_txt(records: Iterable[PageText], folder: Path) -> None:
    """Save one .txt file per page (good for quick inspection)."""
    folder.mkdir(parents=True, exist_ok=True)

    for i, r in enumerate(records, start=1):
        # Title can be non-ASCII; that's fine on most systems.
        # Add a short URL hash so filenames are unique + stable.
        stem = slugify_filename(r.title or "", max_len=80)
        h = short_url_hash(r.url, n=8)

        out_path = folder / f"{i:02d}_{stem}_{h}.txt"
        out_path.write_text(r.text, encoding="utf-8")

def main() -> None:
    # Start on the first listing page, then follow "next" links.
    start_url = "https://books.toscrape.com/"

    # Crawl a few pages (tweak this number when you want more)
    pages = crawl_pages(start_url, max_pages=3)

    # Store results
    out_dir = Path("output")
    save_jsonl(pages, out_dir / "pages.jsonl")
    save_txt(pages, out_dir / "pages_txt")

    print(f"Saved {len(pages)} pages to: {out_dir.resolve()}")


if __name__ == "__main__":
    main()

Key points:

  • This script turns HTML text extraction into a small but realistic pipeline. It fetches pages, follows pagination, cleans the DOM, extracts readable text, and saves the results.
  • Instead of hard-coding page URLs, the crawler starts from a single entry point and discovers subsequent pages via the site's pagination markup. This reflects how real sites are structured and scales cleanly.
  • A single requests.Session is reused across requests. This reduces connection overhead and keeps headers and cookies consistent.
  • Each page is parsed once into a BeautifulSoup object. All cleanup and extraction operate on that parsed structure, not on raw HTML strings.
  • Noise removal happens before extraction. Scripts, styles, hidden elements, and common layout blocks are removed so they never leak into the text layer.
  • Text is extracted using double newlines as separators. This preserves paragraph boundaries and makes later normalization or block-based processing easier.
  • normalize_text() handles the unglamorous cleanup work: non-breaking spaces, inconsistent line endings, excess blank lines, and repeated whitespace.
  • Pagination is handled by detecting the next link and resolving it to an absolute URL. When no next link exists, crawling stops automatically.
  • Results are saved in two formats. JSONL preserves structured records for pipelines and analysis, while plain text files are convenient for inspection, indexing, or NLP input.
  • The pipeline is intentionally modular. You can change selectors, target a different site, or swap the output format without rewriting the overall flow.

HTML text extraction in other languages

The ideas behind HTML text extraction are not Python-specific. No matter the language, the workflow looks almost the same: fetch a page, parse HTML into a DOM, remove junk, extract readable text, then clean it up.

If you work in a mixed stack or switch between languages, this is good news. Once you understand the concepts in Python, you can transfer them almost directly to other ecosystems like Ruby, JS, or C#. The main differences are library names and syntax. The mental model stays the same.

Ruby and HTML text extraction

Ruby has a strong ecosystem for HTML parsing and web scraping. Just like in Python, Ruby developers typically rely on real HTML parsers instead of string manipulation.

The flow in Ruby mirrors what you've seen so far:

  • Fetch HTML (Net::HTTP, Faraday, or similar)
  • Parse it into a DOM
  • Remove scripts, styles, and layout blocks
  • Extract text nodes
  • Normalize the result

Here's a very simple Ruby example that follows the same flow you've seen in Python. It fetches a page, parses HTML, removes scripts and styles, and extracts readable text:

require "net/http"
require "uri"
require "nokogiri"

url = URI("https://example.com")
html_source = Net::HTTP.get(url)

# Parse HTML into a DOM
doc = Nokogiri::HTML(html_source)

# Remove script and style tags
doc.search("script, style").each(&:remove)

# Extract readable text
text = doc.text

# Basic normalization
clean_text = text
  .gsub("\u00a0", " ")
  .lines
  .map(&:strip)
  .reject(&:empty?)
  .join("\n")

puts clean_text

If you're writing Ruby and want a language-specific walkthrough, this guide is a good starting point: Ruby HTML parser guide.

Extracting text with Ruby and Nokogiri

Nokogiri is the most popular HTML parsing library in the Ruby world. Conceptually, it plays the same role as BeautifulSoup or lxml in Python.

With Nokogiri, you:

  • Load HTML into a document object
  • Use CSS selectors or XPath to find elements
  • Remove unwanted nodes like scripts and styles
  • Call .text to extract readable content

If you're comfortable with BeautifulSoup, Nokogiri will feel familiar. CSS selectors work the same way. XPath is available when you need more precision. And, just like in Python, the key is to clean the DOM before you extract text.

Here's a Nokogiri example that shows just the core idea: select nodes, remove junk, then extract text:

require "nokogiri"

html_source = <<~HTML
  <html>
    <head>
      <style>.hidden { display: none }</style>
      <script>console.log("tracking")</script>
    </head>
    <body>
      <h1>Sample page</h1>
      <p>This is visible text.</p>
      <p class="hidden">This should not be extracted.</p>
    </body>
  </html>
HTML

# Load HTML into a document object
doc = Nokogiri::HTML(html_source)

# Remove unwanted nodes
doc.search("script, style, .hidden").each(&:remove)

# Extract readable content
text = doc.text.strip

puts text

This article dives deeper into practical Nokogiri usage and patterns: Parse HTML with Nokogiri.

HTML parsing in JavaScript

JavaScript follows the same extraction flow, whether you're running in Node.js or in a browser-like environment. You fetch HTML, parse it into a DOM, remove junk nodes, extract text, then clean it up.

In Node.js, this is usually done with libraries like jsdom or cheerio. They give you a DOM API that feels very similar to what browsers provide, so the mental model stays the same.

The flow in JavaScript looks like this:

  • Fetch HTML (fetch, axios, or similar)
  • Parse it into a DOM
  • Remove scripts, styles, and layout elements
  • Extract text content
  • Normalize whitespace

Here's a very simple Node.js example using jsdom.

import { JSDOM } from "jsdom";

// Example HTML (could also come from fetch)
const html_source = `
<!doctype html>
<html>
  <head>
    <style>.hidden { display: none }</style>
    <script>console.log("tracking")</script>
  </head>
  <body>
    <header>Navigation</header>
    <main>
      <h1>Sample page</h1>
      <p>This is visible text.</p>
      <p class="hidden">This should not appear.</p>
    </main>
    <footer>Footer content</footer>
  </body>
</html>
`;

// Load HTML into a DOM
const dom = new JSDOM(html_source);
const document = dom.window.document;

// Remove unwanted elements
document.querySelectorAll("script, style, header, footer, .hidden")
  .forEach(el => el.remove());

// Extract readable text
let text = document.body.textContent || "";

// Basic normalization
text = text
  .replace(/\u00a0/g, " ")
  .replace(/\s+/g, " ")
  .trim();

console.log(text);

If you're comfortable with browser DOM APIs, this should feel natural. CSS selectors work the same way, and the overall extraction logic matches what you've already seen in Python, Ruby, or C#.

Learn more about JS web scraping in our dedicated tutorial.

HTML parsing in C#

C# and .NET follow the same overall approach you've seen in Python and Ruby. You fetch HTML, parse it into a DOM-like structure, remove elements you don't care about, then extract and clean the text.

In the .NET ecosystem, developers typically rely on dedicated HTML parsing libraries rather than string operations. These libraries understand malformed markup, support CSS selectors or XPath, and make it easy to remove scripts, styles, and layout blocks before extracting text.

The flow in C# looks very familiar:

  • Download HTML using HttpClient
  • Load it into an HTML document object
  • Select and remove unwanted nodes like scripts, styles, headers, and footers
  • Extract text nodes and normalize the result

Here's a C# example that follows the same extraction flow using a common .NET HTML parser.

using System;
using System.Net.Http;
using System.Text.RegularExpressions;
using HtmlAgilityPack;

var url = "https://example.com";

using var http = new HttpClient();
var html_source = await http.GetStringAsync(url);

// Load HTML into a document object
var doc = new HtmlDocument();
doc.LoadHtml(html_source);

// Remove unwanted nodes
var nodesToRemove = doc.DocumentNode.SelectNodes("//script|//style");
if (nodesToRemove != null)
{
    foreach (var node in nodesToRemove)
    {
        node.Remove();
    }
}

// Extract readable text
var text = doc.DocumentNode.InnerText;

// Basic normalization
text = Regex.Replace(text, @"\s+", " ").Trim();

Console.WriteLine(text);

If you're working in C# and want a concrete, language-specific walkthrough, this guide covers the available libraries and common patterns in more detail: C# HTML parser guide.

The takeaway is simple: languages change, libraries change, but the extraction mindset doesn't. Once you understand how to safely extract text from HTML in one language, picking it up in another becomes much easier.

Let an API handle HTML fetching for you

Everything we've covered so far focuses on parsing, cleaning, and extracting text. That's the part you actually want to control. What usually slows projects down is everything before that.

Fetching pages reliably is hard. Real sites block requests, require JavaScript rendering, rotate content, or behave differently based on headers and IPs. You end up spending more time fighting networking issues than working on extraction logic. One way to simplify this is to offload the fetching part to a scraping API and keep text extraction in Python.

The idea is simple: let an external service deal with JavaScript rendering, anti-bot defenses, and retries, then give you clean HTML that's ready to parse.

With a tool like Web Scraping API by ScrapingBee, you can request a page and get back fully rendered HTML. From there, nothing changes in your code. You still pass the HTML into BeautifulSoup, lxml, or html-text and apply the same extraction and cleanup steps you've already seen.

This setup keeps responsibilities clear:

  • The API handles fetching, rendering, and blocking issues
  • Your Python code handles parsing, text extraction, and analysis

If you're working on production pipelines or crawling at scale, this approach saves time, reduces edge cases, and lets you focus on the part that actually matters: turning HTML into useful data.

Try ScrapingBee today for free!

Conclusion

Extracting text from HTML is less about one library and more about having a clear process. Fetch the page, parse it with a real HTML parser, remove the junk, extract readable content, then clean it up. Python makes this easy with tools like BeautifulSoup, lxml, and html-text, but the same ideas apply in Ruby, C#, and JavaScript. Once you understand the flow, switching languages or libraries is mostly mechanical.

Start simple. Use real parsers. Avoid regex for structure. And when fetching pages becomes the bottleneck, offload that part so you can focus on extraction and analysis.

Before you go, check out these related reads:

Frequently asked questions (FAQs)

Which Python library is best to extract text from HTML?

  • BeautifulSoup is usually the best starting point. It's easy to learn, tolerant of messy HTML, and flexible when you need custom cleanup.
  • lxml is faster and more strict, which makes it a good choice at scale or when you need precise DOM control with XPath or CSS selectors.
  • html-text works well when you mainly want clean, normalized plain text with minimal post-processing, but it doesn't reliably isolate the "main article" on its own.

How do I ignore scripts, styles, and navigation when I extract text?

Parse the HTML first, then remove unwanted elements like <script>, <style>, headers, footers, and sidebars using tag filters or CSS selectors. After that, extract text. This cleanup step dramatically improves readability and prevents JavaScript or layout junk from leaking into results.

Can I extract only part of a page instead of all text?

Yes. Instead of extracting from the whole document, select a specific container like <main> or an article section using CSS selectors. Then extract text only from that subtree. This keeps results focused and avoids pulling in navigation, ads, or unrelated page content.

Is Python the only option for HTML text extraction?

No. The same approach works in many languages. Ruby, C#, and JavaScript all have solid HTML parsers. The syntax changes, but the flow stays the same: parse HTML, remove noise, extract text, then clean it. Choose based on your stack.

image description
Vivek Singh

I am a Python/Django Developer always ready to learn and teach new things to fellow developers.