Python HTML Parsers

Karthik Devan | 26 May 2025 | 19 min read

Table of contents

The internet is full of useful information and data. In 2025 it's forecast that an astonishing 496 quintillion bytes of data will be created daily. This data can be analyzed and used to make better business decisions. However, most of the data is not structured and isn’t readily available for processing. That’s where web scraping comes in.

Web scraping enables you to retrieve data from a web page and store it in a format useful for further processing. But, as you probably know, web pages are written in a markup language called HTML. So, in order for you to extract the data from a web page, you must first parse its HTML. In doing so, you transform the HTML into a tree of objects.

There are numerous HTML parsers on the market, and choosing which one to go with can be confusing. In this roundup, you’ll review some of the best Python HTML parsers out there. Python is one of the most popular languages when it comes to scraping data, so it’s not surprising that there are quite a few options to consider.

Here's a quick comparison table of the parsers we'll cover in this guide:

Parser/Library	Performance & Speed	Robustness (Malformed HTML)	Ease of Use & API Design	Extraction Features (Selectors etc.)
BeautifulSoup (API)	Varies (depends on parser used)	Varies (depends on parser used)	Very High (Pythonic, intuitive)	Rich navigation, CSS Selectors
`html.parser`	Moderate (Pure Python)	Moderate	Via BS4: High; Direct: Low	Via BS4: CSS; Direct: Basic
`html5lib`	Slow	Very High (Browser-like)	Via BS4: High; Direct: Moderate	Via BS4: CSS; Direct: Basic
`lxml`	Very Fast (C-based)	High	Moderate-High (Powerful)	XPath, CSS Selectors
`Selectolax`	Exceptionally Fast (C-based)	High (HTML5 compliant)	Moderate (CSS-centric)	CSS Selectors
`pyquery`	Fast (uses `lxml`)	High (uses `lxml`)	High (jQuery-like)	CSS & XPath (jQuery-style), DOM manip.
`jusText`	Fast (Specialized task)	N/A (Text extraction only)	Very High (Specific task)	Boilerplate removal, no general selectors
`Scrapy` (Framework)	High (Async, uses `lxml` via Parsel)	High (via Parsel/`lxml`)	Moderate (Framework); High (Selectors)	XPath, CSS (via Parsel); Full crawling features

Skip the Parser Decision: Use our Web Scraping API which makes it easy to scrape any website, just give our API the selectors and we fetch your data. Sign up and get 1,000 free Web Scraping credits and easily build a Python web scraper with only a few clicks in our request builder. We even have a AI powered Web Scraping feature where you can just describe the data you want in plain english and we extract it, now selectors needed.

BeautifulSoup

BeautifulSoup is an HTML parser, but it’s also much more than that—it’s a lightweight and versatile library that allows you to extract data and information from both HTML and XML files. You can use it for web scraping, but also for modifying HTML files. While it’s relatively simple and easy to learn, it’s a very powerful framework. You can complete even more complex web scraping projects using BeautifulSoup as the only web scraping library.

The BeautifulSoup Python package is not built-in, so you need to install it before using it. Fortunately, it is very user-friendly and easy to set up. Simply install it by running pip install beautifulsoup4. Once that is done, you only need to input the HTML file you want to parse. You can then use the numerous functions of BeautifulSoup to find and scrape the data you want, all in just a few lines of code.

Let’s take a look at an example based on this simple HTML from BeautifulSoup’s documentation. Note that if you wanted to scrape a real web page instead, you’d first need to use something like the Requests module in order to access the web page.

Here’s the HTML:

html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""

Before parsing this, you’ll need to import the BeautifulSoup library using the following code:

from bs4 import BeautifulSoup

Finally, you can parse the HTML with this simple line of code:

soup = BeautifulSoup(html_doc, 'html.parser')

With this line, you simply pass the document you want to parse (in this case, the “html_doc”) into the BeautifulSoup constructor, which then takes care of the parsing. You’ve probably noticed that apart from the document you want to parse, the BeautifulSoup constructor takes a second argument. That’s where you pass the parser you want to use. If instead of the html.parser you prefer the lxml, you can simply run the following:

soup = BeautifulSoup(html_doc, 'lxml')

Keep in mind that different parsers will output different results.

Now that the HTML is parsed, you can find and extract any elements that you’re looking for. For example, finding the title element is as easy as running soup.title. You can get the actual title with soup.title.string. Some popular functions BeautifulSoup offers include find() and especially find_all(). These make it extremely easy to find any elements you want. For instance, finding all the paragraphs or links in an HTML is as simple as running soup.find_all(‘p’) or soup.find_all(‘a’).

In a nutshell, BeautifulSoup is a powerful, flexible, and easy-to-use web scraping library. It’s great for everyone, but thanks to its extensive documentation and being straightforward to use, it’s especially suitable for beginners. To learn more about all the options BeautifulSoup can provide, check out the documentation.

BeautifulSoup Pros

Extremely Easy to Learn and Use: Its API is very Pythonic and intuitive, making it a popular choice for beginners and quick tasks.
Flexible Parser Integration: Can work with different underlying parsers (lxml, html.parser, html5lib), allowing you to choose based on speed or leniency needs.
Excellent for Navigating and Searching: Provides a rich set of methods for traversing the parse tree (find(), find_all(), select() for CSS selectors, parent/sibling/child navigation).
Handles Messy HTML Gracefully (via underlying parsers): When paired with lxml or html5lib, it can make sense of poorly structured HTML. Large Community and Extensive Documentation: Abundant resources, tutorials, and community support are available.

BeautifulSoup Cons

Not a Parser Itself: BeautifulSoup is an interface; the actual parsing speed and robustness depend on the underlying parser (lxml, html.parser, html5lib) you choose.
Slower than Direct Parser Usage: There's a slight overhead compared to using lxml or Selectolax directly for raw parsing, as BeautifulSoup adds its own layer of abstraction.
No Native XPath Support: While it supports CSS selectors, full XPath querying usually requires integrating with lxml more directly or using third-party additions.

💡 Love BeautifulSoup? Check out our awesome guide to improving scraping speed performance with BS4. Also our comprehensive guide on Python Web Scraping is a must read.

html.parser

The html.parser is Python's built-in HTML parsing library, meaning it requires no external installation beyond Python itself. When you use it with BeautifulSoup, you get BeautifulSoup's convenient API powered by this standard library parser.

To illustrate, let's fetch the ScrapingBee homepage and extract all H2 headings using html.parser as the engine for BeautifulSoup:

import requests
from bs4 import BeautifulSoup
url = "https://www.scrapingbee.com/"
try:
    response = requests.get(url)
    response.raise_for_status() # Ensure we got a successful response
    html_content = response.text

    # Parse using html.parser
    soup = BeautifulSoup(html_content, 'html.parser')

    h2_tags = soup.find_all('h2')
    print(f"Found {len(h2_tags)} H2 headings using html.parser:")
    for h2 in h2_tags:
        print(f"- {h2.get_text(strip=True)}")

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")

This approach leverages the simplicity of Python's built-in parser with the powerful features of BeautifulSoup for navigating and searching the HTML tree. While other parsers might offer more speed or specialized features, html.parser provides a reliable and accessible starting point.

html.parser Pros

Built-in: No extra pip install needed for the parser itself.
Good Default: A solid choice for many common web scraping tasks, especially if you prefer to minimize external dependencies or are working with relatively well-formed HTML.

html.parser Cons

No Built-in XPath Support: Unlike lxml, html.parser does not have any built-in support for XPath expressions.
Not as Lenient: It's reasonably tolerant of minor HTML errors, though not as robust for severely broken HTML as lxml or html5lib.

html5lib

html5lib is an excellent parser you can use with BeautifulSoup, especially when dealing with very messy or invalid HTML. This parser is designed to parse documents in the same way a web browser does, following the HTML5 parsing algorithm. This makes it extremely robust and capable of handling even badly broken markup, aiming to produce a parse tree that mirrors what a user would see in their browser.

To use html5lib with BeautifulSoup, you first need to install it:

pip install html5lib

Then, you can specify it when creating your BeautifulSoup object:

soup = BeautifulSoup(html_doc, 'html5lib')

html5lib Pros

Extremely Lenient: Its biggest strength. It handles malformed HTML exceptionally well, mimicking browser behavior.
Accuracy: Aims for a parse tree consistent with modern web browser rendering.

html5lib Cons

Slower Performance: Generally slower than lxml or html.parser due to its more complex parsing algorithm.
Dependency: Requires an external installation.

lxml

Another high performance HTML parser is lxml. You’ve already seen it in the previous section—BeautifulSoup supports using the lxml parser by simply passing lxml as the second argument to the BeautifulSoup constructor.

With lxml, you can extract data from both XML and broken HTML. It’s very fast, safe, and easy to use—thanks to the Pythonic API—and it requires no manual memory management. It also comes with a dedicated package for parsing HTML.

As with BeautifulSoup, to use lxml you first need to install it, which is easily done with pip install lxml. Once you install it, there are a couple of different functions for parsing HTML, such as parse() and fromstring(). For example, the same html_doc from the BeautifulSoup example can be parsed with the following piece of code:

from lxml import etree
from io import StringIO

parser = etree.HTMLParser()

tree = etree.parse(StringIO(html_doc), parser)

Now the HTML file will be contained in tree. From there, you can extract the data you need using a variety of methods, such as XPath or CSSSelect. You can learn how to do this in detail from lxml’s extensive documentation.

lxml is a lightweight, actively maintained parser with great documentation. It’s similar to BeautifulSoup in many aspects, so if you know one of them, picking up the other should be pretty straightforward.

lxml Pros

High Performance: Being a C-based library, lxml is exceptionally fast for both parsing and serializing HTML and XML, making it ideal for speed-sensitive applications.
Robust Parsing & XML Support: It's highly effective at parsing broken HTML, offers comprehensive XML support (including XSLT, RelaxNG, XML Schema), and is very standards-compliant.
Powerful Selection (XPath & CSS): Provides full support for XPath 1.0 expressions and also supports CSS selectors (via cssselect), offering flexible and powerful ways to query documents.

lxml Cons

External C Dependency: Relies on C libraries (libxml2 and libxslt), which can sometimes make installation more complex on certain systems or in restricted environments compared to pure Python libraries.
Steeper Learning Curve for Advanced Features: While basic parsing is straightforward, mastering XPath or its more advanced XML features can require more effort than simpler APIs like BeautifulSoup for basic tasks.
Less "Pythonic" API for Simple Traversal (for some): Some users find its API for simple tree navigation less intuitive or "Pythonic" at first glance compared to BeautifulSoup's direct attribute access style.

Selectolax

When raw parsing speed is a top priority, Selectolax emerges as a compelling option. It's a fast, modern HTML5 parser for Python, built on top of the Modest engine, which is part of Google's Lexbor project. Selectolax is designed for efficiency and primarily uses CSS selectors for data extraction, making it straightforward for those familiar with web development or other scraping tools that use CSS selectors.

To get started with Selectolax, you'll first need to install it:

pip install selectolax

In this example, HTMLParser(html_content) creates a parse tree, and tree.css('h2') efficiently selects all h2 elements. The .text(strip=True) method is then used to get the clean text content of each node.

import requests
from selectolax.parser import HTMLParser

# Ensure you have selectolax installed: pip install selectolax

url = "https://www.scrapingbee.com/"
try:
    response = requests.get(url)
    response.raise_for_status() # Ensure we got a successful response
    html_content = response.text

    # Parse the HTML content
    tree = HTMLParser(html_content)

    # Find all <h2> elements using a CSS selector
    # .css() returns a list of nodes
    h2_nodes = tree.css('h2')

    print(f"Found {len(h2_nodes)} H2 headings using Selectolax:")
    for node in h2_nodes:
        # .text() extracts the text content, strip_tags=True by default
        print(f"- {node.text(strip=True)}")

except requests.exceptions.RequestException as e:
    print(f"Error fetching URL: {e}")
except Exception as e:
    print(f"An error occurred: {e}")

Selectolax Pros

Exceptional Speed: This is Selectolax's main advantage. It's often significantly faster than BeautifulSoup (with any of its underlying parsers) and can even outperform lxml in many benchmarks for pure parsing and CSS selection tasks.
Strict HTML5 Compliance: Being based on the Modest/Lexbor engine, it aims for high compliance with HTML5 parsing specifications, meaning it generally parses HTML "like a browser."
Efficient CSS Selectors: Provides a fast and efficient implementation for querying the document using CSS selectors.
Lightweight (Performance): Due to its C-based backend and focus on speed, it tends to be resource-efficient.
Simple and Focused API: The API is generally concise and geared towards parsing and selecting, making it relatively easy to pick up if you're comfortable with CSS selectors.

Selectolax Cons

No XPath Support: If XPath is your preferred method for navigating and selecting elements, Selectolax doesn't offer it.
Less Feature-Rich for Tree Manipulation/Navigation: Compared to BeautifulSoup, which offers a wide array of methods for navigating the parse tree (e.g., find_parent, find_next_sibling, children), Selectolax is more focused on direct selection via CSS. Tree manipulation features are also minimal.

pyquery

If you enjoy the jQuery API and would like it in Python, then pyquery is for you. It’s a library that offers both XML and HTML parsing, with high speeds and an API very similar to jQuery.

Just like the other HTML parsers in this roundup, pyquery allows you to traverse and scrape information from the XML or HTML file. It also allows you to manipulate the HTML file, including adding, inserting, or removing elements. You can create document trees from simple HTML and then manipulate and extract data from them. Also, since pyquery includes many helper functions, it helps you save time by not having to write as much code yourself.

In order to use pyquery, you first need to install it with pip install pyquery and then import it with from pyquery import PyQuery as pq. Once that’s done, you can load documents from strings, URLs, or even lxml. Here’s the code that will enable you to do so:

d = pq(html_doc) # loads the html_doc we introduced previously
d = pq(url="https://www.scrapingbee.com/") # loads from an inputted url

pyquery does essentially the same thing as lxml and BeautifulSoup. The main distinction is in its syntax, which is intentionally very similar to that of jQuery. For instance, in the code above, the d is like the $ in jQuery. If you’re familiar with jQuery, this might be a particularly helpful option for you.

However, pyquery is not as popular as BeautifulSoup or lxml, so its community support comparatively lacking. Still, it’s lightweight, actively maintained, and has great documentation.

pyquery Pros

jQuery-like API: Its syntax closely mimics jQuery, making it very intuitive and easy to pick up for web developers already familiar with jQuery's selection and manipulation patterns.
Combines Parsing with DOM Manipulation: Allows not only for querying documents but also for modifying them (adding, removing, changing elements and attributes) using a familiar API.
Leverages lxml: Built on top of lxml, so it inherits lxml's speed and robustness for the underlying parsing tasks, providing a good performance base.

pyquery Cons

Smaller Community & Resources: Compared to BeautifulSoup or lxml standalone, pyquery has a smaller user base, leading to fewer online tutorials, examples, and community support threads.
Niche Appeal: Its primary advantage (jQuery syntax) is most beneficial to those already comfortable with jQuery; others might find its API less universally intuitive than BeautifulSoup.
Less Active Development (Historically): While still functional, it has historically seen less frequent updates or feature additions compared to more mainstream libraries like BeautifulSoup or lxml. (It's always good to check the latest repository activity.)

jusText

While not as powerful as the other parsers discussed in this roundup, jusText is a tool that can be very useful in specific instances, such as when you want to keep only the full sentences on a given web page. It has one main goal: to remove all boilerplate content from a web page, and leave only the main content. So, if you were to pass a web page to jusText, it would remove the contents from the header, main menu, footer, sidebar, and everything else deemed not to be important, leaving you with only the text on the main part of the page.

You can give it a try using this jusText demo. Simply input the URL of any web page, and it will strip out all unnecessary content, leaving only what it considers important. If this is something that you would find useful, you can learn more about jusText here.

In order to use jusText, you’d first need to install it using pip install justext. You would also need to have the Requests module installed so that you can access the web page from which you want to remove the boilerplate content. Then, you can use the following piece of code to get the main content of any web page:

import requests
import justext

url = "https://www.scrapingbee.com/"

response = requests.get(url)
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))

for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
	print(paragraph.text)

This piece of code accesses the ScrapingBee homepage and returns only its main content. All you need to change in this code is to insert any URL you want, though keep in mind that this example only works in English, but you can change the language.

Similar to pyquery, jusText is another package that’s not as popular as BeautifulSoup, and it’s main use is limited to one particular task. Compared to lxml and BeautifulSoup, it doesn’t come with extensive documentation, nor is it as actively maintained. Still, it’s a lightweight parser that is great for the task of removing boilerplate content.

jusText Pros

Excellent for Boilerplate Removal: Specifically designed and highly effective at identifying and stripping away common boilerplate content (headers, footers, ads, navigation) to isolate the main textual content of a webpage.
Simple to Use for its Core Task: For its primary purpose of extracting clean article text, it requires minimal code and is very straightforward to implement.
Language-Aware Cleaning: Utilizes stoplists for various languages to help distinguish between boilerplate and substantive content, improving accuracy across different language websites.

jusText Cons

Highly Specialized Niche: It's not a general-purpose HTML parser; its utility is almost exclusively for extracting main textual content and it lacks features for general DOM traversal or data extraction.
Limited Control & Potential Inaccuracies: The heuristic-based approach means you have less fine-grained control, and it might occasionally misclassify content, either removing important text or leaving some boilerplate.
Less Active Maintenance & Documentation: Compared to major parsing libraries, it generally has less extensive documentation and has seen periods of lower maintenance activity.

Scrapy

Finally, BeautifulSoup may be great for parsing HTML and extracting data from it, but when it comes to crawling and extracting a whole web site, there are better choices available. If you’re working on a highly complex web scraping project, you might be better off with a framework like Scrapy. It’s much more complicated and requires a steeper learning curve than BeautifulSoup, but it’s also much more powerful.

To say that Scrapy is an HTML parser is a huge understatement, since parsing HTML is a miniscule part of what Scrapy is capable of. Scrapy is the complete Python web scraping framework. It has all the features that you may require for web scraping, such as crawling an entire website from a single URL, exporting and storing the data in various formats and databases, limiting the crawling rate, and more. It’s very powerful, efficient, and customizable.

Among its many features, Scrapy offers methods for parsing HTML. First, you need to perform a request for the URL you need parsed, which you can do using the start_requests method. Once that’s done, the web page you get as a response is easily parsed thanks to the parse method, which extracts the data and returns an object. The parse method allows for the extracted information to be returned as different kinds of objects, such as Item objects, Request objects, and dictionaries. Once the data is extracted, the yield command sends it to an Item Pipeline.

The only downside to Scrapy is its steep learning curve; unlike the other items in this roundup, Scrapy is not suitable for beginners. Learning it can be a challenge, since there are so many features you’ll need to be familiar with.

However, by learning enough Scrapy, you can extract any piece of data available on a website. If you really want to step up your web scraping abilities, there’s no better tool. Because of its many features, Scrapy is not as lightweight as the other mentioned parsers. Still, it’s actively maintained and offers very extensive documentation and great community support—it’s one of the most popular Python packages available. If you want to learn how to build your first Scrapy spider, check out this Scrapy beginner’s guide.

Scrapy Pros

Comprehensive Framework: Provides an all-in-one solution for the entire web scraping process, from making requests to processing and storing data.
Asynchronous & Fast: Built on Twisted, allowing for concurrent requests, making it very efficient for crawling multiple pages.
Powerful Selectors: Leverages Parsel for robust data extraction using CSS selectors and XPath expressions. Built-in Features: Offers out-of-the-box support for many common scraping needs, like auto-throttling, cookie handling, user-agent rotation, proxy management, and data export in formats like JSON, CSV, and XML. Scalability: Well-suited for large-scale, complex scraping projects that require crawling entire websites or many different sites. Organized Project Structure: Enforces a clear project layout, which helps in maintaining larger codebases.

Scrapy Cons

Steep Learning Curve: More complex to learn than standalone parsing libraries, especially for beginners. Requires understanding its architecture (Spiders, Items, Pipelines, Middlewares).
Overkill for Simple Tasks: Using Scrapy for extracting data from a single page or a very small number of pages can be excessive.
Heavier Setup: Involves setting up a Scrapy project structure, which is more involved than writing a simple script with requests and BeautifulSoup. Configuration: Can require more configuration for specific behaviors compared to simpler libraries.

💡 Love web scraping in Python? Check out our expert list of the Best Python web scraping libraries.

Conclusion

The amount of data on the internet increases by the day, and the need to handle this data and turn it into something useful grows in turn. In order to gather the data efficiently, you must first parse a web page’s HTML and extract the data from it. To parse the HTML, you need an HTML parser.

This article discussed several Python HTML parsers, reviewed based on whether they’re open sourced, lightweight, and actively maintained, as well as whether they offer good performance and community support.

For your next web scraping project, consider using one of the parsers. For most use cases, BeautifulSoup is likely the best place to start, with lxml being a viable alternative. If you prefer the jQuery API, then you can opt for pyquery. In cases when all you need is to get the main content from a web page, jusText is a great option. Finally, if you’re scraping an entire web site with thousands of pages Scrapy is probably your best bet.

Karthik Devan

I work freelance on full-stack development of apps and websites, and I'm also trying to work on a SaaS product. When I'm not working, I like to travel, play board games, hike and climb rocks.