Python HTML Parsers
The internet is full of useful information and data. In 2021, an astonishing 2.5 quintillion bytes of data was created daily. This data can be analyzed and used to make better business decisions. However, most of the data is not structured and isn’t readily available for processing. That’s where web scraping comes in.
Web scraping enables you to retrieve data from a web page and store it in a format useful for further processing. But, as you probably know, web pages are written in a markup language called HTML. So, in order for you to extract the data from a web page, you must first parse its HTML. In doing so, you transform the HTML into a tree of objects.
There are numerous HTML parsers on the market, and choosing which one to go with can be confusing. In this roundup, you’ll review some of the best Python HTML parsers out there. Python is one of the most popular languages when it comes to scraping data, so it’s not surprising that there are quite a few options to consider. The parsers in this roundup were chosen for discussion based on the following factors:
- Open sourced
- Actively maintained
- Good community support
- High performance
Beautiful Soup is an HTML parser, but it’s also much more than that—it’s a lightweight and versatile library that allows you to extract data and information from both HTML and XML files. You can use it for web scraping, but also for modifying HTML files. While it’s relatively simple and easy to learn, it’s a very powerful framework. You can complete even more complex web scraping projects using Beautiful Soup as the only web scraping library.
The Beautiful Soup Python package is not built-in, so you need to install it before using it. Fortunately, it is very user-friendly and easy to set up. Simply install it by running
pip install beautifulsoup4. Once that is done, you only need to input the HTML file you want to parse. You can then use the numerous functions of Beautiful Soup to find and scrape the data you want, all in just a few lines of code.
Let’s take a look at an example based on this simple HTML from Beautiful Soup’s documentation. Note that if you wanted to scrape a real web page instead, you’d first need to use something like the Requests module in order to access the web page.
Here’s the HTML:
html_doc = """<html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
Before parsing this, you’ll need to import the Beautiful Soup library using the following code:
from bs4 import BeautifulSoup
Finally, you can parse the HTML with this simple line of code:
soup = BeautifulSoup(html_doc, 'html.parser')
With this line, you simply pass the document you want to parse (in this case, the “html_doc”) into the
BeautifulSoup constructor, which then takes care of the parsing. You’ve probably noticed that apart from the document you want to parse, the
BeautifulSoup constructor takes a second argument. That’s where you pass the parser you want to use. If instead of the
html.parser you prefer the
lxml, you can simply run the following:
soup = BeautifulSoup(html_doc, 'lxml')
Keep in mind that different parsers will output different results.
Now that the HTML is parsed, you can find and extract any elements that you’re looking for. For example, finding the title element is as easy as running
soup.title. You can get the actual title with
soup.title.string. Some popular functions Beautiful Soup offers include
find() and especially
find_all(). These make it extremely easy to find any elements you want. For instance, finding all the paragraphs or links in an HTML is as simple as running
In a nutshell, Beautiful Soup is a powerful, flexible, and easy-to-use web scraping library. It’s great for everyone, but thanks to its extensive documentation and being straightforward to use, it’s especially suitable for beginners. To learn more about all the options Beautiful Soup can provide, check out the documentation.
Another high performance HTML parser is lxml. You’ve already seen it in the previous section—Beautiful Soup supports using the lxml parser by simply passing
lxml as the second argument to the
BeautifulSoup constructor. Previously, lxml was known for its speed, while BeautifulSoup was known for its ability to handle messy HTML. However, now that they support each other, you get both the speed and the ability to handle messy HTML in a single package.
With lxml, you can extract data from both XML and broken HTML. It’s very fast, safe, and easy to use—thanks to the Pythonic API—and it requires no manual memory management. It also comes with a dedicated package for parsing HTML.
As with Beautiful Soup, to use lxml you first need to install it, which is easily done with
pip install lxml. Once you install it, there are a couple of different functions for parsing HTML, such as
fromstring(). For example, the same
html_doc from the Beautiful Soup example can be parsed with the following piece of code:
from lxml import etree from io import StringIO parser = etree.HTMLParser() tree = etree.parse(StringIO(html_doc), parser)
Now the HTML file will be contained in
tree. From there, you can extract the data you need using a variety of methods, such as XPath or CSSSelect. You can learn how to do this in detail from lxml’s extensive documentation.
lxml is a lightweight, actively maintained parser with great documentation. It’s similar to Beautiful Soup in many aspects, so if you know one of them, picking up the other should be pretty straightforward.
If you enjoy the jQuery API and would like it in Python, then pyquery is for you. It’s a library that offers both XML and HTML parsing, with high speeds and an API very similar to jQuery.
Just like the other HTML parsers in this roundup, pyquery allows you to traverse and scrape information from the XML or HTML file. It also allows you to manipulate the HTML file, including adding, inserting, or removing elements. You can create document trees from simple HTML and then manipulate and extract data from them. Also, since pyquery includes many helper functions, it helps you save time by not having to write as much code yourself.
In order to use pyquery, you first need to install it with
pip install pyquery and then import it with
from pyquery import PyQuery as pq. Once that’s done, you can load documents from strings, URLs, or even lxml. Here’s the code that will enable you to do so:
d = pq(html_doc) # loads the html_doc we introduced previously d = pq(url="https://www.scrapingbee.com/") # loads from an inputted url
pyquery does essentially the same thing as lxml and Beautiful Soup. The main distinction is in its syntax, which is intentionally very similar to that of jQuery. For instance, in the code above, the
d is like the
$ in jQuery. If you’re familiar with jQuery, this might be a particularly helpful option for you.
However, pyquery is not as popular as BeautifulSoup or lxml, so its community support comparatively lacking. Still, it’s lightweight, actively maintained, and has great documentation.
While not as powerful as the other parsers discussed in this roundup, jusText is a tool that can be very useful in specific instances, such as when you want to keep only the full sentences on a given web page. It has one main goal: to remove all boilerplate content from a web page, and leave only the main content. So, if you were to pass a web page to jusText, it would remove the contents from the header, main menu, footer, sidebar, and everything else deemed not to be important, leaving you with only the text on the main part of the page.
You can give it a try using this jusText demo. Simply input the URL of any web page, and it will strip out all unnecessary content, leaving only what it considers important. If this is something that you would find useful, you can learn more about jusText here.
In order to use jusText, you’d first need to install it using
pip install justext. You would also need to have the Requests module installed so that you can access the web page from which you want to remove the boilerplate content. Then, you can use the following piece of code to get the main content of any web page:
import requests import justext url = "https://www.scrapingbee.com/" response = requests.get(url) paragraphs = justext.justext(response.content, justext.get_stoplist("English")) for paragraph in paragraphs: if not paragraph.is_boilerplate: print(paragraph.text)
This piece of code accesses the ScrapingBee homepage and returns only its main content. All you need to change in this code is to insert any URL you want, though keep in mind that this only works in English.
Similar to pyquery, jusText is another package that’s not as popular as Beautiful Soup, and it’s main use is limited to one particular task. Compared to lxml and Beautiful Soup, it doesn’t come with extensive documentation, nor is it as actively maintained. Still, it’s a lightweight parser that is great for the task of removing boilerplate content.
Finally, Beautiful Soup may be great for parsing HTML and extracting data from it, but when it comes to crawling and extracting a whole web site, there are better choices available. If you’re working on a highly complex web scraping project, you might be better off with a framework like Scrapy. It’s much more complicated and requires a steeper learning curve than Beautiful Soup, but it’s also much more powerful.
To say that Scrapy is an HTML parser is a huge understatement, since parsing HTML is a miniscule part of what Scrapy is capable of. Scrapy is the complete Python web scraping framework. It has all the features that you may require for web scraping, such as crawling an entire website from a single URL, exporting and storing the data in various formats and databases, limiting the crawling rate, and more. It’s very powerful, efficient, and customizable.
Among its many features, Scrapy offers methods for parsing HTML. First, you need to perform a request for the URL you need parsed, which you can do using the
start_requests method. Once that’s done, the web page you get as a response is easily parsed thanks to the
parse method, which extracts the data and returns an object. The
parse method allows for the extracted information to be returned as different kinds of objects, such as Item objects, Request objects, and dictionaries. Once the data is extracted, the
yield command sends it to an Item Pipeline.
The only downside to Scrapy is its steep learning curve; unlike the other items in this roundup, Scrapy is not suitable for beginners. Learning it can be a challenge, since there are so many features you’ll need to be familiar with.
However, by learning enough Scrapy, you can extract any piece of data available on a website. If you really want to step up your web scraping abilities, there’s no better tool. Because of its many features, Scrapy is not as lightweight as the other mentioned parsers. Still, it’s actively maintained and offers very extensive documentation and great community support—it’s one of the most popular Python packages available. If you want to learn how to build your first Scrapy spider, check out this Scrapy beginner’s guide.
The amount of data on the internet increases by the day, and the need to handle this data and turn it into something useful grows in turn. In order to gather the data efficiently, you must first parse a web page’s HTML and extract the data from it. To parse the HTML, you need an HTML parser.
This article discussed several Python HTML parsers, reviewed based on whether they’re open sourced, lightweight, and actively maintained, as well as whether they offer good performance and community support.
For your next web scraping project, consider using one of the parsers. For most use cases, Beautiful Soup is likely the best place to start, with lxml being a viable alternative. If you prefer the jQuery API, then you can opt for pyquery. In cases when all you need is to get the main content from a web page, jusText is a great option. Finally, if you’re scraping an entire web site with thousands of pages Scrapy is probably your best bet.
Alen is a data scientist working in finance. He also freelances and writes about data science and machine learning.
You might also like:
Web Scraping with Kotlin
This tutorial covers the main tools and techniques for web scraping in Kotlin. Using the Skrape.it library, we will see how to fetch a web page, and parse the HTML to extract meaningful data.
How to build a job board with web scraping and ChatGPT
Learn how to collect job openings by scraping Google and recruiting software sites, and then extract useful information such as salary and benefits using ChatGPT.