BeautifulSoup tutorial: Scraping web pages with Python

07 July 2022 | 9 min read

In this article, we will see how to extract structured information from web pages leveraging BeautifulSoup and CSS selectors.

Getting the HTML

BeautifulSoup is not a web scraping library per se. It is a library that allows you to efficiently and easily pull out information from HTML. In the real world, it is often used for web scraping projects.

So, for starters, we need an HTML document. For that purpose, we will be using Python's Requests package and fetch the main page of HackerNews.

import requests
response = requests.get("https://news.ycombinator.com/")
if response.status_code != 200:
	print("Error fetching page")
	exit()
else:
	content = response.content
print(content)

> b'<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" 
> content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css? ...

Parsing the HTML with BeautifulSoup

Now that the HTML is accessible we will use BeautifulSoup to parse it. If you haven't already, you can install the package by doing a simple pip install beautifulsoup4. In the rest of this article, we will refer to BeautifulSoup4 as "BS4".

We now need to parse the HTML and load it into a BS4 structure.

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

This soup object is very handy and allows us to easily access many useful pieces of information such as:

# The title tag of the page
print(soup.title)
> <title>Hacker News</title>

# The title of the page as string
print(soup.title.string)
> Hacker News

# All links in the page
nb_links = len(soup.find_all('a'))
print(f"There are {nb_links} links in this page")
> There are 231 links in this page

# Text from the page
print(soup.get_text())
> Hacker News
> Hacker News
> new | past | comments | ask | show | jobs | submit
> login
> ...

Targeting DOM elements

You might begin to see a pattern in how to use this library. It allows you to quickly and elegantly target the DOM elements you need.

If you need to select DOM elements from its tag (<p>, <a>, <span>, ....) you can simply do soup.<tag> to select it. The caveat is that it will only select the first HTML element with that tag.

For example if I want the first link I just have to access the a field of my BeautifulSoup object

 first_link = soup.a
 print(first_link)
 ><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>

That element is a full representation of that tag and comes with quite a few HTML-specific methods

# The text of the link
print(first_link.text)
# Empty because first link only contains an <img> tag
>""

# The href of the link
print(first_link.get('href')
> https://news.ycombinator.com

This is a simple example. If you want to select the first element based on its id or class attributes, it is not much more difficult:

pagespace = soup.find(id="pagespace")
print(pagespace)
> <tr id="pagespace" style="height:10px" title=""></tr>

# class is a reserved keyword in Python, hence the '_'
athing = soup.find(class_="athing")
print(athing)
> <tr class="athing" id="22115671">
> ...

And if you don't want the first matching element but instead all matching elements, just replace find with find_all.

This simple and elegant interface allows you to quickly write short and powerful Python snippets. For example, let's say I want to extract all links in this page and find the top three links that appear the most on the page. All I have to do is this:

from collections import Counter
all_hrefs = [a.get('href') for a in soup.find_all('a')]
top_3_links = Counter(all_hrefs).most_common(3)
print(top_3_links)
> [('from?site=github.com', 3), ('item?id=22115671', 2), ('item?id=22113827', 2)]

Dynamic element selection

So far we've always passed a static tag type, however find_all is more versatile and does support dynamic selections as well. For example, we could pass a function reference and find_all will invoke your function for each element and only include that element only if your function returned true.

In the following code sample we defined a function my_tag_selector which takes a tag parameter and returns true only if it got an <a> tag with an HTML class titlelink. Essentially, we extract only the article links from the main page.

import requests
from bs4 import BeautifulSoup
import re

def my_tag_selector(tag):
	# We only accept "a" tags with a titlelink class
	return tag.name == "a" and tag.has_attr("class") and "titlelink" in tag.get("class")

response = requests.get("https://news.ycombinator.com/")
if response.status_code != 200:
	print("Error fetching page")
	exit()

soup = BeautifulSoup(response.content, 'html.parser')

print(soup.find_all(my_tag_selector))

>[<a class="titlelink" href="....

find_all does not only support static strings as filter, but rather follows a generic "true-ness" approach, where you can pass different types of expressions and they just need to evaluate to true. Apart from tag strings and functions, there currently is also support for regular expressions and lists. In addition to find_all, there are also other functions to navigate the DOM tree, for example selecting the following DOM siblings or the element's parent.

BeautifulSoup is a great example of a library that is both, easy to use and powerful. We mostly talked about selecting and finding elements so far, but you can also change and update the whole DOM tree. These bits, we won't cover in this article, however, because it's now time for CSS selectors.

CSS selectors

Why learn about CSS selectors if BeautifulSoup already has a way to select elements based on their attributes?

Well, you'll soon understand.

Querying the DOM

Often, DOM elements do not have proper IDs or class names. While perfectly possible (see our previous examples, please), selecting elements in that case can be rather verbose and require lots of manual steps.

For example, let's say that you want to extract the score of a post on the HN homepage, but you can't use class name or id in your code. Here is how you could do it:

results = []
all_tr = soup.find_all('tr')
for tr in all_tr:
	if len(tr.contents) == 2:
		print(len(tr.contents[1]))
		if len(tr.contents[0].contents) == 0 and len(tr.contents[1].contents) == 13:
			points = tr.contents[1].text.split(' ')[0].strip()
			results.append(points)
print(results)
>['168', '80', '95', '344', '7', '84', '76', '2827', '247', '185', '91', '2025', '482', '68', '47', '37', '6', '89', '32', '17', '47', '1449', '25', '73', '35', '463', '44', '329', '738', '17']

As promised, rather verbose, isn't it?

This is exactly where CSS selectors shine. They allow you to break down your loop and ifs into one expression.

all_results = soup.select('td:nth-child(2) > span:nth-child(1)')
results = [r.text.split(' ')[0].strip() for r in all_results]
print(results)

>['168', '80', '95', '344', '7', '84', '76', '2827', '247', '185', '91', '2025', '482', '68', '47', '37', '6', '89', '32', '17', '47', '1449', '25', '73', '35', '463', '44', '329', '738', '17']

The key here is td:nth-child(2) > span:nth-child(1). This selects for us the first <span> which is an immediate child of a <td>, which itself has to be the second element of its parent (<tr>). The following HTML illustrates a valid DOM excerpt for our selector.

<tr>
    <td>not the second child, are we?</td>
    <td>
        <span>HERE WE GO</span>
        <span>this time not the first span</span>
    </td>
</tr>

This is much clearer and simpler, right? Of course, this example artificially highlights the usefulness of the CSS selector. But after playing a while with the DOM, you will fairly quickly realise how powerful CSS selectors are, especially when you cannot only rely on IDs or class names.

Easily debuggable

Another thing that makes CSS selectors great for web scraping is that they are easily debuggable. Let's check it out.

Open the developer tools (F12) in Chrome or Firefox, select the document tab, and use Ctrl/ + F to open the search bar. Now enter any CSS expression (e.g. html body) and the browser will find the first matching element. Pressing Enter will iterate over the elements.

Hacker News's HTML

What is great is that it works the other way around too. Right-click any element in the DOM inspector and choose Copy - Copy Selector from the context menu.

Voilà, you have the right selector in your clipboard.

Chrome Dev Tools XPath selector

Advanced expressions

CSS selectors provide a comprehensive syntax to select elements in a wide variety of settings.

This includes child and descendant combinators, attribute selectors, and more.

Child and descendants

Child and descendant selectors allow you to select elements which are either immediate or indirect children of a given parent element.

# all <p> directly inside of an <a>
a > p

# all <p> descendants of an <a>
a p

And you can mix them together:

a > p > .test .example > span

That selector will work perfectly fine with this HTML snippet.

<a>
	<p>
		<div class="test">
			<div class="some other classes">
				<div class="example">
					<span>HERE WE GO</span>
				</div>
			</div>
		</div>
	</p>
</a>

Siblings

This one is one of my favorites because it allows you to select elements based on the elements on the same level in the DOM hierarchy, hence the sibling expression.

#html example
<p>...</p>
<section>
	<p>...</p>
	<h2>...</h2>
	<p>This paragraph will be selected</p> (match h2 ~ p / h2 + p)
	<div>
		<p>...</p>
	</div>
	<p>This paragraph will be selected</p> (match h2 ~ p)
</section>

To select all p coming after an h2 you can use the h2 ~ p selector (it will match two <p>s).

You can also use h2 + p if you only want to select the <p> immediately following the our <h2> (it will match only one <p>).

Attribute selectors

Attribute selectors allow you to select elements with particular attributes values. So, p[data-test="foo"] will match

<p data-test="foo"></p>

Pseudo-classes

Let's assume we have this HTML document.

<section>
	<p>Paragraph 1</p>
	<p>Paragraph 2</p>
	<p>Paragraph 3</p>
	<p>Paragraph 4</p>
</section>

Furthermore, let's assume we only want to select a particular <p> element. Welcome to pseudo-classes!

Pseudo-classes, such as :first-child, :last-child, and :nth-child, for example allow you to select specific elements by their position within the DOM tree.

// Selects "Paragraph 4"
section > p:last-child()

// Selects "Paragraph 2"
section > p:nth-child(2)

There are plenty of other pseudo-classes (e.g. input[type="checkbox"]:checked will select all checked checkboxes) and you can find a full list here. If you like to learn more about CSS selectors, you may also find this article interesting.

Maintainable code

I also think that CSS expressions are easier to maintain. For example, at ScrapingBee, when we do custom web scraping tasks all of our scripts begins like this:

TITLE_SELECTOR = "title"
SCORE_SELECTOR = "td:nth-child(2) > span:nth-child(1)"
...

This makes it easy to fix scripts when changes to the DOM are made.

Certainly, a rather easy way to determine the right CSS selector is to simply copy/paste what Chrome gave you when you right-click an element. However, you ought to be careful, as these selector paths tend to be very "absolute" in nature and are often neither the most efficient nor very resilient to DOM changes. In general it's best to verify such selectors manually before you use them in your script.

💡 We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check out the documentation. If you like to give ScrapingBee a try, we are happy to provide the first 1,000 API calls for free.

Conclusion

BeautifulSoup and CSS selectors offer a very elegant and light-weight approach to run your web scraping jobs from a Python script. In particular, CSS selectors are a technology which is also used beyond the realm of Python and something that's definitely worth adding to one's list of tools.

I hoped you liked this article about web scraping in Python and that it will make your life easier. If you like to read more about web scraping in Python do not hesitate to check out our extensive Python web scraping guide. You might also be interested by our XPath tutorial.

Happy Scraping,

Pierre de Wulf

image description
Pierre de Wulf

Pierre is a data engineer who worked in several high-growth startups before co-founding ScrapingBee. He is an expert in data processing and web scraping.