BeautifulSoup tutorial: Scraping web pages with Python

26 July 2024 (updated) | 20 min read

The internet is an endless source of data, and for many data-driven tasks, accessing this information is critical. Thus, the demand for web scraping has risen exponentially in recent years, becoming an important tool for data analysts, machine learning developers, and businesses alike. Also, Python has become the most popular programming language for this purpose.

In this detailed tutorial, you'll learn how to access the data using popular libraries such as Requests and Beautiful Soup with CSS selectors.

Without further ado, let’s get started!

Getting the HTML

When you're web scraping, the first step is to retrieve the contents of a web page from a URL using an HTTP GET request. One popular and easy-to-use option is the Python Requests library. It allows you to send HTTP requests easily and offers various useful features.

Install the Requests library with the following command:

pip install requests==2.32.3

Once you have extracted the HTML data, you can use the BeautifulSoup library to parse it and pull out the desired information efficiently.

To install the library, launch the below command in the terminal:

pip install beautifulsoup4==4.12.3

We will build a Hacker News scraper using Requests and BeautifulSoup to extract the rank, URL, and title from all articles posted on HN.

hackernews homepage

Create a Python script named main.py and import requests. In main.py, use Requests to make a GET request to your target website, storing the retrieved HTML content in the html_content variable, and then log it to the console.

import requests

# Send a GET request to the specified URL
response = requests.get("https://news.ycombinator.com/")

# Check if the request was successful (status code 200)
if response.status_code == 200:
    html_content = response.content
    print(html_content)
else:
	print(response)

In the code, the requests.get() sends an HTTP GET request to a URL and retrieves the website's HTML content. It then checks the server's response status code using response.status_code. A 200 code indicates success, allowing the code to proceed with the HTML. Any other code (like a 404 "Not Found") is considered an error.

Note: The GET request to the server may fail due to a temporary outage, incorrect URL, or blocked IP. Error handling logic (if/else) prevents crashes and ensures the script only continues on successful ( 2xx ) responses.

Here’s the result after running the code:

getting html from website

Great! Now, it's time to use Beautiful Soup to parse the page's HTML code and extract the specific data we want.

Parsing the HTML with Beautiful Soup

In the previous section, we retrieved an HTML document from the server. It appears as a complex structure of HTML tags and elements, and the only way to make sense of it is by extracting the desired data through HTML parsing.

Beautiful Soup is a Python library that allows you to select HTML elements and easily extract data from them

Let’s use BS4 to parse the content:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://news.ycombinator.com/")
html_content = response.content

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html_content, "html.parser")

The BeautifulSoup() constructor takes HTML content and a string specifying the parser. Here "html.parser" tells Beautiful Soup to use the built-in HTML parser.

Note: When using BeautifulSoup, it's better to use a response.content instead of response.text. This is because response.content holds the raw HTML data in bytes, which is easier for BeautifulSoup to decode than the text representation in response.text. Using response.content helps avoid potential issues with character encoding.

Now, move on to the code. This soup object is very handy and allows us to access many useful pieces of information such as:

# The title tag of the page
print(soup.title)
# Output: <title>Hacker News</title>

# The title of the page as string
print(soup.title.string)
# Output: Hacker News

# All links in the page
nb_links = len(soup.find_all("a"))
print(f"There are {nb_links} links in this page")
# Output: There are 231 links on this page

# Text from the page
print(soup.get_text())
# Output:
# Hacker News
# Hacker News
# new | past | comments | ask | show | jobs | submit
# login
# ...

Targeting DOM elements

When using BeautifulSoup for web scraping, one of the most important tasks is targeting and extracting specific DOM (Document Object Model) elements. The DOM is a programming interface for web documents. Imagine the HTML code of a webpage as an upside-down tree. Each HTML element (headings, paragraphs, and links) is a node in this tree.

Dom tree of html - Image from W3 Schools

BS4 allows you to quickly and elegantly target the DOM elements you need. Here are the different ways Beautiful Soup provides to target these elements within the DOM:

Finding by Tag

To find elements by their tag name in Beautiful Soup, you have two main options: the find method and the find_all method.

  • find : This method searches the parsed HTML document from top to bottom and returns the first occurrence of the tag you specify.
  • find_all : This method searches the entire parsed HTML document and returns a list containing all instances of the specified tag or set of tags.

For instance, to find the first <a> tag in a document, you can use:

soup.find("a")

anchor tag

If you want to find all <a> tags, you can use the find_all method, which will return a list of all <a> tags in the document.

soup.find_all("a")

all anchor tags of webpage

The .find('span') method will return the first <span> tag it finds in the parsed HTML:

soup.find("span")

span tag of webpage

.find_all('span') returns a list containing all <span> tags found in the HTML. This allows you to iterate through the list and access the content or attributes of each tag.

soup.find_all("span")

all span tags of webpage

Finding by Class or ID

BS4 provides methods to locate elements by their class or ID attributes. For instance, to find an element with a specific class:

soup.find(class_="athing")

finding first article having athing class

You can also find all elements with a specific class using:

soup.find_all(class_="athing")

finding all articles having athing class

Note: For classes, use class_ instead of class because the class is a reserved keyword in Python.

To find an element by its ID:

soup.find(id="pagespace")

finding by pagespace id

CSS selectors

Why learn about CSS selectors if BeautifulSoup already has a way to select elements based on their attributes?

Well, you'll soon understand.

Querying the DOM

Often, DOM elements do not have proper IDs or class names. While perfectly possible, selecting elements in that case can be rather verbose and require lots of manual steps.

For example, let's say that you want to extract the score of a post on the HN homepage, but you can't use class name or id in your code. Here is how you could do it:

results = []
all_tr = soup.find_all('tr')
for tr in all_tr:
	if len(tr.contents) == 2:
		print(len(tr.contents[1]))
		if len(tr.contents[0].contents) == 0 and len(tr.contents[1].contents) == 13:
			points = tr.contents[1].text.split(' ')[0].strip()
			results.append(points)
print(results)
>['168', '80', '95', '344', '7', '84', '76', '2827', '247', '185', '91', '2025', '482', '68', '47', '37', '6', '89', '32', '17', '47', '1449', '25', '73', '35', '463', '44', '329', '738', '17']

As promised, rather verbose, isn't it?

This is exactly where CSS selectors shine. They allow you to break down your loop and ifs into one expression.

all_results = soup.select('td:nth-child(2) > span:nth-child(1)')
results = [r.text.split(' ')[0].strip() for r in all_results]
print(results)

>['168', '80', '95', '344', '7', '84', '76', '2827', '247', '185', '91', '2025', '482', '68', '47', '37', '6', '89', '32', '17', '47', '1449', '25', '73', '35', '463', '44', '329', '738', '17']

The key here is td:nth-child(2) > span:nth-child(1). This selects for us the first <span> which is an immediate child of a <td>, which itself has to be the second element of its parent (<tr>). The following HTML illustrates a valid DOM excerpt for our selector.

<tr>
    <td>not the second child, are we?</td>
    <td>
        <span>HERE WE GO</span>
        <span>this time not the first span</span>
    </td>
</tr>

This is much clearer and simpler, right? Of course, this example artificially highlights the usefulness of the CSS selector. But after playing a while with the DOM, you will fairly quickly realise how powerful CSS selectors are, especially when you cannot only rely on IDs or class names.

BeautifulSoup allows you to use CSS selectors to target elements more flexibly. For instance, to find all elements with a specific class within a <tr>:

soup.select("tr.athing")

The select method finds all elements in the parsed HTML that match a given CSS selector. It returns a list of all matching elements.

finding all articles having athing class

Here’s how you can select the first <tr> element in the HTML document that has the class athing.

soup.select_one("tr.athing")

The select_one method selects the first matching element based on the given CSS selector.

finding first article having athing class

Or to find an element with a specific ID:

soup.select_one("#pagespace")

finding by pagespace id

Easily debuggable

Another thing that makes CSS selectors great for web scraping is that they are easily debuggable. Let's check it out.

Open the developer tools (F12) in Chrome or Firefox, select the document tab, and use Ctrl/ + F to open the search bar. Now enter any CSS expression (e.g. html body) and the browser will find the first matching element. Pressing Enter will iterate over the elements.

hackernews html

What is great is that it works the other way around too. Right-click any element in the DOM inspector and choose Copy - Copy Selector from the context menu.

Voilà, you have the right selector in your clipboard.

Chrome Dev Tools XPath selector

Advanced expressions

CSS selectors provide a comprehensive syntax to select elements in a wide variety of settings.

This includes child and descendant combinators, attribute selectors, and more.

Child and descendants: Child and descendant selectors allow you to select elements which are either immediate or indirect children of a given parent element.

# all <p> directly inside of an <a>
a > p

# all <p> descendants of an <a>
a p

And you can mix them together:

a > p > .test .example > span

That selector will work perfectly fine with this HTML snippet.

<a>
	<p>
		<div class="test">
			<div class="some other classes">
				<div class="example">
					<span>HERE WE GO</span>
				</div>
			</div>
		</div>
	</p>
</a>

Siblings:

This one is one of my favorites because it allows you to select elements based on the elements on the same level in the DOM hierarchy, hence the sibling expression.

#html example
<p>...</p>
<section>
	<p>...</p>
	<h2>...</h2>
	<p>This paragraph will be selected</p> (match h2 ~ p / h2 + p)
	<div>
		<p>...</p>
	</div>
	<p>This paragraph will be selected</p> (match h2 ~ p)
</section>

To select all p coming after an h2 you can use the h2 ~ p selector (it will match two <p>s).

You can also use h2 + p if you only want to select the <p> immediately following the our <h2> (it will match only one <p>).

Attribute selectors Attribute selectors allow you to select elements with particular attributes values. So, p[data-test="foo"] will match

<p data-test="foo"></p>

Pseudo-classes:

Let's assume we have this HTML document.

<section>
	<p>Paragraph 1</p>
	<p>Paragraph 2</p>
	<p>Paragraph 3</p>
	<p>Paragraph 4</p>
</section>

Furthermore, let's assume we only want to select a particular <p> element. Welcome to pseudo-classes!

Pseudo-classes, such as :first-child , :last-child , and :nth-child , for example allow you to select specific elements by their position within the DOM tree.

// Selects "Paragraph 4"
section > p:last-child()

// Selects "Paragraph 2"
section > p:nth-child(2)

There are plenty of other pseudo-classes (e.g. input[type="checkbox"]:checked will select all checked checkboxes) and you can find a full list here . If you like to learn more about CSS selectors, you may also find this article interesting.

Maintainable code

I also think that CSS expressions are easier to maintain. For example, at ScrapingBee, when we do custom web scraping tasks all of our scripts begins like this:

TITLE_SELECTOR = "title"
SCORE_SELECTOR = "td:nth-child(2) > span:nth-child(1)"
...

This makes it easy to fix scripts when changes to the DOM are made.

Certainly, a rather easy way to determine the right CSS selector is to simply copy/paste what Chrome gave you when you right-click an element. However, you ought to be careful, as these selector paths tend to be very "absolute" in nature and are often neither the most efficient nor very resilient to DOM changes. In general it's best to verify such selectors manually before you use them in your script.

Traversing the DOM

Next, let's use Beautiful Soup to parse the HTML data and scrape the contents from all the articles on the first page of Hacker News. Before we select an element, let's use our developer tools to inspect the page and find what selectors we need to use to target the data we want to extract.

The content we are looking for is inside the tr tag with the class 'athing'.

selecting articles having athing class

Now, let's use the Beautiful Soup find_all method to select all elements containing the athing class.

import requests
from bs4 import BeautifulSoup

# Send a GET request to the website
response = requests.get("https://news.ycombinator.com/")
html_content = response.content

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(html_content, "html.parser")

# Find all articles with the class 'athing'
articles = soup.find_all(class_="athing")

A common mistake you can make is failing to check for None.

Methods like find() and select_one() return None if they can't find the desired element. Since web pages change over time, the element you're trying to find might be located differently or may no longer exist.

To avoid errors, always check if the method result is not None before using it in your code.

#...

if articles is not None:
	# ...

Next, let's loop through each article and print its text contents to the console, just to verify we have selected the correct elements.

import requests
from bs4 import BeautifulSoup

# Fetch the content from the URL
response = requests.get("https://news.ycombinator.com/")
html_content = response.content

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html_content, "html.parser")
articles = soup.find_all(class_="athing")

# Check if articles were found
if articles:
    # Loop through the selected elements
    for article in articles:
        # Print each article's text content to the console
        print(article.text)

The result is:

articles scraped from first page

Great! We’ve successfully accessed the rank and title of each element.

In the next step, we will use the Beautiful Soup find method to grab the specific values we want to extract from each article and organize the obtained data in a Python dictionary.

For each article, we will extract the URL, rank, and title.

  • The rank is located within the span tag having the class rank.
  • The title and URL of the article are located within the span tag having the class titleline.

Take a look at the below image:

tags within athing class

Here’s how our code looks when using the find method to get each article’s URL, title, and rank:

import requests
from bs4 import BeautifulSoup

# Fetch the content from the URL
response = requests.get("https://news.ycombinator.com/")
html_content = response.content

# Use Beautiful Soup to parse the HTML
soup = BeautifulSoup(html_content, "html.parser")
articles = soup.find_all(class_="athing")

scraped_data = []

# Check if articles are found
if articles is not None:
    for article in articles:
        data = {
            "URL": article.find(class_="titleline").find("a").get("href"),
            "title": article.find(class_="titleline").getText(),
            "rank": article.find(class_="rank").getText().replace(".", ""),
        }
        scraped_data.append(data)

# Print the output list
print(scraped_data)

The result is:

articles stored in a list

Great! We've scraped information from the first page of Hacker News using Requests and Beautiful Soup. But wouldn't it be even better to get article data from all the pages? We can achieve this by applying the same logic to the remaining pages.

However, before applying the logic, you need to navigate to all the remaining pages. So, in the next section, we'll explore techniques for handling website pagination.

Handling pagination

Handling pagination in web scraping is a straightforward concept. You need to make your scraper repeat its scraping logic for each page visited until no more pages are left. To achieve this, you have to identify the last page to stop scraping and save the extracted data.

To handle pagination, we'll use two variables: "is_scraping" (a boolean tracking if the last page is reached) and "current_page" (an integer keeping track of the page being scraped).

is_scraping = True
current_page = 1

Next, let's create a while loop that continues scraping until the scraper reaches the last page. Within the loop, we will send a GET request to the current page of Hacker News to extract the URL, title, and rank.

import requests
from bs4 import BeautifulSoup

is_scraping = True
current_page = 1
scraped_data = []

while is_scraping:
    # Fetch the content from the URL
    response = requests.get(f"https://news.ycombinator.com/?p={current_page}")
    html_content = response.content

    print(f"Scraping page {current_page}: {response.url}")

    # Use Beautiful Soup to parse the HTML
    soup = BeautifulSoup(html_content, "html.parser")
    articles = soup.find_all(class_="athing")

    # ...

Within the loop, we will check if there is a "More" button with the class morelink on the page to navigate to the next page. Take a look at the below image:

handling pagination

The "More" button is present on all pages, except the last one. So, if the morelink class is present on the page, the script increments the current_page variable and continues scraping the next page. If there is no morelink class, the script sets is_scraping to False and exits the loop.

# Check if there is a link to the next page
next_page_link = soup.find(class_="morelink")

if next_page_link:
    current_page += 1
else:
    is_scraping = False
    print(f"Finished scraping! Total pages scraped: {current_page}")
    print(f"Total articles scraped: {len(scraped_data)}")

Putting it all together:

import requests
from bs4 import BeautifulSoup

is_scraping = True
current_page = 1
scraped_data = []

while is_scraping:
    # Fetch the content from the URL
    response = requests.get(f"https://news.ycombinator.com/?p={current_page}")
    html_content = response.content

    print(f"Scraping page {current_page}: {response.url}")

    # Use Beautiful Soup to parse the HTML
    soup = BeautifulSoup(html_content, "html.parser")
    articles = soup.find_all(class_="athing")

    # Check if articles are found
    if articles is not None:
        for article in articles:
            data = {
                "URL": article.find(class_="titleline").find("a").get("href"),
                "title": article.find(class_="titleline").getText(),
                "rank": article.find(class_="rank").getText().replace(".", ""),
            }
            scraped_data.append(data)
    
    # Check if there is a link to the next page
    next_page_link = soup.find(class_="morelink")

    if next_page_link:
        current_page += 1
    else:
        is_scraping = False
        print(f"Finished scraping! Total pages scraped: {current_page}")
        print(f"Total articles scraped: {len(scraped_data)}")

The result is:

all pages scraped

Export the Scraped Data

The scraped information can be used for various needs and purposes. Therefore, it's crucial to convert it into a format that makes it easy to read and explore, like CSV or JSON. We currently have the article's information stored in a list named scraped_data. Now, let's explore how to store the data in CSV and JSON files.

Export to CSV

CSV is a popular format for data exchange, storage, and analysis, especially when dealing with large datasets. A CSV file stores information in a tabular form with comma-separated values.

import csv

# Scraping logic...

with open("hn_articles.csv", "a", newline="", encoding="utf-8-sig") as csvfile:
    writer = csv.DictWriter(csvfile, fieldnames=data.keys())

    # Write headers only on the first iteration
    if csvfile.tell() == 0:
        writer.writeheader()
    writer.writerows(scraped_data)

The code snippet performs the following tasks:

  • It first imports the csv module, which provides the tools needed to work with CSV files.
  • Then, it opens a file named "hn_articles.csv" in append mode.
  • It creates a writer object called DictWriter . This object allows us to write dictionaries, where each key becomes a column header and the corresponding value goes into that column.
  • The code checks if the file is empty. If it is (meaning the file pointer is at position 0), it writes the column headers (based on the fieldnames) as the first row of the CSV file.
  • Finally, the snippet uses another method called writerows() to write each entry from your scraped_data variable, line by line, into the CSV file.

The result is:

csv file

Great! The data is stored in a CSV file.

Export to JSON

JSON is a text-based, human-readable format for representing structured data. It is commonly used for transmitting data between a server and a web application.

import json

# Scraping logic...

with open("hn_articles.json", "w", encoding="utf-8") as jsonfile:
    json.dump(scraped_data, jsonfile, indent=4)

The logic above uses the json.dump() function from the standard json module. This function allows you to write a Python object to a JSON file.

json.dump() takes two arguments:

  • The Python object to convert to JSON format.
  • A file object created with open().

The result is:

json file

Great! The data is stored in a JSON file.

Overcoming Web Scraping Challenges

Websites often use anti-bot systems like CAPTCHAs and fingerprint challenges, which can be difficult to bypass. Web scraping also requires significant computational resources, bandwidth, and ongoing maintenance and updates.

For instance, Hacker News is a simple website with minimal anti-bot protections, allowing us to scrape it relatively easily. Conversely, complex websites utilize sophisticated techniques to detect and block bots.

The difficulty of avoiding these blocks varies depending on the target website and the scale of your scraping operation. However, here are some techniques you can consider to mitigate them.

Rotate User Agents

The User-Agent header tells the server about the client's operating system, vendor, and version. If a server receives a large number of requests from the same User-Agent, it might block your requests. To avoid this, you can rotate User-Agents by generating a new one for each request.

Implement Rate Limiting

Anti-scraping measures often monitor the frequency of requests from the client. If a single IP sends too many requests too quickly, it might be flagged and blocked. A common way to bypass this is by introducing delays between requests. This ensures the server has enough time to process each request before receiving the next.

Use a Proxy Server

A rotating proxy server acts as a middleman between your scraping script and target websites. It routes your requests and assigns you a different IP address from its pool, automatically changing it periodically or after a set number of requests. Learn about How To Use A Proxy With Python Requests .

Add Retry Logic

Retry logic for network calls is essential in web scraping scripts. It improves reliability by handling temporary issues like server downtime or disconnections. Retrying failed calls before giving up, thus increasing the chance of successful data retrieval.

Robots.txt

To ensure ethical scraping, always check the website's terms of service and tailor your scripts accordingly. You should follow the rules outlined in the robots.txt file of the target website. Additionally, avoid scraping personal information without consent, as this violates privacy regulations.

Web scraping challenges require a combination of technical strategies, ethical considerations, and adherence to best practices. Some of these best practices include respecting the robots.txt file, rotating proxies and user agents, and so on.

But even with best practices, obstacles can arise that block you from getting the data you need. That's where tools like ScrapingBee come in. It handles all anti-bot challenges, you can integrate in your code in just a matter of minutes.

Final code

We've come to the end of this detailed tutorial. Here's the final code you can use to scrape Hacker News. This code implements retry logic, which helps in retrying failed requests until the maximum attempts are reached.

import requests
from bs4 import BeautifulSoup
import time
import json

# Initialize variables
is_scraping = True
current_page = 1
scraped_data = []
max_retries = 3  # Maximum retries for failed requests

print("Hacker News Scraper started...")

# Continue scraping until there are no more pages to scrape
while is_scraping:
    try:
        # Make the request
        response = requests.get(
            f"https://news.ycombinator.com/?p={current_page}")
        html_content = response.content

        print(f"Scraping page {current_page}: {response.url}")

        # Use BeautifulSoup to parse the HTML content
        soup = BeautifulSoup(html_content, "html.parser")
        articles = soup.find_all(class_="athing")

        if articles:
            # Extract data from each article on the page
            for article in articles:
                article_data = {
                    "URL": article.find(class_="titleline").find("a").get("href"),
                    "title": article.find(class_="titleline").getText(),
                    "rank": article.find(class_="rank").getText().replace(".", ""),
                }
                scraped_data.append(article_data)
            print(f"Page {current_page} scraped successfully, found {len(articles)} articles.")

        # Check if there's a link to the next page
        next_page_link = soup.find(class_="morelink")

        if next_page_link:
            current_page += 1
        else:
            is_scraping = False
            print(f"Finished scraping! Total pages scraped: {current_page}")
            print(f"Total articles scraped: {len(scraped_data)}")

            # Save scraped data to JSON file
            with open("hn_articles.json", "w", encoding="utf-8") as jsonfile:
                json.dump(scraped_data, jsonfile, indent=4)

    except requests.exceptions.RequestException as e:
        print(f"Error occurred while scraping page {current_page}: {e}")

        # Implement retry logic with delay (up to max_retries attempts)
        for attempt in range(1, max_retries + 1):
            print(f"Retrying request for page {current_page} (attempt {attempt}/{max_retries})...")
            time.sleep(2)  # Adjust delay as needed
            try:
                response = requests.get(
                    f"https://news.ycombinator.com/?p={current_page}")
                break  # Successful request, exit retry loop
            except requests.exceptions.RequestException:
                pass  # Continue retrying on failure
        else:
            print(f"Giving up after {max_retries} retries for page {current_page}")

    # Always add a delay between requests
    time.sleep(2)

Conclusion

This detailed tutorial covered everything you need to know about getting started with Beautiful Soup and Requests in Python. You learned how to use them to build a web scraper that effectively extracts data from websites. However, web scraping has become increasingly challenging due to the rise of anti-bot and anti-scraping technologies. This is where tools like ScrapingBee come in.

I hope you liked this article about web scraping in Python and that it will make your life easier. For further reading, do not hesitate to check out our extensive Python web scraping guide . You might also be interested in following tutorials that can help you in your web scraping journey.

  1. How To Set Up A Rotating Proxy In Selenium With Python
  2. How To Set Up A Rotating Proxy In Puppeteer
  3. The Best Web Scraping Tools & Software In 2024
  4. How To Bypass Cloudflare Antibot Protection At Scale In 2024
image description
Satyam Tripathi

Satyam is a senior technical writer who is passionate about web scraping, automation, and data engineering. He has delivered over 130 blog posts since 2021.