Python Web Scraping: Full Tutorial With Examples (2024)

28 May 2024 (updated) | 37 min read

Have you ever wondered how to gather data from websites automatically? Or how some websites and web applications can extract and display data so seamlessly from other sites in real-time? Whether you want to collect and track prices from e-commerce sites, gather news articles and research data, or monitor social media trends, web scraping is the tool you need.

In this tutorial, we'll explore the world of web scraping with Python, guiding you from the basics to advanced techniques. In my experience, Python is a powerful tool for automating data extraction from websites and one of the most powerful and versatile languages for web scraping, thanks to its vast array of libraries and frameworks.

Consider this a follow-up to our previous guide on how to scrape the web without getting blocked. This time, we'll equip you with the knowledge to pick the perfect tool for the job, complete with the pros and cons of each method, real-world examples, and a sprinkle of hard-earned wisdom from yours truly.

By the end of this tutorial, you will have a solid understanding of Python web scraping and be ready to scrape the web like a pro. Let's get started!

Just a heads-up, we'll be assuming you're using Python3 throughout this code-filled odyssey.

Logos of python web scraping tools

0. Web Scraping Process

Web scraping can seem daunting at first, but following a structured approach can significantly simplify the process. Whether you're a beginner or an experienced developer, following these steps when scraping a website will ensure a smooth and efficient scraping process.

Step 1: Understanding the Website's Structure

Before we start scraping, let's get to know the website's structure. First, we need to inspect the HTML source code of the web page to identify the elements we want to scrape.

Once we find these elements, we need to identify the HTML tags and attributes that hold our treasures.

Step 2: Setting Up Our Python Playground

Let's make sure we have Python3 installed on our machine. If not, we can grab it from the official Python website.

Now that Python's ready to go, we should create a virtual environment to keep things organized. This way, our scraping project won't mess with other projects on our machine. Think of it as a designated sandbox for our web-scraping adventures!

Here's how to create one:

python -m venv scraping-env
source scraping-env/bin/activate  # On Windows use `scraping-env\Scripts\activate`

Step 3: Choosing Your Web Scraping Tool

If you're a web scraping newbie, then I highly recommend starting with the Requests and BeautifulSoup libraries. They're super easy to use and understand, kind of like training wheels for our web-scraping bike.

You can learn more about these awesome tools in the Requests & BeautifulSoup section.

Step 4: Handling Pagination and Dynamic Content

Websites can be tricky sometimes. They might have multiple pages of data we need, or the content might change and flicker like a firefly (we call this dynamic content). Not to worry!

We'll employ tools like Selenium in the Headless Browsing section to handle pagination and scrape websites that use JavaScript.

Every website has rules, and web scraping is no exception. Before we start scraping, it's important to check the website's robots.txt file. This file tells us what parts of the website are okay to scrape and which ones are off-limits.

Think of it as a treasure map that shows us where to dig and where not to! We also always want to make sure our scraping follows the website's terms of service and legal guidelines. It's all about being a good web scraping citizen.

Step 6: Optimizing and Scaling Your Scraper

As you become more comfortable with web scraping, you can take your scraper to the next level! We can optimize it to run faster and scrape even larger amounts of data. Frameworks like Scrapy and Asyncio can help us with these complex tasks.

Pro Tip: For web scraping beginners, Requests and BeautifulSoup are your best buddies. They're easy to use and will set you on the right path to web scraping mastery. You can learn more about these tools in the Requests & BeautifulSoup section, so be sure to check it out!

1. Manually Opening a Socket and Sending the HTTP Request

Socket

In the early days of my web scraping journey, I learned the most basic way to perform an HTTP request in Python: manually opening a TCP socket and then sending the HTTP request. It's a bit like crafting things from scratch – sure, you get a deep appreciation for the nuts and bolts, but let's be honest, it can be a bit…well…socket-work.

Here’s how you can do it:

import socket

HOST = 'www.google.com'  # Server hostname or IP address
PORT = 80                # The standard port for HTTP is 80, for HTTPS it is 443

client_socket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
server_address = (HOST, PORT)
client_socket.connect(server_address)

request_header = b'GET / HTTP/1.0\r\nHost: www.google.com\r\n\r\n'
client_socket.sendall(request_header)

response = ''
while True:
    recv = client_socket.recv(1024)
    if not recv:
        break
    response += recv.decode('utf-8')

print(response)
client_socket.close()

Pro Tip: While wrangling sockets and parsing raw HTTP responses by hand is a fantastic learning experience (and a real eye-opener into how web requests tick under the hood!), it can also get cumbersome pretty quickly. For most web scraping tasks, libraries like Requests are our knight in shining armor, simplifying the process by leaps and bounds.

Now that we've built our connection to the server, sent our HTTP request, and received the response, it's time to wrangle some data! In the good ol' days, regular expressions (regex) were my trusty companions for this quest.

Regular Expressions

When I first started parsing HTML responses manually, regular expressions (regex) were invaluable for searching, parsing, manipulating, and handling text. They allow us to define search patterns and are extremely useful for extracting specific data from text, such as prices, dates, numbers, or names. For example, we could quickly identify all phone numbers on a web page.

Combined with classic search and replace, regular expressions also allow us to perform string substitution on dynamic strings in a relatively straightforward fashion. The easiest example, in a web scraping context, may be to replace uppercase tags in a poorly formatted HTML document with the proper lowercase counterparts.

But hey, you might be thinking: "With all these fancy Python modules for parsing HTML with XPath and CSS selectors, why even bother with regex?" Let's be honest. That's a fair question!

In a perfect world, data would be neatly tucked away inside HTML elements with clear labels. But the web is rarely perfect. Sometimes, we'll find mountains of text crammed into basic <p> elements. To extract specific data (like a price, date, or name) from this messy landscape, we'll need to wield the mighty regex.

Note: Beyond the basics that we'll discuss, regular expressions can be more complex, but tools like regex101.com can help you test and debug your patterns. Additionally, RexEgg is an excellent resource to learn more about regex.

For example, regular expressions can be useful when we've an HTML snippet with this kind of data:

<p>Price : 19.99$</p>

We could select this text node with an XPath expression and then use this kind of regex to extract the price:

Price\s*:\s*(\d+\.\d{2})\$

However, if we only have the HTML snippet, fear not! It's not much trickier than catching a well-fed cat napping. We can simply specify the HTML tag in our expression and use a capturing group for the text:

import re

html_content = '<p>Price : 19.99$</p>'
pattern = r'Price\s*:\s*(\d+\.\d{2})\$'

match = re.search(pattern, html_content)
if match:
    print(match.group(1))  # Output: 19.99

As you can see, building HTTP requests with sockets and parsing responses with regex is a fundamental skill that unlocks a deeper understanding of web scraping. However, regex isn't a magic solution. It can get tangled and tricky to maintain, especially when dealing with complex or nested HTML structures.

In my professional experience, it's best to reach for dedicated HTML parsing libraries like BeautifulSoup or Requests whenever possible. These libraries offer robust and flexible tools for navigating and extracting data from even the most unruly HTML documents.

2. Using Urllib3 & LXML

Urllib3

Note: Working with HTTP requests in Python can sometimes be confusing due to the various libraries available. While the standard library includes urllib and urllib2, urllib3 stands out for its ease of use and flexibility. Although urllib3 is not part of the standard library, it's widely adopted in the Python community, powering popular packages like pip and Requests.

Want to send HTTP requests, receive responses? urllib3 is our genie. And the best part? It does all this with way fewer lines of code than, say, wrestling with sockets directly.

Remember that convoluted socket code from before? urllib3 lets us achieve the same thing with way less hassle:

import urllib3
http = urllib3.PoolManager()
r = http.request('GET', 'http://www.google.com')
print(r.data)

Just look at this! Isn't that so much cleaner than that socket business we talked about earlier? urllib3 boasts a super clean API, making it a breeze to not only send requests but also add fancy HTTP headers, use proxies, and even send those tricky POST forms.

For instance, had we decided to set some headers and use a proxy, we would only have to do the following:

import urllib3
user_agent_header = urllib3.make_headers(user_agent="<USER AGENT>")
pool = urllib3.ProxyManager('<PROXY IP>', headers=user_agent_header)
r = pool.request('GET', 'https://www.google.com/')

See what I mean? Same number of lines, but way more functionality!

Now, don't get me wrong, urllib3 isn't perfect. There are some things it doesn't handle quite as smoothly. Adding cookies, for instance, requires a bit more manual work, crafting those headers just right. But hey, on the flip side, urllib3 shines in areas where Requests might struggle. Managing connection pools, proxy pools, and even retry strategies? urllib3 is our champion.

In a nutshell, urllib3 is more advanced than raw sockets but is still a tad simpler than Requests.

Pro Tip: If you're new to web scraping with Python, then Requests might be your best bet. Its user-friendly API is perfect for beginners. But once you're ready to level up your HTTP game, urllib3 is there to welcome you with open arms (and fewer lines of code).

Next, to parse the response, we're going to use the lxml library and XPath expressions.

XPath

We've all heard of CSS selectors, right? XPath is like its super-powered cousin. It uses path expressions to navigate and snag the exact data we need from an XML or HTML document.

Note: As with the Document Object Model, XPath has been a W3C standard since 1999. Although XPath is not a programming language per se, but it allows us to write expressions that can directly access a specific node, or a specific node-set, without having to go through the entire HTML or XML tree.

Here are 3 things we need to extract data from an HTML document with XPath:

  • An HTML document
  • XPath expressions
  • An XPath engine (like lxml) to run those expressions

To begin, we'll use the HTML we got from urllib3.

And now, let's extract all the links from the Google homepage. We'll use a simple XPath expression: //a. This tells our engine to find all the anchor tags (<a>) on the page, which is where those sweet, sweet links live.

Installing LXML

Now, to make this magic happen, we need to install a library called lxml. It's a fast and easy to use XML and HTML processing library that supports XPath.

Therefore, let's install lxml first:

pip install lxml

Running XPath Expressions

Since we're parsing the response from our previous output, we can continue the code from where we stopped:

# ... Previous snippet here
from lxml import html

# We reuse the response from urllib3
data_string = r.data.decode('utf-8', errors='ignore')

# We instantiate a tree object from the HTML
tree = html.fromstring(data_string)

# We run the XPath against this HTML
# This returns an array of elements
links = tree.xpath('//a')

for link in links:
    # For each element we can easily get back the URL
    print(link.get('href'))

And the output should look like this:

https://books.google.fr/bkshp?hl=fr&tab=wp
https://www.google.fr/shopping?hl=fr&source=og&tab=wf
https://www.blogger.com/?tab=wj
https://photos.google.com/?tab=wq&pageId=none
http://video.google.fr/?hl=fr&tab=wv
https://docs.google.com/document/?usp=docs_alc
...
https://www.google.fr/intl/fr/about/products?tab=wh

This is a super basic example. XPath can get pretty darn complex, but that just means it's even more powerful!

Note: We could have also used //a/@href to point straight to the href attribute). You can get up to speed about XPath with this helpful introduction from MDN Web Docs. The lxml documentation is also well-written and is a good starting point.

Copying Our Target XPath from Chrome Dev Tools

  1. Open Chrome Dev Tools (press F12 key or right-click on the webpage and select "Inspect")
  2. Use the element selector tool to highlight the element you want to scrape
  3. Right-click the highlighted element in the Dev Tools panel
  4. Select "Copy" and then "Copy XPath"
  5. Paste the XPath expression into the code

Using Chrome developer tools to copy Target XPath

Pro Tip: In my experience, XPath expressions, like regular expressions, are powerful and one of the fastest ways to extract information from HTML. However, like regular expressions, XPath can also quickly become messy, hard to read, and hard to maintain. So, keep your expressions clean and well-documented to avoid future headaches.

Learn more about XPath for web scraping in our separate blog post.

3. Using Requests & BeautifulSoup

Requests

I started building web scrapers in Python, and let me tell you, Requests quickly became my go-to library. It's the undisputed king of making HTTP requests, with over 11 million downloads under its belt. Think of it as "Everything HTTP for Humans" – scraping has never been so user-friendly!

Installing Requests

To get started with Requests, first, we have to install it:

pip install requests

Now we're ready to make requests with Requests (see what I did there?). Here's a simple example:

import requests

r = requests.get('https://www.scrapingninja.co')
print(r.text)

With Requests, it's easy to perform POST requests, handle cookies, and manage query parameters.

Furthermore, don't be surprised that we can even download images with Requests:

import requests

url = 'https://www.google.com/images/branding/googlelogo/1x/googlelogo_light_color_272x92dp.png'
response = requests.get(url)
with open('image.jpg', 'wb') as file:
    file.write(response.content)

That's the power of Requests in a nutshell. Need to scrape the web at scale? Check out our guide on Python Requests With Proxies – it's a game-changer!

Authentication to Hacker News

Let's say we want to build a scraper that submits our blog posts to Hacker News (or any other forum). To do that, we need to log in. Here's where Requests and BeautifulSoup come in handy.

To start, let's take a quick look at the Hacker News login form and the associated DOM:

Using developer tools to inspect the input fields of Hacker News login form

We're looking for those special <input> tags with the name attribute – they're the key to sending our login information.

Here, there are three <input> tags with a name attribute (other input elements are not sent) on this form. The first one has a type hidden with a name goto, and the two others are the username and password.

When we submit the form in our browser, cookies are sent back and forth, keeping the server informed that we're logged in.

Requests handles these cookies beautifully (pun intended) with its Session object.

BeautifulSoup

The next thing we need is BeautifulSoup. It's a Python library that helps us parse HTML and XML documents to extract data.

Installing BeautifulSoup

Just like Requests, getting BeautifulSoup is a snap:

pip install beautifulsoup4

Now we can use BeautifulSoup to dissect the HTML returned by the server and see if we've successfully logged in. All we have to do POST our three inputs with our credentials to the /login endpoint and sniff around for an element that only appears after logging in:

import requests
from bs4 import BeautifulSoup

BASE_URL = 'https://news.ycombinator.com'
USERNAME = ""
PASSWORD = ""

s = requests.Session()

data = {"goto": "news", "acct": USERNAME, "pw": PASSWORD}
r = s.post(f'{BASE_URL}/login', data=data)

soup = BeautifulSoup(r.text, 'html.parser')
if soup.find(id='logout') is not None:
    print('Successfully logged in')
else:
    print('Authentication Error')

Fantastic! With only a few lines of Python code, we've logged in to Hacker News and checked if the login was successful. Feel free to try this with any other site.

Now, on to the next challenge: getting all the links on the homepage.

Scraping the Hacker News Homepage

Hacker News boasts a powerful API, but for this example, we'll use scraping to showcase the process.

First, let's examine the Hacker News homepage to understand its structure and identify the CSS classes we need to target:

Using developer tools to inspect the CSS classes of Hacker News homepage

We see that each post is wrapped in a <tr> tag with the class athing. Easy enough! Let's grab all these tags in one fell swoop:

links = soup.findAll('tr', class_='athing')

Then, for each link, we'll extract its ID, title, URL, and rank:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')

formatted_links = []

for link in links:
    data = {
        'id': link['id'],
        'title': link.find_all('td')[2].a.text,
        "url": link.find_all('td')[2].a['href'],
        "rank": int(link.find_all('td')[0].span.text.replace('.', ''))
    }
    formatted_links.append(data)

print(formatted_links)

Voila! We've conquered the Hacker News homepage and retrieved details about all the posts.

Note: We've been scraping by with beautiful BeautifulSoup, and it's been a delicious experience! But what if we crave a bit more, a turbo boost for our scraping toolkit? Enter MechanicalSoup, the perfect blend of Requests' simplicity and BeautifulSoup's parsing power. Check out our guide on Getting started with MechanicalSoup.

But wait, there's more! Let's not just print this data and watch it disappear faster than a squirrel on a sugar rush.

Storing our data in CSV

To save the scraped data to a CSV file, we can use Python's csv module. Let's continue our code and save our data into a CSV file:

import csv

# Sample data
data = [
    {'id': '1', 'title': 'Post 1', 'url': 'http://example.com/1', 'rank': 1},
    {'id': '2', 'title': 'Post 2', 'url': 'http://example.com/2', 'rank': 2}
]

# Define the CSV file path
csv_file = 'hacker_news_posts.csv'

# Write data to CSV
with open(csv_file, 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=['id', 'title', 'url', 'rank'])
    writer.writeheader()
    for row in data:
        writer.writerow(row)

Pro Tip: In my experience, this combination of Requests, BeautifulSoup and the csv module is perfect for beginners to build powerful web scrapers with minimal code. Once you're comfortable with these tools as a beginner, you can explore more advanced options like Scrapy and Selenium.

But on our journey to the land of big data, our trusty CSV file might start to spin out of control. Fortunately, we have a secret weapon: databases! Let's level up our Python scraper and make it as robust as a knight in shining armor.

Storing Our Data in PostgreSQL

We chose a good ol' relational database for our example here - PostgreSQL! With a database like PostgreSQL, we can make our data storage as strong as a dragon's hoard.

Step 1: Installing PostgreSQL

For starters, we'll need a functioning database instance. Check out PostgreSQL Download Page for that, and pick the appropriate package for your operating system, and follow its installation instructions.

Step 2: Creating a Database Table

After installation, we'll need to set up a database (let's name it scrape_demo) and add a table for our Hacker News links to it (let's name that one hn_links) with the following schema:

CREATE TABLE "hn_links" (
    "id" INTEGER NOT NULL,
    "title" VARCHAR NOT NULL,
    "url" VARCHAR NOT NULL,
    "rank" INTEGER NOT NULL
);

Note: To manage the database, we can either use PostgreSQL's own command line client (psql) or one of the available UI interfaces (PostgreSQL Clients).

All right, the database should be ready, and we can turn to our code again.

Step 3: Installing Psycopg2 to Connect to PostgreSQL

First thing, we need something that lets us talk to PostgreSQL and Psycopg2 is a truly great library for that. As always, we can quickly install it with pip:

pip install psycopg2

The rest is relatively easy and straightforward. We just need to establish a connection to our PostgreSQL database:

con = psycopg2.connect(host="127.0.0.1", port="5432", user="postgres", password="", database="scrape_demo")

After setting up the connection, we can insert data into the database.

Step 4: Inserting Data into PostgreSQL

Once connected, we get a database cursor to execute SQL commands and insert data into the database:

cur = con.cursor()

And once we've the cursor, we can use the method execute to run our SQL command:

cur.execute("INSERT INTO table [HERE-GOES-OUR-DATA]")

Perfect! We have stored everything in our database!

Step 5: Committing the Data and Closing the Connection

Hold your horses, please. Before you ride off into the sunset, don't forget to commit your (implicit) database transaction 😉. One more con.commit() (and a couple of closes) and we're really good to go:

# Commit the data
con.commit();

# Close our database connections
cur.close()
con.close()

Now, let’s take a sneak peek at our data:

Viewing the stored PostgreSQL data from the hn_links table

And for the grand finale, here’s the complete code, including the scraping logic from before and the database storage:

import psycopg2
import requests
from bs4 import BeautifulSoup

# Establish database connection
con = psycopg2.connect(
    host="127.0.0.1",
    port="5432",
    user="postgres",
    password="",
    database="scrape_demo"
)

# Get a database cursor
cur = con.cursor()

r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')

for link in links:
    cur.execute("""
        INSERT INTO hn_links (id, title, url, rank)
        VALUES (%s, %s, %s, %s)
        """,
        (
            link['id'],
            link.find_all('td')[2].a.text,
            link.find_all('td')[2].a['href'],
            int(link.find_all('td')[0].span.text.replace('.', ''))
        )
    )

# Commit the data
con.commit()

# Close our database connections
cur.close()
con.close()

With that, we’re done! Our data is safely tucked away in the database, ready for any analysis or processing we might have in mind. Nice work!

Summary

Whew! We've accomplished quite a bit. As we've seen, Requests and BeautifulSoup are fantastic libraries for extracting data and automating various tasks, like posting forms.

However, if we're planning to run large-scale web scraping projects, we could still use Requests, but we'll need to handle many components ourselves.

💡 Did you know about ScrapingBee's Data Extraction tools? Not only do they provide a complete no-code environment for your projects, but they also scale effortlessly and manage all advanced features, like JavaScript and proxy round-robin, right out of the box. Check it out - the first 1,000 requests are on the house!

If you want to dive deeper into Python, BeautifulSoup, POST requests, and particularly CSS selectors, I highly recommend these articles:

The scraping world is full of opportunities for improvement. Here are some ways to make our scraper truly shine:

  • Parallelizing our code and making it faster by running multiple scraping tasks concurrently
  • Handling errors by making our scraper robust with exception handling and retrying failed requests
  • Filtering results to extract only the data we need
  • Throttling requests to avoid overloading the server by adding delays between requests

Fortunately, tools exist that can handle these improvements for us. For large-scale projects, consider using web crawling frameworks such as Scrapy.

Asyncio

Requests is fantastic, but for hundreds of pages, it might feel a bit sluggish. By default, Requests handles synchronous requests, meaning that each request is sent one by one.

For instance, if we've 25 URLs to scrape, each taking 10 seconds, it will take over four minutes to scrape all pages sequentially:

import requests

# An array with 25 urls
urls = [...]

for url in urls:
    result = requests.get(url)

Asyncio to the rescue! This asynchronous I/O library in Python, along with aiohttp for asynchronous HTTP requests, allows us to send requests concurrently.

Instead of waiting for each request to finish before sending the next, we can send them all (or many at once) and handle the responses asynchronously.

Step 1: Installing the Aiohttp Library

First, we'll to install the aiohttp library, which works well with asyncio for making HTTP requests:

pip install aiohttp

Step 2: Using Asyncio for Concurrent Requests

After installation, we can now use asyncio and aiohttp to scrape multiple pages in parallel.

For example, we can scrape Hacker News list pages asynchronously:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

# Asynchronous function to fetch the HTML content of the URL
async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

# Asynchronous function to fetch the HTML content of multiple URLs
async def fetch_all(urls):
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        return await asyncio.gather(*tasks)

# Main function to fetch and parse the HTML content
async def main():
    urls = ['https://news.ycombinator.com/news?p=1'] * 25  # Example: same URL for demonstration
    html_pages = await fetch_all(urls)
    
    all_links = []
    for html in html_pages:
        soup = BeautifulSoup(html, 'html.parser')
        links = soup.findAll('tr', class_='athing')
        for link in links:
            data = {
                'id': link['id'],
                'title': link.find_all('td')[2].a.text,
                'url': link.find_all('td')[2].a['href'],
                'rank': int(link.find_all('td')[0].span.text.replace('.', ''))
            }
            all_links.append(data)
    
    for link in all_links:
        print(f"ID: {link['id']}, Title: {link['title']}, URL: {link['url']}, Rank: {link['rank']}")

# Run the main function
asyncio.run(main())

Pro Tip: In my experience, asyncio can dramatically reduce scraping time for multiple pages. It's a superhero for large scraping projects!

Asyncio's secret weapon is its single-threaded, single-process design with event loops. This makes it super efficient, allowing us to handle hundreds of requests concurrently without overloading the system, making it highly scalable.

For features like JavaScript rendering or fancy proxy rotation, consider exploring scraping frameworks like Scrapy or services like ScrapingBee. They'll help us conquer even the most complex scraping challenges!

4. Using Web Crawling Frameworks

Scrapy

Scrapy logo, a python web scraper

Scrapy is like a Swiss Army knife for web scraping and crawling, armed with Python power. I’ve had my share of adventures with it, and trust me, it's got quite the arsenal.

From downloading web pages asynchronously to managing and saving the content in various formats, Scrapy’s got us covered. It supports multithreading, crawling (yep, the process of hopping from link to link to discover all URLs on a website like a digital spider), sitemaps, and a lot more.

Key Features of Scrapy

  • Asynchronous Requests: Handles multiple requests simultaneously, speeding up the scraping process
  • Built-In Crawler: Automatically follows links and discovers new pages
  • Data Export: Exports data in various formats such as JSON, CSV, and XML
  • Middleware Support: Customize and extend Scrapy's functionality using middlewares

And let's not forget the Scrapy Shell, my secret weapon for testing code. With Scrapy Shell, we can quickly test our scraping code and ensure our XPath expressions or CSS selectors work flawlessly.

But hold on to your hats, folks, because Scrapy's learning curve is steeper than a rollercoaster drop. There is a lot to learn!

To continue our example with Hacker News, we'll create a Scrapy Spider that scrapes the first 15 pages of Hacker News and save the data in a CSV file.

Step 1: Installing Scrapy

First things first, let’s install Scrapy. It’s a piece of cake with pip:

pip install Scrapy

Step 2: Generating Project Boilerplate

After installation, we use the Scrapy CLI to generate the boilerplate code for our project:

scrapy startproject hacker_news_scraper

Step 3: Creating the Scrapy Spider

Inside the hacker_news_scraper/spiders directory, we'll create a new Python file with our spider's code:

from bs4 import BeautifulSoup
import scrapy

class HnSpider(scrapy.Spider):
    name = "hacker-news"
    allowed_domains = ["news.ycombinator.com"]
    start_urls = [f'https://news.ycombinator.com/news?p={i}' for i in range(1, 16)]

    def parse(self, response):
        soup = BeautifulSoup(response.text, 'html.parser')
        links = soup.findAll('tr', class_='athing')

        for link in links:
            yield {
                'id': link['id'],
                'title': link.find_all('td')[2].a.text,
                'url': link.find_all('td')[2].a['href'],
                'rank': int(link.td.span.text.replace('.', ''))
            }

Scrapy uses conventions extensively. Here, the start_urls list contains all the desired URLs. Scrapy then fetches each URL and calls the parse method for each response, where we use custom code to parse the HTML.

Step 4: Configuring Scrapy Settings

We need to tweak Scrapy a bit to ensure our spider behaves politely with the target website. To do this, we should enable and configure the AutoThrottle extension in the Scrapy settings:

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True

# The initial download delay
AUTOTHROTTLE_START_DELAY = 5

Note: We should always turn this on. This feature is like having a friendly spider that doesn’t flood the site with requests. It automatically adjusts the request rate and the number of concurrent threads based on response times. We wouldn’t want to be that annoying guest at a party, right?

Step 5: Running the Spider

Now, let’s run the spider code with the Scrapy CLI and save the output our desired formats (CSV, JSON, or XML). Here’s how to save the data in a JSON file:

scrapy crawl hacker-news -o links.json

Voila! We now have all our links neatly packed in a JSON file. Scrapy does all the heavy lifting while we sit back and sip our coffee.

There’s so much more to explore with Scrapy. If you’re hungry for more knowledge, check out our dedicated blog post about web scraping with Scrapy. It’s a treasure trove of information!

PySpider

PySpider, an alternative to Scrapy, might feel like a hidden gem in the world of web crawling frameworks. Although its last update was in 2018, it still holds relevance today due to its unique features that Scrapy doesn’t handle out of the box.

Why PySpider?

PySpider outshines when it comes to handling JavaScript-heavy pages (think SPA and Ajax calls) because it includes PhantomJS, a headless browsing library. In contrast, Scrapy requires additional middlewares we've to install to tackle JavaScript content.

On top of that, PySpider has a user-friendly UI that lets me keep an eye on all my crawling jobs:

PySpider

If we decide to use PySpider, here is how to get it up and running.

Step 1: Installing PySpider

We can install PySpider using pip:

pip install pyspider

It’s as easy as pie!

Step 2. Starting the PySpider Components

After installation, we need to start the necessary components: webui, scheduler, fetcher and launch PySpider:

pyspider all

Step 3: Accessing the Web UI

Once everything is up and running, we navigate to http://localhost:5000 to access the PySpider interface. We’ll find it quite intuitive and user-friendly.

PySpider vs. Scrapy: A Quick Comparison

Although PySpider has some cool features, there are several reasons why we might still lean towards Scrapy. Here’s a quick comparison:

Feature PySpider Scrapy
JavaScript Handling Built-in with PhantomJS Requires additional middlewares
User Interface Yes No
Documentation Limited Extensive with easy-to-understand guides
HTTP Cache System No Yes, built-in
HTTP Authentication No Automatic
Redirection Support Basic Full support for 3XX and HTML meta refresh tags

To learn more about PySpider, check out the official documentation. Dive in and start crawling!

5. Using Headless Browsing

Selenium & Chrome

Scrapy is excellent for large-scale web scraping tasks. However, it struggles with sites that heavily use JavaScript or are implemented as Single Page Applications (SPAs). Scrapy only retrieves static HTML code and can't handle JavaScript on its own. So, all that fancy JavaScript goes to waste.

These SPAs can be tricky to scrape because of all the AJAX calls and WebSocket connections they use. If I'm worried about performance, I can use my browser's developer tools to look at all the network calls and copy the AJAX calls that have the data I'm after. But if there are just too many HTTP calls involved, it's easier to use a headless browser to render the page.

Headless browsers are perfect for taking screenshots of websites, and that's exactly what we're going to do with the Hacker News homepage (because, hey, who doesn't love Hacker News?). We'll use Selenium to lend us a hand.

When to Use Selenium

Scrapy is like that reliable old friend who’s great for most things, but sometimes we need some extra muscle, especially when the website has tons of JavaScript code.

Here are the three most common cases when we should summon Selenium:

  • Delayed Content: When the data we need doesn't appear until a few seconds after the page loads
  • JavaScript Everywhere: The website is a JavaScript jungle
  • JavaScript Blockades: The site uses JavaScript checks to block "classic" HTTP clients and regular web scrapers

Setting Up Selenium

Step 1: Installing Selenium

We can install the Selenium package with pip:

pip install selenium

Step 2: Getting ChromeDriver

We also need ChromeDriver. For macOS, we can use brew for that:

brew install chromedriver

Others: Download from ChromeDriver.

Taking a Screenshot With Selenium

After getting our ChromeDriver, we just have to import the webdiver from the selenium package, configure Chrome with headless=True, set a window size (otherwise it's really small), start Chrome, load the page, and finally get our beautiful screenshot:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path='/usr/local/bin/chromedriver')
driver.get("https://news.ycombinator.com/")
driver.save_screenshot('hn_homepage.png')
driver.quit()

True, being good netizens, we also quit() the webdriver instance of course. Now, we should get a nice screenshot of the homepage:

Hacker News homepage screenshot taken using Selenium

As we can see, we have a screenshot of the Hacker News homepage saved as hn_homepage.png.

Let's see one more example.

Scraping Titles With Selenium

Selenium is not just about taking screenshots. It's a full-fledged browser at our command that can scrape data rendered by JavaScript.

Let's now scrape the titles of posts (class titleline) on the Hacker News homepage:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import logging

# Set up logging to troubleshoot if anything goes wrong
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Set up headless Chrome options
options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

# Initialize the WebDriver
logging.info("Initializing WebDriver")
driver = webdriver.Chrome(options=options, executable_path='/usr/local/bin/chromedriver')

# Load the Hacker News homepage
logging.info("Loading Hacker News homepage")
driver.get("https://news.ycombinator.com/")

# Get page source and parse it with BeautifulSoup
logging.info("Parsing page source with BeautifulSoup")
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Find all titles on the page (class - titleline)
logging.info("Finding all story titles on the page")
titles = soup.find_all('span', class_='titleline')

if titles:
    logging.info(f"Found {len(titles)} titles. Printing titles:")
    # Print each title
    for title in titles:
        title_link = title.find('a')
        if title_link:
            print(title_link.text)
else:
    logging.warning("No titles found on the page.")

# Close the driver
logging.info("Closing WebDriver")
driver.quit()

Here, we launch a headless Chrome browser, load the Hacker News homepage, and retrieve the page source. Then, we use BeautifulSoup to parse and extract the titles.

Advanced Selenium Usage

Naturally, there's a lot more we can do with the Selenium API and Chrome. After all, it's a full-blown browser instance:

  • Running JavaScript
  • Filling forms
  • Clicking on elements
  • Extracting elements with CSS selectors or XPath expressions

For more in-depth knowledge, don't hesitate to check out our detailed guide on Selenium and Python.

Selenium and Chrome in headless mode are the dynamic duo for scraping anything we can dream of. However, great power comes with great responsibility (and resource usage):

  • Memory & CPU Usage: Chrome, bless its heart, can gobble up memory before we realize. While some fine-tuning can shrink its footprint to a manageable 300-400MB per instance, each one still needs a dedicated CPU core.

If we need to run several instances concurrently, this will require a machine with an adequate hardware setup and enough memory to serve all our browser instances.

For a more lightweight solution or to avoid the complexity of managing multiple browser instances, consider using ScrapingBee's site crawler SaaS platform. It takes care of a lot of the heavy lifting for you.

RoboBrowser

When web scraping calls for simplicity and elegance, RoboBrowser steps in.

This Python library combines the forces of Requests and BeautifulSoup into one, easy-to-use package. We can craft custom scripts to control the browsing workflow, making it perfect for those light-duty scraping tasks, like handling forms.

However, RoboBrowser is not a true headless browser and shares the same limitations as Requests and BeautifulSoup when dealing with JavaScript-heavy sites.

Installing RoboBrowser

We can install RoboBrowser with pip:

pip install robobrowser

Handling Forms With RoboBrowser

Logging into a website? No sweat with RoboBrowser! Instead of wrestling with Requests to craft a perfect request, we can use RoboBrowser to fill out the form and hit submit with ease.

For example, if we want to login to Hacker News, instead of manually crafting a request with Requests, we can write a script that will populate the form and click the login button:

from robobrowser import RoboBrowser

# Initialize RoboBrowser with a specific parser
browser = RoboBrowser(parser='html.parser')

# Open the login page
browser.open('https://news.ycombinator.com/login')

# Get the login form by action name
signin_form = browser.get_form(action='login')

# Fill out the form with username and password
signin_form['acct'].value = 'your_username'
signin_form['password'].value = 'your_password'

# Submit the form
browser.submit_form(signin_form)

# Verify login by checking for a specific element
if browser.find(id='logout'):
    print('Successfully logged in')
else:
    print('Login failed')

RoboBrowser makes interacting with web forms a breeze. It's like having a virtual assistant filling out forms for us, without the hassle of a full-fledged headless browser.

And the best part? RoboBrowser's lightweight design makes it easy to parallelize tasks on our computer. So, we can scrape away with multiple instances without breaking a sweat (or the bank).

Limitations of RoboBrowser

Like all things light and breezy, RoboBrowser has its limitations:

  • No JavaScript Here, Folks: Since RoboBrowser is not using a real browser, it cannot handle JavaScript-heavy pages, AJAX calls, or Single Page Applications.
  • Minimal Documentation: The documentation is a bit, well, slim. In my experience, I wouldn't recommend it for beginners or those not already used to BeautifulSoup or Requests API.

RoboBrowser is a fantastic tool for those simple scraping tasks where JavaScript isn't invited to the party. But for more complex adventures that involve JavaScript, we might want to consider enlisting the help of Selenium or other more robust solutions.

6. Using a Website’s API

Sometimes, scraping a website’s content can be as easy as interacting with its API. No need to wrestle with weird HTTP clients and website code or need a fancy browser in disguise.

We just need to find the API the website offers and then use it to grab the data we want.

Scraping a Website’s Content API

Let's see the steps to find and scrape a website’s content API.

Step 1: Identifying the API Endpoint

Many websites provide official APIs. To find them, we can look for links to API documentation in the website’s footer, developer section, or simply use search engines with queries like "site:example.com API".

Another trick up our sleeve is using browser developer tools to inspect network requests while interacting with the website. We should look for API endpoints returning JSON or XML data, which often signals an API.

Step 2: Understanding the API Documentation

API documentation is our treasure map. We need to read through it to understand the available endpoints, required authentication, rate limits, and data formats.

Step 3: Generating API Keys

To unlock the API's full potential, we need API keys. Sign up or log in to the website’s developer portal, create an application, and specify a redirect URL to generate these keys.

Step 4: Using cURL to Test API Requests

cURL is our trusty command-line tool for making HTTP requests. We can use it to test API endpoints and understand the required parameters. Here is an example cURL command:

curl -X GET "https://api.example.com/data?param=value" -H "Authorization: Bearer your_token"

Step 5: Converting cURL to Python Requests

After pinpointing the necessary API request, it’s time to convert the cURL command to Python code using the Requests library. Let’s translate the above sample cURL command:

import requests

url = "https://api.example.com/data"
headers = {
    "Authorization": "Bearer your_token"
}
params = {
    "param": "value"
}

response = requests.get(url, headers=headers, params=params)
data = response.json()
print(data)

Pro Tip: From my experience, using APIs is often more efficient than scraping HTML. We get the structured data straight from the source, all neat and organized, and it's less likely to break if the website changes its layout.

For more insights on converting cURL commands to Python requests, check out our detailed guide.

Scraping Reddit Data

Let's set sail for Reddit and plunder some data with the amazing PRAW library. It's a fantastic Python package that wraps the Reddit API in a nice, user-friendly way.

Step 1: Installing PRAW

First things first, we need to invite praw to our Python party. Let's send out the invitation:

pip install praw

It's as easy as pie. Or should we say, as easy as PRAW?

Step 2: Getting Reddit API Credentials

To use the Reddit API, we need to create an application and obtain the necessary credentials. Go to Reddit Apps and scroll to the bottom to create an application.

According to PRAW's documentation, set http://localhost:8080 as the "redirect URL".

After clicking create app, the screen with the API details and credentials will load. For our example, we'll need the client ID, the secret, and the user agent.

Reddit API application details for setting up PRAW, the Reddit API wrapper

Step 3: Scraping Data With PRAW

Now, let's get our hands on the top 1,000 posts from the /r/Entrepreneur subreddit. We're going to scoop up this data and export it into a shiny CSV file:

import praw
import csv

# Setup Reddit API credentials
reddit = praw.Reddit(client_id='your_client_id', client_secret='your_secret', user_agent='top-1000-posts')

# Get the top 1,000 posts from the subreddit
top_posts = reddit.subreddit('Entrepreneur').top(limit=1000)

# Open a CSV file to write the data
with open('top_1000.csv', 'w', newline='') as csvfile:
    fieldnames = ['title', 'score', 'num_comments', 'author']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    for post in top_posts:
        writer.writerow({
            'title': post.title,
            'score': post.score,
            'num_comments': post.num_comments,
            'author': str(post.author)
        })

The actual extraction part is a piece of cake (or maybe a piece of code). We're running top on the subreddit and storing the posts in top_posts. Voilà!

There are many other mystical and wonderful things we can do with PRAW. From analyzing subreddits in real-time with sentiment analysis libraries to predicting the next $GME, the possibilities are as endless as the scroll on our Reddit feed.

💡 Want to take the hassle out of scraping? Learn how to screen scrape with no infrastructure maintenance via our scraping API

7. Avoiding Anti-Bot Technology

One of the biggest challenges we face on our scraping adventures is getting blocked by websites. They're getting better at spotting those pesky scraper bots.

But don't worry! Here are some tried-and-true methods to keep our scraping efforts under the radar. For a deep dive, check out our detailed article on how to not get blocked while web scraping.

Using Proxies

Websites often block IP addresses that make too many requests in a short period. But imagine having a whole fleet of ships (IP addresses) at our command. That's what proxies are!

Instead of just one ship sending out requests, we can spread the load across many, making it harder for websites to detect us.

  • Rotating Proxies: This is like sending our ships from different ports, making it look like a bunch of different users are accessing the website. This is especially useful for high-volume scraping.
  • Residential Proxies: These proxies use IP addresses provided by ISPs to homeowners, making them appear as real users.

Pro Tip: I've found that a mix of both proxy types provide a good balance between cost and effectiveness.

Setting or Rotating User Agents

Imagine showing up to a party in a disguise - each time a different one. That's what rotating user agents does. By changing the User-Agent header in our HTTP requests, we can mimic requests from different browsers and devices.

  • Random User Agents: We can have a whole chest full of different hats (user agents) and pick a new one for each request.
  • Custom User Agents: We can even create our own special eyepatches (custom user agents) to look super unique, with details like browser versions, operating systems, and other details to further disguise our scraper.

Undetected ChromeDriver

Some websites are really smart and can spot headless browsers like Selenium. But there is a secret weapon: the undetected ChromeDriver!

It helps in running Chrome in a headless mode while avoiding detection by websites that block scrapers.

  • Installation: Use libraries like undetected-chromedriver that automatically configures ChromeDriver to evade detection
  • Customization: Modify Chrome options to mimic real user behavior, such as setting a realistic screen resolution and enabling JavaScript

No Driver

NoDriver is an asynchronous tool that replaces traditional components such as Selenium or webdriver binaries, providing direct communication with browsers. This approach not only reduces the detection rate by most anti-bot solutions but also significantly improves the tool's performance.

This package has a unique feature that sets it apart from other similar packages - it is optimized to avoid detection by most anti-bot solutions.

Check out our tutorial on How to scrape websites with Nodriver.

Balancing Act

Web scraping is like a delicate dance - too aggressive, and we'll be blocked; too timid, and we won't get the data we need. By using proxies, rotating user agents, leveraging undetected ChromeDriver, and avoiding drivers when possible, we can strike the perfect balance.

Remember, it's not just about getting the data but doing so efficiently and ethically.

For more tips and tricks, don't forget to check out our article on web scraping without getting blocked.

Conclusion

Here's a quick recap table of every technology we discussed in this tutorial.

Name Ease of Use Flexibility Speed of Execution Common Use Case Learn More
Socket - - - + + + + + + Writing low-level programming interfaces Official documentation
Urllib3 + + + + + + + High-level applications needing fine control over HTTP (pip, aws client, requests, streaming) Official documentation
Requests + + + + + + + Calling APIs, Simple applications (in terms of HTTP needs) Official documentation
Asyncio + + + + + + + Asynchronous scraping, high-speed concurrent requests Official documentation
Scrapy + + + + + + + + Crawling numerous websites, Filtering, extracting, and loading scraped data Official documentation
Selenium + + + + + JS rendering, Scraping SPAs, Automated testing, Programmatic screenshots Official documentation
Web Content API + + + + + + + + + Directly accessing structured data from APIs Praw documentation

We’ve covered quite a lot of ground in this blog post, from the basics of socket programming to advanced web scraping with Selenium and everything in between. Here are some other key points to remember:

  • Sockets and urllib3 provide a foundation for understanding low-level and high-level HTTP requests.
  • Requests simplifies HTTP requests, making it an excellent starting point for beginners.
  • Asyncio allows for high-speed concurrent requests, significantly speeding up scraping tasks.
  • BeautifulSoup helps parse HTML, while Scrapy offers a robust framework for large-scale scraping tasks.
  • Selenium is our go-to for scraping JavaScript-heavy websites, allowing us to automate browser actions.
  • Using APIs can sometimes be the most efficient method to get structured data directly.

Further Reading

Here are some valuable resources to dive deeper into web scraping and related topics:

Ready to Level Up Your Scraping Game? Check Out ScrapingBee!

If you're interested in scraping data and building products around it, then ScrapingBee's API is here to make your scraping life a whole lot easier.

Speaking of making life easier, did I mention ScrapingBee throws you a cool 1,000 FREE scraping credits when you sign up? That's right, no credit card needed! It's the perfect chance to test drive this awesome tool and see what web scraping magic you can conjure.

So, why wait? Sign up for ScrapingBee today and supercharge your web scraping output! Let's turn those websites into data goldmines!

If you have any questions or suggestions for additional resources, feel free to reach out via email or live chat on our website. We love hearing from our readers!

Happy Scraping!

image description
Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.