Web scraping is a powerful tool for gathering data from websites, and Playwright is one of the best tools out there to get the job done. In this tutorial, I'll walk you through how to scrape with Playwright for Python. We'll start with the basics and gradually move to more advanced techniques, ensuring you have a solid grasp of the entire process. Whether you're new to web scraping or looking to refine your skills, this guide will help you use Playwright for Python effectively to extract data from the web.
We'll cover setting up your environment, writing simple scripts, handling multiple pages, and adding a retry mechanism to make your scraper more robust. By the end, you'll be ready to tackle any web scraping project with confidence.
So, let's dive in and learn how to scrape with Playwright for Python! Playwright is also available for Javascript if that's your flavour .
What is Playwright for Python?
Playwright is a powerful tool for automating web browsing tasks. It's designed to provide a fast and reliable way to scrape websites, test web apps, and perform any task that requires interacting with a browser. It's created and maintained by Microsoft while being available for both Python and Javascript.
Headless browsing
One of Playwright's key features is its support for headless browsing. This means you can run browser automation without a graphical user interface (GUI), making your scripts faster and more efficient. Headless mode is perfect for scraping tasks where you don't need to see the browser window.
Why Playwright over selenium?
You might have heard of Selenium , another popular tool for browser automation. While both Playwright and Selenium serve similar purposes, Playwright offers some advantages:
- Multi-browser support: Playwright supports Chromium, Firefox, and WebKit, giving you more flexibility in testing across different browsers.
- Modern web features: Playwright is designed with modern web technologies in mind, offering better support for features like shadow DOM, web components, and more.
- Consistent API: Playwright provides a more consistent and intuitive API compared to Selenium, making it easier to write and maintain your scripts.
- Built-in waiting mechanisms: Playwright includes smarter waiting mechanisms, which helps in dealing with dynamic content and avoiding flakiness in your tests.
Getting started with Playwright
Prerequisites
Before diving into using Playwright for scraping, make sure you've got the basics covered:
- A basic understanding of Python
- Python 3 installed on your computer
- An operating system that's either Mac, Linux, or Windows 10+
- Access to a terminal and a code editor
You can find the source code for this tutorial on GitHub .
Setting up your environment
- Use Poetry to initialize a virtual environment:
poetry init
- Add Playwright to your
pyproject.toml
file:
playwright = "^1.45.1"
- Install the package using Poetry:
poetry install --no-root
- Create a new
main.py
in the project root. To run your script, use:
poetry run python3 main.py
Alternatively, if you prefer a more straightforward method, install Playwright directly with:
pip install playwright
Most likely you'll also need to install Playwright browsers by running:
poetry run playwright install
That's it!
Full example: Playwright Python script
If you are eager to start scraping as soon as possible, here's the full version of the script that we're going to build today.
import sys
import re
import csv
import time
from concurrent.futures import ThreadPoolExecutor
from playwright.sync_api import sync_playwright
def run(playwright, url, take_screenshot):
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto(url)
if take_screenshot:
__capture_screenshot(page, url)
else:
__save_page_text(page, "main", url)
browser.close()
def process_url(url, take_screenshot, retries=2, backoff_factor=1):
for attempt in range(retries + 1):
try:
with sync_playwright() as playwright:
run(playwright, url, take_screenshot)
break
except Exception as e:
if attempt < retries:
sleep_time = backoff_factor * (2 ** attempt)
print(f"Error processing URL {url}, retrying in {sleep_time} seconds... ({attempt + 1}/{retries})")
time.sleep(sleep_time) # Exponential backoff
else:
print(f"Failed to process URL {url} after {retries + 1} attempts: {e}")
def __save_page_text(page, selector, url):
title = page.title()
main_content = page.query_selector(selector)
main_text = (
main_content.inner_text() if main_content else "No requested selector found"
)
filename = __safe_filename_from(title)
with open(filename, "w", encoding="utf-8") as f:
f.write(f"URL: {url}\n")
f.write(f"Title: {title}\n\n")
f.write(main_text)
print(f"Data saved as {filename}")
def __safe_filename_from(title):
safe_title = re.sub(r"[^\w\s-]", "", title).strip().replace(" ", "_")
return f"{safe_title}.txt"
def __capture_screenshot(page, url):
filename = __safe_filename_from(page.title()) + ".png"
page.screenshot(path=filename, full_page=True)
print(f"Screenshot saved as {filename}")
def read_urls_from_csv(file_path):
urls = []
with open(file_path, newline='', encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
urls.append(row['loc'])
return urls
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py <csv_file> [--screenshot]")
sys.exit(1)
csv_file = sys.argv[1]
take_screenshot = "--screenshot" in sys.argv
urls = read_urls_from_csv(csv_file)
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(process_url, url, take_screenshot) for url in urls]
for future in futures:
try:
future.result()
except Exception as e:
print(f"Error processing URL: {e}")
In this example we read a list of links from the CSV file and scrape those in multiple threads retrying requests as necessary. If the --screenshot
command-line argument is set, we create the page screenshot instead of downloading its data.
Avoiding getting blocked while scraping
You probably know that websites aren't too fond of scrapers and might block you if they detect unusual or undesired activity. Here are some methods to help you avoid this :
Use proxies
Using proxies can help mask your IP address, making it look like your requests are coming from different locations. This helps avoid getting flagged for sending too many requests from a single IP.
Rotate user agents
Websites often check the user agent string to detect bots. Rotating user agents can make your requests appear to come from different browsers and devices, reducing the chances of getting blocked.
Avoid honeypots
Honeypots are traps set by websites to detect bots. These are often hidden elements that normal users won't interact with. Make sure your scraper ignores these elements to avoid detection.
Respect robots.txt
Always check the robots.txt
file of the website you're scraping. This file contains rules about which parts of the site can be scraped and which cannot.
ScrapingBee
If you want a hassle-free solution, consider using ScrapingBee . ScrapingBee provides:
- Headless browsers: Allows you to scrape websites without worrying about rendering issues.
- Proxy rotation: Automatically rotates proxies to avoid IP bans.
- Premium proxies: Access to high-quality proxies to ensure your requests are less likely to be blocked.
- Google search scraping: Easily scrape Google search results, which can be tricky due to Google's strict anti-scraping measures.
ScrapingBee handles the heavy lifting for you, so you can focus on extracting the data you need without getting blocked.
Scraping a single webpage with Playwright
Let's start with something simple. We'll write a script that opens a webpage and takes a screenshot if a command-line argument is provided. This will help you get a feel for how Playwright works.
Screenshoting a single webpage
Open the main.py
in your code editor and add the following code:
import sys # 1
from playwright.sync_api import sync_playwright
def run(playwright, url, take_screenshot):
browser = playwright.chromium.launch() # 2
page = browser.new_page()
page.goto(url) # 3
if take_screenshot: # 4
page.screenshot(path="screenshot.png", full_page=True)
print("Screenshot saved as screenshot.png")
browser.close() # 5
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py <url> [--screenshot]")
sys.exit(1)
url = sys.argv[1]
take_screenshot = '--screenshot' in sys.argv
with sync_playwright() as playwright:
run(playwright, url, take_screenshot)
So, here are the key points:
- Import all the necessary modules.
- Launch Chromium headless browser.
- Proceed to the requested page.
- Save page screenshot as PNG file if requested by the user. The full_page option is set to True to take care of the long pages with scrolling.
- Close the browser.
Enhancing the script to scrape webpage data
Now that we've got the basics down, let's enhance the script. If the screenshot option is disabled, we'll scrape the page title and all text within the <body>
tag, then save this data to a file named after the page title.
Update your main.py
file with the following code:
import sys
import re
from playwright.sync_api import sync_playwright
def run(playwright, url, take_screenshot):
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto(url)
if take_screenshot:
__capture_screenshot(page)
else:
__save_page_text(page, "body")
browser.close()
def __save_page_text(page, selector):
title = page.title() # 1
content = page.query_selector(selector) # 2
text = ( # 3
content.inner_text() if content else "No requested selector found"
)
filename = __safe_filename_from(title) # 4
with open(filename, "w", encoding="utf-8") as f: # 5
f.write(f"Title: {title}\n\n")
f.write(text)
print(f"Data saved as {filename}")
def __safe_filename_from(title):
safe_title = re.sub(r"[^\w\s-]", "", title).strip().replace(" ", "_")
return f"{safe_title}.txt"
def __capture_screenshot(page):
page.screenshot(path="screenshot.png", full_page=True)
print("Screenshot saved as screenshot.png")
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py <url> [--screenshot]")
sys.exit(1)
url = sys.argv[1]
take_screenshot = "--screenshot" in sys.argv
with sync_playwright() as playwright:
run(playwright, url, take_screenshot)
What we added:
- We use the
page.title()
method to get the page title. - Utilize the
page.query_selector()
method to get the requested element. - Extract the element text content using
inner_text()
. - We use a regex to create a safe filename from the title by removing any invalid characters and replacing spaces with underscores.
- Write the page title and main text to a file named after the page title.
Running your script
To run the script, use the following command:
poetry run python3 main.py <url> [--screenshot]
Replace <url>
with the webpage you want to open. If you want to take a screenshot, add the --screenshot
flag. For large-scale screenshot operations, consider using a service like ScrapingBee's
dedicated screenshot API
. This can simplify the process by handling the complexities of web rendering and scaling.
This is a simple start, but it sets the foundation for more complex scraping tasks. Next, we'll look into extracting specific data from the webpage.
Scraping multiple web pages
In this section, we're going to see how to easily scrape all web pages of a site.
Finding all website pages
First of all, we need to fetch all URLs that a website contains. There are a few methods to find all website pages but we're going to simply scan the sitemap and prepare a CSV file with all the found links.
Create a new file find_links.py
and paste the following code inside:
import requests
from bs4 import BeautifulSoup as Soup
import os
import csv
# Constants for the attributes to be extracted from the sitemap.
ATTRS = ["loc", "lastmod", "priority"]
def parse_sitemap(url, csv_filename="urls.csv"):
"""Parse the sitemap at the given URL and append the data to a CSV file."""
# Return False if the URL is not provided.
if not url:
return False
# Attempt to get the content from the URL.
response = requests.get(url)
# Return False if the response status code is not 200 (OK).
if response.status_code != 200:
return False
# Parse the XML content of the response.
soup = Soup(response.content, "xml")
# Recursively parse nested sitemaps.
for sitemap in soup.find_all("sitemap"):
loc = sitemap.find("loc").text
parse_sitemap(loc, csv_filename)
# Define the root directory for saving the CSV file.
root = os.path.dirname(os.path.abspath(__file__))
# Find all URL entries in the sitemap.
urls = soup.find_all("url")
rows = []
for url in urls:
row = []
for attr in ATTRS:
found_attr = url.find(attr)
# Use "n/a" if the attribute is not found, otherwise get its text.
row.append(found_attr.text if found_attr else "n/a")
rows.append(row)
# Check if the file already exists
file_exists = os.path.isfile(os.path.join(root, csv_filename))
# Append the data to the CSV file.
with open(os.path.join(root, csv_filename), "a+", newline="") as csvfile:
writer = csv.writer(csvfile)
# Write the header only if the file doesn't exist
if not file_exists:
writer.writerow(ATTRS)
writer.writerows(rows)
parse_sitemap("https://example.com/sitemap.xml")
This script requires
BeautifulSoup
and
Requests
, so drop these dependencies into the pyproject.toml
file:
requests = "^2.31"
beautifulsoup4 = "^4.12"
lxml = "^5.1" # this library might not be needed in your setup
Don't forget to install those with poetry install
. Alternatively, if you need to take advantage of proxying, you can use the ScrapingBee client instead:
scrapingbee = "^2.0"
Then grab your free trial , copy-paste the API key from your personal profile, and initialize the client in the following way:
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(
api_key="YOUR_API_KEY"
)
Then use the client in the same way as before (of course, you can further configure it by introducing proxying and taking advantage of other goodies ).
Processing every website page
Now having collected all the website pages, we need to process them one by one and either make a screenshot or download the body text:
import sys
import re
import csv
from playwright.sync_api import sync_playwright
def run(playwright, urls, take_screenshot):
browser = playwright.chromium.launch()
page = browser.new_page()
for url in urls:
page.goto(url)
if take_screenshot:
__capture_screenshot(page, url)
else:
__save_page_text(page, "main", url)
browser.close()
def __save_page_text(page, selector, url):
title = page.title()
main_content = page.query_selector(selector)
main_text = (
main_content.inner_text() if main_content else "No requested selector found"
)
filename = __safe_filename_from(title)
with open(filename, "w", encoding="utf-8") as f:
f.write(f"URL: {url}\n")
f.write(f"Title: {title}\n\n")
f.write(main_text)
print(f"Data saved as {filename}")
def __safe_filename_from(title):
safe_title = re.sub(r"[^\w\s-]", "", title).strip().replace(" ", "_")
return f"{safe_title}.txt"
def __capture_screenshot(page, url):
filename = __safe_filename_from(page.title()) + ".png"
page.screenshot(path=filename, full_page=True)
print(f"Screenshot saved as {filename}")
def __read_urls_from_csv(file_path):
urls = []
with open(file_path, newline="", encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
urls.append(row["loc"])
return urls
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py <csv_file> [--screenshot]") # 1
sys.exit(1)
csv_file = sys.argv[1]
take_screenshot = "--screenshot" in sys.argv
urls = __read_urls_from_csv(csv_file) # 2
with sync_playwright() as playwright:
run(playwright, urls, take_screenshot) # 3
Key points:
- The script usage has changed. Instead of passing the website URL as the first argument, the user should now provide the path to the CSV file containing URLs.
- We read the provided CSV and construct an array of links to visit.
- Then we simply process those URLs in a loop and perform same actions as before.
Great job!
Making it parallel
Processing URLs in a sequential manner is not too optimal if you have many links to visit. Thus, it might make sense to send parallel requests which is very easy with Python. However, make sure not too send too many requests as you might be blocked. For example, let's send 5 parallel requests:
import sys
import re
import csv
from concurrent.futures import ThreadPoolExecutor
from playwright.sync_api import sync_playwright
def run(playwright, url, take_screenshot):
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto(url)
if take_screenshot:
__capture_screenshot(page, url)
else:
__save_page_text(page, "main", url)
browser.close()
def process_url(url, take_screenshot):
with sync_playwright() as playwright:
run(playwright, url, take_screenshot)
def __save_page_text(page, selector, url):
title = page.title()
main_content = page.query_selector(selector)
main_text = (
main_content.inner_text() if main_content else "No requested selector found"
)
filename = __safe_filename_from(title)
with open(filename, "w", encoding="utf-8") as f:
f.write(f"URL: {url}\n")
f.write(f"Title: {title}\n\n")
f.write(main_text)
print(f"Data saved as {filename}")
def __safe_filename_from(title):
safe_title = re.sub(r"[^\w\s-]", "", title).strip().replace(" ", "_")
return f"{safe_title}.txt"
def __capture_screenshot(page, url):
filename = __safe_filename_from(page.title()) + ".png"
page.screenshot(path=filename, full_page=True)
print(f"Screenshot saved as {filename}")
def read_urls_from_csv(file_path):
urls = []
with open(file_path, newline="", encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
urls.append(row["loc"])
return urls
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py <csv_file> [--screenshot]")
sys.exit(1)
csv_file = sys.argv[1]
take_screenshot = "--screenshot" in sys.argv
urls = read_urls_from_csv(csv_file)
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(process_url, url, take_screenshot) for url in urls]
for future in futures:
try:
future.result()
except Exception as e:
print(f"Error processing URL: {e}")
This code is more involved because unfortunately
Playwright is not threadsafe
so you have to manage multiple threads or use
asyncio
. We're taking the first path and create multiple threads per file. Also, we limit the number of threads to 5 to avoid spamming the website with requests. Then we simply call the process_url()
function that performs the same actions as before.
Adding retry mechanism
Sometimes, the server might have a bad day and not respond properly when you request a page. To handle this, let's add a simple retry mechanism using exponential backoff.
So, what's exponential backoff? It's a way to manage retries by waiting a bit longer each time a request fails. If the first retry waits for 1 second, the next will wait for 2 seconds, then 4 seconds, and so on. This way, we don't hammer the server with requests too quickly and give it time to recover.
We'll try each request up to 2 more times before giving up. You can do this with some third-party tools or just code it yourself:
import sys
import re
import csv
import time
from concurrent.futures import ThreadPoolExecutor
from playwright.sync_api import sync_playwright
def run(playwright, url, take_screenshot):
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto(url)
if take_screenshot:
__capture_screenshot(page, url)
else:
__save_page_text(page, "main", url)
browser.close()
def process_url(url, take_screenshot, retries=2, backoff_factor=1):
for attempt in range(retries + 1):
try:
with sync_playwright() as playwright:
run(playwright, url, take_screenshot)
break
except Exception as e:
if attempt < retries:
sleep_time = backoff_factor * (2 ** attempt)
print(f"Error processing URL {url}, retrying in {sleep_time} seconds... ({attempt + 1}/{retries})")
time.sleep(sleep_time) # Exponential backoff
else:
print(f"Failed to process URL {url} after {retries + 1} attempts: {e}")
def __save_page_text(page, selector, url):
title = page.title()
main_content = page.query_selector(selector)
main_text = (
main_content.inner_text() if main_content else "No requested selector found"
)
filename = __safe_filename_from(title)
with open(filename, "w", encoding="utf-8") as f:
f.write(f"URL: {url}\n")
f.write(f"Title: {title}\n\n")
f.write(main_text)
print(f"Data saved as {filename}")
def __safe_filename_from(title):
safe_title = re.sub(r"[^\w\s-]", "", title).strip().replace(" ", "_")
return f"{safe_title}.txt"
def __capture_screenshot(page, url):
filename = __safe_filename_from(page.title()) + ".png"
page.screenshot(path=filename, full_page=True)
print(f"Screenshot saved as {filename}")
def read_urls_from_csv(file_path):
urls = []
with open(file_path, newline='', encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
urls.append(row['loc'])
return urls
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python main.py <csv_file> [--screenshot]")
sys.exit(1)
csv_file = sys.argv[1]
take_screenshot = "--screenshot" in sys.argv
urls = read_urls_from_csv(csv_file)
with ThreadPoolExecutor(max_workers=5) as executor:
futures = [executor.submit(process_url, url, take_screenshot) for url in urls]
for future in futures:
try:
future.result()
except Exception as e:
print(f"Error processing URL: {e}")
The main change is in the process_url()
function, where we track the number of retries and re-run the request if it fails.
Handling dynamic content
Web scraping often involves dealing with dynamic content that isn't immediately available when the page loads. With Playwright, you can handle these scenarios by waiting for specific elements to appear and interacting with them. Let's see how to do this with a small demo.
Waiting for elements
Sometimes, you need to wait for certain elements to load before you can scrape them. Playwright provides several ways to wait for elements. Here's an example:
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com")
# Wait for a specific element to appear
page.wait_for_selector("h1")
print("Page title:", page.title())
# Wait for a network response
page.wait_for_response(lambda response: response.url == "https://example.com/api/data" and response.status == 200)
print("Data loaded")
browser.close()
with sync_playwright() as playwright:
run(playwright)
In this example we wait for an h1
selector to appear and a network request to complete.
Interacting with elements
Playwright allows you to interact with elements, such as clicking buttons, filling out forms, and more. Here's an example that demonstrates how to interact with elements:
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com/login")
# Fill out a login form
page.fill("#username", "myusername")
page.fill("#password", "mypassword")
page.click("button[type=submit]")
# Wait for navigation after login
page.wait_for_navigation()
print("Logged in and navigated to:", page.url())
browser.close()
with sync_playwright() as playwright:
run(playwright)
In this example, we navigate to a login page, fill out the username and password fields, click the login button, and wait for the page to navigate to the next page.
These techniques allow you to handle dynamic content effectively, ensuring that your scraper can interact with pages just like a real user would.
Handling JavaScript
Many modern websites use JavaScript to load content dynamically, which can make scraping a bit tricky. Playwright can help you handle these scenarios by waiting for JavaScript to execute and interacting with the resulting content. Let's look at how to handle JavaScript on web pages.
Waiting for JavaScript execution
To wait for JavaScript to load dynamic content, you can use Playwright's waiting functions. Here's an example of how to wait for network requests that load data:
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com")
# Wait for a specific network response
page.wait_for_response(lambda response: "api/data" in response.url and response.status == 200)
print("Data loaded")
# Extract content after JavaScript execution
content = page.content()
print("Page content:", content)
browser.close()
with sync_playwright() as playwright:
run(playwright)
In this example, we wait for a network response that contains the data we need, ensuring that JavaScript has executed and the content is available.
Interacting with JavaScript-rendered content
Sometimes, you need to interact with elements that are rendered by JavaScript. Here's an example of how to do that:
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com")
# Wait for a specific element to appear after JavaScript execution
page.wait_for_selector("#dynamic-content")
print("Dynamic content loaded")
# Interact with the JavaScript-rendered content
dynamic_text = page.inner_text("#dynamic-content")
print("Dynamic content text:", dynamic_text)
browser.close()
with sync_playwright() as playwright:
run(playwright)
In this example, we take advantage of the wait_for_selector()
function that has already been discussed. Specifically, we wait for a dynamically loaded element to appear and then extract its text content.
Handling infinite scroll
Many websites use infinite scroll to load content as you scroll down the page. Playwright can help you handle this by simulating scroll actions and waiting for new content to load:
from playwright.sync_api import sync_playwright
def run(playwright):
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://example.com/infinite-scroll")
# Scroll and wait for new content to load
for _ in range(5):
page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
page.wait_for_timeout(2000) # Wait for 2 seconds to load new content
# Extract content after scrolling
content = page.content()
print("Page content after scrolling:", content)
browser.close()
with sync_playwright() as playwright:
run(playwright)
In this example, we simulate scrolling down the page multiple times, waiting for new content to load each time.
Conclusion
And there you have it! We started with the basics of using Playwright to scrape a single webpage and take screenshots. Then, we made it more complex by scraping multiple pages from a sitemap. Finally, we made our script more robust by adding a retry mechanism with exponential backoff to handle those occasional server hiccups.
By now, you should have a solid understanding of how to use Playwright for web scraping, handle multiple URLs, and make your script more reliable with retries.
Hopefully you found this article interesting and useful. As always, thank you for staying with me and happy scraping!
Ilya is an IT tutor and author, web developer, and ex-Microsoft/Cisco specialist. His primary programming languages are Ruby, JavaScript, Python, and Elixir. He enjoys coding, teaching people and learning new things. In his free time he writes educational posts, participates in OpenSource projects, tweets, goes in for sports and plays music.