Scraping single page applications with Python.
These frontend frameworks are complicated to deal with because there are often using the newest features of the HTML5 API.
Headless Chrome with Python
You will need to install the selenium package:
pip install selenium
And of course, you need a Chrome browser, and Chromedriver installed on your system.
On macOS, you can simply use brew:
brew install chromedriver
Taking a screenshot
> chrome.py from selenium import webdriver from selenium.webdriver.chrome.options import Options options = Options() options.headless = True options.add_argument("--window-size=1920,1200") driver = webdriver.Chrome(options=options, executable_path=r'/usr/local/bin/chromedriver') driver.get("https://www.nintendo.com/") driver.save_screenshot('screenshot.png') driver.quit()
The code is really straightforward, I just added a parameter --window-size because the default size was too small.
You should now have a nice screenshot of the Nintendo's home page:
Waiting for the page load
Most of the times, lots of AJAX calls are triggered on a page, and you will have to wait for these calls to load to get the fully rendered page.
A simple solution to this is to just time.sleep() en arbitrary amount of time. The problem with this method is that you are either waiting too long, or too little depending on your latency and internet connexion speed.
The other solution is to use the WebDriverWait object from the Selenium API:
try: elem = WebDriverWait(driver, delay) .until(EC.presence_of_element_located((By.NAME, 'chart'))) print("Page is ready!") except TimeoutException: print("Timeout")
This is a great solution because it will wait the exact amount of time necessary for the element to be rendered on the page.
As you can see, setting up Chrome in headless mode is really easy in Python. The most challenging part is to manage it in production. If you scrape lots of different websites, the resource usage will be volatile.
This is one of the reason we started ScrapingBee, a web scraping api, so that developers can focus on extracting the data they want, not managing Headless browsers and proxies!
I recently wrote "A guide to Web scraping without getting blocked", do not hesitate to check it out.
If you want to know more about the Python web scraping ecosystem, don't hesitate to look at our python web scraping tutorial
And here is a recent article about the best web scraping tools on the market.
Happy Web Scraping :)
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.
You might also like:
Scraping E-Commerce Product Data
In this tutorial, we are going to see how to scrape product data from any E-commerce websites with Java and Schema.org metadata.