Web Scraping using Selenium and Python

Kevin Sahin Kevin Sahin

Kevin has been working in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.

Blog post header

In the last tutorial we saw how to leverage the Scrapy framework to solve lots of common web scraping problems. Today we are going to take a look at Selenium (with Python ❤️ ) with a step by step tutorial.

Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python.

The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely.

At the beginning of the project (almost 20 years ago!) it was mostly used for cross-browser end-to-end testing (acceptance tests).

Now it is still used for testing, but also as a general browser automation platform and of course, web scraping!

Selenium is really useful when you have to perform action on a website such as:

  • clicking on buttons
  • filling forms
  • scrolling
  • taking a screenshot

It is also very useful in order to execute Javascript code. Let's say that you want to scrape a Single Page application, and that you don't find an easy way to directly call the underlying APIs, then Selenium might be what you need.

Installation

We will use Chrome in our example, so make sure you have it installed on your local machine:

In order to install the Selenium package, as always, I recommend that you create a virtual environnement, using virtualenv for example, and then:

pip install selenium

Quickstart

Once you have downloaded both Chrome and Chromedriver, and installed the selenium package you should be ready to start the browser:

from selenium import webdriver

DRIVER_PATH = '/path/to/chromedriver'
driver = webdriver.Chrome(executable_path=DRIVER_PATH)
driver.get('https://google.com')

This will launch Chrome in headfull mode (like a regular Chrome, which is controlled by your Python code). You should see a message stating that the browser is controlled by an automated software.

In order to run Chrome in headless mode (without any graphical user interface), to run it on a server for example:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options, executable_path=DRIVER_PATH)
driver.get("https://www.nintendo.com/")
print(driver.page_source)
driver.quit()

The driver.page_source will return the full page HTML code.

Here are two other interesting webdriver properties:

  • driver.title to get the page's title
  • driver.current_url to get the current url (can be useful when there are redirections on the website and that you need the final URL)

Locating elements

Locating data on a website is one of the main use cases for Selenium, either for a test suite (making sure that a specific element is present/absent on the page) or to extract the data and save it for further analysis (web scraping).

There are many methods available in the Selenium API to select elements on the page. You can use:

  • Tag name
  • Class name
  • IDs
  • XPath
  • CSS selectors

We recently published an article explaining XPath, don't hesitate to take a look if you aren't familiar with XPath.

As usual, the easiest way to locate an element is to open your Chrome dev tools and inspect the element that you need. A cool shortcut for this is to highlight the element you want with your mouse, and then Ctrl + Shift + C or on macOS cmd + shift + c instead of having to right click + inspect each time:

Document Object Model

find_element

There are many ways to locate an element in selenium. Let's say that we wan to locate the h1 tag in this HTML:

<html>
    <head>
        ... some stuff
    </head>
    <body>
        <h1 class="someclass" id="greatID">Super title</h1>
    </body>
</html>
h1 = driver.find_element_by_name('h1')
h1 = driver.find_element_by_class_name('someclass')
h1 = driver.find_element_by_xpath('//h1')
h1 = driver.find_element_by_id('greatID')

All these methods also have the find_elements(note the plural) to return a list of elements.

For example, to get all anchors on a page:

all_links = driver.find_elements_by_tag_name('a')

Some element aren't easily accessible with an ID or a simple class, and that's when you need an XPath expression. You also might have multiple elements with the same class (the id is supposed to be unique).

XPath is my favorite way of locating elements on a web page. It's very powerful to extract any element on a page, based on it's absolute position on the DOM, or relative to another element.

WebElement

A WebElement is a Selenium object representing an HTML element.

There are many actions that you can perform on those elements, here are the most useful:

  • Accessing the text of the element with the property element.text
  • Clicking on the element with element.click()
  • Accessing an attribute with element.get_attribute('class')
  • Sending text to an input: element.send_keys('mypassword')

There are some other interesting methods like is_displayed() , it returns True if an element is visible to the user.

It can be interesting to avoid honeytraps (like filling hidden inputs).

Full example

Here is a full example using the different methods we just saw about the Selenium API.

We are going to log into Hacker News:

Hacker News login page

In our example, authenticating to Hacker news in itself is not really useful but you could imagine creating a bot to automatically post a link to your latest blog post for example.

In order to authenticate we need to:

  • Go to the login page using driver.get()
  • Select the username input using driver.find_element_by_* and then element.send_keys() to send text to the input
  • Same process with the password input
  • Click on the login button using element.click()

Should be easy right? Let's see the code:

driver.get("https://news.ycombinator.com/login")

login = driver.find_element_by_xpath("//input").send_keys(USERNAME)
password = driver.find_element_by_xpath("//input[@type='password']").send_keys(PASSWORD)
submit = driver.find_element_by_xpath("//input[@value='login']").click()

Easy right? Now there is one importing thing that is missing here, how do we know if we are logged in?

We could do several thing:

  • Check for an error message (like “Wrong password”)
  • Check for one element on the page that is only displayed once logged in.

We're going to check for the logout button. The logout has the id “logout”, easy!

We can't just check if the element is None because all the find_element_by_* raise an exception if the element is not found in the DOM. So we have to use a try/except block and catch the NoSuchElementException exception:

# dont forget from selenium.common.exceptions import NoSuchElementException  
try:
    logout_button = driver.find_element_by_id("logout")
    print('Successfully logged in')
except NoSuchElementException:
    print('Incorrect login/password')

Taking a screenshot

We could easily take a screenshot using:

driver.save_screenshot('screenshot.png')
Hacker News login page

Note that lots of things can go wrong when you take a screenshot with Selenium. First you have to make sure that the window size is set correctly. Then you need to make sure that every asynchronous HTTP calls made by the frontend Javascript code has finish, and that the page is fully rendered.

In our Hacker News case it's really simple and we don't have to worry about these issues.

Waiting for an element to be present

Dealing with a website that uses lots of Javascript to render its content can be tricky. These days, more and more sites are using frameworks like Angular, React, Vue.js for their frontend. These frontend frameworks are complicated to deal with because they fire a lots of AJAX calls.

If we had to worry about an asynchronous HTTP (or many) call to an API, there are two ways to solve this:

  • Use a time.sleep(ARBITRARY_TIME) before taking the screenshot.
  • Use a WebDriverWait object.

If you use a time.sleep() you will probably use an arbitrary value. The problem is you're either waiting for too long, or not enough. Also the website can load slow on your local wifi internet connexion, but will be 10 times faster on your cloud server. With the WebDriverWait method you will wait the exact amount of time necessary for your element / data to be loaded.

try:
    element = WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.ID, "mySuperId"))
    )
finally:
    driver.quit()

This will wait 5 seconds for an element located by the id “mySuperId” to be loaded. There are many other interesting expected conditions like:

  • element_to_be_clickable
  • text_to_be_present_in_element
  • element_to_be_clickable

You can find more information about this in the Selenium documentation

Executing Javascript

Sometimes, you may need to execute some Javascript on the page. For example, let's say you want to take a screenshot of some information, but you first need to scroll a little bit to see it. You can easily do this with Selenium:

javaScript = "window.scrollBy(0,1000);"
driver.execute_script(javaScript)

Conlusion

I hope you enjoyed this blog post! You should now have a good understanding of how the Selenium API works in Python. If you want to know more about the different ways to scrape the web with Python don't hesitate to take a look at our general python web scraping guide.

Selenium is often necessary to extract data from websites using lots of Javascript. The problem is running lots of Selenium/Headless chrome instance at scale is hard and this is one of the things we solve with ScrapingBee, our web scraping api

Selenium is also really an excellent tool to automate almost anything on the web.

If you perform repetitive tasks like filling forms, checking an information behind a login form where the website doesn't have an API, then it's maybe* a good idea to automate it with Selenium,just don't forget this xkcd:

Blog post header

Tired of getting blocked while scraping the web? Our API handles headless browsers and rotates proxies for you.