How to execute JavaScript with Scrapy?

Ari Bajo Ari Bajo

Ari is an expert Data Engineer and a talented technical writer. He wrote the entire Scrapy integration for ScrapingBee and this awesome article.

Blog post header

Most modern websites use a client-side JavaScript framework such as React, Vue or Angular. Scraping data from a dynamic website without server-side rendering often requires executing JavaScript code.

I’ve scraped hundreds of sites, and I always use Scrapy. Scrapy is a popular Python web scraping framework. Compared to other Python scraping libraries, such as Beautiful Soup, Scrapy forces you to structure your code based on some best practices. In exchange, Scrapy takes care of concurrency, collecting stats, caching, handling retrial logic and many others.

If you're new to scrapy, you should probably begin by reading this great tutorial that will teach you all the basics of Scrapy.

In this article, I compare the most popular solutions to execute JavaScript with Scrapy, how to scale headless browsers and introduce an open-source integration with ScrapingBee API for JavaScript support and proxy rotation.

Scraping dynamic websites with Scrapy

Scraping client-side rendered websites with Scrapy used to be painful. I’ve often found myself inspecting API requests on the browser network tools and extracting data from JavaScript variables. While these hacks may work on some websites, I find the code harder to understand and maintain than traditional XPATHs. But to scrape client-side data directly from the HTML you first need to execute the JavaScript code.

Scrapy middlewares for headless browsers

A headless browser is a web browser without a graphical user interface. I’ve used three libraries to execute JavaScript with Scrapy: scrapy-selenium, scrapy-splash and scrapy-scrapingbee.

All three libraries are integrated as a Scrapy downloader middleware. Once configured in your project settings, instead of yielding a normal Scrapy Request from your spiders, you yield a SeleniumRequest, SplashRequest or ScrapingBeeRequest.

Executing JavaScript in Scrapy with Selenium

Locally, you can interact with a headless browser with Scrapy with the scrapy-selenium middleware. Selenium is a framework to interact with browsers commonly used for testing applications, web scraping and taking screenshots.

Selenium needs a web driver to interact with a browser. For example, Firefox requires you to install geckodriver. You can then configure Selenium on your Scrapy project settings.


from shutil import which

SELENIUM_DRIVER_NAME = 'firefox'

SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver')

SELENIUM_DRIVER_ARGUMENTS=['-headless']

DOWNLOADER_MIDDLEWARES = {

    'scrapy_selenium.SeleniumMiddleware': 800

}

In your spiders, you can then yield a SeleniumRequest.

from scrapy_selenium import SeleniumRequest

yield SeleniumRequest(url, callback=self.parse)

Selenium allows you to interact with the browser in Python and JavaScript. The driver object is accessible from the Scrapy response. Sometimes it can be useful to inspect the HTML code after you click on a button. Locally, you can set up a breakpoint with an ipdb debugger to inspect the HTML response.

def parse(self, response):


    driver = response.request.meta['driver']
    driver.find_element_by_id('show-price').click()


    import ipdb; ipdb.set_trace()
    print(driver.page_source)

Otherwise, Scrapy XPATH and CSS selectors are accessible from the response object to select data from the HTML.

def parse(self, response):
    title = response.selector.xpath(
        '//title/@text'
    ).extract_first()

SeleniumRequest takes some additional arguments such as wait_time to wait before returning the response, wait_until to wait for an HTML element, screenshot to take a screenshot and script for executing a custom JavaScript script.

On production, the main issue with scrapy-selenium is that there is no trivial way to set up a Selenium grid to have multiple browser instances running on remote machines. Next, I will compare two solutions to execute JavaScript with Scrapy at scale.

Executing JavaScript in Scrapy with Splash

Splash is a web browser as a service with an API. It’s maintained by Scrapinghub, the main contributor to Scrapy and integrated with Scrapy through the scrapy-splash middleware. It can also be hosted by Scrapinghub.

Splash was created in 2013, before headless Chrome and other major headless browsers were released in 2017. Since then, other popular projects such as PhantomJS have been discontinued in favour of Firefox, Chrome and Safari headless browsers.

You can run an instance of Splash locally with Docker.

docker run -p 8050:8050 scrapinghub/splash`

Configuring Splash middleware requires adding multiple middlewares and changing the default priority of HttpCompressionMiddleware in your project settings.

SPLASH_URL = 'http://192.168.59.103:8050'


DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}


SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}


DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Then you can yield a SplashRequest with optional arguments wait and lua_source.

from scrapy_splash import SplashRequest

yield SplashRequest(url, callback=self.parse, args={
	'wait': 0.5,
    'lua_source': script
})

Splash is a popular solution because it has been out for a long time, but it has two major issues: it uses a custom headless browser and requires coding in Lua to interact with a website. Because of those two issues, for my last scraping project, I decided to create a middleware for the ScrapingBee API.

Executing JavaScript in Scrapy with ScrapingBee

ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts.

Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.

pip install scrapy-scrapingbee

First, you need to create a ScrapingBee account to get an API key. Then you can add the downloader middleware and set concurrency according to your ScrapingBee plan in your project settings.

SCRAPINGBEE_API_KEY = 'REPLACE-WITH-YOUR-API-KEY'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_scrapingbee.ScrapingBeeMiddleware': 725,
}

CONCURRENT_REQUESTS = 1

You can then inherit your spiders from ScrapingBeeSpider and yield a ScrapingBeeRequest.

from scrapy_scrapingbee import ScrapingBeeSpider, ScrapingBeeRequest

class HttpbinSpider(ScrapingBeeSpider):
    name = 'httpbin'
    start_urls = [
        'https://httpbin.org',
    ]


    def start_requests(self):
        for url in self.start_urls:
            yield ScrapingBeeRequest(url)


    def parse(self, response):
        

ScrapingBeeRequest takes an optional params argument to execute a js_snippet, set up a custom wait before returning the response or waiting for a CSS or XPATH selector in the HTML code with wait_for.

In some websites, HTML is loaded asynchronously as you scroll through the page. You can use the JavaScript snippet below to scroll to the end of the page.

JS_SNIPPET = 'window.scrollTo(0, document.body.scrollHeight);'

yield ScrapingBeeRequest(url, params={
           'js_snippet': JS_SNIPPET,
           # 'wait': 3000,
           # 'wait_for': '#swagger-ui',
       })

ScrapingBee has gathered other common JavaScript snippets to interact with a website on the ScrapingBee documentation.

Behind the scenes, the scrapy-scrapingbee middleware transforms the original request into a request forwarded to the ScrapingBee API and encodes each argument in the URL query string. The API endpoint is logged in your Scrapy logs and the api_key is hidden by the ScrapingBeeSpider.

2020-06-22 12:32:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://app.scrapingbee.com/api/v1/?api_key=HIDDEN&url=https://httpbin.org&js_snippet=d2luZG93LnNjcm9sbFRvKDAsIGRvY3VtZW50LmJvZHkuc2Nyb2xsSGVpZ2h0KTs=&wait=3000> (referer: None)

In your spider’s parse method, the response.url is resolved by the middleware to the original URL passed to ScrapingBeeRequest.

def parse(self, response):
    assert response.url == 'https://httpbin.org'

Another advantage of using ScrapingBee is that you get access to residential proxies in different countries and proxy rotation out of the box with the following arguments.

yield ScrapingBeeRequest(url, params={
   'premium_proxy': True,
   'country_code': 'fr',
})

Using Scrapy cache and concurrency to scrape faster

Scrapy uses Twisted under the hood, an asynchronous networking framework. Twisted makes Scrapy fast and able to scrape multiple pages concurrently. However, to execute JavaScript code you need to resolve requests with a real browser or a headless browser. There are two challenges with headless browsers: they are slower and hard to scale.

Executing JavaScript in a headless browser and waiting for all network calls can take several seconds per page. When scraping multiple pages, it makes the scraper significantly slower. Hopefully, Scrapy provides caching to speed-up development and concurrent requests for production runs.

Locally, while developing a scraper you can use Scrapy's built-in cache system. It will make subsequent runs faster as the responses are stored on your computer in a hidden folder .scrapy/httpcache. You can activate the HttpCacheMiddleware in your project settings:

HTTPCACHE_ENABLED = True

Another issue with headless browsers is that they consume memory for each request. On production, you need an environment that can handle multiple browsers. To make several requests concurrently, you can modify your project settings:

CONCURRENT_REQUESTS = 1

When using ScrapingBee, remember to set concurrency according to your ScrapingBee plan.

Conclusion

I compared three Scrapy middlewares to render and execute JavaScript with Scrapy. Selenium allows you to interact with the web browser using Python in all major headless browsers but can be hard to scale. Splash can be run locally with Docker or deployed to Scrapinghub but relies on a custom browser implementation and you have to write scripts in Lua. ScrapingBee uses the latest Chrome headless browser, allows you to execute custom scripts in JavaScript and also provides proxy rotation for the hardest websites to scrape.

scrapy-selenium scrapy-splash scrapy-scrapingbee
Run Locally Yes Yes, with Docker No
Scaling Remotely With Selenium Grid With Scrapinghub With ScrapingBee
Scripting Language JavaScript, Python Lua JavaScript
Browser Support Chrome, Firefox, Edge, Safari Splash Latest Headless Chrome
Proxy Rotation No Provided by another service, Crawlera Yes, provided by the same Middleware

Get started with the scrapy-scrapingbee middleware and get 1000 credits on ScrapingBee API.

OTHER ARTICLES:

Tired of getting blocked while scraping the web? Our API handles headless browsers and rotates proxies for you.