Web Scraping without getting blocked

28 June 2022 (updated) | 17 min read

Introduction

Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code to extract the data you want.

"But why don't you use the API for this?"

Well, not every website offers an API, and APIs don't always expose every piece of information you need. So, scraping is often the only solution to extract website data.

There are many use cases for web scraping:

  • E-commerce price monitoring
  • News aggregation
  • Lead generation
  • SEO (search engine result page monitoring)
  • Bank account aggregation (Mint in the US, Bankin' in Europe)
  • Individuals and researchers building datasets otherwise not available

The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browsers (except when it comes to Google - they all want to be scraped by Google).

So, when you scrape, you do not want to be recognized as a robot. There are two main ways to seem human: use human tools and emulate human behavior.

This post will guide you through all the tools websites use to block you and all the ways you can successfully overcome these obstacles.

1. Imitate The Tool: Headless Chrome

Why Headless Browsing?

When you open your browser and go to a webpage, it almost always means that you ask an HTTP server for some content. One of the easiest ways to pull content from an HTTP server is to use a classic command-line tool such as cURL.

The thing is, if you just run curl www.google.com, Google has many ways to know that you are not a human (for example by looking at the headers). Headers are small pieces of information that go with every HTTP request that hits the servers. One of those pieces of information precisely describes the client making the request, the infamous "User-Agent" header. Just by looking at the "User-Agent" header, Google knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great. As an experiment, just go over here. This webpage simply displays the headers information of your request.

Headers are easy to alter with cURL, and providing the User-Agent header of a proper browser could do the trick. In the real world, you'd need to set more than one header. But it is not difficult to artificially forge an HTTP request with cURL or any library to make the request look exactly like a request made with a browser. Everybody knows this. So, to determine if you are using a real browser, websites will check something that cURL and library can not do: executing JavaScript code.

Do You Speak JavaScript?

The concept is simple, the website embeds a JavaScript snippet in its content that, once executed, will "unlock" the webpage. If you're using a real browser, you won't notice the difference. If you're not, you'll receive an HTML page with some obscure JavaScript code in it:

JavaScript code snippet
An example of a JavaScript snippet to unlock a webpage

Once again, this solution is not completely bulletproof either, mainly because it is now very easy to execute JavaScript outside of a browser with Node.js. However, the web has evolved and there are other tricks to determine if you are using a real browser.

How Do Headless Browsers Work?

Trying to execute JavaScript snippets on the side with Node.js is difficult and not robust. And more importantly, as soon as the website has a more complicated check system or is a big single-page application cURL and pseudo-JS execution with Node.js become useless. So the best way to look like a real browser is to actually use one.

Headless browsers will behave like a real browser, because they are real browsers, except that you will easily be able to programmatically use them. The most popular is Chrome Headless, a Chrome option that behaves like Chrome without all of the user interface wrapping it.

The easiest way to use Headless Chrome is by calling a driver that wraps all functionality into an easy API. Selenium Playwright and Puppeteer are the three most famous solutions. However, often even headless browsers won't do the trick, as there are now also ways to detect those and that arms race has been (and will be) going on for quite a while.

While these solutions can be easy to run on your local computer, it will be trickier to make this work at scale. Among many other things, this is what ScrapingBee is focusing on with its services, providing scraping tools which offer smooth scalability and natural browsing behavior.

Browser Fingerprinting

It is a rather well-known fact these days, in particular among web developers of course, that browsers can behave very differently. Sometimes it's about rendering CSS, sometimes how they execute JavaScript, and sometimes just superficial details. Most of these differences are well-known and it is now possible to detect if a browser is actually who it pretends to be. This means the website asks "do all of the browser properties and does the behavior match what I know about the User-Agent sent by this browser?".

This is why there is an everlasting arms race between web scrapers who want to pass as a real browser and websites who want to distinguish headless browsers from the rest. However, web scrapers tend to have a big advantage and here is why:

Screenshot of Chrome malware alert
Screenshot of Chrome malware alert

Most of the time, when JavaScript code tries to detect whether it's being run in headless mode, it is when a malware is trying to evade behavioral fingerprinting. This means that the JavaScript code will behave nicely inside a scanning environment, but will pursue its real goal in the context of real browsers. And this is why the team behind the Chrome headless mode is trying to make it indistinguishable from a real user's web browser in order to stop malware from doing that. Web scrapers can profit from this effort as well.

Another thing to know is that while running 20 cURL in parallel is trivial and Chrome Headless is relatively easy to use for small use cases, it can be tricky to put at scale. Because it uses lots of RAM, managing more than 20 instances of it is a challenge.

The following two resources might be also of interest for you, if you want to know more

That's it, folks. At least as far as posing as a real browser is concerned. Coming up next, let's make our scraper behave like a real human.

TLS Fingerprinting

What is it?

TLS stands for Transport Layer Security and is the successor of SSL, which was basically what the "S" of HTTPS stood for.

This protocol ensures data confidentiality and integrity between two communicating computer applications (in our case, a web browser or script and an HTTP server). Similar to browser fingerprinting the goal of TLS fingerprinting is to uniquely identify browsers based on the way they use TLS. How this protocol works can be split into two big parts.

  1. First, when the client connects to the server, a TLS handshake is taking place. During this handshake, the two parties send information back and forth to ensure that everyone is actually who they claim to be and to set up basic connection parameters.

  2. Then, if the handshake has been successful, the protocol defines how the client and the server should encrypt and decrypt the data in a secure way. If you want a detailed explanation, check out this great introduction by Cloudflare.

Most of the data points used to build the fingerprint are from the TLS handshake and if you want to see what does a TLS fingerprint looks like, you can go visit this awesome online database. That website contains statistics on different TLS fingerprints and - as of the time of writing this article - the most commonly used on that website is 133e933dd1dfea90, which seemingly is from Apple's Safari browser on iOS and was used by one out of five requests in the past week.

Most used fingerprint screenshot
The TLS fingerprint of Safari on iOS

That's quite a figure and at least two orders of magnitude higher than the most common browser fingerprint. It actually makes sense as a TLS fingerprint is computed using way fewer parameters than a browser fingerprint.

Those parameters are, amongst others:

  • TLS version
  • Handshake version
  • Cipher suites supported
  • TLS extensions

If you like to check out more on your browser's TLS fingerprint, SSL Labs will provide you with further insight and details on that subject.

How do I change it?

Ideally, in order to increase your stealth when scraping the web, you should be changing your TLS parameters. However, this is harder than it looks.

Firstly, because there are not that many TLS fingerprints out there, simply randomizing those parameters won't work. Your fingerprint will be so rare that it will be instantly flagged as fake. Secondly, TLS parameters are low-level stuff that rely heavily on system dependencies. So, changing them is not straight-forward. For example, the famous Python requests module doesn't support changing the TLS fingerprint out of the box.

Here are a few resources to change your TLS version and cipher suite in your favorite language:

💡 Keep in mind, most of these libraries rely on the SSL and TLS implementation of your system. OpenSSL is the most widely used, and you might need to change its version in order to completely alter your fingerprint.

2. Imitate User Behaviour: Proxies, Solving CAPTCHAs, and Request Patterns

A human using a real browser will rarely request 20 pages per second from the same website. So if you want to request a lot of pages from the same website, you should make the site believe all those requests come from different places, i.e. different IP addresses, ideally from different locations/ISPs around the world. This is where proxies come in handy.

Proxies are not very expensive: ~USD 1 per IP. However, if you need to do more than ~10k requests per day to the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxies often stop working, so one should be constantly monitoring those, in order to discard and replace those.

Proxy Lists

There are several proxy providers on the market, the most commonly used ones with rotating web proxies are Luminati Network, Blazing SEO, and SmartProxy.

There is also a lot of free proxy lists, though, I don’t recommend using those because they are often slow and unreliable, and websites offering these lists are not always transparent about where these proxies are located. Free proxy lists are usually public, and therefore, it's a lot easier for websites to block these proxies addresses straight away.

That's exactly what many anti-crawling services do, they maintain lists of proxies and allow site owners to automatically block traffic from such addresses, hence the "quality" of a proxy is important.

In the context of proxy lists, paid almost always trumps free, which is why I recommend using a paid proxy network or creating your own proxy setup.

Running Your Own Proxy Network

CloudProxy is a great open-source solution to run your very own proxy network. It's a Docker image which allows you to maintain a list of cloud-hosted server instances and have it manage and provision them automatically.

Currently, CloudProxy supports DigitalOcean, AWS, Google, and Hetzner as service providers. All you need to do is specify your account API token and CloudProxy will take it from there and provision and scale proxy instances according to your configuration settings.

ISP & Mobile Proxies

One thing to consider with proxy lists (and even your own proxies) is they will most of the time run in the context of a data center, and data centers may be an immediate red flag for the site you are trying to crawl.

Ask yourself, how many "regular" users will browse your site from the basement of a data center?

This is where ISP proxies can be interesting. Such proxy setups are still hosted in a data center environment, however, they are located with classic ISPs, instead of hosting providers, and may blend a lot better into the standard traffic behaviour than a request coming straight from Google's data centers. Plus, being still run in a data center, they, of course, still still come with data center connectivity and reliability - Best Of Both!

We have another great article dedicated on that very subject, so please check out ISP proxies for more details on that.

Another world of networking not to miss would be the mobile world. Mobile 3G and 4G proxies run off of IP address blocks assigned to mobile network operators and offer a beautiful native-like approach for any sites and services which cater predominantly to mobile-first or mobile-only users.

Tor

Another option is the Tor network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. Tor usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for Tor usage, such as privacy, freedom of speech, journalists in a dictatorship regime, and of course, illegal activities.

In the context of web scraping, Tor works very similar to proxies and will also hide your IP address and change your bot’s IP address every 10 minutes. The Tor exit nodes IP addresses are public. Some websites block Tor traffic using a simple rule: if the server receives a request from one of the Tor public exit nodes, it will block it. That’s why in many cases, Tor won’t help you, compared to classic proxies. It's worth noting that traffic through Tor is also inherently much slower because of the multiple layers of re-routing.

CAPTCHAs

Often, changing your IP address alone won't do it. It is becoming more and more common for websites to request their visitors to complete CAPTCHAs. A CAPTCHA is a challenge-response test, which is ideally easy to solve for humans but difficult to impossible for machines.

Most of the time, sites will display a CAPTCHA only to requests from suspicious IP addresses, in which case switching the proxy might just do the trick. Should that not be the case, then you might actually need to solve the CAPTCHA as well.

Old captcha screenshot
reCAPTCHA 1.0

While many CAPTCHAs can still be solved programmatically (e.g. with OCR) - contrary to what the idea actually promises - there are also implementations which do require the human element.

Google Recaptcha V2 screenshot
reCAPTCHA 2.0

In these cases, 2Captcha and Death by Captcha are two examples for services which offer paid API services to solve CAPTCHAs automatically from within your scraper. These APIs are backed by humans on the other hand and promise a timely CAPTCHA resolution for a fraction of a dollar.

But still, even if you handle CAPTCHAs and use proxies, sites can often still determine that you are not a regular user, based on how you send your requests.

Request Patterns

Another method, which sites use to detect scraping, is trying to find patterns in the requests you are sending.

1. Request rate

For example, you have the URL template https://www.myshop.com/product/[ID] and would like to scrape the IDs 1 to 10,000. If you did this sequentially and at a constant request rate, it will be pretty easy to say that you are likely a scraper. Instead of looping from 1 to 10,000, have a list of IDs and randomly pick one on each crawler iteration. Also, make sure your request rate is random (e.g. anything between a couple of seconds to a minute).

2. Fingerprinting

Some sites also evaluate your browser fingerprint and block requests if you hit them with the standard configuration of a headless browser. Please refer to what we discussed earlier on fingerprinting in this article.

3. Location

Another giveaway may be your location. Some sites have a very narrow geographical use-case. A Brazilian food delivery service will not be too useful outside of Brazil, and if you scrape that site via proxies from the US or Vietnam, that may quickly raise flags.

I know of an example, where a website blocked all IP ranges of another country, only because it was scraped by own company in that country.

Short story, stay under the radar

What I can tell you from experience, the most important factor when it comes to finding patterns in requests, is the rate limit. The more self-restraint your bot employs and the slower it crawls a site, the higher will be the chance that you will be flying under the radar and appear like a regular visitor to the site.

3. Imitate Code Behaviour: API Reverse Engineering

More and more often sites do not just serve plain HTML output, but actually provide proper API endpoints, even if these may be unofficial and undocumented.

Especially in case they are for internal use only, you should still be careful and most of the mentioned rules still apply, but at least it may provide a more standardised interface and interactive elements, such as JavaScript, may be less a factor.

Reverse Engineering of an API

What it comes down to here is primarily:

  1. Analyzing a web page's behaviour to find interesting and relevant API calls
  2. Forging those API calls from within your code

For example, let's say that I want to get all the comments of a famous social network. Now, when I click the "Load more comments" button, I can observe the following request in the Inspector tab of my browser developer tools.

Request being made when clicking more comments
Request being made when clicking more comments

In this example we also specifically filtered for XHR requests, in order to focus on the relevant requests.

Now, if we check the response we received, we notice we got a JSON object with all the important data - bingo!

Request response
Request response

Together with the data from the Headers tab, we should now have everything we need to replay the request and get an understanding of which parameters are expected and their meaning. This should provide us with the opportunity to craft such requests from a script as well.

HTTP Client response
HTTP Client response

This analysis of the request parameters is partially the hardest part of this task, as there can be a certain ambiguity and you may need a few sample requests to be able to tweak your requests and correctly interpret the responses. Plus, sometimes there are built-in scraping protection layers, such as single-use tokens to prevent simple request replays, to take into consideration as well.

In any case, your browser's developer tools will greatly assist you in this endeavour and you can also export sets of requests as HAR file and use them for further analysis in your favorite HTTP tool (I love Paw and Postman ❤️).

Paw Example
Previous request imported in Paw

Reverse Engineering of Mobile Apps

Similar principles, as with API debugging, apply when it comes to reverse engineering mobile apps. You will want to intercept the request your mobile app makes to the server and replay it with your code.

Doing this may be hard for two reasons:

  • To intercept requests, you will need a Man-In-The-Middle proxy, Charles proxy for example
  • Mobile apps can tag and obfuscate their requests more easily than a web app

For example, when Pokemon Go was released a few years ago, lots of people were using tools which allowed them to manipulate their scores. What they, and the developers of these tools, did not know was that Niantic added additional parameters, which the cheat-scripts did not take into account. Based on that, it was a piece of cake of Niantic to identify and ban these players. A few weeks into the game release, a massive number of players were banned for cheating.

Another interesting example is Starbucks and its unofficial API. Someone actually took the time to reverse-engineer the Starbucks API and document his findings. They seemingly employ a lot of the techniques we mentioned in this article: device fingerprinting, encryption, single-use requests, and more.

Conclusion

Here is a recap of all the anti-bot techniques we saw in this article:

Anti-bot technique Counter measure Supported by ScrapingBee
Browser fingerprinting Headless browsers
IP-rate limiting Rotating proxies
Banning data center IPs Residential IPs
TLS fingerprinting Forge and rotate TLS fingerprints
CAPTCHAs on suspicious activity All of the above
Always-on CAPTCHAs CAPTCHA-solving tools and services

I hope this overview helped you to understand better what difficulties a web-scraper may encounter and how to counter or avoid them altogether.

At ScrapingBee, we leverage and combine all of the mentioned techniques, which is why our web scraping API is able to handle thousands of requests per second without the risk of being blocked. If you don’t want to lose too much time setting everything up, make sure to try ScrapingBee. The first 1k API calls are on us of course :).

We also recently published a guide about the best web scraping tools on the market, please don't hesitate to take a look!

image description
Pierre de Wulf

Pierre is a data engineer who worked in several high-growth startups before co-founding ScrapingBee. He is an expert in data processing and web scraping.