Web Scraping without getting blocked
Web scraping or crawling is the process of fetching data from a third-party website by downloading and parsing the HTML code to extract the data you want.
"But why don't you use the API for this?"
Well, not every website offers an API, and APIs don't always expose every piece of information you need. So, scraping is often the only solution to extract website data.
There are many use cases for web scraping:
- E-commerce price monitoring
- News aggregation
- Lead generation
- SEO (search engine result page monitoring)
- Bank account aggregation (Mint in the US, Bankin' in Europe)
- Individuals and researchers building datasets otherwise not available
The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browsers (except when it comes to Google - they all want to be scraped by Google).
So, when you scrape, you do not want to be recognized as a robot. There are two main ways to seem human: use human tools and emulate human behavior.
This post will guide you through all the tools websites use to block you and all the ways you can successfully overcome these obstacles.
1. Imitate The Tool: Headless Chrome
Why Headless Browsing?
When you open your browser and go to a webpage, it almost always means that you ask an HTTP server for some content. One of the easiest ways to pull content from an HTTP server is to use a classic command-line tool such as cURL.
The thing is, if you just run
curl www.google.com, Google has many ways to know that you are not a human (for example by looking at the headers). Headers are small pieces of information that go with every HTTP request that hits the servers. One of those pieces of information precisely describes the client making the request, the infamous "User-Agent" header. Just by looking at the "User-Agent" header, Google knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great. As an experiment, just go over here. This webpage simply displays the headers information of your request.
How Do Headless Browsers Work?
Headless browsers will behave like a real browser, because they are real browsers, except that you will easily be able to programmatically use them. The most popular is Chrome Headless, a Chrome option that behaves like Chrome without all of the user interface wrapping it.
The easiest way to use Headless Chrome is by calling a driver that wraps all functionality into an easy API. Selenium Playwright and Puppeteer are the three most famous solutions. However, often even headless browsers won't do the trick, as there are now also ways to detect those and that arms race has been (and will be) going on for quite a while.
While these solutions can be easy to run on your local computer, it will be trickier to make this work at scale. Among many other things, this is what ScrapingBee is focusing on with its services, providing scraping tools which offer smooth scalability and natural browsing behavior.
This is why there is an everlasting arms race between web scrapers who want to pass as a real browser and websites who want to distinguish headless browsers from the rest. However, web scrapers tend to have a big advantage and here is why:
Another thing to know is that while running 20 cURL in parallel is trivial and Chrome Headless is relatively easy to use for small use cases, it can be tricky to put at scale. Because it uses lots of RAM, managing more than 20 instances of it is a challenge.
The following two resources might be also of interest for you, if you want to know more
- Our article How to use a proxy with cURL, with details on how to set up and use a proxy configuration with cURL
- Antoine Vastel's blog, which is entirely dedicated to the subject of browser fingerprinting and bot detection
That's it, folks. At least as far as posing as a real browser is concerned. Coming up next, let's make our scraper behave like a real human.
What is it?
TLS stands for Transport Layer Security and is the successor of SSL, which was basically what the "S" of HTTPS stood for.
This protocol ensures data confidentiality and integrity between two communicating computer applications (in our case, a web browser or script and an HTTP server). Similar to browser fingerprinting the goal of TLS fingerprinting is to uniquely identify browsers based on the way they use TLS. How this protocol works can be split into two big parts.
First, when the client connects to the server, a TLS handshake is taking place. During this handshake, the two parties send information back and forth to ensure that everyone is actually who they claim to be and to set up basic connection parameters.
Then, if the handshake has been successful, the protocol defines how the client and the server should encrypt and decrypt the data in a secure way. If you want a detailed explanation, check out this great introduction by Cloudflare.
Most of the data points used to build the fingerprint are from the TLS handshake and if you want to see what does a TLS fingerprint looks like, you can go visit this awesome online database. That website contains statistics on different TLS fingerprints and - as of the time of writing this article - the most commonly used on that website is
133e933dd1dfea90, which seemingly is from Apple's Safari browser on iOS and was used by one out of five requests in the past week.
That's quite a figure and at least two orders of magnitude higher than the most common browser fingerprint. It actually makes sense as a TLS fingerprint is computed using way fewer parameters than a browser fingerprint.
Those parameters are, amongst others:
- TLS version
- Handshake version
- Cipher suites supported
- TLS extensions
If you like to check out more on your browser's TLS fingerprint, SSL Labs will provide you with further insight and details on that subject.
How do I change it?
Ideally, in order to increase your stealth when scraping the web, you should be changing your TLS parameters. However, this is harder than it looks.
Firstly, because there are not that many TLS fingerprints out there, simply randomizing those parameters won't work. Your fingerprint will be so rare that it will be instantly flagged as fake. Secondly, TLS parameters are low-level stuff that rely heavily on system dependencies. So, changing them is not straight-forward. For example, the famous Python
requests module doesn't support changing the TLS fingerprint out of the box.
Here are a few resources to change your TLS version and cipher suite in your favorite language:
- Python with HTTPAdapter and requests
- NodeJS with the TLS package
- Ruby with OpenSSL
💡 Keep in mind, most of these libraries rely on the SSL and TLS implementation of your system. OpenSSL is the most widely used, and you might need to change its version in order to completely alter your fingerprint.
2. Imitate User Behaviour: Proxies, Solving CAPTCHAs, and Request Patterns
A human using a real browser will rarely request 20 pages per second from the same website. So if you want to request a lot of pages from the same website, you should make the site believe all those requests come from different places, i.e. different IP addresses, ideally from different locations/ISPs around the world. This is where proxies come in handy.
Proxies are not very expensive: ~USD 1 per IP. However, if you need to do more than ~10k requests per day to the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxies often stop working, so one should be constantly monitoring those, in order to discard and replace those.
There are several proxy providers on the market, the most commonly used ones with rotating web proxies are Luminati Network, Blazing SEO, and SmartProxy.
There is also a lot of free proxy lists, though, I don’t recommend using those because they are often slow and unreliable, and websites offering these lists are not always transparent about where these proxies are located. Free proxy lists are usually public, and therefore, it's a lot easier for websites to block these proxies addresses straight away.
That's exactly what many anti-crawling services do, they maintain lists of proxies and allow site owners to automatically block traffic from such addresses, hence the "quality" of a proxy is important.
In the context of proxy lists, paid almost always trumps free, which is why I recommend using a paid proxy network or creating your own proxy setup.
Running Your Own Proxy Network
CloudProxy is a great open-source solution to run your very own proxy network. It's a Docker image which allows you to maintain a list of cloud-hosted server instances and have it manage and provision them automatically.
Currently, CloudProxy supports DigitalOcean, AWS, Google, and Hetzner as service providers. All you need to do is specify your account API token and CloudProxy will take it from there and provision and scale proxy instances according to your configuration settings.
ISP & Mobile Proxies
One thing to consider with proxy lists (and even your own proxies) is they will most of the time run in the context of a data center, and data centers may be an immediate red flag for the site you are trying to crawl.
Ask yourself, how many "regular" users will browse your site from the basement of a data center?
This is where ISP proxies can be interesting. Such proxy setups are still hosted in a data center environment, however, they are located with classic ISPs, instead of hosting providers, and may blend a lot better into the standard traffic behaviour than a request coming straight from Google's data centers. Plus, being still run in a data center, they, of course, still still come with data center connectivity and reliability - Best Of Both!
We have another great article dedicated on that very subject, so please check out ISP proxies for more details on that.
Another world of networking not to miss would be the mobile world. Mobile 3G and 4G proxies run off of IP address blocks assigned to mobile network operators and offer a beautiful native-like approach for any sites and services which cater predominantly to mobile-first or mobile-only users.
Another option is the Tor network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. Tor usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for Tor usage, such as privacy, freedom of speech, journalists in a dictatorship regime, and of course, illegal activities.
In the context of web scraping, Tor works very similar to proxies and will also hide your IP address and change your bot’s IP address every 10 minutes. The Tor exit nodes IP addresses are public. Some websites block Tor traffic using a simple rule: if the server receives a request from one of the Tor public exit nodes, it will block it. That’s why in many cases, Tor won’t help you, compared to classic proxies. It's worth noting that traffic through Tor is also inherently much slower because of the multiple layers of re-routing.
Often, changing your IP address alone won't do it. It is becoming more and more common for websites to request their visitors to complete CAPTCHAs. A CAPTCHA is a challenge-response test, which is ideally easy to solve for humans but difficult to impossible for machines.
Most of the time, sites will display a CAPTCHA only to requests from suspicious IP addresses, in which case switching the proxy might just do the trick. Should that not be the case, then you might actually need to solve the CAPTCHA as well.
While many CAPTCHAs can still be solved programmatically (e.g. with OCR) - contrary to what the idea actually promises - there are also implementations which do require the human element.
In these cases, 2Captcha and Death by Captcha are two examples for services which offer paid API services to solve CAPTCHAs automatically from within your scraper. These APIs are backed by humans on the other hand and promise a timely CAPTCHA resolution for a fraction of a dollar.
But still, even if you handle CAPTCHAs and use proxies, sites can often still determine that you are not a regular user, based on how you send your requests.
Another method, which sites use to detect scraping, is trying to find patterns in the requests you are sending.
1. Request rate
For example, you have the URL template
https://www.myshop.com/product/[ID] and would like to scrape the IDs 1 to 10,000. If you did this sequentially and at a constant request rate, it will be pretty easy to say that you are likely a scraper. Instead of looping from 1 to 10,000, have a list of IDs and randomly pick one on each crawler iteration. Also, make sure your request rate is random (e.g. anything between a couple of seconds to a minute).
Some sites also evaluate your browser fingerprint and block requests if you hit them with the standard configuration of a headless browser. Please refer to what we discussed earlier on fingerprinting in this article.
Another giveaway may be your location. Some sites have a very narrow geographical use-case. A Brazilian food delivery service will not be too useful outside of Brazil, and if you scrape that site via proxies from the US or Vietnam, that may quickly raise flags.
I know of an example, where a website blocked all IP ranges of another country, only because it was scraped by own company in that country.
Short story, stay under the radar
What I can tell you from experience, the most important factor when it comes to finding patterns in requests, is the rate limit. The more self-restraint your bot employs and the slower it crawls a site, the higher will be the chance that you will be flying under the radar and appear like a regular visitor to the site.
3. Imitate Code Behaviour: API Reverse Engineering
More and more often sites do not just serve plain HTML output, but actually provide proper API endpoints, even if these may be unofficial and undocumented.
Reverse Engineering of an API
What it comes down to here is primarily:
- Analyzing a web page's behaviour to find interesting and relevant API calls
- Forging those API calls from within your code
For example, let's say that I want to get all the comments of a famous social network. Now, when I click the "Load more comments" button, I can observe the following request in the Inspector tab of my browser developer tools.
In this example we also specifically filtered for XHR requests, in order to focus on the relevant requests.
Now, if we check the response we received, we notice we got a JSON object with all the important data - bingo!
Together with the data from the Headers tab, we should now have everything we need to replay the request and get an understanding of which parameters are expected and their meaning. This should provide us with the opportunity to craft such requests from a script as well.
This analysis of the request parameters is partially the hardest part of this task, as there can be a certain ambiguity and you may need a few sample requests to be able to tweak your requests and correctly interpret the responses. Plus, sometimes there are built-in scraping protection layers, such as single-use tokens to prevent simple request replays, to take into consideration as well.
In any case, your browser's developer tools will greatly assist you in this endeavour and you can also export sets of requests as HAR file and use them for further analysis in your favorite HTTP tool (I love Paw and Postman ❤️).
Reverse Engineering of Mobile Apps
Similar principles, as with API debugging, apply when it comes to reverse engineering mobile apps. You will want to intercept the request your mobile app makes to the server and replay it with your code.
Doing this may be hard for two reasons:
- To intercept requests, you will need a Man-In-The-Middle proxy, Charles proxy for example
- Mobile apps can tag and obfuscate their requests more easily than a web app
For example, when Pokemon Go was released a few years ago, lots of people were using tools which allowed them to manipulate their scores. What they, and the developers of these tools, did not know was that Niantic added additional parameters, which the cheat-scripts did not take into account. Based on that, it was a piece of cake of Niantic to identify and ban these players. A few weeks into the game release, a massive number of players were banned for cheating.
Another interesting example is Starbucks and its unofficial API. Someone actually took the time to reverse-engineer the Starbucks API and document his findings. They seemingly employ a lot of the techniques we mentioned in this article: device fingerprinting, encryption, single-use requests, and more.
Here is a recap of all the anti-bot techniques we saw in this article:
|Anti-bot technique||Counter measure||Supported by ScrapingBee|
|Browser fingerprinting||Headless browsers||✅|
|IP-rate limiting||Rotating proxies||✅|
|Banning data center IPs||Residential IPs||✅|
|TLS fingerprinting||Forge and rotate TLS fingerprints||✅|
|CAPTCHAs on suspicious activity||All of the above||✅|
|Always-on CAPTCHAs||CAPTCHA-solving tools and services||❌|
I hope this overview helped you to understand better what difficulties a web-scraper may encounter and how to counter or avoid them altogether.
At ScrapingBee, we leverage and combine all of the mentioned techniques, which is why our web scraping API is able to handle thousands of requests per second without the risk of being blocked. If you don’t want to lose too much time setting everything up, make sure to try ScrapingBee. The first 1k API calls are on us of course :).
We also recently published a guide about the best web scraping tools on the market, please don't hesitate to take a look!
Pierre is a data engineer who worked in several high-growth startups before co-founding ScrapingBee. He is an expert in data processing and web scraping.
You might also like:
Web Scraping with Elixir
In this tutorial, you will learn the basics of web crawling, data extraction, and data parsing using the Elixir language. Due to its high performance, simplicity, and overall stability, Elixir is a great choice for web scraping. You'll also learn how to use Crawly, the complete web-scraping framework for Elixir.
Introduction to Chrome Headless with Java
In this post, we're going to see how to run headless Chrome with Java and the selenium API. Headless Chrome is a game changer for web scraping in 2019.