Web scraping or crawling is the fact of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want.
But you should use an API for this!
Not every website offers an API, and APIs don't always expose every piece of information you need. So it's often the only solution to extract website data.
There are many use cases for web scraping:
So, what is the problem?
The main problem is that most websites do not want to be scraped. They only want to serve content to real users using real web browser (except Google, they all want to be scraped by Google).
So, when you scrape, you have to be careful not being recognized as a robot by basically doing two things: using human tools & having a human behavior. This post will guide you through all the things you can use to cover yourself and through all the tools websites use to block you.
When you open your browser and go to a webpage it almost always means that you are you asking an HTTP server for some content. And one of the easiest ways pull content from an HTTP server is to use a classic command-line tool such as cURL.
Thing is if you just do a:
curl www.google.com, Google has many ways to know that you are not a human, just by looking at the headers for examples. Headers are small pieces of information that goes with every HTTP request that hit the servers, and one of those pieces of information precisely describe the client making the request, I am talking about the infamous “User-Agent” header. And just by looking at the “User-Agent” header, Google now knows that you are using cURL. If you want to learn more about headers, the Wikipedia page is great, and to make some experiment, just go over here, it's a webpage that simply displays the headers information of your request.
Headers are really easy to alter with cURL, and copying the User-Agent header of a legit browser could do the trick. In the real world, you'd need to set more than just one header but more generally it is not very difficult to artificially craft an HTTP request with cURL or any library that will make this request looks exactly like a request made with a browser. Everybody knows that, and so, to know if you are using a real browser, website will check one thing that cURL and library can not do: JS execution.
The concept is very simple, the website embeds a little snippet of JS in its webpage that, once executed, will “unlock” the webpage. If you are using a real browser, then, you won't notice the difference, but if you're not, all you'll receive is an HTML page with some obscure JS in it.
But once again, this solution is not completely bulletproof, mainly because since nodeJS it is now very easy to execute JS outside of a browser. But once again, the web evolved and there are other tricks to determine if you are using a real browser or not.
Trying to execute snippet JS on the side with node is really difficult and not robust at all. And more importantly, as soon as the website has a more complicated check system or is a big single-page application cURL and pseudo-JS execution with node become useless. So the best way to look like a real browser is to actually use one.
Headless Browsers will behave “exactly” like a real browser except that you will easily be able to programmatically use them. The most used is Chrome Headless, a Chrome option that has the behavior of Chrome without all the UI wrapping it.
However, it will not be enough as websites have now tools that allow them to detect a headless browser. This arms race that's been going on for a long time.
While those solutions can be easy to make work on your computer, it can be trickier to do this at scale.
And this problem, of managing lots of Chrome headless instances, is one of the many we solve at ScrapingBee
Everyone, and mostly front dev, knows how every browser behaves differently. Sometimes it can be about rendering CSS, sometimes JS, sometimes just internal properties. Most of those differences are well known and it is now possible to detect if a browser is actually who it pretends to be. Meaning the website is asking itself “are all the browser properties and behaviors matched what I know about the User-Agent sent by this browser?".
This is why there is an everlasting arms race between scrapers who want to pass themselves as a real browser and websites who want to distinguish headless from the rest.
However, in this arms race, web scrapers tend to have a big advantage and here is why.
One another thing to know is that whereas running 20 cURL in parallel is trivial, Chrome Headless while relatively easy to use for small use cases, can be tricky to put at scale. Mainly because it uses lots of RAM so managing more than 20 instances of it is a challenge.
If you want to learn more about browser fingerprinting I suggest you take a look at Antoine Vastel blog, a blog entirely dedicated to this subject.
That's about all you need to know to understand how to pretend like you are using a real browser. Let's now take a look at how do you behave like a real human.
TLS stands for Transport Layer Security, is the successor of SSL and which was basically what the “S” of HTTPS stood for.
This protocol ensures privacy and data integrity between two or more communicating computer applications, in our case, a web browser or a script and an HTTP server.
Similarly to browser fingerprinting the goal of TLS fingerprinting is to uniquely identify users based on the way they use TLS.
How this protocol works can be split in two big parts.
First, when the client connects to the server, a TLS handshake happens. During this handshake, many requests are sent between the two to ensure that everyone is actually who they claim to be.
Then, if the handshake has been successful the protocol describes how the client and the server should encrypt and decrypt the data in a secure way. If you want a detailed explanation, check this great introduction by Cloudflare.
Most of the data point used to build the fingerprint are from the TLS handshake and if you want to see what does a TLS fingerprint looks like, you can go visit this awesome online database.
On this website, we see that the most used fingerprint was used 22.19% of the time last week (at the time of writing this article).
This number is very big and at least two orders of magnitude higher than the most common browser fingerprint. It's actually logical as a TLS fingerprint is computed using way fewer parameters than a browser fingerprint.
Those parameters are, amongst others:
If you wish to know what is your TLS fingerprint I suggest you go visit this website.
Ideally, in order to increase your stealth, you should be changing your TLS parameters when doing web scraping. However, this is harder than it looks.
Firstly, because there are not that many TLS fingerprint out there, simply randomizing those parameters won't work because your fingerprint will be so rare that it will be instantly flagged as fake.
Secondly, TLS parameters are low-level stuff that rely heavily on system dependencies, and changing them is not as straight-forward as it seems.
For examples, the famous Python
requests module doesn't support it out of the box. Here are a few resources to do this kind of things in your favorite language:
Keep in mind that most of these libraries rely on the SSL and TLS implementation of your system, OpenSSL is the most widely used, and you might need to change its version in order to completely alter your fingerprint.
A human using a real browser will rarely request 20 pages per second from the same website, so if you want to request a lot of page from the same website you have to trick this website into thinking that all those requests come from a different place in the world i.e: different I.P addresses. In other words, you need to use proxies.
Proxies are now not very expensive: ~1$ per IP. However, if you need to do more than ~10k requests per day on the same website, costs can go up quickly, with hundreds of addresses needed. One thing to consider is that proxies IPs needs to be constantly monitored in order to discard the one that is not working anymore and replace it.
There is also a lot of free proxy list and I don’t recommend using these because there are often slow, unreliable, and websites offering these lists are not always transparent about where these proxies are located. Those free proxy lists are most of the time public, and therefore, their IPs will be automatically banned by the most website. Proxy quality is very important, anti crawling services are known to maintain an internal list of proxy IP so every traffic coming from those IPs will also be blocked. Be careful to choose a good reputation Proxy. This is why I recommend using a paid proxy network or build your own.
Another proxy type that you could look into is mobile, 3g and 4g proxies. This is very interesting to scrape hard-to-scrape mobile first websites, like social media.
To build your on you could take a look at scrapoxy, a great open-source API, allowing you to build a proxy API on top of different cloud providers. Scrapoxy will create a proxy pool by creating instances on various cloud providers (AWS, OVH, Digital Ocean). Then you will be able to configure your client so it uses the Scrapoxy URL as the main proxy, and Scrapoxy it will automatically assign a proxy inside the proxy pool. Scrapoxy is easily customizable to fit your needs (rate limit, blacklist …) but can be a little tedious to put in place.
You could also use the TOR network, aka, The Onion Router. It is a worldwide computer network designed to route traffic through many different servers to hide its origin. TOR usage makes network surveillance/traffic analysis very difficult. There are a lot of use cases for TOR usage, such as privacy, freedom of speech, journalists in the dictatorship regime, and of course, illegal activities. In the context of web scraping, TOR can hide your IP address, and change your bot’s IP address every 10 minutes. The TOR exit nodes IP addresses are public. Some websites block TOR traffic using a simple rule: if the server receives a request from one of the TOR public exit nodes, it will block it. That’s why in many cases, TOR won’t help you, compared to classic proxies. It's worth noting that traffic through TOR is also inherently much slower because of the multiple routing thing.
But sometimes proxies will not be enough, some websites systematically ask you to confirm that you are a human with so-called CAPTCHAs. Most of the time CAPTCHAs are only displayed to suspicious IP, so switching proxy will work in those cases. For the other cases, you'll need to use CAPTCHAs solving service (2Captchas and DeathByCaptchas come to mind).
You have to know that while some Captchas can be automatically resolved with optical character recognition (OCR), the most recent one has to be solved by hand.
What it means is that if you use those aforementioned services, on the other side of the API call you'll have hundreds of people resolving CAPTCHAs for as low as 20ct an hour.
But then again, even if you solve CAPCHAs or switch proxy as soon as you see one, websites can still detect your little scraping job.
A last advanced tool used by the website to detect scraping is pattern recognition. So if you plan to scrap every ids from 1 to 10 000 for the URL www.example.com/product/
This one simple example, some websites also do statistic on browser fingerprint per endpoint. Which means that if you don't change some parameters in your headless browser and target a single endpoint, they might block you anyway.
Websites also tend to monitor the origin of traffic, so if you want to scrape a website if Brazil, try not doing it with proxies in Vietnam for example.
But from experience, what I can tell, is that rate is the most important factor in “Request Pattern Recognition”, sot the slower you scrape, the less chance you have to be discovered.
Sometimes, the server expect the client to be a machine. In those case, hiding yourself is way easier.
Basically, this “trick” comes down to two things:
For example, let's say that I want to get back all the comments of a famous social network. I notice that when I click on the “load more comments” button, this happens in my inspector:
Notice that we filter out every requests except “XHR” ones to avoid noise.
And when we try to see what request is being made and what response do we get, bingo!
Now if we look at the “Headers” tab we should have everything we need to replay this request and understand the value of each parameters. This will allow us to make this request from a simple HTTP client.
The hardest part of this process is to understand the role of each parameters in the request. Know that you can left-click on any request in the Chrome dev tool inspector, export in HAR format and then import it in your favorite HTTP client, (I love Paw and PostMan).
This will allow you to have all the parameters of a working request laid out and will make your experimentation much faster and fun.
The same principles apply when it comes to reverse engineering mobile app. You will want to intercept the request your mobile app is making to the server and replay it with your code.
Doing this is harder for two reasons:
For example, when Pokemon Go was released a few years ago, tons of people cheated the game after reverse engineering the requests the mobile app made.
What they did not know was that the mobile app was sending a “secret” parameter that was not sent by the cheating script. It was very easy for Niantic to then identify the cheaters. A few weeks after, a massive amount of players got banned due to this.
Also, here is an interesting example about someone who reverse engineered the Starbucks API.
I hope that this overview will help you understand better web-scraping and that you learned things reading this post.
Everything I talked about in this post is things we leverage at ScrapingBee, a web scraping API, to handle thousands of requests per seconds, without ever being blocked. Do not hesitate to test our solution if you don’t want to lose too much time setting everything up, the first 1k API calls are on us :).
We recently published a guide about the best web scraping tools on the market, don't hesitate to take a look!