What is Web Scraping
What is Web Scraping?
Web scraping has many names: web crawling, data extraction, web harvesting, and a few more
While there are subtle nuances between these terms, the overall idea is the same: to gather data from a website, transform that data to a custom format, and persist it for later use
Search engines are a great example for, both, web crawling and web scraping. They are continuously scouting the web, with the aim to create a "library" of sites and their content, so that when a user then searches for a particular search query they can easily and quickly provide a list of all sites on that particular topic. Just imagine a web without search engines 😨.
Another famous example is price and product scraping. Here, you would have a list of e-commerce shops, which you regularly scrape for existing and new products. You'd then aggregate all that data locally and normalize it into a standardized data model, applicable to your use case. You could then use that data to provide a price comparison service across all these merchants.
So, web scraping really comes down to
- Loading the content of a website
- Looking for certain information/data in that content
- Using that information according to your business case (e.g. transforming it, persisting it for later use, passing it to another service)
Of course, it's not just search engines and comparing prices, but there are plenty of other examples and we will discuss and check them out more in detail under Web Scraping Use Cases in just a second. Stay tuned, please.
But for starters, a brief journey through time back to the early '90s.
History of Web Scraping
The idea of web scraping is pretty much intertwined with the web itself. The first scraper already appeared in 1993 and was intended to build an index of the then brand-new WWW, the Wanderer. Only a couple of months later, JumpStation went online and served as the first search engine as we'd define it these days.
For that reason, one could use a traditional HTTP client or library to send a single HTTP request and get the desired data in the response. Especially with the advent of the DOM and XPath and CSS selectors, it really became easy to parse HTML and extract exactly the data one was after.
💡 We have a lovely article on how to manage your headless Chrome browser from Java. Please check it out whenever you have time.
Web Scraping of APIs 🤖
One of the key aspects of web scraping is that you are dealing with data whose structure is not necessarily well-defined and which can change at any time, depending on the mood of the site's designer. This is what web APIs (e.g. REST or SOAP) are supposed to address, by providing a well-defined and unified interface to access data.
Because of this defined and dedicated background, we are typically not talking about scraping in the context of accessing APIs, with one exception however: unoffical and undocumented APIs
Sites and services do not necessarily make all their APIs public and one can often find a true data treasure, simply by observing from where a website (particularly SPAs) or a mobile application fetches its data.
As always, your browser's developer tools (the magic F12) can be of great assistance in this context and the network tab can quickly uncover potential URLs and parameters of any undocumented API.
With that information it can be easy to reverse-engineer such calls and incorporate that API in one's venture to get data from a particular service and actually receive already well-structured data, instead of fetching and parsing HTML manually.
💡 Rule of thumb: whenever you have the choice between an API and parsing HTML, the former will be almost always the more simple and the more stable approach.
Monitoring HTTP connections with a local proxy
Sometimes the developer tools may not be able to provide you with the full picture, and this is where another approach can prove to be quite handy: man-in-the-middle'ing the connection with a local proxy
There are quite a few solutions in that area out there, some free, some paid. Most notably mitmproxy and Charles. For the last one, we actually have a dedicated article on how to use Charles proxy for web scraping.
The Difference Between Web Crawling and Web Scraping
The terms of crawling and scraping are often used interchangeably and people typically mean the same thing, getting data from a site. While crawling and scraping is, in fact, quite similar, there still is one major conceptual difference.
So, just to get our terminology straight, let's check out what is what.
What is web crawling?
While it is technically perfectly feasible to do everything in one go, experience has shown that it typically is best to approach any data extraction project in two steps.
First, you need to know the pages, where you are going to extract data from. Once you have that list, you can actually go about extracting the desired data from the pages on that very list.
Web crawling is exactly that very first step, where you build a list of pages or links, which you will later use as foundation to get your data. A web crawler typically won't extract data on its own, but it will mostly start at an initial URL, collect links, and then - depending on its configuration - follow these links once again and try to get more links. At this point, it really is rinse and repeat.
The crawler will save each link it found in a form of database, which the scraper will subsequently use for the actual data extraction. Wonderful, this already gave us our cue, the scraper.
And web scraping does what?
You have already guessed it, haven't you? Yeah, the scraper is very second piece in our data puzzle.
Based on the list of pages/URLs our crawler kindly provided, the scraper will handle the actual data extraction. It will iterate over our set of links, request each page, extract the data in question according to the configured logic, possibly transform data, and finally store everything in a database.
Crawler and Scraper
Very much like Paul McCartney and Stevie Wonder sang in Ebony & Ivory, a crawler and a scraper also live together in perfect harmony and complement one another.
The crawler fetches the URLs, which the scraper will subsequently visit to get the data in question. And that's essentially the difference between web crawling and web scraping.
💡 The API of ScrapingBee covers both areas. The data extraction API can be easily customised to either crawl pages for links or to scrape an existing set of links for data.
All right, that was a lot of theory, let's check out actual use cases, where web scraping is typically employed, shall we?
Web Scraping Use Cases
There are more use cases for web scraping than we could possibly mention in this article, so please bear with us and do not consider the following list as exhaustive, but we tried to compile a list of demonstrative examples of how web scraping is commonly used.
Let's start with the elephant in the room, search engines.
As we already briefly mentioned in the introduction, search engines were the very first implementation of a crawler/scraper. Their very goal is to index the entire web and make its content searchable for end-users.
For that purpose, they implement the exact crawler-scraper approach, where the crawler is responsible for finding new sites and links and provide them to the scraper, who will then extract the content of each page, index it, and make it available on the search engine just the way you are used to. Pretty common knowledge, right? Let's take a quick break by using Brave to find a few recipes for delicious crepes 🥞
Search Engines pt. II
If we are already on the topic of search engines, let's take the opportunity to check out another use case in their context.
Search engines do not only scrape the web themselves, they are naturally also an excellent choice to find new sites with your own scraper.
Maybe you'd like to find all (new) sites for programming tutorials. Here we go, https://search.brave.com/search?q=programming+tutorial. You can easily scrape that list for sites on that specific subject and then build on that list and, for example, scrape these individual pages.
It's equally easy to scrape a particular website. Use https://search.brave.com/search?q=site%3Aslashdot.org to get a list of all indexed Slashdot pages. You won't need to crawl Slashdot yourself, but the search engine will already provide you with all the pages it found itself.
Using search engines as source for sites and links can prove to be extremely useful, you just need to make sure to properly tweak your search phrase.
Another area where web scraping is commonly used, is anything related to e-commerce and online shops. For example, all the large price comparison sites either have merchants provide the data upfront, or they actively go out and crawl popular online stores for products.
Here, they will simply crawl the product pages and collect data points like current price and availability and aggregate everything in their own database. Once they have a complete set of data from different merchants, they can use the data to provide said price comparison service and let users sort and filter by price, vicinity, availability, and other criteria.
In addition to consumer-oriented price comparison, there's also the more B2B-focused price monitoring, which is particularly important for businesses who would like to make informed pricing decisions and keep a tab on their competition. Services such as Price2Spy and Prisync will scrape your competitors' sites and provide you with detailed reports on any changes.
Web scraping has also become an essential tool for financial organizations. They scrape the web, news outlets, social media, etc. for information on current live events, as well as for general trends, in order to support workflows regarding trading decisions and their core business.
While a single tweet on Twitter will probably not decide whether they pick up an investment or drop it, an overall trend very well might serve as indicator. Particularly here, additional technology such as sentiment analysis often plays a key role.
On top of the scraping performed by financial organizations, there are also account aggregator services, which consolidate your financial data and accounts in one place.
While not scrapers by strict definition, these services still collect and gather information from different sources, so they are still performing scraping in a broader sense.
Notable examples here would be Intuit's Mint and Plaid.
Similar to our previous example in e-commerce, classified ads are also often used as a good source of information for market trends or for aggregator services.
Especially aggregators love services such as craigslist, as they can quickly collect lots of relevant ads.
Equally, classified ads can provide good insight about current market trends. Is there an uptick in prices for second-hand models of certain manufacturer? Is the market being flooded with brand-new gaming consoles? What's the average price of a particular car model?
Just like with classified ads, the sites of news outlets and agencies are an endless all-you-can-eat buffet for aggregator services as well.
It is fairly easy to scrape hundreds of different outlets, use either simple keyword matching or natural language processing to filter duplicates, and voilà, you have a proper world-wide news feed.
As mentioned before, this is not necessarily only used by aggregators, but also by other parties who base decisions on current events. For example, Mention.com specifically monitors news sites for articles where their customers are mentioned (not a bad domain name for that, right? 😎).
Scraping also plays a major role for many journalists. They typically use scrapers to get additional data for articles. That may include legal documents, scientific information, statistical data, as well as open-source data from the government.
If you like to read more about their first-hand experience and personal stories, it's definitely worth to check out https://datajournalism.com/read/newsletters/data-scraping-for-stories.
Jobs & Employment
Job listings and anything career related are also one of the main domains of scraping. There are dozens of job aggregator sites (e.g. WeWorkRemotely.com), which scout the web 24/7 for new vacancies.
Although these sites typically support manual job submission as well, a large share of their job listings come from crawling the web, for example the career pages of companies, and adding these vacancies to their own database. Interestingly enough, it's not only those dedicated aggregation services, but even large sites seem to follow that approach.
Another common use case in the context of job sites does actually not even focus that much on the jobs themselves, but is more about lead generation. More on that in just a second under Marketing, Sales, & Leads.
Lots of tourism services (e.g. travel agencies or online reservation platforms) heavily rely on data aggregation as well.
For starters, there is the whole scope of price comparison, which overlaps in many areas with the e-commerce aspect. Additionally, scraping is also used to gather information directly from service providers, for example hotel websites, event locations, and tourist attractions.
With social media being one of the central communication channels these days and billions of users using it every single day, it probably is not too surprising that social media platforms, such as Twitter, Facebook, or YouTube, have also become a vital source of information in the scraping world.
Covering almost all of the topics we have mentioned so far, social media really is almost an ecosystem of its own and most of the examples will apply here as well, to a certain extent.
For example, brand monitor services regularly scout Twitter hashtags for brand mentions and use sentiment analysis to provide their customers with insight on how their brand might be currently perceived by the public.
Another very common use case is marketing agencies tracking popular social media influencers and ranking them by engagement rate, followers, and outreach. This allows them to find the most appropriate accounts and match them best to their own customer base of advertisers.
Marketing, Sales, & Leads
Data collection is naturally also at the core of marketing and sales campaigns.
While mass collection of email addresses really is considered nothing but spam these days, targeted and professional email campaigns can still be beneficial for your sales figures.
But marketing-related scraping goes way beyond just the collection of email addresses. For example, professional networks such as LinkedIn may also be a great source to generate appropriate leads. This may also be combined with the search engine approach we mentioned earlier.
Scraping, in general, plays a huge role in finding and generating leads. For example, as we already briefly mentioned under Jobs & Employment, many companies even use advertised job openings for that purpose.
If a company's job ad makes mention of Python, .NET, MySQL, or any other technology, one can probably safely assume they are actively using that very technology. Now, if one is providing services around the Python or .NET stack or is offering a managed MySQL cloud service, one may have just found a new lead.
Of course, intelligence services are also using the web to gather information for their purposes. In fact, there's a specialized term for that, OSINT.
While these agencies are probably less interested in the pricing of the latest iPhone model, they will scout the web on various political and social topics, which could be relevant to their governments.
With social media being a major factor in human interaction these days, there'll be certainly a very strong focus on these platforms and data will be collected across groups, channels, hashtags, as well as individual users, however data will be likely also aggregated from other, more traditional areas, such as websites and forum comments.
Particularly in this context, sentiment analysis plays a crucial role.
While sentiment analysis is not directly related to scraping, it still is a technology which you will regularly come across in the context of web scraping and so it's worth quickly checking out what sentiment analysis is and what it does.
Sentiment analysis essentially attempts to parse and "understand" a given phrase (or also entire paragraph), in order to rate its polarity. Polarity here means, where on the positive-neutral-negative spectrum the given text would be.
Let's have a quick look at the following examples.
|Phrase||Desired sentiment evaluation|
|I really like the pair of boots||✅|
|It was a lovely evening and all the dishes were of excellent quality||✅|
|There's nothing bad I could say about the place||✅|
|The app just works flawlessly. 5 stars||✅|
|The rooms had no air condition and breakfast was mediocre||❌|
|I wouldn't buy at that shop again||❌|
|Flight delayed by four hours||❌|
|The food was everything but good||❌|
The first four examples had a positive sentiment, whereas the last four had a negative one.
While "I really like" may still seem easy to recognize as a positive statement, it will be more difficult with more complex sentences, especially when negation is used. Even though the "bad" in "nothing bad" may fire a negative rating, it definitely has a positive meaning in this context. On the other hand, "everything but good" decisively expresses the user's disappointment.
Of course, there are also other factors in human speech which may have an impact. For example, sarcasm. One would hardly rate "I really liked how the waiter made us wait for half an hour" as positive, right?
Sentiment analysis is particularly important in the business field (e.g. brand recognition) and is a vital technology for intelligence services as well.
It, admittedly, is a rather complex topic and an in-depth discussion would very much warrant its own article (or even series of articles), but should you be interested in more details, https://en.wikipedia.org/wiki/Sentiment_analysis may serve well as a starting point.
Common Scraping Techniques & Technologies
Of course, the general idea of scraping is always the same.
Your scraper sends a request to a website, receives the content with the response, and extracts the information it is looking for based on a defined logic.
So far, so good. But as always in life, there's not just one way how to approach that. You could
- have a tiny scraper Bash script, which you run manually from the command line
- install a no-code browser extension in your browser
- run a locally installed scraper application
- use a full-fledged SaaS platform and control your scraping via API calls
Sounds like a lot of options, doesn't it? That's true, so let's have a detailed look at each of them.
Local No-Code Scraping
There are plenty of browser-based, no-code scraper implementations which do not require you to write code but allow you to generate a scraping script via point-and-click. Examples for such browser extensions (they usually support Chrome and Firefox) would be
What's nice about these tools is that they do not require an extensive setup or a particular technical background, as you can simply run them in your browser, click through the page, select the desired elements with the mouse, and run the scraper.
Of course, there are also some drawbacks, namely performance and scalability.
Particularly, when it comes to the ability to scale, SaaS solutions really come to shine. They are typically designed to run hundreds and thousands of scraper instances at the same time. They do one thing and that they do well: scraping the web
SaaS scrapers usually follow the traditional API approach, where you have, for example, a REST interface which allows you to communicate with the platform and launch and monitor scraping tasks. Found data will be either returned directly in the response or it is stored on the platform for further data tasks.
ℹ️ Did you know the first 1,000 API requests are totally on us? Promised! Check it out.
Of course, you can also go fully-customized and write your own crawler and scraper from scratch, in the language of your choice.
There's virtually no limit here and you can implement a full scraper even in languages whose main purpose is everything but network scraping. For example, check out our lovely guide on web scraping with R - and R really is a language mostly for statistical computing. What's next, scraping in COBOL? 🤪
But even if you stay with more traditional languages, the choice of libraries, tools, and frameworks is almost endless.
Web Scraping with Unix and Bash
You like Unix and the Bash?
Web Scraping with PowerShell
You still like the shell but you are more of a Microsoft guy?
Web Scraping in Java
Well, you like Java, but not so much the Script part?
Yep, you guessed absolutely right. Plenty to pick from
Sun's Oracle's toolbox here as well, and - of course - we do have an article on that 😀. Welcome to the introduction to web scraping with Java.
Web Scraping in Python
Python, in particular, is typically quite a popular choice for web scraping. This is due to its flat learning curve, as well as the fact that it has a rich ecosystem dedicated to scraping.
At ScrapingBee, we actually like Python a lot, which is why we have quite a few tutorials on scraping with Python:
- Web Scraping with Python: Everything you need to know
- BeautifulSoup tutorial: Scraping web pages with Python
- The best Python HTTP clients
- Pyppeteer: the Puppeteer for Python Developers
Web Scraping Frameworks
Staying with Python, there are actually entire scraping frameworks, so that you do not always need to write first all the underlying network boilerplate code, but can solely focus on the data extraction bit.
Avoiding Anti-Scraping Measures
While there is nothing wrong with moderate web scraping per se (please do not DDoS the server), many site owners are nonetheless not too keen of having their content scraped, and they often employ different anti-scraping technologies to make a scraper's life more difficult and prevent scraping altogether.
Some of those measures include
- request throttling, where your requests get blocked when you exceed a certain number of requests a second or minute
- a user agent verification, which tries to analyse certain parameters of the request (user-agent, connection fingerprint) to make sure the request was sent by a regular browser
- CAPTCHAs, which are supposed to be solvable only by people - at least in theory 😊
There are different ways to handle and approach each of them, but the most important factor is trying to lay low and fly under the radar in the first place.
If you like to know more about this subject, I would highly recommend the article Web Scraping without getting blocked.
Web scraping is a very useful way to collect vast amounts of information from the web in an automated fashion, with little supervision.
It is a technology which is used across markets for a wide variety of use cases and, in fact, has proven to be of indispensable value for many businesses, whose core services heavily rely on such data sets.
Scraping can be performed in a large number of ways, either with locally installed applications, dedicated online services, and even custom written scraper applications.
One important thing is to tweak your scraper, so that it scrapes sites in a reasonable way and does not lead to service interruptions and gets subsequently blocked.
💡 ScrapingBee provides you with all the tools necessary to successfully scrape sites.
It easily scales from a single-page landing site to entire e-commerce shops and supports rotating client addresses, global proxy networks, browser-engine scraping (headless Chrome), as well as network throttle management.
Check it out at www.scrapingbee.com. We offer a trial with the first 1,000 requests being totally free of charge.
We hope this article gave you a good first overview of web scraping. If you have any questions or would like to know how ScrapingBee can help you to complete your scraping tasks successfully, then please do not hesitate a second to reach out to us. We are happy to help.
Alexander is a software engineer and technical writer with a passion for everything network related.