Web scraping, Tools,

20 best web scraping tools for 2020

(10 min) Kevin Sahin, 19 February, 2020

In this post we are going to see the different existing web scraping tools available, both commercial and open-source.

There are many tools available on the market, depending on your needs it can be hard to make a choice.

In this article I'm going briefly explain what each tool does and what you should use depending on your needs.

Web Scraping Tools

Table of Content


ScrapingBee

ScrapingBee is a web scraping API that allows you to scrape the web without getting blocked. We offer both classic (data-center) and premium (residentials) proxies so you will never get blocked again while scraping the web. We also give you the opportunity to render all pages inside a real browser (Chrome), this allows us to support website that heavily relies on JavaScript).

ScrapingBee

Who should use this web scraping tool?

Developers and tech companies who want to handle the scraping pipeline themselves without to take care of proxies and headless browsers

Pro:

  • Easy integration
  • Great documentation
  • Great Javascript rendering
  • Cheaper than buying proxies, even for large amount of requests per month

Cons:

  • Cannot be used without developers

DiffBot

DiffBot

DiffBot offers multiple structured APIs that returned structured data of products/article/discussion webpages. Their solution is quite expensive with the lowest plan beginning at $299 per month.

Who should use this web scraping tool?

Developers and tech companies.

Developing in-house web scrapers is painful because websites are constantly changing. Let's say you are scraping ten news websites. You need ten different rules (XPath, CSS selectors…) to handle the different cases.

Diffbot can take care of this with their automatic extraction API.

Pro:

  • Easy integration

Cons:

  • Doesn't work on every websites
  • Expensive

ScrapeBox

ScrapeBox

ScrapeBox is a desktop software allowing you to do many thing related to web scraping. From email scraper to keyword scraper they claim to be the swiss army knife of SEO.

Who should use this web scraping tool?

SEO professionnals and agencies.

Pro:

  • Run on your local machine
  • Low cost (one time payment)
  • Feature-rich

Cons:

  • Slow for large scale scraping

ScreamingFrog

ScreamingFrog

ScreamingFrog is a website crawler for Windows, MacOS and Ubuntu. It allows you to crawl websites’ URLs to analyse and perform technical audit and onsite SEO. It is able to crawl both small and very large websites efficiently, while allowing you to analyse the results in real-time.

Who should use this web scraping tool?

SEO professionnals and agencies.

Pro:

  • Run on your local machine
  • Low cost (one time payment)
  • Feature-rich

Cons:

  • Slow for large scale scraping

Scrapy

Scrapy

Scrapy is a free and open-source web-crawling framework written in Python. Originally designed for web scraping, it can also be used to extract data using APIs or as a general-purpose web crawler.

Who should use this web scraping tool?

Developers and tech company with Python knowledge.

Scrapy is great for large-scale web scraping with repetitive tasks:

  • Extracting e-commerce product data
  • Extracting article from news websites
  • Crawling an entire domain to get every URLs

Pro:

  • Lots of feature to solve the most common web scraping problems
  • Activaly maintained
  • Great documentation

Cons:

  • None

Goutte

Goutte

Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

It also integrates nicely with the Guzzle requests library, which allows you to customize the framework for more advanced use cases.

Who should use this web scraping tool?

Pro:

  • Open source
  • Free
  • Actively maintained

Cons:

  • Less popular than Scrapy
  • Fewer integration than Scrapy

Frontera

Frontera

Frontera is another web crawling tool.

It is an open source framework developed to facilitate building a crawl frontier. A crawl frontier is the system in charge of the logic and policies to follow when crawling websites, it plays a key role in more sophisticated crawling systems. It sets rules about what pages should be crawled next, visiting priorities and ordering, how often pages are revisited, and any behaviour you may want to build into the crawl.

It can be use with Scrapy or any other web crawling framework.

Who should use this web scraping tool?

Pro:

Cons:

PySpider

PySpider

PySpider is another open-source web crawling tool. It has a web UI that allows you to monitor tasks, edit scripts and view your results.

Who should use this web scraping tool?

Pro:

  • Open-source
  • Very popular (14K Github stars) and active project
  • Solve lots of common web scraping problems
  • Powerful web UI

Cons:

  • Steep learning curve
  • Use PhantomJS to render Javascript page, which is inferior to Headless Chrome

Mozenda

Mozenda

Mozenda is an entreprise web scraping software designed for all kinds of data extraction needs. They claim to work with 30% of the fortune 500, for use cases like large-scale price monitoring, market research, competitor monitoring.

They can build and host the scraper for you

Who should use this web scraping tool?

Entreprise with large data extraction projects.

Pro:

  • Great for big companies
  • Can be integrated to any system
  • Can even scrape PDFs

Cons:

  • Expensive

ScrapingHub

ScrapingHub

ScrapingHub is one of the most well-known web scraping company. They have a lot of product around web scraping, both open-source and commercial. There are the company behind the Scrapy framework and Portia. They offer scrapy hosting, meaning you can easily deploy your scrapy spiders to their cloud.

Who should use this web scraping tool?

ScrapingHub offer lots of developers tools for web scraping. It is aimed at tech companies and individual developers.

Pro:

  • Lots of different product offering for different use case
  • Best hosting for Scrapy projects

Cons:

  • Pricing is tricky and can quickly become expensive compared to other options
  • Support seems slow to respond

Import.io

Importio

Import.io is an entreprise web scraping platform. Historically they had a self-serve visual web scraping tool.

Who should use this web scraping tool?

Large companies who want a no-code / low-code web scraping tool to easily extract data from websites.

Pro:

  • One of the best UI
  • Very easy to use

Cons:

  • The tool is self-serve, meaning you won't get much help if you have problems with it.
  • As lots of other visual web scraping tool, it is expensive

Dexi.io

Dexio.io

Dexi.io is a visual web scraping plateform. One of the most intestering features is that they offer built-in data flows. Meaning not only you can scrape data from external websites, but you can also transform the data, use external APIs (like Clearbit, Google Sheets…).

Who should use this web scraping tool?

Teams without developers that want to quickly scrape websites and transform the data.

Pro:

  • Intuitive interface
  • Data pipeline
  • Lots of integration

Cons:

  • Pricey
  • Not very flexible

Webscraper.io

Web scraper is one of the most popular Chrome extension tool that allows you to scrape any website without writing a single line of code, directly inside Chrome!

Here is a screenshot of the interface (accessible within the Chrome dev tools):

Webscraper.io

Here is a short video on how to use it:

If the scraping tasks you want to do needs proxies or need to be run on a daily basis, they also have a cloud option, where you can run your scraping tasks directly on their servers for a monthly fee.

Who should use this web scraping tool?

Pro:

  • Simple to use

Con:

  • Can't handle complex web scraping scenarios

Parsehub

Webscraper.io

Parsehub is a web scraping desktop application that allows you to scrape the web, even with complicated and dynamic websites / scenarios.

The scraping itself happens on Parsehub servers, you only have to create the instruction within the app.

Lots of visual web scraping tools are very limited when it comes to scraping dynamic websites, not Parsehub. For example, you can:

  • Scroll
  • Wait for an element to be displayed on the page
  • Fill inputs and submit forms
  • Scrape data behind a login form
  • Download file and image
  • Can be cheaper than buying proxies

Pro:

  • API access
  • Export to JSON / CSV file
  • Scheduler (you can choose to execute your scraping task hourly/daily/weekly)

Cons:

  • Steep learning curve
  • Expensive

Octoparse

Octoparse

Octoparse is another web scraping tool with a desktop application (Windows only, sorry MacOS users 🤷‍♂️ ).

It is very similar to Parsehub

The pricing is cheaper than Parsehub, but we found the tool more complicated to use.

You can do both cloud extraction and local extraction.

Pro:

  • Great pricing

Cons:

  • Steep learning curve
  • Windows only

Simplescraper.io

Simplescraper is a very easy to use Chrome extension to quickly extract data from a website.

You just have to point and click on element, name your element and “voilà”. Here is a little video of how it works:

Pros:

  • Very simple to use
  • Website to Data To API in 30 secs

Cons:

  • Much more limited than octoparse or parsehub
  • Expensive for high volume

Dataminer

Dataminer

Dataminer is one of the most famous Chrome extension for webscraping (186k installation and counting). What is very unique about dataminer is that it has a lot of feature compared to other extension.

Generally Chrome extension are easier to use than desktop app like Octoparse or Parsehub, but lacks lots of feature.

Dataminer fits right in the middle. It can handle infinite scroll, pagination, custom Javascript execution, all inside your browser.

One of the great thing about dataminer is that there is a public recipe list that you can search to speed up your scraping. A recipe is a list of steps and rules to scrape a website.

For big websites like Amazon or Ebay, you can scrape the search results with a single click, without having to manually click and select the element you want.

Cons:

  • It is by far the most expensive tool in our list ($200/mo for 9000 pages scraped per month)

Portia

Portia is another great open source project from ScrapingHub. It's a visual abstraction layer on top of the great Scrapy framework.

Meaning it allows to create scrapy spiders without a single line of code, with a visual tool.

Portia itself is a web application written in Python. You can run it easily thanks to the docker image.

Simply run :

docker run -v ~/portia_projects:/app/data/projects:rw -p 9001:9001 scrapinghub/portia
Portia

Lots of things can be automated with Portia, but when things gets too complicated and custom code/logic needs to be implemented, you can use this tool https://github.com/scrapinghub/portia2code to convert a Portia project to a Scrapy project, in order to add custom logic.

One of the biggest problem of Portia is that it use the Splash engine to render Javascript heavy website. It works great in many cases, but have severe limitation compared to Headless Chrome for example. Websites using React.js aren't supported for example!

Pros:

  • Great “low-code” tool for teams already using Scrapy
  • Open-source

Cons:

  • Limitations regarding Javascript rendering support

WebHarvy

WebHarvy

WebHarvy is a desktop application that can scrape website locally (it runs on your computer, not on a cloud server).

It visual scraping feature allows you to define extraction rules just like Octoparse and Parsehub. The difference here is that you only pay for the software once, there isn't any monthly billing.

Webharvy is a good software for fast and simple scraping tasks.

However, there are serious limitations. If you want to perform a large-scale scraping tasks,it can take really long because you are limited by the number of CPU cores on your local computer.

It's also complicated to implement complex logic compared to software like Parsehub or Octoparse.

Pros:

  • One time payment
  • Great for simple scraping tasks

Cons:

  • Limited feature compared to competition
  • User interface isn't as good as Parsehub & Octoparse
  • Doesn't support CAPTCHA solving

FMiner

Fminer is another software very similar to Webharvy.

There are three major differences with WebHarvy:

  • You can record a complete sequences with your browser and reproduce it with the tool
  • It can solve CAPTCHAs
  • You can use a custom Python code to handle complex logic
Fminer

Overall FMiner is a really good visual web scraping software.

The only cons we see is the price: $249 for the pro version.

Prowebscraper

¨Prowebcraper

Prowebscraper is a new online visual web scraping tool.

It has many useful features, as usual you can select elements with an easy point & click interface. You can export the data in many format, CSV, JSON and even with a REST API.

They can also set up the scraper for you if this is too complicated for a fee.

Pro:

  • Easy set up
  • Runs in the cloud

Cons:

  • Expensive ($385/mo for 100k page scraped per month)

Conclusion

This was a long list!

Web scraping can be done by people with various degree of experience and knowledge. From developers wanting to perform large-scale data extraction on lots of websites to growth-hackers wanted to extract email addresses on a directory websites, there are many options!

I hope this blog post will help you choose the right tool for the job :)

Happy Web Scraping!

Tired of getting blocked while scraping the web? Our API handles headless browsers and rotates proxies for you.