Contents

Web Scraping with node-fetch

Shenesh Perera

Node-fetch for web scraping

The introduction of the Fetch API changed how Javascript developers make HTTP calls. This means that developers no longer have to download third-party packages just to make an HTTP request. While that is great news for frontend developers, as fetch can only be used in the browser, backend developers still had to rely on different third-party packages. Until node-fetch came along, which aimed to provide the same fetch API that browsers support. In this article, we will take a look at how node-fetch can be used to help you scrape the web!

Prerequisites

To get the full benefit of this article, you should have:

  • Some experience with writing ES6 Javascript.
  • A proper understanding of promises and some experience with async/await.

What is the Fetch API?

Fetch is a specification that aims to standardize what a request, response, and everything in between, which the standard declares as fetching (hence the name fetch). The browser fetch API and node-fetch are implementations of this specification. The biggest and most important difference between fetch and its predecessor XHR is the fact that it's built around Promises. This means that developers no longer have to fear the callback hell, messy code, and the extremely verbose APIs that XHR has.

There are a few more technical differences: i.e when a request returns with an HTTP status code 404, the promise that is returned from the fetch call doesn't get rejected.

node-fetch brings all of this to the server-side. This means that developers no longer have to learn different APIs, their various terminologies, or how fetching actually happens behind the scenes to perform HTTP requests from the server-side. It's as simple as running npm install node-fetch and writing HTTP requests almost the same way you would in a browser.

Scraping the web with node-fetch and cheerio

To get the gig rolling, you must first install cheerio alongside node-fetch. While node-fetch allows us to get the HTML of any page, because the result will just be a bunch of text, you will need some tooling to extract what you need from it. cheerio helps with that, it provides a very intuitive JQuery-like API and will allows you to extract data from the HTML you received with node-fetch.

Make sure you have a package.json, if not:

  • generate one by running npm init
  • Then install cheerio and node-fetch by running the following command: npm install cheerio node-fetch

For the purpose of this article, we will scrape reddit:

const fetch = require('node-fetch');

const getReddit = async () => {
	const response = await fetch('https://reddit.com/');
	const body = await response.text();
	console.log(body); // prints a chock full of HTML richness
	return body;
};

fetch has a single mandatory argument which is the resource URL. When fetch is called, it returns a promise which will resolve to a Response object as soon as the server responds with the headers. At this point, the body is not yet available. The promise that is returned resolves and it does not matter whether or not the request failed. The promise will only be rejected due to network errors like connectivity issues, meaning that the promise can resolves even if the servers respond with a 500 Server Error. The Response class implements the Body class which is a ReadableStream that gives a convenient set of promise-based methods meant for stream consumption.

Body.text() is one of them, and since Response implements Body, all the methods that Body has can be used by a Response instance. Calling any of these methods returns a promise that eventually resolves to the data.

With this data, in this case, HTML text, we can use cheerio to create a DOM, then query it to extract that interests you. For example, if you want a list of all the posts in the feed you could get the selector (using your browser's dev tools) for the post list and then use cheerio like this:

const fetch = require('node-fetch');
const cheerio = require('cheerio');

const getReddit = async () => {
  // get html text from reddit
  const response = await fetch('https://reddit.com/');
  // using await to ensure that the promise resolves
  const body = await response.text();

  // parse the html text and extract titles
  const $ = cheerio.load(body);
  const titleList = [];
    
  // using CSS selector  
  $('._eYtD2XCVieq6emjKBH3m').each((i, title) => {
    const titleNode = $(title);
    const titleText = titleNode.text();
    
    titleList.push(titleText);
  });

  console.log(titleList);
};

getReddit();

cheerio.load() allows you to parse any HTML text into a query-able DOM. cheerio provides various methods to extract components out of the now constructed DOM, one of which is each(), this method allows you to iterate over a list of nodes. How do we know that we get a list? We're looking for a list of the titles of the posts on Reddit's home page, currently the class name of one such title is ._eYtD2XCVieq6emjKBH3m and but it may change in the future.

By iterating over the list using each(), you get each HTML element, which you can feed again to cheerio and it will allow you to once again extract the text out of each title.

This process is fairly intuitive and can be done with any website, as long as the website in question does not have anti-scraping mechanisms to throttle, limit, or prevent you from scraping. While this can be worked around, the effort and dev time required to do so may simply just be unaffordable. This guide can help you out in such cases!

Using the options parameter in node-fetch

fetch has a single mandatory argument and one optional argument and that is the options object. The options object allow you to customize the HTTP request to suit your needs, whether it's to send a cookie along with your request or a make POST request (fetch makes GET requests by default) you'll need define the options argument.

The most common properties that you will make use of are:

  • method -, the request HTTP method, it is by default set to GET.
  • headers - The headers that you want to pass along with the request.
  • body - The body of your request, you would use the body property if you were making for example a POST request.

You can find out about the other properties available to you to customize your HTTP request here. Now to put it all together, let's send a POST request with some cookies and a few query parameters:

const fetch = require('node-fetch');
const { URL, URLSearchParams } = require('url');

(async () => {
	const url = new URL('https://some-url.com');
	const params = { param: 'test'};
	const queryParams = new URLSearchParams(params).toString();
	url.search = queryParams;
	
	const fetchOptions = {
		method: 'POST',
		headers: {
			'cookie': '<cookie>',
		},
		body: JSON.string({ hello: 'world' }),
	};

	await fetch(url, fetchOptions);
})();

Using the URL module, it very easy to attach query parameters to the website URL that you wish to scrape. The URLSearchParams class in particular is useful for this.

To send an HTTP POST request, you must simply set the method property to POST. You would do the same for any other HTTP request method like PUT or DELETE. To send any cookies alongside the request, you have to make use of the cookie header.

Making fetch requests in parallel

At times you may want to make multiple different fetch calls to different URLs at the same time. Doing them one after the other will ultimately lead to bad performance and hence long wait times for your end-users.

To solve this problem, you should parallelize your code. Sending an HTTP requests consume very little resources of your computer, it takes time only because your computer is waiting, idle, for the server to respond. We call those kind of task “io bound”, as opposed to tasks that are slow because they consume a lot of computing power, those are “CPU-bound”.

“io bound” tasks can be efficiently parallelized with promises. And since fetch is promise-based, you can make use of Promise.all to make multiple fetch calls at the same time like this:

const newProductsPagePromise = fetch('https://some-website.com/new-products');
const recommendedProductsPagePromise = fetch('https://some-website.com/recommended-products');

// Returns a promise that resolves to a list of the results
Promise.all([newProductsPagePromise, recommendedProductsPagePromise]); 

Conclusion

With that, you've just mastered node-fetch for web scraping. Although fetch is great for simple use cases, it can get a tad bit difficult to get right when you have to deal with Single Page Applications that use Javascript to render most of it's page. Challenging tasks like scraping concurrently and such should be done by hand as node-fetch is simply an HTTP request client like any other.

Since you've mastered node-fetch give ScrapingBee a try, you get the first 1000 requests for free to try it out. Check out the getting started guide here!

Scraping the web is challenging given the fact that anti-scraping mechanisms get smarter day by day. Even if you manage to do it, getting it done right can be quite a tedious task. ScrapingBee allows you to skip the noise and focus only on what matters the most: the data.

Resources

Tired of getting blocked while scraping the web? Our API handles headless browsers and rotates proxies for you.