Web Scraping with JavaScript and NodeJS

02 August 2022 (updated) | 23 min read

JavaScript has become one of the most popular and widely used languages due to the massive improvements it has seen and the introduction of the runtime known as NodeJS. Whether it's a web or mobile application, JavaScript now has the right tools. This article will explain how the vibrant ecosystem of NodeJS allows you to efficiently scrape the web to meet most of your requirements.

Prerequisites

This post is primarily aimed at developers who have some level of experience with JavaScript. However, if you have a firm understanding of web scraping but have no experience with JavaScript, it may still serve as light introduction to JavaScript. Still, having experience in the following fields will certainly help:

  • βœ… Experience with JavaScript
  • βœ… Experience using the browser's DevTools to extract selectors of elements
  • βœ… Some experience with ES6 JavaScript (Optional)

⭐ Make sure to check out the resources at the end of this article for more details on the subject!

Outcomes

After reading this post will be able to:

  • Have a functional understanding of NodeJS
  • Use multiple HTTP clients to assist in the web scraping process
  • Use multiple modern and battle-tested libraries to scrape the web

Understanding NodeJS: A brief introduction

JavaScript was originally meant to add rudimentary scripting abilities to browsers, in order to allow websites to support more custom ways of interactivity with the user, like showing a dialog box or creating additional HTML content on-the-fly.

For this purpose, browsers are providing a runtime environment (with global objects such as document and window) to enable your code to interact with the browser instance and the page itself. And for more than a decade, JavaScript was really mostly confined to that use case and to the browser. However that changed when Ryan Dahl introduced NodeJS in 2009.

NodeJS took Chrome's JavaScript engine and brought it to the server (or better the command line). Contrary to the browser environment, it did not have any more access to a browser window or cookie storage, but what it got instead, was full access to the system resources. Now, it could easily open network connections, store records in databases, or even just read and write files on your hard drive.

Essentially, Node.js introduced JavaScript as a server-side language and provides a regular JavaScript engine, freed from the usual browser sandbox shackles and, instead, pumped up with a standard system library for networking and file access.

The JavaScript Event Loop

What it kept, was the Event Loop. As opposed to how many languages handle concurrency, with multi-threading, JavaScript has always only used a single thread and performed blocking operations in an asynchronous fashion, relying primarily on callback functions (or function pointers, as C developers may call them).

Let's check that quickly out with a simple web server example:

const http = require('http');
const PORT = 3000;

const server = http.createServer((req, res) => {
  res.statusCode = 200;
  res.setHeader('Content-Type', 'text/plain');
  res.end('Hello World');
});

server.listen(port, () => {
  console.log(`Server running at PORT:${port}/`);
});

Here, we import the HTTP standard library with require, then create a server object with createServer and pass it an anonymous handler function, which the library will invoke for each incoming HTTP request. Finally, we listen on the specified port - and that's actually it.

There are two interesting bits here and both already hint at our event loop and JavaScript's asynchronicity:

  1. The handler function we pass to createServer
  2. The fact that listen is not a blocking call, but returns immediately

In most other languages, we'd usually have an accept function/method, which would block our thread and return the connection socket of the connecting client. At this point, the latest, we'd have to switch to multi-threading, as otherwise we could handle exactly one connection at a time. In this case, however, we don't have to deal with thread management and we always stay with one thread, thanks to callbacks and the event loop.

As mentioned, listen will return immediately, but - although there's no code following our listen call - the application won't exit immediately. That is because we still have a callback registered via createServer (the function we passed).

Whenever a client sends a request, Node.js will parse it in the background and call our anonymous function and pass the request object. The only thing we have to pay attention to here is to return swiftly and not block the function itself, but it's hard to do that, as almost all standard calls are asynchronous (either via callbacks or Promises) - just make sure you don't run while (true); πŸ˜€

But enough of theory, let's check it out, shall we?

If you have Node.js installed, all you need to do is save the code to the file MyServer.js and run it in your shell with node MyServer.js. Now, just open your browser and load http://localhost:3000 - voilΓ , you should get a lovely "Hello World" greeting. That was easy, wasn't it?

One could assume the single-threaded approach may come with performance issues, because it only has one thread, but it's actually quite the opposite and that's the beauty of asynchronous programming. Single-threaded, asynchronous programming can have, especially for I/O intensive work, quite a few performance advantages, because one does not need to pre-allocate resources (e.g. threads).

All right, that was a very nice example of how we easily create a web server in Node.js, but we are in the business of scraping, aren't we? So let's take a look at the JavaScript HTTP client libraries.

HTTP clients: querying the web

HTTP clients are tools capable of sending a request to a server and then receiving a response from it. Almost every tool that will be discussed in this article uses an HTTP client under the hood to query the server of the website that you will attempt to scrape.

1. Built-In HTTP Client

As mentioned in your server example, Node.js does ship by default with an HTTP library. That library also has a built-in HTTP client.

const http = require('http');

const req = http.request('http://example.com', res => {
	const data = [];

	res.on('data', _ => data.push(_))
	res.on('end', () => console.log(data.join()))
});

req.end();

It's rather easy to get started, as there are zero third-party dependencies to install or manage, however - as you can notice from our example - the library does require a bit of boilerplate, as it provides the response only in chunks and you eventually need to stitch them together manually. You'll also need to use a separate library for HTTPS URLs.

In short, it's convenient because it comes out-of-the-box, but it may require you to write more code than you may want. Hence, let's take a look at the other HTTP libraries. Shall we?

2. Fetch API

Another built-in method would be the Fetch API.

While browsers have supported it for a while already, it took Node.js a bit longer, but as of version 18, Node.js does support fetch(). To be fair, for the time being, it still is considered an experimental feature, so if you prefer to play it safe, you can also opt for the polyfill/wrapper library node-fetch, which provides the same functionality.

While at it, also check out our dedicated article on node-fetch.

The Fetch API heavily uses Promises and coupled with await, that can really provide you with lean and legible code.

async function fetch_demo()
{
	const resp = await fetch('https://www.reddit.com/r/programming.json');

	console.log(await resp.json());
}

fetch_demo();

The only workaround we had to employ, was to wrap our code into a function, as await is not supported on the top-level yet. Apart from that we really just called fetch() with our URL, awaited the response (Promise-magic happening in the background, of course), and used the json() function of our Response object (awaiting again) to get the response. Mind you, an already JSON-parsed response 😲.

Not bad, two lines of code, no manual handling of data, no distinction between HTTP and HTTPS, and a native JSON object.

fetch optionally accepts an additional options argument, where you can fine-tune your request with a specific request method (e.g. POST), additional HTTP headers, or pass authentication credentials.

3. Axios

Axios is pretty similar to Fetch. It's also a Promise-based HTTP client and it runs in both, browsers and Node.js. Users of TypeScript will also love its built-in type support.

One drawback, however, contrary to the libraries we mentioned so far, we do have to install it first.

npm install axios

Perfect, let's check out a first plain-Promise example:

const axios = require('axios')

axios
	.get('https://www.reddit.com/r/programming.json')
	.then((response) => {
		console.log(response)
	})
	.catch((error) => {
		console.error(error)
	});

Pretty straightforward. Relying on Promises, we can certainly also use await again and make the whole thing a bit less verbose. So let's wrap it into a function one more time:

async function getForum() {
	try {
		const response = await axios.get(
			'https://www.reddit.com/r/programming.json'
		)
		console.log(response)
	} catch (error) {
		console.error(error)
	}
}

All you have to do is call getForum! You can find the Axios library at Github.

4. SuperAgent

Much like Axios, SuperAgent is another robust HTTP client that has support for promises and the async/await syntax sugar. It has a fairly straightforward API like Axios, but SuperAgent has more dependencies and is less popular.

Regardless, making an HTTP request with SuperAgent using promises, async/await, and callbacks looks like this:

const superagent = require("superagent")
const forumURL = "https://www.reddit.com/r/programming.json"

// callbacks
superagent
	.get(forumURL)
	.end((error, response) => {
		console.log(response)
	})

// promises
superagent
	.get(forumURL)
	.then((response) => {
		console.log(response)
	})
	.catch((error) => {
		console.error(error)
	})

// promises with async/await
async function getForum() {
	try {
		const response = await superagent.get(forumURL)
		console.log(response)
	} catch (error) {
		console.error(error)
	}
}

You can find the SuperAgent library at GitHub and installing SuperAgent is as simple as npm install superagent.

SuperAgent plugins

One feature, that sets SuperAgent apart from the other libraries here, is its extensibility. It features quite a list of plugins which allow for the tweaking of a request or response. For example, the superagent-throttle plugin would allow you to define throttling rules for your requests.

5. Request

Even though it is not actively maintained any more, Request still is a popular and widely used HTTP client in the JavaScript ecosystem.

It is fairly simple to make an HTTP request with Request:

const request = require('request')
request('https://www.reddit.com/r/programming.json', function (
  error,
  response,
  body
) {
  console.error('error:', error)
  console.log('body:', body)
})

What you will definitely have noticed here, is that we were neither using plain Promises nor await. That is because Request still employs the traditional callback approach, however there are a couple of wrapper libraries to support await as well.

You can find the Request library at GitHub, and installing it is as simple as running npm install request.

Should you use Request? We included Request in this list because it still is a popular choice. Nonetheless, development has officially stopped and it is not being actively maintained any more. Of course, that does not mean it is unusable, and there are still lots of libraries using it, but the fact itself, may still make us think twice before we use it for a brand-new project, especially with quite a list of viable alternatives and native fetch support.

Data Extraction in JavaScript

Fetching the content of a site is, undoubtedly, an important step in any scraping project, but it's only the first step and we actually need to locate and extract the data as well. This is what we are going to check out next, how we can handle an HTML document in JavaScript and how to locate and select information for data extraction.

First off, regular expressions πŸ™‚

Regular expressions: the hard way

The simplest way to get started with web scraping without any dependencies, is to use a bunch of regular expressions on the HTML content you received from your HTTP client. But there is a big tradeoff.

While absolutely great in their domain, regular expressions are not ideal for parsing document structures like HTML. Plus, newcomers often struggle with getting them right ("do I need a look-ahead or a look-behind?"). For complex web scraping, regular expressions can also get out of hand. With that said, let's give it a go nonethless.

Say there's a label with some username in it and we want the username. This is similar to what you'd have to do if you relied on regular expressions:

const htmlString = '<label>Username: John Doe</label>'
const result = htmlString.match(/<label>Username: (.+)<\/label>/)

console.log(result[1])
// John Doe

We are using String.match() here, which will provide us with an array containing the data of the evaluation of our regular expression. As we used a capturing group ((.+)), the second array element (result[1]) will contain whatever that group managed to capture.

While this certainly worked in our example, anything more complex will either not work or will require a way more complex expression. Just imagine you have a couple of <label> elements in your HTML document.

Don't get us wrong, regular expressions are an unimaginable great tool, just not for HTML 😊 - so let us introduce you to the world of CSS selectors and the DOM.

Cheerio: Core jQuery for traversing the DOM

Cheerio is an efficient and light library that allows you to use the rich and powerful API of jQuery on the server-side. If you have used jQuery before, you will feel right at home with Cheerio. It provides you with an incredibly easy way to parse an HTML string into a DOM tree, which you can then access via the elegant interface you may be familiar with from jQuery (including function-chaining).

const cheerio = require('cheerio')
const $ = cheerio.load('<h2 class="title">Hello world</h2>')

$('h2.title').text('Hello there!')
$('h2').addClass('welcome')

$.html()
// <h2 class="title welcome">Hello there!</h2>

As you can see, using Cheerio really is almost identical to how you'd use jQuery.

Keep in mind, Cheerio really focuses on DOM-manipulation and you won't be able to directly "port" jQuery functionality, such as XHR/AJAX requests or mouse handling (e.g. onClick), one-to-one in Cheerio.

Cheerio is a great tool for most use cases when you need to handle the DOM yourself. Of course, if you want to crawl a JavaScript-heavy site (e.g. typical Single-page applications) you may need something closer to a full browser engine. We'll be talking about that in just second, under Headless Browsers in JavaScript.

Time for a quick Cheerio example, wouldn't you agree? To demonstrate the power of Cheerio, we will attempt to crawl the r/programming forum in Reddit and get a list of post names.

First, install Cheerio and Axios by running the following command: npm install cheerio axios.

Then create a new file called crawler.js and copy/paste the following code:

const axios = require('axios');
const cheerio = require('cheerio');

const getPostTitles = async () => {
	try {
		const { data } = await axios.get(
			'https://old.reddit.com/r/programming/'
		);
		const $ = cheerio.load(data);
		const postTitles = [];

		$('div > p.title > a').each((_idx, el) => {
			const postTitle = $(el).text()
			postTitles.push(postTitle)
		});

		return postTitles;
	} catch (error) {
		throw error;
	}
};

getPostTitles()
    .then((postTitles) => console.log(postTitles));

getPostTitles() is an asynchronous function that will crawl the subreddit r/programming forum. First, the HTML of the website is obtained using a simple HTTP GET request with the Axios HTTP client library. Then, the HTML data is fed into Cheerio using the cheerio.load() function.

Wonderful, we now have fully parsed HTML document as DOM tree in, good old-fashioned jQuery-manner, in $. What's next? Well, might not be a bad idea to know where to get our posting titles from. So, let's right click one of the titles and pick Inspect. That should get us right to the right element in the browser's developer tools.

Inspecting Reddit DOM

Excellent, equipped with our knowledge on XPath or CSS selectors, we can now easily compose the expression we need for that element. For our example, we chose CSS selectors and following one just works beautifully.

div > p.title > a

If you used jQuery, you probably know what we are up to, right? 😏

$('div > p.title > a')

You were absolutely right. The Cheerio call is identical to jQuery (there was a reason why we used $ for our DOM variable before) and using Cheerio with our CSS selector will give us the very list of elements matching our selector.

Now, we just need to iterate with each() over all elements and call their text() function to get their text content. πŸ’― jQuery, isn't it?

So much about the explanation. Time to run our code.

Open up your shell and run node crawler.js. You'll then see an array of about 25 or 26 different post titles (it'll be quite long). While this is a simple use case, it demonstrates the simple nature of the API provided by Cheerio.

If your use case requires the execution of JavaScript and loading of external sources, the following few options will be helpful.

jsdom: the DOM for Node

Similarly to how Cheerio replicates jQuery on the server-side, jsdom does the same for the browser's native DOM functionality.

Unlike Cheerio, however, jsdom does not only parse HTML into a DOM tree, it can also handle embedded JavaScript code and it allows you to "interact" with page elements.

Instantiating a jsdom object is rather easy:

const { JSDOM } = require('jsdom')
const { document } = new JSDOM(
	'<h2 class="title">Hello world</h2>'
).window

const heading = document.querySelector('.title')
heading.textContent = 'Hello there!'
heading.classList.add('welcome')

heading.innerHTML
// <h2 class="title welcome">Hello there!</h2>

Here, we imported the library with require and created a new jsdom instance using the constructor and passed our HTML snippet. Then, we simply used querySelector() (as we know it from front-end development) to select our element and tweaked its attributes a bit. Fairly standard and we could have done that with Cheerio as well, of course.

What sets jsdom, however, apart is aforementioned support for embedded JavaScript code and, that, we are going to check out now.

The following example uses a simple local HTML page, with one button adding a <div> with an ID.

const { JSDOM } = require("jsdom")

const HTML = `
	<html>
		<body>
			<button onclick="const e = document.createElement('div'); e.id = 'myid'; this.parentNode.appendChild(e);">Click me</button>
		</body>
	</html>`;

const dom = new JSDOM(HTML, {
	runScripts: "dangerously",
	resources: "usable"
});

const document = dom.window.document;

const button = document.querySelector('button');

console.log("Element before click: " + document.querySelector('div#myid'));
button.click();
console.log("Element after click: " + document.querySelector('div#myid'));

Nothing too complicated here:

  • we require() jsdom
  • set up our HTML document
  • pass HTML to our jsdom constructor (important, we need to enable runScripts)
  • select the button with a querySelector() call
  • and click() it

VoilΓ , that should give us this output

Element before click: null
Element after click: [object HTMLDivElement]

Fairly straightforward and the example showcased how we can use jsdom to actually execute the page's JavaScript code. When we loaded the document, there was initially no <div>. Only once we clicked the button, it was added by the site's code, not our crawler's code.

In this context, the important details are runScripts and resources. These flags instruct jsdom to run the page's code, as well as fetch any relevant JavaScript files. As jsdom's documentation points out, that could potentially allow any site to escape the sandbox and get access to your local system, just by crawling it. Proceed with caution please.

jsdom is a great library to handle most of typical browser tasks within your local Node.js instance, but it still has some limitations and that's where headless browsers really come to shine.

πŸ’‘ We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here.

Headless Browsers in JavaScript

Sites become more and more complex and often regular HTTP crawling won't suffice any more, but one actually needs a full-fledged browser engine, to get the necessary information from a site.

This is particularly true for SPAs which heavily rely on JavaScript and dynamic and asynchronous resources.

Browser automation and headless browsers come to the rescue here. Let's check out how they can help us to easily crawl Single-page Applications and other sites making use of JavaScript.

1. Puppeteer: the headless browser

Puppeteer, as the name implies, allows you to manipulate the browser programmatically, just like how a puppet would be manipulated by its puppeteer. It achieves this by providing a developer with a high-level API to control a headless version of Chrome by default and can be configured to run non-headless.

puppeteer-hierachy Taken from the Puppeteer Docs (Source)

Puppeteer is particularly more useful than the aforementioned tools because it allows you to crawl the web as if a real person were interacting with a browser. This opens up a few possibilities that weren't there before:

  • You can get screenshots or generate PDFs of pages.
  • You can crawl a Single Page Application and generate pre-rendered content.
  • You can automate many different user interactions, like keyboard inputs, form submissions, navigation, etc.

It could also play a big role in many other tasks outside the scope of web crawling like UI testing, assist performance optimization, etc.

Quite often, you will probably want to take screenshots of websites or, get to know about a competitor's product catalog. Puppeteer can be used to do this. To start, install Puppeteer by running the following command: npm install puppeteer

This will download a bundled version of Chromium which takes up about 180 to 300 MB, depending on your operating system. You can avoid that step, and use an already installed setup, by specifying a couple of Puppeteer environment variables, such as PUPPETEER_SKIP_CHROMIUM_DOWNLOAD. Generally, though, Puppeteer does recommended to use the bundled version and does not support custom setups.

Let's attempt to get a screenshot and PDF of the r/programming forum in Reddit, create a new file called crawler.js, and copy/paste the following code:

const puppeteer = require('puppeteer')

async function getVisual() {
	try {
		const URL = 'https://www.reddit.com/r/programming/'
		const browser = await puppeteer.launch()

		const page = await browser.newPage()
		await page.goto(URL)

		await page.screenshot({ path: 'screenshot.png' })
		await page.pdf({ path: 'page.pdf' })

		await browser.close()
	} catch (error) {
		console.error(error)
	}
}

getVisual()

getVisual() is an asynchronous function that will take a screenshot of our page, as well as export it as PDF document.

To start, an instance of the browser is created by running puppeteer.launch(). Next, we create a new browser tab/page with newPage(). Now, we just need to call goto() on our page instance and pass it our URL.

All these functions are of asynchronous nature and will return immediately, but as they are returning a JavaScript Promise, and we are using await, the flow still appears to be synchronous and, hence, once goto "returned", our website should have loaded.

Excellent, we are ready to get pretty pictures. Let's just call screenshot() on our page instance and pass it a path to our image file. We do the same with pdf() and voilΓ , we should have at the specified locations two new files. Because we are responsible netizens, we also call close() on our browser object, to clean up behind ourselves. That's it.

Once thing to keep in mind, when goto() returns, the page has loaded but it might not be done with all its asynchronous loading. So depending on your site, you may want to add additional logic in a production crawler, to wait for certain JavaScript events or DOM elements.

But let's run the code. Pop up a shell window, type node crawler.js, and after a few moments, you should have exactly the two mentioned files in your directory.

It's a great tool and if you are really keen on it now, please also check out our other guides on Puppeteer.

2. Nightmare: an alternative to Puppeteer

Nightmare is another a high-level browser automation library like Puppeteer. It uses Electron and web and scraping benchmarks indicate it shows a significantly better performance than its predecessor PhantomJS. If Puppeteer is too complex for your use case or there are issues with the default Chromium bundle, Nightmare - despite its name 😨 - may just be the right thing for you.

As so often, our journey starts with NPM: npm install nightmare

Once Nightmare is available on your system, we will use it to find ScrapingBee's website through a Brave search. To do so, create a file called crawler.js and copy/paste the following code into it:

const Nightmare = require('nightmare')
const nightmare = Nightmare()

nightmare
	.goto('https://search.brave.com/')
	.type('#searchbox', 'ScrapingBee')
	.click('#submit-button')
	.wait('#results a')
	.evaluate(
		() => document.querySelector('#results a').href
	)
	.end()
	.then((link) => {
		console.log('ScrapingBee Web Link:', link)
	})
	.catch((error) => {
		console.error('Search failed:', error)
	})

After the usual library import with require, we first create a new instance of Nightmare and save that in nightmare. After that, we are going to have lots of fun with function-chaining and Promises πŸ₯³

  1. We use goto() to load Brave from https://search.brave.com
  2. We type our search term "ScrapingBee" in Brave's search input, with the CSS selector #searchbox (Brave's quite straightforward with its naming, isn't it?)
  3. We click the submit button to start our search. Again, that's with the CSS selector #submit-button (Brave's really straightforward, we love that❣️)
  4. Let's take a quick break, until Brave returns the search list. wait, with the right selector works wonders here. wait also accepts time value, if you need to wait for a specific period of time.
  5. Once Nightmare got the link list from Brave, we simply use evaluate() to run our custom code on the page (in this case querySelector()) and get the first <a> element matching our selector, and return its href attribute.
  6. Last but not least, we call end() to run and complete our task queue.

That's it, folks. end() returns a standard Promise with the value from our call to evaluate(). Of course, you could also use await here.

That was pretty easy, wasn'it? And if everything went all right 🀞, we should have now got the link to ScrapingBee's website at https://www.scrapingbee.com

ScrapingBee Web Link: https://www.scrapingbee.com/

Wanna try it yourself? Just run node crawler.js in your shell πŸ‘

3. Playwright, the new web scraping framework

Playwright is the new cross-language, cross-platform headless framework supported by Microsoft.

Its main advantage over Puppeteer is that it is cross platform and very easy to use.

Here is how to simply scrape a page with it:

const playwright = require('playwright');
async function main() {
    const browser = await playwright.chromium.launch({
        headless: false // setting this to true will not run the UI
    });

    const page = await browser.newPage();
    await page.goto('https://finance.yahoo.com/world-indices');
    await page.waitForTimeout(5000); // wait for 5 seconds
    await browser.close();
}

main();

Feel free to check out our Playwright tutorial if you want to learn more.

Summary

Phew, that was a long read! But we hope, our examples managed to give you a first glimpse into the world of web scraping with JavaScript and which libraries you can use to crawl the web and scrape the information you need.

Let's give it a quick recap, what we learned today was:

  • βœ… NodeJS is a JavaScript runtime that allow JavaScript to be run server-side. It has a non-blocking nature thanks to the Event Loop.
  • βœ… HTTP clients, such as the native libaries and fetch, as well as Axios, SuperAgent, node-fetch, and Request, are used to send HTTP requests to a server and receive a response.
  • βœ… Cheerio abstracts the best out of jQuery for the sole purpose of running it server-side for web crawling but does not execute JavaScript code.
  • βœ… JSDOM creates a DOM per the standard JavaScript specification out of an HTML string and allows you to perform DOM manipulations on it.
  • βœ… Puppeteer and Nightmare are high-level browser automation libraries, that allow you to programmatically manipulate web applications as if a real person were interacting with them.

This article focused on JavaScript's scraping ecosystem and its tools. However, there are certainly also other apsects to scraping, which we could not cover in this context.

For example, sites often employ techniques to recognize and block crawlers. You'll want to avoid these and blend in as normal visitor. On this subject, and more, we have an excellent, dedicated guide on how not to get blocked as a crawler. Check it out please.

πŸ’‘ Should you love scraping, but the usual time-constraints for your project don't allow you to tweak your crawlers to perfection, then please have a look at our scraping API platform. ScrapingBee was built with all these things in mind and has got your back in all crawling tasks.

Happy Scraping!

Resources

Would you like to read more? Check these links out:

image description
Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.