Node-unblocker for Web Scraping

Arek Nawo | 15 February 2022 | 7 min read

Table of contents

Web proxies help you keep your privacy and get around various restrictions while browsing the web. They hide your details, such as the request origin or IP address, and with additional software can even bypass things like rate limits.

node-unblocker is one such web proxy that includes a form of Node.js library. You can use it for web scraping and accessing geo-restricted content, as well as other functions.

In this article, you’ll learn how to implement and use node-unblocker. You’ll also see its pros, cons, and limitations as compared to a managed service like ScrapingBee.

What Is node-unblocker?

node-unblocker advertises itself as a “Web proxy for evading internet censorship.” It’s a Node.js library with an Express-compatible API, allowing you to get your proxy up and running quickly. Because of its JS interface, it’s highly versatile and can be used in many ways.

One use case of a programmable proxy like node-unblocker is web scraping. Proxying web requests allows you to bypass geographic restrictions and hide your IP; with multiple proxy instances, you can also avoid rate limiting. Overall, a proxy drastically limits the chance of your bots getting blocked.

Implementing node-unblocker

To set up the node-unblocker, make sure you’ve got Node.js and npm installed on your system. You can do that by following the official guide from the Node.js website or with a version management tool like nvm.

Creating the Script

First, create a new folder, initialize a npm project, and install necessary dependencies.

mkdir proxy
cd proxy
npm init -y
npm install unblocker express

Express will allow you to create a web server quickly, while unblocker is the npm package name housing node-unblocker.

With the necessary packages installed, you can start implementing your proxy in a new index.js file.

Start by require()-ing your dependencies.

const express = require("express");
const Unblocker = require("unblocker");

Next, create an Express app and a new Unblocker instance.

// ...
const app = express();
const unblocker = new Unblocker({ prefix: "/proxy/" });

node-unblocker accepts a wide range of options through its config object. You can configure pretty much all aspects of the library, from request details to custom middleware. In fact, because most of the proxy’s functionality is implemented as middleware, you can also selectively enable its features as you see fit.

In the above snippet, only the prefix property is set. This will later indicate at what path the proxy can be accessed, in this case - /proxy/.

Because of the Express-compatible API, all you need to do to connect the proxy instance with your Express server is to call the use() method.

// ...
app.use(unblocker);

Finally, start your Express server using the listen() method.

// ...
app.listen(process.env.PORT || 8080).on("upgrade", unblocker.onUpgrade);
console.log("Proxy is running on port:", process.env.PORT || 8080)

The server will now run on the port set by the PORT environment variable, while defaulting to 8080. Additionally, the upgrade event handler (onUpgrade method) has been attached to the server. This informs the proxy when the connection protocol has been upgraded (or changed) from the established HTTP to, say, WebSocket, enabling proper handling of such connections.

This is how your script should look:

const express = require("express");
const Unblocker = require("unblocker");
const app = express();
const unblocker = new Unblocker({ prefix: "/proxy/" });

app.use(unblocker);
app.listen(process.env.PORT || 8080).on("upgrade", unblocker.onUpgrade);
console.log("Proxy is running on port:", process.env.PORT || 8080);

Testing the Proxy

Test the proxy implementation by running the script using Node.js.

node index.js

If everything works correctly, you should see the console.log() message in your terminal.

To verify the proxy is working, take a URL and prefix it with localhost:[PORT]/proxy/, for example: http://localhost:8080/proxy/https://www.scrapingbee.com/. In your DevTools, you should see that all requests are going through the proxy.

In case any issues arise with the proxy, set the DEBUG environment variable to see detailed information on every request:

DEBUG=unblocker:* node index.js

Deploying to Heroku

Now that you have a functioning proxy, you can deploy it to a remote server like Heroku.

Acceptable Use Policy

Before you deploy any web scraping or proxy application to a remote server, you should know its Acceptable Use Policy. Not all providers allow this kind of application to be hosted on their servers, and many allow it only under strict conditions.

In the case of Heroku, its policy doesn’t allow hosting proxies for public use or web scraping without respecting robot exclusion standards (like the robots.txt file) and providing a unique user-agent string. Keep this in mind while working with Heroku.

Preparing the Script

To deploy your app to Heroku, first adjust your package.json file.

{
  "name": "proxy",
  "version": "1.0.0",
  "main": "index.js",
  "private": true,
  "engines": {
    "node": "16.x"
  },
  "dependencies": {
    "express": "^4.17.1",
    "unblocker": "^2.3.0"
  },
  "scripts": {
    "start": "node index.js"
  }
}

Add a start script to let Heroku know how to run your app, and an engines section to define what Node.js version to use. For this example, use Node.js’ latest LTS version (v16) and node index.js as the start command.

Deploying with Heroku CLI

Deploying a Node.js app to Heroku is simple, thanks to the Heroku CLI. Create a Heroku account and install the Heroku CLI on your system.

Authenticate with Heroku from the CLI using the login command:

heroku login

Then create a new Heroku app:

heroku apps:create

You should see your app’s ID, URL, and Git URL displayed in the terminal. Use the ID to set a remote origin for the newly created Git repo.

git init
heroku git:remote -a [APP_ID]

All you need now is to commit your code and deploy it to Heroku.

git add .
git commit -am "Initial commit"
git push heroku master

You should now be able to access your proxy under your Heroku app URL. Test it out as you did the localhost by going to the following or similar URL: https://[APP_ID].herokuapp.com/proxy/https://www.scrapingbee.com/.

Congrats—your proxy is up and running! You can use it as a separate service or combine it with a headless browser library like Puppeteer directly to do the web scraping.

Limitations of node-unblocker

While the implementation and deployment of node-unblocker are relatively straightforward, the proxy comes with a set of limitations that are hard, if not impossible, to overcome. Additionally, the maintenance effort and other issues you might encounter make running a self-managed proxy a hassle.

On the other hand, a service like ScrapingBee is fully managed, well-supported, and already battle-tested in production environments.

To give you a better look at how these two compare, here are some of node-unblocker’s limitations that you should be aware of.

OAuth Issues

The proxy is unlikely to work well with websites using OAuth forms. In fact, this applies to anything using the postMessage() method. This issue isn’t significant and might be fixed in the future, but what only works currently are standard login forms and most AJAX content.

Issues with Complex Websites

Popular but complex websites like Discord, Twitter, or YouTube (semi-working example available) won’t work correctly. The content, or part of it, might not show up or a request might not be successful, among other issues. Currently, there’s no timeline as to when (if ever) this will be fixed.

Maintenance Effort

Like any other complex service, proxies and web scraping apps require a lot of effort to run and maintain. You have to comply with the policies of cloud providers and fully manage your proxy instances, among other issues. All these factors contribute to a big overhead—especially when running large proxy clusters.

Conclusion

You should now have a good understanding of how to implement a node-unblocker proxy. While it offers a number of benefits, you’ve seen that it carries some limitations as well.

In contrast to node-unblocker, ScrapingBee requires no effort to keep the proxy running. It automates scraping all types of sites and offers a large pool of rotating proxies, meaning you get all the benefits of a web proxy without any drawbacks. ScrapingBee abstracts the complexity of a web proxy in a simple and easy-to-use API. For more details, see the documentation.

Arek Nawo

Arek Nawo is a web developer, freelancer ,and creator of CodeWrite.