A JavaScript Developer's Guide to curl

11 August 2023 | 7 min read

A JavaScript Developer's Guide to curl

curl, short for client URL, is a command line tool for transferring data over various protocols, including HTTP and HTTPS. It's available on many platforms, where it's often installed by default, making it a popular tool for testing network requests, scraping the web, and downloading resources.

curl's powerful and versatile feature set combined with a simple CLI makes it a go-to choice for many developers, with many guides and lots of API documentation, including curl-based examples. It's no wonder developers would like to use curl alongside a scripting language like JavaScript for demanding workflows like web scraping.

This article shows you all the ways you can use curl for web scraping in JavaScript. It also explains why it's not necessary to do it this way and how other solutions can better serve you.

Make sure you have the following installed to follow along:

  • Node.js—v16 or newer
  • curl—v7.86.0 or newer recommended
cover image

Using curl with JavaScript

Before getting into web scraping with curl and JavaScript, you'll first have to know how to integrate these two tools.

As a command line tool, curl doesn't have a JavaScript API. With that said, there are a few ways to integrate it with Node.js. Let's explore them.

Creating a Dedicated Child Process

Thanks to the Node.js child_process module, you can spawn new subprocesses, including, for example, execute a provided shell command like curl. You can do so with the exec() function:

const { exec } = require('child_process')
const command =
    'curl -s -w "\n\n%{json}" "https://httpbin.org/json"'

exec(command, { encoding: 'utf-8' }, (error, stdout, stderr) => {
    if (error !== null) {
        console.log('Error', error, stderr)

        return
    }

    const [responseMetadata, response] = stdout.split('\n\n')

    console.log('Metadata', JSON.parse(responseMetadata))
    console.log('Response', JSON.parse(response))
})

The curl command makes a request to https://httpbin.org/json, with the -s option making curl silent—that is, preventing the output of a progress bar or error messages to the console—and the -w option printing the details about the transfer itself into a JSON object.

The response can then be retrieved from the stdout and processed accordingly to retrieve JSON data.

This approach allows you to use curl commands directly in your Node.js app with all the options included. However, it complicates output parsing and adds the overhead of spawning a new subprocess and executing a command to each request. This makes it not scalable and difficult to use in production.

Converting curl Commands

Often, developers who are experienced with curl rather than JavaScript will want to use curl even when what they actually need is just to make an HTTP request. For this, you can use Node.js's built-in http/https module, fetch(), in the web browser or one of many libraries available on npm, like Axios.

Here's an example of a request, similar to the curl command from before, using the built-in https module:

const http = require('https')

const options = {
    method: 'GET',
    hostname: 'https://httpbin.org/json',
    port: null,
    path: '/todos/1',
    headers: {},
}

const req = http.request(options, function (res) {
    const chunks = []

    res.on('data', function (chunk) {
        chunks.push(chunk)
    })

    res.on('end', function () {
        const body = Buffer.concat(chunks)
        console.log(body.toString())
    })
})

req.end()

If you want to achieve the same with even less code, both in Node.js as well as the browser, you can use the Axios library. You can install it with npm install axios and use it in Node.js as follows:

const axios = require('axios')

const response = await axios.get('https://httpbin.org/json')

If you ever find yourself in need of converting a more complex curl command to a language like JavaScript, a curl converter can automate this process.

Converting the curl command to a request in your language or runtime of choice should be your go-to approach. However, if you need to use some of the more advanced features of curl, like proxies or support for different transfer protocols, you'll have to look somewhere else.

Using libcurl

While curl itself is a command line tool, it's actually built on top of a library called libcurl. libcurl is part of the same project and powers all of the curl command's features. In fact, the curl command itself provides a helpful --libcurl option to output libcurl-based C source code that executes the same operation as the command would.

libcurl is a C library, but thanks to the node-libcurl module, it has native bindings for Node.js. You can install this module with npm install node-libcurl and use it like so:

const { Curl } = require('node-libcurl')

const curl = new Curl()

curl.setOpt('URL', 'https://httpbin.org/json')
curl.on('end', (_statusCode, body) => {
    console.info(body)
    curl.close()
})
curl.on('error', () => curl.close())
curl.perform()

While node-libcurl doesn't have as convenient of an interface as the command line, it's the best way to access the full feature set of curl while maintaining great performance and ease of use.

Web Scraping with node-libcurl

Having established that node-libcurl is a great solution for using curl's underlying library (libcurl) in Node.js, it's time to explore how to use it to scrape the web.

Making GET Requests

For the most basic web scraping, a simple GET request will be enough. This can be done with curl like so:

curl -s https://httpbin.org/forms/post

The -s again ensures no progress bar or error message is written to the console while the command itself results in the HTML being printed out upon a successful request.

Here's the equivalent in Node.js:

const { Curl } = require('node-libcurl')

const curl = new Curl()

curl.setOpt('URL', 'https://httpbin.org/forms/post')
curl.on('end', (_statusCode, body) => {
    console.info(body)
    curl.close()
})
curl.on('error', () => curl.close())
curl.perform()

Since the -s option is meant to modify the console output of the curl command, it has no use when working directly with libcurl. As such, setting the URL option using the setOpt() method is sufficient.

For more advanced web scraping needs, though, you'll likely have to look into other available options to avoid things like rate limiting, geo-blocking, fingerprinting, and so forth. These options include the following:

  • USERAGENT—for setting the user agent string for the request
  • CERTINFO—for customizing the SSL certificate used for requests with SSL-based protocols (like HTTPS)
  • PROXY—for using a specified proxy

Making POST Requests

POST requests are a bit more complicated than GET as they require providing both the data and the Content-Type header:

curl -X POST https://httpbin.org/anything -H 'Content-Type: application/json' -d '{"key":"value"}'

Translating this to node-libcurl results in the following code:

const { Curl } = require('node-libcurl')

const curl = new Curl()

curl.setOpt('URL', 'https://httpbin.org/anything')
curl.setOpt('CUSTOMREQUEST', 'POST')
curl.setOpt('HTTPHEADER', ['Content-Type: application/json'])
curl.setOpt('POSTFIELDS', JSON.stringify({ key: 'value' }))
curl.on('end', (_statusCode, body) => {
    console.info(body)
    curl.close()
})
curl.on('error', () => curl.close())
curl.perform()

The CUSTOMREQUEST option sets the HTTP method to POST, while HTTPHEADER takes a string array for the HTTP headers (here, the content type). Finally, the payload is set through the POSTFIELDS option to a (serialized) JSON object, in line with the provided Content-Type header.

Extracting Data

Once you have the HTML fetched through libcurl, you need a good tool to extract valuable data from it. When the data you need is contained directly in the HTML, all you need is a good HTML parser like cheerio. Install it with npm install cheerio and use it to parse the HTML like this:

const { Curl } = require('node-libcurl')
const cheerio = require('cheerio')

const curl = new Curl()

curl.setOpt('URL', 'https://httpbin.org/forms/post')
curl.on('end', (_statusCode, body) => {
    curl.close()
    const $ = cheerio.load(body)
    const labels = $('label')
        .map((i, el) => $(el).text().trim().replace(':', '').toLowerCase())
        .toArray()
    const submitButton = $('p')
        .filter((i, el) => {
            return $(el).text().toLowerCase().trim().includes('submit')
        })
        .prop('button')

    console.log(body, labels, submitButton)
})
curl.on('error', () => curl.close())
curl.perform()

Cheerio's API combined with knowledge of the HTML structure makes it relatively easy to query for different elements and traverse the tree. You can use CSS-like selectors to retrieve particular elements and then use other methods to filter or process them, as demonstrated above.

However, if you need to scrape more complex websites or single-page applications (SPAs) that require JavaScript to render or if you need to go through various flows to obtain the data you need, you should consider libraries beyond Cheerio, like jsdom, Puppeteer, or Playwright.

Conclusion

You now know about all kinds of ways to integrate curl with JavaScript—from creating subprocesses to using native bindings through node-libcurl.

However, remember that making a network request is almost always just a small part of the scraping process. There's a lot more you have to do to build a web scraper.

If you prefer not to have to deal with rate limits, proxies, user agents, and browser fingerprints, please check out our no-code web scraping API. Did you know that the first 1,000 calls are on us?

image description
Arek Nawo

Arek Nawo is a web developer, freelancer ,and creator of CodeWrite.