Web Scraping in C++ with Gumbo Parser and Libcurl

21 September 2022 | 9 min read

Web scraping is a common technique for harvesting data online, in which an HTTP client, processing a user request for data, uses an HTML parser to comb through that data. It helps programmers more easily get at the information they need for their projects.

There are a number of use cases for web scraping. It allows you to access data that might not be available from APIs, as well as data from several disparate sources. It can also help you aggregate and analyze product-related user opinions, and it can provide insights into market conditions such as pricing volatility or distribution issues. However, scraping that data or integrating it into your projects hasn’t always been easy.

Fortunately, web scraping has become more advanced and a number of programming languages support it, including C++. The ever-popular language for system programming also offers a number of features that make it useful for web scraping, such as speed, strict static typing, and a standard library whose offerings include type inference, templates for generic programming, primitives for concurrency, and lambda functions.

In this tutorial, you’ll learn how to use C++ to implement web scraping with the libcurl and gumbo libraries. You can follow along on GitHub.

cover image

Prerequisites

For this tutorial, you’ll need the following:

  • a basic understanding of HTTP
  • C++ 11 or newer installed on your machine
  • g++ 4.8.1 or newer
  • the libcurl and gumbo C libraries
  • a resource with data for scraping (you’ll use the Merriam-Webster website)

About Web Scraping

For every HTTP request made by a client (such as a browser), a server issues a response. Both requests and responses are accompanied by headers that describe aspects of the data the client intends to receive and explain all the nuances of the sent data for the server.

For instance, say you made a request to Merriam-Webster’s website for the definitions of the word “esoteric,” using cURL as a client:

GET /dictionary/esoteric HTTP/2
Host: www.merriam-webster.com
user-agent: curl/7.68.0
accept: */*

The Merriam-Webster site would respond with headers to identify itself as the server, an HTTP response code to signify success (200), the format of the response data—HTML in this case—in the content-type header, caching directives, and additional CDN metadata. It might look like this:

HTTP/2 200
content-type: text/html; charset=UTF-8
date: Wed, 11 May 2022 11:16:20 GMT
server: Apache
cache-control: max-age=14400, public
pragma: cache
access-control-allow-origin: *
vary: Accept-Encoding
x-cache: Hit from cloudfront
via: 1.1 5af4fdb44166a881c2f1b1a2415ddaf2.cloudfront.net (CloudFront)
x-amz-cf-pop: NBO50-C1
x-amz-cf-id: HCbuiqXSALY6XbCvL8JhKErZFRBulZVhXAqusLqtfn-Jyq6ZoNHdrQ==
age: 5787
 
<!DOCTYPE html>
  <html lang="en">
  <head>
  <!--rest of it goes here-->

You should get similar results after you build your scraper. One of the two libraries you’ll use in this tutorial is libcurl, which cURL is written on top of.

Building the Web Scraper

The scraper you’re going to build in C++ will source definitions of words from the Merriam-Webster site, while eliminating much of the typing associated with conventional word searches. Instead, you’ll reduce the process to a single set of keystrokes.

For this tutorial, you will be working in a directory labeled scraper and a single C++ file of the same name: scraper.cc.

Setting up the Libraries

The two C libraries you’re going to use, libcurl and gumbo, work here because C++ interacts well with C. While libcurl is an API that enables several URL and HTTP-predicated functions and powers the client of the same name used in the previous section, gumbo is a lightweight HTML-5 parser with bindings in several C-compatible languages.

Using vcpkg

Developed by Microsoft, vcpkg is a cross-platform package manager for C/C++ projects. Follow this guide to set up vcpkg on your machine. You can install libcurl and gumbo by typing the following in your console:

$ vcpkg install curl
$ vcpkg install gumbo

If you are working in an IDE environment—specifically Visual Studio Code—next run the following snippet in the root directory of your project in order to integrate the packages:

$ vcpkg integrate install

To minimize errors in your installations, consider adding vcpkg to your environment variable.

Using apt

If you’ve used Linux, you should be familiar with apt, which enables you to conveniently source and manage libraries installed on the platform. To install libcurl and gumbo with apt, type the following in your console:

$ sudo apt install libcurl4-openssl-dev libgumbo-dev

Installing the Libraries

Rather than go through manual installation, you can use the method shown below.

First, clone the curl repository and install it globally:

$ git clone https://github.com/curl/curl.git <directory>
$ cd <directory>
$ autoreconf -fi
$ ./configure
$ make

Next, clone the gumbo repository and install the package:

$ sudo apt install libtool
$ git clone https://github.com/google/gumbo-parser.git <directory>
$ cd <directory>
$ ./autogen.sh
$ ./configure
$ make && sudo make install

Coding the Scraper

The first step in coding the scraper is creating a facility for making an HTTP request. The artifact—a function and named request—will allow the dictionary scraping tool to fetch markup from the Merriam-Webster site.

Defined in the request function in your scraper.cc file, in the code snippet below, are immutable primitives—a client name to identify the scraper via user-agent header, and language artifacts for writing server response markup into memory. The sole parameter is the word that constitutes a portion of the URL path, definitions of which are sourced by the scraper.

typedef size_t( * curl_write)(char * , size_t, size_t, std::string * );

std::string request(std::string word) {
  CURLcode res_code = CURLE_FAILED_INIT;
  CURL * curl = curl_easy_init();
  std::string result;
  std::string url = "https://www.merriam-webster.com/dictionary/" + word;

  curl_global_init(CURL_GLOBAL_ALL);

  if (curl) {
    curl_easy_setopt(curl,
      CURLOPT_WRITEFUNCTION,
      static_cast < curl_write > ([](char * contents, size_t size,
        size_t nmemb, std::string * data) -> size_t {
        size_t new_size = size * nmemb;
        if (data == NULL) {
          return 0;
        }
        data -> append(contents, new_size);
        return new_size;
      }));
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, & result);
    curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
    curl_easy_setopt(curl, CURLOPT_USERAGENT, "simple scraper");

    res_code = curl_easy_perform(curl);

    if (res_code != CURLE_OK) {
      return curl_easy_strerror(res_code);
    }

    curl_easy_cleanup(curl);
  }

  curl_global_cleanup();

  return result;
}

Remember to include the appropriate headers in the preamble of your .cc or .cpp file for the curl library and C++ string library. This will avoid compilation problems with library linkage.

#include “curl/curl.h”
#include “string”

The next step, parsing the markup, requires four functions: scrape, find_definitions, extract_text, and str_replace. Since gumbo is central to all markup parsing, add the appropriate library header as follows:

#include “gumbo.h”

The scrape function feeds the markup from the request into find_definitions for selectively iterative DOM traversal. You’ll use the gumbo parser in this function, which returns a string containing a list of word definitions:

std::string scrape(std::string markup)
{
  std::string res = "";
  GumboOutput *output = gumbo_parse_with_options(&kGumboDefaultOptions, markup.data(), markup.length());
 
  res += find_definitions(output->root);
 
  gumbo_destroy_output(&kGumboDefaultOptions, output);
 
  return res;
}

The find_definitions function below recursively harvests definitions from the span HTML elements with the unique class identifier "dtText". It extracts definition text via the extract_text function on each successful iteration from each HTML node in which that text is enclosed.

std::string find_definitions(GumboNode *node)
{
  std::string res = "";
  GumboAttribute *attr;
  if (node->type != GUMBO_NODE_ELEMENT)
  {
    return res;
  }
 
  if ((attr = gumbo_get_attribute(&node->v.element.attributes, "class")) &&
      strstr(attr->value, "dtText") != NULL)
  {
    res += extract_text(node);
    res += "\n";
  }
 
  GumboVector *children = &node->v.element.children;
  for (int i = 0; i < children->length; ++i)
  {
    res += find_definitions(static_cast<GumboNode *>(children->data[i]));
  }
 
  return res;
}

Next, the extract_text function below extracts text from each node that is not a script or style tag. The function funnels the text to the str_replace routine, which replaces the leading colon with the binary > symbol.

std::string extract_text(GumboNode *node)
{
  if (node->type == GUMBO_NODE_TEXT)
  {
    return std::string(node->v.text.text);
  }
  else if (node->type == GUMBO_NODE_ELEMENT &&
           node->v.element.tag != GUMBO_TAG_SCRIPT &&
           node->v.element.tag != GUMBO_TAG_STYLE)
  {
    std::string contents = "";
    GumboVector *children = &node->v.element.children;
    for (unsigned int i = 0; i < children->length; ++i)
    {
      std::string text = extract_text((GumboNode *)children->data[i]);
      if (i != 0 && !text.empty())
      {
        contents.append("");
      }
 
      contents.append(str_replace(":", ">", text));
    }
 
    return contents;
  }
  else
  {
    return "";
  }
}

The str_replace function (inspired by a PHP function of the same name) replaces every instance of a specified search string in a larger string with another string. It appears as follows:

std::string str_replace(std::string search, std::string replace, std::string &subject)
{
  size_t count;
  for (std::string::size_type pos{};
       subject.npos != (pos = subject.find(search.data(), pos, search.length()));
       pos += replace.length(), ++count)
  {
    subject.replace(pos, search.length(), replace.data(), replace.length());
  }
 
  return subject;
}

Since the traversal and replacement in the function above depend on primitives defined in the algorithm library, you’ll also need to include that library:

#include ”algorithm”

Next, you’ll add dynamism to the scraper—enabling it to return definitions for each word supplied as a command-line argument. To do this, you’ll define a function that converts each command-line argument to its lowercase equivalent, minimizing the likelihood of request errors from redirects and restricting input to a single command-line argument.

Add the function to convert string inputs to their lowercase equivalents:

std::string strtolower(std::string str)
{
  std::transform(str.begin(), str.end(), str.begin(), ::tolower);
 
  return str;
}

Next is the branching logic that selectively parses a single command-line argument:

if (argc != 2)
{
  std::cout << "Please provide a valid English word" << std::endl;
  exit(EXIT_FAILURE);
}

The primary function in your scraper should appear as shown below:

int main(int argc, char **argv)
{
  if (argc != 2)
  {
    std::cout << "Please provide a valid English word" << std::endl;
    exit(EXIT_FAILURE);
  }
 
  std::string arg = argv[1];
 
  std::string res = request(arg);
  std::cout << scrape(res) << std::endl;
 
  return EXIT_SUCCESS;
}

You should include C++’s iostream library to ensure the Input/Output (IO) primitives defined in the main function work as expected:

#include “iostream”

To run your scraper, compile it with g++. Type the following in your console to compile and run your scraper. It should pull the six listed definitions of the word “esoteric”:

$ g++ scraper.cc -lcurl -lgumbo -std=c++11 -o scraper
$ ./scraper esoteric

You should see the following:

Scraper successful

If you would like to learn more about cURL you can check: How to follow redirect using cURL?, How to forward headers with cURL? or How to send a POST request using cURL?

Conclusion

As you saw in this tutorial, C++, which is normally used for system programming, also works well for web scraping because of its ability to parse HTTP. This added functionality can help you expand your knowledge of C++.

You’ll note that this example was relatively simple, and did not address how scraping would work for a more JavaScript-heavy website, for instance one using Selenium. To perform scraping on a more dynamically rendered site, you could use a headless browser with a C++ library for Selenium. This topic will be discussed in a future article.

To check your work on this tutorial, consult this GitHub gist.

image description
Bruno Michael Lochemem

Lochemem Bruno Michael is a software engineer from Uganda. Author of the book Functional Programming in PHP and various open source libraries.