Web Scraping with Perl

18 July 2022 | 8 min read

Web scraping is a technique for retrieving data from web pages. It can be done manually but is normally done programmatically. There are a huge amount of reasons someone might scrape a website:

  • Generating leads for marketing
  • Monitoring prices on a page (and purchase when the price drops low)
  • Academic research
  • Arbitrage betting

Due to how common web scraping is, any programming language that can make HTTP/S requests and parse the markup on web pages can scrape. Perl is one such language. Known in several corners of the web as the “Swiss Army knife of programming,” Perl is used for several tasks like database programming, game development, web development, and scraping.

This article provides:

  • A brief introduction to web scraping
  • A discussion of the benefits of scraping
  • And a demonstration of how to build a simple scraper in Perl.
cover image

Applications of Web Scraping

We’ve already discussed some benefits of scraping earlier, but the list wasn’t exhaustive.

Web scrapers can make it possible to plug the gaps in API response data and retrieve data otherwise not included by the API maintainer. For example, the Genius API doesn’t send lyrics as part of its API response - that this article will show you how to fix with scraping.

Information collection is a huge part of how scraping can fill these gaps. Some companies scrape their competitors' websites to make sure they are offering good prices for their products and not being undercut. Or they may scrape multiple review pages that review their products and aggregate them into one shared document for easier viewing.

The applications of scraping are almost endless, as the insights and value a company can derive from retrieving data from (almost) anywhere on the internet is huge.

cURL

The most simple demonstration of scraping is a cURL request to a website:

If you would like to learn more about cURL you can check: How to follow redirect using cURL? or How to send a POST request using cURL?


```sh
$ curl -i -v -X GET https://genius.com

The server response, essentially composed of the markup on the Genius landing page and defining response headers, is as follows:

HTTP/2 200
date: Tue, 03 May 2022 20:06:48 GMT
content-type: text/html; charset=utf-8
cf-ray: 705b9e87f9aabc6c-DUR
accept-ranges: bytes
cache-control: public, s-maxage=180
etag: W/"1bae1849f0a7e6803d98f06c9c43e007"
set-cookie: _genius_ab_test_cohort=98; Max-Age=2147483647; Path=/
vary: X-Requested-With, Accept-Encoding
via: 1.1 vegur
cf-cache-status: HIT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
set-cookie: _genius_ab_test_primis_mobile=control; Max-Age=2147483647; Path=/
status: 200 OK
x-frame-options: SAMEORIGIN
x-runtime: 1242
server: cloudflare

<!doctype html>
<html>
  <head>
    <title>Genius | Song Lyrics &amp; Knowledge</title>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

<!--The rest of it goes here--->

Alongside the HTML, we also have the HTTP headers that genius.com responded with, things like content-type, cookies, and caching.

The content type, alternatively referred to as MIME type, is the response data format—which is HTML in this example.

In this case, the HTTP cache-control header is a set of directives for how things need to be cached. In our response above, the directives indicate the HTML can be cached for 180 seconds.

Cookies are short strings that contain data sent from server to client in the Set-Cookie header. The server header, like the user-agent, identifies the server. If you want to know more about the portions of metadata in this response that are not discussed in detail here, you can visit MDN.

Web Scraping with Perl

The goal of the scraper you are about to build is to fetch the song lyrics for a specified song available on Genius. This is useful because the song resource in the Genius REST API does not include lyrics. To achieve this, you will need to install Perl’s HTML TreeBuilder module and use it alongside the Library for the World Wide Web in Perl (LWP) module.

LWP

Perl’s Library for WWW in Perl (LWP) is a suite of language APIs (classes and functions) for writing HTTP clients that access data from the web. The library is available out of the box in most Perl distributions and supports a wide range of HTTP functions, such as multimethod HTTP requests and document and file downloads, and even powers language package managers like CPAN.

You can find the full LWP API specification on the Comprehensive Perl Archive Network (CPAN) or browse through it locally with perldoc by typing the following in a console of your choosing:

$ perldoc LWP

Parsing with TreeBuilder

HTML::Treebuilder is a Perl module hosted on CPAN, whose main responsibility is constructing HTML trees for additional selective parsing. It uses several methods for constructing HTML documents and markup-interspersed strings derived from the HTML::Parser and HTML::Element packages it utilizes under the hood.

You can install TreeBuilder with cpan or cpanm, both of which are installable on most major operating systems in accordance with the guide available on the Official CPAN website. In the cpan REPL (which might require additional configuration before initial use, in that you might have to enable readline package support to ease the formatting of console input and specify a base directory for all Perl modules installed via the REPL), the following directive should suffice:

$ cpan
cpan[1]> install HTML::TreeBuilder

Alternatively, with cpanm, the minified version of cpan—similar to JavaScript’s npm and PHP’s Composer—type the following to install TreeBuilder.

$ cpanm HTML::TreeBuilder

Coding the Scraper

For this example, we’re going to retrieve the song lyrics for “Six Days” by the American songwriter DJ Shadow.

The first step is operationalizing the LWP and TreeBuilder libraries to respectively make an HTTP GET request (and effectively pull the lyrics data from Genius) and prime the scraper to parse the resultant HTML:

my $ua = LWP::UserAgent->new;
$ua->agent("Genius Scraper");

my $url = "https://genius.com/DJ-Shadow-Six-Days-lyrics";
my $root = HTML::TreeBuilder->new();

# perform HTTP GET request
my $request = $ua->get($url) or die "Cannot contact Genius $!\n";

The next step is parsing the markup returned by servers hosting the Genius app. Parsing, in this case, means encoding the markup as a traversable tree structure. To preempt any Perl encoding errors in the event of request success, you should supply—to the parser’s parse method—a UTF-8 decoded version of the markup returned from the request.

if ($request->is_success) {
  $root->parse(decode_utf8 $request->content);
} else {
  # print error message
  print "Cannot display the lyrics.\n";
}

Upon completing the parsing, you will invoke the look_down method defined in the TreeBuilder API to traverse the resultant markup and extract the lyrics. The lyrics posted on the Genius platform reside in a div element identified as the lyrics-root. To proceed, you will have to encode the said stipulation to produce something that looks like the snippet to follow.

my $data = $root->look_down(
  _tag  => "div",
  id    => "lyrics-root"
);

Though very useful in debugging tree traversals, TreeBuilder’s dumping methods (inherited from the HTMLBuilder module) do not offer the best UI for markup formatting. The primitives in the FormatText module, on the other hand, are most useful to tidily display the markup that results from the HTTP response.

To print the resulting HTML subtree as a neat string output, you simply need to instantiate FormatText and invoke its format method with data as the only argument.

my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

$formatter->format($data);

You can now run the scraper by typing the following in a console:

$ chmod a+x scraper.pl && ./scraper.pl

The output—in the console—should appear as shown in the snippet below:

Six Days Lyrics
---------------

[Verse 1]
At the starting of the week
At summit talks you'll hear them speak
It's only Monday
Negotiations breaking down
See those leaders start to frown
It's sword and gun day

[Hook]
Tomorrow never comes until it's too late

... etc. etc.

At this point, the scraper is crawling the page to retrieve the lyrics for “Six Days.”

To add some polish and increase its parsing capacity to accommodate any song hosted on the Genius platform, you can parameterize each invocation of the script with a single command-line argument—relevant song input.

if ($#ARGV + 1 != 1) {
  die "Please provide song input\n";
}

A consequent change in the URL should follow to complete the parser:

my $url = "https://genius.com/$ARGV[0]";

Now that you have done most of the hard work, you can invoke the Genius scraper and extract the lyrics to the song “Six Days” by typing the following:

$ ./scraper.pl DJ-shadow-Six-Days-lyrics

The code for the scraper is available in a GitHub Gist. Feel free to customize it however you see fit.

Conclusion

Scraping the web involves retrieving the contents of a web resource–typically a web page. It has numerous uses, both for individual developers and large-scale corporate projects, ranging from plugging gaps in API response data to enhancing business intelligence. As scraping is heavily reliant on the markup shuttled via HTTP, any language that can make HTTP client requests and parse the resultant HTML can utilize this technique to harvest data, and Perl is one of the most robust of these languages.

This article first introduced web scraping and its potential uses before explaining how you could use Perl to build a simple Genius scraper to mitigate inefficiencies in the service’s API. This tutorial should provide a solid grounding for deliberately applying scraping and harvesting data from the web for your own purposes.

image description
Bruno Michael Lochemem

Lochemem Bruno Michael is a software engineer from Uganda. Author of the book Functional Programming in PHP and various open source libraries.