Haskell Web Scraping

20 November 2023 | 9 min read

Haskell Web Scraping

Even though web scraping is commonly done with languages like Python and JavaScript, a statically typed functional programming language like Haskell can provide extra benefits. Types make sure that your scripts do what you want them to do and that the data scraped conforms to your requirements.

In this article, you'll learn how to do web scraping in Haskell with libraries such as Scalpel and webdriver.

cover image

Basic Scraping

Scraping a static website can be done with any language that has libraries for an HTTP client and HTML parsing. Haskell is no different. It even has a dedicated high-level scraping library called Scalpel, which puts it above similar languages like Rust.

In the first part of this tutorial, you'll use Scalpel to scrape the list of largest cities from Wikipedia.

Setup

To use Scalpel, you need GHC (the Haskell compiler) and Stack (the Haskell package manager) on your machine. If you don't have them yet, install them via GHCup, which is the currently recommended installation method.

Next, create a new Haskell project using the following command:

stack new scraper 

The command creates a new folder with everything you need for a Haskell project that uses Stack. Move into that folder:

cd scraper

After that, install Scalpel using the following command:

stack install scalpel

Next, add Scalpel and the "text" library as dependencies in the package.yaml file:

dependencies:
- base >= 4.7 && < 5
- scalpel
- text 

In addition to Scalpel, the "text" library will be used to handle strings.

Then, open app/Main.hs in a code editor. Paste in the following code:

{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import           Data.Text
import           Text.HTML.Scalpel

main :: IO ()
main = someFunc

This code imports the libraries you'll use and also enables the OverloadedStrings language extension, which is necessary for Scalpel.

During the tutorial, you'll continue working in Main.hs to build out your program and understand the opportunities that the library offers.

Scraping Elements

At the heart of Scalpel are scrapers. They find elements using selectors and return the matching DOM elements, text content, or HTML attributes.

Here's a simple example of a scraper:

heading :: Scraper Text Text 
heading = text "h1" 

It takes HTML as input and returns the text of the first H1 element as output.

Scrapers can be run using the scrapeURL function. It returns an IO function for running a certain scraper on a certain URL, which can later be executed by the main function.

Here's how you can use scrapeURL to create a function for scraping the heading of a page:

scraper :: IO (Maybe Text)
scraper = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" heading 
  
  where 
    heading :: Scraper Text Text 
    heading = text "h1" 

To run it, you need to execute it in main:

main :: IO ()
main = do
  result <- scraper
  case result of 
    Just x -> print x
    Nothing -> print "Didn't find the necessary items."

Here's how the full code for running a scraper for the heading on the page would look like:

{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import           Data.Text
import           Text.HTML.Scalpel

scraper :: IO (Maybe Text)
scraper = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" heading

  where
    heading :: Scraper Text Text
    heading = text "h1"

main :: IO ()
main = do
  result <- scraper
  case result of
    Just x  -> print x
    Nothing -> print "Didn't find the necessary items."

You can run it with stack run. It should print out the name of the page, which is List of largest cities.

Scrapers and Selectors in Scalpel

Scalpel offers many types of scrapers and selectors for advanced use.

If you would like to get an attribute of an element, you can use the attr scraper. It takes the name of the attribute you want to access and a selector for an element and returns the contents of that attribute. For example, if you want to access the class name of the h1 element, you can do it with the following code:

headingClass :: Scraper Text Text 
headingClass = attr "class" "h1"  

Scalpel selectors can not only target elements of one tag but also match nested elements by using the // operator. For example, the following combination of selectors would work the same way as the div h1 CSS selector:

headingInDiv :: Scraper Text Text 
headingInDiv = text ("div" // "h1")  

It's also possible to use different types of attribute predicates. For example, if you need a div with a specific class, you can use the HasClass function:

navbarDiv :: Scraper Text Text 
navbarDiv = text ("div" @: [hasClass "navbar"])

This is identical to having a div.navbar CSS selector.

If you're used to slowly arriving at the final element by executing a series of selections, you will find the chroot scraper useful. As shown below, it selects HTML using the selector that you provide and then passes that HTML to the next scraper, enabling you to chain a series of selections.

scraper :: IO (Maybe (Text))
scraper = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" sidebar
  
  where 
    sidebar :: Scraper Text Text 
    sidebar = chroot ("table" @: [hasClass "sidebar"]) sidebarTitle 

    sidebarTitle :: Scraper Text Text
    sidebarTitle = text "th"

Scraping Tables

Now you're ready to scrape the table containing data about the largest cities in the world. Here's how you can do it.

First, create a record to hold data about cities. It will have three fields: name, country, and population:

data City = City 
  { name :: Text
  , country :: Text
  , population :: Text
} deriving (Show, Eq)

Next, create a function for scraping the information by executing a series of scrapers:

allCities :: IO (Maybe ([City]))
allCities = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" table
  
  where

    table :: Scraper Text [City]
    table = chroot ("table" @: [hasClass "static-row-numbers"]) cities 

    cities :: Scraper Text [City]
    cities = chroots "tr" city

    city :: Scraper Text City 
    city = do
      name <- text "th" 
      rows <- texts "td"
      let country = getCountry(rows)
      let population = getPopulation(rows)
      return $ City (strip $ name) country population

    getCountry(x: xs) = strip x
    getCountry(_) = "Not available"  

    getPopulation(x: y: xs) = strip y
    getPopulation(_) = "Not available" 

Here's what the scrapers do:

  • table finds the table containing the information on the page and uses chroot to execute the cities scraper on that table.
  • cities finds each row of the table and executes the city scraper on each of those rows.
  • city extracts information from the row by scraping the row's header and the first and second table element. Then it returns a record containing that information.

Run allCities using the following main function:

main :: IO ()
main = do
  cities <- allCities
  print cities

Here's how the final code should look:

{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import           Data.Text
import           Text.HTML.Scalpel


data City = City
  { name       :: Text
  , country    :: Text
  , population :: Text
} deriving (Show, Eq)

allCities :: IO (Maybe ([City]))
allCities = scrapeURL "https://en.wikipedia.org/wiki/List_of_largest_cities" table

  where

    table :: Scraper Text [City]
    table = chroot ("table" @: [hasClass "static-row-numbers"]) cities

    cities :: Scraper Text [City]
    cities = chroots "tr" city

    city :: Scraper Text City
    city = do
      name <- text "th"
      rows <- texts "td"
      let country = getCountry(rows)
      let population = getPopulation(rows)
      return $ City (strip $ name) country population

    getCountry(x: xs) = strip x
    getCountry(_)     = "Not available"

    getPopulation(x: y: xs) = strip y
    getPopulation(_)        = "Not available"


main :: IO ()
main = do
  result <- allCities
  case result of
    Just x  -> print x
    Nothing -> print "Didn't find the necessary items."

If you run it with the stack run command, it should print out a list of records corresponding to the cities in the Wikipedia list.

Scraping with Selenium in Haskell

Since Scalpel only parses HTML, you cannot use it to interact with pages that use JavaScript to dynamically generate content of the page. For these types of pages, you can use webdriver, a library with Selenium bindings that enable you to programmatically control a web browser.

This section shows you how to use the webdriver library to scrape dynamic websites.

Setup

webdriver is quite an old library and needs a Selenium server version 2 to work. You can download it here.

In addition, you'll need a driver for your browser. This tutorial assumes that you use Chrome. You can download the ChromeDriver that matches your version of Chrome here. Extract the file and put it in the same location as your code and the server.

To use the library, add it to the dependencies in your package.yaml:

dependencies:
- base >= 4.7 && < 5
- scalpel
- text
- webdriver

Open a new terminal window and start the Selenium server using the following command:

java -jar .\selenium-server-standalone-2.53.1.jar

Now you're ready to connect to it and drive a browser.

Interacting with Dynamic Elements

The webdriver library enables you to drive a browser.

You can test it out by trying to scrape quotes from the JS-generated version of Quotes to Scrape. Since the quotes on the page are generated by JavaScript, using a simple HTTP client library to scrape them will fail, but you can easily get the quotes by using Selenium to drive a browser.

Replace the code in Main.hs with the following:

{-# LANGUAGE OverloadedStrings #-}

module Main (main) where

import           Data.Text
import           Test.WebDriver


chromeConfig :: WDConfig
chromeConfig = useBrowser chrome defaultConfig

main :: IO [()]
main = do
  quotes <- getQuotes
  mapM print quotes


getQuotes :: IO [Text]
getQuotes = runSession chromeConfig $ do
    openPage "http://quotes.toscrape.com/js/"
    searchInput <- findElems ( ByCSS "span.text" )
    quotes <- traverse getText searchInput
    closeSession
    return quotes

In the code example above:

  • chromeConfig provides the configuration to connect to a Selenium server running ChromeDriver;
  • getQuotes fetches the quotes from the page; and
  • main executes getQuotes.

Running the code with stack run should result in a Chrome browser starting up, opening the website, and then closing. Quotes from the website should be printed out in your console.

Conclusion

In this article, you learned how to do basic scraping of static websites in Haskell and explored advanced techniques for scraping dynamic websites using Selenium.

Web scraping with Haskell is possible, especially if you're a passionate Haskeller who wants to use it to solve everyday tasks.

However, if you prefer a hassle-free web scraping experience without dealing with rate limits, proxies, user agents, and browser fingerprints, you can check out ScrapingBee's no-code web scraping API. Did you know the first 1,000 calls are on us? Give it a try!

image description
Gints Dreimanis

Gints is a writer and software developer who is excited about making computer science and math concepts accessible to all.