Web Scraping with Ruby

11 June 2022 | 19 min read

Introduction

This post covers the main tools and techniques for web scraping in Ruby. We start with an introduction to building a web scraper using common Ruby HTTP clients and how to parse HTML documents in Ruby.

This approach to web scraping does have its limitations, however, and can come with a fair dose of frustration. Particularly in the context of single-page applications, we will quickly come across major obstacles due to their heavy use of JavaScript. We will have a closer look on how to address this, using web scraping frameworks, in the second part of this article.

Note: This article does assume that the reader is familiar with the Ruby platform. While there is a multitude of gems, we will focus on the most popular ones and use their Github metrics (use, stars, and forks) as indicators. While we won't be able to cover all the use cases of these tools, we will provide good grounds for you to get started and explore more on your own.


cover image

Part I: Static pages

0. Setup

In order to be able to code along with this part, you may need to install the following gems:

gem install 'pry' #debugging tool
gem install 'nokogiri' #parsing gem
gem install 'HTTParty' #HTTP request gem

Moreover, we will use open-uri, net/http, and csv, which are part of the standard Ruby library so there's no need for a separate installation. As for Ruby, we are using version 3 for our examples and our main playground will be the file scraper.rb.

1. Make a request with HTTP clients in Ruby

In this section, we will cover how to scrape a Wikipedia page with Ruby.

Imagine you want to build the ultimate Douglas Adams fan wiki. You would for sure start with getting data from Wikipedia. In order to send a request to any website or web app, you would need to use an HTTP client. Let's take a look at our three main options: net/http, open-uri, and HTTParty. You can use whichever of the below clients you like the most and it will work with the step 2.

Net::HTTP

Ruby's standard library comes with an HTTP client of its own, namely, the net-http gem. In order to make a request to Douglas Adams' Wikipedia page easily, we first need to convert our URL string into a URI object, using the open-uri gem. Once we have our URI, we can pass it to get_response, which will provide us with a Net::HTTPResponse object and whose body method will provide us with the HTML document.

require 'open-uri'
require 'net/http'

url = "https://en.wikipedia.org/wiki/Douglas_Adams"
uri = URI.parse(url)

response = Net::HTTP.get_response(uri)
html = response.body

puts html
#=> "\n<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta charset=\"UTF-8\"/>\n<title>Douglas Adams - Wikipedia</title>..."

Pro tip: Should you use Net::HTTP with a REST interface and need to handle JSON, simply require 'json' and parse the response with JSON.parse(response.body).

That's it - it works! However, the syntax of net/http may be a bit clunky and less intuitive than that of HTTParty or open-uri, which are, in fact, just elegant wrappers for net/http.

HTTParty

The HTTParty gem was created to make http fun. Indeed, with the intuitive and straightforward syntax, the gem has become widely popular in recent years. The following two lines are all we need to make a successful GET request:

require "HTTParty"

response = HTTParty.get("https://en.wikipedia.org/wiki/Douglas_Adams")
html = response.body

puts html
# => "<!DOCTYPE html>\n" + "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n" + "<head>\n" + "<meta charset=\"UTF-8\"/>\n" + "<title>Douglas Adams - Wikipedia</title>\n" + ...

get returns an HTTParty::Response object which, again, provides us with the details on the response and, of course, the content of the page. If the server provided a content type of application/json, HTTParty will automatically parse the response as JSON and return appropriate Ruby objects.

OpenURI

The simplest solution, however, is making a request with the open-uri gem, which also is a part of the standard Ruby library:

require 'open-uri'

html = URI.open("https://en.wikipedia.org/wiki/Douglas_Adams")
##<File:/var/folders/zl/8zprgb3d6yn_466ghws8sbmh0000gq/T/open-uri20200525-33247-1ctgjgo>

This provides us with a file descriptor and allows us to read from the URL as if it were a file, line by line.

The simplicity of OpenURI is already in its name. It only sends one type of request, and does it well, with sensible HTTP defaults for SSL and redirects.

While we have covered here the two most straightforward approaches to load the content of a URL, there are of course also quite a few other HTTP clients. So please feel free to also check out our article on the best Ruby HTTP clients.


2. Parsing HTML with Nokogiri

Once we have the HTML, we need to extract the parts that are of our interest. As you probably noticed, each of the previous examples had declared an html variable. We will use it now as an argument for the Nokogiri::HTML method. Don't forget require "nokogiri", though 🙂.

doc = Nokogiri::HTML(html)
# => #(Document:0x3fe41d89a238 {
#  name = "document",
#  children = [
#    #(DTD:0x3fe41d92bdc8 { name = "html" }),
#    #(Element:0x3fe41d89a10c {
#      name = "html",
#      attributes = [
#        #(Attr:0x3fe41d92fe00 { name = "class", value = "client-nojs" }),
#        #(Attr:0x3fe41d92fdec { name = "lang", value = "en" }),
#        #(Attr:0x3fe41d92fdd8 { name = "dir", value = "ltr" })],
#      children = [
#        #(Text "\n"),
#        #(Element:0x3fe41d93e7fc {
#          name = "head",
#          children = [ ...

Wonderful! We've now got a Nokogiri::HTML::Document object, which is essentially the DOM representation of our document and will allow us to query the document with, both, CSS selectors and XPath expressions.

💡 If you like to find out more about the DOM and XPath, we'd have another lovely article on that: Practical XPath for Web Scraping

In order to select the right DOM elements, we need to do a bit of a detective work with the browser's developer tools. In the example below, we are using Chrome to inspect whether a desired element has any attached class:

Screenshot of using Google DevTools

As we see, Wikipedia does not exactly make extensive use of HTML classes, what a shame. Still, we can select them by their tag. For instance, if we wanted to get all the paragraphs, we'd approach it by selecting all <p> elements and then fetching their text content:

description = doc.css("p").text
# => "\n\nDouglas Noel Adams (11 March 1952 – 11 May 2001) was an English author, screenwriter, essayist, humorist, satirist and dramatist. Adams was author of The Hitchhiker's Guide to the Galaxy, which originated in 1978 as a BBC  radio comedy before developing into a \"trilogy\" of five books that sold more than 15 million copies in his lifetime and generated a television series, several stage plays, comics, a video game, and in 2005 a feature film. Adams's contribution to UK radio is commemorated in The Radio Academy's Hall of Fame.[1]\nAdams also wrote Dirk Gently's...

This approach resulted in a 4,336-word-long string. However, imagine you would like to get only the first introductory paragraph and the picture. You could either use a regular expression or let Ruby do this for you with the .split method.

In our example, we can notice the delimiters for paragraphs (\n) have been preserved, so we can simply split by newlines and get the first non-empty paragraph:

description = doc.css("p").text.split("\n").find{|e| e.length > 0}

Another way would be to trim all whitespace with .strip and select the first element from our string array:

description = doc.css("p").text.strip.split("\n")[0]

Alternatively, and depending on how the HTML is structured, sometimes an easier way could be to directly access the selector elements:

description = doc.css("p")[1]
#=> #(Element:0x3fe41d89fb84 {
#  name = "p",
#  children = [
#    #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] }),
#    #(Text " (11 March 1952 – 11 May 2001) was an English "),
#    #(Element:0x3fe41e837560 {
#      name = "a",
#      attributes = [
#        #(Attr:0x3fe41e833104 { name = "href", value = "/wiki/Author" }),
#        #(Attr:0x3fe41e8330dc { name = "title", value = "Author" })],
#      children = [ #(Text "author")]
#      }),
#    #(Text ", "),
#    #(Element:0x3fe41e406928 {
#      name = "a",
#      attributes = [
#        #(Attr:0x3fe41e41949c { name = "href", value = "/wiki/Screenwriter" }),
#        #(Attr:0x3fe41e4191cc { name = "title", value = "Screenwriter" })],
#      children = [ #(Text "screenwriter")]
#      }),

Once we found the element we are interested in, we need to call the .children method on it, which will return -- you've guessed it -- more nested DOM elements. We could iterate over them to get the text we need. Here's an example of return values from two nodes:

doc.css("p")[1].children[0]
#=> #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] })

doc.css("p")[1].children[1]
#=> #(Text " (11 March 1952 – 11 May 2001) was an English ")

Now, let's find the article's main image. That should be easy, right? The most straightforward approach is to select the <img> tag, isn't it?

doc.css("img").count
#=> 16

Not quite, there are quite a few images on that page. 😳

Well, we could filter for some image specific data, couldn't we?

doc.css("img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}
#=> <img alt....

Perfect, that gave us the right image, right? 🥺

Sort of, but do hold your horses just for a second. The moment there's only a slight change, our find() call won't find any more.

All right, all right, what about using an XPath expression?

doc.xpath("/html/body/div[3]/div[3]/div[5]/div/table[1]/tbody/tr[2]/td/a/img")
#=> <img alt....

True, we got the image here as well and did not filter on an arbitrary tag but, as always with absolute paths, that can also quickly break if there is just a slight change to the DOM hierarchy.

So what now?

As mentioned before, Wikipedia isn't exactly generous when it comes to IDs, but there are some seemingly unique HTML classes which seem to be pretty stable across Wikipedia.

doc.css(".infobox-image img")
#=> <img alt....

As you notice, getting the right (and stable) DOM path can be a bit tricky and does take some experience and analysis of the DOM tree, but it's also quite rewarding when you found the right CSS selectors or XPath expressions and they withstand the test of time and do not break with DOM changes. As so often, your browser's developer tools will be your best friend in this endeavour.

If you're doing web scraping, you will often have to use proxies during your endeavors. Check out our guide on how to use proxy with Ruby and Faraday to learn to do so.

3. Exporting Scraped Data to CSV

All right, before we move on to covering how to make use of the full-fledged web scraping framework, we mentioned earlier, let's just see how to actually use the data we just got from our website.

Once you've successfully scraped the website, you probably want to persist that data for later use. A convenient and interoperable way to do that, is to save it as CSV file. CSVs cannot only be easily managed with Excel, but they are also a standard format for many other third party platforms (e.g. mailing frameworks). Naturally, Ruby got you covered with the csv gem.

require "nokogiri"
require "csv"
require "open-uri"

html = URI.open("https://en.wikipedia.org/wiki/Douglas_Adams")
doc = Nokogiri::HTML(html)

description = doc.css("p").text.split("\n").find{|e| e.length > 0}
picture = doc.css("td a img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value

data_arr = []
data_arr.push(description, picture)

CSV.open('data.csv', "w") do |csv|
  csv << data_arr
end

So, what were we doing here? Let's quickly recap.

  1. We imported the libraries we are going to use.
  2. We used OpenURI to load the content of the URL and provided it to Nokogiri.
  3. Once Nokogiri had the DOM, we politely asked it for the description and the picture URL. CSS selectors are truly elegant, aren't they? 🤩
  4. We added the data to our data_arr array.
  5. We used CSV.open to write the data to our CSV file.

💡 We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Feel free to check the documentation here.


Part II: Kimurai - a complete Ruby web scraping framework

So far we have focused on how to load the content of a URL, how to parse its HTML document into a DOM tree, and how to select specific document elements using CSS selectors and XPath expressions. While that all worked pretty well, there are still a few limitations, namely JavaScript.

More and more sites rely on JavaScript to render their content (in particular, of course, Single-Page Applications or sites which utilise infinite scroll for their data), in which case our Nokogiri implementation will only get the initial HTML bootstrap document without the actual data. Not getting the actual data is, let's say, less ideal for a scraper, right?

In these cases, we can use tools which specifically support JavaScript-powered sites. One of them is Kimurai, a Ruby framework specifically designed for web scraping. Like our previous example, it also uses Nokogiri to access DOM elements, as well as Capybara to execute interactive actions, typically performed by users (e.g. mouse clicks). On top of that, it also supports full integration of headless browsers (i.e. Headless Chrome and Headless Firefox) and PhantomJS.

In this part of the article, we will scrape a job listing web app. First, we will do it statically by just visiting different URL addresses and then, we will introduce some JS action.

Kimurai Setup

In order to scrape dynamic pages, you need to install a couple of tools -- below you will find the list with the macOS installation commands:

  • Chrome and Firefox: brew cask install google-chrome firefox
  • ChromeDriver: brew cask install chromedriver
  • geckodriver: brew install geckodriver
  • PhantomJS: brew install phantomjs
  • Kimurai gem: gem install kimurai

In this tutorial, we will use a simple Ruby file but you could also create a Rails app that would scrape a site and save the data to a database.

Static page scraping

Let's start with what Kimurai considers bare minimum: a class with options for the scraper and a parse method:

class JobScraper < Kimurai::Base
  @name= 'eng_job_scraper'
  @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
  @engine = :selenium_chrome

  def parse(response, url:, data: {})
  end
end

In our class, we defined the following three fields:

  • @name: you can name your scraper whatever you wish or omit it altogether if your scraper consists of just one file;
  • @start_urls: this is an array of start URLs, which will be processed one by one inside the parse method;
  • @engine: the engine used for scraping; Kimurai supports four default engines. For our examples here, we'll be using Selenium with Headless Chrome

Let's talk about the parse method now. It is the default entry method for the scraper and accepts the following arguments:

  • response: the Nokogiri::HTML object, which we know from the prior part of this post;
  • url: a string, which can be either passed to the method manually or otherwise will be taken from the @start_urls array;
  • data: a storage for passing data between requests;

Just like when we used Nokogiri, you can also use CSS selectors and XPath expressions here to select the document elements you want to extract.

The browser object

Every Kimurai class also has a default browser field, which provides access to the underlying Capybara session object and allows you to interact with its browser instance (e.g. compile forms or perform mouse actions).

All right, let's dive into the page structure of our job site.

As we are interested in the job entries, we should first check if there's a common parent element (ideally with its own HTML ID, right? 🤞). And we are in luck, the jobs are all contained within one single <td> with a resultsCol ID.

td#resultsCol

Now, we just need to find the tag for the individual entry elements and we can scrape the data. Fortunately, that's relatively straightforward as well, a <div> with a job_seen_beacon class.

div.job_seen_beacon

Following is our previous base class with a rough implementation and a @@jobs array to keep track of all the jobs entries we found.

require 'kimurai'

class JobScraper < Kimurai::Base
  @name= 'eng_job_scraper'
  @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
  @engine = :selenium_chrome

  @@jobs = []

  def scrape_page
    doc = browser.current_response
    returned_jobs = doc.css('td#resultsCol')
    returned_jobs.css('div.job_seen_beacon').each do |char_element|
        #code to get only the listings
    end
  end

  def parse(response, url:, data: {})
    scrape_page
    @@jobs
  end
end

JobScraper.crawl!

Time to get the actual data!

Let's check out the page once more and try to find the selectors for the data points we are after.

Data pointSelector
Page URLh2.jobTitle > a
Titleh2.jobTitle > a > span
Descriptiondiv.job-snippet
Companydiv.companyInfo > span.companyName
Locationdiv.companyInfo > div.companyLocation
Salaryspan.estimated-salary > span

With that information, we should be now able to extract the data from all the job entries in our page.

  def scrape_page
    doc = browser.current_response
    returned_jobs = doc.css('td#resultsCol')
    returned_jobs.css('div.job_seen_beacon').each do |char_element|
      # scraping individual listings
      title = char_element.css('h2.jobTitle > a > span').text.gsub(/\n/, "")
      link = "https://indeed.com" + char_element.css('h2.jobTitle > a').attributes["href"].value.gsub(/\n/, "")
      description = char_element.css('div.job-snippet').text.gsub(/\n/, "")
      company = char_element.css('div.companyInfo > span.companyName').text.gsub(/\n/, "")
      location = char_element.css('div.companyInfo > div.companyLocation').text.gsub(/\n/, "")
      salary = char_element.css('span.estimated-salary > span').text.gsub(/\n/, "")

      # creating a job object
      job = {title: title, link: link, description: description, company: company, location: location, salary: salary}

      # adding the object if it is unique
      @@jobs << job if !@@jobs.include?(job)
    end
  end

Instead of creating an object, we could also create an array, depending on what data structure we'd need later:

  job = [title, link, description, company, location, salary, requirements]

As the code currently is, we only get the first 15 results, or just the first page. In order to get data from the next pages, we can visit subsequent URLs:

  def parse(response, url:, data: {})
    # scrape first page
    scrape_page

    # next page link starts with 20 so the counter will be initially set to 2
    num = 2

    #visit next page and scrape it
    10.times do
        browser.visit("https://www.indeed.com/jobs?q=software+engineer&l=New+York,+NY&start=#{num}0")
        scrape_page
        num += 1
    end

    @@jobs
  end

Last but not least, we could store our data in a CSV or JSON file, by adding one of the following snippets to our parse method:

    CSV.open('jobs.csv', "w") do |csv|
      csv << @@jobs
    end

or:

    File.open("jobs.json","w") do |f|
        f.write(JSON.pretty_generate(@@jobs))
    end

Dynamic page scraping with Selenium and Headless Chrome

So far, our Kimurai code was not all that different from our previous example. Although, we were no loading everything through a real browser, we still simply loaded the page, extracted the desired data items, and loaded the next page based on our URL template.

Real users wouldn't do that last step, would they? No, they wouldn't, they simply click the "next" button. That's exactly what we are going to check out now.

def parse(response, url:, data: {})

  10.times do
      # scrape first page
    scrape_page
    puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"

    # find the "next" button + click to move to the next page
    browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
    puts "🔺 🔺 🔺 🔺 🔺  CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "

  end

  @@jobs
end

We still call our scrape_page() method from before, but now we also use Capybara's browser object to find() (using an XPath expression) and click() the "next" button.

We added two puts statements to see whether our scraper actually moves forward:

error logs

As you see, we successfully scraped the first page but then we encountered an error:

element click intercepted: Element <span class="pn">...</span> is not clickable at point (329, 300). Other element would receive the click: <input autofocus="" name="email" type="email" id="popover-email" class="popover-input-locationtst"> (Selenium::WebDriver::Error::ElementClickInterceptedError)
(Session info: headless chrome=83.0.4103.61)

We now have two options

  1. check what we received in response and whether the DOM tree is any different (spoiler, it is 🙂)
  2. have the browser snap a screenshot and check if the page looks any different

While option #1 is, certainly, the more thorough one, option #2 often is a shortcut to point out any obvious changes. So, let's try that first.

As so often, it's quite straightforward to take a screenshot with Capybara. Simply call save_screenshot() on the browser object, and a screenshot (with a random name) will be saved in your working directory.

  def parse(response, url:, data: {})

  10.times do
    # scrape first page
    scrape_page

    # take a screenshot of the page
    browser.save_screenshot

    # find the "next" button + click to move to the next page
    browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
    puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
    puts "🔺 🔺 🔺 🔺 🔺  CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "

  end

  @@jobs
end

Voilà, Ruby will now save a screenshot of each page. This is the first page:

first screenshot showing the page as we know it

Lovely, just what we were looking for. Here page number two:

second screenshot which features a popup

Aha! A popup! After running this test a couple of times, and inspecting errors closely, we know that it comes in two versions and that our "next" button is not clickable when the popup is displayed. Fortunately, a simple browser.refresh takes care of that.

def parse(response, url:, data: {})

  10.times do
    scrape_page

    # if there's the popup, escape it
    if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst')
        browser.refresh
    end

    # find the "next" button + click to move to the next page
    browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
    puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
    puts "🔺 🔺 🔺 🔺 🔺  CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "
  end

  @@jobs
end

Finally, our scraper works without a problem and after ten rounds, we end up with 155 job listings:

Here's the full code of our dynamic scraper:

require 'kimurai'

class JobScraper < Kimurai::Base
    @name= 'eng_job_scraper'
    @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
    @engine = :selenium_chrome

    @@jobs = []

    def scrape_page
        doc = browser.current_response
        returned_jobs = doc.css('td#resultsCol')
        returned_jobs.css('div.job_seen_beacon').each do |char_element|
            # scraping individual listings
            title = char_element.css('h2.jobTitle > a > span').text.gsub(/\n/, "")
            link = "https://indeed.com" + char_element.css('h2.jobTitle > a').attributes["href"].value.gsub(/\n/, "")
            description = char_element.css('div.job-snippet').text.gsub(/\n/, "")
            company = char_element.css('div.companyInfo > span.companyName').text.gsub(/\n/, "")
            location = char_element.css('div.companyInfo > div.companyLocation').text.gsub(/\n/, "")
            salary = char_element.css('span.estimated-salary > span').text.gsub(/\n/, "")

            # creating a job object
            job = {title: title, link: link, description: description, company: company, location: location, salary: salary}

            @@jobs << job if !@@jobs.include?(job)
        end
    end

    def parse(response, url:, data: {})

        10.times do
            scrape_page

            if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst')
                browser.refresh
            end

            browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
            puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
            puts "🔺 🔺 🔺 🔺 🔺  CLICKED NEXT BUTTON 🔺 🔺 🔺 🔺 "
        end

        CSV.open('jobs.csv', "w") do |csv|
            csv << @@jobs
        end

        File.open("jobs.json","w") do |f|
            f.write(JSON.pretty_generate(@@jobs))
        end

        @@jobs
    end
end

jobs = JobScraper.crawl!

Alternatively, you could also replace the crawl! method with parse!, which would allow you to use the return statement and print out the @@jobs array:

jobs = JobScraper.parse!(:parse, url: "https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY")

pp jobs

Conclusion

Web scraping is most definitely a very powerful tool when you need to access and analyse a large number of (semi-structured) data from a number of different sources. While it allows you to quickly access, aggregate, and process that data, it can also be a challenging and daunting task, depending on the tools you are using and the data you want to handle. Nothing is more disappointing than believing to have found the one, perfect CSS selector, only to realise on page 500, that it won't work because of one small inconsistency - back to the drawing board.

What is important, is that you are using the right tools and the right approach to crawl a site. As we learned throughout this article, Ruby is a great choice and comes with many ready-to-use libraries for this purpose.

One important aspect to remember is to plan your crawler strategy in a way to avoid being rate limited by the site. We have another excellent article on that subject and how to make sure your web crawler does not get blocked.

💡 If you prefer not to have to deal with rate limits, proxies, user agents, and browser fingerprints, please check out our no-code web scraping API. Did you know, the first 1,000 calls are on us?

Happy Scraping.

Read more

image description
Sylwia vargas

Sylwia is a talented full-stack web developer and technical writer. She is also a lead instructor for Flatiron School and a teacher at Yale university.