Contents
Ruby, Tutorial, Web scraping,

Web Scraping with Ruby

(18 min) Sylwia Vargas, 18 June, 2020
Web Scraping with Javascript

Introduction

This post will cover main tools and techniques for web scraping in Ruby. We start with an introduction to building a web scraper using common Ruby HTTP clients and parsing the response. This approach to web scraping has, however, its limitations and can come with a fair dose of frustration. Not to mention, as manageable as it is to scrape static pages, these tools fail when it comes to dealing with Single Page Applications, the content of which is built with JavaScript. As an answer to that, we will propose using a complete web scraping framework. This article assumes that the reader is familiar with fundamentals of Ruby and of how the Internet works.

Note: Although there is a multitude of gems, we will focus on those most popular as indicated by their Github “used by”, “star” and “fork” attributed. While we won't be able to cover all the usecases for these tools, we will provide good grounds for you to get started and explore more on your own.


Part I: Static pages

Setup

In order to be able to code along with this part, you may need to install the following gems:

gem install 'pry' #debugging tool
gem install 'nokogiri' #parsing gem
gem install 'HTTParty' #HTTP request gem

Moreover, we will use open-uri, net/http and csv, which are part of the standard Ruby library so there's no need for a separate installation.

I will place all my code in a file called scraper.rb.

Note: My ruby version is 2.6.1

Make a request with HTTP clients in Ruby

In this section, we will cover how to scrape Wikipedia with Ruby.

Imagine you want to build the ultimate Douglas Adams fan wiki. You would for sure start with getting data from Wikipedia. In order to send a request to any website or web app, you would need to use an HTTP client. Let's take a look at our three main options: net/http, open-uri and HTTParty. You can use whichever of the below clients you like the most and it will work with the step 2.

Net/HTTP

Ruby standard library comes with an HTTP client of its own, namely, the net/http gem. In order to make a request to Douglas Adams Wikipedia page easily, we need to first prepare the URL. To do so, we will use the open-uri gem, which also is a part of the standard Ruby library.

# you shouldn't need to require these gems but just in case:
# require 'open-uri'
# require 'net/http'

url = "https://en.wikipedia.org/wiki/Douglas_Adams"
uri = URI.parse(url)
response = Net::HTTP.get_response(uri)
puts response.body
#=> "\n<!DOCTYPE html>\n<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n<head>\n<meta charset=\"UTF-8\"/>\n<title>Douglas Adams - Wikipedia</title>..."

We are given a string with all the html from the page.

Note: If the data comes in a json format, you can parse it by adding these two lines:

require 'json'
JSON.parse(response.body)

That's it - it works! However, the syntax of net/http is a bit clunky and less intuitive than that of HTTParty or open-uri, which are, in fact, just elegant wrappers for net/http.

HTTParty

The HTTParty gem was created to ‘make http fun’. Indeed, with the intuitive and straightforward syntax, the gem has become widely popular in recent years. The following two lines are all we need to make a successful GET request:

require "HTTParty"

html = HTTParty.get("https://en.wikipedia.org/wiki/Douglas_Adams")
# => "<!DOCTYPE html>\n" + "<html class=\"client-nojs\" lang=\"en\" dir=\"ltr\">\n" + "<head>\n" + "<meta charset=\"UTF-8\"/>\n" + "<title>Douglas Adams - Wikipedia</title>\n" + ...

What is returned is a HTTParty::Response, an array-like collection of strings representing the html of the page.

Note: It is much easier to work with objects. If the response Content Type is application/json, HTTParty will parse the response and return Ruby objects with keys as strings. We can learn about the content type by running response.headers["content-type"]. In order to achieve that, we need to add this line to our code:

JSON.parse response, symbolize_names: true

We can't, however, do this with Wikipedia as we get text/html back.

Open URI

The simplest solution, however, is making a request with the open-uri gem, which also is a part of the standard Ruby library:

require 'open-uri'

html = open("https://en.wikipedia.org/wiki/Douglas_Adams")
##<File:/var/folders/zl/8zprgb3d6yn_466ghws8sbmh0000gq/T/open-uri20200525-33247-1ctgjgo>

The return value is a Tempfile containing the HTML and that's all we need for the next step. It is just one line.

The simplicity of open-uri is already explained in its name. It only sends one type of request, and does it very well: with a default SSL and following redirections.


Parsing HTML with Nokogiri

Once we have the HTML, we need to extract only the parts that are of our interest. As you probably noticed, each of the routes in the previous section has declared the html variable. We will use it now as an argument to Nokogiri::HTML method.

response = Nokogiri::HTML(html)
# => #(Document:0x3fe41d89a238 {
#  name = "document",
#  children = [
#    #(DTD:0x3fe41d92bdc8 { name = "html" }),
#    #(Element:0x3fe41d89a10c {
#      name = "html",
#      attributes = [
#        #(Attr:0x3fe41d92fe00 { name = "class", value = "client-nojs" }),
#        #(Attr:0x3fe41d92fdec { name = "lang", value = "en" }),
#        #(Attr:0x3fe41d92fdd8 { name = "dir", value = "ltr" })],
#      children = [
#        #(Text "\n"),
#        #(Element:0x3fe41d93e7fc {
#          name = "head",
#          children = [ ...

Here the return value is Nokogiri::HTML::Document, a hash-like object, which is actually a snapshot of that HTML converted into a structure of nested nodes.

The good news is that Nokogiri allows us to use CSS selectors or XPath to target the desired element. We will use both the CSS selectors and the XPath.

In order to parse through the object, we need to do a bit of a detective work with the browser's DevTools. In the below example, we am using Chrome to inspect whether a desired element has any attached class:

Screenshot of using Google DevTools

As we see, the elements on Wikipedia do not have classes. Still, we can target them by their tag. For instance, if we wanted to get all the paragraphs, we'd approach it by first selecting all p keys, and then converting them to text:

description = doc.css("p").text
# => "\n\nDouglas Noel Adams (11 March 1952 – 11 May 2001) was an English author, screenwriter, essayist, humorist, satirist and dramatist. Adams was author of The Hitchhiker's Guide to the Galaxy, which originated in 1978 as a BBC  radio comedy before developing into a \"trilogy\" of five books that sold more than 15 million copies in his lifetime and generated a television series, several stage plays, comics, a video game, and in 2005 a feature film. Adams's contribution to UK radio is commemorated in The Radio Academy's Hall of Fame.[1]\nAdams also wrote Dirk Gently's...

This approach resulted in a 4,336-word-long string. However, imagine you would like to get only the first introductory paragraph and the picture. You could either use regex or let Ruby do this for you with the .split method. Here we see that the demarcation for paragraphs (\n) have been preserved. We can therefore ask Ruby to extract only the first non-empty paragraph:

description = doc.css("p").text.split("\n").find{|e| e.length > 0}

Alternatively, we can also just get rid of any empty spaces by adding the .strip method and then just selecting the first item:

description = doc.css("p").text.strip.split("\n")[0]

Alternatively, and depending on how the HTML is structured, sometimes an easier way could be traversing it by accessing the XML (Extensible Markup Language), which is the format of the Nokogiri::HTML::Document. To do that, we'd select one of the nodes and dive as deep as necessary:

description = doc.css("p")[1]
#=> #(Element:0x3fe41d89fb84 {
#  name = "p",
#  children = [
#    #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] }),
#    #(Text " (11 March 1952 – 11 May 2001) was an English "),
#    #(Element:0x3fe41e837560 {
#      name = "a",
#      attributes = [
#        #(Attr:0x3fe41e833104 { name = "href", value = "/wiki/Author" }),
#        #(Attr:0x3fe41e8330dc { name = "title", value = "Author" })],
#      children = [ #(Text "author")]
#      }),
#    #(Text ", "),
#    #(Element:0x3fe41e406928 {
#      name = "a",
#      attributes = [
#        #(Attr:0x3fe41e41949c { name = "href", value = "/wiki/Screenwriter" }),
#        #(Attr:0x3fe41e4191cc { name = "title", value = "Screenwriter" })],
#      children = [ #(Text "screenwriter")]
#      }),

Once we found the node of our interest, we want to call the .children method on it, which will return – you've guessed it – more nested XML objects. We could iterate over them to get the text we need. Here's an example of return values from two nodes:

doc.css("p")[1].children[0]
#=> #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] })

doc.css("p")[1].children[1]
#=> #(Text " (11 March 1952 – 11 May 2001) was an English ")

Now we want to call .children method, which will return – you've guessed it – nested XML objects and we could iterate over them to get the text we need. Here's an example of return values from two nodes:

doc.css("p")[1].children[0]
#=> #(Element:0x3fe41e43d6e4 { name = "b", children = [ #(Text "Douglas Noel Adams")] })

doc.css("p")[1].children[1]
#=> #(Text " (11 March 1952 – 11 May 2001) was an English ")

Now, let's locate the picture. That should be easy, right? Well, since there are no classes or ids on Wikipedia, calling doc.css('img') resulted in 16 elements, and increasing selector specificity to doc.css('td a img') still did not allow us to get the main image. We could look for the image by its alt text and then save its url:

picture = doc.css("td a img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value

Or we could locate the image using the XPath, which also returns eigtht objects so we need to find the correct one:

picture = doc.xpath("/html/body/div[3]/div[3]/div[4]/div/table[1]/tbody/tr[2]/td/a/img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value

While all this is possible to achieve, it is really time consuming and a small change in the page html can result in our code breaking. Hunting for a specific piece of text, with or without regex, oftentimes feels like looking for a few needles in haystack. To add to that, oftentimes the websited themselves are not structured in a logical way and do not follow clear design. That not only prolongs the time a developer spends with DevTools but also results in many exceptions.

Fortunately, a developer's experience definitely improves when using a web scraping framework, which not only makes the code cleaner but also has ready-made tools for all occassions.


Exporting Scraped Data to CSV

Before we move on to covering the complete web scraping framework, let's just see how to actually use the data we get from a website.

Once you've successfully scraped the website, you can save it as a CSV list, which can be used in Excel or integrated into e.g. a mailing platform. It is a popular use case for web scraping. In order to implement this feature, you will use the csv gem.

  1. In the same folder, create a separate data.csv file.
  2. csv works best with arrays so create a data_array variable and define it as an empty array.
  3. Push the data to the array.
  4. Add the array to the csv file.
  5. Run the scraper and check your data.csv file.

The code:

require 'csv'

html = open("https://en.wikipedia.org/wiki/Douglas_Adams")
doc = Nokogiri::HTML(html)

data_arr = []
description = doc.css("p").text.split("\n").find{|e| e.length > 0}
picture = doc.css("td a img").find{|picture| picture.attributes["alt"].value.include?("Douglas adams portrait cropped.jpg")}.attributes["src"].value

data_arr.push([description, picture])

CSV.open('data.csv', "w") do |csv|
  csv << data
end


Part II: A complete Ruby web scraping framework

We have covered scraping static pages with basic tools, which forced us to spend a bit too much time on trying to locate a specific element. While these approaches more-or-less work, the do have their limitations. For instance, what happens when a website depens on JavaScript, like in the case of Single Page Applications, or the infinite scroll pages? These web apps usually have very limited initial HTML and so scraping them with Nokogiri would not bring the desired effects.

In this case, we could use a framework that works with JS-rendered sites. The friendliest and best-documented one is by far Kimurai, which runs on Nokogiri for static pages and Capybara for imitating user interaction with the website. Apart from a plethora of helper methods for making web scraping easy and pleasant, it works out of the box with Headless Chrome, Firefox and PhantomJS.

In this part of the article, we will scrape a job listing web app. First, we will do it statically by just visiting different URL addresses and then, we will introduce some JS action.

Kimurai Setup

In order to scrape dynamic pages, you need to install a couple of tools – below you will find the list with the MacOS installation commands:

  • Chrome and Firefox: brew cask install google-chrome firefox
  • chromedriver: brew cask install chromedriver
  • geckodriver: brew install geckodriver
  • PhantomJS: brew install phantomjs
  • Kimurai gem: gem install kimurai

In this tutorial, we will use a simple Ruby file but you could also create a Rails app that would scrape a site and save the data to the database.

Static page scraping

Let's start with what Kimurai considers a bare minimum: a class with options for the scraper and a parse method:

require 'kimurai'

class JobScraper < Kimurai::Base
  @name= 'eng_job_scraper'
  @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
  @engine = :selenium_chrome

  def parse(response, url:, data: {})
  end
end

As you see above, we use the following options:

  • @name: you can name your scraper whatever you wish or omit it altogether if your scraper consists of just one file;
  • @start_urls: this is an array of start urls, which will be processed one by one inside the parse method;
  • @engine: engine used for scraping; in this tutorial, we are using Selenium with Headless Chrome; if you don't know, which engine to choose, check this description of each one.

Let's talk about the parse method now. It is a is the default start method for the scraper and it accepts the following arguments:

  • response: the Nokogiri::HTML object, which we know from the prior part of this post;
  • url: a string, which can be either passed to the method manually or otherwise will be taken from the @start_urls array;
  • data: a storage for passing data between requests;

Just like when we used Nokogiri, here you can also parse the response by using CSS selectors or XPath. If you're not very familiar with the XPath, here is a practical guide to XPath for web scraping. In this part of the tutorial we will use both.

Before we move on, we need to introduce the browser object. Every instance method of the scraper has an access to the Capybara::Session object. Although usually it is not necessary to use it (because response already contains the whole page), if you want ultimately be able to click or fill out forms, it allows you to interact with the website.

Now would be a good time to have a look at the Page structure:

Since we are only interested in the job listings, it is convenient to see whether they are groupped within a component – in fact, they are all nested in td#resultsCol. After locating that, we do the same with each of the listings. Below you will also see a helper method scrape_page and a @@jobs = [] array, which will be our storage for all the jobs we scrape.

require 'kimurai'

class JobScraper < Kimurai::Base
  @name= 'eng_job_scraper'
  @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
  @engine = :selenium_chrome

  @@jobs = []

  def scrape_page
    doc = browser.current_response
    returned_jobs = doc.css('td#resultsCol')
    returned_jobs.css('div.jobsearch-SerpJobCard').each do |char_element|
        #code to get only the listings
    end 
  end

  def parse(response, url:, data: {})
    scrape_page
    @@jobs
  end
end

JobScraper.crawl!

Let's inspect the page again to check the selectors for title, description, company, location, salary, requirements and the slug for the listing:

With this knowledge, we can scrape an individual listing.

  def scrape_page
    doc = browser.current_response
    returned_jobs = doc.css('td#resultsCol')
    returned_jobs.css('div.jobsearch-SerpJobCard').each do |char_element|
      # scraping individual listings 
      title = char_element.css('h2 a')[0].attributes["title"].value.gsub(/\n/, "")
      link = "https://indeed.com" + char_element.css('h2 a')[0].attributes["href"].value.gsub(/\n/, "")
      description = char_element.css('div.summary').text.gsub(/\n/, "")
      company = description = char_element.css('span.company').text.gsub(/\n/, "")
      location = char_element.css('div.location').text.gsub(/\n/, "")
      salary = char_element.css('div.salarySnippet').text.gsub(/\n/, "")
      requirements = char_element.css('div.jobCardReqContainer').text.gsub(/\n/, "")

      # creating a job object
      job = {title: title, link: link, description: description, company: company, location: location, salary: salary, requirements: requirements}

      # adding the object if it is unique
      @@jobs << job if !@@jobs.include?(job)
    end
  end

Instead of creating an object, we could also create an array, depending on what data structure we'd need later:

  job = [title, link, description, company, location, salary, requirements]

As the code currently is, we only get the first 15 results, or just the first page. In order to get data from the next pages, we can visit subsequent URLs:

  def parse(response, url:, data: {})
    # scrape first page
    scrape_page

    # next page link starts with 20 so the counter will be initially set to 2
    num = 2

    #visit next page and scrape it
    10.times do
        browser.visit("https://www.indeed.com/jobs?q=software+engineer&l=New+York,+NY&start=#{num}0")
        scrape_page
        num += 1
    end

    @@jobs
  end

Last but not least, we could create a JSON or CSV files by adding these snippets to the parse method to store the scraped data:

    CSV.open('jobs.csv', "w") do |csv|
      csv << @@jobs
    end

or:

    File.open("jobs.json","w") do |f|
        f.write(JSON.pretty_generate(@@jobs))
    end

Dynamic page scraping with Selenium and Headless Chrome

In bringing JavaScript interaction, we actually won't change much about our current code except that instead of visiting different URLs, we will use Selenium with headless Chrome to imitate a user interaction and click the button that will take us there.

def parse(response, url:, data: {})

  10.times do
      # scrape first page
    scrape_page    
    puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"   

    # find the "next" button + click to move to the next page
    browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
    puts "🔺 🔺 🔺 🔺 🔺  CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "

  end

  @@jobs
end

To this end we use two methods:

  • find(): finding an element in the current session by its XPath;
  • click: simulating user interaction.

We added two puts statements to see whether our scraper actually moves forward:

error logs

As you see, we successfully scraped the first page but then we encountered an error:

element click intercepted: Element <span class="pn">...</span> is not clickable at point (329, 300). Other element would receive the click: <input autofocus="" name="email" type="email" id="popover-email" class="popover-input-locationtst"> (Selenium::WebDriver::Error::ElementClickInterceptedError)
(Session info: headless chrome=83.0.4103.61)

We could either investigate the response to read the HTML and try to understand why the page looks differently or we could use a more elaborate tool, a screenshot of the page:

  def parse(response, url:, data: {})

  10.times do
    # scrape first page
    scrape_page 
    
    # take a screenshot of the page
    browser.save_screenshot

    # find the "next" button + click to move to the next page
    browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
    puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"   
    puts "🔺 🔺 🔺 🔺 🔺  CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "

  end

  @@jobs
end

Now, as the code runs, we get screenshots of every page it encounters. This is the first page:

first screenshot showing the page as we know it

And here is the second page:

second screenshot which features a popup

Aha! A popup! After running this test a couple of times, and inspecting errors closely, we know that it comes in two versions and that, sadly, it is not clickable. However, we can always refresh the page and the information of our annoyance will be saved in the session. Let's then add a safeguard:

def parse(response, url:, data: {})

  10.times do
    scrape_page    

    # if there's the popup, escape it
    if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst')
        browser.refresh 
    end

    # find the "next" button + click to move to the next page
    browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
    puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"   
    puts "🔺 🔺 🔺 🔺 🔺  CLICKED THE NEXT BUTTON 🔺 🔺 🔺 🔺 "
  end

  @@jobs
end

Finally, our scraper works without a problem and after ten rounds, we end up with 155 job listings:

Here's the full code of our dynamic scraper:

require 'kimurai'

class JobScraper < Kimurai::Base
    @name= 'eng_job_scraper'
    @start_urls = ["https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY"]
    @engine = :selenium_chrome

    @@jobs = []

    def scrape_page
        doc = browser.current_response
        returned_jobs = doc.css('td#resultsCol')
        returned_jobs.css('div.jobsearch-SerpJobCard').each do |char_element|
            title = char_element.css('h2 a')[0].attributes["title"].value.gsub(/\n/, "")
            link = "https://indeed.com" + char_element.css('h2 a')[0].attributes["href"].value.gsub(/\n/, "")
            description = char_element.css('div.summary').text.gsub(/\n/, "")
            company = description = char_element.css('span.company').text.gsub(/\n/, "")
            location = char_element.css('div.location').text.gsub(/\n/, "")
            salary = char_element.css('div.salarySnippet').text.gsub(/\n/, "")
            requirements = char_element.css('div.jobCardReqContainer').text.gsub(/\n/, "")
            # job = [title, link, description, company, location, salary, requirements]
            job = {title: title, link: link, description: description, company: company, location: location, salary: salary, requirements: requirements}

            @@jobs << job if !@@jobs.include?(job)
        end  
    end

    def parse(response, url:, data: {})

        10.times do
            scrape_page

            if browser.current_response.css('div#popover-background') || browser.current_response.css('div#popover-input-locationtst')
                browser.refresh 
            end
                    
            browser.find('/html/body/table[2]/tbody/tr/td/table/tbody/tr/td[1]/nav/div/ul/li[6]/a/span').click
            puts "🔹 🔹 🔹 CURRENT NUMBER OF JOBS: #{@@jobs.count}🔹 🔹 🔹"
            puts "🔺 🔺 🔺 🔺 🔺  CLICKED NEXT BUTTON 🔺 🔺 🔺 🔺 "
        end

        CSV.open('jobs.csv', "w") do |csv|
            csv << @@jobs
        end

        File.open("jobs.json","w") do |f|
            f.write(JSON.pretty_generate(@@jobs))
        end
        
        @@jobs
    end
end

jobs = JobScraper.crawl!

Alternatively, you could also replace the crawl! method with parse!, which would allow you to use the return statement and print out the @@jobs array:

jobs = JobScraper.parse!(:parse, url: "https://www.indeed.com/jobs?q=software+engineer&l=New+York%2C+NY")

pp jobs 

Conclusion

Web scraping is definitely one of the most powerful activities a developer can engange as it helps you quickly access, agregate and process data coming from various sources. It can feel satisfying or daunting, depending on the tools one's using as some can see through poor web design decisions and deliver only the data you need in the matter of seconds. I definitely do not wish anyone to spend hours trying to scrape a page just to learn that there are sometimes small inconsistencies in how their developers approached web development. Look for tools that make your life easier!

While this whole article tackles the main aspect of web scraping with Ruby, it does not talk about web scraping without getting blocked.

If you want to learn how to do it, we have wrote this complete guide, and if you don't want to take care of this, you can always use our web scraping API.

Happy Scraping.

Read more

Tired of getting blocked while scraping the web? Our API handles headless browsers and rotates proxies for you.