Data extraction in Ruby

One of the most important features of ScrapingBee, is the ability to extract exact data without need to post-process the request’s content using external libraries.

We can use this feature by specifying an additional parameter with the name extract_rules. We specify the label of elements we want to extract, their CSS Selectors and ScrapingBee will do the rest!

Let’s say that we want to extract the title & the subtitle of the data extraction documentation page. Their CSS selectors are h1 and span.text-20 respectively. To make sure that they’re the correct ones, you can use the JavaScript function: document.querySelector("CSS_SELECTOR") in that page’s developer tool’s console.

The full code will look like this:

require 'net/http'
require 'net/https'
require 'addressable/uri'
require 'json'

# Get

def extract_rules(user_url, rules)

    uri = Addressable::URI.parse("https://app.scrapingbee.com/api/v1/")
    api_key = "YOUR-API-KEY"
    uri.query_values = {
      'api_key'  => api_key,
      'url' => user_url,
      'extract_rules' => rules
    }
    uri = URI(uri)

    # Create client
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER

    # Create Request
    req =  Net::HTTP::Get.new(uri)

    # Fetch Request
    res = http.request(req)

    # Print response body
    return res
rescue StandardError => e
    puts "HTTP Request failed (#{ e.message })"
end

url = "https://www.scrapingbee.com/documentation/data-extraction/"
rules = {
    "title": "h1",
    "subtitle": "span.text-20"
}
rules = rules.to_json # Convert the hash object into JSON format
request = extract_rules(url, rules)

puts request.body

And as you can see, the result is:

{"title": "Documentation - Data Extraction", "subtitle": "Extract data with CSS selector"}'

You can find more about this feature in our documentation: Data Extraction. And more about CSS selectors in W3Schools - CSS Selectors page.

Go back to tutorials