Documentation - Data Extraction

Extract data with CSS selectors

Basic usage

If you want to extract data from pages and don't want to parse the HTML on your side, you can add extraction rules to your API call.

The simplest way to use extraction rules is to use the following format

{"key_name" : "css_selector"} 

For example, if you wish to extract the title and subtitle of our blog, you will need to use those rules.

{
    "title" : "h1",
    "subtitle" : "#subtitle",
}

And this will be the JSON response

{
    "title" : "The ScrapingBee Blog",
    "subtitle" : "We help you get better at web-scraping: detailed tutorial, case studies and writing by industry experts",
}

Important: extraction rules are JSON formatted, and in order to pass them to a GET request, you need to stringify them.

Here is how to extract the above information in your favorite language.

# Install the Python ScrapingBee library:
# pip install scrapingbee

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key='YOUR-API-KEY')
response = client.get(
    'https://www.scrapingbee.com/blog',
    params={
        'extract_rules':{"title": "h1", "subtitle": "#subtitle"},
    },
)
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)

// request Classic
const https = require('https')
const extract_rules = encodeURIComponent('{"title":"h1","subtitle":"#subtitle"}')

const options = {
    hostname: 'app.scrapingbee.com',
    port: '443',
    path: '/api/v1?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' + extract_rules,
    method: 'GET',
}

const req = https.request(options, res => {
    console.log(`statusCode: ${ res.statusCode }`)
    res.on('data', d => {
        process.stdout.write(d)
    })
})

req.on('error', error => {
    console.error(error)
})

req.end()

require 'net/http'
require 'net/https'
require 'uri'

# Classic (GET )
def send_request
    extract_rules = URI::encode('{"title": "h1", "subtitle": "#subtitle"}')
    uri = URI('https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' + extract_rules)

    # Create client
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER

    # Create Request
    req =  Net::HTTP::Get.new(uri)

    # Fetch Request
    res = http.request(req)
    puts "Response HTTP Status Code: #{ res.code }"
    puts "Response HTTP Response Body: #{ res.body }"
rescue StandardError => e
    puts "HTTP Request failed (#{ e.message })"
end

send_request()

<?php

// get cURL resource
$ch = curl_init();

// set url
$extract_rules = urlencode('{"title": "h1", "subtitle": "#subtitle"}');

curl_setopt($ch, CURLOPT_URL, 'https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' . $extract_rules);

// set method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

// return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);



// send the request and save response to $response
$response = curl_exec($ch);

// stop if fails
if (!$response) {
  die('Error: "' . curl_error($ch) . '" - Code: ' . curl_errno($ch));
}

echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;

// close curl resource to free up system resources
curl_close($ch);

?>

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
    "net/url"
)

func sendClassic() {
	// Create client
	client := &http.Client{}


    // Stringify rules
    extract_rules := url.QueryEscape(`{"title": "h1", "subtitle": "#subtitle"}`)
	// Create request
	req, err := http.NewRequest("GET", "https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=" + extract_rules, nil)


	parseFormErr := req.ParseForm()
	if parseFormErr != nil {
		fmt.Println(parseFormErr)
	}

	// Fetch Request
	resp, err := client.Do(req)

	if err != nil {
		fmt.Println("Failure : ", err)
	}

	// Read Response Body
	respBody, _ := ioutil.ReadAll(resp.Body)

	// Display Results
	fmt.Println("response Status : ", resp.Status)
	fmt.Println("response Headers : ", resp.Header)
	fmt.Println("response Body : ", string(respBody))
}

func main() {
    sendClassic()
}

Please note that using:

{
    "title" : "h1",
}

Is the same as using:

{
    "title" : {
        "selector": "h1",
        "output": "text",
        "type": "item"
    }
}

Below are more details about all those different options.

Output format

For a given selector, you can extract different kind of data using the output option:

  • text: text content of selector (default)
  • html: HTML content of selector
  • @...: attribute of selector (prefixed by @)

Below is an example of different output option using the same selector.

{
    "title_text" : {
        "selector": "h1",
        "output": "text"
    },
    "title_html" : {
        "selector": "h1",
        "output": "html"
    },
    "title_id" : {
        "selector": "h1",
        "output": "@id"
    },
}

The information extracted by the above rules on ScrapingBee's blog page will be

{
    "title_text": "The ScrapingBee Blog",
    "title_html": "<h1 id=\"the-scrapingbee-blog\"<The <a href=\"https://www.scrapingbee.com/\"<ScrapingBee</a< Blog</h1<",
    "title_id": "the-scrapingbee-blog"
}

Single element or list

By default, we will return you the first HTML element that match the selector. If you want to get all elements matching the selector, you should use the type option. type can be:

  • item return first element matching the selector (default)
  • list return a list of all elements matching the selector

Here is an example for extracting post title from our blog.

{
    "first_post_title" : {
        "selector": ".post-title",
        "type": "item"
    },
    "all_post_title" : {
        "selector": ".post-title",
        "type": "list"
    },
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
  "first_post_title": "  Block ressources with Puppeteer - (5min)",
  "all_post_title": [
    "  Block ressources with Puppeteer - (5min)",
    "  Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
    ...
    "  Scraping E-Commerce Product Data - (6min)",
    "  Introduction to Chrome Headless with Java - (4min)"
  ]
}

Extract nested items

It is also possible to add extraction rules inside the output option in order to create powerful extractors.

Here are the rules that would extract general information and all blog post details from ScrapingBee's blog.

{
    "title" : "h1",
    "subtitle" : "#subtitle",
    "articles": {
        "selector": ".card",
        "type": "list",
        "output": {
            "title": ".post-title",
            "link": {
                "selector": ".post-title",
                "output": "@href"
            },
            "description": ".post-description"
        }
    }
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
  "title": "The ScrapingBee Blog",
  "subtitle": " We help you get better at web-scraping: detailed tutorial, case studies and \n                        writing by industry experts",
  "articles": [
    {
      "title": "  Block ressources with Puppeteer - (5min)",
      "link": "https://www.scrapingbee.com/blog/block-requests-puppeteer/",
      "description": "This article will show you how to intercept and block requests with Puppeteer using the request interception API and the puppeteer extra plugin."
    },
    ...
    {
      "title": "  Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
      "link": "https://www.scrapingbee.com/blog/scraping-vs-crawling/",
      "description": "What is the difference between web scraping and web crawling? That's exactly what we will discover in this article, and the different tools you can use."
    },
  ]
}

Common use cases

Below you will find common extraction rules often used by our users

Extract all links from a page

For SEO purpose, lead generation or simply data harvesting it can be useful to quickly extract all links from a single page.

The following extract_rules will allow you to do that with one simple API call:

{
    "all_links" : {
        "selector": "a",
        "type": "list",
        "output": "@href"
    }
}

The JSON response will be as follow:

{
    "all_links": [
        "https://www.scrapingbee.com/",
        ...,
        "https://www.scrapingbee.com/api-store/"
    ]
}

If you wish to extract both the `href` and the anchors of links you can use those rules instead:

{
    "all_links" : {
        "selector": "a",
        "type": "list",
        "output": {
            "anchor": "a",
            "href": {
                "selector": "a",
                "output": "@href"
            }
        }
    }
}

The JSON response will be as follow:

{
   "all_links":[
      {
         "link":"Blog",
         "anchor":"https://www.scrapingbee.com/blog/"
      },
      ...
      {
         "link":" Linkedin ",
         "anchor":"https://www.linkedin.com/company/26175275/admin/"
      }
   ]
}