Documentation - Data Extraction

Extract data with CSS or XPATH selectors

You can also discover this feature using our Postman collection covering every ScrapingBee's features.

💡 Important:
This page explains how to use a specific feature of our main web scraping API!
If you are not yet familiar with ScrapingBee web scraping API, you can read the documentation here.

Basic usage

If you want to extract data from pages and don't want to parse the HTML on your side, you can add extraction rules to your API call.

The simplest way to use extraction rules is to use the following format

{"key_name" : "css_or_xpath_selector"}

For example, if you wish to extract the title and subtitle of our blog, you will need to use those rules.

{
    "title" : "h1",
    "subtitle" : "#subtitle",
}

And this will be the JSON response

{
    "title" : "The ScrapingBee Blog",
    "subtitle" : "We help you get better at web-scraping: detailed tutorial, case studies and writing by industry experts",
}

You can also extract HTML attribute by using the @ prefix.

Meaning that if you want to extract some link from the page, you can use the following rule.

{"link" : "@href"}

Important: extraction rules are JSON formatted, and in order to pass them to a GET request, you need to stringify them.

Here is how to extract the above information in your favorite language.

# Install the Python ScrapingBee library:
# pip install scrapingbee

from scrapingbee import ScrapingBeeClient

client = ScrapingBeeClient(api_key='YOUR-API-KEY')
response = client.get(
    'https://www.scrapingbee.com/blog',
    params={
        'extract_rules':{"title": "h1", "subtitle": "#subtitle"},
    },
)
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
// request Axios
const axios = require('axios');

axios.get('https://app.scrapingbee.com/api/v1', {
    params: {
        'api_key': 'YOUR-API-KEY',
        'url': 'https://www.scrapingbee.com/blog',
        'extract_rules': '{"title":"h1","subtitle":"#subtitle"}',
    }
}).then(function (response) {
    // handle success
    console.log(response);
})
String encoded_url = URLEncoder.encode("YOUR URL", "UTF-8");
require 'net/http'
require 'net/https'
require 'uri'

# Classic (GET )
def send_request
    extract_rules = URI::encode('{"title": "h1", "subtitle": "#subtitle"}')
    uri = URI('https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' + extract_rules)

    # Create client
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER

    # Create Request
    req =  Net::HTTP::Get.new(uri)

    # Fetch Request
    res = http.request(req)
    puts "Response HTTP Status Code: #{ res.code }"
    puts "Response HTTP Response Body: #{ res.body }"
rescue StandardError => e
    puts "HTTP Request failed (#{ e.message })"
end

send_request()
<?php

// get cURL resource
$ch = curl_init();

// set url
$extract_rules = urlencode('{"title": "h1", "subtitle": "#subtitle"}');

curl_setopt($ch, CURLOPT_URL, 'https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' . $extract_rules);

// set method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

// return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);



// send the request and save response to $response
$response = curl_exec($ch);

// stop if fails
if (!$response) {
  die('Error: "' . curl_error($ch) . '" - Code: ' . curl_errno($ch));
}

echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;

// close curl resource to free up system resources
curl_close($ch);

?>
package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
    "net/url"
)

func sendClassic() {
	// Create client
	client := &http.Client{}


    // Stringify rules
    extract_rules := url.QueryEscape(`{"title": "h1", "subtitle": "#subtitle"}`)
	// Create request
	req, err := http.NewRequest("GET", "https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=" + extract_rules, nil)


	parseFormErr := req.ParseForm()
	if parseFormErr != nil {
		fmt.Println(parseFormErr)
	}

	// Fetch Request
	resp, err := client.Do(req)

	if err != nil {
		fmt.Println("Failure : ", err)
	}

	// Read Response Body
	respBody, _ := ioutil.ReadAll(resp.Body)

	// Display Results
	fmt.Println("response Status : ", resp.Status)
	fmt.Println("response Headers : ", resp.Header)
	fmt.Println("response Body : ", string(respBody))
}

func main() {
    sendClassic()
}

Please note that using:

{
    "title" : "h1",
    "link": "a@href"
}

Is the same as using:

{
    "title" : {
        "selector": "h1",
        "output": "text",
        "type": "item"
    },
    "link": {
        "selector": "a",
        "output": "@href",
        "type": "item"
    }
}

Below are more details about all those different options.



CSS or XPATH selector

selector_type [ auto | css | xpath ] (default= auto)

You can use extract rules with CSS or Xpath selectors. By default, the rules will work without the need to specify the kind of selector you are using.

The rules will consider any selector beginning with a / as an XPATH selector, everything else will be considered a CSS selector.

{"extract_rules": {"title": "#title"}} # CSS selector
{"extract_rules": {"title": "//h1[@id=\"title\"]"}} # XPATH selector
{"extract_rules": {"title": "/html/body/h1[@id=\"title\"]"}}  # XPATH selector

Sometimes, you might want to force this behavior if:

  • you use an XPATH selector which doesn't begin with /
  • you use a CSS selector which begins with /
  • you simply want to make your code clearer

Then you can use the selector_type property.

{"extract_rules": {"title": {"selector": "#title", "selector_type": "css"}}} # CSS selector
{"extract_rules": {"title": {"selector": "./html/body/h1[@id=\"title\"]", "selector_type": "xpath"}}} # XPATH selector


Output Format

output [ text | html | table_array | table_json | @...] (default= text)

For a given selector, you can extract different kind of data using the output option:

  • text: text content of selector (default)
  • text_relevant: text content of selector, but trimmed of scripts, css, header, footer in order to only keep "content". Very useful for AI training (beta)
  • html: HTML content of selector
  • @...: attribute of selector (prefixed by @)
  • table_json: JSON representation of a <table> (more details here)
  • table_array: Array representation of a <table> (more details here)

Below is an example of different output option using the same selector.

{
    "title_text" : {
        "selector": "h1",
        "output": "text"
    },
    "title_text_relevant" : {
        "selector": "h1",
        "output": "text_relevant"
    },
    "title_html" : {
        "selector": "h1",
        "output": "html"
    },
    "title_id" : {
        "selector": "h1",
        "output": "@id"
    },
    "table_array" : {
        "selector": "table",
        "output": "table_array"
    },
    "table_json" : {
        "selector": "table",
        "output": "table_json"
    }
}

The information extracted by the above rules on ScrapingBee's documentation page will be

{
    "title_text": "Documentation - HTML API",
    "title_text_relevant": "Documentation - HTML API", # No particular effect here. Use it on "body" to see the difference with "text"
    "title_html": "<h1 id=\"the-scrapingbee-documentation\"> Documentation - HTML API </h1>",
    "title_id": "the-scrapingbee-documentation"
    "table_array": [
            ["Rotating Proxy without JavaScript rendering", "1"],
            ["Rotating Proxy with JavaScript rendering  (default)", "5"],
            ["Premium Proxy without JavaScript rendering", "10"],
            ["Premium Proxy with JavaScript rendering", "25"]
        ]
    "table_json": [
            {"Feature used": "Rotating Proxy without JavaScript rendering", "API credit cost": "1"},
            {"Feature used": "Rotating Proxy with JavaScript rendering  (default)", "API credit cost": "5"},
            {"Feature used": "Premium Proxy without JavaScript rendering", "API credit cost": "10"},
            {"Feature used": "Premium Proxy with JavaScript rendering", "API credit cost": "25"}
        ]
}

Shortcuts

To make extract rules easier to write and maintain, you can use a simpler syntax to extract text and @attribute.

Meaning that using:

{
    "title" : "h1",
    "link": "a@href"
}

Is the same as using:

{
    "title" : {
        "selector": "h1",
        "output": "text",
        "type": "item"
    },
    "link": {
        "selector": "a",
        "output": "@href",
        "type": "item"
    }
}

Extracting information from tables

ScrapingBee allows you to easily get formated information from HTML tables.

We offer two modes to do it: table_array and table_json.

Let say you want to extract this table from the HTML page.

Feature usedAPI credit cost
Rotating Proxy without JavaScript rendering1
Rotating Proxy with JavaScript rendering (default)5
Premium Proxy without JavaScript rendering10
Premium Proxy with JavaScript rendering25

And let's say that this table have its id set to pricing_table.

JSON representation

If you use those extract rules:

{
    "table_json" : {
        "selector": "#pricing_table",
        "output": "table_json"
    }
}

You will get this result:

{
    "table_json": [
        {"Feature used": "Rotating Proxy without JavaScript rendering", "API credit cost": "1"},
        {"Feature used": "Rotating Proxy with JavaScript rendering  (default)", "API credit cost": "5"},
        {"Feature used": "Premium Proxy without JavaScript rendering", "API credit cost": "10"},
        {"Feature used": "Premium Proxy with JavaScript rendering", "API credit cost": "25"}
    ]
}

Each line of the table is turned into a JSON object where keys would be column name and value would be content of the table.

We advise to use this mode if the table is correctly formatted and has a header line (first line with columns name).

Array representation

If you use those extract rules:

{
    "table_array" : {
        "selector": "#pricing_table",
        "output": "table_array"
    },
}

You will get this results:

{
    "table_array": [
        ["Rotating Proxy without JavaScript rendering", "1"],
        ["Rotating Proxy with JavaScript rendering  (default)", "5"],
        ["Premium Proxy without JavaScript rendering", "10"],
        ["Premium Proxy with JavaScript rendering", "25"]
    ]
}

Each line of the table is turned into an array of N elements where N is the number of columns of the table.

We advise to use this mode if the table is not correctly formatted or doesn't have a header line (first line with columns name).



Single element or list

type [ item | list ] (default= item)

By default, we will return you the first HTML element that match the selector. If you want to get all elements matching the selector, you should use the type option. type can be:

  • item return first element matching the selector (default)
  • list return a list of all elements matching the selector

Here is an example for extracting post title from our blog.

{
    "first_post_title" : {
        "selector": ".post-title",
        "type": "item"
    },
    "all_post_title" : {
        "selector": ".post-title",
        "type": "list"
    },
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
  "first_post_title": "  Block ressources with Puppeteer - (5min)",
  "all_post_title": [
    "  Block ressources with Puppeteer - (5min)",
    "  Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
    ...
    "  Scraping E-Commerce Product Data - (6min)",
    "  Introduction to Chrome Headless with Java - (4min)"
  ]
}


Clean Text

clean [ true | false ] (default= true)

By default, ScrapingBee will return a clean content. Meaning that it will remove trailing spaces, and empty character from the results ('\n', '\t', etc...). If you don't to enable this behavior, you should disable it by setting clean: false with your data extraction rule.

Here is an example for extracting post description from our blog using "clean": true.

{
    "first_post_description" : {
        "selector": ".card > div",
        "clean": true #default
    }
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
    "first_post_description": "How to Use a Proxy with Python Requests? - (7min) By Maxine Meurer 13 October 2021 In this tutorial we will see how to use a proxy with the Requests package. We will also discuss on how to choose the right proxy provider.read more",
}

If you use "clean": false.

{
    "first_post_description" : {
        "selector": ".card > div",
        "clean": false
    }
}

You would get this result instead:

{
    "first_post_description": "\n                How to Use a Proxy with Python Requests? - (7min)\n        \n            \n            \n            By Maxine Meurer\n            \n            \n            13 October 2021\n            \n        \n        In this tutorial we will see how to use a proxy with the Requests package. We will also discuss on how to choose the right proxy provider.\n        read more\n        ",
}

Extract nested items

It is also possible to add extraction rules inside the output option in order to create powerful extractors.

Here are the rules that would extract general information and all blog post details from ScrapingBee's blog.

{
    "title" : "h1",
    "subtitle" : "#subtitle",
    "articles": {
        "selector": ".card",
        "type": "list",
        "output": {
            "title": ".post-title",
            "link": {
                "selector": ".post-title",
                "output": "@href"
            },
            "description": ".post-description"
        }
    }
}

The information extracted by the above rules on ScrapingBee's blog page would be

{
  "title": "The ScrapingBee Blog",
  "subtitle": " We help you get better at web-scraping: detailed tutorial, case studies and \n                        writing by industry experts",
  "articles": [
    {
      "title": "  Block ressources with Puppeteer - (5min)",
      "link": "https://www.scrapingbee.com/blog/block-requests-puppeteer/",
      "description": "This article will show you how to intercept and block requests with Puppeteer using the request interception API and the puppeteer extra plugin."
    },
    ...
    {
      "title": "  Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
      "link": "https://www.scrapingbee.com/blog/scraping-vs-crawling/",
      "description": "What is the difference between web scraping and web crawling? That's exactly what we will discover in this article, and the different tools you can use."
    },
  ]
}

Common use cases

Below you will find common extraction rules often used by our users

For SEO purposes, lead generation, or simply data harvesting it can be useful to quickly extract all links from a single page.

The following extract_rules will allow you to do that with one simple API call:

{
    "all_links" : {
        "selector": "a",
        "type": "list",
        "output": "@href"
    }
}

The JSON response will be as follow:

{
    "all_links": [
        "https://www.scrapingbee.com/",
        ...,
        "https://www.scrapingbee.com/api-store/"
    ]
}

If you wish to extract both the href and the anchors of links you can use those rules instead:

{
    "all_links" : {
        "selector": "a",
        "type": "list",
        "output": {
            "anchor": "a",
            "href": {
                "selector": "a",
                "output": "@href"
            }
        }
    }
}

The JSON response will be as follow:

{
   "all_links":[
      {
         "anchor":"Blog",
         "href":"https://www.scrapingbee.com/blog/"
      },
      ...
      {
         "anchor":" Linkedin ",
         "href":"https://www.linkedin.com/company/26175275/admin/"
      }
   ]
}

Extract all text from a page

If you need to get all the text of a web page, and only the text, meaning no HTML tags or attributes, you can use those rules:

{
    "text": "body"
}

For example, using those rules with this ScrapingBee landing page returns this result:

{
    "text": "Login Sign Up Pricing FAQ Blog Other Features Screenshots Google search API Data extraction JavaScript scenario No code scraping with Integromat Documentation Tired of getting blocked while scraping the web? ScrapingBee API handles headless browsers and rotates proxies for you. Try ScrapingBee for Free based on 25+ reviews. Render your web page as if it were a real browser. We manage thousands of headless instances using the latest Chrome version. Focus on extracting the data you need, and not dealing with concurrent headless browsers that will eat up all your RAM and CPU. Latest Chrome version Fast, no matter what! ScrapingBee simplified our day-to-day marketing and engineering operations a lot . We no longer have to worry about managing our own fleet of headless browsers, and we no longer have to spend days sourcing the right proxy provider Mike Ritchie CEO @ SeekWell Javascript Rendering We render Javascript with a simple parameter so you can scrape every website, even Single Page Applications using React, AngularJS, Vue.js or any other libraries. Execute custom JS snippet Custom wait for all JS to be executed ScrapingBee is helping us scrape many job boards and company websites without having to deal with proxies or chrome browsers. It drastically simplified our data pipeline Russel Taylor CEO @ HelloOutbound Rotating Proxies Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots! Large proxy pool Geotargeting Automatic proxy rotation ScrapingBee clear documentation, easy-to-use API, and great success rate made it a no-brainer. Dominic Phillips Co-Founder @ CodeSubmit Three specific ways to use ScrapingBee How our customers use our API: 1. ..."
}

Extract all email addresses from a page

If you need to get all the email addresses of a web page you can use those rules:

{
    "email_addresses": {
        "selector": "a[href^='mailto']",
        "output": "@href",
        "type": "list"
    }
}

Using those rules with this ScrapingBee landing page returns this result:

{
    "email_addresses": [
        "mailto:contact@scrapingbee.com"
    ]
}

How does this work?

First, we target all anchor (a tag) that has and href attribute that starts with the string mailto, then we decide to only extract the href attribute. And since we want all email addresses on the page and not just one, we use the type list (on ScrapingBee landing page there is just one email address anyway). 

Limitation

Those rules will only work for links whose href attributes contain mailto. If the email addresses on the page are just plain text or simple anchors. Then you should either extract all the text on the page an run some regular expression or extract all link's on the page and filter for email addresses on your side.