Documentation - Data Extraction
Extract data with CSS or XPATH selectorsYou can also discover this feature using our Postman collection covering every ScrapingBee's features.
💡 Important:
This page explains how to use a specific feature of our main web scraping API !
If you are not yet familiar with ScrapingBee web scraping API, you can read the documentation here .
Basic usage
If you want to extract data from pages and don't want to parse the HTML on your side, you can add extraction rules to your API call.
The simplest way to use extraction rules is to use the following format
{"key_name" : "css_or_xpath_selector"}
For example, if you wish to extract the title and subtitle of our blog , you will need to use those rules.
{
"title" : "h1",
"subtitle" : "#subtitle",
}
And this will be the JSON response
{
"title" : "The ScrapingBee Blog",
"subtitle" : "We help you get better at web-scraping: detailed tutorial, case studies and writing by industry experts",
}
You can also extract HTML attribute by using the @
prefix.
Meaning that if you want to extract some link from the page, you can use the following rule.
{"link" : "@href"}
Important: extraction rules are JSON formatted, and in order to pass them to a GET request, you need to stringify them.
Here is how to extract the above information in your favorite language.
# Install the Python ScrapingBee library:
# pip install scrapingbee
from scrapingbee import ScrapingBeeClient
client = ScrapingBeeClient(api_key='YOUR-API-KEY')
response = client.get(
'https://www.scrapingbee.com/blog',
params={
'extract_rules':{"title": "h1", "subtitle": "#subtitle"},
},
)
print('Response HTTP Status Code: ', response.status_code)
print('Response HTTP Response Body: ', response.content)
// request Axios
const axios = require('axios');
axios.get('https://app.scrapingbee.com/api/v1', {
params: {
'api_key': 'YOUR-API-KEY',
'url': 'https://www.scrapingbee.com/blog',
'extract_rules': '{"title":"h1","subtitle":"#subtitle"}',
}
}).then(function (response) {
// handle success
console.log(response);
})
String encoded_url = URLEncoder.encode("YOUR URL", "UTF-8");
require 'net/http'
require 'net/https'
require 'uri'
# Classic (GET )
def send_request
extract_rules = URI::encode('{"title": "h1", "subtitle": "#subtitle"}')
uri = URI('https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' + extract_rules)
# Create client
http = Net::HTTP.new(uri.host, uri.port)
http.use_ssl = true
http.verify_mode = OpenSSL::SSL::VERIFY_PEER
# Create Request
req = Net::HTTP::Get.new(uri)
# Fetch Request
res = http.request(req)
puts "Response HTTP Status Code: #{ res.code }"
puts "Response HTTP Response Body: #{ res.body }"
rescue StandardError => e
puts "HTTP Request failed (#{ e.message })"
end
send_request()
<?php
// get cURL resource
$ch = curl_init();
// set url
$extract_rules = urlencode('{"title": "h1", "subtitle": "#subtitle"}');
curl_setopt($ch, CURLOPT_URL, 'https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=' . $extract_rules);
// set method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
// return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// send the request and save response to $response
$response = curl_exec($ch);
// stop if fails
if (!$response) {
die('Error: "' . curl_error($ch) . '" - Code: ' . curl_errno($ch));
}
echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;
// close curl resource to free up system resources
curl_close($ch);
?>
package main
import (
"fmt"
"io/ioutil"
"net/http"
"net/url"
)
func sendClassic() {
// Create client
client := &http.Client{}
// Stringify rules
extract_rules := url.QueryEscape(`{"title": "h1", "subtitle": "#subtitle"}`)
// Create request
req, err := http.NewRequest("GET", "https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=https://www.scrapingbee.com/blog&extract_rules=" + extract_rules, nil)
parseFormErr := req.ParseForm()
if parseFormErr != nil {
fmt.Println(parseFormErr)
}
// Fetch Request
resp, err := client.Do(req)
if err != nil {
fmt.Println("Failure : ", err)
}
// Read Response Body
respBody, _ := ioutil.ReadAll(resp.Body)
// Display Results
fmt.Println("response Status : ", resp.Status)
fmt.Println("response Headers : ", resp.Header)
fmt.Println("response Body : ", string(respBody))
}
func main() {
sendClassic()
}
Please note that using:
{
"title" : "h1",
"link": "a@href"
}
Is the same as using:
{
"title" : {
"selector": "h1",
"output": "text",
"type": "item"
},
"link": {
"selector": "a",
"output": "@href",
"type": "item"
}
}
Below are more details about all those different options.
CSS or XPATH selector
selector_type
[ auto | css | xpath ]
(default= auto
)
You can use extract rules with CSS or Xpath selectors. By default, the rules will work without the need to specify the kind of selector you are using.
The rules will consider any selector beginning with a /
as an XPATH selector, everything else will be considered a CSS selector.
{"extract_rules": {"title": "#title"}} # CSS selector
{"extract_rules": {"title": "//h1[@id=\"title\"]"}} # XPATH selector
{"extract_rules": {"title": "/html/body/h1[@id=\"title\"]"}} # XPATH selector
Sometimes, you might want to force this behavior if:
- you use an XPATH selector which doesn't begin with
/
- you use a CSS selector which begins with
/
- you simply want to make your code clearer
Then you can use the selector_type
property.
{"extract_rules": {"title": {"selector": "#title", "selector_type": "css"}}} # CSS selector
{"extract_rules": {"title": {"selector": "./html/body/h1[@id=\"title\"]", "selector_type": "xpath"}}} # XPATH selector
Output Format
output
[ text | html | table_array | table_json | @...]
(default= text
)
For a given selector, you can extract different kind of data using the output
option:
text
: text content of selector (default)text_relevant
: text content of selector, but trimmed ofscripts
,css
,header
,footer
in order to only keep "content". Very useful for AI training (beta)mardown_relevant
: markdown content of selector, but trimmed ofscripts
,css
,header
,footer
in order to only keep "content". Very useful for AI training (beta)html
: HTML content of selector@...
: attribute of selector (prefixed by@
)table_json
: JSON representation of a<table>
( more details here )table_array
: Array representation of a<table>
( more details here )
Below is an example of different output option using the same selector.
{
"title_text" : {
"selector": "h1",
"output": "text"
},
"title_text_relevant" : {
"selector": "h1",
"output": "text_relevant"
},
"title_mardown_relevant" : {
"selector": "h1",
"output": "markdown_relevant"
},
"title_html" : {
"selector": "h1",
"output": "html"
},
"title_id" : {
"selector": "h1",
"output": "@id"
},
"table_array" : {
"selector": "table",
"output": "table_array"
},
"table_json" : {
"selector": "table",
"output": "table_json"
}
}
The information extracted by the above rules on ScrapingBee's documentation page will be
{
"title_text": "Documentation - HTML API",
"title_text_relevant": "Documentation - HTML API", # No particular effect here. Use it on "body" to see the difference with "text"
"title_mardown_relevant": "# Documentation - HTML API",
"title_html": "<h1 id=\"the-scrapingbee-documentation\"> Documentation - HTML API </h1>",
"title_id": "the-scrapingbee-documentation"
"table_array": [
["Rotating Proxy without JavaScript rendering", "1"],
["Rotating Proxy with JavaScript rendering (default)", "5"],
["Premium Proxy without JavaScript rendering", "10"],
["Premium Proxy with JavaScript rendering", "25"]
]
"table_json": [
{"Feature used": "Rotating Proxy without JavaScript rendering", "API credit cost": "1"},
{"Feature used": "Rotating Proxy with JavaScript rendering (default)", "API credit cost": "5"},
{"Feature used": "Premium Proxy without JavaScript rendering", "API credit cost": "10"},
{"Feature used": "Premium Proxy with JavaScript rendering", "API credit cost": "25"}
]
}
Shortcuts
To make extract rules easier to write and maintain, you can use a simpler syntax to extract text
and @attribute
.
Meaning that using:
{
"title" : "h1",
"link": "a@href"
}
Is the same as using:
{
"title" : {
"selector": "h1",
"output": "text",
"type": "item"
},
"link": {
"selector": "a",
"output": "@href",
"type": "item"
}
}
Extracting information from tables
ScrapingBee allows you to easily get formated information from HTML
tables.
We offer two modes to do it: table_array
and table_json
.
Let say you want to extract this table from the HTML page.
Feature used | API credit cost |
---|---|
Rotating Proxy without JavaScript rendering | 1 |
Rotating Proxy with JavaScript rendering (default) | 5 |
Premium Proxy without JavaScript rendering | 10 |
Premium Proxy with JavaScript rendering | 25 |
And let's say that this table have its id
set to pricing_table
.
JSON representation
If you use those extract rules:
{
"table_json" : {
"selector": "#pricing_table",
"output": "table_json"
}
}
You will get this result:
{
"table_json": [
{"Feature used": "Rotating Proxy without JavaScript rendering", "API credit cost": "1"},
{"Feature used": "Rotating Proxy with JavaScript rendering (default)", "API credit cost": "5"},
{"Feature used": "Premium Proxy without JavaScript rendering", "API credit cost": "10"},
{"Feature used": "Premium Proxy with JavaScript rendering", "API credit cost": "25"}
]
}
Each line of the table is turned into a JSON object where keys would be column name and value would be content of the table.
We advise to use this mode if the table is correctly formatted and has a header line (first line with columns name).
Array representation
If you use those extract rules:
{
"table_array" : {
"selector": "#pricing_table",
"output": "table_array"
},
}
You will get this results:
{
"table_array": [
["Rotating Proxy without JavaScript rendering", "1"],
["Rotating Proxy with JavaScript rendering (default)", "5"],
["Premium Proxy without JavaScript rendering", "10"],
["Premium Proxy with JavaScript rendering", "25"]
]
}
Each line of the table is turned into an array of N elements where N is the number of columns of the table.
We advise to use this mode if the table is not correctly formatted or doesn't have a header line (first line with columns name).
Single element or list
type
[ item | list ]
(default= item
)
By default, we will return you the first HTML element that match the selector. If you want to get all elements matching the selector, you should use the type
option. type
can be:
item
return first element matching the selector (default)list
return a list of all elements matching the selector
Here is an example for extracting post title from our blog .
{
"first_post_title" : {
"selector": ".post-title",
"type": "item"
},
"all_post_title" : {
"selector": ".post-title",
"type": "list"
},
}
The information extracted by the above rules on ScrapingBee's blog page would be
{
"first_post_title": " Block ressources with Puppeteer - (5min)",
"all_post_title": [
" Block ressources with Puppeteer - (5min)",
" Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
...
" Scraping E-Commerce Product Data - (6min)",
" Introduction to Chrome Headless with Java - (4min)"
]
}
Clean Text
clean
[ true | false ]
(default= true
)
By default, ScrapingBee will return a clean content. Meaning that it will remove trailing spaces, and empty character from the results ('\n', '\t', etc...). If you don't to enable this behavior, you should disable it by setting clean: false
with your data extraction rule.
Here is an example for extracting post description from our
blog
using "clean": true
.
{
"first_post_description" : {
"selector": ".card > div",
"clean": true #default
}
}
The information extracted by the above rules on ScrapingBee's blog page would be
{
"first_post_description": "How to Use a Proxy with Python Requests? - (7min) By Maxine Meurer 13 October 2021 In this tutorial we will see how to use a proxy with the Requests package. We will also discuss on how to choose the right proxy provider.read more",
}
If you use "clean": false
.
{
"first_post_description" : {
"selector": ".card > div",
"clean": false
}
}
You would get this result instead:
{
"first_post_description": "\n How to Use a Proxy with Python Requests? - (7min)\n \n \n \n By Maxine Meurer\n \n \n 13 October 2021\n \n \n In this tutorial we will see how to use a proxy with the Requests package. We will also discuss on how to choose the right proxy provider.\n read more\n ",
}
Extract nested items
It is also possible to add extraction rules inside the output
option in order to create powerful extractors.
Here are the rules that would extract general information and all blog post details from ScrapingBee's blog .
{
"title" : "h1",
"subtitle" : "#subtitle",
"articles": {
"selector": ".card",
"type": "list",
"output": {
"title": ".post-title",
"link": {
"selector": ".post-title",
"output": "@href"
},
"description": ".post-description"
}
}
}
The information extracted by the above rules on ScrapingBee's blog page would be
{
"title": "The ScrapingBee Blog",
"subtitle": " We help you get better at web-scraping: detailed tutorial, case studies and \n writing by industry experts",
"articles": [
{
"title": " Block ressources with Puppeteer - (5min)",
"link": "https://www.scrapingbee.com/blog/block-requests-puppeteer/",
"description": "This article will show you how to intercept and block requests with Puppeteer using the request interception API and the puppeteer extra plugin."
},
...
{
"title": " Web Scraping vs Web Crawling: Ultimate Guide - (10min)",
"link": "https://www.scrapingbee.com/blog/scraping-vs-crawling/",
"description": "What is the difference between web scraping and web crawling? That's exactly what we will discover in this article, and the different tools you can use."
},
]
}
Common use cases
Below you will find common extraction rules often used by our users
Extract all links from a page
For SEO purposes, lead generation, or simply data harvesting it can be useful to quickly extract all links from a single page.
The following extract_rules
will allow you to do that with one simple API call:
{
"all_links" : {
"selector": "a",
"type": "list",
"output": "@href"
}
}
The JSON response will be as follow:
{
"all_links": [
"https://www.scrapingbee.com/",
...,
"https://www.scrapingbee.com/api-store/"
]
}
If you wish to extract both the href
 and the anchors of links you can use those rules instead:
{
"all_links" : {
"selector": "a",
"type": "list",
"output": {
"anchor": "a",
"href": {
"selector": "a",
"output": "@href"
}
}
}
}
The JSON response will be as follow:
{
"all_links":[
{
"anchor":"Blog",
"href":"https://www.scrapingbee.com/blog/"
},
...
{
"anchor":" Linkedin ",
"href":"https://www.linkedin.com/company/26175275/admin/"
}
]
}
Extract all text from a page
If you need to get all the text of a web page, and only the text, meaning no HTML tags or attributes, you can use those rules:
{
"text": "body"
}
For example, using those rules with this ScrapingBee landing page returns this result:
{
"text": "Login Sign Up Pricing FAQ Blog Other Features Screenshots Google search API Data extraction JavaScript scenario No code scraping with Integromat Documentation Tired of getting blocked while scraping the web? ScrapingBee API handles headless browsers and rotates proxies for you. Try ScrapingBee for Free based on 25+ reviews. Render your web page as if it were a real browser. We manage thousands of headless instances using the latest Chrome version. Focus on extracting the data you need, and not dealing with concurrent headless browsers that will eat up all your RAM and CPU. Latest Chrome version Fast, no matter what! ScrapingBee simplified our day-to-day marketing and engineering operations a lot . We no longer have to worry about managing our own fleet of headless browsers, and we no longer have to spend days sourcing the right proxy provider Mike Ritchie CEO @ SeekWell Javascript Rendering We render Javascript with a simple parameter so you can scrape every website, even Single Page Applications using React, AngularJS, Vue.js or any other libraries. Execute custom JS snippet Custom wait for all JS to be executed ScrapingBee is helping us scrape many job boards and company websites without having to deal with proxies or chrome browsers. It drastically simplified our data pipeline Russel Taylor CEO @ HelloOutbound Rotating Proxies Thanks to our large proxy pool, you can bypass rate limiting website, lower the chance to get blocked and hide your bots! Large proxy pool Geotargeting Automatic proxy rotation ScrapingBee clear documentation, easy-to-use API, and great success rate made it a no-brainer. Dominic Phillips Co-Founder @ CodeSubmit Three specific ways to use ScrapingBee How our customers use our API: 1. ..."
}
Extract all email addresses from a page
If you need to get all the email addresses of a web page you can use those rules:
{
"email_addresses": {
"selector": "a[href^='mailto']",
"output": "@href",
"type": "list"
}
}
Using those rules with this ScrapingBee landing page returns this result:
{
"email_addresses": [
"mailto:contact@scrapingbee.com"
]
}
How does this work?
First, we target all anchor (a
tag) that has and href
attribute that starts with the string mailto
, then we decide to only extract the href
attribute. And since we want all email addresses on the page and not just one, we use the type list (on ScrapingBee landing page there is just one email address anyway).Â
Limitation
Those rules will only work for links whose href
attributes contain mailto
. If the email addresses on the page are just plain text or simple anchors. Then you should either extract all the text on the page an run some regular expression or extract all link's on the page and filter for email addresses on your side.