Knowledge Base

How to use ScrapingBee?

Overview

Scraping the web can be hard. More and more websites are using Single Page Application frameworks like Angular / Vue.js or React.

Rendering Javascript is one of the problems we solve with ScrapingBee. Meaning you won't have to worry about running headless chrome on a server.

The other problem that we solve is managing proxies. Some websites have really aggressive rate-limits. Meaning they will block you after 10-25 requests per day on the same IP address. If you want to scrape hundreds of thousands/millions of pages, it can cost you a lot of money in proxies.

The other problem with proxies is that you need to rotate them when they are blocked and choose a proxy provider carefully.

Getting Started

ScrapingBee is a really simple API that allows you to extract HTML from every website in one single API call

If you need more to use our custom options such as JS rendering or our Premium Proxy take a look at our full documentation.

To get your API Key, you just need to create an account here.

Of course, don't forget to replace "YOUR_URL" by the URL of the page you want to scrape.


curl "https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=YOUR-URL"
         

#  Install the Python Requests library:
# `pip install requests`
import requests

def send_request():
    response = requests.get(
        url="https://app.scrapingbee.com/api/v1/",
        params={
            "api_key": "YOUR-API-KEY",
            "url": "YOUR-URL",
        },

    )
    print('Response HTTP Status Code: ', response.status_code)
    print('Response HTTP Response Body: ', response.content)
send_request()

// request Classic
const https = require('https')

const options = {
    hostname: 'app.scrapingbee.com',
    port: '443',
    path: '/api/v1?api_key=YOUR-API-KEY&url=YOUR-URL',
    method: 'GET',

}

const req = https.request(options, res => {
    console.log(`statusCode: ${ res.statusCode }`)
    res.on('data', d => {
        process.stdout.write(d)
    })
})

req.on('error', error => {
    console.error(error)
})

req.end()

import java.io.IOException;
import org.apache.http.client.fluent.*;

public class SendRequest
{
  public static void main(String[] args) {
    sendRequest();
  }

  private static void sendRequest() {

    // Classic (GET )

    try {

      // Create request
      Content content = Request.Get("https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=YOUR-URL")



      // Fetch request and return content
      .execute().returnContent();

      // Print content
      System.out.println(content);
    }
    catch (IOException e) { System.out.println(e); }
  }
}

require 'net/http'
require 'net/https'

# Classic (GET )
def send_request
    uri = URI('https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=YOUR-URL')

    # Create client
    http = Net::HTTP.new(uri.host, uri.port)
    http.use_ssl = true
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER

    # Create Request
    req =  Net::HTTP::Get.new(uri)

    # Fetch Request
    res = http.request(req)
    puts "Response HTTP Status Code: #{ res.code }"
    puts "Response HTTP Response Body: #{ res.body }"
rescue StandardError => e
    puts "HTTP Request failed (#{ e.message })"
end

send_request()

<?php

// get cURL resource
$ch = curl_init();

// set url
curl_setopt($ch, CURLOPT_URL, 'https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=YOUR-URL');

// set method
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');

// return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);



// send the request and save response to $response
$response = curl_exec($ch);

// stop if fails
if (!$response) {
  die('Error: "' . curl_error($ch) . '" - Code: ' . curl_errno($ch));
}

echo 'HTTP Status Code: ' . curl_getinfo($ch, CURLINFO_HTTP_CODE) . PHP_EOL;
echo 'Response Body: ' . $response . PHP_EOL;

// close curl resource to free up system resources
curl_close($ch);

?>

package main

import (
	"fmt"
	"io/ioutil"
	"net/http"
)

func sendClassic() {
	// Create client
	client := &http.Client{}

	// Create request
	req, err := http.NewRequest("GET", "https://app.scrapingbee.com/api/v1/?api_key=YOUR-API-KEY&url=YOUR-URL", nil)


	parseFormErr := req.ParseForm()
	if parseFormErr != nil {
		fmt.Println(parseFormErr)
	}

	// Fetch Request
	resp, err := client.Do(req)

	if err != nil {
		fmt.Println("Failure : ", err)
	}

	// Read Response Body
	respBody, _ := ioutil.ReadAll(resp.Body)

	// Display Results
	fmt.Println("response Status : ", resp.Status)
	fmt.Println("response Headers : ", resp.Header)
	fmt.Println("response Body : ", string(respBody))
}

func main() {
    sendClassic()
}

Scraping an E-commerce product page

The following guide will help you scrape an E-commerce product page with ScrapingBee API. We will use Python, Requests, and BeautifulSoup in our examples.

Don't hesitate to take a look at our Python web scraping 101 to read a detailed introduction to these libraries. We also have a lot of tutorials in different languages in our web scraping blog.

We are going to extract the price, image URL, and product name for this product: https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/

E-commerce product page with Chome dev tools opened.

We are going to use some basic CSS selectors to extract the price, title, and image URL. Then we simply print the result into the console.

                            import requests
from bs4 import BeautifulSoup

# Extract product data from a (dummy) E-commerce product page
# https://www.scrapingbee.com/

# Replace with your ScrapingBee API
SCRAPINGBEE_API_KEY = ""
endpoint = "https://app.scrapingbee.com/api/v1"
rows = []

params = {
    'api_key': SCRAPINGBEE_API_KEY,
    'url': 'https://clever-lichterman-044f16.netlify.app/products/taba-cream.1/',
}

response = requests.get(endpoint, params=params)
if response.status_code != 200:
    print('Error with your request: ' + str(response.status_code))
    print(response.content)
else:
    #import pdb;pdb.set_trace()
    soup = BeautifulSoup(response.content, 'html.parser')
    product = {
        'price': soup.select('.my-4 s')[0].text,
        'image_url': soup.select('.slick-track img')[0]['src'],
        'product_title': soup.select('.col-12 h2')[0].text
    }
    print(product)

URL encoding

When you are making an API call to ScrapingBee, you can pass different parameters as query strings. Here is an example:

https://app.scrapingbee.com/api/v1/?url=https://example.com/&premium_proxies=True&api_key=YOUR-API-KEY

Now what happens when the URL you want to scrape also contains a query parameter named param1?

curl 'https://app.scrapingbee.com/api/v1/?url=https://example.com/?param1=value1&param2=value2&premium_proxy=True&api_key=YOUR-API-KEY'

In this case, the API will return an error, because it will parse these param1 and param2, and since ScrapingBee doesn't use these parameters you will get this error:

{"message": "Unknown arguments: param2"}

In order to pass your URL including the query parameters, you have to encode it. Most HTTP clients in most programming languages will do it for you.

If your client doesn't (like cURL ...) you can use a third-party tool to encode the URL like this one: URL encoder

There are also many ways to encode a URL directly in your code:


sudo apt-get install gridsite-clients
urlencode "YOUR URL"
         

import urllib.parse
encoded_url = urllib.parse.quote("YOUR URL")

encoded_url = encodeURIComponent("YOUR URL")

String encoded_url = URLEncoder.encode("YOUR URL", "UTF-8");

require 'uri'
encoded_url = URI::encode("YOUR URL")

<?php

$url_encoded = urlencode("YOUR URL");

?>

package main

import (
	"net/url"
)

func main() {
	encoded_url := url.QueryEscape("YOUR URL")
}

Understanding JavaScript rendering.

Some websites are making heavy use of Javascript and AJAX call.

Typically for those websites, when you visit them with your web browser the page returned by the server is an empty HTML skeleton.

Once the HTML hit your browser, the Javascript framework will call it's internal code and fire one or many HTTP requests to call some other backend APIs and populate the page with useful data.

It's basically what most Single Page Application frameworks do. This process can take a few seconds on when loading the first page.

This is usually what is going on when you visit a web page that displays a loader before showing useful information.

Single Page Application Diagram

And this is exactly trying to scrape those websites without a headless browser will not work well.

By default our API scrape pages through a real web browser (documentation).

Waiting for some Javascript code to execute.

By default, ScrapingBee waits for 2000 milliseconds before returning the HTML. You can increase this value by adding using the wait parameter (documentation).

How to scroll a page?

You will sometimes find pages that trigger an AJAX call when you scroll to the bottom to load more elements. In these cases you may want to scroll a few times with our Headless Browser in order to load more elements.

This is possible with the js_snippet parameter. Here is a code sample to scroll to the bottom of a page 5 times, with 500 milliseconds between each scroll. We need to add a wait of 2500ms so that all our Javascript code can finish his execution.

let counter = 5;
function scroll(){
    if(counter == 0){
        clearInterval(id);
    }
    window.scrollTo(0, document.body.scrollHeight);
    counter = counter - 1 ;
}
let id = setInterval(scroll, 500); 

Don't hesitate to test this in your browser Javascript console. In order to make ScrapingBee execute this code, you have to encode it in base64.

You can find below how to do it in your favorite language.


echo  'YOUR JS SNIPPET' | base64
         

import base64
base64_snippet = base64.b64encode("YOUR JS SNIPPET".encode()).decode()

'use strict';

let js_snippet = 'YOUR JS SNIPPET';
let buff = new Buffer(js_snippet);
let base64_snippet = buff.toString('base64');

import org.apache.commons.codec.binary.Base64;

byte[] encodedBytes = Base64.encodeBase64("YOUR JS SNIPPET".getBytes());
String base64_snippet = new String(encodedBytes);

require "base64"

base64_snippet = Base64.encode64('YOUR JS SNIPPET')


<?php

$str = 'YOUR JS SNIPPET';
$base64_snippet = base64_encode($str);

?>

package main

func main() {
    base64_snippet := b64.StdEncoding.EncodeToString([]byte("YOUR JS SNIPPET"))
}

Here is what this Javascript snippet encoded in base64 with a wait parameter set to 2500 milliseconds looks like:

wait=2500&js_snippet=bGV0IGNvdW50ZXIgPSAxNTsKZnVuY3Rpb24gc2Nyb2xsKCl7CmlmKGNvdW50ZXIgPT0gMCl7CmNsZWFySW50ZXJ2YWwoaWQpOwp9CndpbmRvdy5zY3JvbGxUbygwLCBkb2N1bWVudC5ib2R5LnNjcm9sbEhlaWdodCk7CmNvdW50ZXIgPSBjb3VudGVyIC0gMSA7Cn0KbGV0IGlkID0gc2V0SW50ZXJ2YWwoc2Nyb2xsLCA1MDApOw==

How to use concurrency?

According to the plan you chose, you will have access to a specific number of concurrent request. This means that you'll be able to only do a specific number of request at the same time.

For example, if you need to make 100 requests and have an allowed concurrency of 5, it means that you can send 5 requests at the same time. The simplest way for you to take advantage of this concurrency is to set up 5 workers / threads and having each of them send 20 requests.

Below you'll find some resources that can help you doing that.


import requests
from multiprocessing.dummy import Pool as ThreadPool

def request_scrapingbee(url):
    r = requests.get(
      url="https://app.scrapingbee.com/api/v1/",
      params={
        "api_key": "",
        "url": url,
      },
    )
    response = {
        "statusCode": r.status_code,
        "body": r.text,
        "url": url,
    }
    return response

concurrency = 2
pool = ThreadPool(concurrency)

urls = ["", ""]
results = pool.map(request_scrapingbee, urls)
pool.close()
pool.join()

for result in results:
  print(result)

    coming soon

    coming soon

    coming soon

    coming soon

    coming soon

What to do if my request fails?

Please find below the most commons ways to fix your API call.

Check that your URL is correctly encoded.

Most of the errors you will encounter will be because you haven't correctly encoded your URL.

To do this quickly you can go on this website, and click on the encode button.

If you need to do this programmatically, learn here how to do it.

Wait for the JavaScript to render

Some websites use a lot of JavaScript, for those websites, you will have to use the wait parameter with the number of milliseconds you want to wait. (documentation)

Example: wait=2000 if you want to wait 2 seconds or 2000 milliseconds

Use premium proxy

For very difficult websites, you should use the premium_proxy=True parameter. Every call using premium proxies will cost 25 API credits if you use JavaScript rendering, 5 API credits if you don't. (documentation)

If you still have some problems, do not hesitate to send us an email.

Happy Scraping.