Limited Time Offer: Use code CYBER at checkout and get 50% off for your 1st month! Start Free Trial 🐝

How to Scrape Baidu: Step-by-Step Guide

02 October 2025 | 13 min read

Want to learn how to scrape Baidu? As China’s largest search engine, Baidu is an attractive target for web scraping because it is similar to Google in function but tailored for local regulations. For those wanting to tap into China's digital ecosystem, it is the best source of public data that displays relevant, location-based search trends, plus everything you need to conduct market research.

This guide will teach you how to extract information from Baidu HTML code with the most beginner-friendly solution – our Scraping API and Python SDK. Dynamically loaded pages load structure data with the help of JavaScript scripts, while rate-limiting and bot detection tools try to prevent the automated data parsing on the platform.

Stick around, and we will go through the basic steps of extracting and storing knowledge with your first Baidu scraper built from scratch!

Quick Answer (TL;DR)

Use our Python SDK to focus on the contents of extracted data while we handle JavaScript rendering, proxy rotation, and header management for you! With the help of our tools, you can enjoy the benefits of a Baidu Scraping API. The example script in this tutorial loads dynamic Baidu pages just like a real browser, letting you extract titles, descriptions, and links without CAPTCHA or manual setups.

Scrape Baidu Search Results with ScrapingBee

To begin collecting structured data from Baidu, let's take care of the tools and parameters needed to access the search engine through our script. First, make sure you have Python 3.6 or newer installed on your device.

Python

Note: For Windows users, Python can also be installed via the Microsoft Store app.

Python is the biggest programming language in 2025, and it is the most common option for writing web scraper scripts because it supports and easily integrates external tools and libraries, which we will use in this tutorial. After installation, you can check your Python version by entering this line in your Command Line Prompt or Terminal window:

python --version

Once Python is installed on your device, it also installs pip, a package manager. It installs external libraries from the Python Package Index (PyPI) using the "pip" command. To keep the process simple for beginners, we will only need two external libraries:

  • pandas – a Python library for organizing and analyzing data. It converts raw HTML or JSON from Baidu into clean tables (DataFrames) for easier filtering, sorting, and export to CSV files.

  • scrapingbee – our own Python SDK that sends web requests through the ScrapingBee API. Through it, we will teach you to customize JavaScript rendering, proxy rotation, and bot detection parameters to reliably fetch Baidu search results.

For more tips on using our SDK, check out the ScrapingBee Documentation page. You can install both libraries with one line in your command prompt or terminal:

pip install scrapingbee pandas

Get Your API Key and Set Up

To begin scraping Baidu, create a free ScrapingBee account and get your API key from the dashboard. For new registered users, we provide a free trial of 1,000 credits for a week, which will be more than enough to test our system and its intuitive features.

Register

After signing up, head over to the dashboard. You will see your API key in the top-right section. Copy it, as we will later integrate it into our script.

Dashboard

Before we write the script, which will use up credits according to the used tools and parameters, you can test the request using curl. Replace "YOUR_API_KEY" with the actual provided key:

curl "https://app.scrapingbee.com/api/v1?api_key=YOUR_API_KEY&url=https://www.baidu.com/s?wd=scrapingbee&render_js=true"

In our Python script, our client will not execute the web scraping logic if it cannot access the API:

client = ScrapingBeeClient(api_key='YOUR_API_KEY')

Python script for Baidu API requests

Now it's time to start working on your script. Create a project folder of your choice. In it, make a text file ending with .py, for example, "baidu_scraper.py".

Note: While you can write your script in a text editor, we recommend using Notepad++ or Visual Studio Code for syntax highlighting plus additional tools that make coding more intuitive.

Start your script by invoking the downloaded external libraries and a built-in Python library "urllib" for easier encoding of Baidu search result URLs:

#Importing our HTML API
from scrapingbee import ScrapingBeeClient 
# pandas dataframes for better data formatting
import pandas as pd 
# An internal Python library to integrate input parameters into a URL string
from urllib.parse import urlencode 

Before we start working on the main function, add a client variable that we already defined right after import declarations, and create a base Baidu URL that will be encoded with our parameters:

client = ScrapingBeeClient(api_key='YOUR_API_KEY')
base = "https://www.baidu.com/s"

Pagination for Baidu query results

Now, let's take care of the pagination logic so the script does not stop at the first page of Baidu search results. The "search_term" variable asks for a user input, simulating a query on a search engine. Then, the "Pagination" variable requests the user to input the number of pages for the Baidu data extraction.

search_term = input("Search_term: ").strip()
Pagination = int(input("How many pages to scrape? "))

Then we convert the "search_term" parameter to make it suitable for URL encoding:

params = {"wd": search_term}

The following section encodes the "url" variable with our base URL and imported parameters. After that, we create a list variable "url_list" which all links depending on the number of pages. This way our Baidu scraper is flexible, accepting various keywords to send requests to multiple pages.

url = f"{base}?{urlencode(params)}"
url_list=[url]

Baidu structures its URLs in a way that adds "&pn=10" if you want to see the second page, and increments the value by +10 for each additional page. For example, if you want to scrape 3 pages with the keyword "iPhone", your URL list should look like this:

https://www.baidu.com/s?wd=iphone - page 1
https://www.baidu.com/s?wd=iphone&pn=10 - page 2
https://www.baidu.com/s?wd=iphone&pn=20 - page 3

During the creation of our "url_list" variable, it instantly assigns the page 1 URL as the first member of its list. Now, with the help of a for loop, we can append the list with links for additional pages.

if Pagination<=1: 
   pass
# with each additional page, append the list of URLs, but change it by adding &page=n
else:
    for i in range(1, Pagination):  # n_pages=1 -> only base
        next_url= f"{url}&pn={i*10}"
        url_list.append(next_url)
    print(url_list)

As we can see, the if statement ensures that the for loop is initiated only if the user requests more than one page. In any other case, it loops the range of values in an ascending order: from 1 to the number defined by the variable. Then the URL is encoded and treats the second page as "1", then the multiplier by 10 creates the desired structure.

Send API Request to Baidu

Now we can start working on the main function. Add the following line to begin its definition. We named it "scrape_Baidu". The lines that follow it will be indented because Python defines functions and conditional blocks through indentation instead of braces {}.

def scrape_Baidu():

Now we can start working with tools within our SDK. First, create a "js_scenario" variable that will instruct our API how to interact with JavaScript elements on the platform. For now, let's add a "wait" instruction to let the page load before extracting the HTML code.

# Start of the function definition
def scrape_Baidu():

    # Rules for JavaScript Rendering
    js_scenario = {
    # List of instructions
        "instructions": [
        # Instructs the headless browser to wait for 4 seconds    
            {"wait": 4000}
        ]
    }

Note: Headless browser use is enabled by default through a hidden render_js=true variable within every Get API call.

Then, the "response" variable will store the data from the GET API call. For now, we only send the request to the first URL within the "url_list" variable to test if our software can access Baidu to extract data:

    response = client.get(
# url from the list
        url_list[0],
        params={
# Adding additional parameters: routing via proxy servers and JS rendering rules
        "js_scenario": js_scenario,
        #"extract_rules": extract_rules,
        'premium_proxy': 'True'
                }
        )

Then, we create a "result" variable to store data extracted from the response variable with the .text method. To see the outcome, we added two print functions: one for HTML content and the other for HTTP status code, which will output "200" if the connection is successful:

    result=(response.text)
    print(result)
    print("HTTP STATUS CODE: ",response.status_code)

Now we can test the script's raw HTML code extraction. For example, if you run the code with the "iPhone" query, the result should look something like this:

CMD

Note: By default, your Command Prompt will not display Chinese characters. To see them, follow these steps:

  1. Type "chcp 65001" in your console to switch to UTF-8 encoding.

  2. Right-click the CMD title bar, go to "Properties", then select "Font" and choose a Unicode-capable font. For example, our tutorial uses "NSimSun".

Parse Baidu Results in Python

To deconstruct raw code and get the actual information in a neatly organized data set, we need to create another dictionary variable in our function definition, "extract_rules". It will define parsing rules to only extract valuable insights from the page.

From each search result, we will extract 3 variables: Title, description, and the link. Start defining your variable like this:

    #Extract rules definition
    extract_rules = {
    "Search_Result": {
        "selector": ", 

The "Search_Result" section will define the logic for CSS-based parsing. First, we need to find the selector – a CSS element that encapsulates the data within each search result. You can find it by manually visiting the page with your browser and opening Developer tools (F12 or right-click and select inspect).

As we scroll through the displayed HTML code, it will highlight the place that corresponds to your selection on the page. Developer tools also have a search bar that you can access by clicking Ctrl+F. For example, after inspecting the code, we can see that "div._content_1ml43_4" is our main selector. If we enter it into the search bar, it will show all results within the page:

Class

After putting it as our main selector, the next step is to define that the output data will be a list of other CSS selectors, our key data variables. At this point, your extract_rules should look like this:

    extract_rules = {
    "Search_Result": {
        "selector": "div._content_1ml43_4",
        "type": "list",
        "output": {
        "Title": "",
        "Description":"",
        "Link":"",
           }
        },

}

Then, following the same logic as we did with the main selector, find CSS selectors that correspond to the title, description, and search query link within each section:

Class

After picking all CSS selectors, here is our finished set of rules for extracting desired search parameters:

    extract_rules = {
    "Search_Result": {
        "selector": "div._content_1ml43_4",
        "type": "list",
        "output": {
        "Title": "div.title-box_4YBsj",
        "Description":"div.summary-gap_3Jb4I",
        "Link": "h3 > a@href",
           }
        },

}

Notes: Baidu's HTML pages use dynamic class attributes, requiring regular updates to CSS selectors for successful extractions. Keep that in mind if your Baidu search API stops returning info despite using the original search query. To avoid additional hurdles, check out our BeautifulSoup Scraping Guide.

Now, to connect information for all target pages, we create an empty list variable "Pages_list" which will return organized data from each page to connect them into one Pandas DataFrame. Then, add our "response" variable, which contains the GET API call, into a for loop. Don't forget to update its parameters by adding our "extract_rules" instructions:

    #empty pages list to collect all extracted pages into one list
    Pages_list=[]
    for urls in url_list:
        response = client.get(
            #iterating through pages within the URL list
            urls,
            params={
                "extract_rules": extract_rules,
                "js_scenario": js_scenario,
                'premium_proxy': 'True',
                }
            )
        # get extracted content in JSON format
        result=(response.json())
        # df variable creates a DataFrame from extracted content
        df=pd.DataFrame(result['Search_Result'])
        #appending the list of pages with a DataFrame for each page
        Pages_list.append(df)

Don't forget to pay attention to the comment line that starts with the hash sign (#). After filling up our "Pages_list" variable with DataFrames for each page, we can close the loop, connect them into one continuous DataFrame, and export the result in two ways: one for printing out the result in our Command Prompt, while the other exports our connected result to a CSV file, making it ready for analysis. The last line of code invokes the web scraping function.

    #uses concatenation to merge all DataFrames from the Pages_List
    Pages_DataFrame= pd.concat(Pages_list, ignore_index=True)
    
    print(Pages_DataFrame)
    Pages_DataFrame.to_csv("Baidu_extraction.csv", index=False)
    print("HTTP STATUS CODE: ",response.status_code)
scrape_Baidu()

After running the code to scrape 2 pages, our Command Prompt looks like this:

CMD

Meanwhile, this is how the generated CSV looks in Google Sheets:

Table

And that's it! Feel free to copy the full code example below and test the scraper with different queries and numbers of pages.

Handle Common Issues

When scraping Baidu, you may encounter CAPTCHA or blank responses from JavaScript-heavy pages. These may occur because Baidu actively filters automated traffic and delays content rendering. To solve this, our API has JavaScript rendering enabled by default, while parameters like "premium_proxy" or "stealth_proxy" are used to avoid blocks and ensure access to the platform with high-quality IP addresses.

If 403 or CAPTCHA pages appear, use ScrapingBee’s premium proxy network and customize user-headers to mimic a normal browser. You can also randomize user-agents or rotate IPs automatically through the API. These settings let you access search results reliably without triggering Baidu’s security filters. For more insights, check out our blog on Web Scraping Without Getting Blocked.

Complete ScrapingBee Code Example

Below is a full example of a beginner-friendly Baidu scraper that is customizable and scalable to accommodate your use cases. Don't forget to add your API key before running the script!

#Importing our HTML API
from scrapingbee import ScrapingBeeClient 
# pandas dataframes for better data formatting
import pandas as pd 
# An internal Python library to integrate input parameters into a URL string
from urllib.parse import urlencode 

# Initializing our API client in the "client" variable
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

base = "https://www.baidu.com/s"

search_term = input("Search_term: ").strip()
Pagination = int(input("How many pages to scrape? "))

params = {"wd": search_term}
url = f"{base}?{urlencode(params)}"
url_list=[url]

if Pagination<=1: 
   pass
# with each additional page, append the list of URLs, but change it by adding &page=n
else:
    for i in range(1, Pagination):  # n_pages=1 -> only base
        next_url= f"{url}&pn={i*10}"
        url_list.append(next_url)
    print(url_list)

def scrape_Baidu():
    # Rules for JavaScript Rendering
    js_scenario = {
    # List of instructions
        "instructions": [
        # Instructs the headless browser to wait for 4 seconds    
            {"wait": 4000}
        ]
    }

    #Extract rules definition
    extract_rules = {
    "Search_Result": {
        "selector": "div._content_1ml43_4",
        "type": "list",
        "output": {
        "Title": "div.title-box_4YBsj",
        "Description":"div.summary-gap_3Jb4I",
        "Link": "h3 > a@href",
           }
        },

}
    #empty pages list to collect all extracted pages into one list
    Pages_list=[]
    for urls in url_list:
        response = client.get(
            #iterating through pages within the URL list
            urls,
            params={
                "extract_rules": extract_rules,
                "js_scenario": js_scenario,
                'premium_proxy': 'True',
                }
            )
        # get extracted content in JSON format
        result=(response.json())
        # df variable creates a DataFrame from extracted content
        df=pd.DataFrame(result['Search_Result'])
        #appending the list of pages with a DataFrame for each page
        Pages_list.append(df)
    #uses concatenation to merge all DataFrames from the Pages_List
    Pages_DataFrame= pd.concat(Pages_list, ignore_index=True)
    
    print(Pages_DataFrame)
    Pages_DataFrame.to_csv("Baidu_extraction.csv", index=False)
    print("HTTP STATUS CODE: ",response.status_code)
scrape_Baidu()

Start Scraping Baidu with ScrapingBee

Our Python SDK is one of the best tools to build your own Baidu search API. With built-in JavaScript rendering, automatic proxy rotation, and reliable API-based access, there’s no need to import requests libraries or stress over restrictions to the platform with manual header tweaks. Get started today with your first 1,000 credits and enjoy an effortless introduction to data extraction with your first Baidu web scraper!

Frequently Asked Questions (FAQs)

Can ScrapingBee scrape Baidu without getting blocked?

Yes. Our API handles IP rotation and simulates browser activity to avoid rate-limiting restrictions from the recipient server. To learn more, check out our Avoid Bot Detection Guide.

Do I need proxies to scrape Baidu?

Yes, if you plan to use the script for large-scale scraping. However, with our HTML API, you don’t need to manage them yourself. We automatically handle proxy rotation and IP management for every request without requiring external proxy deals.

How do I scrape Baidu using Python?

There are a few different ways to build a Baidu scraper with Python. We recommend using our Python SDK to enjoy automated proxy rotation and header management. Feel free to use our copy-paste-ready example code or check out our Python Web Scraping Guide.

Does Baidu support English queries?

Yes. You can search using English terms, but Baidu primarily indexes Chinese-language content, so most results and snippets will appear in Chinese. For broader or English-only data, you may need to filter or translate the extracted text after scraping.

image description
Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.