How to Scrape Wikipedia with ScrapingBee

Kevin Sahin | 27 August 2025 | 11 min read

Table of contents

Ever wanted to extract valuable insights and data from largest encyclopedias online? Then it is it to learn how to scrape Wikipedia pages! As one of the biggest treasuries of structured content, it is constantly reviewed and fact-checked by fellow users, or at least provide valuable insights and links to sources.

Wikipedia has structured content but scraping can be tricky due to rate limiting, which restricts repeated connection requests to websites. Fortunately, our powerful tools can overcome these hurdles, ensuring efficient data extraction in a clean HTML or JSON format.

In this guide, we will teach you how to scrape content and data from Wikipedia, covering page titles, summaries, and infoboxes. Tag along, and with minimal coding knowledge, you will be able to gather information and send your first automated GET request!

Quick Answer (TL;DR)

With the help of our Python SDK, you can scrape Wikipedia quickly and reliably – with just one API call! Extract titles, summaries, infobox content, and internal links while bypassing IP bans, rate limits, and user-agent blocks. No need for headless browsers — just download pip libraries, and plug in your API key to utilize our Wikipedia Scraping API.

Scraping Wikipedia Using ScrapingBee

With our scraping API, following the tutorial is more than enough to start your scraping journey with minimal coding knowledge. We handle the rate limits, redirects, and messy HTML content with proxy servers, smart redirection to appropriate URLs, and JavaScript rendering to ensure all specific data gets loaded.

If you want to learn how to scrape Wikipedia, there is no better starting point than Python. Empowered with a sophisticated code interpreter, a Python file utilizes the biggest coding language to practice web scraping techniques and retrieve structured data. To learn more about web data extraction, check out our blog article – What Is Web Scraping?

Set Up Your Environment

First, if you do not have it yet, install Python (version 3.6 or newer) on your system. Python can be installed on Windows by downloading the installer from the official Python website (python.org) or directly from the Microsoft Store by searching for Python and clicking "Get" to install the latest version.

Python

The biggest strength of Python is its integration of libraries, both internal and external. Its package manager, pip, allows users to install and manage additional coding tools.

Thanks to our HTML API, the process is even simpler. Here is what you need:

scrapingbee – our Python library that provides a web scraping API, handling headless browsers, proxy rotation, and JavaScript rendering.
pandas – package for data analysis and manipulation, offering structures such as DataFrames to efficiently handle extracted information.

To install these packages, open your Terminal (Command Prompt for Windows) and add the following line:

pip install scrapingbee pandas

After that, log in to your ScrapingBee account to copy your API key. If you don’t have one, don’t worry! After registering an account, you will receive 1,000 free credits for a week to test the code described in this guide.

Now, create a dedicated folder and create a text file with a .py extension, which will contain our Wikipedia scraping script. First, we import the downloaded libraries to make sure the script utilizes their tools.

from scrapingbee import ScrapingBeeClient 
import pandas as pd

Now we are ready to start dissecting the Wikipedia HTML code and send the first request. If you want more details on how our API works, check out our extensive ScrapingBee Documentation.

Let’s get straight to business with a simple, raw extraction of a Wikipedia page. First, by creating a “client” variable, we assign our API client with a provided API key.

client = ScrapingBeeClient(api_key='YOUR_API_KEY')

Then, add a URL of the Wikipedia main page, and start defining a function that will execute the API call.

url = "https://en.wikipedia.org/wiki/Main_Page"
def scrape_wikipedia(url):

Note: Python uses indentation to encompass the executed steps within the function.

The following section is a js_scenario dictionary, which tells our API how to handle JavaScript rendering. In this case, it is just an instruction to wait 2 seconds for the page to load before extracting data.

    js_scenario = {
        "instructions": [
            {"wait": 2000},
        ]
    }

And now we get to the main part. The “response” variable will contain information extracted from the GET API call. It is invoked by adding the “.get” method to the client variable, plus the following parameters: URL and a params dictionary, which invokes the js_scenario definitions. Let’s add a proxy connection and a “retries” parameter just to ensure consistent access to Wikipedia.

  response = client.get(
        url,
        params={
            "js_scenario": js_scenario,
            'premium_proxy': 'True'
            },
            retries=2
        )

Now, all that is left is to print out the extracted HTML code and close the function. After that, the last line will call the function, executing the previously defined steps:

   print(response.content)
# Call the function
scrape_wikipedia(url)

After a successful execution, your result should look like this:

CMD

Extract Wikipedia Data: Title, Infobox, Introduction

To target specific Wikipedia data, let’s use the “HTTP” page as an example. Here we will target specific CSS selectors. Press F12 or Ctrl+Shift+I (Windows) / Cmd+Option+I (Mac) to open Developer Tools in your browser. Then right-click on the webpage element you want to inspect, choose Inspect, right-click the highlighted code in the Elements panel, and select copy - copy selector to copy its CSS selector.

First, we will create an additional dictionary for data parsing instructions – extract_rules. Here we will name this section as “infobox”, followed by a selector, which will focus on this specific section instead of going all over the place.

Selector

The image above shows the infobox table and its class via Developer Tools. After copying its selector, we get a string of text that looks like this:

#mw-content-text > div.mw-content-ltr.mw-parser-output > table.infobox.vevent

To remove excess data and specify which areas to target, we can make this more accurate by removing the first two steps, and adding the string with “tr” to only target its table rows: "selector": "table.infobox.vevent tr"

Note: While using Developer tools, click CTRL+F to open its search bar. Here, you can paste selectors, modify them, and get an immediate response on how the scraper will react to the CSS selector.

Next, we choose a type of extraction, either a text or a list. Because we are working with multiple elements, we will show how our API extracts multiple data points at the same time with a “list” type. After that, the output section in “extract_rules” lets us assign names for targeted text and use CSS selectors (or XPath).

Wikipedia infoboxes have two columns: one for the label, and the latter one for the actual information. In this tutorial, we named them “label” and “infobox data”. After all these steps, your “extract_rules” section should look like this:

    extract_rules = {
        "infobox": {
           "selector": "table.infobox.vevent tr",
           "type": "list",
           "output": {
           "label": 'th.infobox-label',
           "infobox data": 'td.infobox-data
           }

Note: After attempting an initial extraction, we’ve noticed that “td.infobox-data” selector attaches <style> HTML tags to our content. To get a clean result, we adjusted the JS rendering instructions to exclude the clutter from our data. Pay close attention to how your code reacts to CSS elements, and make adjustments to get clean extractions.

 js_scenario = {
        "instructions": [
            {"wait": 2000},
            {"evaluate": "document.querySelectorAll('style').forEach(e => e.remove());"}
        ]
    }

Note: Keep in mind that this solution may not work for all Wikipedia <table> tags. For more in-depth info, check out our blog: How to Scrape Tables.

After that, we can use the same principles to extract additional elements. In this tutorial, we targeted page titles and their first paragraph. Here is a full version of the “extract_rules” definition:

    extract_rules = {
        "infobox": {
           "selector": "table.infobox.vevent tr",
           "type": "list",
           "output": {
           "label": 'th.infobox-label',
           "infobox data": 'td.infobox-data
           }
        },
        "Heading": {
         "selector": "h1",        
    },
         "intro": {
         "selector": "div.mw-parser-output > p:not(.mw-empty-elt)",        
    }
}
After these changes, the result is much better:

After these changes, the result is much better:

CMD

Now all that is left is to use Pandas and extract the result to a csv file.

    result=(response.json())
    df = pd.DataFrame(result)
    df.to_csv("wiki_extraction.csv", index=False)
    
scrape_wikipedia(url)

And here is our result, fully automated and easy to read, with plenty of room for organizing data in the most comfortable and understandable format for you!

Table

Note: While this tutorial uses our tools to target CSS selectors, you can use other parsing libraries. For example, here is a BeautifulSoup Tutorial that you can use to collect data from Wikipedia with their flexible and beginner-friendly parsing library.

Optional: Use Wikipedia's API with ScrapingBee

If you want to use Wikipedia’s API, our tools can forward HTTP headers for handling JSON APIs, allowing you to access the platform for more structured data extraction. By making requests with parameters like action=query combined with prop=extracts or prop=pageprops, you retrieve specific parts of the page’s content or metadata in JSON format rather than raw HTML.

This approach simplifies data parsing because you get clean, organized content directly from Wikipedia’s backend, which is faster and does not require scraping.

Bypass Rate Limits and Stay Undetected

Web scraping is a common practice, and beginners who start sending GET API requests get banned or rate-limited. Here is how to avoid getting blocked:

premium_proxy: true — Uses a managed pool of premium residential proxies with automatic rotation to avoid IP bans.
render_js: true (enabled by default) — Enables or disables JavaScript rendering using headless Chrome for dynamic pages.
custom user-agent headers — Allows customizing user-agent strings to mimic different browsers and devices.
auto IP rotation — Automatically rotates IP addresses between requests to distribute load.
geotargeting with country_code — Select proxies from specific regions to access geo-restricted content.
CAPTCHA solving — Automatically handles CAPTCHA challenges encountered during scraping.

These measures help respect Wikipedia's rate limits and polite scraping policies by distributing requests, simulating human-like browsing, and avoiding detection/blocking. Need more info? Check out our blog on How to Avoid Getting Blocked!

Final Code: Full Wikipedia Scraper With ScrapingBee

Below is a ready and field-tested scraping script that covers all the steps mentioned in this tutorial. Feel free to copy-paste it and add additional features and data points that fit your use cases.

# Importing pip libraries
from scrapingbee import ScrapingBeeClient
import pandas as pd
# Initializing ScrapingBeeClient (replace YOUR_API_KEY with the key from your account!)
client = ScrapingBeeClient(api_key='YOUR_API_KEY')
# targeting Wikipedia's page about HTTP
url = "https://en.wikipedia.org/wiki/HTTP"
# function definition begins, all of its contents are indented
def scrape_wikipedia(url):
# extract_rules dictionary holds instructions for CSS selectors
    extract_rules = {
        "infobox": {
           "selector": "table.infobox.vevent tr",
           "type": "list",
           "output": {
           "label": 'th.infobox-label',
           "infobox data": 'td.infobox-data'
           }
        },
        "Heading": {
         "selector": "h1",
         "output":"text"
    },
         "intro": {
         "selector": "div.mw-parser-output > p:not(.mw-empty-elt)",
         "output":"text"         
    }
}
# Instructions for JS_rendering, removes <style> tags before parsing
    js_scenario = {
        "instructions": [
            {"wait": 2000},
            {"evaluate": "document.querySelectorAll('style').forEach(e => e.remove());"}
        ]
    }
# Sending the GET API call with previously described extract_rules and js_scenario + premium residential proxies to avoid IP restrictions
    response = client.get(
        url,
        params={
           # 'custom_google': 'True',
            "extract_rules": extract_rules,
            "js_scenario": js_scenario,
            'premium_proxy': 'True'
            },
            retries=2
        )
# Saving the result in json format and extracting it to a csv file in the same directory
    result=(response.json())
    df = pd.DataFrame(result)
    df.to_csv("wiki_extraction.csv", index=False)
# identation ends, calling the defined function with a url variable
scrape_wikipedia(url)

Call to Action: Start Scraping Wikipedia With ScrapingBee

Start scraping Wikipedia effortlessly with ScrapingBee, the API that handles rate limits and IP rotation automatically to keep your requests smooth and uninterrupted. Whether you’re a data scientist, journalist, or researcher, our API simplifies HTML parsing, letting you focus on extracted knowledge and its significance without worrying about proxies and browser automation.

Our system is very easy to set up and get reliable performance, allowing you to spend less time troubleshooting and more time extracting valuable insights. With built-in protections against bans and CAPTCHAs, your Wikipedia scraping project is in safe hands. Try your first scrape now, and our API will keep things straightforward and efficient!

Frequently Asked Questions (FAQs)

How do I avoid getting blocked when scraping Wikipedia?

To avoid detection during Wikipedia scraping, use IP rotation and vary user-agent headers to mimic real users. Our API automates these steps by default, ensuring that your GET API calls reach the platform from different web access points, ignoring rate limiting.

Can I use ScrapingBee to scrape other language versions of Wikipedia?

Yes, our Python SDK can scrape any Wikipedia language by specifying the URL of the desired language version. It handles proxies and rate limits to ensure consistent access to your pages.

Is it better to use Wikipedia’s API or scrape the HTML?

Using Wikipedia’s API is preferred for structured, clean data and fewer blocks. Scraping HTML can be necessary for data not available via the API, but it requires more handling of site structure changes.

Can ScrapingBee help with large Wikipedia scrapes (thousands of pages)?

Yes, our IP rotation, CAPTCHA solving, and rate limit handling tools make it ideal for large-scale Wikipedia scraping projects. It distributes requests and manages challenges to keep scraping uninterrupted, and can even support multiple parallel scraping requests.

Before you go, check out these related reads:

Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.