HTML Web Scraping Tutorial

Kevin Sahin | 22 October 2025 | 12 min read

Table of contents

Over the last two decades, HTML scraping has transformed how we approach market research. While the internet continues to reimagine how we extract and analyze information, we have many different ways to scrape HTML, all of which are different in their approach and complexity.

In this tutorial, we will show how to combine the basics of traditional HTML data collection with the powerful extraction capabilities of our scraping API. This approach will help you create a clear and consistent method for automated data extractions. Let's dive in!

Quick Answer

HTML web scraping focuses on creating automated data collection bots that access a webpage’s HTML content to extract information. Our Python SDK comes with built-in libraries used for DIY scrapers, allowing us to build a simple yet effective scraper that ensures consistent access to platforms through our scraping API.

What is HTML Web Scraping?

To extract data from HTML, basic scrapers read a webpage’s raw HTML and look at its CSS selectors to parse collected information, leaving only data points like product names, prices, or links. Since most pages on the web load content through structured HTML, scraping allows you to access this data directly without a formal API.

Setting Up for HTML Scraping

In this tutorial, we will use Python, a leading programming language, to build our HTML scraper. Make sure you have Python 3.6+ installed on your device. You can download it from the official website, the Microsoft Store, or your Linux package manager.

Python

After closing the installation wizard, you will unlock access to Python's package manager pip. Go to your Terminal or Command Prompt to install the following packages:

scrapingbee – our Python SDK with built-in "requests" library under the hood, often used in DIY scripts, plus data parsing capabilities via its "extract_rules parameter
pandas – an external Python library for transforming JSON data into DataFrames for efficient extraction tasks.

You can install these packages all at once by entering the following line:

pip install scrapingbee pandas

How to Extract Data from HTML Step by Step

To make sure that our HTML extraction is easy to follow, we will test our scraper on Wikipedia, one of the best platforms to test your parsing capabilities and JavaScript scenarios. Make a new project folder and create a text file with a .py extension, which will make it readable to a Python interpreter.

Now we can start working on your HTML scraper and its data extraction workflow!

Step 1 - Get an API Key and Make Your First HTML Request

To interact with our scraping API, you will need a key. Log in to your account or sign up to ScrapingBee to test our tools with a 1-week free trial with 1,000 credits.

Once you're in, copy the API key from the top-right section of your Dashboard:

dashboard

Start writing your script by importing the installed Python libraries. After that, a new "client" variable will be the gateway to our API, and this is where we have to pass your keyword argument:

# Importing pip libraries
from scrapingbee import ScrapingBeeClient
import pandas as pd
# Initializing ScrapingBeeClient (replace YOUR_API_KEY with the key from your account!)
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

Then, to keep things simple, let's focus on Wikipedia's HTML page, assigning its link to the "url" variable. After that, all we need for a raw extraction is to send the GET API call. The "response" variable will contain our result. The last line will print out the result in a text by invoking the .text method:

url = "https://en.wikipedia.org/wiki/HTML"

response = client.get(
    url,
    params={ 
     },
    )
print(response.text)

Note: Once you're familiar with extracting data from a static IP, you can create input variables to encode the URL and target a specific page based on user input.

After running the code, we can see that our first raw HTML extraction has worked:

CMD

For simple extractions, you can also get the same results by running a cURL command in your Terminal or Command Prompt:

curl "https://app.scrapingbee.com/api/v1/?api_key=YOUR_API_KEY&url=https://en.wikipedia.org/wiki/HTML"

Step 2 - Control JavaScript Rendering

While Wikipedia did not cause any trouble in our extraction, many retailer websites do not include valuable public data in their HTML until the user interacts with the page's scripted elements. However, most modern APIs take care of JavaScript rendering through integration of a headless browser.

If you're working with our tools, you don't need to change anything because the parameter that enables dynamic scraping capabilities is enabled by default – "render_js: true". If you want to disable it, put "render_js: false" into the "params" dictionary that provides query parameters for the GET API call:

response = client.get(
    url,
    params={ 
    "render_js":"False",  
     },
    )

After turning off the headless browser, the extraction exported data ~5 seconds faster, but it is still a necessary parameter for most scraping targets that render parts of their content through JavaScript.

Step 3 - Wait for Content Reliably

As you approach bigger extraction targets, not every page will be able to load all public data instantly. To improve the connection, we can create a "js_scenario" variable, which will contain a sequence of steps that the headless browser has to perform before extracting content from the targeted platform.

# preparing JavaScript rendering instructions
js_scenario = {
        "instructions":[
# put your instructions here         
        ]
    }

With our scraping API, there are three ways to incorporate delays in your automated scraper connection:

wait — delays a fixed time (ms)
wait_for — waits for an element by CSS or XPath
wait_browser — waits for lifecycle events like domcontentloaded

For example, here is a JS scenario sequence that waits for 2 seconds, and then the next instruction tells the scraper to wait until the main content container is loaded and identifiable by its CSS selector:

js_scenario = {
        "instructions":[   
            {"wait": 2000},
            {"wait_for": "#content"}           
        ]
    }

After that, we have to add our variable to the GET API "params" dictionary:

response = client.get(
    url,
    params={ 
    'js_scenario':js_scenario  
     },
    )

Step 4 - Interact with the Page Using a JavaScript Scenario

Let's add additional instructions. The following steps will instruct our HTML web scraper to scroll the entire page and close off the action by clicking a link at the end of the document:

js_scenario = {
    "instructions": [
        {"wait": 2000},
        {"wait_for": "#content"},
        {
            "infinite_scroll": {  # Scroll the page until the end
                "max_count": 0,  # Maximum number of scrolls, 0 for infinite
                "delay": 250,    # Delay between each scroll, in ms
                "end_click": {
                    "selector": "#Web_browsers9587 > a",
                    "selector_type": "css"  # Click when the end of the page is reached
                }
            }
        }
    ]
}

Step 5 - Extract Structured Data with CSS/XPath

Just like with the js_scenario dictionary, we can add an "extract_rules" dictionary to parse the raw HTML into usable points, outputting data extraction in a clean JSON format. The following section forces the script to only focus on the first 5 non-empty paragraphs. Then, it extracts all links from each paragraph:

extract_rules = {
  "links_in_first_5_paragraphs": {
    "selector": "(//p[normalize-space() != ''])[position() <= 5]",
    "selector_type": "xpath",
            "type": "list",
            "output": {
                "link_name": {
                "selector": "a@href",
                "type":"list"
                }
             }
            }
        }

After testing the code, we can see that our extraction is working:

Code

Note: For messy pages, we highly recommend trying out AI web scraping solutions with the "ai_extract_rules" variable.

Step 6 — Avoid Blocks and Improve Success Rate

Depending on your target, scraping HTML is not that easy because some sites block scrapers. Fortunately, our API can bypass most obstacles with these powerful features:

premium_proxy=true for geolocation and cleaner IP pools
stealth_proxy=true for stealthy fingerprints (infinite_scroll instruction of the JavaScript scenario is not supported with this option)
country_code=us to pick location
session_id to reuse cookies/IP

For example, Bakersfield.com is a website that is only accessible for US IP addresses. Let's add a "premium_proxy" feature with a European country code:

# Importing pip libraries
from scrapingbee import ScrapingBeeClient
import pandas as pd
# Initializing ScrapingBeeClient (replace YOUR_API_KEY with the key from your account!)
client = ScrapingBeeClient(api_key='YOUR_API_KEY')

url = "https://bakersfield.com"

js_scenario = {
    "instructions": [
        {"wait": 2000}
    ]
}


response = client.get(
    url,
    params={ 
    "premium_proxy":"True",
    "country_code": "de"
     },
    )
result=response.text
print(result)
#df=pd.DataFrame(result['links_in_first_5_paragraphs'])

After running the code, we can see that the console outputs an HTTP status error 451 – “Unavailable For Legal Reasons”. This means the website does not comply with the General Data Protection Regulation (GDPR), which applies to all websites that collect or process personal data from EU residents.

However, if we change the country code to "us", our HTML scraper successfully extracts data from Bakersfield.com:

Step 7 - Return Screenshots or JSON

After setting up geolocation parameters that bring back access to the platform, you can use our screenshot API to return images instead of text. Add the following parameter to your GET API call:

    "screenshot_full_page": True,

Then, instead of assigning "response.content" to a string variable, write it directly to a file in binary mode so the raw image bytes are saved as a valid PNG:

response = client.get(
    url,
    params={ 
    "screenshot_full_page": True,
    "premium_proxy":"True",
    "country_code": "us"
     },
    )
with open("bakersfield_fullpage.png", "wb") as f:
    f.write(response.content)

print("Saved full-page screenshot as bakersfield_fullpage.png")

After running the code, we can see that the image has been created in our project folder:

Step 8 - Save, Validate, and Schedule Your Scrapes

You can also save the created variables and export them in a structured format. Let's return to our Wikipedia example to export a created Pandas DataFrame to a CSV file:

df=pd.DataFrame(result['links_in_first_5_paragraphs'])
# Exporting the DataFrame to a CSV file:
df.to_csv("Wikipedia_links.csv", index=False)
print(df)

After running the code, a new file appears in our project folder where rows represent links in the first 5 paragraphs:

Note: Don't forget to validate your data. Make sure the keys and types match your expectations when working with extract_rules parsing parameters. A missing field, an empty list, or a malformed record can silently break your workflows if there are no alerts about unsuccessful extractions. Add reasonable delays and retries to avoid rate limiting, IP bans, and unnecessary stress on target servers. Also, retries will ensure that your scraper can recover from temporary issues like slow page loads or intermittent server errors. Together, they make your scrapes more stable and less detectable.

Common Challenges in HTML Scraping

HTML scraping is much harder than it looks once you start targeting big retailer platforms. JavaScript-heavy pages may hide content until there is some proof of a real user navigating the browser. On top of that, if a site suspects that your connection is automated, it can trigger CAPTCHA or block IPs after repeated access.

Even well-built DIY scrapers can struggle to work when website structures change often, breaking hardcoded selectors. Without services provided by our API, which are covered extensively in our scraping blog, the biggest challenges are not in parsing data, but in guaranteeing consistent access to the biggest public data sources.

Scalable HTML Web Scraping with ScrapingBee

An API-based HTML web scraper adheres to all the problems that arise when people encounter stubborn targets or try to scale up their data collection efforts. Here is a breakdown of the biggest strengths:

JavaScript Rendering and headless browsers
Rotating proxies and randomized device fingerprints
AI and CSS/XPath extraction
Geolocation control and session persistence

Advanced HTML Scraping with AI Extraction

For exceptionally difficult targets, our AI web scraping feature can take care of HTML parsing through provided queries instead of CSS selectors. Let's simplify the code so we don't get lost and focus strictly on extracting data. But first, add another internal Python library "json":

import json

Replace the previous "extract_rules" with this simple section of "ai_extract_rules" which contains queries for AI:

ai_extract_rules = {
    "title": "title of the page",
    "summary": "a 5 sentence summary of the page content"
}

Then, we assign it to a different variable and transform a Python dictionary into a JSON string so it can be passed on as a set of queries.

ai_extract_rules_str = json.dumps(ai_extract_rules)

Send the API request with this parameter. Add the last line of code to print out the result. Your code should look something like this:

# Importing pip libraries
from scrapingbee import ScrapingBeeClient
import json
# Initializing ScrapingBeeClient (replace YOUR_API_KEY with the key from your account!)
client = ScrapingBeeClient(api_key='YOUR_API_KEY')
# targeting Wikipedia's page about HTTP



url = "https://en.wikipedia.org/wiki/HTML"

ai_extract_rules = {
    "title": "title of the page",
    "summary": "a 5 sentence summary of the page content"
}

ai_extract_rules_str = json.dumps(ai_extract_rules)


response = client.get(
    url,
    params={
    "ai_extract_rules": ai_extract_rules_str
     },
    )
print (response.content)

After running the code, we can see that it returns AI-structured data about the page:

Comparing DIY HTML Scrapers vs. ScrapingBee

After covering all the neat features, let's break down why users prefer API-based solutions for all web scrapers:

Features	DIY HTML Scraper	ScrapingBee API
Setup & Integration	Manual coding, libraries, proxy setup, needs technical profficiency	One-liner install with built-in API client, beginner-friendly
JavaScript Handling	Requires headless browser setup	built-in JavaScript Rendering
Proxy & Location	Manual rotation & geo-targeting, needs a proxy provider	Customizable premium proxies and geolocation control
Scalability	Complex orchestration required	Auto-scaled infrastructure
Extraction Logic	Regex/CSS/XPath + manual parsing	Simple CSS/XPath extractions + AI parsing solutions e
Anti-Bot Evasion	Prone to IP blocks, CAPTCHA issues	High-quality proxies, fingerprint masking
Maintenance & Updates	Can break after web structure changes	Easier to update and maintain

Pricing and Getting Started

If you decide to go the API-based scraper route, we have flexible options designed to fit all use cases. Instead of burdening you with complex proxy setups or unreliable browser infrastructure, ScrapingBee pricing only charges for successful requests and the advanced features you actually use, so there is no need to manage proxy subscriptions or maintain custom infrastructure.

Ready to Start HTML Web Scraping at Scale?

It’s time to simplify your web scraping workflow. We warmly welcome you to sign up for ScrapingBee and enjoy our beginner-friendly API with a 1-week free trial of 1,000 credits. With it, you will have plenty of resources to practice simple but efficient and scalable web scraping tasks. Let's get to work!

HTML Web Scraping FAQs

What is the difference between web scraping and HTML scraping?

HTML scraping is a subcategory of web scraping that focuses specifically on extracting data from a page’s HTML structure. Web scraping can also include API data extraction, JSON endpoints, or PDF parsing, while HTML scraping deals directly with the raw markup of web pages.

Can I scrape a website without coding?

Yes, there are no-code tools and AI-based scrapers that can collect data without pushing you to study coding basics. However, we believe that anyone can build an effective and scalable data scraper with our Python SDK

What are the best tools for HTML web scraping?

HTML web scrapers often rely on Python libraries like BeautifulSoup or Playwright for DIY setups, but they don’t handle proxy rotation or scraping risks, making automatic extraction less reliable compared to API-based solutions.

Is HTML scraping legal?

Scraping public data is legal, but it’s important to understand potential dangers, legal liabilities, and always respect the terms of service of visited websites. Being mindful of the platform’s rules will help you create clean and compliant scraping practices, resulting in smooth extractions.

Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.