How to Collect Data for Machine Learning: A Step-by-Step Guide

Karolis Stasiulevičius | 14 May 2026 | 10 min read

Table of contents

In artificial intelligence, there is an ironclad rule that every engineer learns early: "Garbage in, garbage out." Your machine learning (ML) model is only as sophisticated as the data you feed it. You can deploy the most advanced neural network architecture in existence, but if the training data is noisy, biased, or incomplete, the predictions will be worthless.

While algorithms often get all the hype, professional data scientists actually spend roughly 80% of their time on data collection and preparation. Learning how to collect data for machine learning is the foundation of the entire pipeline. This guide walks you through the best data sources, modern extraction methods, and essential preparation steps to ensure your models perform at their peak.

How to Collect Data for Machine Learning: A Step-by-Step Guide

Quick Answer (TL;DR)

Data for machine learning is generally collected via internal company databases, open-source repositories (like Kaggle), or web scraping. For projects requiring high-volume, real-time, or highly specific external data, web scraping for machine learning is the most powerful method for creating a proprietary edge. However, to avoid the technical debt of managing proxies and anti-bot systems, using a managed API is the most efficient route.

You can fetch structured data for your pipeline using a single API call with the following Python snippet:

# main.py

import json
import os

import requests
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

API_URL = "https://app.scrapingbee.com/api/v1/"
TARGET_URL = "https://quotes.toscrape.com/"

# Read the API key from the environment
api_key = os.getenv("SCRAPINGBEE_API_KEY")

if not api_key:
    raise RuntimeError(
        "Missing SCRAPINGBEE_API_KEY. Add it to your .env file."
    )

# Extraction rules define the structured data
# we want ScrapingBee to return as JSON
extract_rules = {
    "quotes": {
        "selector": ".quote",
        "type": "list",
        "output": {
            "text": ".text",
            "author": ".author",
            "tags": {
                "selector": ".tag",
                "type": "list",
            },
        },
    },
}

response = requests.get(
    API_URL,
    params={
        "api_key": api_key,
        "url": TARGET_URL,
        # Enable JS rendering only for dynamic websites
        # because it is slower and more expensive
        "render_js": "false",
        # ScrapingBee expects extract_rules as a JSON string
        "extract_rules": json.dumps(extract_rules),
    },
    # Prevent hanging forever on slow responses
    timeout=30,
)

# Raise an exception for HTTP errors (4xx / 5xx)
response.raise_for_status()

data = response.json()

# Print the first 3 extracted quotes
for quote in data.get("quotes", [])[:3]:
    print(f'{quote["text"]} — {quote["author"]}')

Store your API key in the .env file:

SCRAPINGBEE_API_KEY=your_api_key_here

Install the necessary dependencies and run with:

pip install requests python-dotenv

python main.py

Step 1: Define Your ML Objectives and Data Requirements

Before writing a single line of code, you must define what you are trying to predict. Your objective dictates your data requirements.

Numerical/Regression Models: Require structured data like tabular CSVs or SQL exports (e.g., house prices, stock movements).
NLP and LLMs: Require massive amounts of unstructured data such as raw text, social media comments, or news articles.
Computer Vision: Require image or video files, often with associated metadata or labels.

Understanding whether you need structured data (neatly organized rows/columns) or unstructured data (raw, "messy" information like raw HTML or images) will determine which collection tools you choose.

Step 2: Identify the Best Data Sources

Where does the data actually live? Most ML projects draw from a combination of these three categories:

1. Internal Sources

This includes data your company already owns, such as CRM data, user activity logs, and transactional histories. This data is proprietary and often provides the highest competitive advantage.

2. Open-Source Datasets & Data Providers

For benchmarking or general training, public repositories are invaluable. Kaggle remains the gold standard, while Google Dataset Search and AWS Public Datasets provide access to massive cloud-hosted indices.

3. The Open Web

The web is the largest unstructured database in existence. If you need to build a sentiment analysis tool for a specific niche or train an LLM on specialized industry news, you need to know how to scrape all text from a website for LLM AI training to create custom datasets that don't exist in any public repository.

Step 3: Methods for Collecting Machine Learning Data

Method A: Using Public APIs

Platforms like Reddit or the New York Times provide clean JSON, though they often have strict rate limits.

Here is a simple example that fetches top posts from a subreddit and prints their titles and scores:

# main.py

import requests

URL = "https://www.reddit.com/r/MachineLearning/top.json"

headers = {
    # Reddit recommends using a unique and descriptive User-Agent
    "User-Agent": "python:ml-data-collection-demo:v1.0"
}

response = requests.get(
    URL,
    headers=headers,
    params={
        # Fetch the top 5 posts from the past week
        "limit": 5,
        "t": "week",
    },

    # Prevent hanging forever on slow responses
    timeout=30,
)
# Raise an exception for HTTP errors (4xx / 5xx)
response.raise_for_status()

data = response.json()

# Safely extract the list of posts from the API response
posts = data.get("data", {}).get("children", [])

for post in posts:
    title = post["data"]["title"]
    score = post["data"]["score"]

    print(f"{score} points — {title}")

This example uses Reddit's public JSON endpoint for simplicity. For production applications, Reddit recommends authenticated OAuth API access with a registered application and a unique User-Agent. Authentication also provides more reliable rate limits and API stability.

Note: Reddit's API Terms place restrictions on using user-generated content for machine learning or AI training. Always review the platform's terms before collecting or processing public data at scale.

Install dependencies and run the script:

pip install requests

python main.py

Method B: Custom Web Scraping with Python (BeautifulSoup / Playwright)

If no API exists, you can build a custom scraper using Python. This is often the first step for developers learning web scraping 101 with Python.

Cons: You must manage your own proxies, handle CAPTCHAs, and manage headless browsers for JavaScript-heavy sites. This infrastructure management quickly becomes a bottleneck for scaling ML datasets.

Method C: Using a Managed Web Scraping API

For production-level ML pipelines, custom scripts often fail due to sophisticated bot detection. A managed API like ScrapingBee handles the "infrastructure" of scraping (proxy rotation, CAPTCHA solving, and JavaScript rendering) letting you focus on the data.

Using extract_rules, you can turn any website into a structured JSON object ready for a Pandas DataFrame:

# main.py

import json
import os

import requests
from dotenv import load_dotenv

# Load environment variables from .env
load_dotenv()

API_URL = "https://app.scrapingbee.com/api/v1/"
TARGET_URL = "https://books.toscrape.com/"

# Read the API key from the environment
api_key = os.getenv("SCRAPINGBEE_API_KEY")

if not api_key:
    raise RuntimeError(
        "Missing SCRAPINGBEE_API_KEY. Add it to your .env file."
    )

# Define the structured data we want to extract
extract_rules = {
    "products": {
        "selector": ".product_pod",
        "type": "list",
        "output": {
            "name": {
                "selector": "h3 a",

                # Extract the value of the `title` attribute
                "output": "@title",
            },
            "price": ".price_color",
            "availability": ".availability",
            "relative_url": {
                "selector": "h3 a",

                # Extract the product link
                "output": "@href",
            },
        },
    }
}

response = requests.get(
    API_URL,
    params={
        "api_key": api_key,
        "url": TARGET_URL,

        # Enable JS rendering only for dynamic websites
        # because it is slower and more expensive
        "render_js": "false",

        # ScrapingBee expects extract_rules as a JSON string
        "extract_rules": json.dumps(extract_rules),
    },

    # Prevent hanging forever on slow responses
    timeout=30,
)

# Raise an exception for HTTP errors (4xx / 5xx)
response.raise_for_status()

data = response.json()

# Print the first 5 extracted products
for product in data.get("products", [])[:5]:
    print(product)

Create an .env file with your API token:

SCRAPINGBEE_API_KEY=your_api_key_here

Install dependencies and run the script:

pip install requests python-dotenv

python main.py

Step 4: Data Preparation, Cleaning, and Formatting

Extraction is only half the battle. Once you've successfully navigated the complexities of web data mining, you must clean the raw output.

Handling Missing Values: Dropping empty rows or using "imputation" (filling gaps with means or medians).
Removing Duplicates: Ensuring the model doesn't overfit on repeated data points.
Normalization: Converting text to lowercase or scaling numerical values (0 to 1).
Formatting: Storing your final dataset as a CSV, JSONL, or Parquet file.

Use this Pandas snippet to clean your scraped JSON data, remove duplicates, and export it as a Parquet file:

# main.py

import pandas as pd

# Example raw dataset collected from an API or scraper
data = [
    {"text": "Great product!", "label": "positive"},
    {"text": "Awful service.", "label": "negative"},
    {"text": None, "label": "neutral"},
    {"text": "Great product!", "label": "positive"},
]

# Load the raw data into a DataFrame
df = pd.DataFrame(data)

# Clean and normalize the dataset
clean_df = (
    df

    # Remove rows where the text field is missing
    .dropna(subset=["text"])

    # Normalize text:
    # - remove leading/trailing whitespace
    # - convert to lowercase
    .assign(text=lambda x: x["text"].str.strip().str.lower())

    # Remove empty text rows after cleaning
    .query("text != ''")

    # Remove duplicate text entries
    .drop_duplicates(subset=["text"])

    # Reset row indexes after filtering
    .reset_index(drop=True)
)

# Save the cleaned dataset as a Parquet file
clean_df.to_parquet(
    "cleaned_ml_dataset.parquet",
    engine="pyarrow",
    index=False,
)

print(clean_df)

Install the necessary dependencies and run the script:

pip install pandas pyarrow

python main.py

Legal and Ethical Considerations for ML Data

Collecting data is a legal and ethical responsibility. With the EU AI Act's provisions being phased in through 2027, requirements around data provenance and transparency are tightening for machine learning teams operating in Europe.

To maintain a compliant pipeline, follow these four pillars:

Respect robots.txt and Public Access: Always check the /robots.txt file of a target site. While the Ninth Circuit's hiQ v. LinkedIn ruling found that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA), this does not mean scraping is universally protected. Sites can still pursue other legal claims, and accessing data behind login credentials or paywalls remains legally risky.
Ethical Rate Limiting: High-frequency requests can strain a website's infrastructure. Use concurrency limits and "sleep" intervals to mimic human browsing. This protects the target server and helps you avoid getting blocked.
Data Minimization and PII: Under GDPR and CCPA, you should only collect the specific data your model requires. If you are scraping reviews for sentiment analysis, strip out names or IP addresses. In 2026, it is standard practice to use anonymization or masking on Personally Identifiable Information (PII) immediately after collection.
AI Opt-Outs: Many site owners now use machine-readable tags (like noai) to opt out of generative AI training. Respecting these signals is essential for ethical data sourcing and helps avoid "Fair Use" litigation.

Frequently Asked Questions (FAQs)

Can I use web scraped data for machine learning?

Yes. Web scraping is one of the most common ways to build custom datasets, especially for NLP and training LLMs. As long as you respect ethical guidelines and avoid PII, the open web remains the most robust source for real-time training data.

How much data is needed for machine learning?

It depends on the model's complexity. A simple regression might only need a few thousand rows, while deep learning and LLMs require millions. However, remember that quality (clean, accurately labeled data) is more important than sheer volume.

What is the best data format for machine learning?

For large tabular datasets, Parquet is often a strong choice because it is compressed and fast to read. For text datasets, JSONL is a common format because each record can be stored on a separate line, making it convenient for preprocessing and ML pipelines. ScrapingBee allows you to extract web data directly into structured JSON, which is easily convertible to any of these formats.

Are public datasets enough for real-world ML projects?

Public datasets (like Kaggle) are great for benchmarking. However, they rarely provide a competitive edge because everyone has access to them. To build a unique, production-ready model, teams usually rely on proprietary data collection via web scraping.

Scaling Your Data Pipelines

While open datasets are excellent for practice, real-world machine learning requires custom data collection. Mastering the transition from raw web pages to structured, clean datasets is the key to building models that actually deliver value.

Don't let proxy management and CAPTCHA headaches slow down your research. Sign up for ScrapingBee's free trial today and start turning the web into a structured database for your machine learning models.

Karolis Stasiulevičius

Karolis is Head of Growth at ScrapingBee. Previously built and scaled technology products in data and e-commerce verticals.