New Amazon API: We've just released a brand new way to scrape Amazon at scale Start Free Trial 🐝

Web scraping Java: From setup to production scrapers

23 December 2025 (updated) | 35 min read

Web scraping Java is (probably) harder than it should be. Making one HTTP request is easy. Building a scraper that survives pagination, JavaScript rendering, parallel requests, and blocking is where most Java projects fall apart.

In this tutorial, you'll build a reliable scraper with Java 21, Jsoup, and ScrapingBee. We'll cover static scraping, pagination, parallel crawling, and the cases where Selenium still makes sense. And you'll do it without running your own proxies, CAPTCHAs, or headless browsers.

Let's get rolling!

Web scraping Java: From setup to production scrapers

Quick answer (TL;DR)

If you just want something that works, here you go. This is a simple web scraping Java example that:

  • fetches HTML via ScrapingBee (no direct requests to the target site)
  • parses with Jsoup
  • discovers a few pagination pages
  • fetches those pages in parallel
  • stores results in Java objects
  • prints debug output

💡 Before you go deeper, this list is a good bookmark: Best 10 Java web scraping libraries.

Full example

Here's the copy-pasteable code:

package org.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Queue;
import java.util.Set;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

public class Main {
    // Simple model (we'll reuse this later)
    public record Product(String title, String price, String url) {}

    /**
     * Holds the pagination URLs we discovered + the HTML we already fetched while discovering.
     * Key idea: discovery requires fetching pages to find "next", so keep that HTML and reuse it.
     */
    private record DiscoveryResult(List<String> pageUrls, Map<String, String> htmlCache) {}

    public static void main(String[] args) throws Exception {
        // Set this in your shell: SCRAPINGBEE_API_KEY=...
        String apiKey = System.getenv("SCRAPINGBEE_API_KEY");
        if (apiKey == null || apiKey.isBlank()) {
            throw new IllegalStateException("Missing SCRAPINGBEE_API_KEY env var.");
        }

        HttpClient client = HttpClient.newBuilder()
                .connectTimeout(Duration.ofSeconds(20))
                .build();

        String startUrl = "https://books.toscrape.com/";

        // 1) Discover a few pages by following the "next" link
        // IMPORTANT: discovery has to fetch HTML to find the next page.
        // We store fetched HTML in a cache so we DON'T fetch those pages again later.
        int pagesToScrape = 3;
        DiscoveryResult discovery = discoverPagesAndCacheHtml(client, apiKey, startUrl, pagesToScrape);

        List<String> pageUrls = discovery.pageUrls();
        Map<String, String> htmlCache = discovery.htmlCache();

        System.out.println("Pages discovered:");
        for (String u : pageUrls) System.out.println(" - " + u);
        System.out.println("HTML cached from discovery: " + htmlCache.size() + " pages");

        // 2) Parse pages in parallel.
        // We only call ScrapingBee if HTML is not in the cache.
        int threads = 5;
        ExecutorService pool = Executors.newFixedThreadPool(threads);

        // Thread-safe results + visited tracking
        Queue<Product> results = new ConcurrentLinkedQueue<>();
        Set<String> visited = ConcurrentHashMap.newKeySet();

        AtomicInteger extraFetches = new AtomicInteger(0);

        List<Future<?>> futures = new ArrayList<>();

        for (String pageUrl : pageUrls) {
            futures.add(pool.submit(() -> {
                try {
                    // Avoid duplicates (sometimes pagination gets weird)
                    if (!visited.add(pageUrl)) return;

                    // Reuse HTML from discovery if available, otherwise fetch now.
                    // This is the "no double fetch" fix.
                    String html = htmlCache.get(pageUrl);
                    if (html == null) {
                        html = fetchViaScrapingBee(client, apiKey, pageUrl);
                        extraFetches.incrementAndGet();
                    }

                    Document doc = Jsoup.parse(html, pageUrl);

                    List<Product> items = extractBooks(doc);
                    results.addAll(items);

                    System.out.println("Done: " + pageUrl + " | items: " + items.size());
                } catch (Exception e) {
                    System.out.println("Failed: " + pageUrl + " | " + e.getMessage());
                }
            }));
        }

        // Wait for all tasks
        for (Future<?> f : futures) f.get();

        pool.shutdown();
        pool.awaitTermination(30, TimeUnit.SECONDS);

        // Debug output
        System.out.println("\nExtra ScrapingBee fetches after discovery: " + extraFetches.get());
        System.out.println("Total products scraped: " + results.size());

        int limit = Math.min(results.size(), 10);
        System.out.println("Top " + limit + ":");

        // ConcurrentLinkedQueue doesn't support get(i), so just iterate.
        int i = 0;
        for (Product p : results) {
            if (i >= limit) break;
            System.out.println((i + 1) + ") " + p.title() + " | " + p.price() + " | " + p.url());
            i++;
        }
    }

    // Fetch HTML through ScrapingBee
    private static String fetchViaScrapingBee(HttpClient client, String apiKey, String targetUrl) throws Exception {
        String apiUrl = "https://app.scrapingbee.com/api/v1/?" +
                "api_key=" + URLEncoder.encode(apiKey, StandardCharsets.UTF_8) +
                "&url=" + URLEncoder.encode(targetUrl, StandardCharsets.UTF_8) +
                "&render_js=false";

        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(apiUrl))
                .timeout(Duration.ofSeconds(60))
                .header("User-Agent", "java-scraper-tldr/1.0")
                .GET()
                .build();

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

        if (response.statusCode() != 200) {
            String body = response.body();
            if (body == null) body = "";
            System.out.println("ScrapingBee HTTP " + response.statusCode() + " for " + targetUrl);
            System.out.println(body.substring(0, Math.min(body.length(), 500)));
            throw new RuntimeException("ScrapingBee request failed: " + response.statusCode());
        }

        String body = response.body();
        return body == null ? "" : body;
    }

    /**
     * Discover first N pages by following pagination AND cache HTML while doing it.
     * This prevents the classic "discover pages -> fetch them again" double-fetch.
     *
     * books.toscrape.com uses: ul.pager li.next a
     */
    private static DiscoveryResult discoverPagesAndCacheHtml(
            HttpClient client,
            String apiKey,
            String startUrl,
            int maxPages
    ) throws Exception {

        List<String> urls = new ArrayList<>();
        Map<String, String> htmlCache = new ConcurrentHashMap<>();

        // O(1) loop detection
        Set<String> seen = new HashSet<>();

        String current = startUrl;

        for (int i = 1; i <= maxPages; i++) {
            // Stop if we loop back (defensive; some sites do weird things)
            if (!seen.add(current)) break;

            urls.add(current);

            // Fetch once here (required to find "next")
            String html = fetchViaScrapingBee(client, apiKey, current);

            // Cache it so the parallel phase can reuse it without another API call
            htmlCache.put(current, html);

            Document doc = Jsoup.parse(html, current);

            Element next = doc.selectFirst("ul.pager li.next a");
            if (next == null) break;

            String nextUrl = next.absUrl("href");
            if (nextUrl == null || nextUrl.isBlank()) break;

            current = nextUrl;
        }

        return new DiscoveryResult(urls, htmlCache);
    }

    // Extract book data from the DOM
    private static List<Product> extractBooks(Document doc) {
        List<Product> products = new ArrayList<>();
        Elements cards = doc.select("ol.row article.product_pod");

        for (Element card : cards) {
            Element a = card.selectFirst("h3 a");
            if (a == null) continue;

            // Full title is stored in the title attribute on this site
            String title = a.attr("title").trim();

            Element priceEl = card.selectFirst(".price_color");
            String price = priceEl != null ? priceEl.text().trim() : "";

            // absUrl works because we parse with the page URL as baseUri
            String url = a.absUrl("href").trim();

            products.add(new Product(title, price, url));
        }

        return products;
    }
}

More detailed explanations and instructions come later in the post.

Setting up your Java scraping environment

You want a modern web scraping Java setup that feels boring, but in a good way. Java 21 LTS, one build tool, and a code editor that won't fight you.

One important thing upfront. Your network calls go through ScrapingBee's API. That's your "get me the page" layer. In Java, you still do the work: parse HTML with Jsoup and manage data with collections. And if you need a real browser, you run Selenium in Java too.

Installing Java 21 LTS and verifying setup

Install Java 21 LTS. It keeps your stack current and stable. It also gives you nice modern stuff like:

  • a solid built-in HttpClient
  • record types for clean data models
  • virtual threads for scaling lots of requests

After install, make sure both java and javac are on your PATH:

java -version
javac -version

You want to see 21 in the output. If javac is missing, you likely installed a JRE-only package. You want a full JDK.

Choosing between Maven and Gradle for dependency management

You need a build tool so your dependencies don't turn into manual chaos. For web scraping Java projects, this is mainly about clean installs for:

  • jsoup for HTML parsing
  • Selenium if you use a browser
  • test libraries like JUnit for sanity checks

Pick one and stick to it. Mixing Maven and Gradle in the same repo is how confusion spreads.

Go Maven if:

  • you want the most common "works everywhere" setup
  • you like explicit XML config
  • you expect other Java devs to jump in fast

Go Gradle if:

  • you want a shorter config file
  • you like scripting-style config
  • you plan to tweak the build later

Maven example (pom.xml)

Here's an example of the pom.xml file with JSoup added:

<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                             https://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>

  <groupId>com.example.scraper</groupId>
  <artifactId>java-scraper</artifactId>
  <version>1.0.0</version>

  <properties>
    <maven.compiler.release>21</maven.compiler.release>
  </properties>

  <dependencies>
    <dependency>
      <groupId>org.jsoup</groupId>
      <artifactId>jsoup</artifactId>
      <version>1.21.2</version>
    </dependency>
  </dependencies>
</project>

Gradle example (build.gradle)

plugins {
  id 'application'
}

java {
  toolchain {
    languageVersion = JavaLanguageVersion.of(21)
  }
}

repositories {
  mavenCentral()
}

dependencies {
  implementation 'org.jsoup:jsoup:1.21.2'
}

application {
  mainClass = 'org.example.Main'
}

If you plan to use Selenium later, you add it here too. Same story for JSON helpers.

You don't need a fancy IDE. You need one that helps you move fast when scraping gets weird.

  • IntelliJ IDEA — best "just works" experience. Great debugger, easy Maven and Gradle sync. Nice when you're stepping through HTTP responses and parsing logic.
  • VS Code with Java extensions — lightweight and solid. Good if you like a simple setup. Works fine for running tests and managing Maven or Gradle. Also comfy for checking JSON.
  • Eclipse — still a classic. Works well with Maven projects, solid debugger. Good choice if that's what your team already uses.

What matters for web scraping Java work: quick debugging, easy test runs, and smooth Maven or Gradle support.

Setting up project for Java web scraping

In this tutorial I'll stick with Gradle and VS Code, so let's create a new project by running:

mkdir java-web-scraping
cd java-web-scraping
gradle init

Alternatively, you can use the gradlew wrapper.

Choose:

  • Type: application
  • Implementation language: Java
  • Java version: 21
  • Project name: leave default
  • Application structure: Single application project
  • Build script DSL: Groovy
  • Test framework: doesn't matter as we won't write tests

Next, edit app/build.gradle and paste the content as shown in the section above.

The generated app might contain sample unit tests, so let's remove that folder:

rm -rf app\src\test

Download dependencies and build:

gradle build

Open the folder in VS Code:

code .

That's it. One folder, one build tool, jsoup ready to go. You're set up for web scraping Java without touching a heavy IDE.

Static web scraping with Jsoup

In this section we build a simple static scraper using web scraping Java the right way:

  • ScrapingBee handles the hard stuff: HTTP requests, proxies, retries, and anti-bot tricks.
  • Jsoup does one job only: parse HTML into a DOM and let you extract data cleanly.

So, an important rule for this guide: all real requests go through ScrapingBee's API. Jsoup never hits target websites directly in production.

If you want a deeper dive into parsing, this article pairs well with this section: HTML parsing in Java with JSoup.

Before you start: ScrapingBee API key

To follow along, you need a free ScrapingBee account.

Register here: app.scrapingbee.com/account/register!

After signup, grab your API key from the dashboard. You get 1,000 free credits, which is more than enough for testing and learning. We'll use that API key in all examples below.

What we're scraping

We'll scrape a demo site made for practice: books.toscrape.com. Each book there lives inside an ol.row list. Every item looks like this in the HTML:

  • ol.row
  • li
  • article.product_pod
  • title in h3 a
  • price in .price_color
  • link in a[href]

That structure is great for learning CSS selectors.

Connecting to a web page with Jsoup.connect()

Jsoup has two common usage patterns. Both are valid, but they serve different goals.

Option 1: Jsoup.connect() (local or quick tests)

This is fine for fast experiments or learning selectors.

Document doc = Jsoup.connect("https://books.toscrape.com/")
        .get();

This hits the site directly. That's okay for demos, but not how production scraping should work.

In real web scraping Java setups, ScrapingBee fetches the page, and you then feed the HTML into Jsoup.

Example flow:

  • Call ScrapingBee's API with your API key
  • Get the HTML as a string
  • Parse it with Jsoup
String html = scrapingBeeResponseBody;

Document doc = Jsoup.parse(
        html,
        "https://books.toscrape.com/"
);

Say it once more because it matters: production scraping should call ScrapingBee and feed the HTML into Jsoup. That's how you avoid blocks and rate limits.

Using CSS selectors to extract HTML elements

Jsoup uses CSS selectors, the same ones you see in Chrome DevTools. Here are the patterns you'll use all the time:

  • .class.product_pod
  • #id#main
  • tag[attr=value]a[href]
  • parent childarticle.product_pod h3 a
  • nth-of-type(n)li:nth-of-type(1)

If you can select it in DevTools, you can select it in Jsoup.

Example: select all book cards.

Elements cards = doc.select("article.product_pod");

From there, you drill down:

for (Element card : cards) {
    String title = card.select("h3 a").attr("title");
    String price = card.select(".price_color").text();
    String url = card.select("h3 a").attr("href");
}

That mental mapping from browser → selector → document.select() is the core scraping skill.

Storing scraped data in Java objects

Raw strings get messy fast, so map scraped fields into a simple Java object. Java 21 records are perfect for this:

public record Product(
        String title,
        String price,
        String url
) {}

Now store results in a list:

List<Product> products = new ArrayList<>();

for (Element card : cards) {
    Product product = new Product(
            card.select("h3 a").attr("title"),
            card.select(".price_color").text(),
            card.select("h3 a").attr("href")
    );

    products.add(product);
}

We'll reuse the same Product record when we add pagination and parallel scraping. So, that's static web scraping Java in its simplest form: ScrapingBee fetches, Jsoup parses, Java stores the data.

Static web scraping: Full Java example

Open up app\src\main\java\org\example\Main.java file and paste the following code inside:

package org.example;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.ArrayList;
import java.util.List;

/**
 * Static scraper demo:
 * - ScrapingBee does the HTTP/proxy/anti-bot work
 * - Jsoup parses the returned HTML and extracts book data
 *
 * Requires:
 *   1) Add jsoup dependency (Maven/Gradle)
 *   2) Set env var SCRAPINGBEE_API_KEY to your ScrapingBee API key
 */
public class Main {
    // Simple data model we will reuse later (pagination, parallel scraping, etc.)
    public record Product(String title, String price, String url) {}

    public static void main(String[] args) throws Exception {
        // 1) Read API key from environment variables (don't hardcode secrets!)
        String apiKey = System.getenv("SCRAPINGBEE_API_KEY");
        if (apiKey == null || apiKey.isBlank()) {
            throw new IllegalStateException(
                    "Missing SCRAPINGBEE_API_KEY env var. " +
                    "Set it and run again."
            );
        }

        // 2) Page we want to scrape (we do NOT call it directly)
        String targetUrl = "https://books.toscrape.com/";

        // 3) Build ScrapingBee API URL
        // Tip: keep render_js=false for static pages (faster, cheaper)
        String scrapingBeeUrl =
                "https://app.scrapingbee.com/api/v1/?" +
                "api_key=" + URLEncoder.encode(apiKey, StandardCharsets.UTF_8) +
                "&url=" + URLEncoder.encode(targetUrl, StandardCharsets.UTF_8) +
                "&render_js=false";

        // 4) Send request via Java 21 HttpClient
        HttpClient client = HttpClient.newBuilder()
                .connectTimeout(Duration.ofSeconds(20))
                .build();

        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(scrapingBeeUrl))
                .timeout(Duration.ofSeconds(60))
                // A basic UA helps keep logs tidy and avoids weird defaults
                // Note: this user-agent goes to ScrapingBee
                // To forward headers to the *target* website, these must be prefixed with `Spb-`
                // and `forward_headers` must be set to `true`
                // Learn more in https://help.scrapingbee.com/en/article/how-to-forward-headers-fh6fqo/
                .header("User-Agent", "java-scraper-demo/1.0")
                .GET()
                .build();

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

        // 5) Quick debug info for the HTTP layer
        System.out.println("HTTP status: " + response.statusCode());
        System.out.println("Response size: " + response.body().length() + " chars");

        if (response.statusCode() != 200) {
            // Print a small snippet to understand what went wrong (avoid dumping huge HTML)
            String body = response.body();
            if (body == null) body = "";
            System.out.println("Non-200 response. Body preview:");
            System.out.println(body.substring(0, Math.min(body.length(), 600)));
            return;
        }

        String html = response.body();

        // 6) Parse the HTML with Jsoup
        // baseUri is important so relative links become absolute with absUrl()
        Document doc = Jsoup.parse(html, targetUrl);

        // 7) Extract books
        // Books are inside ol.row and each book is an article.product_pod
        Elements cards = doc.select("ol.row article.product_pod");

        System.out.println("Found book cards: " + cards.size());

        List<Product> products = new ArrayList<>();

        // 8) Pull out a few fields per card
        for (Element card : cards) {
            Element linkEl = card.selectFirst("h3 a");
            if (linkEl == null) continue;

            String title = linkEl.attr("title").trim();
            String price = card.selectFirst(".price_color") != null
                    ? card.selectFirst(".price_color").text().trim()
                    : "";

            // Use absUrl to convert relative href into a full URL
            // Jsoup can do this because we provided baseUri in Jsoup.parse(...)
            String url = linkEl.absUrl("href").trim();

            products.add(new Product(title, price, url));
        }

        // 9) Print scraped results (debug output)
        int limit = Math.min(products.size(), 10);
        System.out.println("\nTop " + limit + " scraped books:");
        for (int i = 0; i < limit; i++) {
            Product p = products.get(i);
            System.out.println((i + 1) + ") " + p.title());
            System.out.println("   Price: " + p.price());
            System.out.println("   URL:   " + p.url());
        }
    }
}

This code:

  • Uses one entry point (Main) and a simple Product record to keep scraped data clean and reusable.
  • Reads SCRAPINGBEE_API_KEY from env vars, so you don't hardcode secrets in code.
  • Calls ScrapingBee's API endpoint to fetch HTML, instead of hitting the target site directly.
  • Builds the ScrapingBee request URL with render_js=false since this is a static page (faster + cheaper).
  • Uses Java 21 HttpClient with timeouts to make the request and avoid hanging forever.
  • Prints basic HTTP debug info (status + body size) and shows a short preview on non-200 responses.
  • Parses the returned HTML with Jsoup.parse(html, targetUrl) so relative links can be resolved with absUrl().
  • Extracts book cards using a CSS selector (ol.row article.product_pod) and then pulls title, price, and href.
  • Stores results in List<Product> and prints a short preview (top 10) to verify the scraper works.

Make sure to set SCRAPINGBEE_API_KEY environment variable, and then run the app:

gradle run

Here's the sample output:

Top 10 scraped books:
1) A Light in the Attic
   Price: £51.77
   URL:   https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
2) Tipping the Velvet
   Price: £53.74
   URL:   https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html

...

Great job!

Hidden techniques for crawling multiple pages

Pagination is where scraping starts to feel like crawling. But the general rules can stay simple:

  • Each page fetch goes through ScrapingBee.
  • Jsoup only parses HTML and extracts links and items.
  • You follow "Next"-like link until it's gone.
  • You track visited URLs so you don't loop forever.
  • You keep it polite. Low page counts for demos. Small delays if you scale.

We'll use the "Books to Scrape" site again. It has a classic pager:

<ul class="pager">
  <li class="current">Page 1 of 50</li>
  <li class="next"><a href="catalogue/page-2.html">next</a></li>
</ul>

Detecting pagination with a.next and a.page-numbers

Navigate to the website you're planning to scrape, open DevTools, inspect the "next" button, then copy the selector idea into Jsoup.

Common patterns you'll see on real sites:

  • a[rel=next]
  • a.next
  • a.page-numbers.next
  • ul.pager li.next a (this is our "books" site)

In Jsoup, you grab the link element and read href:

Element nextLink = doc.selectFirst("ul.pager li.next a");
String href = nextLink.attr("href");

Relative vs absolute URLs

Let's do a quick refresher:

  • Relative URL: catalogue/page-2.html
  • Absolute URL: https://books.toscrape.com/catalogue/page-2.html

Jsoup can convert relative to absolute for you if you parse with a base URL:

Document doc = Jsoup.parse(html, baseUrl);
String nextUrl = nextLink.absUrl("href");

That absUrl("href") is gold: use it!

Recursive crawling with Jsoup and URL updates

The crawler loop is always the same:

  • Fetch page HTML via ScrapingBee
  • Parse with Jsoup
  • Extract products
  • Find "next" link
  • Move to the next URL
  • Stop when no next link exists

You can write it recursively, but in real projects you usually convert it to a loop. Reason? Recursion can grow the stack if you crawl many pages. So we'll show a loop-based crawler and keep it safe by limiting to 2–3 pages.

Avoiding duplicate requests with URL tracking

Pagination sometimes gets weird.

Some sites can link back or repeat pages, so we keep a Set<String> of visited URLs. The idea is simple: if the URL is already in the set, skip it and stop. This prevents infinite loops.

Full example: Scrape pages via ScrapingBee + Jsoup

This example:

  • fetches each page via ScrapingBee
  • scrapes book title, price, and product URL
  • follows the "next" link up to 3 pages
  • uses a Set<String> to avoid duplicates
  • prints debug info so you see what it's doing
package org.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

public class Main {
    public record Product(String title, String price, String url) {}

    public static void main(String[] args) throws Exception {
        // Read API key from env var (don't hardcode secrets)
        String apiKey = System.getenv("SCRAPINGBEE_API_KEY");
        if (apiKey == null || apiKey.isBlank()) {
            throw new IllegalStateException("Missing SCRAPINGBEE_API_KEY env var.");
        }

        HttpClient client = HttpClient.newBuilder()
                .connectTimeout(Duration.ofSeconds(20))
                .build();

        // Start URL (page 1)
        String currentUrl = "https://books.toscrape.com/";
        int maxPages = 3; // keep it polite and simple for the demo

        // Track visited URLs so we don't loop forever
        Set<String> visited = new HashSet<>();

        // Store all results here
        List<Product> allProducts = new ArrayList<>();

        for (int pageNum = 1; pageNum <= maxPages; pageNum++) {
            // Stop if we already scraped this URL (prevents loops)
            // Important: this runs BEFORE fetching, so loop detection doesn't waste requests
            if (!visited.add(currentUrl)) {
                System.out.println("Already visited (loop detected): " + currentUrl);
                break;
            }

            System.out.println("\n--- Page " + pageNum + " ---");
            System.out.println("Fetching: " + currentUrl);

            // 1) Fetch HTML via ScrapingBee (not directly from the target site)
            String html = fetchHtmlViaScrapingBee(client, apiKey, currentUrl);

            // 2) Parse HTML with base URL so absUrl() works
            Document doc = Jsoup.parse(html, currentUrl);

            // 3) Extract products on this page
            List<Product> pageProducts = extractBooks(doc);
            System.out.println("Products found: " + pageProducts.size());
            allProducts.addAll(pageProducts);

            // Debug print a few items per page
            int preview = Math.min(pageProducts.size(), 3);
            for (int i = 0; i < preview; i++) {
                Product p = pageProducts.get(i);
                System.out.println((i + 1) + ") " + p.title() + " | " + p.price() + " | " + p.url());
            }

            // 4) Find the next page link (books site uses: ul.pager li.next a)
            Element nextLink = doc.selectFirst("ul.pager li.next a");
            if (nextLink == null) {
                System.out.println("No next link. Stopping.");
                break;
            }

            // 5) Convert relative href into absolute URL
            String nextUrl = nextLink.absUrl("href");
            if (nextUrl == null || nextUrl.isBlank()) {
                System.out.println("Next link exists, but URL is empty. Stopping.");
                break;
            }

            // Polite crawling: if you expand this later, add delays
            // Thread.sleep(500);

            currentUrl = nextUrl;
        }

        System.out.println("\n=== Done ===");
        System.out.println("Total pages scraped (max): " + maxPages);
        System.out.println("Total products scraped: " + allProducts.size());

        // Print a final preview of the first 10 total
        int limit = Math.min(allProducts.size(), 10);
        System.out.println("\nTop " + limit + " total products:");
        for (int i = 0; i < limit; i++) {
            Product p = allProducts.get(i);
            System.out.println((i + 1) + ") " + p.title());
            System.out.println("   Price: " + p.price());
            System.out.println("   URL:   " + p.url());
        }
    }

    private static String fetchHtmlViaScrapingBee(HttpClient client, String apiKey, String targetUrl) throws Exception {
        String endpoint = "https://app.scrapingbee.com/api/v1/";

        String apiUrl = endpoint +
                "?api_key=" + URLEncoder.encode(apiKey, StandardCharsets.UTF_8) +
                "&url=" + URLEncoder.encode(targetUrl, StandardCharsets.UTF_8) +
                "&render_js=false";

        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(apiUrl))
                .timeout(Duration.ofSeconds(60))
                .header("User-Agent", "java-scraper-demo/1.0")
                .GET()
                .build();

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

        System.out.println("HTTP status: " + response.statusCode());

        if (response.statusCode() != 200) {
            String body = response.body();
            if (body == null) body = "";
            System.out.println("Non-200 response. Body preview:");
            System.out.println(body.substring(0, Math.min(body.length(), 600)));
            throw new RuntimeException("ScrapingBee request failed: " + response.statusCode());
        }

        String body = response.body();
        return body == null ? "" : body;
    }

    // Extract books from the current page DOM
    private static List<Product> extractBooks(Document doc) {
        List<Product> products = new ArrayList<>();

        // Cards live here on the books site
        Elements cards = doc.select("ol.row article.product_pod");

        for (Element card : cards) {
            Element linkEl = card.selectFirst("h3 a");
            if (linkEl == null) continue;

            String title = linkEl.attr("title").trim();

            Element priceEl = card.selectFirst(".price_color");
            String price = priceEl != null ? priceEl.text().trim() : "";

            // absUrl works because we parsed with a base URL
            String url = linkEl.absUrl("href").trim();

            products.add(new Product(title, price, url));
        }

        return products;
    }
}

If you want this to scrape only 2 pages, set maxPages = 2. If you want "crawl until the end", keep a max cap anyway to be polite.

Parallel web scraping in Java

Once pagination works, speed is the next pain point. Parallel web scraping Java is how you speed it up without turning your scraper into a chaos machine.

Here's the goal:

  • use a fixed thread pool
  • submit tasks (each task fetches via ScrapingBee, then parses with Jsoup)
  • wait for completion
  • shut down cleanly
  • store results in thread-safe collections

This pairs well with this guide for background and ideas: An automatic bill downloader in Java.

Using ExecutorService for concurrent page fetching

ExecutorService is a built-in Java tool for running work in a pool of threads. You submit jobs to it, and it runs them in parallel.

In our case, each job does the same thing:

  • call ScrapingBee
  • parse HTML with Jsoup
  • extract items
  • store results

Here's the basic pattern:

ExecutorService pool = Executors.newFixedThreadPool(5);
List<Future<?>> futures = new ArrayList<>();

for (String url : urlsToFetch) {
    futures.add(pool.submit(() -> {
        String html = fetchHtmlViaScrapingBee(client, apiKey, url);
        Document doc = Jsoup.parse(html, url);
        List<Product> items = extractBooks(doc);
        results.addAll(items);
    }));
}

// wait for all tasks to finish
for (Future<?> f : futures) {
    f.get();
}

pool.shutdown();

This is the core loop. Everything else is just guard rails.

Thread safe collections: CopyOnWriteArrayList and ConcurrentSkipListSet

In concurrent code, normal collections can bite you:

  • ArrayList is not safe when many threads write at the same time.
  • HashSet is not safe when many threads add/check at the same time.

So for a simple tutorial setup, use thread-safe options:

  • CopyOnWriteArrayList for scraped items
  • ConcurrentSkipListSet for visited URLs

Practical rule: if a URL is already in the set, skip it. If it's new, scrape it.

TL;DR:

CopyOnWriteArrayList<Product> results = new CopyOnWriteArrayList<>();
ConcurrentSkipListSet<String> visited = new ConcurrentSkipListSet<>();

if (!visited.add(url)) {
    return; // already seen
}

results.addAll(items); // safe from multiple threads

When this combo makes sense:

  • small demos
  • low volume scraping (a few pages, a few hundred items)
  • you want code that's easy to read and hard to break

Important nuance:

  • CopyOnWriteArrayList is optimized for "many reads, few writes".
  • Each write can copy the underlying array, which becomes expensive if you add lots of items.

So it's great for a tutorial and small workloads, but for bigger runs you usually switch to other concurrent collections.

Thread safe collections (scales better): ConcurrentLinkedQueue and ConcurrentHashMap

If you move from "tutorial demo" to "scrape a lot of pages", two changes usually improve performance without making the code complicated:

  • ConcurrentLinkedQueue<Product> for results
  • ConcurrentHashMap.newKeySet() for visited URLs

Why these are often a better fit for scrapers:

  • ConcurrentLinkedQueue is built for lots of concurrent add() operations without copying arrays.
  • ConcurrentHashMap.newKeySet() gives you a concurrent set with fast membership checks, without maintaining sort order.

Example:

Queue<Product> results = new ConcurrentLinkedQueue<>();
Set<String> visited = ConcurrentHashMap.newKeySet();

if (!visited.add(url)) {
    return; // already seen
}

results.addAll(items); // many threads can add safely

A quick mental model:

  • Use CopyOnWriteArrayList when you mostly read and rarely write.
  • Use ConcurrentLinkedQueue when you mostly *write and read later (common in scraping).
  • Use ConcurrentSkipListSet when you need ordered keys.
  • Use ConcurrentHashMap.newKeySet() when you just need "seen / not seen" tracking.

Limiting thread pool size to avoid server overload

More threads does not always mean faster. You hit limits instead:

  • target sites throttle you
  • ScrapingBee can rate limit if you spam too hard
  • your own machine can bottleneck on CPU or memory

For small demos, 5 to 10 threads is a sweet spot. Watch response codes, then scale slowly. ScrapingBee handles lots of anti-bot and proxy details, but you still have responsibilities:

  • respect robots.txt where it applies
  • respect local laws and site terms
  • don't hammer random sites

Full example: Parallel scraping pages

This builds directly on the previous paginator idea, but runs page fetches in parallel.

What it does:

  • first, it discovers the first 3 page URLs by following "next" links (one by one)
  • then, it fetches those pages concurrently via ScrapingBee
  • it parses each page with Jsoup and stores results thread-safely
package org.example;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.net.URI;
import java.net.URLEncoder;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.nio.charset.StandardCharsets;
import java.time.Duration;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Queue;
import java.util.Set;
import java.util.concurrent.*;
import java.util.concurrent.atomic.AtomicInteger;

/**
 * Parallel web scraping Java demo:
 * - Discovery follows "next" and CACHES HTML (so we don't fetch those pages twice)
 * - Parallel phase reuses cached HTML and only fetches if cache-miss happens
 */
public class Main {

    public record Product(String title, String price, String url) {}

    /**
     * Discovery returns both:
     * 1) the list of page URLs we want
     * 2) htmlCache: HTML that was already fetched during discovery
     *
     * Key idea: discovery MUST fetch HTML to find "next", so keep it.
     */
    private record DiscoveryResult(List<String> pageUrls, Map<String, String> htmlCache) {}

    public static void main(String[] args) throws Exception {
        String apiKey = System.getenv("SCRAPINGBEE_API_KEY");
        if (apiKey == null || apiKey.isBlank()) {
            throw new IllegalStateException("Missing SCRAPINGBEE_API_KEY env var.");
        }

        HttpClient client = HttpClient.newBuilder()
                .connectTimeout(Duration.ofSeconds(20))
                .build();

        // 1) Discover which pages to scrape (sequential, simple)
        int maxPages = 3;
        DiscoveryResult discovery = discoverPagesAndCacheHtml(client, apiKey, "https://books.toscrape.com/", maxPages);

        List<String> pageUrls = discovery.pageUrls();
        Map<String, String> htmlCache = discovery.htmlCache();

        System.out.println("Discovered pages:");
        for (String u : pageUrls) System.out.println(" - " + u);
        System.out.println("Cached HTML pages from discovery: " + htmlCache.size());

        // 2) Parallel scrape phase
        // Use a write-friendly concurrent collection (CopyOnWriteArrayList is expensive for many writes)
        Queue<Product> allProducts = new ConcurrentLinkedQueue<>();

        // A simple concurrent visited set (we don't need ordering)
        Set<String> visitedPages = ConcurrentHashMap.newKeySet();

        // Debug: count how many extra fetches happened AFTER discovery
        AtomicInteger cacheMissFetches = new AtomicInteger(0);

        int threads = 5;
        ExecutorService pool = Executors.newFixedThreadPool(threads);
        List<Future<?>> futures = new ArrayList<>();

        for (String pageUrl : pageUrls) {
            futures.add(pool.submit(() -> {
                try {
                    // Avoid duplicate requests (defensive)
                    if (!visitedPages.add(pageUrl)) {
                        System.out.println("Skip duplicate: " + pageUrl);
                        return;
                    }

                    // IMPORTANT: reuse HTML if we already downloaded it during discovery
                    String html = htmlCache.get(pageUrl);
                    if (html == null) {
                        // Cache-miss: only then call ScrapingBee
                        html = fetchHtmlViaScrapingBee(client, apiKey, pageUrl);
                        cacheMissFetches.incrementAndGet();
                    }

                    Document doc = Jsoup.parse(html, pageUrl);
                    List<Product> items = extractBooks(doc);

                    allProducts.addAll(items);

                    System.out.println("Done: " + pageUrl + " | items: " + items.size());
                } catch (Exception e) {
                    System.out.println("Failed: " + pageUrl + " | " + e.getMessage());
                }
            }));
        }

        // Wait for all tasks
        for (Future<?> f : futures) f.get();

        pool.shutdown();
        pool.awaitTermination(30, TimeUnit.SECONDS);

        System.out.println("\n=== Done ===");
        System.out.println("Total pages scraped: " + visitedPages.size());
        System.out.println("Total products scraped: " + allProducts.size());
        System.out.println("Extra ScrapingBee fetches after discovery (cache misses): " + cacheMissFetches.get());

        int limit = Math.min(allProducts.size(), 10);
        System.out.println("\n" + limit + " products:");

        int i = 0;
        for (Product p : allProducts) {
            if (i >= limit) break;
            System.out.println((i + 1) + ") " + p.title());
            System.out.println("   Price: " + p.price());
            System.out.println("   URL:   " + p.url());
            i++;
        }
    }

    /**
     * Discover first N pages by following pagination AND cache HTML as we go.
     * books.toscrape.com uses: ul.pager li.next a
     */
    private static DiscoveryResult discoverPagesAndCacheHtml(
            HttpClient client,
            String apiKey,
            String startUrl,
            int maxPages
    ) throws Exception {

        List<String> urls = new ArrayList<>();
        Map<String, String> htmlCache = new ConcurrentHashMap<>();

        // O(1) loop detection (instead of urls.contains(...) which is O(n))
        Set<String> seen = new HashSet<>();

        String currentUrl = startUrl;

        for (int i = 1; i <= maxPages; i++) {
            // Defensive: stop if pagination loops back to a page we've already seen
            if (!seen.add(currentUrl)) break;

            // Add URL we're about to fetch
            urls.add(currentUrl);

            // Fetch once (required to discover next link)
            String html = fetchHtmlViaScrapingBee(client, apiKey, currentUrl);

            // Cache HTML for reuse in parallel phase
            htmlCache.put(currentUrl, html);

            Document doc = Jsoup.parse(html, currentUrl);

            Element nextLink = doc.selectFirst("ul.pager li.next a");
            if (nextLink == null) break;

            String nextUrl = nextLink.absUrl("href");
            if (nextUrl == null || nextUrl.isBlank()) break;

            // We don't need urls.contains(nextUrl) anymore:
            // seen.add(currentUrl) at the top of the loop handles loop detection in O(1).
            currentUrl = nextUrl;
        }

        return new DiscoveryResult(urls, htmlCache);
    }

    /**
     * Fetch HTML through ScrapingBee's API endpoint.
     * (no retries/backoff here as this is a tutorial after all.)
     */
    private static String fetchHtmlViaScrapingBee(HttpClient client, String apiKey, String targetUrl) throws Exception {
        String endpoint = "https://app.scrapingbee.com/api/v1/";

        String apiUrl = endpoint +
                "?api_key=" + URLEncoder.encode(apiKey, StandardCharsets.UTF_8) +
                "&url=" + URLEncoder.encode(targetUrl, StandardCharsets.UTF_8) +
                "&render_js=false";

        HttpRequest request = HttpRequest.newBuilder()
                .uri(URI.create(apiUrl))
                .timeout(Duration.ofSeconds(60))
                .header("User-Agent", "java-scraper-demo/1.0")
                .GET()
                .build();

        HttpResponse<String> response = client.send(request, HttpResponse.BodyHandlers.ofString());

        if (response.statusCode() != 200) {
            String body = response.body();
            if (body == null) body = "";
            System.out.println("ScrapingBee HTTP " + response.statusCode() + " for " + targetUrl);
            System.out.println(body.substring(0, Math.min(body.length(), 600)));
            throw new RuntimeException("ScrapingBee request failed: " + response.statusCode());
        }

        String body = response.body();
        return body == null ? "" : body;
    }

    /**
     * Extract book data from the DOM.
     * Books are inside: ol.row > li > article.product_pod
     */
    private static List<Product> extractBooks(Document doc) {
        List<Product> products = new ArrayList<>();

        Elements cards = doc.select("ol.row article.product_pod");

        for (Element card : cards) {
            Element linkEl = card.selectFirst("h3 a");
            if (linkEl == null) continue;

            String title = linkEl.attr("title").trim();

            Element priceEl = card.selectFirst(".price_color");
            String price = priceEl != null ? priceEl.text().trim() : "";

            String url = linkEl.absUrl("href").trim();

            products.add(new Product(title, price, url));
        }

        return products;
    }
}

Note on retries and error handling

Real scraping isn't "status == 200 or die". In production you'll see timeouts, random 5xx, and 429 rate limits. If you don't handle those, your scraper will be flaky even on "easy" sites.

Here's the core idea.

What to retry (and what not to)

Retry (usually safe):

  • 429 (rate limited) → wait and retry (respect Retry-After if present)
  • 500 / 502 / 503 / 504 → temporary server/proxy issues
  • timeouts / connection resets → network hiccups

Don't retry (usually):

  • 401 / 403 → auth / blocking / config problem (retrying won't fix it)
  • 404 → page is gone (unless you expect it to appear later)
  • 400 → your request is broken

Backoff: Don't spam retries

When you retry, don't hammer. Use exponential backoff plus a bit of randomness (jitter):

  • attempt 1: ~500ms
  • attempt 2: ~1s
  • attempt 3: ~2s
  • attempt 4: ~4s

(cap it, and stop after a few attempts)

If you get 429 and the response includes Retry-After, prefer that over your own timer.

Fail "usefully"

When a request fails, you want enough context to debug without dumping giant HTML:

Log at least:

  • URL
  • HTTP status code (or exception type)
  • a short body preview (first ~200–600 chars)
  • attempt number

And keep track of totals:

  • pages attempted / succeeded / failed
  • consider failing the whole run if failure rate is too high (ex: >20%)

Here's the minimal pattern (pseudo-ish):

  • wrap your fetch in a loop maxAttempts
  • retry only on "retryable" status/exception
  • sleep with backoff between tries
  • on final failure: throw or mark page as failed and continue (depending on your pipeline)

Scraping dynamic content with Selenium WebDriver

Selenium still matters in web scraping Java projects, but not as your default scraper. Use Selenium when you need:

  • local debugging (see what the page actually renders)
  • tricky interactions (clicks, scrolling, weird UI states)
  • generating "what does the final HTML look like?" so you can later scrape it via ScrapingBee using render_js or a js_scenario

For production scraping at scale, ScrapingBee is usually simpler and more stable. Selenium is heavy, it breaks more often, and it's usually slower.

Check out this tutorial to learn about HtmlUnit: Getting started with HtmlUnit.

Installing selenium-java via Maven or Gradle

Selenium WebDriver lets Java control a real browser, so let's install it.

Maven

<dependency>
  <groupId>org.seleniumhq.selenium</groupId>
  <artifactId>selenium-java</artifactId>
  <version>4.39.0</version>
</dependency>

Gradle

dependencies {
  implementation 'org.seleniumhq.selenium:selenium-java:4.39.0'
}

Running Chrome in headless mode with ChromeOptions

This is the "hello browser" snippet. It's useful to understand what ScrapingBee's render_js is doing for you on the API side.

package org.example;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

/**
 * Selenium headless demo.
 *
 * Purpose:
 * - show how a real browser renders the page
 * - help debug selectors and JS behavior locally
 *
 * This is NOT meant for production scraping at scale.
 * In production, ScrapingBee with render_js is simpler and more stable.
 */
public class Main {

    public static void main(String[] args) {
        // Configure Chrome to run headless
        ChromeOptions options = new ChromeOptions();
        // In Docker/Linux CI you may need --no-sandbox and --disable-dev-shm-usage.
        options.addArguments("--headless=new");
        options.addArguments("--window-size=1280,800");

        WebDriver driver = new ChromeDriver(options);
        // If ChromeDriver can't be found/downloaded automatically,
        // install Chrome + let Selenium Manager handle it, or provide a driver path / use WebDriverManager.

        try {
            // Load the page like a real browser
            driver.get("https://books.toscrape.com/");

            // Basic debug output
            System.out.println("Page title: " + driver.getTitle());
            System.out.println("Current URL: " + driver.getCurrentUrl());
        } finally {
            // Always shut down the browser
            driver.quit();
        }
    }
}

Output:

Page title: All products | Books to Scrape - Sandbox
Current URL: https://books.toscrape.com/

Learn more at: Introduction to Chrome headless with Java.

Interacting with JavaScript-rendered elements

On JS-heavy sites, the DOM you want is not there instantly. You usually need to:

  • wait for elements to appear
  • read text after render
  • scroll to trigger lazy loading

Here's an example:

package org.example;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.ExpectedConditions;
import org.openqa.selenium.support.ui.WebDriverWait;

import java.time.Duration;

/**
 * Selenium wait demo.
 *
 * In production scraping, this kind of waiting is usually
 * replaced by ScrapingBee with render_js=true.
 */
public class Main {

    public static void main(String[] args) {
        // Run Chrome in headless mode (no visible browser window)
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless=new");
        options.addArguments("--window-size=1280,800");

        WebDriver driver = new ChromeDriver(options);

        try {
            // Load the page like a real browser
            driver.get("https://books.toscrape.com/");

            // Explicit wait: pause until a specific element exists in the DOM
            // This is common for JS-heavy pages
            WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(10));

            WebElement firstCard = wait.until(
                    ExpectedConditions.presenceOfElementLocated(
                            By.cssSelector("article.product_pod")
                    )
            );

            // Debug output: show that the element is really there
            System.out.println("First card text preview:");
            System.out.println(firstCard.getText());
        } finally {
            // Always close the browser to free system resources
            driver.quit();
        }
    }
}

Production idea: Do the same thing via ScrapingBee

Instead of running ChromeDriver yourself, you can ask ScrapingBee to render JavaScript:

  • render_js=true makes the page render like a browser
  • extract_rules lets you pull fields with CSS selectors

Conceptually, it looks like this (pseudo-ish JSON payload style):

{
  "products": {
    "selector": "article.product_pod",
    "type": "list",
    "output": {
      "title": {
        "selector": "h3 a",
        "output": "text"
      },
      "price": {
        "selector": ".price_color",
        "output": "text"
      },
      "url": {
        "selector": "h3 a",
        "output": "@href"
      }
    }
  }
}

That's the vibe: browser rendering on the API side, selectors on your side. Way less moving parts than managing ChromeDriver on a server.

Clicking pagination buttons with WebDriver.findElement()

Sometimes pagination is a JS button, not a normal link. Selenium helps you learn the flow.

Here's an example:

package org.example;

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

import java.util.List;

/**
 * Selenium pagination demo.
 * This example is for learning and debugging only.
 * In production scraping, pagination is usually replaced
 * by ScrapingBee + render_js + CSS selectors.
 */
public class Main {

    public static void main(String[] args) {
        // Run Chrome in headless mode
        ChromeOptions options = new ChromeOptions();
        options.addArguments("--headless=new");
        options.addArguments("--window-size=1280,800");

        WebDriver driver = new ChromeDriver(options);

        try {
            // Open the first page
            driver.get("https://books.toscrape.com/");

            // Click through a few pages to understand the flow
            for (int page = 1; page <= 3; page++) {
                // Read currently visible book cards
                List<WebElement> cards = driver.findElements(
                        By.cssSelector("article.product_pod")
                );

                System.out.println("Page " + page + " | cards found: " + cards.size());

                // Find the "Next" button
                List<WebElement> nextButtons = driver.findElements(
                        By.cssSelector("ul.pager li.next a")
                );

                // Stop if there is no next page
                if (nextButtons.isEmpty()) {
                    System.out.println("No next page. Stopping.");
                    break;
                }

                // Click "Next" to trigger navigation
                nextButtons.get(0).click();

                // In real projects, you would usually wait here
                // for the next page to load before continuing
            }
        } finally {
            // Always close the browser
            driver.quit();
        }
    }
}

Once you understand what "next" does, you usually replace this with ScrapingBee:

  • use render_js=true if the site needs JS
  • keep using CSS selectors to extract
  • avoid running a browser farm yourself

Virtual threads: The "IO cheat code" for scrapers

Virtual threads are super lightweight threads managed by the JVM. The point is simple:

  • A platform thread maps 1:1 to an OS thread, so it's heavier and you can't spin up tens of thousands of them.
  • A virtual thread is cheap, so you can run tons of IO-bound tasks (HTTP calls) without drowning your JVM in OS threads.

They're perfect for scraping because scraping is mostly: wait on network → parse → wait → parse. So:

  • You write code that looks blocking (client.send(...)), but it scales like async.
  • You still must cap concurrency (don't spam targets / don't hit ScrapingBee rate limits / don't DDoS yourself).

Example: Fetch a bunch of pages with virtual threads + a hard cap

This example:

  • uses newVirtualThreadPerTaskExecutor() (1 virtual thread per task)
  • uses a Semaphore to cap in-flight requests
  • fetches HTML (placeholder fetcher)
  • prints result sizes
package org.example;

import java.net.URI;
import java.net.http.HttpClient;
import java.net.http.HttpRequest;
import java.net.http.HttpResponse;
import java.time.Duration;
import java.util.List;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.Semaphore;
import java.util.concurrent.atomic.AtomicInteger;

public class Main {

    // One shared client is fine (HttpClient is thread-safe)
    private static final HttpClient CLIENT = HttpClient.newBuilder()
            .connectTimeout(Duration.ofSeconds(20))
            .build();

    public static void main(String[] args) throws Exception {
        List<String> urls = List.of(
                "https://books.toscrape.com/",
                "https://books.toscrape.com/catalogue/page-2.html",
                "https://books.toscrape.com/catalogue/page-3.html"
        );

        // Cap concurrent in-flight HTTP calls (important even with virtual threads)
        int maxInFlight = 5;
        Semaphore cap = new Semaphore(maxInFlight);

        AtomicInteger ok = new AtomicInteger(0);
        AtomicInteger fail = new AtomicInteger(0);

        // Virtual threads: one virtual thread per task (cheap)
        try (ExecutorService vt = Executors.newVirtualThreadPerTaskExecutor()) {

            List<Callable<Void>> tasks = urls.stream()
                    .map(url -> (Callable<Void>) () -> {
                        boolean acquired = false;
                        try {
                            cap.acquire();
                            acquired = true;

                            String html = fetch(url);
                            ok.incrementAndGet();
                            System.out.println("OK  " + url + " | bytes=" + html.length());
                        } catch (Exception e) {
                            fail.incrementAndGet();
                            System.out.println("FAIL " + url + " | " + e.getMessage());
                        } finally {
                            if (acquired) cap.release();
                        }
                        return null; // Callable<Void> contract
                    })
                    .toList();

            // Blocks until all tasks are done
            vt.invokeAll(tasks);
        }

        System.out.println("\nDone. OK=" + ok.get() + " FAIL=" + fail.get());
    }

    private static String fetch(String url) throws Exception {
        HttpRequest req = HttpRequest.newBuilder()
                .uri(URI.create(url))
                .timeout(Duration.ofSeconds(30))
                .header("User-Agent", "vt-demo/1.0")
                .GET()
                .build();

        HttpResponse<String> res = CLIENT.send(req, HttpResponse.BodyHandlers.ofString());

        if (res.statusCode() != 200) {
            String body = res.body();
            if (body == null) body = "";
            int previewLen = Math.min(body.length(), 200);
            throw new RuntimeException("HTTP " + res.statusCode() + " | " + body.substring(0, previewLen));
        }

        String body = res.body();
        return body == null ? "" : body;
    }
}

Turn your Java scraper into a reliable data pipeline

At this point, you've seen the full picture.

  • Java 21 gives you a modern, fast runtime.
  • Jsoup gives you clean, predictable HTML parsing.
  • Parallel crawling speeds things up without losing control.
  • Selenium helps you understand JavaScript-heavy flows when you need them.
  • ScrapingBee takes care of the ugly parts: proxies, CAPTCHAs, retries, browser quirks.

Put together, this is not "just a scraper". It's the foundation of a reliable data pipeline. The key mindset shift is simple: Java owns the logic. ScrapingBee owns the web. You write Java code that:

  • decides what pages to fetch
  • parses data into real objects
  • runs in parallel when it makes sense
  • stays readable and testable

ScrapingBee sits underneath and:

  • fetches pages safely
  • renders JavaScript when needed
  • keeps you away from IP bans and CAPTCHA hell
  • lets you scale without running a browser farm

If you haven't already, the next step is easy. Sign up for a free ScrapingBee account and grab your API key. You get free credits, which is plenty to test everything you saw in this guide. That's how you move from "it works on my machine" to a scraper you can actually rely on.

Conclusion

You don't need a massive framework or a fragile browser setup to scrape the web with Java. With Java 21, Jsoup, and a few clean concurrency patterns, you can build scrapers that are fast, readable, and easy to maintain.

Add ScrapingBee on top, and you stop worrying about proxies, CAPTCHAs, and headless browser maintenance. You focus on data, not infrastructure. That's the difference between a quick script and a scraper you can actually trust.

Before you go, check out these related reads

Frequently asked questions (FAQs)

How do I start a web scraping Java project if I am a beginner?

Start simple. Use Java 21, one build tool (Maven or Gradle), and Jsoup for parsing HTML. First scrape a static site locally, then add ScrapingBee for fetching pages. This way you learn selectors, data modeling, and flow before dealing with scale.

Why should I use ScrapingBee instead of doing all HTTP requests directly in Java?

Direct HTTP requests work for demos, but they break fast in the real world. ScrapingBee handles proxies, CAPTCHAs, retries, and JavaScript rendering for you. Your Java code stays clean and focused on parsing and logic, not constant anti-bot fixes.

Can I combine ScrapingBee with Selenium in the same Java project?

Yes, and it's a common pattern. Use Selenium locally to understand page behavior, selectors, and JavaScript flows. Once you know what you need, switch production scraping to ScrapingBee with render_js or extract rules. You get stability without running browsers in production.

How do I avoid blocking and respect websites when scraping with Java?

Limit request rates, crawl only what you need, and avoid infinite pagination loops. Keep thread pools small and monitor response codes. Even with ScrapingBee handling infrastructure, you're still responsible for polite usage, respecting robots.txt where applicable, and following local laws.

image description
Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.