Introduction to Web Scraping With Java

31 March 2022 (updated) | 12 min read

Is there a website from where you'd like to regularly obtain data in structured fashion, but that site does not offer a standardised API, such as a JSON REST interface, yet? Don't fret, web scraping comes to the rescue.

Welcome to the world of web scraping

Web scraping, or web crawling, refers to the process of fetching and extracting arbitrary data from a website. This involves downloading the site's HTML code, parsing that HTML code, and extracting the desired data from it.

If the aforementioned REST API is not available, scraping typically is the only solution when it comes to collecting information from a site. It is a commonly employed business standard, to obtain data in an automated fashion and can be used for any subject of your choice. For example, to analyse changes in your competitor's pricing scheme, to aggregate the latest stories from different news agencies, or to collect address information for your latest marketing campaign.

Doing essentially what a standard web browser does, there are barely any limits as to what information you can collect and the most tricky part typically is obtaining information from multimedia content (i.e. images, audio, video).

💡 Check out the advanced data extraction features of ScrapingBee and how they can help you to handle even more complex site setups.

In this post, we will walk you through on how to set up a basic web crawler in Java, fetch a site, parse and extract the data, and store everything in a JSON structure.

Prerequisites

As we are going to use Java for our demo project, please make sure you have the following prerequisites in place, before proceeding.

  • The Java 8 SDK
  • A suitable Java IDE for development (e.g. Eclipse)
  • If not part of your IDE, Maven for dependency management

Of course, having a basic understanding of Java and the concept of XPath will also speed up things.

Java dependencies

Please make sure you have added HtmlUnit as dependency to your pom.xml file

<dependency>
	<groupId>net.sourceforge.htmlunit</groupId>
	<artifactId>htmlunit</artifactId>
	<version>2.60.0</version>
</dependency>

as well as Jackson's FasterXML

<dependency>
  <groupId>com.fasterxml.jackson.core</groupId>
  <artifactId>jackson-databind</artifactId>
  <version>2.13.2.1</version>
</dependency>

💡 If you are using Eclipse, it is recommended to set the maximum length of the output in the detail pane (when you click in the variables tab) so that you will receive the entire HTML code of the page.

Detail Pane Eclipse IDE

Let's scrape Craigslist

For our example here, we are going to focus on Craigslist and would like to get a list of all classified ads in New York, selling an iPhone 13.

As Craigslist does not offer an API, our only option is to go the scraping path and extract the data straight from the site. For that, we will fetch the site, collect the names, prices, and images of all items and export it all as a JSON structure.

Finding the right search URL

First, let's take a look at what happens when you search for something on Craigslist.

Craigslist index page

You'll be immediately redirected to the search page with all the found products. For the purpose of this example, the URL in the address bar will be the most interesting thing for now.

https://newyork.craigslist.org/search/moa?query=iphone%2013

At this point, we have established what the search URL for this particular query is, and what parameters (i.e. query) it requires.

You can now open your favourite IDE, it is time to code.

Fetching the page

To make a request to a site, you'll first need an HTTP client to send that request. It just so happens, that HtmlUnit comes with a default class for that task, appropriate called WebClient.

There are quite a few parameters you can tweak to customise its behaviour (e.g. proxy settings, CSS support, and more), but for our example we will use the bare configuration without CSS and JavaScript support.

// Define the search term
String searchQuery = "iphone 13";

// Instantiate the client
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);

// Set up the URL with the search term and send the request
String searchUrl = "https://newyork.craigslist.org/search/moa?query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);

At this point, we have the site's content in the page variable, and we could access the entire document with the asXml() method, however we are more interested in particular data items of the HTML document.

For this, we are checking out the site's structure in the browser, using the Inspector feature of the developer tools (F12).

Craigslist's HTML

Based on this, we now know that all items will be <li> tags beneath an <ul> container tag with the ID search-results. Furthermore, each <li> tag will have the HTML class result-row assigned.

Extracting data

With this knowledge, we can now use XPath to access the returned products and their item properties. HtmlUnit provides a number of convenience methods for this purpose (e.g. getHtmlElementById, getFirstByXPath, getByXPath), which allow you to work with an XPath expression to precisely access fetch data from the document. Please refer to JavaDoc of HtmlUnit for more information on the supported methods.

Let's go through the following code step-by-step:

  1. We are fetching all aforementioned <li> tags with the class result-row and store them in the variable items.
  2. We are iterating over items and store each entry as item.
  3. For each item, we are going to look
    1. for the product details, under an <a> tag (contained within a <p> tag with the class result-info)
    2. for the product price, under a <span> tag (contained within an <a> tag) with the class result-price
  4. Once we have the details and the price, we are printing them on the screen.
// Retrieve all <li> elements
List<HtmlElement> items = (List<HtmlElement>) page.getByXPath("//li[@class='result-row']");
if (!items.isEmpty()) {
  // Iterate over all elements
  for (HtmlElement item : items) {

    // Get the details from <p class="result-info"><a href=""></a></p>
    HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));

    // Get the price from <a><span class="result-price"></span></a>
    HtmlElement spanPrice = ((HtmlElement) htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;

    String itemName = itemAnchor.asText()
    String itemUrl =  itemAnchor.getHrefAttribute()

    // It is possible that an item doesn't have any price
    String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText() ;

    System.out.println( String.format("Name : %s Url : %s Price : %s", itemName, itemPrice, itemUrl));

  }
}
else {
  System.out.println("No items found !");
}

Voilà, we have parse the whole page and managed to extract the individual product items.

💡 We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Please check out the documentation here for more information.

Converting to JSON

While the previous example provided an excellent overview on how quickly scrape a website, we could take this a step further and convert the data into a structured and machine-readable format, such as JSON.

For that, we just need to make small changes to our code.

  1. A POJO data class
  2. Mapping the data to our class instead of directly printing it

POJO

We add an additional POJO (Plain Old Java Object) class, which will represent the JSON object and hold our data

public class Item {
    private String title;
    private BigDecimal price;
    private String url;

    public String getTitle()
    {
        return title;
    }
    
    public void setTitle(String title)
    {
        this.title = title;
    }

    public BigDecimal getPrice()
    {
        return price;
    }
    
    public void setPrice(BigDecimal price)
    {
        this.price = price;
    }

    public String getUrl()
    {
        return url;
    }
    
    public void setUrl(String url)
    {
        this.url = url;
    }
}

Mapping

Now, we can extend our previous for loop to create an Item instance for each found item and map that to a JSON object.

for(HtmlElement htmlItem : items){
   HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));
   HtmlElement spanPrice = ((HtmlElement)htmlItem.getFirstByXPath(".//a/span[@class='result-price']"));

   // It is possible that an item doesn't have any
   // price, we set the price to 0.0 in this case
   String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText();

   Item item = new Item();

   item.setTitle(itemAnchor.asText());
   item.setUrl(baseUrl + itemAnchor.getHrefAttribute());

   item.setPrice(new BigDecimal(itemPrice.replace("$", "")));

   ObjectMapper mapper = new ObjectMapper();
   String jsonString = mapper.writeValueAsString(item);

   System.out.println(jsonString);
}

Code sample

You can find the full source code of this example in our Github repository.

Let's take it a step further

Our project provided us so far with a quick overview on what web scraping is, its fundamental concepts, and how to set up our own crawler, using Java and XPath.

For now, it's a relatively simple example, taking a defined search term and returning as JSON all the products sold in the area of New York City. What if we wanted to get data from more than one city? Let's check it out.

Multi-city support

If you closely look at the URL we previously used for the search, you'll notice, Craigslist catalogues its ads by city and keeps that information as part of the hostname of the URL. For example, our ads for New York City are all behind the following URL

https://newyork.craigslist.org

If we wanted to fetch the ads relevant to Boston, we'd be using https://boston.craigslist.org instead.

Now, let's say, we'd like to retrieve all iPhone 13 ads for the East Coast and, specifically, for New York, Boston, and Washington D.C. In that case, we'd simply revisit our code from Fetching the page and extend it a bit, to support the other cities as well.

// Define the search term
String searchQuery = "iphone 13";
String[] cities = new String[]{"newyork", "boston", "washingtondc"};

// Instantiate the client
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);

for (String city : cities)
{
    // Set up the URL with the search term and send the request
    String searchUrl = "https://" + city + ".craigslist.org/search/moa?query=" + URLEncoder.encode(searchQuery, "UTF-8");
    HtmlPage page = client.getPage(searchUrl);
    
    // Here goes the rest of the code handling the content in "page"
}

What we now did was add

  • an additional string array for the cities and
  • a for-loop, iterating over said array to fetch the ads for each of the defined city entries

Voilà, we now run the request for each city individually.

Passing of parameters

So far, we were pretty content with the search listing for the iPhone, but what if we wanted to narrow down the search a bit?

Fortunately, Craigslist does offer you the ability to filter the search by specific criteria. For example, you can tell it only to return ads which have pictures. Or, you might only be interested in ads which were posted today.

Craigslist index page

Once you checked these two boxes, you'll notice the URL in the address bar changed to

https://newyork.craigslist.org/search/moa?hasPic=1&postedToday=1&query=iphone%2013

This URL is pretty similar to what we used before, but we now have two other parameters in the query string:

  • hasPic with a value of 1, indicating that only ads with pictures should be returned
  • postedToday with a value of 1, indicating that only today's ads should be returned

With that URL we'll only get listings which were posted today and were uploaded with a picture. Not bad, is it?

But, wait, there's more. In addition to the two parameters just mentioned, you can also specify the following to narrow down your search even further.

  • srchType with a value of T, to only search the ads' titles
  • bundleDuplicates with a value of 1, to bundle ads by the same seller
  • searchNearby with a value of 1, to include nearby areas of the city in question

The following URL will return all ads posted today and limit the text search to the title.

https://newyork.craigslist.org/search/moa?query=iphone%2013&postedToday=1&srchType=T

Output customisation

You could encounter the situation where your crawler may have to support different output formats. For example, you might have to support JSON and CSV. In that case you could simply add a switch to your code, which changes the output format depending on its value.

String outputType = argv.length == 1 ? argv[0] : "";

for(HtmlElement htmlItem : items){
   HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));
   HtmlElement spanPrice = ((HtmlElement)htmlItem.getFirstByXPath(".//a/span[@class='result-price']"));

   // It is possible that an item doesn't have any
   // price, we set the price to 0.0 in this case
   String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText();
    
   switch (outputType)
   {
       case "json":
           Item item = new Item();

           item.setTitle(itemAnchor.asText());
           item.setUrl(baseUrl + itemAnchor.getHrefAttribute());

           item.setPrice(new BigDecimal(itemPrice.replace("$", "")));

           ObjectMapper mapper = new ObjectMapper();
           String jsonString = mapper.writeValueAsString(item);

           System.out.println(jsonString);
           
           break;
           
       case "csv":
           // TODO: CSV-escaping
           System.out.println(String.format("%s,%s,%s", itemAnchor.asText(), itemPrice, baseUrl + itemAnchor.getHrefAttribute()));
           
           break;
       
       default:
           System.out.println("Error: no format specified");
           
           break;
   }
}

If you now pass json as first argument to your crawler call, it will return a JSON object for each entry (just as we originally showed under Mapping). If you passed csv, it would print a comma-separated line for each entry instead.

Next steps

The examples mentioned so far provided a bit of insight on how to scrape Craigslist, but there are certainly still a few areas which could be improved.

  • Properly handling pagination
  • Support for more than one criterion
  • and more

Of course, there's a lot more to scraping than just fetching a single HTML page and running a few XPath expressions. Especially when it comes to distributed scraping, fully handling JavaScript, and CAPTCHAs, the topic can quickly become very complex. If you like it and would like to have these things handled automatically, then please simply check out our web scraping API. The first 1,000 API calls are on us!

Even more

We are almost at the end of this post, so thanks for staying with us until now, but we'd still have a couple of recommended articles for you.

Don't get blocked

Also check out our recent blog post on Web Scraping without getting blocked, which goes into details on how to optimise your scraping approach in order to avoid being blocked by anti-scraping measures.

Scraping with Chrome and full JavaScript support

While HtmlUnit is a wonderful headless browser, you may still want to check out our other article on the Introduction to Headless Chrome, as this will provide you with additional insight on how to use Chrome's headless mode, which features full JavaScript support, just as you'd expect it from your daily driver browser.

One CSS selector, please

CSS selectors are used for much more these days, than just applying colours and spacing. Very often they are used in the very same context as XPath expressions and if you happen to prefer CSS selectors, you should definitely also check out our tutorial on HTML parsing with Java using jsoup.

Python maybe?

Python has been one of the most popular languages for years at this point and is, in fact, commonly used for web scraping as well. If Python is your choice of language, you might just like our other guide on using Python for scraping web pages.

image description
Kevin Sahin

Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.