Is there a website from where you'd like to regularly obtain data in structured fashion, but that site does not offer a standardised API, such as a JSON REST interface, yet? Don't fret, web scraping comes to the rescue.
đź’ˇ Interested in web scraping with Java? Check out our guide to the best Java web scraping libraries
Welcome to the world of web scraping
Web scraping, or web crawling, refers to the process of fetching and extracting arbitrary data from a website. This involves downloading the site's HTML code, parsing that HTML code, and extracting the desired data from it.
If the aforementioned REST API is not available, scraping typically is the only solution when it comes to collecting information from a site. It is a commonly employed business standard, to obtain data in an automated fashion and can be used for any subject of your choice. For example, to analyse changes in your competitor's pricing scheme, to aggregate the latest stories from different news agencies, or to collect address information for your latest marketing campaign.
Doing essentially what a standard web browser does, there are barely any limits as to what information you can collect and the most tricky part typically is obtaining information from multimedia content (i.e. images, audio, video).
đź’ˇ Check out the advanced data extraction features of ScrapingBee and how they can help you to handle even more complex site setups.
In this post, we will walk you through on how to set up a basic web crawler in Java, fetch a site, parse and extract the data, and store everything in a JSON structure.
Prerequisites
As we are going to use Java for our demo project, please make sure you have the following prerequisites in place, before proceeding.
- The Java 8 SDK
- A suitable Java IDE for development (e.g. Eclipse )
- If not part of your IDE, Maven for dependency management
Of course, having a basic understanding of Java and the concept of XPath will also speed up things.
Java dependencies
Please make sure you have added
HtmlUnit
as dependency to your pom.xml
file
<dependency>
<groupId>net.sourceforge.htmlunit</groupId>
<artifactId>htmlunit</artifactId>
<version>2.60.0</version>
</dependency>
as well as Jackson's FasterXML
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.13.2.1</version>
</dependency>
đź’ˇ If you are using Eclipse, it is recommended to set the maximum length of the output in the detail pane (when you click in the variables tab) so that you will receive the entire HTML code of the page.
Let's scrape Craigslist
For our example here, we are going to focus on Craigslist and would like to get a list of all classified ads in New York, selling an iPhone 13.
As Craigslist does not offer an API, our only option is to go the scraping path and extract the data straight from the site. For that, we will fetch the site, collect the names, prices, and images of all items and export it all as a JSON structure.
Finding the right search URL
First, let's take a look at what happens when you search for something on Craigslist.
- Go to https://newyork.craigslist.org
- Enter
iphone 13
in the search box on the left - Press Enter
You'll be immediately redirected to the search page with all the found products. For the purpose of this example, the URL in the address bar will be the most interesting thing for now.
https://newyork.craigslist.org/search/moa?query=iphone%2013
At this point, we have established what the search URL for this particular query is, and what parameters (i.e. query
) it requires.
You can now open your favourite IDE, it is time to code.
Fetching the page
To make a request to a site, you'll first need an HTTP client to send that request. It just so happens, that HtmlUnit comes with a default class for that task, appropriate called WebClient
.
There are quite a few parameters you can tweak to customise its behaviour (e.g. proxy settings, CSS support, and more), but for our example we will use the bare configuration without CSS and JavaScript support.
// Define the search term
String searchQuery = "iphone 13";
// Instantiate the client
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
// Set up the URL with the search term and send the request
String searchUrl = "https://newyork.craigslist.org/search/moa?query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);
At this point, we have the site's content in the page
variable, and we could access the entire document with the asXml()
method, however we are more interested in particular data items of the HTML document.
For this, we are checking out the site's structure in the browser, using the Inspector feature of the developer tools (F12
).
Based on this, we now know that all items will be <li>
tags beneath an <ul>
container tag with the ID search-results
. Furthermore, each <li>
tag will have the HTML class result-row
assigned.
Extracting data
With this knowledge, we can now use
XPath
to access the returned products and their item properties.
HtmlUnit
provides a number of convenience methods for this purpose (e.g. getHtmlElementById
, getFirstByXPath
, getByXPath
), which allow you to work with an XPath expression to precisely access fetch data from the document. Please refer to
JavaDoc of HtmlUnit
for more information on the supported methods.
Let's go through the following code step-by-step:
- We are fetching all aforementioned
<li>
tags with the classresult-row
and store them in the variableitems
. - We are iterating over
items
and store each entry asitem
. - For each
item
, we are going to look- for the product details, under an
<a>
tag (contained within a<p>
tag with the classresult-info
) - for the product price, under a
<span>
tag (contained within an<a>
tag) with the classresult-price
- for the product details, under an
- Once we have the details and the price, we are printing them on the screen.
// Retrieve all <li> elements
List<HtmlElement> items = (List<HtmlElement>) page.getByXPath("//li[@class='result-row']");
if (!items.isEmpty()) {
// Iterate over all elements
for (HtmlElement item : items) {
// Get the details from <p class="result-info"><a href=""></a></p>
HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));
// Get the price from <a><span class="result-price"></span></a>
HtmlElement spanPrice = ((HtmlElement) htmlItem.getFirstByXPath(".//a/span[@class='result-price']")) ;
String itemName = itemAnchor.asText()
String itemUrl = itemAnchor.getHrefAttribute()
// It is possible that an item doesn't have any price
String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText() ;
System.out.println( String.format("Name : %s Url : %s Price : %s", itemName, itemPrice, itemUrl));
}
}
else {
System.out.println("No items found !");
}
VoilĂ , we have parse the whole page and managed to extract the individual product items.
đź’ˇ We released a new feature that makes this whole process way simpler. You can now extract data from HTML with one simple API call. Please check out the documentation here for more information.
Converting to JSON
While the previous example provided an excellent overview on how quickly scrape a website, we could take this a step further and convert the data into a structured and machine-readable format, such as JSON.
For that, we just need to make small changes to our code.
- A POJO data class
- Mapping the data to our class instead of directly printing it
POJO
We add an additional POJO (Plain Old Java Object) class, which will represent the JSON object and hold our data
public class Item {
private String title;
private BigDecimal price;
private String url;
public String getTitle()
{
return title;
}
public void setTitle(String title)
{
this.title = title;
}
public BigDecimal getPrice()
{
return price;
}
public void setPrice(BigDecimal price)
{
this.price = price;
}
public String getUrl()
{
return url;
}
public void setUrl(String url)
{
this.url = url;
}
}
Mapping
Now, we can extend our previous for
loop to create an Item
instance for each found item and map that to a JSON object.
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));
HtmlElement spanPrice = ((HtmlElement)htmlItem.getFirstByXPath(".//a/span[@class='result-price']"));
// It is possible that an item doesn't have any
// price, we set the price to 0.0 in this case
String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText();
Item item = new Item();
item.setTitle(itemAnchor.asText());
item.setUrl(baseUrl + itemAnchor.getHrefAttribute());
item.setPrice(new BigDecimal(itemPrice.replace("$", "")));
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(item);
System.out.println(jsonString);
}
Code sample
You can find the full source code of this example in our Github repository .
Let's take it a step further
Our project provided us so far with a quick overview on what web scraping is, its fundamental concepts, and how to set up our own crawler, using Java and XPath.
For now, it's a relatively simple example, taking a defined search term and returning as JSON all the products sold in the area of New York City. What if we wanted to get data from more than one city? Let's check it out.
Multi-city support
If you closely look at the URL we previously used for the search, you'll notice, Craigslist catalogues its ads by city and keeps that information as part of the hostname of the URL. For example, our ads for New York City are all behind the following URL
https://newyork.craigslist.org
If we wanted to fetch the ads relevant to Boston, we'd be using https://boston.craigslist.org
instead.
Now, let's say, we'd like to retrieve all iPhone 13 ads for the East Coast and, specifically, for New York, Boston, and Washington D.C. In that case, we'd simply revisit our code from Fetching the page and extend it a bit, to support the other cities as well.
// Define the search term
String searchQuery = "iphone 13";
String[] cities = new String[]{"newyork", "boston", "washingtondc"};
// Instantiate the client
WebClient client = new WebClient();
client.getOptions().setCssEnabled(false);
client.getOptions().setJavaScriptEnabled(false);
for (String city : cities)
{
// Set up the URL with the search term and send the request
String searchUrl = "https://" + city + ".craigslist.org/search/moa?query=" + URLEncoder.encode(searchQuery, "UTF-8");
HtmlPage page = client.getPage(searchUrl);
// Here goes the rest of the code handling the content in "page"
}
What we now did was add
- an additional string array for the cities and
- a for-loop, iterating over said array to fetch the ads for each of the defined city entries
VoilĂ , we now run the request for each city individually.
Passing of parameters
So far, we were pretty content with the search listing for the iPhone, but what if we wanted to narrow down the search a bit?
Fortunately, Craigslist does offer you the ability to filter the search by specific criteria. For example, you can tell it only to return ads which have pictures. Or, you might only be interested in ads which were posted today.
Once you checked these two boxes, you'll notice the URL in the address bar changed to
https://newyork.craigslist.org/search/moa?hasPic=1&postedToday=1&query=iphone%2013
This URL is pretty similar to what we used before, but we now have two other parameters in the query string:
hasPic
with a value of1
, indicating that only ads with pictures should be returnedpostedToday
with a value of1
, indicating that only today's ads should be returned
With that URL we'll only get listings which were posted today and were uploaded with a picture. Not bad, is it?
But, wait, there's more. In addition to the two parameters just mentioned, you can also specify the following to narrow down your search even further.
srchType
with a value ofT
, to only search the ads' titlesbundleDuplicates
with a value of1
, to bundle ads by the same sellersearchNearby
with a value of1
, to include nearby areas of the city in question
The following URL will return all ads posted today and limit the text search to the title.
https://newyork.craigslist.org/search/moa?query=iphone%2013&postedToday=1&srchType=T
Output customisation
You could encounter the situation where your crawler may have to support different output formats. For example, you might have to support JSON and CSV. In that case you could simply add a switch to your code, which changes the output format depending on its value.
String outputType = argv.length == 1 ? argv[0] : "";
for(HtmlElement htmlItem : items){
HtmlAnchor itemAnchor = ((HtmlAnchor) htmlItem.getFirstByXPath(".//p[@class='result-info']/a"));
HtmlElement spanPrice = ((HtmlElement)htmlItem.getFirstByXPath(".//a/span[@class='result-price']"));
// It is possible that an item doesn't have any
// price, we set the price to 0.0 in this case
String itemPrice = spanPrice == null ? "0.0" : spanPrice.asText();
switch (outputType)
{
case "json":
Item item = new Item();
item.setTitle(itemAnchor.asText());
item.setUrl(baseUrl + itemAnchor.getHrefAttribute());
item.setPrice(new BigDecimal(itemPrice.replace("$", "")));
ObjectMapper mapper = new ObjectMapper();
String jsonString = mapper.writeValueAsString(item);
System.out.println(jsonString);
break;
case "csv":
// TODO: CSV-escaping
System.out.println(String.format("%s,%s,%s", itemAnchor.asText(), itemPrice, baseUrl + itemAnchor.getHrefAttribute()));
break;
default:
System.out.println("Error: no format specified");
break;
}
}
If you now pass json
as first argument to your crawler call, it will return a JSON object for each entry (just as we originally showed under
Mapping
). If you passed csv
, it would print a comma-separated line for each entry instead.
Next steps
The examples mentioned so far provided a bit of insight on how to scrape Craigslist, but there are certainly still a few areas which could be improved.
- Properly handling pagination
- Support for more than one criterion
- and more
Of course, there's a lot more to scraping than just fetching a single HTML page and running a few XPath expressions. Especially when it comes to distributed scraping, fully handling JavaScript, and CAPTCHAs, the topic can quickly become very complex. If you like it and would like to have these things handled automatically, then please simply check out our web scraping API . The first 1,000 API calls are on us!
Even more
We are almost at the end of this post, so thanks for staying with us until now, but we'd still have a couple of recommended articles for you.
Don't get blocked
Also check out our recent blog post on Web Scraping without getting blocked , which goes into details on how to optimise your scraping approach in order to avoid being blocked by anti-scraping measures.
Scraping with Chrome and full JavaScript support
While HtmlUnit is a wonderful headless browser, you may still want to check out our other article on the Introduction to Headless Chrome , as this will provide you with additional insight on how to use Chrome's headless mode, which features full JavaScript support, just as you'd expect it from your daily driver browser.
One CSS selector, please
CSS selectors are used for much more these days, than just applying colours and spacing. Very often they are used in the very same context as XPath expressions and if you happen to prefer CSS selectors, you should definitely also check out our tutorial on HTML parsing with Java using jsoup .
Python maybe?
Python has been one of the most popular languages for years at this point and is, in fact, commonly used for web scraping as well. If Python is your choice of language, you might just like our other guide on using Python for scraping web pages .
Or Groovy?
I've you like Java you're going to LOVE Groovy. Check out our guide to web scraping with Groovy You may also like our guide about web scraping with Kotlin
What about Scala?
Of course, we didn't forget about web scraping with Scala , you should check it out!
Kevin worked in the web scraping industry for 10 years before co-founding ScrapingBee. He is also the author of the Java Web Scraping Handbook.