Getting Started with Apache Nutch
Getting Started with Apache Nutch
Typically, a crawler first makes requests to a list of known web addresses and, from their content, identifies other relevant links. It adds these new URLs to a queue, iteratively takes them out, and repeats the process until the queue is empty. The crawler stores the extracted data—like web page content, meta tags, and links—in a database.
Apache Nutch is an open source web crawler that lets you collect relevant data from web pages, store it, and create an index for easy searching and querying. Nutch also supports incremental crawling, which means it revisits previously crawled websites and looks for updates to ensure that your indexed data remains fresh. It's flexible and modular, and maintenance is easy—you can check for broken links and duplicates with simple commands.
In this tutorial, let's take a look at how to set up and use Nutch to operate a crawler and then integrate it with Apache Solr to analyze data.
Why Apache Nutch?
Apache Nutch is a feature-rich framework, and one of its most important features is its highly extensible architecture. Nutch uses a plugin-based architecture, which allows you to extend its base functionalities to better suit your use cases. You might benefit from integrating, say, custom content parsers, URL filters, data formats, metadata extractors, and storage solutions.
Nutch also uses Apache Hadoop to support distributed file processing and storage. This gives you the ability to efficiently process large data volumes across multiple machines.
While Nutch is primarily a crawler, it does include its own search engine. Once your crawler has collected and indexed web content, you can carry out basic keyword-based querying through the data. You can extend its search capability during indexing by configuring it to index specific metadata or content fields.
Its segment-based storage means you have flexible data storage options. Each segment is a collection of web pages fetched in a particular crawling operation, which makes it easier to handle large volumes of data. For those large-scale crawls, you can use Hadoop's Distributed File System (HDFS). Its native file system storage is more useful for smaller crawls. If you need to extend Nutch's data storage capabilities, you can make use of platforms like Apache Solr or Elasticsearch.
Getting Started with Apache Nutch
To demonstrate how Apache Nutch can be used to crawl websites, let's configure a crawler to fetch data from ScrapingBee. You'll analyze that data by integrating Nutch with Apache Solr to make queries.
To get started, ensure that you have the following set up:
- Unix environment (WSL 2 on Windows)
- Java Development or Runtime Environment version 11
- A code editor like Visual Studio Code
Install Apache Nutch
Once you've got all the prerequisites set up, it's time to create a project directory using the
mkdir <dir_name> command and navigate into it via your terminal. Then, run the command below to download and unzip Apache Nutch version 1.19:
Run the following command to change the directory into the unzipped folder and test your installation using the
This should show you the options available to use with the
nutch command, as shown here:
If you get a
JAVA_HOME is not set error, run the following command to set it for this terminal session:
Configure Apache Nutch with Data Storage and Indexing Options
For this tutorial, you'll want to integrate Nutch with Apache Solr to store your indexed data and analyze it. To install Solr version 8.11.2, navigate back to the project directory and run:
tar xzf solr-8.11.2.tgz
Change the directory to the
solr-8.11.2/bin folder and start the Solr server:
This opens a browser tab at
http://localhost:8983/solr/#//. You'll use this server to analyze your crawled data later.
To create a core in Solr named
nutch, which is essentially an index where your data will be stored, run the following:
./solr create -c nutch
This returns the following output:
Created new core 'nutch'
With Solr set up, it's time to integrate it with Apache Nutch. Open the
apache-nutch-1.19/conf/nutch-site.xml file in your code editor.
In this file, enter the configuration property below between the
</configuration> tags to specify that Solr will be used for indexing:
Enter the following property just below the previous one to specify the Solr server URL:
Finally, enter the property below to set your crawler name to a specific value—in this case,
MyNutchCrawler. This ensures that you obey any rules set by
robots.txt files on the parts of a website that your crawler is allowed to access. With a default crawler name, you might unintentionally violate these directives.
nutch-site.xml file is displayed below:
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
Set Up the Tool for Scraping
To get your scraping tool set up, you need to define the URL(s) that will be used as a starting point. These are called seed URLs.
Navigate back into the
apache-nutch-1.19 directory and run the commands below to create a
urls directory and a
seed.txt file in it with a link to ScrapingBee:
mkdir -p urls
echo "https://www.scrapingbee.com" >> urls/seed.txt
To ensure that you only include URLs that match ScrapingBee's domain, edit the
conf/regex-urlfilter.txt file. Locate the following code, which accepts any URL that makes it through the other filters in the file:
# accept anything else
Then replace the previous code with a regular expression matching the ScrapingBee domain only, as shown here:
# ScrapingBee URLs only
The edited portion is shown in the image below.
Scrape Data from ScrapingBee
To set the stage for crawling, inject the seed URL(s) from the
urls directory into Nutch's crawl database,
crawldb, by running the following command:
bin/nutch inject crawl/crawldb urls
A portion of the terminal output is displayed below:
Run the following command to scan
crawldb, select the next set of URLs that need to be crawled, and schedule them for fetching by creating a new segment in the
bin/nutch generate crawl/crawldb crawl/segments
Save and display the name of the newly created segment in the shell variable
s1 for easy reference later by running the following:
s1=$(ls -d crawl/segments/2* | tail -1)
You can fetch web pages in the
$s1 segment of the Apache Nutch web crawler by running the following:
bin/nutch fetch $s1
A portion of the terminal output is displayed here:
Run the following command to process and extract useful data from the fetched web pages in the
$s1 segment of the crawler:
bin/nutch parse $s1
You should see the output below at the end of the operation:
Store the Scraped Data
Now, you can update the central
crawldb database with the result from the
$s1 segment by running the following:
bin/nutch updatedb crawl/crawldb $s1
This generates the terminal output shown here:
Repeat the process for a new segment containing the top-scoring twenty pages by running the following:
bin/nutch generate crawl/crawldb crawl/segments -topN 20
s2=$(ls -d crawl/segments/2* | tail -1)
bin/nutch fetch $s2
bin/nutch parse $s2
crawldb as before with the result from segment
$s2 by running the following:
bin/nutch updatedb crawl/crawldb $s2
Your database now reflects the latest fetched and parsed results from both segments!
However, to prepare and store the data in a format that makes it efficient for search operations, you have to index it. Before doing that, run the following command to invert the links that have been crawled so far:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
This command converts outlinks, or links from one website to another, to inlinks, which link from one web page to another within the same site. This information is then saved to a link database,
linkdb, and provides additional context about a web page's relevance and content.
Finally, index the data on the configured Apache Solr server by running the following:
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ -dir crawl/segments/ -filter -normalize -deleteGone
The command above includes the following:
-filteroption to filter URLs based on already specified criteria
-normalizeoption to convert the URLs into a standard format
-deleteGoneoption to remove broken URLs from the index
A portion of the terminal output is displayed in the image below. Note that on the penultimate line, it says that
20 documents are indexed.
Now that you understand these processes, you can use the crawl script to automate them.
Analyze the Indexed Data
To analyze your indexed data, you'll use the running Solr server. Navigate to
http://localhost:8983/solr/#/nutch/core-overview to view the
nutch core created earlier, as shown in the image below.
Note: The number of documents matches the terminal output.
To analyze the crawled data by making queries, click Query on the lower left. Enter
title:amazon into the
q (query) field to search for a scraped web page that has Amazon in its title.
Scroll down and click Execute Query to run the query.
The result is generated on the right-hand side, as shown in the image below. Notice the blog post that corresponds to the query with all its scraped data.
Next, replace the term in the
q field with
content:scraping to search for web pages with
scraping in their
content field. Click Execute Query.
title:alternative into the
fq (filter query) field to limit the results to those that have the word
alternative in their title. The result is shown below:
Note: Sure enough, the number of extracted content reduces from
To select the fields that get displayed in the result, enter the field names in the
fl (field list) field as a comma-separated list, like so:
title, url, content.
Click Execute Query to see the generated result, as shown below.
Finally, to search for the SVG images you scraped, clear all the fields and enter
url:*.svg into the
q (query) field. Sort the result from most relevant to least relevant by entering
boost desc into the sort field.
Click Execute Query to see the generated result, as shown below:
You can also query the indexed data via HTTP requests in a language of your choice. Then, you can process it and perform a more nuanced analysis.
Note: You can find the configured Nutch crawler in this GitHub repository.
If you followed this tutorial, you should have a solid understanding of Apache Nutch as an open source web crawler as well as its key features. You also know how to configure and run an Apache Nutch crawler, how to index the crawled data, and how to analyze the indexed data using the search platform Apache Solr.
If you prefer not to have to deal with rate limits, proxies, user agents, and browser fingerprints, check out ScrapingBee's no-code web scraping API today. Your first 1,000 API calls are free!
You might also like:
Getting Started with MechanicalSoup
MechanicalSoup is an excellent tool that can be used to scrape websites in Python. In this guide, you'll give a high-level overview of how to use it and explore some of its advanced capabilities.
Guide to Scraping E-commerce Websites
In this guide, you will share high-level tips, best practices, and tools for scraping e-commerce websites. Towards the end, you will also show a quick demo on how to scrape an open source e-commerce website.