Web Scraping with R

18 October 2022 | 16 min read

Want to scrape the web with R? You’re at the right place!

We will teach you from ground up on how to scrape the web with R, and will take you through fundamentals of web scraping (with examples from R).

Throughout this article, we won’t just take you through prominent R libraries like rvest and Rcrawler, but will also walk you through how to scrape information with barebones code.

Overall, here’s what you are going to learn:

  1. R web scraping fundamentals
  2. Handling different web scraping scenarios with R
  3. Leveraging rvest and Rcrawler to carry out web scraping

Let’s start the journey!

Introduction

The first step towards scraping the web with R requires you to understand HTML and web scraping fundamentals. You’ll first learn how to access the HTML code in your browser, then, we will check out the underlying concepts of markup languages and HTML, which will set you on course to scrape that information. And, above all - you’ll master the vocabulary you need to scrape data with R.

We will be looking at the following key items, which will help you in your R scraping endeavour:

  1. HTML basics
  2. HTML elements and tags
  3. Parsing HTML data in R

So, let’s get into it, shall we.

1. HTML basics

Ever since Tim Berners-Lee proposed, in the late 80s, the idea of a platform of documents (the World Wide Web) linking to each other, HTML has been the very foundation of the web and every website you are using.

So, whenever you type a site address in your browser, your browser will download and render the page for you. For example, here’s what https://ScrapingBee.com looks like when you view it in a browser.

ScrapingBee Home page

Beautiful, isn't it. But how would we scrape that with R? Well, before we can do any of that, we first need to understand how a webpage is structured and what it is composed of. While the page above has all those beautiful colors and images, the underlying document is a lot more textual in nature, and that's where HTML comes in.

HTML is the technical representation of that webpage and tells your browser which elements to display and how to display them. That HTML is what we need to understand and analyse, in order to be able to successfully scrape a webpage.

So, ladies and gentlemen, please lo and behold, here the HTML code of our ScrapingBee homepage.

<!DOCTYPE html>
<html lang="en">
	<head>
		<!-- lots of meta tags, trust me -->

		<!-- ah, and one title tag, true -->
		<title>ScrapingBee, the best web scraping API.</title>
	</head>

	<body>
		<div id="wrapper">
			<header class="bg-yellow-100 py-20 md:py-38 absolute left-0 right-0 top-0 z-9">
				<div class="container">
					<div class="flex items-center">
						<div class="w-160 md:mr-60 lg:mr-90">
							<a href="/">
								<img src="/images/logo.svg" alt="ScrapingBee logo" height="26" width="160">
							</a>
						</div>
						<span class="nav-opener md:hidden cursor-pointer absolute top-0 right-0 mt-19 mr-20"><i class="icon-menu"></i></span>
						<div class="navbar-wrap overflow-hidden md:overflow-visible md:flex-1">
							<nav class="navbar px-20 py-20 md:p-0 md:flex md:items-center md:justify-between text-16 leading-a20 bg-black-100 md:bg-transparent text-white md:text-black-100">
								<ul class="flex items-center -mx-21 justify-between md:justify-start border-b border-blue-200 md:border-transparent pb-20 md:pb-0 mb-30 md:mb-0">
									<li class="px-15 lg:px-21"><a href="https://app.scrapingbee.com/account/login" class="block hover:underline">Login</a></li>
									<li class="px-15 lg:px-21"><a href="https://app.scrapingbee.com/account/register" class="btn btn-black-o text-16 px-21 h-40 md:h-48 border-white md:border-black-100 text-white md:text-black-100 hover:bg-white md:hover:bg-black-100 hover:text-black-100 md:hover:text-white transition-all">Sign Up</a></li>
								</ul>
							</nav>
						</div>
					</div>
				</div>
			</header>

			<!-- plenty of more content, plenty -->

		</div>
	</body>
</html>

All right, that was a lot of angle brackets, where did our pretty page go? ??

If you are not familiar with HTML yet, that may have been a bit overwhelming to handle, let alone scrape it.

But don’t worry, the next section exactly shows how to interpret that better. Promised.

2. HTML elements and tags

If you carefully check the HTML code, you will notice something like <title>...</title>, <body>...</body> etc. These are called tags, which are special markers in every HTML document. Each tag serves a special purpose and is interpreted differently by your browser. For example, <title> provides the browser with the - yes, you guessed right - title of that page. Similarly, <body> contains the main content of the page.

Tags are typically either a pair of an opening and a closing marker (e.g. <title> and </title>), with content in-between, or they are self-closing tags on their own (e.g. <br />). What style they follow, usually depends on the tag type and its use case.

In either case, tags can also have attributes, which provide additional data and information, relevant to the tag they belong to. In our example above, you can notice such an attribute in the very first tag <html lang="en">, where the lang attribute specifies that this document uses English as primary document language.

Once you understand the main concepts of HTML, its document tree, and tags, an HTML document will suddenly make more sense and you will be able identify the parts you are interested in. The main takeaway here is that an HTML page is a structured document with a tag hierarchy, which your crawler will use to extract the desired information.

3. Parsing a webpage using R

So, with the information we've learned so far, let's try and use our favorite language R to scrape a webpage. Please keep in mind, we've only - pun fully intended - scraped the surface of HTML so far, so for our first example, we won't extract data, but only print the plain HTML code.

I want to scrape the HTML code of ScrapingBee.com and see how it looks. We will use readLines() to map every line of the HTML document and create a flat representation of it.

scrape_url <- "https://www.scrapingbee.com/"

flat_html <- readLines(con = url)

Now, if we print flat_html, we should get something like this in your R console:

[1] "<!DOCTYPE html>"
[2] "<html lang=\"en\">"
[3] "<head>"
[4] "    <meta name=\"generator\" content=\"Hugo 0.60.1\"/>"
[6] "    <meta http-equiv=\"x-ua-compatible\" content=\"ie=edge\"/>"
[7] "    <title>ScrapingBee - Web Scraping API</title>"
[8] "    <meta name=\"description\""
[9] "          content=\"ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>"
[10] "    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, shrink-to-fit=no\"/>"
[11] "    <meta name=\"twitter:title\" content=\"ScrapingBee - Web Scraping API\"/>" 
[12] "    <meta name=\"twitter:description\""
[13] "          content=\"ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>"
[14] "    <meta name=\"twitter:card\" content=\"summary_large_image\"/>"
[15] "    <meta property=\"og:title\" content=\"ScrapingBee - Web Scraping API\"/>"
[16] "    <meta property=\"og:url\" content=\"https://www.scrapingbee.com/\" />"
[17] "    <meta property=\"og:type\" content=\"website\"/>" 
[18] "    <meta property=\"og:image\""
[19] "          content=\"https://www.scrapingbee.com/images/cover_image.png\"/>"
[20] "    <meta property=\"og:description\" content=\"ScrapingBee is a Web Scraping API that handles proxies and Headless browser for you, so you can focus on extracting the data you want, and nothing else.\"/>"
[21] "    <meta property=\"og:image:width\" content=\"1200\"/>"
[22] "    <meta property=\"og:image:height\" content=\"630\"/>"
[23] "    <meta name=\"twitter:image\""
[24] "          content=\"https://www.scrapingbee.com/images/terminal.png\"/>" 
[25] "    <link rel=\"canonical\" href=\"https://www.scrapingbee.com/\"/>"
[26] "    <meta name=\"p:domain_verify\" content=\"7a00b589e716d42c938d6d16b022123f\"/>"

Quite similar to our previous HTML example, of course. The whole output would be quite a few lines, so I took the liberty and trim it for the example. But, here’s something you can do to have some fun before I take you further towards scraping the web with R:

  1. Scrape www.google.com and try to make sense of the information you received
  2. Scrape a very simple web page like https://www.york.ac.uk/teaching/cws/wws/webpage1.html and see what you get

Remember, scraping is only fun if you experiment with it. So, as we move forward with the blog post, I’d love it if you try out each and every example as you go through them and put your own twist on.

While our output above looks great, it still isn't strictly an HTML document, as in HTML we have a document hierarchy of tags which looks like

<!DOCTYPE html>
<head>
    <title>Page Title</title>
</head>
<body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
</body>
</html>

This is because, readLines() read the document line by line and did not take into account the overall document structure. Though, given that, I just wanted to give you a barebones look at scraping, this code looks like a good illustration.

Real-world code, of course, will be a lot more complex. But, fortunately, we have a lot of libraries that simplify web scraping in R for us. We will go through four of these libraries in later sections.

First, we need to go through different scraping situations that you’ll frequently encounter when you scrape data with R.

Common web scraping scenarios with R

1. Using R to download files over FTP

Even though, FTP is being used less these days, it still often is a fast way to exchange files.

In this example, we will use the CRAN FTP server, to first get the list of files for a given directory, filter the list for HTML files, and download each of them. Let's get going.

Directory listing

The URL that we are trying to get data from is ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/.

ftp_url <- "ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/"

get_files <- getURL(ftp_url, dirlistonly = TRUE)

Excellent, we got the list of file in get_files

> get_files

"BayesMixSurv.pdf\r\nChangeLog\r\nDESCRIPTION\r\nNAMESPACE\r\naliases.rds\r\nindex.html\r\nrdxrefs.rds\r\n"

Looking at the string above, can you see what the file names are?

This browser screenshot shows them in a slightly more user-friendly way

Files and directory inside an FTP server

Turns out, we got the file list line-by-line, with DOS line endings (carriage return/line feed). It's pretty easy to parse that with R, simply use str_split() and str_extract_all().

extracted_filenames <- str_split(get_files, "\r\n")[[1]]

extracted_html_filenames <-unlist(str_extract_all(extracted_filenames, ".+(.html)"))

Let’s print the file names to see what we have now:

> extracted_html_filenames

[1] "index.html"

Great! So, we now have a list of HTML files that we want to access. In our case, it was only one HTML file.

File download

Now, all we have to do is to create a function FTPDownloader, which downloads our file (with getURL() once more) and saves it to a local folder.

FTPDownloader <- function(filename, folder, handle) {

  dir.create(folder, showWarnings = FALSE)

  fileurl <- str_c(ftp_url, filename)

  if (!file.exists(str_c(folder, "/", filename))) {

    file_name <- try(getURL(fileurl, curl = handle))

    write(file_name, str_c(folder, "/", filename))

    Sys.sleep(1)

  }

}  

We are almost there now! We only need a cURL handle for the actual network communication.

Curlhandle <- getCurlHandle(ftp.use.epsv = FALSE)

Now, we just call l_ply() from our plyr package and pass it our list of files (extracted_html_filenames), our download function (FTPDownloader), the local directory, and our cURL handle.

library(plyr)

l_ply(extracted_html_filenames, FTPDownloader, folder = "scrapingbee_html", handle = Curlhandle)

And, we are done! We should now have a directory named scrapingbee_html with one index.html.

That was via FTP, what about the web's protocol HTTP? That, we are going to check out next.

If you would like to learn more about cURL without R you can check: How to send a GET request using cURL?, How to send a delete request using cURL? or How to send a POST request using cURL?

2. Scraping information from Wikipedia using R

In this section, I’ll show you how to retrieve information from Leonardo Da Vinci’s Wikipedia page https://en.wikipedia.org/wiki/Leonardo_da_Vinci.

Let’s take a look at the basic steps to parse information:

wiki_url <- "https://en.wikipedia.org/wiki/Leonardo_da_Vinci"

wiki_read <- readLines(wiki_url, encoding = "UTF-8")

parsed_wiki <- htmlParse(wiki_read, encoding = "UTF-8")
  1. We save our URL in wiki_url
  2. We use readLines to fetch the HTML content of the URL and save it in wiki_read
  3. We use htmlParse() to parse the HTML code into a DOM tree and save that as parsed_wiki

What's a DOM tree, you ask? That's a fair question.

DOM tree

DOM is an abbreviation for Document Object Model and is essentially a typed, in-memory representation of an HTML document.

As the DOM already has all elements properly parsed, we can easily access the document's elements via the object returned by htmlParse. For example, to get a list of all <p> elements, we can simply use the following code.

wiki_intro_text <- parsed_wiki["//p"]

wiki_intro_text will now contain a list of all paragraphs. With the following code, we'd access the fourth element.

wiki_intro_text[[4]]

<p>Born <a href="/wiki/Legitimacy_(family_law)" title="Legitimacy (family law)">out of wedlock</a> to a notary, Piero da Vinci, and a peasant woman, Caterina, in <a href="/wiki/Vinci,_Tuscany" title="Vinci, Tuscany">Vinci</a>, in the region of <a href="/wiki/Florence" title="Florence">Florence</a>, <a href="/wiki/Italy" title="Italy">Italy</a>, Leonardo was educated in the studio of the renowned Italian painter <a href="/wiki/Andrea_del_Verrocchio" title="Andrea del Verrocchio">Andrea del Verrocchio</a>. Much of his earlier working life was spent in the service of <a href="/wiki/Ludovico_il_Moro" class="mw-redirect" title="Ludovico il Moro">Ludovico il Moro</a> in Milan, and he later worked in Rome, Bologna and Venice. He spent his last three years in France, where he died in 1519.
</p> 

The DOM is extremely powerful, but there are also other functions available to handle our plain HTML string.

getHTMLLinks(), for example, will provide us with a list of all links in our page.

getHTMLLinks(wiki_read)

[1] "/wiki/Wikipedia:Good_articles"                                                       

[2] "/wiki/Wikipedia:Protection_policy#semi"                                              

[3] "/wiki/Da_Vinci_(disambiguation)"                                                     

[4] "/wiki/Leonardo_da_Vinci_(disambiguation)"                                            

[5] "/wiki/Republic_of_Florence"                                                          

[6] "/wiki/Surname"                                                                       

[7] "/wiki/Given_name"                                                                    

[8] "/wiki/File:Francesco_Melzi_-_Portrait_of_Leonardo.png"                               

[9] "/wiki/Francesco_Melzi"

You can also see the total number of links on this page by using length() function:

length(getHTMLLinks(wiki_read))

[1] 1566

I’ll throw in one more use case here which is to scrape tables off such HTML pages. And it is something that you’ll encounter quite frequently too for web scraping purposes. The XML package in R offers a function named readHTMLTable(), which makes our life a lot easier when it comes to scraping tables from HTML pages.

Leonardo’s Wikipedia page has no HTML though, so I will use a different page to show how we can scrape HTML from a webpage using R.

Here's the new URL: https://en.wikipedia.org/wiki/Help:Table

So, let's load the page and check how many HTML tables we've got.

wiki_url1 <- "https://en.wikipedia.org/wiki/Help:Table"

wiki_read1 <- readLines(wiki_url1, encoding = "UTF-8")

length((readHTMLTable(wiki_read1)))

[1] 108

108 tables is a lot, but then that page is about tables, isn't it.

Fair enough, let's choose one table. For that, we can use the names() function.

names(readHTMLTable(wiki_read1))

[1] "NULL"                                                                             

[2] "NULL"                                                                             

[3] "NULL"                                                                             

[4] "NULL"                                                                             

[5] "NULL"                                                                             

[6] "The table's caption\n"

Quite a few NULLs here, but also a named one, "The table's caption", so let's check out that one.

readHTMLTable(wiki_read1)$"The table's caption\n"

               V1              V2              V3

1 Column header 1 Column header 2 Column header 3

2    Row header 1          Cell 2          Cell 3

3    Row header A          Cell B          Cell C

Here’s how this table looks in HTML

Html Table

Awesome isn’t it? Now, imagine accessing and scraping real-world data and information. For example, you could try to fetch all the historic data of the US census at https://en.wikipedia.org/wiki/United_States_Census. A pretty good use case for a table.

That being said, it's not always necessarily that straightforward. Often, we have HTML forms and authentication requirements, which can block your R code from scraping. And that’s exactly what we are going to learn to handle next.

3. Handling HTML forms while scraping with R

Often we come across pages that aren’t that easy to scrape. Let's take the meteorological service of Singapore, for example - http://www.weather.gov.sg/climate-historical-daily.

Screenshot weather.gov website

Pay close attention to the dropdowns and imagine if you want to scrape information that you can only get upon clicking on the dropdowns. What would you do in that case?

Well, I’ll be jumping a few steps forward and will show you a preview of rvest package while scraping this page. Our goal here is to scrape data from 2016 to 2022.

library(rvest)

html_form_page <- 'http://www.weather.gov.sg/climate-historical-daily' %>% read_html()

weatherstation_identity <- page %>% html_nodes('button#cityname + ul a') %>% 

  html_attr('onclick') %>%  

  sub(".*'(.*)'.*", '\\1', .)

weatherdf <- expand.grid(weatherstation_identity, 

                  month = sprintf('%02d', 1:12),

                  year = 2016:2022)

Let’s check what type of data have been able to scrape. Here’s what our data frame looks like:

str(weatherdf)

> 'data.frame':	3780 obs. of  3 variables:

 $ Var1 : Factor w/ 63 levels "S104","S105",..: 1 2 3 4 5 6 7 8 9 10 ...

 $ month: Factor w/ 12 levels "01","02","03",..: 1 1 1 1 1 1 1 1 1 1 ...

 $ year : int  2016 2016 2016 2016 2016 2016 2016 2016 2016 2016 ...

 - attr(*, "out.attrs")=List of 2

  ..$ dim     : Named num  63 12 5

  .. ..- attr(*, "names")= chr  "" "month" "year"

  ..$ dimnames:List of 3

  .. ..$ Var1 : chr  "Var1=S104" "Var1=S105" "Var1=S109" "Var1=S86" ...

  .. ..$ month: chr  "month=01" "month=02" "month=03" "month=04" ...

  .. ..$ year : chr  "year=2016" "year=2017" "year=2018" "year=2019" ...

From the data frame above, we can now easily generate URLs that provide direct access to data of our interest.

urlPages <- paste0('http://www.weather.gov.sg/files/dailydata/DAILYDATA_', 
weatherdf$Var1, '_', weatherdf$year, weatherdf$month, '.csv')

Now, we can download those files at scale using lapply().

lapply(urlPages, function(url){download.file(url, basename(url), method = 'curl')})

Note: This is going to download a ton of data once you execute it.

Web scraping using rvest

Inspired by libraries like BeautifulSoup, rvest is probably one of most popular packages for R that we use to scrape the web. While it is simple enough that it makes scraping with R look effortless, it is complex enough to enable any scraping operation.

So, let’s see rvest in action now. For this example, we'll crawl IMDB, and because we are cinema aficionados, we'll pick a particular cinematic masterpiece for our first scraping task.

The Last Sharknado: It's About Time 😎

library(rvest)

sharknado <- html("https://www.imdb.com/title/tt8031422/")

All right, we've got everything we need in sharknado at this point. Let's figure out who the cast is.

sharknado %>%

  html_nodes("table") %>%

  .[[1]] %>%

  html_table()

X1                                X2

1  Cast overview, first billed only: Cast overview, first billed only:

2                                                           Ian Ziering

3                                                            Tara Reid

4                                                     Cassandra Scerbo

5                                                    Judah Friedlander

6                                                        Vivica A. Fox

7                                                     Brendan Petrizzo

8                                                      M. Steven Felty

9                                                         Matie Moncea

10                                                            Todd Rex

11                                                        Debra Wilson

12                                                  Alaska Thunderfuck

13                                                 Neil deGrasse Tyson

14                                                       Marina Sirtis

15                                                         Audrey Latt

16                                              Ana Maria Varty Mihail

Ian Ziering, isn't that the dude from Beverly Hills, 90120? And Vicky from American Pie was in it as well. Not to forget Deanna Troi from Star Trek.

Still, there are skeptics of Sharknado. I guess the rating would prove them wrong? So, here's how you extract the movie's rating from IMDB.

sharknado %>%

  html_node("strong span") %>%

  html_text() %>%

  as.numeric()

[1] 3.5

I still stand by my words. But I hope you get the point, right? See how easy it is for us to scrape information using rvest, while we were writing 10+ lines of code in much simpler scraping scenarios.

Next on our list is Rcrawler.

Web Scraping using Rcrawler

Rcrawler is another R package that helps us harvest information from the web. But unlike rvest, we use Rcrawler for network graph related scraping tasks a lot more. For example, if you wish to scrape a very large website, you might want to try Rcrawler in a bit more depth.

Note: Rcrawler is more about crawling than scraping.

We will go back to Wikipedia and we will try to find the date of birth, date of death, and other details of scientists.

library(Rcrawler)

list_of_scientists <- c("Niels Bohr", "Max Born", "Albert Einstein", "Enrico Fermi")

target_pages = paste0('https://en.wikipedia.org/wiki/Special:Search/', gsub(" ", "_", list_of_scientists))

scientist_data <- ContentScraper(Url = target_pages , 
        XpathPatterns = c("//th","//tr[(((count(preceding-sibling::*) + 1) = 5) and parent::*)]//td","//tr[(((count(preceding-sibling::*) + 1) = 6) and parent::*)]//td"),
        PatternsName = c("scientist", "dob", "dod"), 
        asDataFrame = TRUE)

So what did we do here?

  1. We first stored the people's names in list_of_scientists.
  2. Based on our list, we then created our Wikipedia URLs in target_pages.
  3. Now, we use Rcrawler's ContentScraper() function to crawl our pages. Here we pass the URLs, as well as XPath expressions and pattern names for the data items we are interested in.

Voilà, scienstist_data now contains the following information set.


#    Scientist  	         dob  	                                                          dod

1    Niels Bohr          7 October 1885Copenhagen, Denmark          18 November 1962 (aged 77)

Copenhagen, Denmark

2    Max Born           11 December 1882                                          5 January 1970 (aged 87)

3    Albert Einstein   14 March 1879                                                18 April 1955

4    Enrico Fermi      29 September 1901                                        28 November 1954

And that’s it. You pretty much know everything you need to get started with web scraping in R.

Try challenging yourself with interesting use cases and uncover challenges. Scraping the web with R can be really fun!

One important aspect to remember is to plan your crawler strategy in a way to avoid being rate limited by the site. We have another excellent article on that subject and how to make sure your web crawler does not get blocked.

💡 If you prefer not to have to deal with rate limits, proxies, user agents, and browser fingerprints, please check out our no-code web scraping API. Did you know, the first 1,000 calls are on us?

Happy scraping.

image description
Parikshit Joshi

Parikshit is a marketer with a deep passion for data. He spends his free time learning how to make better use of data to make marketing decisions.