Web scraping with R is a practical way to collect data from websites when APIs are missing, incomplete, or locked behind login pages. With the right tools, you can move far beyond fragile one-off scripts and build scrapers that are reliable, readable, and production-ready.
This guide walks through a modern approach to web scraping with R, using rvest and httr2 for parsing and requests, and ScrapingBee to handle the hard parts like JavaScript rendering, proxies, retries, and bot protection. You'll learn how to scrape static pages, work with JSON APIs, deal with pagination and logins, and handle JavaScript-heavy sites without guessing.
The focus is on patterns that actually hold up: clear separation between fetching and parsing, polite request behavior, and workflows you can reuse in real projects—not just demos.

Quick answer (TL;DR)
Want a fast win before diving into the full guide? Here's a copy-paste script that shows what web scraping with R looks like in practice:
- fetches a page through ScrapingBee (HTTP + rendering layer)
- parses the HTML with rvest
- extracts book titles, URLs, and star ratings into a clean data frame
If you ever hit the classic problem where your scraper "doesn't see the data you see in the browser", it's usually because the content is loaded by JavaScript or requires a different request context. Our guide explains those cases.
Full code
# Install dependencies if needed
# install.packages(c("httr2", "rvest", "stringr", "tibble"))
library(httr2)
library(rvest)
# ScrapingBee API key should be set as an environment variable
# export SCRAPINGBEE_API_KEY="your api key here"
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
# Send the request to ScrapingBee
# ScrapingBee fetches the target page and returns the final HTML
resp <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = "https://books.toscrape.com/"
) |>
req_perform()
# Parse HTML directly from the HTTP response
doc <- resp_body_html(resp)
# Each book is stored inside an <article class="product_pod">
cards <- doc |>
html_elements("ol.row article.product_pod")
# Quick sanity check: how many books did we grab?
print(length(cards))
# Extract book titles (stored in the "title" attribute)
titles <- cards |>
html_element("h3 a") |>
html_attr("title")
# Extract relative URLs to book pages
urls <- cards |>
html_element("h3 a") |>
html_attr("href")
# Extract rating classes like "star-rating Three"
rating_class <- cards |>
html_element("p.star-rating") |>
html_attr("class")
# Clean up ratings: "star-rating Three" -> "Three"
ratings <- stringr::str_remove(rating_class, "^star-rating\\s+")
# Combine everything into a tidy data frame
books_df <- tibble::tibble(
title = titles,
url = urls,
rating = ratings
)
books_df
This script is the shortest path from zero to results. ScrapingBee handles the HTTP side and returns ready-to-parse HTML, and rvest does what it's best at: selecting elements and extracting data.
In the next sections, we'll slow things down and look at other ways to do web scraping with R, including different request setups, pagination, and patterns that scale better when you move toward production.
Setting up your R web scraping environment
Before we scrape anything useful, we want a setup that doesn't fight back. The goal here is simple: install R, install a few core packages, and run a tiny script that already talks to ScrapingBee. If that works, you've got a solid base you can reuse for everything later in this guide, from quick experiments to production jobs.
If you hit common scraping issues later on, this page answers many of the usual questions people ask when working with scrapers in production: Web scraping questions.
Install R and RStudio
To get started, you only need two things: R and RStudio. R is the language itself, and RStudio is the place where you write code, run scripts, and inspect results. They're separate tools, but they're designed to be used together.
- First, install R. Download it from the Comprehensive R Archive Network (CRAN), pick your operating system, and run the installer. Stick with the defaults. No need to tweak paths or enable fancy options.
- Next, install RStudio. It's provided by Posit (the company behind the RStudio IDE). Download the free RStudio Desktop version for your OS and run the installer, again keeping the default settings.
The order matters: install R first, then RStudio. RStudio needs R to already be there, and doing it the other way around can lead to confusing startup issues.
Once both are installed, open RStudio. Create a new project via File → New Project and choose an empty directory for your scraping work. This folder will hold your scripts, configs, and outputs. When the project opens, your environment is ready, and you can move on to installing scraping packages.
Install essential packages like rvest, httr2, xml2
These packages cover most of what you need for web scraping with R, from simple scripts to more advanced setups:
- httr2 handles HTTP requests and is what we use to call the ScrapingBee API.
- rvest extracts data from HTML using CSS selectors or XPath.
- xml2 parses HTML and XML documents and sits underneath rvest.
- chromote is optional and lets you control a real Chrome browser if you ever need it.
- Rcrawler is optional and can help when crawling many pages, but note that it's outdated and not available on the newest versions of R.
You can install everything in one go:
install.packages(c(
"httr2",
"rvest",
"xml2",
# Note that this might fail in your setup,
# see below for a custom-made solution:
# "Rcrawler",
"chromote"
))
After installation, load the core packages to make sure everything works:
library(httr2)
library(rvest)
library(xml2)
If RStudio runs this without errors, you're good to go.
If you are used to building requests with cURL, this tool can help later when converting requests into R code: Convert cURL commands to R.
Get a ScrapingBee API key
To run the examples in this guide, you'll need a ScrapingBee API key. ScrapingBee sits between your R code and the target website and takes care of things like headers, rendering, and anti-bot protections.
You can register for free at app.scrapingbee.com/account/register. The free plan gives you 1,000 credits, which is more than enough to run all the code samples in this tutorial and do some testing on your own.
After signing up, you'll find your API key in the ScrapingBee dashboard. We'll store it as an environment variable in the next step so you don't have to hard-code it in your scripts.
Create your first R script
Now let's make sure everything actually works together. We'll create a tiny script that loads a library, reads your ScrapingBee API key, sends a test request, and prints the HTTP status code. If this passes, your setup is solid.
Create a new file in RStudio and name it something like test_scrapingbee.R.
First, set your ScrapingBee API key as an environment variable so it's not hard-coded into the script.
On macOS or Linux:
export SCRAPINGBEE_API_KEY="your_api_key_here"
On Windows (PowerShell):
setx SCRAPINGBEE_API_KEY "your_api_key_here"
Now add this code to the script:
library(httr2)
# Read the API key from the environment
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
# Fail early if the key is missing
if (api_key == "") {
stop("SCRAPINGBEE_API_KEY is not set")
}
# Build a request to the ScrapingBee API
# The 'url' parameter is the page ScrapingBee will fetch for us
req <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = "https://example.com"
)
# Send the HTTP request
resp <- req_perform(req)
# Print the HTTP status code (200 means success)
resp_status(resp)
Run the script from your terminal:
Rscript test_scrapingbee.R
If the output is 200, your environment is set up correctly. R, the scraping packages, and ScrapingBee are all talking to each other. We'll reuse and extend this setup in the next sections when we start parsing real pages and dealing with more production-style scenarios.
Scraping static websites with rvest
This section walks through a workflow for web scraping with R where ScrapingBee does the fetching, and rvest does the parsing. Keeping those jobs separate makes your code easier to read, easier to debug, and way more reusable.
It also makes your scraper more resilient. If a site rate-limits you, blocks your IP, or later starts needing rendering, you usually don't have to rewrite your whole scraper. The fetching layer stays the same (you just adjust the request) while your parsing code stays focused on HTML.
If you get stuck on parsing or hit weird edge cases, this is a solid reference: Common questions about web scraping and data parsing.
Read HTML content using read_html()
The core idea is simple: call ScrapingBee, grab the response body, then pass it into read_html() so rvest can work with a parsed document.
Why fetch with ScrapingBee instead of calling the target website directly from R? You get more reliability out of the box. IP rotation, anti-bot handling, and optional JavaScript rendering live on the ScrapingBee side. Your R code stays the same and focuses only on parsing HTML.
Below is a minimal pattern using resp_body_string() followed by read_html().
library(httr2)
library(rvest)
library(xml2)
# Read API key from the environment
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
target_url <- "https://books.toscrape.com/"
# Build a ScrapingBee API request
req <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = target_url
)
# Send the request
resp <- req_perform(req)
# Get the raw HTML as a string
html_str <- resp_body_string(resp)
# Parse the HTML into an rvest/xml2 document
doc <- read_html(html_str)
doc
This two-step approach makes it explicit what you're working with: first a raw HTTP response, then a parsed HTML document.
If you prefer a shorter version, httr2 can return an already-parsed HTML object directly. This skips one step and is great for quick scripts.
library(httr2)
library(rvest)
# Read API key from the environment
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
resp <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = "https://books.toscrape.com/"
) |>
req_perform()
# Parse HTML directly from the response
doc <- resp_body_html(resp)
doc
Both versions produce the same doc object. The choice mostly comes down to whether you want a very explicit flow (string → HTML) or the most compact code possible.
Select elements using CSS selectors
Once you have doc, scraping turns into a simple loop: pick nodes, then pull values. You select nodes using CSS selectors and a couple of rvest helpers:
html_elements()returns all matching nodes.html_element()returns only the first match.
CSS selectors are the same ones you use in your browser's DevTools. The basics you'll use constantly are:
tagselects by tag name, likeh3ora.classselects by class name, like.product_pod#idselects by id, like#content
When you're unsure which selector to use, open DevTools in your browser, inspect the element, and look for stable tags or class names that describe what you want.
Here's a compact example using Books to Scrape. Each book card lives inside article.product_pod, so that's our anchor.
library(rvest)
# 'doc' is the parsed HTML document from earlier
books <- doc |>
# Select all book cards on the page
html_elements("ol.row article.product_pod")
# How many books did we find?
length(books)
At this point, books is a list of HTML nodes. In the next steps, we'll drill into each node to extract titles, links, prices, or ratings.
Extract text and attributes
After selecting nodes, the next step is usually pulling text and attributes.
html_text2()extracts readable text and trims whitespace for you.html_attr()extracts attributes likehref,src,title, or anything else defined on the element.
Below is a small example that builds a clean tibble with a title, URL, and rating. On this site, the rating isn't plain text — it's encoded in a class name like star-rating Three, so we grab the class and clean it up.
library(httr2)
library(rvest)
# 'doc' is the parsed HTML document from earlier
books <- doc |>
html_elements("ol.row article.product_pod")
# Quick check: how many book cards are on the page?
length(books)
# Store the book cards (same selection, clearer variable name)
cards <- doc |>
html_elements("ol.row article.product_pod")
# Extract book titles from the "title" attribute
titles <- cards |>
html_element("h3 a") |>
html_attr("title")
# Extract relative URLs to the book detail pages
urls <- cards |>
html_element("h3 a") |>
html_attr("href")
# Extract the full rating class (e.g. "star-rating Three")
rating_class <- cards |>
html_element("p.star-rating") |>
html_attr("class")
# Turn "star-rating Three" into just "Three"
ratings <- stringr::str_remove(rating_class, "^star-rating\\s+")
# Combine everything into a tidy data frame
books_df <- tibble::tibble(
title = titles,
url = urls,
rating = ratings
)
books_df
You'll end up with a clean table like this:
title url rating
<chr> <chr> <chr>
1 A Light in the Attic cata… Three
2 Tipping the Velvet cata… One
3 Soumission cata… One
This pattern is the workhorse for most static pages: select a list of nodes, extract a few fields from each one, and bind everything into a data frame you can analyze or store.
Handle pagination manually
Static sites often paginate with a simple pattern like page-2.html, page-3.html, or a ?page= query parameter. Manual pagination is just: loop pages → fetch via ScrapingBee → parse → extract rows → bind into one table.
A clean way to do this in R is purrr::map_dfr(), which runs a function for each page and row-binds the results as it goes. Adding a small Sys.sleep() keeps your scraper polite and lowers the odds of hitting rate limits.
library(httr2)
library(rvest)
library(purrr)
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
# Limit how many pages you visit (keep this small while testing)
max_pages <- 3
pages <- 1:max_pages
# Books to Scrape uses a predictable URL pattern for listing pages
base <- "https://books.toscrape.com/catalogue/page-%d.html"
base_site <- "https://books.toscrape.com/"
fetch_page <- function(page_num) {
# Build the target page URL (page-1.html, page-2.html, ...)
page_url <- sprintf(base, page_num)
# Fetch the page through ScrapingBee
resp <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = page_url
) |>
req_perform()
# Parse HTML directly from the response
doc <- resp_body_html(resp)
# Each book card is an <article class="product_pod">
cards <- doc |>
html_elements("ol.row article.product_pod")
# Full title is stored in the `title` attribute
titles <- cards |>
html_element("h3 a") |>
html_attr("title")
# Links are relative on this site, so we convert them to absolute URLs
rel_urls <- cards |>
html_element("h3 a") |>
html_attr("href")
urls <- paste0(base_site, rel_urls)
# Rating is encoded in the class name (e.g. "star-rating Three")
rating_class <- cards |>
html_element("p.star-rating") |>
html_attr("class")
ratings <- stringr::str_remove(rating_class, "^star-rating\\s+")
# Return one tibble per page; map_dfr() will bind them together
tibble::tibble(
page = page_num,
title = titles,
url = urls,
rating = ratings
)
}
all_books <- map_dfr(pages, function(p) {
# Small pause to keep traffic smooth and reduce rate-limit risk
Sys.sleep(0.5)
fetch_page(p)
})
all_books
This is "manual pagination" because you decide which page numbers to request. Next, we'll also paginate by following the "next" link in the HTML, which is usually more robust when the total number of pages can change.
Make sure you've got the packages installed before running the code:
install.packages("purrr")
install.packages("dplyr")
You should see another tidy table as output (just bigger, because it includes multiple pages).
Scrape tables and lists
Static pages often hide the best stuff in either tables or lists. Tables are great because the structure is already there. Lists are everywhere in the wild: product grids, search results, feature bullets, you name it.
Either way, the workflow doesn't change: select the right nodes, extract values, then clean things into something you can actually work with.
Scrape an HTML table with html_table()
If the page uses a real <table>, rvest::html_table() is the quickest way to get a data frame out of it. It reads rows and cells for you, and you can clean the result after.
Here's an example that grabs the demographics table from the APA sample page. That table includes multi-row headers and section rows, so expect a bit of cleanup. The point here is the basic move: select the table node, convert it to a data frame, then clean it.
# Fetch the page through ScrapingBee, then parse the HTML
resp <- httr2::request("https://app.scrapingbee.com/api/v1/") |>
httr2::req_url_query(
api_key = api_key,
url = "https://apastyle.apa.org/style-grammar-guidelines/tables-figures/sample-tables#demographic"
) |>
httr2::req_perform()
# Parse HTML directly from the response
doc <- httr2::resp_body_html(resp)
# Grab the first table on the page (you can make this selector stricter if needed)
tbl_node <- rvest::html_element(doc, "table")
# Convert the HTML <table> to a data frame
raw_tbl <- rvest::html_table(tbl_node, fill = TRUE)
raw_tbl
This gives you a real data frame fast. The tradeoff is that messy tables often need a little cleanup right after.
Common cleanup steps include trimming whitespace (so " Female" and "Female" don't become two different values) and converting "numbers as strings" like "250,000" into numeric values.
# Simple cleanup pass: trim whitespace in all character columns
clean_tbl <- raw_tbl
for (col in names(clean_tbl)) {
if (is.character(clean_tbl[[col]])) {
clean_tbl[[col]] <- trimws(clean_tbl[[col]])
}
}
clean_tbl
That's usually step one. Next up is converting numeric-looking columns that contain commas, currency symbols, or %.
to_number <- function(x) {
# Convert "250,000" -> 250000, "£51.77" -> 51.77, "45%" -> 45
x <- gsub("[,%£$]", "", x)
x <- gsub("\\s+", "", x)
suppressWarnings(as.numeric(x))
}
maybe_numeric <- function(x) {
# Only try converting character columns
if (!is.character(x)) return(x)
out <- to_number(x)
# Keep the original if conversion produces mostly NA
if (mean(is.na(out)) > 0.8) x else out
}
# Convert columns that look numeric while leaving text columns untouched
clean_tbl2 <- as.data.frame(
lapply(clean_tbl, maybe_numeric),
stringsAsFactors = FALSE
)
clean_tbl2
This is a solid "starter cleaning pass" you can reuse a lot: trim first, then gently convert numeric-ish columns without wrecking the text ones.
Scrape list-like content by selecting items and mapping rows
Not everything is a <table>. A lot of "tables" on modern sites are really just cards or div-based grids that look tabular. In those cases, the move is: select one repeatable "item" node, then extract fields inside it.
"Books to Scrape" is a perfect example. Each book is one card. You grab all cards, then pull title, link, and rating into rows.
# Fetch the page through ScrapingBee
resp <- httr2::request("https://app.scrapingbee.com/api/v1/") |>
httr2::req_url_query(
api_key = api_key,
url = "https://books.toscrape.com/"
) |>
httr2::req_perform()
# Parse HTML
doc <- httr2::resp_body_html(resp)
# Select all book cards on the page
cards <- rvest::html_elements(doc, "ol.row article.product_pod")
# Turn each card into one row, then bind them together
rows <- lapply(cards, function(card) {
# Title lives in the "title" attribute
title <- rvest::html_element(card, "h3 a") |>
rvest::html_attr("title")
# Link is relative, so we'll prefix it later
rel_url <- rvest::html_element(card, "h3 a") |>
rvest::html_attr("href")
# Rating is encoded in a class like "star-rating Three"
rating_class <- rvest::html_element(card, "p.star-rating") |>
rvest::html_attr("class")
rating <- stringr::str_remove(rating_class, "^star-rating\\s+")
# Return a one-row data frame so we can rbind() later
data.frame(
title = trimws(title),
url = paste0("https://books.toscrape.com/", rel_url),
rating = trimws(rating),
stringsAsFactors = FALSE
)
})
# Bind all one-row data frames into one table
books_df <- do.call(rbind, rows)
books_df
This is the core "card grid" pattern: find the repeatable block, extract a few fields from each block, and bind the results into a data frame.
The same idea works for real HTML lists. If the page uses <ul> / <ol>, select the li nodes and pull the text:
# Example pattern for bullet lists
items <- rvest::html_elements(doc, "ul li")
list_df <- data.frame(
# html_text2() gives you cleaner text (trimmed + more readable)
item = trimws(rvest::html_text2(items)),
stringsAsFactors = FALSE
)
list_df
Main takeaway: card grids and lists are just "repeatable blocks." Find the block selector, extract the fields you need, clean whitespace, and convert strings to numbers where it makes sense. After that, it's just another dataset in R.
Handling APIs and HTTP requests with httr2
Once you move past simple HTML pages, you'll usually hit two kinds of targets: JSON APIs and "real" web endpoints that expect browser-ish requests. httr2 is the clean way to handle both in R, and ScrapingBee can sit in front of those requests when you want extra reliability.
A useful mental model is: httr2 builds requests and parses responses, while ScrapingBee handles scraping headaches like IP rotation, blocking, and optional rendering. The nice part is that your httr2 code barely changes, which makes this approach scale from quick scripts to production jobs.
Here's a short, realistic example that fetches a public JSON endpoint through ScrapingBee and parses the response with jsonlite.
# Read API key from the environment
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
# Public JSON API endpoint we'll fetch through ScrapingBee
target_url <- "https://httpbingo.org/json"
# Build and send the request
resp <- httr2::request("https://app.scrapingbee.com/api/v1/") |>
httr2::req_url_query(
api_key = api_key,
url = target_url
) |>
httr2::req_perform()
# Get the raw JSON as text
json_txt <- httr2::resp_body_string(resp)
# Parse JSON into an R list
data <- jsonlite::fromJSON(json_txt, simplifyVector = TRUE)
# Quick sanity check: what keys did we get back?
names(data)
This gives you a regular R list you can inspect, transform, or turn into a data frame.
You can parse JSON with base R, but it gets clunky fast. For scraping and API work, jsonlite is usually the right tradeoff between simplicity and control.
Set headers and cookies
With httr2, you can add query parameters, headers, and cookies just like in any other HTTP client. The only extra thing to keep in mind with ScrapingBee is that there are two "layers" of headers:
- Headers sent to ScrapingBee itself.
- Headers that ScrapingBee forwards to the target website.
By default, headers set with req_headers() apply to the ScrapingBee API request. If you actually want headers to reach the target site, you need to enable header forwarding and prefix those headers with Spb-. ScrapingBee strips the prefix and forwards them to the destination.
Here's a compact example that mimics a browser user agent, forwards a header, and sends a cookie to the target website.
# Read API key from the environment
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
target_url <- "https://httpbingo.org/cookies"
resp <- httr2::request("https://app.scrapingbee.com/api/v1/") |>
httr2::req_url_query(
api_key = api_key,
url = target_url,
# ScrapingBee-specific way to send cookies to the target site
cookies = "demo_cookie=hello_from_r"
) |>
httr2::req_headers(
# Headers prefixed with Spb- will be forwarded to the target site
"Spb-User-Agent" =
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
) |>
httr2::req_url_query(
# Enable forwarding of Spb-* headers
forward_headers = TRUE
) |>
httr2::req_perform()
# Check the HTTP status
httr2::resp_status(resp)
# Parse the JSON response
json_txt <- httr2::resp_body_string(resp)
data <- jsonlite::fromJSON(json_txt)
data
data$cookies
If everything worked, you should see:
"hello_from_r"
That confirms the cookie was received by httpbingo.org. This is the basic pattern you'll reuse whenever a site expects specific headers or session cookies.
Authenticate with login forms
Sometimes you need to be logged in to see the page you want. For simple, old-school login forms, the pattern is pretty straightforward:
- send a POST request with the form fields
- grab the session cookie from the response headers
- reuse that cookie on the next requests
This works best on sites that still use classic server-side sessions. It gets messy fast when the login flow is multi-step, heavily JavaScript-driven, or protected by CAPTCHAs, MFA, device fingerprinting, or short-lived one-time tokens. In those cases, you'll usually need a different approach (an official API, browser automation, or you just won't be able to scrape it reliably).
Below is a practical example using Hacker News. The login form posts to /login and uses fields like acct and pw. We send a POST request through ScrapingBee, extract the session cookie (often user=...) from the response headers, then request the profile page to prove we're authenticated.
A few ScrapingBee-specific details matter here:
- ScrapingBee supports POST and PUT by sending a real POST/PUT request to its main endpoint (
/api/v1/) withapi_keyandurl. The request body is forwarded to the target site. - Response headers coming from the target site are returned with a
Spb-prefix in normal mode, so you can tell them apart from ScrapingBee's own headers. - If you want to forward your own headers (cookies, User-Agent, etc.) to the target site, set
forward_headers=trueand prefix those headers withSpb-.
Example: Log in to Hacker News and fetch your profile page
This script reads credentials from environment variables:
HN_USERNAMEHN_PASSWORD
Then it:
- POSTs the login form through ScrapingBee
- Extracts the
usersession cookie from the response headers - Requests the profile page with that cookie and scrapes a safe "proof" field (karma)
The goal is to keep the flow simple and explicit so you can adapt it to other form-based logins that work the same way.
# --- Env vars (keep secrets out of code) --------------------------------------
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
hn_user <- Sys.getenv("HN_USERNAME")
hn_pass <- Sys.getenv("HN_PASSWORD")
if (hn_user == "" || hn_pass == "") stop("HN_USERNAME / HN_PASSWORD are not set")
# Toggle verbose debugging output
debug <- FALSE
# --- Target URLs --------------------------------------------------------------
login_url <- "https://news.ycombinator.com/login"
profile_url <- paste0("https://news.ycombinator.com/user?id=", hn_user)
# Use a realistic UA when forwarding headers to HN (reduces bot-y responses)
browser_ua <- "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120 Safari/537.36"
# --- Helper: build a ScrapingBee request consistently -------------------------
# Important: when forward_headers=TRUE, any header prefixed with "Spb-" is forwarded to the target site.
sb_request <- function(target_url) {
httr2::request("https://app.scrapingbee.com/api/v1/") |>
httr2::req_url_query(
api_key = api_key,
url = target_url,
forward_headers = TRUE
) |>
httr2::req_headers(
# Forwarded to Hacker News
"Spb-User-Agent" = browser_ua,
# Only seen by ScrapingBee (not forwarded)
"User-Agent" = "httr2 + ScrapingBee HN demo"
)
}
# --- Helper: extract "name=value" for a cookie from Set-Cookie header(s) -------
extract_cookie_pair <- function(set_cookie_headers, cookie_name) {
if (is.null(set_cookie_headers) || length(set_cookie_headers) == 0) return(NA_character_)
sc <- as.character(set_cookie_headers)
# Some tooling smushes multiple Set-Cookie lines together; split them safely.
sc <- unlist(strsplit(sc, "\r\n|\n", perl = TRUE))
# Find any chunk that contains the cookie (not necessarily at the start)
pat <- paste0("(^|[;\\s,])", cookie_name, "=")
idx <- which(grepl(pat, sc, perl = TRUE))
if (length(idx) == 0) return(NA_character_)
# Pull just "cookie=value" (strip attributes like Path/Expires/Secure/HttpOnly)
line <- sc[idx[1]]
m <- regexpr(paste0(cookie_name, "=[^;]+"), line, perl = TRUE)
if (m[1] == -1) return(NA_character_)
regmatches(line, m)
}
# --- Helper: collect *all* cookie pairs (optional but robust) ------------------
# Some sites require multiple cookies; easiest is to forward everything we got.
collect_cookie_header <- function(set_cookie_headers) {
if (is.null(set_cookie_headers) || length(set_cookie_headers) == 0) return(NA_character_)
sc <- as.character(set_cookie_headers)
sc <- unlist(strsplit(sc, "\r\n|\n", perl = TRUE))
# Keep only the "name=value" part from each Set-Cookie line
pairs <- gsub(";.*$", "", sc)
pairs <- pairs[nzchar(pairs)]
if (!length(pairs)) return(NA_character_)
paste(unique(pairs), collapse = "; ")
}
# ==============================================================================
# 1) Login (POST form via ScrapingBee)
# ==============================================================================
resp_login <- sb_request(login_url) |>
httr2::req_method("POST") |>
httr2::req_body_form(
# These field names come from HN's login form inputs
acct = hn_user,
pw = hn_pass,
goto = "news"
) |>
httr2::req_perform()
# Throws a clean error if not 2xx
httr2::resp_check_status(resp_login)
# Some quick "did it work?" checks (useful while debugging)
body_login <- httr2::resp_body_string(resp_login)
if (debug) {
cat("\nLogged-in signal (logout present?): ",
grepl("logout", body_login, ignore.case = TRUE), "\n")
cat("Bad login present?: ",
grepl("Bad login", body_login, ignore.case = TRUE), "\n")
}
# ==============================================================================
# 2) Pull cookies from the response headers
# ==============================================================================
hdrs <- httr2::resp_headers(resp_login)
# ScrapingBee may prefix target headers with "Spb-". We don't assume the exact name.
# Instead, grab any header key that contains "set-cookie" (case-insensitive).
set_cookie_keys <- grep("set-cookie", names(hdrs), ignore.case = TRUE, value = TRUE)
set_cookie <- unlist(hdrs[set_cookie_keys], use.names = FALSE)
if (debug) {
cat("\n--- header names ---\n")
print(sort(names(hdrs)))
cat("\n--- set-cookie-ish headers ---\n")
print(hdrs[set_cookie_keys])
cat("\n--- first 500 chars of body ---\n")
cat(substr(body_login, 1, 500), "\n")
}
# Option A (simple): only forward the HN "user" cookie
user_cookie_pair <- extract_cookie_pair(set_cookie, "user")
if (is.na(user_cookie_pair)) {
stop("Could not find the session cookie in the response (expected a 'user=...' cookie).")
}
# Option B (more robust): forward all cookie pairs received
# cookie_header <- collect_cookie_header(set_cookie)
# ==============================================================================
# 3) Fetch profile using forwarded cookie (prove login worked)
# ==============================================================================
resp_profile <- sb_request(profile_url) |>
httr2::req_headers(
# To send cookies to the target site through ScrapingBee,
# forward them using the Spb-Cookie header.
"Spb-Cookie" = user_cookie_pair
# If using Option B above, swap to: "Spb-Cookie" = cookie_header
) |>
httr2::req_perform()
httr2::resp_check_status(resp_profile)
# Parse the profile page HTML
doc <- httr2::resp_body_html(resp_profile)
# Safe proof we're authenticated: extract the "karma" value from the profile page
karma_text <- rvest::html_element(
doc,
xpath = "//td[normalize-space()='karma:']/following-sibling::td[1]"
) |>
rvest::html_text2()
cat("Logged in as:", hn_user, "\n")
cat("Karma:", karma_text, "\n")
Short recap of what's happening:
- Login step returns one (or more) session cookies in
Set-Cookie - We extract
user=...(or forward all cookies for robustness) - Then we reuse that cookie to fetch a page that normally requires a logged-in session
After running the script, you should see something like:
Logged in as: YOUR_USERNAME
Karma: 1
Work with JSON APIs
When the thing you want is already exposed as JSON, life gets way easier: you just GET the endpoint, parse the JSON, and turn whatever nested structure you got back into a tidy data frame.
Even though ScrapingBee really shines for HTML pages, it's also useful as a "one way to request stuff" layer. Same request style, same headers, same proxying options if you need them later. The main difference is simple: instead of resp_body_html(), you use resp_body_json().
Below we'll hit a demo endpoint (https://httpbingo.org/json), parse the response, and flatten the nested slideshow$slides list into a data frame. Then we'll do a tiny bit of filtering in R.
httpbingois basically a testing playground likehttpbin— great for examples because the responses are predictable.
Example: Call a JSON endpoint and tidy nested data
library(httr2)
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
# Small helper: replace NULL / empty values with a default
`%||%` <- function(x, y) if (is.null(x) || length(x) == 0) y else x
# Read API key from the environment
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
json_url <- "https://httpbingo.org/json"
# Fetch the JSON endpoint through ScrapingBee
resp <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = json_url
) |>
req_perform()
# Fail fast if the request didn't succeed
resp_check_status(resp)
# Parse JSON directly into nested R lists
payload <- resp_body_json(resp)
# Pull the top-level slideshow object
slideshow <- payload$slideshow
# Start by creating a 1-row tibble with a list-column for slides
slides_df <- tibble(
title = slideshow$title %||% NA_character_,
date = slideshow$date %||% NA_character_,
author = slideshow$author %||% NA_character_,
slides = list(slideshow$slides %||% list())
) |>
# Expand to one row per slide
unnest_longer(slides) |>
# Extract fields from each slide object
mutate(
slide_type = map_chr(slides, ~ .x$type %||% NA_character_),
slide_title = map_chr(slides, ~ .x$title %||% NA_character_),
# Keep items as a list-column (some slides don't have it)
items = map(slides, ~ .x$items %||% list())
) |>
# Drop the raw slide object once fields are extracted
select(-slides)
# Example filtering after parsing
slides_all <- slides_df |>
filter(slide_type == "all")
print(slides_all)
What's happening here:
- ScrapingBee fetches the endpoint and returns clean JSON
resp_body_json()gives you nested R lists (this is normal)- You turn nested arrays into rows using
unnest_longer() - Deep fields are pulled out with
map_*()helpers - Any slicing or filtering is done after parsing, in R
A few practical notes:
- Install missing packages before running this:
install.packages("tidyr") - Keeping nested data (like
items) as list-columns is fine — unnest later if needed - For JSON APIs, fetch once and do most of your logic locally; it's simpler and faster
Implement retries and rate limiting
Even when you're routing requests through ScrapingBee (proxies, IP rotation, less bot-looking traffic), you still need polite client-side behavior:
- Servers can have temporary hiccups (timeouts, flaky upstreams, random 5xx).
- You can get throttled (
429 Too Many Requests) if you go too fast. - ScrapingBee itself can return transient errors that usually disappear if you retry a moment later.
So the move is simple: retry when it makes sense, and pace your requests so you don't hammer anything.
A practical pattern
Below is a small helper that does a few useful things:
- Sends a request
- Retries on transient failures (
429,5xx, network errors) - Uses exponential backoff so retries don't pile up instantly
- Adds a small delay between successful requests to stay polite
library(httr2)
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
# Build a ScrapingBee request for a target URL
# Keeping this as a helper avoids copy-paste everywhere
sb_request <- function(target_url) {
request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = target_url
) |>
req_headers(
"User-Agent" = "httr2 + ScrapingBee demo"
)
}
# Retry + rate limiting helper
perform_with_retries <- function(
req,
retries = 5,
retry_on = c(429, 500, 502, 503, 504),
base_delay = 1, # seconds
max_delay = 15, # cap the wait time
jitter = 0.25 # random noise to avoid synchronized retries
) {
attempt <- 1
while (TRUE) {
# Catch low-level network errors (no HTTP response)
resp <- tryCatch(req_perform(req), error = function(e) e)
if (inherits(resp, "error")) {
if (attempt > retries) stop(resp)
delay <- min(max_delay, base_delay * (2 ^ (attempt - 1)))
delay <- delay + runif(1, 0, jitter)
message(sprintf(
"Network error, retry %d/%d in %.2fs",
attempt, retries, delay
))
Sys.sleep(delay)
attempt <- attempt + 1
next
}
status <- resp_status(resp)
# Success: return immediately
if (status >= 200 && status < 300) {
return(resp)
}
# Retry only for transient status codes
if (!(status %in% retry_on) || attempt > retries) {
resp_check_status(resp) # throws a useful error
stop("Request failed")
}
# Exponential backoff + jitter
delay <- min(max_delay, base_delay * (2 ^ (attempt - 1)))
delay <- delay + runif(1, 0, jitter)
message(sprintf(
"Got HTTP %d, retry %d/%d in %.2fs",
status, attempt, retries, delay
))
Sys.sleep(delay)
attempt <- attempt + 1
}
}
# ------------------------------------------------------------------------------
# Example usage: scrape multiple pages politely
urls <- c(
"https://httpbingo.org/json",
"https://httpbingo.org/headers",
"https://httpbingo.org/ip"
)
# Pause between successful requests
polite_delay <- 1.0
responses <- vector("list", length(urls))
for (i in seq_along(urls)) {
req <- sb_request(urls[[i]])
resp <- perform_with_retries(req, retries = 5)
responses[[i]] <- resp
# Even on success, pause between requests
if (i < length(urls)) Sys.sleep(polite_delay)
}
# Show one result
cat(resp_body_string(responses[[1]]), "\n")
Rules of thumb to keep in mind:
- Retry only when there's a good chance the error is temporary (
429,5xx, network issues). - Don't retry forever; always cap retries and delays.
- Add a small delay between requests even when everything works.
- For large jobs, batch and cache results so you're not re-fetching the same URLs over and over.
Crawling and navigating multiple pages
"Crawling" just means following links from one page to another in a controlled way. You don't need a special crawler package for that; you need a loop, a queue, and some discipline.
In this section, we'll build a small, polite crawler in R using:
httr2for requestsrvestfor parsing- a few simple limits so we don't hammer the site
As an example, we'll use books.toscrape.com again.
What we'll do:
- Start from the homepage
- Extract a few book links
- Visit each book's detail page
- Collect basic info (title, price, availability)
- Optionally visualize the crawl structure as a small network graph
This is not a full search-engine crawler, and that's intentional.
You could use RCrawler for this kind of task, but it's basically abandoned and often doesn't install cleanly on recent R versions. A small custom crawler like the one below is usually simpler and easier to control.
A simple, polite crawl pattern
Before writing any crawling code, it helps to be clear about the rules you're following. This keeps things predictable and prevents accidental over-scraping.
The rules we'll stick to here:
- Limit scope: only touch a small set of pages
- Limit depth: homepage → individual book pages
- Rate limit: pause between requests
- Deduplicate URLs: never visit the same page twice
We'll start by loading the packages and defining a few crawl-wide settings.
library(httr2)
library(rvest)
library(dplyr)
library(stringr)
library(tibble)
These give us HTTP requests, HTML parsing, and some light data wrangling. Nothing fancy.
base_url <- "https://books.toscrape.com/"
start_url <- base_url
# Be polite: wait between requests
polite_delay <- 1 # seconds
# Hard limit so the crawl stays small and predictable
max_books <- 5
# ScrapingBee API key
api_key <- Sys.getenv("SCRAPINGBEE_API_KEY")
if (api_key == "") stop("SCRAPINGBEE_API_KEY is not set")
At this point, we've defined the crawl boundaries. In the next step, we'll fetch the start page, extract book links, and begin walking the site one page at a time.
Step 1: Fetch the listing page and extract book links
First, we fetch the homepage through ScrapingBee and parse the HTML. This gives us the listing page that contains all the book cards.
resp <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = start_url
) |>
req_perform()
# Fail fast if the request didn't succeed
resp_check_status(resp)
# Parse the listing page HTML
doc <- resp_body_html(resp)
Now we extract links to individual book pages. Each book card contains an a tag under h3, so we target that selector and pull the href attribute.
book_links <- doc |>
html_elements("ol.row li article h3 a") |>
html_attr("href") |>
head(max_books)
Those links are relative, so we convert them into absolute URLs before crawling them.
library(xml2)
book_urls <- xml2::url_absolute(book_links, base_url)
At this point, we've built a small crawl queue: a limited list of book pages we're allowed to visit next.
Step 2: Visit each book page and extract data
Now we loop over the book URLs, pause between requests, and scrape the details from each page. This is the "crawler" part: walk the queue, extract fields, store rows.
books <- vector("list", length(book_urls))
for (i in seq_along(book_urls)) {
# Be polite: pause before each request
Sys.sleep(polite_delay)
url <- book_urls[[i]]
message("Scraping: ", url)
# Fetch the detail page (through ScrapingBee)
resp <- request("https://app.scrapingbee.com/api/v1/") |>
req_url_query(
api_key = api_key,
url = url
) |>
req_perform()
# Fail fast if the request didn't succeed
resp_check_status(resp)
# Parse HTML
doc <- resp_body_html(resp)
# Extract a few fields into a one-row tibble
books[[i]] <- tibble(
url = url,
title = doc |> html_element("h1") |> html_text2(),
price = doc |> html_element(".price_color") |> html_text2(),
availability = doc |>
html_element(".instock.availability") |>
html_text2() |>
str_squish()
)
}
# Bind all one-row tibbles into a single data frame
books_df <- bind_rows(books)
print(books_df)
This is a complete multi-page crawl:
- one listing page to collect links
- multiple detail pages to extract data
- structured output as a tidy table
Optional: Visualize the crawl as a network graph
Sometimes it's useful to see the crawl: what page linked to what, and how big your queue got.
Here's a tiny visualization showing:
- the homepage
- one edge from the homepage to each book page we visited
# Install with:
# install.packages("igraph")
library(igraph)
Now let's write the code:
# ... previous code for the steps above ...
# Wrap long titles so labels don't explode the chart
wrap_label <- function(x, width = 18) {
stringr::str_wrap(stringr::str_squish(x), width = width)
}
# Build labels: one for the homepage, then one per scraped book
vertex_labels <- c(
"Homepage",
wrap_label(books_df$title)
)
# Push multi-line labels down a bit (plot text is centered by default)
vertex_labels <- paste0("\n", vertex_labels)
# Map labels to vertex names (graph vertices are identified by URL)
names(vertex_labels) <- c(start_url, books_df$url)
# Create edges: homepage -> each book URL
edges <- tibble(from = start_url, to = book_urls)
g <- graph_from_data_frame(edges, directed = TRUE)
# Attach labels to vertices
V(g)$label <- vertex_labels[V(g)$name]
# Layout for nicer spacing (force-directed)
set.seed(1)
lay <- igraph::layout_with_fr(g)
# Render to a PNG file
png("crawl-graph.png", width = 1600, height = 1000, res = 150)
plot(
g,
layout = lay,
vertex.size = 28,
vertex.label.cex = 0.6,
vertex.label.color = "black",
vertex.label.dist = 1.0,
vertex.label.degree = -pi/2,
edge.arrow.size = 0.4,
main = "Mini crawl graph: listing → book pages"
)
dev.off()
This saves a crawl-graph.png image in your working directory.

It won't win any design awards, but it's a clean sanity check, and a nice base if you want to visualize bigger crawls later (categories, pagination, depth > 1, etc.).
Scraping JavaScript-heavy sites with Chromote
Some sites don't put any useful data in the initial HTML. The page loads, JavaScript runs, API calls fire in the background, and then the content appears. If you just GET the page, you'll only see an empty shell.
In those cases, you have two main options:
- Run a real browser from R using
chromote, and let JavaScript execute locally. - Delegate rendering to ScrapingBee, which can render JavaScript server-side and handle a lot of anti-bot work for you.
Both approaches are valid. The key is knowing when to use which.
Chromote vs ScrapingBee (quick intuition)
Chromote:
- Controls a real Chrome / Chromium instance.
- Great for debugging, exploration, and understanding how a page works.
- Lets you inspect the DOM after JavaScript has finished running.
- Not ideal for large-scale scraping (heavier, slower, more fragile).
ScrapingBee (with JS rendering):
- JavaScript runs remotely; you fetch the already-rendered result.
- Better suited for scale, reliability, and anti-bot protection.
- Less interactive, but usually "just works" once configured.
- Strong default choice for production scraping.
A very common workflow looks like this:
Use chromote to explore and debug the page → switch to ScrapingBee for the actual scraping.
That way you get the best of both worlds: visibility while figuring things out, and stability when it's time to run the scraper for real.
Install and configure chromote
chromote lets R talk to Chrome through the Chrome DevTools Protocol. You need a local Chrome or Chromium installed — no Selenium, no drivers to babysit.
Install chromote:
install.packages("chromote")
On first use, chromote will try to find a compatible Chrome / Chromium binary automatically. On most machines, this just works.
On CI/Docker you may need extra flags/deps; if chromote can't find Chrome, set
CHROMOTE_CHROME/ install Chromium.
Start a Chrome session:
library(chromote)
# Start a new Chrome session
b <- ChromoteSession$new()
At this point:
- Chrome is running in the background
- You can navigate pages, run JavaScript, and inspect the DOM
- You're looking at the page after JS execution, not the raw HTML
Navigate to a page and wait for JavaScript:
b$Page$navigate("https://example.com")
# Give the page time to load and execute JS
Sys.sleep(3)
Sys.sleep() is crude but fine for demos. When you're actually debugging pages, you'll usually:
- wait for a specific DOM node to appear
- or poll until a JS condition becomes true
Extract the rendered HTML:
html <- b$Runtime$evaluate(
"document.documentElement.outerHTML"
)$result$value
At this point, html contains the fully rendered DOM — exactly what Chrome sees after JavaScript has run.
You can now feed this into rvest like any other HTML page:
library(rvest)
doc <- read_html(html)
doc
From here on, it's normal rvest: CSS selectors, html_elements(), html_text2(), and so on.
Clean up when done:
b$close()
Always close your sessions. Chrome instances pile up fast if you forget, and suddenly your machine sounds like it's about to take off.
When chromote is most useful
chromote shines when things don't add up: you see data in the browser, but your scraper gets an empty page.
It's especially useful when you want to figure out:
- which API calls the page makes behind the scenes
- when certain elements appear in the DOM
- whether data is injected by JavaScript or fetched separately
It's also a great fit when you're prototyping or debugging a single site and want full visibility into what's going on.
Once you understand the flow (for example: "data comes from /api/search after page load"), you can usually skip browser automation entirely. At that point, either call the API directly or let ScrapingBee render the page for you at scale.
Navigate dynamic pages
With chromote, you're driving a real Chrome tab. The basic loop for JS-heavy pages is always the same:
- navigate
- wait for a selector that proves the page is "ready"
- optionally click something (links, buttons)
- wait again for the new page state
Below we'll open ScrapingBee's pricing page, wait for the pricing section to appear, click the first "Try now" button, and then sanity-check that we landed on the registration page.
library(chromote)
library(jsonlite)
pricing_url <- "https://www.scrapingbee.com/pricing/"
register_url <- "https://app.scrapingbee.com/account/register"
# Start a Chrome session
b <- ChromoteSession$new()
# Simple helper: wait until a CSS selector exists in the DOM
wait_for_selector <- function(b, selector, timeout_sec = 15, poll_sec = 0.25) {
deadline <- Sys.time() + timeout_sec
while (Sys.time() < deadline) {
ok <- b$Runtime$evaluate(
sprintf(
"document.querySelector(%s) !== null",
jsonlite::toJSON(selector)
)
)$result$value
if (isTRUE(ok)) return(invisible(TRUE))
Sys.sleep(poll_sec)
}
stop("Timed out waiting for selector: ", selector)
}
# 1) Navigate to the pricing page and wait until JS-rendered content appears
b$Page$navigate(pricing_url)
wait_for_selector(b, "section#pricing")
# 2) Click the first "Try now" button inside the pricing section
# Using JS click keeps this example simple
b$Runtime$evaluate("
(() => {
const root = document.querySelector('section#pricing');
const link = root?.querySelector('a.btn');
link?.click();
return !!link;
})()
")$result$value
# 3) Wait for a selector that exists on the registration page
wait_for_selector(b, "#wrapper h3")
At this point, you've confirmed two important things:
- the pricing page actually rendered (the
#pricingsection exists) - navigation worked (you reached a page where
#wrapper h3is present)
That "wait for selector" pattern is the single most useful trick when dealing with dynamic pages. It's far more reliable than sleeping for a fixed amount of time.
Once the page is fully rendered in the browser, extraction is easy:
- grab the rendered HTML
- pass it into
read_html() - reuse the same
rvestselectors you already know
Parsing logic doesn't change at all, only how you obtain the HTML does.
Example: Verify the register page content
Once we know we've landed on the register page, we can extract content just like we would on any static page. The only difference is that the HTML comes from the browser, not a direct HTTP request.
library(rvest)
# Grab the fully rendered DOM from Chrome
html_register <- b$Runtime$evaluate(
"document.documentElement.outerHTML"
)$result$value
# Parse it with rvest
doc_register <- read_html(html_register)
# Extract a couple of safe proof elements
join_title <- doc_register |>
html_element("#wrapper h3") |>
html_text2()
join_subtitle <- doc_register |>
html_element("#wrapper p") |>
html_text2()
cat("Title:", join_title, "\n")
cat("Subtitle:", join_subtitle, "\n")
If everything worked, you should see something like:
Title: Join us!
Subtitle: Get Started with 1000 free API calls. No credit card required.
This confirms that JavaScript navigation succeeded and that you're scraping the rendered page, not an empty shell. From here on, it's regular rvest again.
Example: Extract plan names + monthly prices from the pricing page
If you want to scrape the pricing grid itself, the flow is exactly the same: navigate, wait for JS to finish, grab the rendered HTML, then parse it with rvest.
library(dplyr)
library(tibble)
# Go back to the pricing page
b$Page$navigate(pricing_url)
# Wait until the pricing section is present
wait_for_selector(b, "section#pricing")
# Grab the fully rendered HTML
html_pricing <- b$Runtime$evaluate(
"document.documentElement.outerHTML"
)$result$value
# Parse with rvest
doc_pricing <- read_html(html_pricing)
# Extract plan cards and map them into rows
plans <- doc_pricing |>
html_elements("section#pricing .price-plan") |>
lapply(\(node) {
# Plan name
name <- node |>
html_element("strong") |>
html_text2()
# Monthly price
# XPath avoids CSS issues with brackets in class names
price <- node |>
html_element(xpath = ".//span[contains(@class, 'text-[32px]')]") |>
html_text2()
tibble(
plan = name,
price_month = price
)
}) |>
bind_rows()
print(plans)
You should get something like:
# A tibble: 4 × 2
plan price_month
<chr> <chr>
1 Freelance $49/mo
2 Startup $99/mo
3 Business $249/mo
4 Business + $599/mo
The big takeaway here is simple: chromote is just your HTML getter for JS-rendered pages. Once you have the rendered HTML, everything else is the same rvest workflow you already used for static pages.
And when you're done, don't forget:
b$close()
Leaving Chrome sessions open is the fastest way to slowly melt your machine.
Bypass anti-scraping measures (the boring, reliable way)
"Bypass" here does not mean fighting the site or trying to break protections. In practice, the most effective approach is the opposite: blend in, slow down, and use the right tools.
Modern sites block scrapers mostly because of bad behavior patterns, not because scraping itself is impossible.
Use ScrapingBee for the heavy lifting
ScrapingBee already takes care of a lot of the annoying parts:
- Proxies and IP rotation reduce obvious scraping fingerprints.
- JavaScript rendering makes pages behave as if a real browser loaded them.
- Header forwarding lets you send realistic User-Agents, cookies, and session data.
- Session continuity avoids triggering bot checks on every single request.
When a site is JS-heavy or sensitive, prefer:
- rendering via ScrapingBee
- fewer, higher-quality requests
- reusing cookies instead of starting from scratch every time
This almost always beats running your own headless browsers at scale.
Keep your R-side behavior polite
Even with ScrapingBee in front, your R code still matters:
- Add delays between requests.
- Retry only when errors are likely temporary (
429,5xx). - Avoid downloading the same page over and over — cache when you can.
- Don't parallelize aggressively unless the site clearly tolerates it.
Most blocks come from "too fast, too often", not from a single request.
Respect robots.txt and site terms
Before scraping:
- Check
robots.txtto see what the site explicitly disallows. - Read the site's terms if you're scraping more than public metadata.
- If an official API exists, use it.
Scraping responsibly isn't just about ethics — it also makes your scrapers more reliable.
When you hit a block, adjust patterns (don't escalate)
If you start seeing:
- empty responses
- CAPTCHA pages
- repeated
403or429errors
Resist the urge to "fight" the site.
Instead:
- slow down
- reuse sessions
- reduce concurrency
- enable JS rendering
- inspect responses manually (often you're being redirected, not hard-blocked)
Most scraping failures are configuration issues, not adversarial ones.
A good rule of thumb
If your scraper behaves like a calm human with a slow mouse and a stable browser, most sites will leave you alone.
ScrapingBee helps you look like that human. Your R code should help you act like one.
Turn your R web scrapers into reliable pipelines
At this point, you've got the full toolbox:
- environment-based config (API keys, credentials, clean setup)
- HTML scraping with
rvest - JSON APIs with
httr2 - retries and rate limiting for stability
- JavaScript-heavy pages with
chromote - ScrapingBee handling proxies, rendering, headers, and sessions
That's not toy scraping anymore — that's a production-ready flow.
You can now:
- scrape real sites, not just static demos
- handle logins, cookies, and pagination
- extract structured data from HTML and APIs
- debug JS-heavy pages instead of guessing
- run scrapers politely, reliably, and at scale
The next step is simple: build something real. Pick a small project:
- track pricing pages
- monitor competitors or marketplaces
- collect datasets for analysis
- automate data collection you currently do by hand
Then wire it up with ScrapingBee, grab your API key, and let it run.
ScrapingBee takes care of the boring-but-hard parts: proxies, rendering, anti-bot friction, so you can focus on the data and logic in R.
👉 Sign up, get your API key, and turn your scripts into jobs that actually hold up.
Start small. Run it for real.
If you're curious how the same ideas translate to other ecosystems, check out our guide on
web scraping in C++. The tools change, the principles don't.
Conclusion
Web scraping with R doesn't have to be fragile or hacky. With the right setup, clear patterns, and a bit of discipline, you can build scrapers that actually hold up in the real world.
The key ideas are simple: separate fetching from parsing, be polite with requests, understand how pages load data, and use tools like ScrapingBee and chromote when the site demands it. Once those pieces are in place, scraping becomes just another data pipeline — not a guessing game.
Frequently asked questions (FAQs)
Do I always need ScrapingBee when doing web scraping with R?
No. If the site is simple and the data is already in the HTML response, R alone is enough. ScrapingBee becomes useful when you hit blocks, JavaScript-rendered pages, unstable IPs, or bot checks. It also makes production scraping more predictable by handling retries, proxies, and rendering for you.
How do I convert browser cURL requests into R scraping code?
Copy the request as cURL from your browser's DevTools, then translate the URL, headers, cookies, and method into an httr2 request. If you want a shortcut, ScrapingBee's converter tool can generate an R version that you can clean up and reuse.
What should I do when my R scraper does not see any data?
First, inspect the raw HTML you're parsing and confirm the content is actually there. Many sites load data with JavaScript, so the initial HTML is basically empty. In that case, enable JS rendering with ScrapingBee or scrape the underlying JSON endpoint. Also double-check your selectors.
How can I scale my R scrapers beyond a single script?
Treat your scraper like a small project, not a one-off script. Split logic into functions, add logging, and keep inputs configurable with environment variables or config files. Use retries and rate limits. For higher volume, parallelize carefully or schedule runs with cron or a worker.

