If you're already using R for data analysis, you have a powerful secret weapon: you can scrape, clean, analyze, and visualize data all in the same environment. No context switching, no exporting CSV files between tools—just a seamless workflow from raw HTML to publication-ready insights.

Web scraping in R has evolved dramatically. What started with simple HTML parsing has grown into a sophisticated ecosystem capable of handling everything from static pages to JavaScript-heavy modern web applications. Whether you're collecting research data, monitoring prices, or building datasets that don't exist yet, R has the tools you need.

This guide will take you on a complete journey through web scraping in R:

Setting up your R scraping environment

Understanding the key packages and when to use each one

Scraping simple sites with rvest

Handling complex HTTP requests, authentication, and APIs with httr2

Conquering modern JavaScript-heavy sites with chromote

A complete real-world project: scraping book data from Books to Scrape

Analyzing your scraped data to extract insights

By the end, you'll have the skills to scrape virtually any website—and the wisdom to know which tool to reach for.

1. The R Web Scraping Toolkit: Key Packages

Choosing the right tool is half the battle. R's scraping ecosystem offers packages for every scenario, from simple HTML parsing to driving a full browser. Here's your guide to the essential tools:

The "Go-To" Packages (Static Scraping)

rvest: Your primary HTML parsing tool. Part of the tidyverse, rvest was inspired by Python's Beautiful Soup and makes extracting data from HTML incredibly intuitive. It's perfect for static pages where all the content is in the initial HTML response.

httr2: The modern package for making HTTP requests. While rvest's read_html() works for simple cases, httr2 gives you complete control over headers, authentication, cookies, sessions, and rate limiting—essential for professional scraping.

The "Expert Tier" (Dynamic/JavaScript Scraping)

chromote: The modern, lightweight solution for controlling a headless Chrome browser. When websites load content with JavaScript (React, Vue, Angular apps), chromote lets you interact with the page just like a real user, waiting for content to load before scraping it.

RSelenium: The older, more heavyweight option for browser automation. While it requires running a separate Selenium server and Java, it offers more features for complex interactions. For most projects, though, chromote is simpler and sufficient.

The "Helper" Packages

jsonlite: Essential for working with JSON data from web APIs. Many modern sites use APIs that return JSON—this package makes parsing it effortless.

xml2: The low-level XML/HTML parsing engine that powers rvest. Most users won't interact with it directly, but it's the foundation of the ecosystem.

The "Specialist"

Rcrawler: When you need to systematically crawl an entire website—not just scrape a few pages—this package provides the infrastructure for link discovery, depth control, and parallel crawling.

Comparison Table: Which Package When?

Scenario	Recommended Package	Why?
Simple HTML page	`rvest`	Clean, tidy syntax for parsing static HTML
Need custom headers/auth	`httr2`+`rvest`	Full HTTP control, then parse with`rvest`
JavaScript-rendered content	`chromote`+`rvest`	Browser renders JS, then scrape the result
JSON API	`httr2`+`jsonlite`	Fetch JSON, parse into R data structures
Crawl entire website	`Rcrawler`	Built-in link discovery and crawling logic
Complex form interactions	`chromote`or`RSelenium`	Simulate user clicking, typing, navigating

2. Getting Started: Setting Up Your Scraping Lab

Before we write our first scraper, let's get your environment ready.

Install R and RStudio

If you're new to R:

Download R from CRAN

Download RStudio (the IDE) from Posit

Install the Core Packages

Open RStudio and run this command to install everything you'll need:

1install.packages(c("rvest", "httr2", "chromote", "tidyverse", "jsonlite"))

This will install:

rvest for HTML parsing

httr2 for advanced HTTP requests

chromote for browser automation

tidyverse for data manipulation (includes dplyr, ggplot2, and more)

jsonlite for working with JSON

System Dependencies (Linux Users)

If you're on Linux, you may need to install system libraries before the R packages will work:

BASH

1# Ubuntu/Debian
2sudo apt-get install libcurl4-openssl-dev libxml2-dev libssl-dev
3
4# Fedora/CentOS
5sudo dnf install libcurl-devel libxml2-devel openssl-devel

Mac and Windows users typically don't need to worry about this.

Load the Libraries

At the start of each scraping script, load your tools:

1library(rvest)
2library(httr2)
3library(chromote)
4library(tidyverse)
5library(jsonlite)

Now you're ready to scrape!

3. The Basics: Scraping Static Sites with rvest

Let's start with the fundamentals. Most web scraping follows this workflow:

Fetch the HTML

Select the elements you want

Extract the data

Structure it into a data frame

Step 1: Read the HTML

Use read_html() to fetch a webpage:

1url &lt;- "https://www.imdb.com/title/tt0111161/"
2page &lt;- read_html(url)

The page object now contains the entire HTML document.

Step 2: Select Elements with CSS Selectors

This is the heart of scraping. You need to tell rvest which parts of the page you want.

CSS Selectors are the standard way to target HTML elements. They're the same selectors used in web development:

h1 selects all <h1> headings

.class-name selects elements with that class

#id-name selects the element with that ID

div.class p selects <p> tags inside <div class="class">

Use your browser's Developer Tools (F12) to inspect elements and find their selectors. Right-click an element → Inspect → right-click in the HTML → Copy → Copy selector.

In rvest:

html_elements() (plural) returns all matching elements

html_element() (singular) returns the first match

1# Get the first h1
2title &lt;- page |&gt; 
3  html_element("h1")

💡 Pro tip: Check out this CSS Selectors Cheat Sheet for quick reference.

Step 3: Extract the Data

Once you've selected elements, extract their content:

Text content: Use html_text2() (note the "2"—it handles whitespace better than the older html_text()):

1title_text &lt;- page |&gt; 
2  html_element("h1") |&gt; 
3  html_text2()
4
5print(title_text)
6# "The Shawshank Redemption"

Attributes: Use html_attr() to get things like links or image sources:

1# Get all links
2links &lt;- page |&gt; 
3  html_elements("a") |&gt; 
4  html_attr("href")
5
6# Get image sources
7images &lt;- page |&gt; 
8  html_elements("img") |&gt; 
9  html_attr("src")

Step 4: Handle Tables Automatically

HTML tables are so common that rvest has a magic function for them:

1tables &lt;- page |&gt; 
2  html_table()
3
4# Returns a list of data frames, one for each &lt;table&gt; on the page

Complete Example: Scraping IMDB

Let's scrape The Shawshank Redemption's IMDB page for its title, rating, and main cast:

1library(rvest)
2
3url &lt;- "https://www.imdb.com/title/tt0111161/"
4page &lt;- read_html(url)
5
6# Get the movie title
7title &lt;- page |&gt; 
8  html_element("h1") |&gt; 
9  html_text2()
10
11# Get the rating (you'll need to inspect the page to find the right selector)
12rating &lt;- page |&gt; 
13  html_element("[data-testid='hero-rating-bar__aggregate-rating__score'] span") |&gt; 
14  html_text2()
15
16# Get the cast list (first 5 actors)
17cast &lt;- page |&gt; 
18  html_elements("[data-testid='title-cast-item'] a") |&gt; 
19  html_text2() |&gt; 
20  head(5)
21
22# Combine into a data frame
23movie_data &lt;- tibble(
24  title = title,
25  rating = rating,
26  cast = paste(cast, collapse = ", ")
27)
28
29print(movie_data)

That's the basic rvest workflow! For many static sites, this is all you need.

4. Level Up: Advanced HTTP Control with httr2

What happens when read_html() isn't enough? Many websites require:

Custom headers (like a User-Agent to identify yourself)

Authentication (login credentials)

Session management (maintaining cookies across requests)

Rate limiting (being polite and not hammering the server)

This is where httr2 shines. It gives you complete control over HTTP requests.

Setting Custom Headers

Some sites block requests that don't include a User-Agent (they look like bots). Always identify yourself:

1library(httr2)
2
3req &lt;- request("https://example.com") |&gt; 
4  req_headers(
5    "User-Agent" = "MyResearchScraper/1.0 (your.email@example.com)"
6  )
7
8resp &lt;- req_perform(req)
9html &lt;- resp |&gt; resp_body_html()

Now you can pass html to rvest functions.

Handling Authentication

For APIs or sites requiring login:

1# Basic authentication (common for APIs)
2req &lt;- request("https://api.example.com/data") |&gt; 
3  req_auth_basic(username = "user", password = "pass")
4
5# Form-based login (like a website login page)
6req &lt;- request("https://example.com/login") |&gt; 
7  req_body_form(
8    username = "user",
9    password = "pass"
10  )

Managing Sessions and Cookies

When you log in to a site, the server gives you a session cookie. To maintain that session across multiple requests:

1# httr2 automatically handles cookies within a session
2# Just use the same request object for subsequent calls
3
4session &lt;- request("https://example.com")
5
6# Login
7login_resp &lt;- session |&gt; 
8  req_url_path_append("login") |&gt; 
9  req_body_form(username = "user", password = "pass") |&gt; 
10  req_perform()
11
12# Now make authenticated requests (cookies persist)
13data_resp &lt;- session |&gt; 
14  req_url_path_append("data") |&gt; 
15  req_perform()

Rate Limiting: Be a Good Citizen

Scraping too fast can overload servers and get you blocked. Always add delays:

1req &lt;- request("https://example.com") |&gt; 
2  req_throttle(rate = 10)  # Max 10 requests per second
3
4# Also add retry logic for temporary failures
5req &lt;- req |&gt; 
6  req_retry(max_tries = 3, backoff = ~5)

Scraping JSON APIs

Many modern sites use APIs that return JSON. Here's how to fetch and parse it:

1library(jsonlite)
2
3req &lt;- request("https://api.example.com/data")
4resp &lt;- req_perform(req)
5
6# Parse JSON response
7data &lt;- resp |&gt; resp_body_json()
8
9# Convert to data frame
10df &lt;- as_tibble(data)

httr2 is your Swiss Army knife for any HTTP complexity. Pair it with rvest for HTML parsing, and you can handle almost any static site.

5. The Expert Tier: Scraping JavaScript Sites with chromote

Here's the challenge: modern websites are often built with frameworks like React, Vue, or Angular. When you load these sites, the initial HTML is mostly empty—content is loaded after by JavaScript.

If you try read_html() on these sites, you'll get an empty shell. rvest can't run JavaScript.

The Solution: Headless Browsers

chromote controls a real Chrome browser (in "headless" mode—no visible window). The browser loads the page, runs all the JavaScript, and then you scrape the rendered result.

Installation

You need:

The R package: install.packages("chromote")

Google Chrome installed on your computer

chromote will automatically find and use your Chrome installation.

The chromote Workflow

Here's the basic pattern:

1library(chromote)
2library(rvest)
3
4# 1. Start a new browser session
5b &lt;- ChromoteSession$new()
6
7# 2. Navigate to the page
8b$Page$navigate("https://example.com")
9
10# 3. WAIT for the page to load (critical!)
11b$Page$loadEventFired()
12
13# 4. Get the rendered HTML
14html &lt;- b$Runtime$evaluate("document.documentElement.outerHTML")$result$value
15
16# 5. Parse with rvest
17page &lt;- read_html(html)
18
19# Now use normal rvest functions
20data &lt;- page |&gt; 
21  html_elements(".dynamic-content") |&gt; 
22  html_text2()
23
24# 6. Close the browser
25b$close()

Advanced: Waiting for Specific Elements

Sometimes loadEventFired() isn't enough—you need to wait for a specific element to appear:

1b &lt;- ChromoteSession$new()
2b$Page$navigate("https://example.com")
3
4# Wait for a specific element using JavaScript
5b$Runtime$evaluate('
6  new Promise(resolve =&gt; {
7    const check = () =&gt; {
8      if (document.querySelector(".target-element")) {
9        resolve(true);
10      } else {
11        setTimeout(check, 100);
12      }
13    };
14    check();
15  })
16')
17
18# Now scrape

Interacting with the Page

You can also click buttons, fill forms, and scroll—anything a real user can do:

1# Click a button
2b$Runtime$evaluate('document.querySelector("#load-more").click()')
3
4# Wait for new content to load
5Sys.sleep(2)
6
7# Fill a form field
8b$Runtime$evaluate('document.querySelector("#search-box").value = "query"')
9
10# Submit the form
11b$Runtime$evaluate('document.querySelector("#search-form").submit()')

chromote is powerful but comes with overhead—it's slower and more resource-intensive than rvest. Use it only when you need it.

6. Full Project: Scraping Book Data from “Books to Scrape”

Let’s put everything together with a real-world project that’s friendly for learning and explicitly designed for scraping: Books to Scrape. We’ll collect book titles, prices, availability, ratings, and category.

The Challenge

Books to Scrape is mostly static HTML and paginated, which makes it perfect for rvest without a headless browser. We’ll need to:

Traverse categories

Handle multi-page pagination

Extract structured data from product cards

The Plan

We’ll build two functions:

Get all book page URLs for a given category, following pagination

Scrape an individual book page for detailed fields

Then we’ll iterate over all categories and combine results into a single data frame.

Helpers

1library(rvest)
2library(xml2)
3library(stringr)
4library(purrr)
5library(dplyr)
6library(readr)
7
8BASE <- "https://books.toscrape.com/"
9
10rel_to_abs <- function(href, base) {
11  xml2::url_absolute(href, base)
12}

Step 1: Get categories

1get_categories <- function(base = BASE) {
2  page <- read_html(base)
3  a <- page |>
4    html_elements(".side_categories ul li ul li a")
5  tibble(
6    category = a |> html_text2() |> str_squish(),
7    url = a |> html_attr("href") |> rel_to_abs(base)
8  )
9}
10
11cats <- get_categories()
12head(cats)

Step 2: Get all book URLs in a category (with pagination)

1get_book_urls_in_category <- function(cat_url) {
2  urls <- character()
3  next_url <- cat_url
4  repeat {
5    page <- read_html(next_url)
6
7    page_urls <- page |>
8      html_elements("section div ol li article.product_pod h3 a") |>
9      html_attr("href") |>
10      rel_to_abs(next_url)
11
12    urls <- c(urls, page_urls)
13
14    next_rel <- page |>
15      html_element("li.next a") |>
16      html_attr("href")
17
18    if (is.na(next_rel) || is.null(next_rel)) break
19
20    next_url <- rel_to_abs(next_rel, next_url)
21  }
22  unique(urls)
23}

Step 3: Scrape a single book page

1scrape_book_page <- function(book_url) {
2  page <- read_html(book_url)
3
4  title <- page |> html_element(".product_main h1") |> html_text2()
5  price <- page |> html_element(".product_main .price_color") |> html_text2() |>
6    stringr::str_extract("[0-9]+\.[0-9]{2}") |> as.numeric()
7  availability <- page |> html_element(".product_main .availability") |> html_text2() |> str_squish()
8  rating_class <- page |> html_element(".product_main .star-rating") |> html_attr("class")
9  category <- page |> html_elements(".breadcrumb li a") |> html_text2() |> dplyr::last()
10
11  tibble(
12    title = title,
13    price = price,
14    availability = availability,
15    rating_class = rating_class,
16    category = category,
17    url = book_url
18  )
19}

Step 4: Scrape all categories

1all_books <- map_dfr(seq_len(nrow(cats)), function(i) {
2  cat("Category:", cats$category[i], "\n")
3  book_urls <- get_book_urls_in_category(cats$url[i])
4  map_dfr(book_urls, function(u) {
5    Sys.sleep(0.2)
6    scrape_book_page(u)
7  })
8})
9
10# Save
11write_csv(all_books, "books_to_scrape.csv")
12saveRDS(all_books, "books_to_scrape.rds")

Schema: Expected columns in books_to_scrape.csv

<strong>Column</strong>	<strong>Type</strong>	<strong>Description</strong>
title	text	Book title
price	number	Price extracted from page (numeric)
availability	text	Availability string, e.g., "In stock (22 available)"
rating_class	text	CSS class containing star rating label
category	text	Book category from breadcrumb
url	text	Absolute URL of the product page

Optional: Robust scraping with error handling

1scrape_book_page_safe <- function(u) {
2  tryCatch(scrape_book_page(u), error = function(e) {
3    message("Error:", e$message)
4    NULL
5  })
6}
7
8all_books <- map_dfr(seq_len(nrow(cats)), function(i) {
9  book_urls <- get_book_urls_in_category(cats$url[i])
10  map_dfr(book_urls, scrape_book_page_safe)
11})

🎉 You now have a clean dataset of books, their prices, availability, ratings, and categories from Books to Scrape.

7. The Payoff: Analyzing Your Books-to-Scrape Data

Now that you’ve scraped book data, let’s analyze it directly in R.

Load and tidy the data

1library(tidyverse)
2
3books <- read_csv("books_to_scrape.csv")
4
5# Example normalization helpers
6normalize_price <- function(x) as.numeric(readr::parse_number(x))
7
8books <- books |> 
9  mutate(
10    price_num = normalize_price(price),
11    availability_n = readr::parse_number(availability),
12    rating = case_when(
13      str_detect(rating_class, "One") ~ 1,
14      str_detect(rating_class, "Two") ~ 2,
15      str_detect(rating_class, "Three") ~ 3,
16      str_detect(rating_class, "Four") ~ 4,
17      str_detect(rating_class, "Five") ~ 5,
18      TRUE ~ NA_real_
19    )
20  )

Question 1: What’s the price distribution overall and by category?

1library(ggplot2)
2
3# Overall
4ggplot(books, aes(price_num)) +
5  geom_histogram(binwidth = 5, fill = "#4C78A8", color = "white") +
6  labs(title = "Book Price Distribution", x = "Price", y = "Count") +
7  theme_minimal()
8
9# By category (top 8 by count)
10top_cats <- books |> count(category, sort = TRUE) |> slice_head(n = 8) |> pull(category)
11
12ggplot(filter(books, category %in% top_cats), aes(price_num, fill = category)) +
13  geom_histogram(binwidth = 5, color = "white", alpha = 0.85) +
14  facet_wrap(~ category, scales = "free_y") +
15  guides(fill = "none") +
16  labs(title = "Price Distribution by Category", x = "Price", y = "Count") +
17  theme_minimal()

Question 2: Do higher-rated books cost more?

Teaser: Price vs. Availability (quick plot)

1# Simple teaser scatter: price vs. in-stock count
2library(ggplot2)
3
4p_teaser <- ggplot(books, aes(x = availability_n, y = price_num)) +
5  geom_point(alpha = 0.6, color = "#3E7CB1") +
6  labs(
7    title = "Books: Price vs. Availability",
8    x = "Copies in stock",
9    y = "Price"
10  ) +
11  theme_minimal()
12
13p_teaser
14
15# Save if you want to embed the static image
16# ggsave("books_price_vs_availability.png", p_teaser, width = 6, height = 4, dpi = 150)

Question 3: Availability patterns by category

1books |> 
2  group_by(category) |> 
3  summarise(
4    avg_in_stock = mean(availability_n, na.rm = TRUE),
5    n = n()
6  ) |> 
7  arrange(desc(avg_in_stock)) |> 
8  slice_head(n = 15) |> 
9  ggplot(aes(x = reorder(category, avg_in_stock), y = avg_in_stock)) +
10  geom_col(fill = "#54A24B") +
11  coord_flip() +
12  labs(title = "Average Copies In Stock by Category (Top 15)", x = "Category", y = "Avg copies in stock") +
13  theme_minimal()

Question 4: Top 10 most expensive titles

1books |> 
2  arrange(desc(price_num)) |> 
3  transmute(title, category, rating, price = price_num) |> 
4  slice_head(n = 10) |> 
5  knitr::kable()

This is the power of web scraping in R: you can go from raw HTML to actionable insights without ever leaving your IDE.

Conclusion: From Beginner to Expert Scraper

We've covered a lot of ground. You now know:

When to use rvest (static HTML), httr2 (advanced HTTP), or chromote (JavaScript sites)

How to select and extract data with CSS selectors

How to handle authentication, sessions, and rate limiting

How to scrape modern JavaScript-heavy websites

How to build a complete scraping pipeline from collection to analysis

Ethical Scraping: Be a Good Citizen

Before you scrape, remember:

Check robots.txt: Visit example.com/robots.txt to see what the site allows

Respect rate limits: Use Sys.sleep() or req_throttle() to avoid overloading servers

Identify yourself: Use a descriptive User-Agent with contact info

Honor terms of service: Don't scrape data you're not supposed to access

Consider the API: Many sites offer APIs—use them when available

When Scraping Gets Hard: Let FoxScrape Handle It

Building and maintaining production scrapers is challenging. You have to:

Manage rotating proxies to avoid IP bans

Handle CAPTCHAs and anti-bot systems

Keep scrapers updated when websites change

Scale infrastructure for large-scale data collection

Monitor for failures and retries

If you need reliable, large-scale data extraction without the maintenance burden, check out FoxScrape. It's a web scraping API that handles all the complexity—proxies, browser fingerprinting, JavaScript rendering, and CAPTCHA solving—so you can focus on analyzing data, not maintaining scrapers.

Whether you're collecting market research, monitoring competitors, or building datasets for research, FoxScrape lets you scale from hundreds to millions of pages without managing infrastructure.

Happy Scraping!

Web scraping opens up a world of data that doesn't exist in neat CSV files. With R's powerful ecosystem, you have everything you need to collect, clean, and analyze that data—all in one place.

Start small with rvest, level up with httr2, and when you hit JavaScript sites, bring out chromote. Before you know it, you'll be building scrapers that would have seemed impossible when you started.

Now go forth and scrape responsibly! 🚀