A Complete Guide to Web Scraping in R

If you're already using R for data analysis, you have a powerful secret weapon: you can scrape, clean, analyze, and visualize data all in the same environment. No context switching, no exporting CSV files between tools—just a seamless workflow from raw HTML to publication-ready insights.
Web scraping in R has evolved dramatically. What started with simple HTML parsing has grown into a sophisticated ecosystem capable of handling everything from static pages to JavaScript-heavy modern web applications. Whether you're collecting research data, monitoring prices, or building datasets that don't exist yet, R has the tools you need.
This guide will take you on a complete journey through web scraping in R:
rvesthttr2chromoteBy the end, you'll have the skills to scrape virtually any website—and the wisdom to know which tool to reach for.
1. The R Web Scraping Toolkit: Key Packages
Choosing the right tool is half the battle. R's scraping ecosystem offers packages for every scenario, from simple HTML parsing to driving a full browser. Here's your guide to the essential tools:
The "Go-To" Packages (Static Scraping)
rvest: Your primary HTML parsing tool. Part of the tidyverse, rvest was inspired by Python's Beautiful Soup and makes extracting data from HTML incredibly intuitive. It's perfect for static pages where all the content is in the initial HTML response.
httr2: The modern package for making HTTP requests. While rvest's read_html() works for simple cases, httr2 gives you complete control over headers, authentication, cookies, sessions, and rate limiting—essential for professional scraping.
The "Expert Tier" (Dynamic/JavaScript Scraping)
chromote: The modern, lightweight solution for controlling a headless Chrome browser. When websites load content with JavaScript (React, Vue, Angular apps), chromote lets you interact with the page just like a real user, waiting for content to load before scraping it.
RSelenium: The older, more heavyweight option for browser automation. While it requires running a separate Selenium server and Java, it offers more features for complex interactions. For most projects, though, chromote is simpler and sufficient.
The "Helper" Packages
jsonlite: Essential for working with JSON data from web APIs. Many modern sites use APIs that return JSON—this package makes parsing it effortless.
xml2: The low-level XML/HTML parsing engine that powers rvest. Most users won't interact with it directly, but it's the foundation of the ecosystem.
The "Specialist"
Rcrawler: When you need to systematically crawl an entire website—not just scrape a few pages—this package provides the infrastructure for link discovery, depth control, and parallel crawling.
Comparison Table: Which Package When?
| Scenario | Recommended Package | Why? |
|---|---|---|
| Simple HTML page | rvest | Clean, tidy syntax for parsing static HTML |
| Need custom headers/auth | httr2+rvest | Full HTTP control, then parse withrvest |
| JavaScript-rendered content | chromote+rvest | Browser renders JS, then scrape the result |
| JSON API | httr2+jsonlite | Fetch JSON, parse into R data structures |
| Crawl entire website | Rcrawler | Built-in link discovery and crawling logic |
| Complex form interactions | chromoteorRSelenium | Simulate user clicking, typing, navigating |
2. Getting Started: Setting Up Your Scraping Lab
Before we write our first scraper, let's get your environment ready.
Install R and RStudio
If you're new to R:
Install the Core Packages
Open RStudio and run this command to install everything you'll need:
1install.packages(c("rvest", "httr2", "chromote", "tidyverse", "jsonlite"))
This will install:
rvest for HTML parsinghttr2 for advanced HTTP requestschromote for browser automationtidyverse for data manipulation (includes dplyr, ggplot2, and more)jsonlite for working with JSONSystem Dependencies (Linux Users)
If you're on Linux, you may need to install system libraries before the R packages will work:
1# Ubuntu/Debian2sudo apt-get install libcurl4-openssl-dev libxml2-dev libssl-dev34# Fedora/CentOS5sudo dnf install libcurl-devel libxml2-devel openssl-devel
Mac and Windows users typically don't need to worry about this.
Load the Libraries
At the start of each scraping script, load your tools:
1library(rvest)2library(httr2)3library(chromote)4library(tidyverse)5library(jsonlite)
Now you're ready to scrape!
3. The Basics: Scraping Static Sites with rvest
Let's start with the fundamentals. Most web scraping follows this workflow:
Step 1: Read the HTML
Use read_html() to fetch a webpage:
1url <- "https://www.imdb.com/title/tt0111161/"2page <- read_html(url)
The page object now contains the entire HTML document.
Step 2: Select Elements with CSS Selectors
This is the heart of scraping. You need to tell rvest which parts of the page you want.
CSS Selectors are the standard way to target HTML elements. They're the same selectors used in web development:
h1 selects all <h1> headings.class-name selects elements with that class#id-name selects the element with that IDdiv.class p selects <p> tags inside <div class="class">Use your browser's Developer Tools (F12) to inspect elements and find their selectors. Right-click an element → Inspect → right-click in the HTML → Copy → Copy selector.
In rvest:
html_elements() (plural) returns all matching elementshtml_element() (singular) returns the first match1# Get the first h12title <- page |>3html_element("h1")
💡 Pro tip: Check out this CSS Selectors Cheat Sheet for quick reference.
Step 3: Extract the Data
Once you've selected elements, extract their content:
Text content: Use html_text2() (note the "2"—it handles whitespace better than the older html_text()):
1title_text <- page |>2html_element("h1") |>3html_text2()45print(title_text)6# "The Shawshank Redemption"
Attributes: Use html_attr() to get things like links or image sources:
1# Get all links2links <- page |>3html_elements("a") |>4html_attr("href")56# Get image sources7images <- page |>8html_elements("img") |>9html_attr("src")
Step 4: Handle Tables Automatically
HTML tables are so common that rvest has a magic function for them:
1tables <- page |>2html_table()34# Returns a list of data frames, one for each <table> on the page
Complete Example: Scraping IMDB
Let's scrape The Shawshank Redemption's IMDB page for its title, rating, and main cast:
1library(rvest)23url <- "https://www.imdb.com/title/tt0111161/"4page <- read_html(url)56# Get the movie title7title <- page |>8html_element("h1") |>9html_text2()1011# Get the rating (you'll need to inspect the page to find the right selector)12rating <- page |>13html_element("[data-testid='hero-rating-bar__aggregate-rating__score'] span") |>14html_text2()1516# Get the cast list (first 5 actors)17cast <- page |>18html_elements("[data-testid='title-cast-item'] a") |>19html_text2() |>20head(5)2122# Combine into a data frame23movie_data <- tibble(24title = title,25rating = rating,26cast = paste(cast, collapse = ", ")27)2829print(movie_data)
That's the basic rvest workflow! For many static sites, this is all you need.
4. Level Up: Advanced HTTP Control with httr2
What happens when read_html() isn't enough? Many websites require:
This is where httr2 shines. It gives you complete control over HTTP requests.
Setting Custom Headers
Some sites block requests that don't include a User-Agent (they look like bots). Always identify yourself:
1library(httr2)23req <- request("https://example.com") |>4req_headers(5"User-Agent" = "MyResearchScraper/1.0 (your.email@example.com)"6)78resp <- req_perform(req)9html <- resp |> resp_body_html()
Now you can pass html to rvest functions.
Handling Authentication
For APIs or sites requiring login:
1# Basic authentication (common for APIs)2req <- request("https://api.example.com/data") |>3req_auth_basic(username = "user", password = "pass")45# Form-based login (like a website login page)6req <- request("https://example.com/login") |>7req_body_form(8username = "user",9password = "pass"10)
Managing Sessions and Cookies
When you log in to a site, the server gives you a session cookie. To maintain that session across multiple requests:
1# httr2 automatically handles cookies within a session2# Just use the same request object for subsequent calls34session <- request("https://example.com")56# Login7login_resp <- session |>8req_url_path_append("login") |>9req_body_form(username = "user", password = "pass") |>10req_perform()1112# Now make authenticated requests (cookies persist)13data_resp <- session |>14req_url_path_append("data") |>15req_perform()
Rate Limiting: Be a Good Citizen
Scraping too fast can overload servers and get you blocked. Always add delays:
1req <- request("https://example.com") |>2req_throttle(rate = 10) # Max 10 requests per second34# Also add retry logic for temporary failures5req <- req |>6req_retry(max_tries = 3, backoff = ~5)
Scraping JSON APIs
Many modern sites use APIs that return JSON. Here's how to fetch and parse it:
1library(jsonlite)23req <- request("https://api.example.com/data")4resp <- req_perform(req)56# Parse JSON response7data <- resp |> resp_body_json()89# Convert to data frame10df <- as_tibble(data)
httr2 is your Swiss Army knife for any HTTP complexity. Pair it with rvest for HTML parsing, and you can handle almost any static site.
5. The Expert Tier: Scraping JavaScript Sites with chromote
Here's the challenge: modern websites are often built with frameworks like React, Vue, or Angular. When you load these sites, the initial HTML is mostly empty—content is loaded after by JavaScript.
If you try read_html() on these sites, you'll get an empty shell. rvest can't run JavaScript.
The Solution: Headless Browsers
chromote controls a real Chrome browser (in "headless" mode—no visible window). The browser loads the page, runs all the JavaScript, and then you scrape the rendered result.
Installation
You need:
install.packages("chromote")chromote will automatically find and use your Chrome installation.
The chromote Workflow
Here's the basic pattern:
1library(chromote)2library(rvest)34# 1. Start a new browser session5b <- ChromoteSession$new()67# 2. Navigate to the page8b$Page$navigate("https://example.com")910# 3. WAIT for the page to load (critical!)11b$Page$loadEventFired()1213# 4. Get the rendered HTML14html <- b$Runtime$evaluate("document.documentElement.outerHTML")$result$value1516# 5. Parse with rvest17page <- read_html(html)1819# Now use normal rvest functions20data <- page |>21html_elements(".dynamic-content") |>22html_text2()2324# 6. Close the browser25b$close()
Advanced: Waiting for Specific Elements
Sometimes loadEventFired() isn't enough—you need to wait for a specific element to appear:
1b <- ChromoteSession$new()2b$Page$navigate("https://example.com")34# Wait for a specific element using JavaScript5b$Runtime$evaluate('6new Promise(resolve => {7const check = () => {8if (document.querySelector(".target-element")) {9resolve(true);10} else {11setTimeout(check, 100);12}13};14check();15})16')1718# Now scrape
Interacting with the Page
You can also click buttons, fill forms, and scroll—anything a real user can do:
1# Click a button2b$Runtime$evaluate('document.querySelector("#load-more").click()')34# Wait for new content to load5Sys.sleep(2)67# Fill a form field8b$Runtime$evaluate('document.querySelector("#search-box").value = "query"')910# Submit the form11b$Runtime$evaluate('document.querySelector("#search-form").submit()')
chromote is powerful but comes with overhead—it's slower and more resource-intensive than rvest. Use it only when you need it.
6. Full Project: Scraping Book Data from “Books to Scrape”
Let’s put everything together with a real-world project that’s friendly for learning and explicitly designed for scraping: Books to Scrape. We’ll collect book titles, prices, availability, ratings, and category.
The Challenge
Books to Scrape is mostly static HTML and paginated, which makes it perfect for rvest without a headless browser. We’ll need to:
The Plan
We’ll build two functions:
Then we’ll iterate over all categories and combine results into a single data frame.
Helpers
1library(rvest)2library(xml2)3library(stringr)4library(purrr)5library(dplyr)6library(readr)78BASE <- "https://books.toscrape.com/"910rel_to_abs <- function(href, base) {11xml2::url_absolute(href, base)12}
Step 1: Get categories
1get_categories <- function(base = BASE) {2page <- read_html(base)3a <- page |>4html_elements(".side_categories ul li ul li a")5tibble(6category = a |> html_text2() |> str_squish(),7url = a |> html_attr("href") |> rel_to_abs(base)8)9}1011cats <- get_categories()12head(cats)
Step 2: Get all book URLs in a category (with pagination)
1get_book_urls_in_category <- function(cat_url) {2urls <- character()3next_url <- cat_url4repeat {5page <- read_html(next_url)67page_urls <- page |>8html_elements("section div ol li article.product_pod h3 a") |>9html_attr("href") |>10rel_to_abs(next_url)1112urls <- c(urls, page_urls)1314next_rel <- page |>15html_element("li.next a") |>16html_attr("href")1718if (is.na(next_rel) || is.null(next_rel)) break1920next_url <- rel_to_abs(next_rel, next_url)21}22unique(urls)23}
Step 3: Scrape a single book page
1scrape_book_page <- function(book_url) {2page <- read_html(book_url)34title <- page |> html_element(".product_main h1") |> html_text2()5price <- page |> html_element(".product_main .price_color") |> html_text2() |>6stringr::str_extract("[0-9]+\.[0-9]{2}") |> as.numeric()7availability <- page |> html_element(".product_main .availability") |> html_text2() |> str_squish()8rating_class <- page |> html_element(".product_main .star-rating") |> html_attr("class")9category <- page |> html_elements(".breadcrumb li a") |> html_text2() |> dplyr::last()1011tibble(12title = title,13price = price,14availability = availability,15rating_class = rating_class,16category = category,17url = book_url18)19}
Step 4: Scrape all categories
1all_books <- map_dfr(seq_len(nrow(cats)), function(i) {2cat("Category:", cats$category[i], "\n")3book_urls <- get_book_urls_in_category(cats$url[i])4map_dfr(book_urls, function(u) {5Sys.sleep(0.2)6scrape_book_page(u)7})8})910# Save11write_csv(all_books, "books_to_scrape.csv")12saveRDS(all_books, "books_to_scrape.rds")
Schema: Expected columns in books_to_scrape.csv
| <strong>Column</strong> | <strong>Type</strong> | <strong>Description</strong> |
|---|---|---|
| title | text | Book title |
| price | number | Price extracted from page (numeric) |
| availability | text | Availability string, e.g., "In stock (22 available)" |
| rating_class | text | CSS class containing star rating label |
| category | text | Book category from breadcrumb |
| url | text | Absolute URL of the product page |
Optional: Robust scraping with error handling
1scrape_book_page_safe <- function(u) {2tryCatch(scrape_book_page(u), error = function(e) {3message("Error:", e$message)4NULL5})6}78all_books <- map_dfr(seq_len(nrow(cats)), function(i) {9book_urls <- get_book_urls_in_category(cats$url[i])10map_dfr(book_urls, scrape_book_page_safe)11})
🎉 You now have a clean dataset of books, their prices, availability, ratings, and categories from Books to Scrape.
7. The Payoff: Analyzing Your Books-to-Scrape Data
Now that you’ve scraped book data, let’s analyze it directly in R.
Load and tidy the data
1library(tidyverse)23books <- read_csv("books_to_scrape.csv")45# Example normalization helpers6normalize_price <- function(x) as.numeric(readr::parse_number(x))78books <- books |>9mutate(10price_num = normalize_price(price),11availability_n = readr::parse_number(availability),12rating = case_when(13str_detect(rating_class, "One") ~ 1,14str_detect(rating_class, "Two") ~ 2,15str_detect(rating_class, "Three") ~ 3,16str_detect(rating_class, "Four") ~ 4,17str_detect(rating_class, "Five") ~ 5,18TRUE ~ NA_real_19)20)
Question 1: What’s the price distribution overall and by category?
1library(ggplot2)23# Overall4ggplot(books, aes(price_num)) +5geom_histogram(binwidth = 5, fill = "#4C78A8", color = "white") +6labs(title = "Book Price Distribution", x = "Price", y = "Count") +7theme_minimal()89# By category (top 8 by count)10top_cats <- books |> count(category, sort = TRUE) |> slice_head(n = 8) |> pull(category)1112ggplot(filter(books, category %in% top_cats), aes(price_num, fill = category)) +13geom_histogram(binwidth = 5, color = "white", alpha = 0.85) +14facet_wrap(~ category, scales = "free_y") +15guides(fill = "none") +16labs(title = "Price Distribution by Category", x = "Price", y = "Count") +17theme_minimal()
Question 2: Do higher-rated books cost more?
Teaser: Price vs. Availability (quick plot)
1# Simple teaser scatter: price vs. in-stock count2library(ggplot2)34p_teaser <- ggplot(books, aes(x = availability_n, y = price_num)) +5geom_point(alpha = 0.6, color = "#3E7CB1") +6labs(7title = "Books: Price vs. Availability",8x = "Copies in stock",9y = "Price"10) +11theme_minimal()1213p_teaser1415# Save if you want to embed the static image16# ggsave("books_price_vs_availability.png", p_teaser, width = 6, height = 4, dpi = 150)
Question 3: Availability patterns by category
1books |>2group_by(category) |>3summarise(4avg_in_stock = mean(availability_n, na.rm = TRUE),5n = n()6) |>7arrange(desc(avg_in_stock)) |>8slice_head(n = 15) |>9ggplot(aes(x = reorder(category, avg_in_stock), y = avg_in_stock)) +10geom_col(fill = "#54A24B") +11coord_flip() +12labs(title = "Average Copies In Stock by Category (Top 15)", x = "Category", y = "Avg copies in stock") +13theme_minimal()
Question 4: Top 10 most expensive titles
1books |>2arrange(desc(price_num)) |>3transmute(title, category, rating, price = price_num) |>4slice_head(n = 10) |>5knitr::kable()
This is the power of web scraping in R: you can go from raw HTML to actionable insights without ever leaving your IDE.
Conclusion: From Beginner to Expert Scraper
We've covered a lot of ground. You now know:
rvest (static HTML), httr2 (advanced HTTP), or chromote (JavaScript sites)Ethical Scraping: Be a Good Citizen
Before you scrape, remember:
example.com/robots.txt to see what the site allowsSys.sleep() or req_throttle() to avoid overloading serversWhen Scraping Gets Hard: Let FoxScrape Handle It
Building and maintaining production scrapers is challenging. You have to:
If you need reliable, large-scale data extraction without the maintenance burden, check out FoxScrape. It's a web scraping API that handles all the complexity—proxies, browser fingerprinting, JavaScript rendering, and CAPTCHA solving—so you can focus on analyzing data, not maintaining scrapers.
Whether you're collecting market research, monitoring competitors, or building datasets for research, FoxScrape lets you scale from hundreds to millions of pages without managing infrastructure.
Happy Scraping!
Web scraping opens up a world of data that doesn't exist in neat CSV files. With R's powerful ecosystem, you have everything you need to collect, clean, and analyze that data—all in one place.
Start small with rvest, level up with httr2, and when you hit JavaScript sites, bring out chromote. Before you know it, you'll be building scrapers that would have seemed impossible when you started.
Now go forth and scrape responsibly! 🚀
Further Reading

Web Scraping with Scala
Web scraping sits at the intersection of curiosity and automation. It’s what happens when developers stop copying data manually and start thinking:...

Web Scraping with C#
If you’ve ever tried scraping a modern website, you’ve probably experienced a full emotional arc: excitement, frustration, triumph, and then despai...

No Code Web Scraping
Web scraping is often seen as a task reserved for programmers: writing scripts, handling proxies, automating browsers, and dealing with anti-bot me...