A Complete Guide to Web Scraping in R

Published on
Written by
Mantas Kemėšius
A Complete Guide to Web Scraping in R

If you're already using R for data analysis, you have a powerful secret weapon: you can scrape, clean, analyze, and visualize data all in the same environment. No context switching, no exporting CSV files between tools—just a seamless workflow from raw HTML to publication-ready insights.

Web scraping in R has evolved dramatically. What started with simple HTML parsing has grown into a sophisticated ecosystem capable of handling everything from static pages to JavaScript-heavy modern web applications. Whether you're collecting research data, monitoring prices, or building datasets that don't exist yet, R has the tools you need.

This guide will take you on a complete journey through web scraping in R:

  • Setting up your R scraping environment
  • Understanding the key packages and when to use each one
  • Scraping simple sites with rvest
  • Handling complex HTTP requests, authentication, and APIs with httr2
  • Conquering modern JavaScript-heavy sites with chromote
  • A complete real-world project: scraping book data from Books to Scrape
  • Analyzing your scraped data to extract insights
  • By the end, you'll have the skills to scrape virtually any website—and the wisdom to know which tool to reach for.

    1. The R Web Scraping Toolkit: Key Packages

    Choosing the right tool is half the battle. R's scraping ecosystem offers packages for every scenario, from simple HTML parsing to driving a full browser. Here's your guide to the essential tools:

    The "Go-To" Packages (Static Scraping)

    rvest: Your primary HTML parsing tool. Part of the tidyverse, rvest was inspired by Python's Beautiful Soup and makes extracting data from HTML incredibly intuitive. It's perfect for static pages where all the content is in the initial HTML response.

    httr2: The modern package for making HTTP requests. While rvest's read_html() works for simple cases, httr2 gives you complete control over headers, authentication, cookies, sessions, and rate limiting—essential for professional scraping.

    The "Expert Tier" (Dynamic/JavaScript Scraping)

    chromote: The modern, lightweight solution for controlling a headless Chrome browser. When websites load content with JavaScript (React, Vue, Angular apps), chromote lets you interact with the page just like a real user, waiting for content to load before scraping it.

    RSelenium: The older, more heavyweight option for browser automation. While it requires running a separate Selenium server and Java, it offers more features for complex interactions. For most projects, though, chromote is simpler and sufficient.

    The "Helper" Packages

    jsonlite: Essential for working with JSON data from web APIs. Many modern sites use APIs that return JSON—this package makes parsing it effortless.

    xml2: The low-level XML/HTML parsing engine that powers rvest. Most users won't interact with it directly, but it's the foundation of the ecosystem.

    The "Specialist"

    Rcrawler: When you need to systematically crawl an entire website—not just scrape a few pages—this package provides the infrastructure for link discovery, depth control, and parallel crawling.

    Comparison Table: Which Package When?

    ScenarioRecommended PackageWhy?
    Simple HTML pagervestClean, tidy syntax for parsing static HTML
    Need custom headers/authhttr2+rvestFull HTTP control, then parse withrvest
    JavaScript-rendered contentchromote+rvestBrowser renders JS, then scrape the result
    JSON APIhttr2+jsonliteFetch JSON, parse into R data structures
    Crawl entire websiteRcrawlerBuilt-in link discovery and crawling logic
    Complex form interactionschromoteorRSeleniumSimulate user clicking, typing, navigating

    2. Getting Started: Setting Up Your Scraping Lab

    Before we write our first scraper, let's get your environment ready.

    Install R and RStudio

    If you're new to R:

  • Download R from CRAN
  • Download RStudio (the IDE) from Posit
  • Install the Core Packages

    Open RStudio and run this command to install everything you'll need:

    R
    1
    install.packages(c("rvest", "httr2", "chromote", "tidyverse", "jsonlite"))

    This will install:

  • rvest for HTML parsing
  • httr2 for advanced HTTP requests
  • chromote for browser automation
  • tidyverse for data manipulation (includes dplyr, ggplot2, and more)
  • jsonlite for working with JSON
  • System Dependencies (Linux Users)

    If you're on Linux, you may need to install system libraries before the R packages will work:

    BASH
    1
    # Ubuntu/Debian
    2
    sudo apt-get install libcurl4-openssl-dev libxml2-dev libssl-dev
    3
    4
    # Fedora/CentOS
    5
    sudo dnf install libcurl-devel libxml2-devel openssl-devel

    Mac and Windows users typically don't need to worry about this.

    Load the Libraries

    At the start of each scraping script, load your tools:

    R
    1
    library(rvest)
    2
    library(httr2)
    3
    library(chromote)
    4
    library(tidyverse)
    5
    library(jsonlite)

    Now you're ready to scrape!

    3. The Basics: Scraping Static Sites with rvest

    Let's start with the fundamentals. Most web scraping follows this workflow:

  • Fetch the HTML
  • Select the elements you want
  • Extract the data
  • Structure it into a data frame
  • Step 1: Read the HTML

    Use read_html() to fetch a webpage:

    R
    1
    url <- "https://www.imdb.com/title/tt0111161/"
    2
    page <- read_html(url)

    The page object now contains the entire HTML document.

    Step 2: Select Elements with CSS Selectors

    This is the heart of scraping. You need to tell rvest which parts of the page you want.

    CSS Selectors are the standard way to target HTML elements. They're the same selectors used in web development:

  • h1 selects all <h1> headings
  • .class-name selects elements with that class
  • #id-name selects the element with that ID
  • div.class p selects <p> tags inside <div class="class">
  • Use your browser's Developer Tools (F12) to inspect elements and find their selectors. Right-click an element → Inspect → right-click in the HTML → Copy → Copy selector.

    In rvest:

  • html_elements() (plural) returns all matching elements
  • html_element() (singular) returns the first match
  • R
    1
    # Get the first h1
    2
    title <- page |>
    3
    html_element("h1")

    💡 Pro tip: Check out this CSS Selectors Cheat Sheet for quick reference.

    Step 3: Extract the Data

    Once you've selected elements, extract their content:

    Text content: Use html_text2() (note the "2"—it handles whitespace better than the older html_text()):

    R
    1
    title_text <- page |>
    2
    html_element("h1") |>
    3
    html_text2()
    4
    5
    print(title_text)
    6
    # "The Shawshank Redemption"

    Attributes: Use html_attr() to get things like links or image sources:

    R
    1
    # Get all links
    2
    links <- page |>
    3
    html_elements("a") |>
    4
    html_attr("href")
    5
    6
    # Get image sources
    7
    images <- page |>
    8
    html_elements("img") |>
    9
    html_attr("src")

    Step 4: Handle Tables Automatically

    HTML tables are so common that rvest has a magic function for them:

    R
    1
    tables <- page |>
    2
    html_table()
    3
    4
    # Returns a list of data frames, one for each <table> on the page

    Complete Example: Scraping IMDB

    Let's scrape The Shawshank Redemption's IMDB page for its title, rating, and main cast:

    R
    1
    library(rvest)
    2
    3
    url <- "https://www.imdb.com/title/tt0111161/"
    4
    page <- read_html(url)
    5
    6
    # Get the movie title
    7
    title <- page |>
    8
    html_element("h1") |>
    9
    html_text2()
    10
    11
    # Get the rating (you'll need to inspect the page to find the right selector)
    12
    rating <- page |>
    13
    html_element("[data-testid='hero-rating-bar__aggregate-rating__score'] span") |>
    14
    html_text2()
    15
    16
    # Get the cast list (first 5 actors)
    17
    cast <- page |>
    18
    html_elements("[data-testid='title-cast-item'] a") |>
    19
    html_text2() |>
    20
    head(5)
    21
    22
    # Combine into a data frame
    23
    movie_data <- tibble(
    24
    title = title,
    25
    rating = rating,
    26
    cast = paste(cast, collapse = ", ")
    27
    )
    28
    29
    print(movie_data)

    That's the basic rvest workflow! For many static sites, this is all you need.

    4. Level Up: Advanced HTTP Control with httr2

    What happens when read_html() isn't enough? Many websites require:

  • Custom headers (like a User-Agent to identify yourself)
  • Authentication (login credentials)
  • Session management (maintaining cookies across requests)
  • Rate limiting (being polite and not hammering the server)
  • This is where httr2 shines. It gives you complete control over HTTP requests.

    Setting Custom Headers

    Some sites block requests that don't include a User-Agent (they look like bots). Always identify yourself:

    R
    1
    library(httr2)
    2
    3
    req <- request("https://example.com") |>
    4
    req_headers(
    5
    "User-Agent" = "MyResearchScraper/1.0 (your.email@example.com)"
    6
    )
    7
    8
    resp <- req_perform(req)
    9
    html <- resp |> resp_body_html()

    Now you can pass html to rvest functions.

    Handling Authentication

    For APIs or sites requiring login:

    R
    1
    # Basic authentication (common for APIs)
    2
    req <- request("https://api.example.com/data") |>
    3
    req_auth_basic(username = "user", password = "pass")
    4
    5
    # Form-based login (like a website login page)
    6
    req <- request("https://example.com/login") |>
    7
    req_body_form(
    8
    username = "user",
    9
    password = "pass"
    10
    )

    Managing Sessions and Cookies

    When you log in to a site, the server gives you a session cookie. To maintain that session across multiple requests:

    R
    1
    # httr2 automatically handles cookies within a session
    2
    # Just use the same request object for subsequent calls
    3
    4
    session <- request("https://example.com")
    5
    6
    # Login
    7
    login_resp <- session |>
    8
    req_url_path_append("login") |>
    9
    req_body_form(username = "user", password = "pass") |>
    10
    req_perform()
    11
    12
    # Now make authenticated requests (cookies persist)
    13
    data_resp <- session |>
    14
    req_url_path_append("data") |>
    15
    req_perform()

    Rate Limiting: Be a Good Citizen

    Scraping too fast can overload servers and get you blocked. Always add delays:

    R
    1
    req <- request("https://example.com") |>
    2
    req_throttle(rate = 10) # Max 10 requests per second
    3
    4
    # Also add retry logic for temporary failures
    5
    req <- req |>
    6
    req_retry(max_tries = 3, backoff = ~5)

    Scraping JSON APIs

    Many modern sites use APIs that return JSON. Here's how to fetch and parse it:

    R
    1
    library(jsonlite)
    2
    3
    req <- request("https://api.example.com/data")
    4
    resp <- req_perform(req)
    5
    6
    # Parse JSON response
    7
    data <- resp |> resp_body_json()
    8
    9
    # Convert to data frame
    10
    df <- as_tibble(data)

    httr2 is your Swiss Army knife for any HTTP complexity. Pair it with rvest for HTML parsing, and you can handle almost any static site.

    5. The Expert Tier: Scraping JavaScript Sites with chromote

    Here's the challenge: modern websites are often built with frameworks like React, Vue, or Angular. When you load these sites, the initial HTML is mostly empty—content is loaded after by JavaScript.

    If you try read_html() on these sites, you'll get an empty shell. rvest can't run JavaScript.

    The Solution: Headless Browsers

    chromote controls a real Chrome browser (in "headless" mode—no visible window). The browser loads the page, runs all the JavaScript, and then you scrape the rendered result.

    Installation

    You need:

  • The R package: install.packages("chromote")
  • Google Chrome installed on your computer
  • chromote will automatically find and use your Chrome installation.

    The chromote Workflow

    Here's the basic pattern:

    R
    1
    library(chromote)
    2
    library(rvest)
    3
    4
    # 1. Start a new browser session
    5
    b <- ChromoteSession$new()
    6
    7
    # 2. Navigate to the page
    8
    b$Page$navigate("https://example.com")
    9
    10
    # 3. WAIT for the page to load (critical!)
    11
    b$Page$loadEventFired()
    12
    13
    # 4. Get the rendered HTML
    14
    html <- b$Runtime$evaluate("document.documentElement.outerHTML")$result$value
    15
    16
    # 5. Parse with rvest
    17
    page <- read_html(html)
    18
    19
    # Now use normal rvest functions
    20
    data <- page |>
    21
    html_elements(".dynamic-content") |>
    22
    html_text2()
    23
    24
    # 6. Close the browser
    25
    b$close()

    Advanced: Waiting for Specific Elements

    Sometimes loadEventFired() isn't enough—you need to wait for a specific element to appear:

    R
    1
    b <- ChromoteSession$new()
    2
    b$Page$navigate("https://example.com")
    3
    4
    # Wait for a specific element using JavaScript
    5
    b$Runtime$evaluate('
    6
    new Promise(resolve => {
    7
    const check = () => {
    8
    if (document.querySelector(".target-element")) {
    9
    resolve(true);
    10
    } else {
    11
    setTimeout(check, 100);
    12
    }
    13
    };
    14
    check();
    15
    })
    16
    ')
    17
    18
    # Now scrape

    Interacting with the Page

    You can also click buttons, fill forms, and scroll—anything a real user can do:

    R
    1
    # Click a button
    2
    b$Runtime$evaluate('document.querySelector("#load-more").click()')
    3
    4
    # Wait for new content to load
    5
    Sys.sleep(2)
    6
    7
    # Fill a form field
    8
    b$Runtime$evaluate('document.querySelector("#search-box").value = "query"')
    9
    10
    # Submit the form
    11
    b$Runtime$evaluate('document.querySelector("#search-form").submit()')

    chromote is powerful but comes with overhead—it's slower and more resource-intensive than rvest. Use it only when you need it.

    6. Full Project: Scraping Book Data from “Books to Scrape”

    Let’s put everything together with a real-world project that’s friendly for learning and explicitly designed for scraping: Books to Scrape. We’ll collect book titles, prices, availability, ratings, and category.

    The Challenge

    Books to Scrape is mostly static HTML and paginated, which makes it perfect for rvest without a headless browser. We’ll need to:

  • Traverse categories
  • Handle multi-page pagination
  • Extract structured data from product cards
  • The Plan

    We’ll build two functions:

  • Get all book page URLs for a given category, following pagination
  • Scrape an individual book page for detailed fields
  • Then we’ll iterate over all categories and combine results into a single data frame.

    Helpers

    R
    1
    library(rvest)
    2
    library(xml2)
    3
    library(stringr)
    4
    library(purrr)
    5
    library(dplyr)
    6
    library(readr)
    7
    8
    BASE <- "https://books.toscrape.com/"
    9
    10
    rel_to_abs <- function(href, base) {
    11
    xml2::url_absolute(href, base)
    12
    }

    Step 1: Get categories

    R
    1
    get_categories <- function(base = BASE) {
    2
    page <- read_html(base)
    3
    a <- page |>
    4
    html_elements(".side_categories ul li ul li a")
    5
    tibble(
    6
    category = a |> html_text2() |> str_squish(),
    7
    url = a |> html_attr("href") |> rel_to_abs(base)
    8
    )
    9
    }
    10
    11
    cats <- get_categories()
    12
    head(cats)

    Step 2: Get all book URLs in a category (with pagination)

    R
    1
    get_book_urls_in_category <- function(cat_url) {
    2
    urls <- character()
    3
    next_url <- cat_url
    4
    repeat {
    5
    page <- read_html(next_url)
    6
    7
    page_urls <- page |>
    8
    html_elements("section div ol li article.product_pod h3 a") |>
    9
    html_attr("href") |>
    10
    rel_to_abs(next_url)
    11
    12
    urls <- c(urls, page_urls)
    13
    14
    next_rel <- page |>
    15
    html_element("li.next a") |>
    16
    html_attr("href")
    17
    18
    if (is.na(next_rel) || is.null(next_rel)) break
    19
    20
    next_url <- rel_to_abs(next_rel, next_url)
    21
    }
    22
    unique(urls)
    23
    }

    Step 3: Scrape a single book page

    R
    1
    scrape_book_page <- function(book_url) {
    2
    page <- read_html(book_url)
    3
    4
    title <- page |> html_element(".product_main h1") |> html_text2()
    5
    price <- page |> html_element(".product_main .price_color") |> html_text2() |>
    6
    stringr::str_extract("[0-9]+\.[0-9]{2}") |> as.numeric()
    7
    availability <- page |> html_element(".product_main .availability") |> html_text2() |> str_squish()
    8
    rating_class <- page |> html_element(".product_main .star-rating") |> html_attr("class")
    9
    category <- page |> html_elements(".breadcrumb li a") |> html_text2() |> dplyr::last()
    10
    11
    tibble(
    12
    title = title,
    13
    price = price,
    14
    availability = availability,
    15
    rating_class = rating_class,
    16
    category = category,
    17
    url = book_url
    18
    )
    19
    }

    Step 4: Scrape all categories

    R
    1
    all_books <- map_dfr(seq_len(nrow(cats)), function(i) {
    2
    cat("Category:", cats$category[i], "\n")
    3
    book_urls <- get_book_urls_in_category(cats$url[i])
    4
    map_dfr(book_urls, function(u) {
    5
    Sys.sleep(0.2)
    6
    scrape_book_page(u)
    7
    })
    8
    })
    9
    10
    # Save
    11
    write_csv(all_books, "books_to_scrape.csv")
    12
    saveRDS(all_books, "books_to_scrape.rds")

    Schema: Expected columns in books_to_scrape.csv

    <strong>Column</strong><strong>Type</strong><strong>Description</strong>
    titletextBook title
    pricenumberPrice extracted from page (numeric)
    availabilitytextAvailability string, e.g., "In stock (22 available)"
    rating_classtextCSS class containing star rating label
    categorytextBook category from breadcrumb
    urltextAbsolute URL of the product page

    Optional: Robust scraping with error handling

    R
    1
    scrape_book_page_safe <- function(u) {
    2
    tryCatch(scrape_book_page(u), error = function(e) {
    3
    message("Error:", e$message)
    4
    NULL
    5
    })
    6
    }
    7
    8
    all_books <- map_dfr(seq_len(nrow(cats)), function(i) {
    9
    book_urls <- get_book_urls_in_category(cats$url[i])
    10
    map_dfr(book_urls, scrape_book_page_safe)
    11
    })

    🎉 You now have a clean dataset of books, their prices, availability, ratings, and categories from Books to Scrape.

    7. The Payoff: Analyzing Your Books-to-Scrape Data

    Now that you’ve scraped book data, let’s analyze it directly in R.

    Load and tidy the data

    R
    1
    library(tidyverse)
    2
    3
    books <- read_csv("books_to_scrape.csv")
    4
    5
    # Example normalization helpers
    6
    normalize_price <- function(x) as.numeric(readr::parse_number(x))
    7
    8
    books <- books |>
    9
    mutate(
    10
    price_num = normalize_price(price),
    11
    availability_n = readr::parse_number(availability),
    12
    rating = case_when(
    13
    str_detect(rating_class, "One") ~ 1,
    14
    str_detect(rating_class, "Two") ~ 2,
    15
    str_detect(rating_class, "Three") ~ 3,
    16
    str_detect(rating_class, "Four") ~ 4,
    17
    str_detect(rating_class, "Five") ~ 5,
    18
    TRUE ~ NA_real_
    19
    )
    20
    )

    Question 1: What’s the price distribution overall and by category?

    R
    1
    library(ggplot2)
    2
    3
    # Overall
    4
    ggplot(books, aes(price_num)) +
    5
    geom_histogram(binwidth = 5, fill = "#4C78A8", color = "white") +
    6
    labs(title = "Book Price Distribution", x = "Price", y = "Count") +
    7
    theme_minimal()
    8
    9
    # By category (top 8 by count)
    10
    top_cats <- books |> count(category, sort = TRUE) |> slice_head(n = 8) |> pull(category)
    11
    12
    ggplot(filter(books, category %in% top_cats), aes(price_num, fill = category)) +
    13
    geom_histogram(binwidth = 5, color = "white", alpha = 0.85) +
    14
    facet_wrap(~ category, scales = "free_y") +
    15
    guides(fill = "none") +
    16
    labs(title = "Price Distribution by Category", x = "Price", y = "Count") +
    17
    theme_minimal()

    Question 2: Do higher-rated books cost more?

    Teaser: Price vs. Availability (quick plot)

    R
    1
    # Simple teaser scatter: price vs. in-stock count
    2
    library(ggplot2)
    3
    4
    p_teaser <- ggplot(books, aes(x = availability_n, y = price_num)) +
    5
    geom_point(alpha = 0.6, color = "#3E7CB1") +
    6
    labs(
    7
    title = "Books: Price vs. Availability",
    8
    x = "Copies in stock",
    9
    y = "Price"
    10
    ) +
    11
    theme_minimal()
    12
    13
    p_teaser
    14
    15
    # Save if you want to embed the static image
    16
    # ggsave("books_price_vs_availability.png", p_teaser, width = 6, height = 4, dpi = 150)

    Question 3: Availability patterns by category

    R
    1
    books |>
    2
    group_by(category) |>
    3
    summarise(
    4
    avg_in_stock = mean(availability_n, na.rm = TRUE),
    5
    n = n()
    6
    ) |>
    7
    arrange(desc(avg_in_stock)) |>
    8
    slice_head(n = 15) |>
    9
    ggplot(aes(x = reorder(category, avg_in_stock), y = avg_in_stock)) +
    10
    geom_col(fill = "#54A24B") +
    11
    coord_flip() +
    12
    labs(title = "Average Copies In Stock by Category (Top 15)", x = "Category", y = "Avg copies in stock") +
    13
    theme_minimal()

    Question 4: Top 10 most expensive titles

    R
    1
    books |>
    2
    arrange(desc(price_num)) |>
    3
    transmute(title, category, rating, price = price_num) |>
    4
    slice_head(n = 10) |>
    5
    knitr::kable()

    This is the power of web scraping in R: you can go from raw HTML to actionable insights without ever leaving your IDE.

    Conclusion: From Beginner to Expert Scraper

    We've covered a lot of ground. You now know:

  • When to use rvest (static HTML), httr2 (advanced HTTP), or chromote (JavaScript sites)
  • How to select and extract data with CSS selectors
  • How to handle authentication, sessions, and rate limiting
  • How to scrape modern JavaScript-heavy websites
  • How to build a complete scraping pipeline from collection to analysis
  • Ethical Scraping: Be a Good Citizen

    Before you scrape, remember:

  • Check robots.txt: Visit example.com/robots.txt to see what the site allows
  • Respect rate limits: Use Sys.sleep() or req_throttle() to avoid overloading servers
  • Identify yourself: Use a descriptive User-Agent with contact info
  • Honor terms of service: Don't scrape data you're not supposed to access
  • Consider the API: Many sites offer APIs—use them when available
  • When Scraping Gets Hard: Let FoxScrape Handle It

    Building and maintaining production scrapers is challenging. You have to:

  • Manage rotating proxies to avoid IP bans
  • Handle CAPTCHAs and anti-bot systems
  • Keep scrapers updated when websites change
  • Scale infrastructure for large-scale data collection
  • Monitor for failures and retries
  • If you need reliable, large-scale data extraction without the maintenance burden, check out FoxScrape. It's a web scraping API that handles all the complexity—proxies, browser fingerprinting, JavaScript rendering, and CAPTCHA solving—so you can focus on analyzing data, not maintaining scrapers.

    Whether you're collecting market research, monitoring competitors, or building datasets for research, FoxScrape lets you scale from hundreds to millions of pages without managing infrastructure.

    Happy Scraping!

    Web scraping opens up a world of data that doesn't exist in neat CSV files. With R's powerful ecosystem, you have everything you need to collect, clean, and analyze that data—all in one place.

    Start small with rvest, level up with httr2, and when you hit JavaScript sites, bring out chromote. Before you know it, you'll be building scrapers that would have seemed impossible when you started.

    Now go forth and scrape responsibly! 🚀