Web Scraping with Ruby

Published on
Written by
Mantas Kemėšius
Web Scraping with Ruby

Web scraping is one of those quiet superpowers every developer eventually picks up. Whether you’re building a price tracker, collecting research data, or just automating boring work, scraping lets you read the web like a machine — and Ruby happens to make that surprisingly elegant.

In this tutorial, we’ll build your skills step-by-step:

from making a single HTTP request and parsing HTML,

to automating entire browsers, and finally, simplifying it all with FoxScrape — a fast, API-based scraping solution for sites that fight back.

You’ll walk away knowing:

  • How to scrape static and dynamic sites in Ruby
  • How to parse data cleanly using Nokogiri
  • How to avoid common anti-scraping pitfalls
  • When (and how) to switch to FoxScrape for effortless data extraction
  • Let’s start from the beginning.

    🧰 1. Why Ruby for Web Scraping?

    Ruby’s not just for Rails — it’s also great for scripting, data parsing, and automation.

    Its expressive syntax and ecosystem of gems make it ideal for building readable, maintainable scrapers.

    Here’s what makes Ruby shine for scraping:

    FeatureWhy It Matters
    NokogiriFast, reliable HTML/XML parsing
    FaradayModern, flexible HTTP client
    Capybara + SeleniumAutomate browsers like Chrome or Firefox
    CSV & JSONBuilt-in data export
    Threads / Parallel gemSimple concurrency for multiple pages

    Throughout this guide, we’ll use these gems to demonstrate the full scraping workflow — from raw HTML to structured CSV output.

    🧑‍💻 2. Setting Up Your Ruby Environment

    You’ll need:

  • Ruby 3.x or newer
  • Bundler (gem install bundler)
  • A text editor (VS Code works great)
  • Create a new folder for your scraper:

    BASH
    1mkdir ruby-scraper && cd ruby-scraper
    2bundle init

    Now open your Gemfile and add the following gems:

    RUBY
    1gem "faraday"
    2gem "nokogiri"
    3gem "selenium-webdriver"
    4gem "capybara"
    5gem "csv"
    6gem "parallel"

    Then run:

    BASH
    1bundle install

    That’s it — you’re ready to start scraping.

    🌐 3. Making Your First HTTP Request with Faraday

    Let’s warm up by fetching a webpage.

    RUBY
    1require 'faraday'
    2
    3response = Faraday.get("https://example.com")
    4puts response.status
    5puts response.body[0..200]

    This performs a simple GET request and prints the first 200 characters of the response.

    What’s happening here:

  • Faraday.get() fetches the page.
  • response.status tells you if the request succeeded (200 = OK).
  • response.body is the raw HTML you’ll parse next.
  • 🧩 4. Parsing HTML with Nokogiri

    Nokogiri is the Swiss Army knife of Ruby scraping. It lets you navigate HTML like a tree — select tags, extract text, and manipulate content easily.

    Let’s extract links from example.com:

    RUBY
    1require 'nokogiri'
    2require 'faraday'
    3
    4html = Faraday.get("https://example.com").body
    5doc = Nokogiri::HTML(html)
    6
    7links = doc.css("a").map { |a| a['href'] }.compact
    8puts links

    What’s happening:

  • We use CSS selectors (a) to find all <a> tags.
  • .map collects their href attributes.
  • compact removes nil values.
  • 💡 Pro tip: You can also use doc.at_css("h1").text to grab single elements, like titles or headers.

    🍿 5. Real-World Example: Scraping Movie Titles

    Let’s make it real by scraping movie data from https://books.toscrape.com — a safe test site for scraping.

    RUBY
    1require 'nokogiri'
    2require 'faraday'
    3require 'csv'
    4
    5url = "https://books.toscrape.com"
    6html = Faraday.get(url).body
    7doc = Nokogiri::HTML(html)
    8
    9books = doc.css(".product_pod").map do |book|
    10  title = book.at_css("h3 a")["title"]
    11  price = book.at_css(".price_color").text
    12  { title: title, price: price }
    13end
    14
    15CSV.open("books.csv", "w") do |csv|
    16  csv << ["Title", "Price"]
    17  books.each { |b| csv << [b[:title], b[:price]] }
    18end
    19
    20puts "✅ Saved #{books.size} books to books.csv"

    Now you’ve scraped, parsed, and exported structured data — the full basic cycle.

    But as you’ll see next, real websites rarely make it this easy.

    🧱 6. When the Web Fights Back: Anti-Scraping Measures

    Sooner or later, you’ll hit problems like:

  • 403 or 429 errors (blocked or rate-limited)
  • Blank pages (because content is loaded with JavaScript)
  • CAPTCHA challenges
  • IP bans after multiple requests
  • Static scraping tools like Faraday + Nokogiri don’t handle these cases well — they only fetch raw HTML, not JavaScript-rendered pages or protected endpoints.

    That’s where you have two main options:

  • Run a full browser (via Selenium or Capybara)
  • Offload the heavy lifting to a scraping API like FoxScrape
  • Let’s explore both paths.

    🧭 7. Scraping Dynamic Sites with Selenium & Capybara

    When JavaScript gets in the way, you can use browser automation.

    Install ChromeDriver or GeckoDriver first, then try this:

    RUBY
    1require 'selenium-webdriver'
    2
    3options = Selenium::WebDriver::Chrome::Options.new
    4options.add_argument("--headless")
    5driver = Selenium::WebDriver.for(:chrome, options: options)
    6
    7driver.navigate.to "https://quotes.toscrape.com/js/"
    8sleep 2 # wait for JS to load
    9puts driver.title
    10puts driver.page_source[0..300]
    11
    12driver.quit

    This launches a headless Chrome browser, loads a JS-heavy site, and prints its HTML.

    Pros: Works anywhere, real browser context

    Cons: Slow, memory-hungry, and sometimes brittle

    Wouldn’t it be nice if you could get the same results without running a full browser?

    🦊 8. Simplifying It All with FoxScrape

    Here’s where FoxScrape comes in — a powerful API that fetches fully-rendered pages from any URL, so you don’t have to deal with:

  • Proxy rotation
  • Headless browsers
  • CAPTCHA walls
  • JavaScript rendering
  • With a single HTTP call, you get clean, ready-to-parse HTML.

    Let’s adapt your previous Nokogiri scraper to use FoxScrape:

    RUBY
    1require 'faraday'
    2require 'nokogiri'
    3
    4FOX_API_KEY = "YOUR_API_KEY"
    5target_url = "https://books.toscrape.com"
    6fox_url = "https://www.foxscrape.com/api/v1?api_key=#{FOX_API_KEY}&url=#{target_url}"
    7
    8response = Faraday.get(fox_url)
    9doc = Nokogiri::HTML(response.body)
    10
    11books = doc.css(".product_pod h3 a").map { |a| a["title"] }
    12puts "Found #{books.size} books!"

    💡 You can even enable JS rendering:

    RUBY
    1fox_url = "https://www.foxscrape.com/api/v1?api_key=#{FOX_API_KEY}&url=#{target_url}&render_js=true"

    The beauty?

    You still use your familiar parsing code — FoxScrape only replaces the network layer.

    Your Nokogiri logic stays exactly the same.

    💾 9. Handling Retries and Saving Data

    Sometimes requests fail — networks drop, sites throttle. Add retries and CSV output for resilience:

    RUBY
    1require 'faraday'
    2require 'faraday/retry'
    3
    4conn = Faraday.new do |f|
    5  f.request :retry, max: 3, interval: 1
    6  f.adapter Faraday.default_adapter
    7end
    8
    9response = conn.get("https://www.foxscrape.com/api/v1", {
    10  api_key: FOX_API_KEY,
    11  url: "https://example.com"
    12})
    13
    14if response.success?
    15  File.write("page.html", response.body)
    16  puts "Saved HTML snapshot."
    17else
    18  puts "Error: #{response.status}"
    19end

    ⚙️ 10. Putting It All Together — A Mini Project

    Let’s combine everything into a small, practical scraper that:

  • Uses FoxScrape to fetch pages
  • Parses data with Nokogiri
  • Writes to CSV
  • Runs in parallel
  • RUBY
    1require 'faraday'
    2require 'nokogiri'
    3require 'csv'
    4require 'parallel'
    5
    6FOX_API_KEY = "YOUR_API_KEY"
    7base_url = "https://books.toscrape.com/catalogue/page-"
    8
    9pages = (1..5).map { |i| "#{base_url}#{i}.html" }
    10
    11results = Parallel.map(pages, in_threads: 4) do |url|
    12  fox_url = "https://www.foxscrape.com/api/v1?api_key=#{FOX_API_KEY}&url=#{url}"
    13  html = Faraday.get(fox_url).body
    14  doc = Nokogiri::HTML(html)
    15
    16  doc.css(".product_pod").map do |b|
    17    {
    18      title: b.at_css("h3 a")["title"],
    19      price: b.at_css(".price_color").text
    20    }
    21  end
    22end.flatten
    23
    24CSV.open("books_combined.csv", "w") do |csv|
    25  csv << ["Title", "Price"]
    26  results.each { |r| csv << [r[:title], r[:price]] }
    27end
    28
    29puts "✅ Scraped #{results.size} books across 5 pages!"

    That’s a multi-threaded, fault-tolerant scraper running through a robust API — production-ready and concise.

    ⚖️ 11. Choosing the Right Approach

    Here’s a quick reference comparing the main scraping methods:

    MethodHandles JS?Handles Blocks?SpeedComplexity
    Nokogiri + Faraday⚡ Fast🟢 Simple
    Selenium / Capybara⚠️ Partial🐢 Slow🔴 Complex
    FoxScrape API⚡⚡ Fast🟢 Simple

    If your target site is static — stick with Nokogiri.

    If it’s dynamic or protected — FoxScrape is your easiest route.

    🧠 12. Best Practices & Ethics

    A few golden rules of scraping:

  • Respect robots.txt and rate limits.
  • Cache responses when possible.
  • Don’t overload servers — use sleep or throttling between requests.
  • Always credit data sources when publishing results.
  • FoxScrape helps here too — its backend automatically throttles and retries responsibly, so your IPs (and conscience) stay clean.

    🎯 13. Conclusion

    You’ve now built a complete scraping toolkit in Ruby — from raw HTML fetching to full-scale, parallel data collection.

    You learned how to:

  • Fetch and parse pages with Faraday + Nokogiri
  • Handle dynamic sites with Selenium
  • Simplify everything using FoxScrape’s API
  • Export and structure your data cleanly
  • When scraping goes from “fun experiment” to “daily pipeline,” that’s when FoxScrape truly shines — because you’ll spend less time fighting blocks and more time working with your data.

    So go ahead:

    Run your first FoxScrape request, grab your results, and watch how easy scraping can be when the hard parts are already handled.

    🦊 Try it yourself:

    RUBY
    1FOX_API_KEY = "YOUR_API_KEY"
    2url = "https://en.wikipedia.org/wiki/Ruby_(programming_language)"
    3fox_url = "https://www.foxscrape.com/api/v1?api_key=#{FOX_API_KEY}&url=#{url}"
    4puts Faraday.get(fox_url).body[0..500]

    Happy scraping — ethically, efficiently, and effortlessly.