Web Scraping with Scala

Published on
Written by
Mantas Kemėšius
Web Scraping with Scala

Web scraping sits at the intersection of curiosity and automation. It’s what happens when developers stop copying data manually and start thinking: couldn’t I just write code for this?

Whether you’re tracking prices, gathering research data, or analyzing trends across multiple websites, scraping is often the most direct way to collect public information at scale. But the web you scrape today is not the same as the web of a decade ago.

Modern sites use JavaScript frameworks, lazy loading, and aggressive bot protection. So to extract data reliably, we need to combine the right techniques, libraries, and occasionally — a little bit of infrastructure help.

In this article, we’ll explore how to scrape the web effectively using Scala, comparing three major methods:

  • jsoup – for lightweight, static HTML parsing
  • Scala Scraper – a Scala-idiomatic wrapper around jsoup
  • Selenium – for interacting with JavaScript-heavy pages
  • And finally, we’ll look at when it makes sense to offload complexity using an external API — introducing FoxScrape, a scraping API that handles browser rendering and proxy management for you.

    But let’s not get ahead of ourselves. We’ll start from scratch — building your first scraping project in Scala, one line of code at a time.

    ⚙️ 1. Preparing Your Scala Environment

    Before we scrape anything, we need a clean development setup.

    🧩 Installing Scala and sbt

    Ensure you have:

  • Scala 3.6.3
  • sbt 1.10.7
  • Check your versions:

    BASH
    1scala -version
    2sbt sbtVersion

    If you’re missing either, install via scala-lang.org or use Coursier (cs setup) for an all-in-one toolchain.

    🏗️ Project Structure

    Let’s create a new project:

    BASH
    1sbt new scala/scala3.g8
    2cd my-scraper

    This gives you a minimal template with /src/main/scala/Main.scala.

    Now add our dependencies.

    📦 build.sbt

    SCALA
    1ThisBuild / scalaVersion := "3.6.3"
    2
    3libraryDependencies ++= Seq(
    4  "org.jsoup" % "jsoup" % "1.18.1",
    5  "net.ruippeixotog" %% "scala-scraper" % "3.1.0",
    6  "org.seleniumhq.selenium" % "selenium-java" % "4.25.0",
    7  "com.lihaoyi" %% "requests" % "0.8.0" // for HTTP calls later
    8)

    With this, we can experiment with all three approaches — from static HTML parsing to full browser automation — without switching environments.

    🌐 2. Your First Scrape with jsoup

    The simplest way to scrape is to fetch a page and parse it directly.

    🧩 What jsoup Does

    jsoup is a Java library (fully compatible with Scala) that:

  • Downloads and parses HTML into a traversable DOM.
  • Lets you select elements using familiar CSS selectors.
  • Cleans and normalizes HTML for easier data extraction.
  • Perfect for static pages — like Wikipedia, blogs, or documentation sites.

    ✏️ Example: Scrape Wikipedia’s Title

    SCALA
    1import org.jsoup.Jsoup
    2
    3@main def scrapeWikipedia(): Unit =
    4  val doc = Jsoup.connect("https://en.wikipedia.org/").get()
    5  println(doc.title())

    Run it:

    PLAIN TEXT
    1sbt run

    Output:

    PLAIN TEXT
    1Wikipedia, the free encyclopedia

    That’s jsoup in a nutshell — simple, fast, and excellent for HTML that’s already rendered server-side.

    🎯 3. Selecting Elements with jsoup

    Once you have a document, you’ll want to extract specific sections — headlines, links, or data fields.

    🔍 CSS Selectors in jsoup

    jsoup uses CSS-like selectors:

    SelectorMeaning
    #idelement with id
    .classelement with class
    div p<p> inside <div>
    a[href]<a> with href attribute

    🧠 Example: Wikipedia’s “In the news” section

    SCALA
    1import org.jsoup.Jsoup
    2
    3@main def wikipediaNews(): Unit =
    4  val doc = Jsoup.connect("https://en.wikipedia.org/").get()
    5
    6  val news = doc.select("#mp-itn b a")
    7  for (n <- news.asScala)
    8    println(s"${n.text()}${n.absUrl("href")}")

    We grab the bolded links in the “In the news” section using the #mp-itn ID and print their text and URLs.

    This is where scraping starts to feel magical — the ability to programmatically traverse the same structure you inspect in your browser’s “View Source.”


    🧩 4. Extracting Text and Attributes

    jsoup makes it trivial to extract values:

    MethodDescription
    .text()Returns the text content
    .attr("href")Returns the attribute value
    .children()Gets child elements
    .first() / .last()Traverse element nodes

    Example:

    SCALA
    1val headlines = doc.select("#mp-itn li")
    2
    3for (item <- headlines.asScala) {
    4  val link = item.select("a").first()
    5  val title = link.text()
    6  val href = link.absUrl("href")
    7  println(s"$title$href")
    8}

    This gives you structured pairs of text and URLs — perfect for later saving into JSON, CSV, or a database.

    🧰 5. Scraping with Scala Scraper

    While jsoup is powerful, its Java-style syntax isn’t very Scala-idiomatic.

    That’s where the Scala Scraper library comes in — it wraps jsoup in a more functional, expressive API.

    📦 Setup

    We already added it to our build.sbt.

    ✏️ Example

    SCALA
    1import net.ruippeixotog.scalascraper.browser.JsoupBrowser
    2import net.ruippeixotog.scalascraper.dsl.DSL.*
    3
    4@main def scrapeWithScalaScraper(): Unit =
    5  val browser = JsoupBrowser()
    6  val doc = browser.get("https://en.wikipedia.org/")
    7
    8  val titles = doc >> elementList("#mp-itn b a")
    9  titles.foreach { el =>
    10    println(el.text + " → " + el.attr("href"))
    11  }

    Notice how concise it feels. The >> operator is part of the DSL — a nice abstraction that lets you extract elements declaratively.

    Scala Scraper is especially helpful when you want to map extracted nodes directly into Scala collections.

    🧮 6. Extracting Structured Data

    Let’s get a little more structured — turning HTML into Scala data.

    SCALA
    1case class Headline(title: String, link: String)
    2
    3@main def scrapeNewsStructured(): Unit =
    4  val browser = JsoupBrowser()
    5  val doc = browser.get("https://en.wikipedia.org/")
    6
    7  val headlines =
    8    for el <- doc >> elementList("#mp-itn li a")
    9    yield Headline(el.text, el.absUrl("href"))
    10
    11  headlines.take(5).foreach(println)

    Output:

    PLAIN TEXT
    1Headline(Election results announced, https://en.wikipedia.org/wiki/Election_results)
    2Headline(New species discovered, https://en.wikipedia.org/wiki/New_species)
    3...

    Clean, typed, and immediately usable for further analysis or storage.

    ⚠️ 7. When Static Scraping Isn’t Enough

    At some point, you’ll try to scrape a modern site — say, a single-page React app or an infinite-scroll product list — and your jsoup code will suddenly return… nothing.

    Why?

    Because jsoup only sees HTML the server sends, not what JavaScript renders afterward.

    Modern websites use client-side frameworks (React, Vue, Angular) to populate content after load. That means your scraper receives an empty skeleton, and your selectors find zero elements.

    Here’s a quick rule of thumb:

    Page TypeWorks with jsoup?Why
    WikipediaStatic HTML
    Amazon⚠️Some static, some dynamic
    LinkedInJS-rendered + login required
    TwitterFully dynamic SPA

    So when jsoup or Scala Scraper fail, you need to simulate a real browser.


    🧭 8. Enter Selenium: A Full Browser at Your Command

    Selenium automates browsers — allowing you to control Chrome, Firefox, or Edge through code. It’s heavier than jsoup but essential when pages require JavaScript execution.

    ⚙️ Setup

    You’ll need:

  • A browser (Chrome or Firefox)
  • Its matching WebDriver binary (chromedriver, geckodriver)
  • Make sure they’re on your PATH.

    ✏️ Example: Wikipedia Again (Headless Mode)

    SCALA
    1import org.openqa.selenium.chrome.ChromeDriver
    2import org.openqa.selenium.chrome.ChromeOptions
    3
    4@main def seleniumWikipedia(): Unit =
    5  val options = ChromeOptions()
    6  options.addArguments("--headless=new", "--disable-gpu")
    7  val driver = ChromeDriver(options)
    8
    9  driver.get("https://en.wikipedia.org/")
    10  println(driver.getTitle())
    11
    12  driver.quit()

    Selenium runs an actual browser instance, loads all scripts, and renders the page. You can then query the DOM (via findElementByCssSelector) or even scroll and click.

    For example:

    SCALA
    1import org.openqa.selenium.By
    2
    3val headlines = driver.findElements(By.cssSelector("#mp-itn b a"))
    4headlines.forEach(h => println(h.getText))

    Powerful, but heavier. A dozen Selenium sessions consume gigabytes of memory. That’s fine for testing, but scraping at scale becomes impractical.

    💡 9. Static vs Dynamic vs API Scraping

    At this point, we’ve seen three clear approaches. Each has trade-offs.

    MethodStrengthsWeaknessesUse When
    jsoupLightweight, fastFails on JS pagesSimple static sites
    Scala ScraperIdiomatic, conciseSame JS limitsData mapping in Scala
    SeleniumHandles dynamic JSSlow, complex setupSites needing rendering

    Most production pipelines combine these:

  • jsoup for simple pages
  • Selenium for the stubborn ones
  • And increasingly, API-based scraping for everything else
  • Let’s explore that last option.

    ☁️ 10. Simplifying Scraping with an API (Example: FoxScrape)

    When scraping grows beyond one or two pages, it’s not code complexity that gets you — it’s infrastructure.

    You start juggling:

  • Proxy rotation to avoid IP blocks
  • User-Agent spoofing
  • CAPTCHA solving
  • JavaScript rendering
  • Rate limiting
  • Each adds overhead, cost, and maintenance.

    Wouldn’t it be better if you could just say:

    “Here’s the URL I want — please give me the final rendered HTML.”

    That’s the idea behind FoxScrape — a developer-friendly web scraping API.

    Instead of running browsers yourself, you delegate that work to FoxScrape’s infrastructure. It fetches, renders (if needed), and returns clean HTML to your code.

    ⚙️ Integration Example (with Scala)

    Let’s fetch and parse Wikipedia using FoxScrape’s API.

    SCALA
    1import requests._
    2import org.jsoup.Jsoup
    3
    4@main def foxscrapeExample(): Unit =
    5  val api = "https://www.foxscrape.com/api/v1"
    6  val response = requests.get(api, params = Map(
    7    "url" -> "https://en.wikipedia.org/",
    8    "render_js" -> "false"
    9  ))
    10
    11  val html = response.text()
    12  val doc = Jsoup.parse(html)
    13  println("Page title: " + doc.title())

    That’s it — no proxies, no headless Chrome, no extra dependencies.

    Want JavaScript content rendered? Just set "render_js" -> "true".

    You can still use all your jsoup or Scala Scraper logic to parse the returned HTML exactly as before — because FoxScrape’s output is just clean, ready-to-parse markup.

    🧱 11. When (and Why) to Use a Scraping API

    FoxScrape and similar APIs shine when you need to scale or when sites are hostile to automated access.

    Typical triggers to switch:

  • Frequent IP bans or 403 errors
  • JavaScript rendering required
  • Pages changing HTML layout frequently
  • High concurrency requirements
  • 🧮 Comparison

    Featurejsoup / Scala ScraperSeleniumFoxScrape API
    JavaScript support
    SetupEasyComplexNone
    SpeedFastSlowFast
    Proxy handlingManualManualAutomatic
    ScaleHighLowVery high
    CostFreeHigh CPUPay-as-you-go

    FoxScrape essentially gives you Selenium-level scraping with jsoup-level simplicity — an elegant hybrid.

    🧩 12. Best Practices for Robust Scrapers

    Regardless of which tool you use, a few universal best practices keep your scrapers ethical and reliable.

    ⏳ Respect rate limits

    Insert delays between requests or use FoxScrape’s built-in throttling.

    🧱 Handle pagination gracefully

    Scrape multiple pages with small pauses and consistent logic.

    SCALA
    1for (i <- 1 to 5)
    2  val page = s"https://example.com/page/$i"
    3  // scrape each page

    💾 Cache HTML locally

    During testing, save raw pages to files so you can debug parsing logic offline.

    ⚖️ Follow site policies

    Always read robots.txt, avoid personal data, and use scraping responsibly.

    🧭 13. Conclusion: Building Smarter Scrapers

    We’ve traveled from simple static scraping with jsoup to full browser automation with Selenium — and finally to API-driven scraping that abstracts away the headaches.

    To recap:

    ScenarioBest Tool
    Simple, static HTMLjsoup
    Functional Scala syntaxScala Scraper
    Dynamic JavaScript contentSelenium or FoxScrape (render_js)
    Scalable, anti-bot scrapingFoxScrape API

    Ultimately, the best approach depends on your use case and scale.

    For quick one-offs, jsoup is perfect. For production-grade scraping that needs reliability, using an API like FoxScrape saves hours of maintenance.

    It’s not about writing more scraping code — it’s about writing less infrastructure code.

    🦊 Learn more and try it at FoxScrape.com — the simplest way to fetch any page, static or dynamic, straight into your Scala project.