Web Scraping with Scala

Web scraping sits at the intersection of curiosity and automation. It’s what happens when developers stop copying data manually and start thinking: couldn’t I just write code for this?
Whether you’re tracking prices, gathering research data, or analyzing trends across multiple websites, scraping is often the most direct way to collect public information at scale. But the web you scrape today is not the same as the web of a decade ago.
Modern sites use JavaScript frameworks, lazy loading, and aggressive bot protection. So to extract data reliably, we need to combine the right techniques, libraries, and occasionally — a little bit of infrastructure help.
In this article, we’ll explore how to scrape the web effectively using Scala, comparing three major methods:
And finally, we’ll look at when it makes sense to offload complexity using an external API — introducing FoxScrape, a scraping API that handles browser rendering and proxy management for you.
But let’s not get ahead of ourselves. We’ll start from scratch — building your first scraping project in Scala, one line of code at a time.
⚙️ 1. Preparing Your Scala Environment
Before we scrape anything, we need a clean development setup.
🧩 Installing Scala and sbt
Ensure you have:
Check your versions:
1scala -version
2sbt sbtVersionIf you’re missing either, install via scala-lang.org or use Coursier (cs setup) for an all-in-one toolchain.
🏗️ Project Structure
Let’s create a new project:
1sbt new scala/scala3.g8
2cd my-scraperThis gives you a minimal template with /src/main/scala/Main.scala.
Now add our dependencies.
📦 build.sbt
1ThisBuild / scalaVersion := "3.6.3"
2
3libraryDependencies ++= Seq(
4 "org.jsoup" % "jsoup" % "1.18.1",
5 "net.ruippeixotog" %% "scala-scraper" % "3.1.0",
6 "org.seleniumhq.selenium" % "selenium-java" % "4.25.0",
7 "com.lihaoyi" %% "requests" % "0.8.0" // for HTTP calls later
8)With this, we can experiment with all three approaches — from static HTML parsing to full browser automation — without switching environments.
🌐 2. Your First Scrape with jsoup
The simplest way to scrape is to fetch a page and parse it directly.
🧩 What jsoup Does
jsoup is a Java library (fully compatible with Scala) that:
Perfect for static pages — like Wikipedia, blogs, or documentation sites.
✏️ Example: Scrape Wikipedia’s Title
1import org.jsoup.Jsoup
2
3@main def scrapeWikipedia(): Unit =
4 val doc = Jsoup.connect("https://en.wikipedia.org/").get()
5 println(doc.title())Run it:
1sbt runOutput:
1Wikipedia, the free encyclopediaThat’s jsoup in a nutshell — simple, fast, and excellent for HTML that’s already rendered server-side.
🎯 3. Selecting Elements with jsoup
Once you have a document, you’ll want to extract specific sections — headlines, links, or data fields.
🔍 CSS Selectors in jsoup
jsoup uses CSS-like selectors:
| Selector | Meaning |
|---|---|
#id | element with id |
.class | element with class |
div p | <p> inside <div> |
a[href] | <a> with href attribute |
🧠 Example: Wikipedia’s “In the news” section
1import org.jsoup.Jsoup
2
3@main def wikipediaNews(): Unit =
4 val doc = Jsoup.connect("https://en.wikipedia.org/").get()
5
6 val news = doc.select("#mp-itn b a")
7 for (n <- news.asScala)
8 println(s"${n.text()} → ${n.absUrl("href")}")We grab the bolded links in the “In the news” section using the #mp-itn ID and print their text and URLs.
This is where scraping starts to feel magical — the ability to programmatically traverse the same structure you inspect in your browser’s “View Source.”
🧩 4. Extracting Text and Attributes
jsoup makes it trivial to extract values:
| Method | Description |
|---|---|
.text() | Returns the text content |
.attr("href") | Returns the attribute value |
.children() | Gets child elements |
.first() / .last() | Traverse element nodes |
Example:
1val headlines = doc.select("#mp-itn li")
2
3for (item <- headlines.asScala) {
4 val link = item.select("a").first()
5 val title = link.text()
6 val href = link.absUrl("href")
7 println(s"$title → $href")
8}This gives you structured pairs of text and URLs — perfect for later saving into JSON, CSV, or a database.
🧰 5. Scraping with Scala Scraper
While jsoup is powerful, its Java-style syntax isn’t very Scala-idiomatic.
That’s where the Scala Scraper library comes in — it wraps jsoup in a more functional, expressive API.
📦 Setup
We already added it to our build.sbt.
✏️ Example
1import net.ruippeixotog.scalascraper.browser.JsoupBrowser
2import net.ruippeixotog.scalascraper.dsl.DSL.*
3
4@main def scrapeWithScalaScraper(): Unit =
5 val browser = JsoupBrowser()
6 val doc = browser.get("https://en.wikipedia.org/")
7
8 val titles = doc >> elementList("#mp-itn b a")
9 titles.foreach { el =>
10 println(el.text + " → " + el.attr("href"))
11 }Notice how concise it feels. The >> operator is part of the DSL — a nice abstraction that lets you extract elements declaratively.
Scala Scraper is especially helpful when you want to map extracted nodes directly into Scala collections.
🧮 6. Extracting Structured Data
Let’s get a little more structured — turning HTML into Scala data.
1case class Headline(title: String, link: String)
2
3@main def scrapeNewsStructured(): Unit =
4 val browser = JsoupBrowser()
5 val doc = browser.get("https://en.wikipedia.org/")
6
7 val headlines =
8 for el <- doc >> elementList("#mp-itn li a")
9 yield Headline(el.text, el.absUrl("href"))
10
11 headlines.take(5).foreach(println)Output:
1Headline(Election results announced, https://en.wikipedia.org/wiki/Election_results)
2Headline(New species discovered, https://en.wikipedia.org/wiki/New_species)
3...Clean, typed, and immediately usable for further analysis or storage.
⚠️ 7. When Static Scraping Isn’t Enough
At some point, you’ll try to scrape a modern site — say, a single-page React app or an infinite-scroll product list — and your jsoup code will suddenly return… nothing.
Why?
Because jsoup only sees HTML the server sends, not what JavaScript renders afterward.
Modern websites use client-side frameworks (React, Vue, Angular) to populate content after load. That means your scraper receives an empty skeleton, and your selectors find zero elements.
Here’s a quick rule of thumb:
| Page Type | Works with jsoup? | Why |
|---|---|---|
| Wikipedia | ✅ | Static HTML |
| Amazon | ⚠️ | Some static, some dynamic |
| ❌ | JS-rendered + login required | |
| ❌ | Fully dynamic SPA |
So when jsoup or Scala Scraper fail, you need to simulate a real browser.
🧭 8. Enter Selenium: A Full Browser at Your Command
Selenium automates browsers — allowing you to control Chrome, Firefox, or Edge through code. It’s heavier than jsoup but essential when pages require JavaScript execution.
⚙️ Setup
You’ll need:
chromedriver, geckodriver)Make sure they’re on your PATH.
✏️ Example: Wikipedia Again (Headless Mode)
1import org.openqa.selenium.chrome.ChromeDriver
2import org.openqa.selenium.chrome.ChromeOptions
3
4@main def seleniumWikipedia(): Unit =
5 val options = ChromeOptions()
6 options.addArguments("--headless=new", "--disable-gpu")
7 val driver = ChromeDriver(options)
8
9 driver.get("https://en.wikipedia.org/")
10 println(driver.getTitle())
11
12 driver.quit()Selenium runs an actual browser instance, loads all scripts, and renders the page. You can then query the DOM (via findElementByCssSelector) or even scroll and click.
For example:
1import org.openqa.selenium.By
2
3val headlines = driver.findElements(By.cssSelector("#mp-itn b a"))
4headlines.forEach(h => println(h.getText))Powerful, but heavier. A dozen Selenium sessions consume gigabytes of memory. That’s fine for testing, but scraping at scale becomes impractical.
💡 9. Static vs Dynamic vs API Scraping
At this point, we’ve seen three clear approaches. Each has trade-offs.
| Method | Strengths | Weaknesses | Use When |
|---|---|---|---|
| jsoup | Lightweight, fast | Fails on JS pages | Simple static sites |
| Scala Scraper | Idiomatic, concise | Same JS limits | Data mapping in Scala |
| Selenium | Handles dynamic JS | Slow, complex setup | Sites needing rendering |
Most production pipelines combine these:
Let’s explore that last option.
☁️ 10. Simplifying Scraping with an API (Example: FoxScrape)
When scraping grows beyond one or two pages, it’s not code complexity that gets you — it’s infrastructure.
You start juggling:
Each adds overhead, cost, and maintenance.
Wouldn’t it be better if you could just say:
“Here’s the URL I want — please give me the final rendered HTML.”
That’s the idea behind FoxScrape — a developer-friendly web scraping API.
Instead of running browsers yourself, you delegate that work to FoxScrape’s infrastructure. It fetches, renders (if needed), and returns clean HTML to your code.
⚙️ Integration Example (with Scala)
Let’s fetch and parse Wikipedia using FoxScrape’s API.
1import requests._
2import org.jsoup.Jsoup
3
4@main def foxscrapeExample(): Unit =
5 val api = "https://www.foxscrape.com/api/v1"
6 val response = requests.get(api, params = Map(
7 "url" -> "https://en.wikipedia.org/",
8 "render_js" -> "false"
9 ))
10
11 val html = response.text()
12 val doc = Jsoup.parse(html)
13 println("Page title: " + doc.title())That’s it — no proxies, no headless Chrome, no extra dependencies.
Want JavaScript content rendered? Just set "render_js" -> "true".
You can still use all your jsoup or Scala Scraper logic to parse the returned HTML exactly as before — because FoxScrape’s output is just clean, ready-to-parse markup.
🧱 11. When (and Why) to Use a Scraping API
FoxScrape and similar APIs shine when you need to scale or when sites are hostile to automated access.
Typical triggers to switch:
🧮 Comparison
| Feature | jsoup / Scala Scraper | Selenium | FoxScrape API |
|---|---|---|---|
| JavaScript support | ❌ | ✅ | ✅ |
| Setup | Easy | Complex | None |
| Speed | Fast | Slow | Fast |
| Proxy handling | Manual | Manual | Automatic |
| Scale | High | Low | Very high |
| Cost | Free | High CPU | Pay-as-you-go |
FoxScrape essentially gives you Selenium-level scraping with jsoup-level simplicity — an elegant hybrid.
🧩 12. Best Practices for Robust Scrapers
Regardless of which tool you use, a few universal best practices keep your scrapers ethical and reliable.
⏳ Respect rate limits
Insert delays between requests or use FoxScrape’s built-in throttling.
🧱 Handle pagination gracefully
Scrape multiple pages with small pauses and consistent logic.
1for (i <- 1 to 5)
2 val page = s"https://example.com/page/$i"
3 // scrape each page💾 Cache HTML locally
During testing, save raw pages to files so you can debug parsing logic offline.
⚖️ Follow site policies
Always read robots.txt, avoid personal data, and use scraping responsibly.
🧭 13. Conclusion: Building Smarter Scrapers
We’ve traveled from simple static scraping with jsoup to full browser automation with Selenium — and finally to API-driven scraping that abstracts away the headaches.
To recap:
| Scenario | Best Tool |
|---|---|
| Simple, static HTML | jsoup |
| Functional Scala syntax | Scala Scraper |
| Dynamic JavaScript content | Selenium or FoxScrape (render_js) |
| Scalable, anti-bot scraping | FoxScrape API |
Ultimately, the best approach depends on your use case and scale.
For quick one-offs, jsoup is perfect. For production-grade scraping that needs reliability, using an API like FoxScrape saves hours of maintenance.
It’s not about writing more scraping code — it’s about writing less infrastructure code.
🦊 Learn more and try it at FoxScrape.com — the simplest way to fetch any page, static or dynamic, straight into your Scala project.
Further Reading

A Complete Guide to Web Scraping in R
If you're already using R for data analysis, you have a powerful secret weapon: you can scrape, clean, analyze, and visualize data all in the same ...

Web Scraping with PHP
Web scraping is one of the most powerful ways to collect structured data from the internet — and PHP remains a surprisingly capable tool for the job.

Web Scraping with Java Made Easy
Web scraping is one of those essential developer skills that sits somewhere between art and engineering. Whether you’re collecting product data, mo...