Web Scraping with Java Made Easy

Published on
Written by
Mantas Kemėšius
Web Scraping with Java Made Easy

Web scraping is one of those essential developer skills that sits somewhere between art and engineering. Whether you’re collecting product data, monitoring competitors, or automating a data feed — understanding how to extract information from the web efficiently can give you a serious edge.

In this guide, we’ll explore how to build web scrapers in Java, step-by-step, using the most popular libraries available — from the simple and elegant Jsoup, to HtmlUnit and Selenium for more dynamic scenarios.

Along the way, we’ll also look at a simpler alternative for those who want to avoid complex setups and anti-bot headaches: using a hosted scraping API like FoxScrape.

🧠 What Is Web Scraping, Really?

At its core, web scraping means programmatically loading a web page and extracting specific data — such as product names, prices, or links — so that it can be reused or analyzed.

Example use cases:

  • Aggregating listings from e-commerce sites
  • Monitoring real-time market data
  • Collecting news headlines
  • Extracting SEO or keyword data
  • Powering AI models with structured datasets
  • ⚖️ Always remember: scraping should be done ethically.

    Respect robots.txt, obey site terms, and don’t overload servers with unnecessary requests.

    ⚙️ Choosing the Right Tools for Java Web Scraping

    Java offers a rich set of tools for different scraping needs. Each tool serves a purpose depending on whether a website is static, dynamic, or heavily reliant on JavaScript.

    LibraryIdeal Use CaseKey Strength
    JsoupStatic pages with structured HTMLLightweight and elegant HTML parser
    HtmlUnitSimulating form interactions or loginsActs like a lightweight headless browser
    SeleniumFull JavaScript rendering and browser controlIdeal for dynamic, JS-heavy websites

    We’ll explore all three, starting with the most straightforward: Jsoup.

    🧾 Scraping Static Websites with Jsoup

    For static HTML pages (sites where the content is available in the HTML itself), Jsoup is the gold standard. It’s fast, simple, and reads almost like natural language.

    🧩 Example: Extracting Product Titles

    Let’s scrape all product titles from a sample store page.

    JAVA
    1
    import org.jsoup.Jsoup;
    2
    import org.jsoup.nodes.Document;
    3
    import org.jsoup.select.Elements;
    4
    5
    public class JsoupExample {
    6
    public static void main(String[] args) throws Exception {
    7
    Document doc = Jsoup.connect("https://example.com/products").get();
    8
    Elements titles = doc.select(".product-title");
    9
    titles.forEach(t -> System.out.println(t.text()));
    10
    }
    11
    }

    This works beautifully — but only if the website is static.

    If the content is rendered by JavaScript, you’ll end up with empty results because Jsoup never executes client-side scripts.

    💻 Scraping Forms and Simulating Actions with HtmlUnit

    Some sites require interaction — like filling out a search form or logging in before you can access data.

    That’s where HtmlUnit comes in handy.

    It’s a headless browser written in Java, capable of managing sessions, cookies, and form submissions.

    Example: Submitting a Search Form

    JAVA
    1
    import com.gargoylesoftware.htmlunit.*;
    2
    import com.gargoylesoftware.htmlunit.html.*;
    3
    4
    public class HtmlUnitExample {
    5
    public static void main(String[] args) throws Exception {
    6
    try (final WebClient client = new WebClient(BrowserVersion.CHROME)) {
    7
    HtmlPage page = client.getPage("https://example.com/search");
    8
    HtmlForm form = page.getForms().get(0);
    9
    HtmlTextInput input = form.getInputByName("query");
    10
    input.setValueAttribute("laptops");
    11
    HtmlSubmitInput submit = form.getInputByName("submit");
    12
    HtmlPage result = submit.click();
    13
    System.out.println(result.asText());
    14
    }
    15
    }
    16
    }

    This code performs a real search — just like a browser would — and prints the result.

    It’s a great approach for sites with basic interactivity, but it can’t handle modern, JavaScript-heavy frontends.

    ⚡ Dealing with JavaScript-Heavy Websites

    And here’s where many Java developers hit the wall.

    Modern websites rely heavily on frameworks like React, Vue, or Angular. These sites load data dynamically, meaning the content doesn’t exist in the raw HTML source — it’s generated later in the browser.

    In these cases, Jsoup and HtmlUnit can’t help much.

    The Traditional Fix: Selenium

    Selenium allows Java to control a real browser — load the page, wait for JS to execute, and then extract the rendered HTML.

    JAVA
    1
    import org.openqa.selenium.*;
    2
    import org.openqa.selenium.chrome.ChromeDriver;
    3
    4
    public class SeleniumExample {
    5
    public static void main(String[] args) {
    6
    WebDriver driver = new ChromeDriver();
    7
    driver.get("https://example.com/dynamic");
    8
    String html = driver.getPageSource();
    9
    System.out.println(html);
    10
    driver.quit();
    11
    }
    12
    }

    This works, but it’s heavy. You’ll need:

  • A browser driver installed (like ChromeDriver)
  • System dependencies and updates
  • Proper headless mode configuration for servers
  • If you only need to retrieve data — not control the browser — this setup can be excessive.

    🦊 The Smarter Alternative: Using FoxScrape API

    Let’s pause here and think practically.

    What if you could:

  • Fetch fully rendered HTML, even from JS-heavy sites
  • Skip setting up Selenium or proxies
  • Handle authentication and headers automatically
  • Get results in seconds, from a single endpoint
  • That’s what FoxScrape is built for.

    FoxScrape acts as a cloud-based scraping layer — you send a URL, and it returns the rendered HTML or API response, ready to parse with Jsoup or Jackson.

    Here’s how the same Selenium task looks with FoxScrape:

    JAVA
    1
    import org.jsoup.Jsoup;
    2
    import org.jsoup.nodes.Document;
    3
    4
    public class FoxScrapeExample {
    5
    public static void main(String[] args) throws Exception {
    6
    String foxUrl = "https://www.foxscrape.com/api/v1?url=https://example.com/dynamic&render_js=true";
    7
    Document doc = Jsoup.connect(foxUrl).get();
    8
    System.out.println(doc.title());
    9
    }
    10
    }

    That’s it — one call, one response.

    No browser drivers. No proxies. No waiting for rendering manually.

    FoxScrape takes care of:

  • JavaScript execution (via headless browsers)
  • IP rotation
  • Captcha bypass
  • Custom headers and authentication
  • It returns the final rendered HTML, which you can parse using the same Jsoup logic as before.

    This approach is perfect for production-scale scraping or cloud deployments where simplicity and reliability matter more than controlling a local browser.

    🔁 Handling Infinite Scroll and AJAX Requests

    Infinite scroll pages are another tricky scenario.

    When you scroll, the site sends background (AJAX) requests to load new data.

    You can handle this in two ways:

  • Use Selenium to scroll:
  • JAVA
    1
    JavascriptExecutor js = (JavascriptExecutor) driver;
    2
    for (int i = 0; i < 5; i++) {
    3
    js.executeScript("window.scrollTo(0, document.body.scrollHeight)");
    4
    Thread.sleep(2000);
    5
    }
  • Inspect the network requests in your browser’s DevTools to find the real data source.
  • You’ll often find a JSON endpoint like:

    JAVA
    1
    https://api.example.com/products?page=3

    You can then call this directly:

    JAVA
    1
    String jsonUrl = "https://api.example.com/products?page=3";
    2
    String response = Jsoup.connect(jsonUrl).ignoreContentType(true).execute().body();
    3
    System.out.println(response);

    If the site hides or dynamically generates this API, you can use FoxScrape to render and extract the full scrolled content without writing scrolling logic:

    PLAIN TEXT
    1
    https://www.foxscrape.com/api/v1?url=https://example.com/products&render_js=true

    🧮 Parsing JSON Data with Jackson

    When your scraped data is in JSON format, use a library like Jackson to process it.

    JAVA
    1
    import com.fasterxml.jackson.databind.*;
    2
    import java.net.*;
    3
    4
    public class JsonParseExample {
    5
    public static void main(String[] args) throws Exception {
    6
    String json = "{\"product\": \"Laptop\", \"price\": 1200}";
    7
    ObjectMapper mapper = new ObjectMapper();
    8
    JsonNode node = mapper.readTree(json);
    9
    System.out.println(node.get("product").asText());
    10
    }
    11
    }

    You can chain this with any request — including one from FoxScrape — to directly parse structured data.

    💾 Saving and Structuring Your Data

    Once you have your parsed data, store it in a format that suits your workflow.

    Example: Writing to CSV

    JAVA
    1
    import java.io.FileWriter;
    2
    import com.opencsv.CSVWriter;
    3
    4
    public class CsvWriterExample {
    5
    public static void main(String[] args) throws Exception {
    6
    try (CSVWriter writer = new CSVWriter(new FileWriter("data.csv"))) {
    7
    String[] header = {"Name", "Price"};
    8
    writer.writeNext(header);
    9
    writer.writeNext(new String[]{"Laptop", "1200"});
    10
    }
    11
    }
    12
    }

    For larger projects, consider:

  • Batching results to avoid memory overload
  • Rate limiting to respect server load
  • Using databases (PostgreSQL, MongoDB) for structured storage
  • 🧭 Best Practices for Web Scraping

    Building a good scraper isn’t just about code — it’s about being efficient and ethical.

    Do:

  • Cache your results where possible
  • Limit request frequency
  • Rotate IPs when scaling
  • Use proper user-agent headers
  • Don’t:

  • Scrape login-protected or private data
  • Violate terms of service
  • Hit APIs aggressively without throttling
  • 🦊 Pro tip: FoxScrape automatically manages rate limiting, IP rotation, and JavaScript rendering, so you can scale safely without managing infrastructure yourself.

    🏁 Wrapping It Up

    By now, you’ve seen the full range of Java’s web scraping capabilities:

    Use CaseRecommended Tool
    Static HTMLJsoup
    Form Submissions / Light JSHtmlUnit
    Full JS RenderingSelenium
    Automated Managed ScrapingFoxScrape

    If you enjoy building scrapers manually — Jsoup, HtmlUnit, and Selenium give you full control.

    But if your goal is speed, simplicity, and reliability, FoxScrape provides a powerful shortcut: an all-in-one scraping API that handles browsers, proxies, and rendering for you.

    In short, use your code for logic, not logistics.

    Happy scraping, responsibly and efficiently.