Web Scraping with Elixir

Published on
Written by
Mantas Kemėšius
Web Scraping with Elixir

Web scraping in Elixir is a bit like having a high-performance data engine at your fingertips. Thanks to Elixir’s concurrency and Crawly, a dedicated web scraping framework, you can build scalable crawlers that fetch, parse, and store data efficiently.

In this tutorial, we’ll go from setting up a simple project to building a multi-page spider, extracting product prices, and finally leveraging FoxScrape to handle tricky pages or anti-scraping protections.

By the end, you’ll know how to:

  • Build spiders in Elixir with Crawly
  • Parse HTML with Floki
  • Handle multi-page crawling and structured data
  • Use FoxScrape to simplify scraping of protected or JS-heavy pages
  • 🛠️ 1. Why Scrape with Elixir?

    Elixir is built on the Erlang VM, which provides lightweight processes, fault tolerance, and concurrency out of the box. For scraping, this means:

    FeatureBenefit
    ConcurrencyCrawl multiple pages at once efficiently
    Fault-toleranceCrashed spiders don’t bring down your process
    OTP supportMakes building supervised scrapers easier
    Erlang VM speedHandle thousands of requests without overhead

    Popular scraping libraries in Elixir:

  • Crawly: Full-featured web scraping framework
  • Floki: HTML parser and selector library
  • HTTPoison / Finch: HTTP clients
  • Jason: JSON serialization
  • ⚙️ 2. Setting Up Your Project

    Create a new Elixir project with OTP support:

    BASH
    1mix new price_spider --sup
    2cd price_spider

    Add dependencies to mix.exs:

    ELIXIR
    1defp deps do
    2  [
    3    {:crawly, "~> 0.13"},
    4    {:floki, "~> 0.33"},
    5    {:httpoison, "~> 2.1"},
    6    {:jason, "~> 1.4"}
    7  ]
    8end

    Install them:

    BASH
    1mix deps.get

    This gives you Crawly for crawling, Floki for parsing, and HTTP clients for fetching pages manually or via APIs.

    🕷️ 3. Creating a Spider

    Crawly spiders are just modules implementing the Crawly.Spider behavior. A minimal spider looks like:

    ELIXIR
    1defmodule PriceSpider.BasicSpider do
    2  use Crawly.Spider
    3
    4  @impl Crawly.Spider
    5  def base_url(), do: "https://books.toscrape.com"
    6
    7  @impl Crawly.Spider
    8  def init() do
    9    [start_urls: [base_url()]]
    10  end
    11
    12  @impl Crawly.Spider
    13  def parse_item(response) do
    14    []
    15  end
    16end
  • base_url/0 defines the domain to crawl
  • init/0 sets starting URLs
  • parse_item/1 processes the HTTP response and extracts items
  • At this point, running the spider yields empty results — now let’s extract data.

    📄 4. Extracting Data with Floki

    Floki lets you parse HTML and extract elements with CSS selectors or XPath-like queries.

    Example: scraping book titles and prices:

    ELIXIR
    1def parse_item(response) do
    2  html = response.body
    3  {:ok, document} = Floki.parse_document(html)
    4
    5  items =
    6    document
    7    |> Floki.find(".product_pod")
    8    |> Enum.map(fn product ->
    9      title = product |> Floki.find("h3 a") |> Floki.attribute("title") |> List.first()
    10      price = product |> Floki.find(".price_color") |> Floki.text()
    11      %{title: title, price: price}
    12    end)
    13
    14  Crawly.Engine.save_items(items)
    15end

    What’s happening:

  • Floki.parse_document/1 parses HTML into a queryable structure
  • Floki.find/2 selects elements by CSS selector
  • We map over the nodes to extract structured data
  • 🔍 5. Handling Multi-Page Crawling

    Crawly makes it easy to follow links and paginate:

    ELIXIR
    1def parse_item(response) do
    2  html = response.body
    3  {:ok, document} = Floki.parse_document(html)
    4
    5  items =
    6    document
    7    |> Floki.find(".product_pod")
    8    |> Enum.map(fn product -> ... end)
    9
    10  # Enqueue next page
    11  next_page =
    12    document
    13    |> Floki.find("li.next a")
    14    |> Floki.attribute("href")
    15    |> List.first()
    16
    17  if next_page do
    18    Crawly.Engine.enqueue_request("#{base_url()}/#{next_page}")
    19  end
    20
    21  Crawly.Engine.save_items(items)
    22end

    This setup ensures your spider automatically follows pagination and collects data across multiple pages.

    🦊 6. Introducing FoxScrape for Anti-Bot & JS Pages

    Some websites implement anti-scraping measures:

  • Require JavaScript rendering
  • Block repeated requests from the same IP
  • Return partial or empty HTML
  • Manually handling these in Crawly is possible but cumbersome. Instead, FoxScrape can fetch fully-rendered pages for you, letting you continue using Crawly and Floki without additional browser automation.

    🔧 Example: Fetching via FoxScrape

    ELIXIR
    1defmodule PriceSpider.FoxSpider do
    2  use Crawly.Spider
    3
    4  @api_key "YOUR_API_KEY"
    5  @target_url "https://books.toscrape.com"
    6
    7  def base_url(), do: @target_url
    8
    9  def init(), do: [start_urls: [@target_url]]
    10
    11  def parse_item(_response) do
    12    fox_url = "https://www.foxscrape.com/api/v1?api_key=#{@api_key}&url=#{@target_url}"
    13    {:ok, resp} = HTTPoison.get(fox_url)
    14    {:ok, document} = Floki.parse_document(resp.body)
    15
    16    items =
    17      document
    18      |> Floki.find(".product_pod")
    19      |> Enum.map(fn product ->
    20        title = product |> Floki.find("h3 a") |> Floki.attribute("title") |> List.first()
    21        price = product |> Floki.find(".price_color") |> Floki.text()
    22        %{title: title, price: price}
    23      end)
    24
    25    Crawly.Engine.save_items(items)
    26  end
    27end
    28

    You can also enable JS rendering for dynamic pages:

    ELIXIR
    1fox_url = "https://www.foxscrape.com/api/v1?api_key=#{@api_key}&url=#{@target_url}&render_js=true"
    2

    Why use FoxScrape here:

  • No manual proxy rotation or headless browser setup
  • Automatically retries failed requests
  • Returns clean HTML ready for Floki parsing
  • 💾 7. Exporting Data

    Crawly supports multiple output formats — for simplicity, save items to JSON Lines:

    ELIXIR
    1Crawly.Engine.start_spider(PriceSpider.FoxSpider)
    2# Output: data in _build/dev/lib/price_spider/output/items.jsonl

    Alternatively, you can post-process the JSON or write to CSV using Elixir’s CSV library.


    ⚖️ 8. Comparison: Crawly vs FoxScrape

    MethodJS SupportAnti-Bot HandlingConcurrencyComplexity
    Crawly + Floki⚠️ Partial✅ High🟡 Medium
    FoxScrape API✅ Automatic✅ High🟢 Simple

    FoxScrape essentially offloads the network and JS rendering layer while letting Crawly remain the parsing engine.

    🧠 9. Best Practices

  • Respect robots.txt and site rate limits
  • Don’t overwhelm servers; use Crawly’s built-in throttling
  • Use FoxScrape for protected or JS-heavy pages
  • Always validate extracted items before storage
  • Parallelize crawls responsibly with OTP supervision
  • 🎯 10. Conclusion

    In this tutorial, you learned:

  • How to set up Elixir and Crawly for web scraping
  • How to create spiders and parse pages with Floki
  • Techniques for multi-page crawling and structured data extraction
  • How to integrate FoxScrape to handle anti-scraping and dynamic content
  • Elixir + Crawly is already a high-performance scraping solution. Adding FoxScrape makes it easier to scale, handle JS-heavy sites, and bypass anti-bot protections — all while keeping your parsing code clean and familiar.

    🦊 Try FoxScrape in Elixir:

    ELIXIR
    1api_key = "YOUR_API_KEY"
    2url = "https://www.amazon.com/s?k=graphics+cards"
    3fox_url = "https://www.foxscrape.com/api/v1?api_key=#{api_key}&url=#{url}"
    4{:ok, resp} = HTTPoison.get(fox_url)
    5IO.puts String.slice(resp.body, 0..500)

    Happy scraping — fast, scalable, and ethical.