Web Scraping with C++

Published on
Written by
Mantas Kemėšius
Web Scraping with C++

Web scraping is one of those timeless developer tasks — equal parts fascinating and frustrating. The ability to automate data extraction from websites has shaped everything from SEO analytics to AI training data.

Most tutorials use Python because it’s quick and expressive. But if you’ve ever needed to scrape large-scale datasets, handle millions of pages, or simply prefer lower-level control, C++ becomes a serious contender.

In this guide, we’ll build a simple C++ web scraper that fetches dictionary definitions from Merriam-Webster, using:

  • libcurl — for making HTTP requests
  • libxml2 — for parsing HTML and XML
  • You’ll learn the essentials of how scraping works under the hood — and how to go beyond it by offloading all the heavy lifting to FoxScrape, a high-performance scraping API that abstracts away networking, JavaScript rendering, proxy rotation, and anti-bot handling.

    ⚙️ 1. Prerequisites

    Before we start coding, you’ll need:

  • Familiarity with basic HTTP requests and responses.
  • C++11 or later, plus a compiler like g++ 4.8.1+.
  • Installed libraries:
  • libcurl (HTTP)
  • libxml2 (HTML parsing)
  • On Linux, you can install both easily:

    BASH
    1
    sudo apt install libcurl4-openssl-dev libxml2-dev

    Or via vcpkg (cross-platform):

    BASH
    1
    vcpkg install curl libxml2
    2
    vcpkg integrate install

    Traditionally, that’s your full setup. But with a scraping API like FoxScrape, you can skip these dependencies entirely.

    Instead of manually managing HTTP, parsing, proxies, and retries, FoxScrape provides a single REST endpointhttps://www.foxscrape.com/api/v1 — that does the scraping for you.

    Still, we’ll start by understanding the foundations first.

    🌐 2. HTTP 101 — What Your Scraper Actually Does

    Every web scraper ultimately speaks HTTP, the language of the web.

    When you fetch a page like https://www.merriam-webster.com/dictionary/esoteric, your browser (or scraper) sends a GET request:

    BASH
    1
    GET /dictionary/esoteric HTTP/1.1
    2
    Host: www.merriam-webster.com
    3
    User-Agent: curl/8.9.1
    4
    Accept: */*

    The server replies with an HTTP response:

    PLAIN TEXT
    1
    HTTP/1.1 200 OK
    2
    Content-Type: text/html; charset=UTF-8
    3
    Content-Length: 58762
    4
    Server: cloudflare
    5
    Cache-Control: no-cache

    Then comes the page’s HTML body.

    Understanding this cycle helps debug scraper issues. But when you use FoxScrape, these low-level interactions happen transparently.

    It automatically manages headers, cookies, redirects, and even rate limits — so your focus stays on data, not protocol.

    🏗️ 3. Building the C++ Web Scraper (Step by Step)

    Now, let’s code the classic way — then see how FoxScrape simplifies it.

    Our goal:

    Input a word → fetch its Merriam-Webster page → extract its definitions.

    We’ll break this down into five small functions:

  • strtolower() — lowercase normalization
  • request() — fetches HTML with libcurl
  • scrape() — parses HTML with libxml2 and extracts definitions
  • main() — coordinates the workflow
  • Utility setup for compilation and running
  • 🧩 3.1. Setting Up the Libraries

    Nothing fancy yet. We include and link libcurl and libxml2:

    C++
    1
    #include <curl/curl.h>
    2
    #include <libxml/HTMLparser.h>
    3
    #include <libxml/xpath.h>
    4
    #include <iostream>
    5
    #include <algorithm>
    6
    #include <string>
    7
    #include <vector>

    Then, we’ll define a write callback for libcurl that stores the HTTP response in a string buffer.

    🔹 3.2. The request() Function

    Let’s make a simple GET request to Merriam-Webster:

    C++
    1
    static size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
    2
    size_t totalSize = size * nmemb;
    3
    output->append((char*)contents, totalSize);
    4
    return totalSize;
    5
    }
    6
    7
    std::string request(const std::string& word) {
    8
    CURL* curl;
    9
    CURLcode res;
    10
    std::string readBuffer;
    11
    std::string url = "https://www.merriam-webster.com/dictionary/" + word;
    12
    13
    curl = curl_easy_init();
    14
    if (curl) {
    15
    curl_easy_setopt(curl, CURLOPT_URL, url.c_str());
    16
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    17
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
    18
    res = curl_easy_perform(curl);
    19
    curl_easy_cleanup(curl);
    20
    }
    21
    22
    return readBuffer;
    23
    }

    This fetches raw HTML for any given dictionary word.

    If you print readBuffer, you’ll see the page’s full HTML markup.

    However, if you’ve ever tried scraping modern sites, you know what’s next:

    JavaScript-rendered content, rate limits, CAPTCHA walls...

    That’s where FoxScrape shines — replacing this entire function with a single HTTP call:

    C++
    1
    // Simplified example using FoxScrape API
    2
    std::string request(const std::string& targetUrl) {
    3
    CURL* curl;
    4
    CURLcode res;
    5
    std::string response;
    6
    std::string foxUrl = "https://www.foxscrape.com/api/v1?url=" + targetUrl;
    7
    8
    curl = curl_easy_init();
    9
    if (curl) {
    10
    curl_easy_setopt(curl, CURLOPT_URL, foxUrl.c_str());
    11
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    12
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);
    13
    res = curl_easy_perform(curl);
    14
    curl_easy_cleanup(curl);
    15
    }
    16
    return response;
    17
    }

    FoxScrape automatically handles rendering, proxies, and all anti-bot logic behind the scenes.

    You just call the API, and you get clean, ready-to-parse HTML or JSON.

    🔹 3.3. The scrape() Function

    Now, let’s parse the HTML using XPath with libxml2:

    C++
    1
    std::vector<std::string> scrape(const std::string& html) {
    2
    std::vector<std::string> results;
    3
    htmlDocPtr doc = htmlReadMemory(html.c_str(), html.size(), NULL, NULL, HTML_PARSE_NOERROR);
    4
    if (!doc) return results;
    5
    6
    xmlXPathContextPtr ctx = xmlXPathNewContext(doc);
    7
    xmlXPathObjectPtr xpathObj = xmlXPathEvalExpression(
    8
    (const xmlChar*)"//div[contains(@class,'vg-sseq-entry-item')]//span[contains(@class,'dtText')]",
    9
    ctx
    10
    );
    11
    12
    if (xpathObj && xpathObj->nodesetval) {
    13
    xmlNodeSetPtr nodes = xpathObj->nodesetval;
    14
    for (int i = 0; i < nodes->nodeNr; i++) {
    15
    xmlNodePtr node = nodes->nodeTab[i];
    16
    xmlChar* text = xmlNodeGetContent(node);
    17
    results.push_back((char*)text);
    18
    xmlFree(text);
    19
    }
    20
    }
    21
    22
    xmlXPathFreeObject(xpathObj);
    23
    xmlXPathFreeContext(ctx);
    24
    xmlFreeDoc(doc);
    25
    26
    return results;
    27
    }

    This finds all dictionary definition spans and extracts their text.

    You could print them line by line in your main function.

    🔹 3.4. The main() Function

    Putting it together:

    C++
    1
    int main(int argc, char* argv[]) {
    2
    if (argc < 2) {
    3
    std::cerr << "Usage: ./scraper <word>" << std::endl;
    4
    return 1;
    5
    }
    6
    7
    std::string word = argv[1];
    8
    std::string html = request("https://www.merriam-webster.com/dictionary/" + word);
    9
    auto defs = scrape(html);
    10
    11
    if (defs.empty()) {
    12
    std::cout << "No definitions found." << std::endl;
    13
    } else {
    14
    for (auto& d : defs)
    15
    std::cout << "- " << d << std::endl;
    16
    }
    17
    18
    return 0;
    19
    }

    💻 4. Compile and Run

    Compile your scraper:

    BASH
    1
    g++ scraper.cc -lcurl -lxml2 -std=c++11 -o scraper -I/usr/include/libxml2/

    Then run it:

    BASH
    1
    ./scraper esoteric

    Output (truncated):

    PLAIN TEXT
    1
    - Intended for or likely to be understood by only a small number of people with specialized knowledge
    2
    - Of special, rare, or secret meaning

    That’s your working C++ scraper!

    It’s fast, memory-efficient, and great for controlled environments.

    But for production-scale scraping — where you’re handling rotating IPs, solving JavaScript-heavy pages, or needing concurrency across thousands of URLs — maintaining this code quickly becomes a headache.

    That’s exactly the pain FoxScrape was built to remove.

    🦊 5. Simplifying the Stack with FoxScrape

    Here’s how the same scraper looks using the FoxScrape API directly — no libcurl setup, no HTML parsing headaches.

    C++
    1
    #include <curl/curl.h>
    2
    #include <iostream>
    3
    #include <string>
    4
    5
    static size_t WriteCallback(void* contents, size_t size, size_t nmemb, std::string* output) {
    6
    size_t totalSize = size * nmemb;
    7
    output->append((char*)contents, totalSize);
    8
    return totalSize;
    9
    }
    10
    11
    int main() {
    12
    CURL* curl;
    13
    CURLcode res;
    14
    std::string output;
    15
    16
    std::string target = "https://www.merriam-webster.com/dictionary/esoteric";
    17
    std::string apiUrl = "https://www.foxscrape.com/api/v1?url=" + target;
    18
    19
    curl = curl_easy_init();
    20
    if (curl) {
    21
    curl_easy_setopt(curl, CURLOPT_URL, apiUrl.c_str());
    22
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
    23
    curl_easy_setopt(curl, CURLOPT_WRITEDATA, &output);
    24
    res = curl_easy_perform(curl);
    25
    curl_easy_cleanup(curl);
    26
    }
    27
    28
    std::cout << output << std::endl;
    29
    return 0;
    30
    }

    FoxScrape returns a clean, normalized response.

    You can choose the format (HTML, JSON, or text) by adding parameters like &format=json.

    So instead of building and maintaining scrapers, you just make a single REST call and get structured data back.

    ⚙️ 6. Why FoxScrape Fits Into Your C++ Workflow

    Even though C++ isn’t the first language people think of for scraping, many high-performance systems rely on it — from financial data aggregation to large-scale crawling infrastructure.

    FoxScrape complements C++ perfectly because:

    ChallengeTraditional C++ ScraperWith FoxScrape
    JavaScript RenderingManual headless browser setupBuilt-in rendering
    Proxy RotationHandle pools manuallyAutomatic rotation
    CAPTCHA / Bot DetectionError-proneBypassed intelligently
    Rate LimitingCustom throttling logicManaged globally
    Output FormatManual parsingJSON or HTML ready-to-use

    Using FoxScrape doesn’t replace your C++ logic — it extends it. You still control how you parse, store, or analyze the data, but you don’t have to fight the web to get it.

    🧩 7. Wrapping Up

    We’ve gone full circle — from building a manual, low-level scraper with libcurl and libxml2 to discovering how FoxScrape collapses the entire workflow into one clean API call.

    If your use case involves:

  • Scaling to thousands of pages per minute,
  • Handling dynamic content (React, Vue, etc.), or
  • Integrating scraping into backend pipelines,
  • FoxScrape gives you the infrastructure you’d otherwise spend weeks building.

    It’s designed for developers who want scraping without the scraping code — a single endpoint that works seamlessly with C++, Python, Node, or Rust.

    Check out FoxScrape.com for API docs and examples, and see how quickly you can replace hundreds of lines of scraper code with one reliable HTTP call.