Web Scraping with C#

Published on
Written by
Mantas Kemėšius
Web Scraping with C#

If you’ve ever tried scraping a modern website, you’ve probably experienced a full emotional arc: excitement, frustration, triumph, and then despair when the site suddenly changes structure overnight.

Web scraping used to be simple.

Grab HttpClient, download HTML, parse it with HtmlAgilityPack, export to CSV — done. That was the era of clean HTML and predictable markup.

Fast-forward to today’s web, and everything is dynamic, JavaScript-rendered, geo-targeted, and wrapped in bot protection.

And yet, developers still need data.

We scrape not to annoy, but to understand — to collect prices, compare news sentiment, analyze public records, or monitor competition. Scraping remains one of the most pragmatic ways to automate access to information.

This article is a developer’s guide to building sane, maintainable scrapers in C#, step by step — while understanding why modern scraping is difficult and how to architect your code so it doesn’t crumble under real-world complexity.

We’ll go from:

  • Setting up your first HTML fetcher
  • Handling parsing, data modeling, and export
  • Managing pagination and resilience
  • Understanding where the real bottlenecks are (JS rendering, anti-bot walls)
  • And finally — how to integrate a scraping API that removes the pain entirely
  • Grab some coffee. Let’s scrape smarter.

    🧱 1. Why Scraping is (Still) a Developer’s Superpower

    At its core, web scraping is a form of automation.

    You’re not “hacking” — you’re structuring what’s already public, making it digestible for analysis.

    A few use cases that show up in real engineering teams:

    Use CaseExampleBenefit
    Price MonitoringTrack competitors on e-commerce sitesDynamic pricing and alerts
    Market ResearchExtract product metadata or reviewsSentiment analysis
    Content AggregationCollect articles, job listings, or forum postsBuild dashboards or newsletters
    SEO/MarketingAudit structured data or meta tagsImprove visibility
    Public DataScrape government or NGO datasetsResearch or compliance

    These are all legitimate automation patterns — but they rely on the ability to fetch and parse web data reliably.

    The problem? The web keeps fighting back.

    ⚔️ 2. The New Reality of Web Scraping

    When you run this simple C# snippet:

    C#
    1using var http = new HttpClient();
    2var html = await http.GetStringAsync("https://example.com");
    3Console.WriteLine(html);

    You’d expect HTML.

    But today, you might get:

  • A blank document because the content is rendered client-side with React or Vue.
  • A CAPTCHA page asking if you’re human.
  • A 403 Forbidden because your IP is flagged as a bot.
  • Or just minified chaos that looks nothing like what you saw in your browser.
  • Let’s unpack why.

    🧩 The Obstacles

    ChallengeDescriptionWhy It Matters
    JavaScript RenderingMost sites generate content dynamically after load.You can’t see data in the raw HTML.
    Anti-Bot SystemsServices like Cloudflare or PerimeterX detect automation.Requests get blocked or challenged.
    Rate LimitingToo many requests from one IP triggers throttling.Data stops after a few pages.
    Geolocation WallsRegion-specific content or pricing.Wrong or missing data.
    HTML VariabilityDifferent layouts for mobile, AB tests, etc.Your XPath breaks constantly.

    Developers often try to fix these with:

  • Selenium or Playwright (slow, complex)
  • Proxy pools (expensive, unreliable)
  • Custom retry logic (fragile)
  • “Stealth” headers and delays (tedious)
  • This all works… until it doesn’t.

    Scraping at scale isn’t a code problem — it’s an infrastructure problem.

    🧰 3. Your C# Toolset: The Essentials

    Before solving infrastructure, let’s build a good scraper foundation.

    We’ll use C# — a fantastic language for web tasks because of its async capabilities, ecosystem, and type safety.

    Here’s the minimal stack you need:

    LibraryPurpose
    HtmlAgilityPackParse HTML using XPath or CSS-like queries
    CsvHelperExport structured data easily
    HttpClientMake web requests asynchronously
    System.Text.JsonHandle JSON APIs (bonus for hybrid scraping)

    Install them via:

    BASH
    1dotnet new console -n WebScraperDemo
    2cd WebScraperDemo
    3dotnet add package HtmlAgilityPack
    4dotnet add package CsvHelper

    We’ll use Books to Scrape (https://books.toscrape.com/) — a static, educational site — as our target dataset.

    🧩 4. Building the Base: Fetch and Parse HTML

    A minimal scraper looks like this:

    C#
    1using HtmlAgilityPack;
    2
    3var url = "https://books.toscrape.com/";
    4using var http = new HttpClient();
    5var html = await http.GetStringAsync(url);
    6
    7var doc = new HtmlDocument();
    8doc.LoadHtml(html);
    9
    10var titles = doc.DocumentNode.SelectNodes("//article[@class='product_pod']//h3/a");
    11
    12foreach (var t in titles)
    13{
    14    Console.WriteLine(t.InnerText.Trim());
    15}

    Output:

    PLAIN TEXT
    1A Light in the Attic
    2Tipping the Velvet
    3Soumission
    4Sharp Objects

    Success! You’ve scraped your first data.

    Now, let’s turn that text into something useful.

    🧩 5. Structuring Your Data

    Instead of dumping everything to console, define a model:

    C#
    1public sealed class Product
    2{
    3    public string Title { get; set; } = "";
    4    public decimal Price { get; set; }
    5    public string Url { get; set; } = "";
    6}

    Then extract clean values:

    C#
    1var products = new List<Product>();
    2
    3var nodes = doc.DocumentNode.SelectNodes("//article[@class='product_pod']");
    4foreach (var n in nodes)
    5{
    6    var a = n.SelectSingleNode(".//h3/a");
    7    var title = HtmlEntity.DeEntitize(a?.GetAttributeValue("title", "") ?? "").Trim();
    8
    9    var priceText = n.SelectSingleNode(".//p[@class='price_color']")?.InnerText ?? "£0.00";
    10    decimal.TryParse(priceText.Replace("£", ""), out var price);
    11
    12    var href = a?.GetAttributeValue("href", "") ?? "";
    13    var productUrl = new Uri(new Uri(url), href).ToString();
    14
    15    products.Add(new Product { Title = title, Price = price, Url = productUrl });
    16}

    You now have a strongly typed dataset ready for export.

    💾 6. Exporting to CSV

    C#
    1using CsvHelper;
    2using CsvHelper.Configuration;
    3using System.Globalization;
    4using System.Text;
    5
    6using var writer = new StreamWriter("products.csv", false, new UTF8Encoding(true));
    7var csv = new CsvWriter(writer, new CsvConfiguration(CultureInfo.InvariantCulture));
    8csv.WriteRecords(products);
    9
    10Console.WriteLine("Saved products.csv");

    Running this gives you:

    PLAIN TEXT
    1Title,Price,Url
    2A Light in the Attic,51.77,https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
    3Tipping the Velvet,53.74,https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html

    Simple, clean, portable data.

    But again — this works only because Books to Scrape is static HTML.

    Try techcrunch.com or twitter.com, and you’ll see the limitations instantly.

    🧭 7. Dealing with Pagination and Scale

    For multi-page scraping:

    C#
    1var allProducts = new List<Product>();
    2var http = new HttpClient();
    3
    4for (int page = 1; page <= 5; page++)
    5{
    6    var pagedUrl = $"https://books.toscrape.com/catalogue/page-{page}.html";
    7    var html = await http.GetStringAsync(pagedUrl);
    8
    9    var doc = new HtmlDocument();
    10    doc.LoadHtml(html);
    11
    12    // parse like before and add to allProducts
    13}

    This works — but what happens if one page times out?

    Or if the site throttles your IP mid-loop?

    You’ll either lose data or crash your scraper. That’s why serious scraping systems include:

  • Retry logic (e.g. exponential backoff)
  • Parallelization (async batches)
  • Error logging
  • Proxy rotation
  • You can implement these manually — but each adds complexity.

    🧹 8. Cleaning and Normalizing Data

    Scraped HTML often contains messy characters, line breaks, and entities.

    C#
    1var clean = HtmlEntity.DeEntitize(rawText).Replace("\n", "").Trim();

    Normalize URLs and numbers early, not after export.

    A small investment in cleanup logic saves hours of data repair later.

    💡 9. The Moment You Hit a Wall

    Eventually, every scraper hits that website.

    You’ve added retries. You’ve used user-agent headers. You’ve throttled requests.

    And yet, half your responses are empty or blocked.

    You spend an afternoon debugging network traces and realize:

    the page only renders after JavaScript executes.

    At that point, you reach for Selenium or Playwright — spinning up full browsers, waiting for page load, grabbing page.Content(), and closing tabs. It works, but it’s heavy, slow, and painful to scale.

    What you really need isn’t more code — it’s a way to delegate infrastructure.

    ☁️ 10. When to Use a Scraping API

    Scraping APIs emerged to solve precisely this:

    They run the browser, rotate proxies, spoof headers, and return the HTML you wish HttpClient could.

    They’re not magic — they’re specialized infrastructure as a service.

    A good scraping API should:

  • Accept a simple url parameter
  • Optionally render JavaScript
  • Handle CAPTCHAs, redirects, and blocks
  • Scale to hundreds of requests per second
  • Integrate easily into your existing code
  • In other words, it lets you keep your scraper logic, while offloading the plumbing.

    🦊 11. Example: Using FoxScrape to Simplify Everything

    Let’s replace all the messy parts of our scraper with a single, reliable API call.

    FoxScrape is a developer-friendly scraping API built for exactly this:

    you give it a URL (and your API key), and it returns clean, optionally rendered HTML — no proxy lists, no CAPTCHA handling, no JS engines on your side.

    Same parameters as typical scraping APIs — so you don’t need to rewrite your scraper at all.

    Here’s how our improved scraper looks:

    C#
    1using HtmlAgilityPack;
    2using CsvHelper;
    3using CsvHelper.Configuration;
    4using System.Globalization;
    5using System.Text;
    6
    7public sealed class Product
    8{
    9    public string Title { get; set; } = "";
    10    public decimal Price { get; set; }
    11    public string Url { get; set; } = "";
    12}
    13
    14var apiKey = "YOUR_API_KEY"; // get from https://www.foxscrape.com
    15var baseUrl = "https://books.toscrape.com/";
    16
    17var requestUrl = $"https://www.foxscrape.com/api/v1?api_key={apiKey}&url={Uri.EscapeDataString(baseUrl)}";
    18
    19using var http = new HttpClient { Timeout = TimeSpan.FromSeconds(20) };
    20var html = await http.GetStringAsync(requestUrl);
    21
    22var doc = new HtmlDocument();
    23doc.LoadHtml(html);
    24
    25var products = new List<Product>();
    26var cards = doc.DocumentNode.SelectNodes("//article[@class='product_pod']") ?? [];
    27
    28foreach (var card in cards)
    29{
    30    var a = card.SelectSingleNode(".//h3/a");
    31    var title = HtmlEntity.DeEntitize(a?.GetAttributeValue("title", "") ?? "").Trim();
    32    var href = a?.GetAttributeValue("href", "") ?? "";
    33    var url = new Uri(new Uri(baseUrl), href).ToString();
    34
    35    var priceText = HtmlEntity.DeEntitize(card.SelectSingleNode(".//p[@class='price_color']")?.InnerText ?? "£0.00");
    36    decimal.TryParse(priceText.Replace("£", "").Trim(), NumberStyles.Any, CultureInfo.InvariantCulture, out var price);
    37
    38    products.Add(new Product { Title = title, Price = price, Url = url });
    39}
    40
    41var csvPath = Path.Combine(AppContext.BaseDirectory, "products.csv");
    42using var writer = new StreamWriter(csvPath, false, new UTF8Encoding(true));
    43var csv = new CsvWriter(writer, new CsvConfiguration(CultureInfo.InvariantCulture));
    44csv.WriteRecords(products);
    45
    46Console.WriteLine($"Saved {products.Count} products to {csvPath}");

    That’s it.

    One request → full HTML → parsed → exported.

    Want to render JavaScript?

    Just add &render_js=true.

    Need to forward headers? Add them to HttpClient.

    Need to scale? FoxScrape handles concurrency and rate limits server-side.

    No Selenium. No proxies. No tears.

    🧭 12. Testing, Scaling, and Keeping Your Scrapers Alive

    Once your scraper works, the next challenge is keeping it working.

    Sites evolve, selectors break, and structures shift subtly.

    Some field-tested practices:

    🧱 Use Semantic Selectors

    Prefer class names or attribute markers over absolute XPath chains.

    C#
    1"//div[contains(@class,'product')]//a"

    is more robust than

    C#
    1"/html/body/div[2]/div[1]/div/a"

    🕓 Add Retry Logic and Backoff

    Even with a scraping API, transient errors happen.

    C#
    1for (int attempt = 1; attempt <= 3; attempt++)
    2{
    3    try
    4    {
    5        html = await http.GetStringAsync(requestUrl);
    6        break;
    7    }
    8    catch
    9    {
    10        await Task.Delay(1000 * attempt);
    11    }
    12}

    🧮 Track Selectors in Config Files

    Store your XPath expressions in JSON or a config class, so you can tweak them without rebuilding.

    📦 Cache Raw HTML

    Save copies of fetched HTML during development to debug parsing logic offline.

    It also helps when you want to test changes without burning API calls.

    🧩 13. Ethical and Legal Notes

    Scraping is powerful — but with great power comes… yes, you know.

    Always follow these principles:

    RuleDescription
    Respect robots.txtSome sites explicitly disallow automated access.
    Use rate limitingDon’t hammer servers — throttle requests.
    Scrape public data onlyNever collect private or copyrighted material.
    Credit your sourcesEspecially for academic or journalistic use.
    Use APIs when availableThey’re faster, safer, and more stable.

    FoxScrape helps here too — by rate-limiting requests, managing concurrency, and keeping your traffic “browser-like.”

    🧠 14. A Smarter Way to Scrape

    After you’ve written a few scrapers, a realization hits:

    the scraping logic itself isn’t the hard part — it’s keeping the pipeline alive.

    You can spend weeks perfecting selectors, proxies, and error handling — or you can let a dedicated service manage that, while you stay focused on why you’re scraping in the first place.

    That’s why tools like FoxScrape exist — not to replace developers, but to remove friction from data acquisition.

    So, instead of spending nights debugging 403 Forbidden, you can spend them building something valuable with the data.

    🦊 Learn more at FoxScrape.com.

    🧩 15. Final Thoughts

    Scraping in 2025 is both an art and a systems problem.

    Static HTML scraping still has its place, but the modern web demands browser-grade tooling, clean architecture, and reliable infrastructure.

    If you’re serious about scraping:

  • Learn to parse smartly
  • Keep selectors flexible
  • Respect sites
  • Automate ethically
  • And whenever possible, delegate the boring parts
  • Your goal isn’t to fight websites — it’s to build insight pipelines.

    And once you’re free from the mechanical pain, you can focus on the creative work: what you’ll do with all that beautiful, structured data.