A C# HTML parser is a library that turns raw HTML into a structured DOM you can query. If you're scraping websites, monitoring content, or building internal tools, parsing HTML is unavoidable. The real question is which parser to use and how to use it without turning your setup into a mess.
In this guide, we'll walk through the most common C# HTML parsers, explain where each one fits, and show how they work in a practical scraping workflow. The focus is on real-world usage, not theory. You'll see when a lightweight parser is enough, when a more browser-like DOM helps, and when full browser automation is overkill.

Quick answer (TL;DR)
If you just want to get started, use HtmlAgilityPack. It's the most common C# HTML parser that is easy to use, fast, and forgiving with real-world HTML. Pair it with ScrapingBee to fetch clean, rendered, unblocked HTML, then parse it locally in C#. Switch to AngleSharp only if you strongly prefer CSS selectors or need browser-like DOM behavior.
Below is a complete, minimal example you can plug into a project right now.
If you want more background on the ScrapingBee side, this tutorial walks through the setup step by step: Getting started with ScrapingBee and C#.
Full example: ScrapingBee + HtmlAgilityPack
This example:
- Fetches a page with ScrapingBee
- Loads the returned HTML into HtmlAgilityPack
- Extracts titles and links from the DOM
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Net.Http;
using System.Threading;
using System.Threading.Tasks;
using HtmlAgilityPack;
public static class Program
{
// In real apps: reuse a single HttpClient + handler for the lifetime of the process (DI/singleton).
// This sample keeps everything local so you can paste it into a console app and run it.
public static async Task<int> Main(string[] args)
{
// 1) Put your ScrapingBee key in an env var:
// Windows (PowerShell): setx SCRAPINGBEE_API_KEY "your_key"
// macOS/Linux (bash): export SCRAPINGBEE_API_KEY="your_key"
var apiKey = Environment.GetEnvironmentVariable("SCRAPINGBEE_API_KEY");
if (string.IsNullOrWhiteSpace(apiKey))
{
Console.Error.WriteLine("Missing SCRAPINGBEE_API_KEY env var.");
return 2;
}
// The page you actually want to scrape (not the ScrapingBee endpoint).
var targetUrl = args.Length > 0 ? args[0] : "https://example.com";
// Cancellation support (Ctrl+C) + a hard timeout for the whole run.
// Using the token for timeouts keeps retries/backoff inside the same overall time budget.
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(60));
// For long-running apps, unregister the handler on exit.
Console.CancelKeyPress += (_, e) => { e.Cancel = true; cts.Cancel(); };
using var handler = new SocketsHttpHandler
{
// Many servers (and ScrapingBee) return compressed bodies. Let .NET decompress automatically.
// "All" includes Brotli (common on modern sites), gzip, and deflate.
AutomaticDecompression = DecompressionMethods.All
};
using var http = new HttpClient(handler)
{
// We intentionally disable HttpClient's built-in timeout and use CancellationToken instead,
// so retries/backoff share the same deadline (and you don't get per-attempt 100s timeouts).
Timeout = Timeout.InfiniteTimeSpan
};
// Build the ScrapingBee API URL.
// IMPORTANT: don't log the full URL because it contains your API key in the query string.
// Flip renderJs to true only if the data is injected by JS / requires rendering.
var requestUri = BuildScrapingBeeUri(apiKey, targetUrl, renderJs: false);
// HttpRequestMessage can only be sent once.
// So we build a *fresh* request for each retry attempt (no cloning, no sync-over-async).
Func<HttpRequestMessage> makeRequest = () =>
{
var req = new HttpRequestMessage(HttpMethod.Get, requestUri);
return req;
};
using var response = await SendWithRetriesAsync(http, makeRequest, cts.Token);
// Read the body even on non-2xx so you can surface helpful error payloads.
// NOTE: this buffers the full response into memory (fine for small pages).
var body = await response.Content.ReadAsStringAsync(cts.Token);
if (!response.IsSuccessStatusCode)
{
throw new HttpRequestException(
$"ScrapingBee request failed: {(int)response.StatusCode} {response.ReasonPhrase}\n" +
Truncate(body, 4_000)
);
}
// This sample expects HTML.
// ScrapingBee can also return JSON depending on params (e.g. extract_rules / ai_query / ai_extract_rules)
// or if you enable JSON wrapping (json_response=true).
// Check Content-Type first; fall back to a cheap heuristic if Content-Type is missing/incorrect.
var mediaType = response.Content.Headers.ContentType?.MediaType;
if (!string.IsNullOrWhiteSpace(mediaType) &&
mediaType.Contains("json", StringComparison.OrdinalIgnoreCase))
{
throw new InvalidOperationException(
$"Response is JSON ({mediaType}), not HTML. " +
"If you used extract_rules / ai_query / ai_extract_rules / json_response, parse the JSON output instead."
);
}
if (LooksLikeJson(body))
{
throw new InvalidOperationException(
"Response looks like JSON, not HTML. " +
"If you used extract_rules / ai_query / ai_extract_rules / json_response, parse JSON instead of HTML."
);
}
var links = ExtractLinks(body);
foreach (var link in links)
Console.WriteLine($"{link.Text} -> {link.Href}");
return 0;
}
private static Uri BuildScrapingBeeUri(string apiKey, string targetUrl, bool renderJs)
{
// Keep params explicit and easy to extend (e.g., premium_proxy, country_code, json_response, etc.).
var qb = new List<string>
{
$"api_key={Uri.EscapeDataString(apiKey)}",
$"url={Uri.EscapeDataString(targetUrl)}",
$"render_js={(renderJs ? "true" : "false")}"
};
return new Uri($"https://app.scrapingbee.com/api/v1/?{string.Join("&", qb)}");
}
private static List<(string Text, string Href)> ExtractLinks(string html)
{
var doc = new HtmlDocument();
doc.LoadHtml(html);
// Grab all anchors. For real scraping, you usually want a narrower XPath/CSS selector.
var nodes = doc.DocumentNode.SelectNodes("//a");
if (nodes == null) return new List<(string, string)>();
return nodes
.Select(a => (
Text: WebUtility.HtmlDecode(a.InnerText).Trim(),
Href: a.GetAttributeValue("href", "").Trim()
))
// Keep it simple for the tutorial: only anchors with visible text + non-empty hrefs.
// (This will drop icon-only links; remove the Text filter if you care about those.)
.Where(x => !string.IsNullOrWhiteSpace(x.Text))
.Where(x => !string.IsNullOrWhiteSpace(x.Href))
.Where(x => x.Href != "#")
.Where(x => !x.Href.StartsWith("javascript:", StringComparison.OrdinalIgnoreCase))
.ToList();
}
private static async Task<HttpResponseMessage> SendWithRetriesAsync(
HttpClient http,
Func<HttpRequestMessage> makeRequest,
CancellationToken ct)
{
const int maxAttempts = 5;
var rng = new Random(); // good enough for single-run console demo
for (int attempt = 1; attempt <= maxAttempts; attempt++)
{
// Create a fresh request each attempt (HttpRequestMessage is single-use).
using var req = makeRequest();
try
{
// ResponseHeadersRead returns as soon as headers are available.
// This lets us decide whether to retry based on status code without pre-buffering the whole body.
var resp = await http.SendAsync(req, HttpCompletionOption.ResponseHeadersRead, ct);
// Success (or non-retryable failure): return the response to the caller.
// The caller is responsible for disposing it.
if (!ShouldRetry(resp.StatusCode))
return resp;
// Retryable status codes: dispose the response and delay before retrying.
// Retry-After can be either:
// - a delta (seconds), or
// - an absolute HTTP date
TimeSpan? retryAfter = resp.Headers.RetryAfter?.Delta;
if (retryAfter is null && resp.Headers.RetryAfter?.Date is DateTimeOffset dt)
{
var diff = dt - DateTimeOffset.UtcNow;
if (diff > TimeSpan.Zero)
retryAfter = diff;
}
// Safety cap: don't sleep forever on weird Retry-After values.
// Keep it within your overall timeout budget (cts).
if (retryAfter is { } ra && ra > TimeSpan.FromSeconds(30))
retryAfter = TimeSpan.FromSeconds(30);
resp.Dispose();
var delay = retryAfter ?? ComputeBackoff(attempt, rng);
await Task.Delay(delay, ct);
}
catch (OperationCanceledException) when (ct.IsCancellationRequested)
{
// Propagate the caller's cancellation/timeout as-is.
throw;
}
catch (HttpRequestException) when (attempt < maxAttempts)
{
// Network flake: exponential backoff and retry.
await Task.Delay(ComputeBackoff(attempt, rng), ct);
}
}
throw new HttpRequestException("Request failed after retries.");
}
private static bool ShouldRetry(HttpStatusCode status)
{
var code = (int)status;
// Retry on:
// - 429 Too Many Requests (rate limit)
// - 408 Request Timeout
// - 5xx (transient server errors)
return status == HttpStatusCode.TooManyRequests // 429
|| status == HttpStatusCode.RequestTimeout // 408
|| (code >= 500 && code <= 599);
}
private static TimeSpan ComputeBackoff(int attempt, Random rng)
{
// Exponential backoff with jitter, capped.
// attempt: 1 => 250ms, 2 => 500ms, 3 => 1000ms, 4 => 2000ms, ...
var baseMs = Math.Min(10_000, (int)(250 * Math.Pow(2, attempt - 1)));
var jitter = rng.Next(0, 250);
return TimeSpan.FromMilliseconds(baseMs + jitter);
}
private static bool LooksLikeJson(string s)
{
// Last-resort heuristic if Content-Type is missing/incorrect.
if (string.IsNullOrWhiteSpace(s)) return false;
var trimmed = s.TrimStart();
// If it looks like HTML, it's not JSON.
if (trimmed.StartsWith("<", StringComparison.Ordinal)) return false;
return trimmed.StartsWith("{", StringComparison.Ordinal)
|| trimmed.StartsWith("[", StringComparison.Ordinal);
}
private static string Truncate(string s, int maxChars)
=> s.Length <= maxChars ? s : s.Substring(0, maxChars) + "\n...[truncated]...";
}
This setup already covers a lot of real-world cases. ScrapingBee deals with JavaScript rendering, blocks, and often reduces captcha pain. HtmlAgilityPack focuses purely on parsing and extraction.
If you later find yourself writing complex XPath or wishing you could use CSS selectors everywhere, that's a good signal to try AngleSharp. If you only want CSS selectors (without a browser-like DOM), HtmlAgilityPack + Fizzler is the lightest switch. Until then, this combo is more than enough to ship.
Understanding HTML parsing in C#
What is HTML parsing and why it matters
HTML parsing is what you do once you already have the page HTML. At that point you're done with networking and just want to work with the content. A C# HTML parser takes a raw HTML string and turns it into something structured and predictable instead of a big blob of text. The parser builds a DOM tree, which is basically how browsers see a page. Tags become nodes, attributes become properties, and text is just text. That lets you ask simple questions like "give me this element" or "read this attribute" without worrying about where it appears in the string or how much whitespace is around it.
In a typical setup, ScrapingBee handles fetching the page and dealing with JavaScript, blocks, and captchas. Your C# code receives the HTML and passes it to a C# HTML parser. From there you select the nodes you care about and map them into C# objects or DTOs. Each part does one job, which keeps the whole flow easier to debug and extend later.
Common use cases: Scraping, automation, and data extraction
A C# HTML parser shows up any time you need data from a web page that isn't exposed as an API. Product scraping is a classic example, where you extract prices, names, stock status, or reviews from e-commerce pages. Once the HTML is loaded, pulling those values from the DOM is straightforward. SEO and content monitoring is another common case. You might scan pages for title tags, meta descriptions, heading structure, or broken links. Price tracking works the same way but usually runs on a schedule and needs to survive small markup changes without breaking your code.
Parsers are also useful for internal tools and QA automation. Teams often need to read data from legacy pages, internal dashboards, or generated reports. Combined with ScrapingBee, you avoid running headless browsers yourself and don't have to fight proxy setups or anti-bot systems just to get usable HTML.
Limitations of regex for HTML parsing
Regex tends to fall apart fast when used on HTML. Pages are nested, messy, and inconsistent, and regex has no real understanding of structure. A pattern that works today can break tomorrow because a tag moved or an extra wrapper div appeared.
HTML in the wild is often imperfect, with missing tags, weird spacing, or reordered attributes. Regex usually depends on things being exact, so small, harmless changes can cause big failures. Debugging those patterns quickly becomes frustrating. Libraries like HtmlAgilityPack and AngleSharp don't rely on brittle text matching. They build a DOM from the HTML and handle broken markup much more gracefully. That's why for any serious C# HTML parser work, regex is usually the wrong tool.
Top C# HTML parser libraries compared
If you're doing anything with web pages in C#, you'll end up needing an HTML parser sooner or later. There are a few solid options, and they all shine in slightly different situations. The big difference is usually how strict they are, how much memory they use, and whether they expect "browser-like" HTML or just a plain HTML string.
In a ScrapingBee workflow, this choice gets simpler. ScrapingBee already gives you rendered, final HTML, so most of the time you don't need browser-level APIs or JavaScript execution on the parsing side. You just need a parser that can take an HTML string, understand the DOM, and let you extract data cleanly.
That's where libraries like HtmlAgilityPack and AngleSharp fit best. Tools that are tightly coupled to Selenium or a real browser usually only make sense if you still need to interact with the page after it loads. For pure extraction from HTML returned by an API, lighter parsers are usually the better call.
HtmlAgilityPack: XPath-based parsing with low memory usage
HtmlAgilityPack is the default choice for a lot of C# developers, and for good reason. If someone says "C# HTML parser", this is usually what they mean. It's been around for a long time, it's stable, and it does one job very well.
Getting started is easy. You install it via NuGet, load an HTML string into a document, and start querying nodes. It supports XPath out of the box and also works nicely with LINQ, which makes extraction logic readable and compact. One of its biggest strengths is how tolerant it is of broken or messy HTML, which is exactly what you get from real websites.
HtmlAgilityPack works especially well with HTML returned by ScrapingBee. Since ScrapingBee already handles JavaScript rendering and anti-bot issues, you usually receive clean, complete HTML. HtmlAgilityPack can load that HTML without complaint and let you focus on selecting the elements you need instead of fixing markup issues.
For most scraping projects, this is the best place to start. It's fast, memory-efficient, and flexible enough for the majority of extraction tasks. The TL;DR example in this guide uses HtmlAgilityPack for exactly that reason.
If you want a deeper walkthrough focused specifically on this library, check out this guide: Web scraping with Html Agility Pack.
AngleSharp: CSS selector support and HTML5 compliance
AngleSharp is a more modern take on an HTML parser for C#. It is fully HTML5 aware and models the DOM much closer to how a real browser does it. If you're used to working with CSS selectors in frontend code, AngleSharp will feel very natural.
Instead of XPath, you can use methods like QuerySelector and QuerySelectorAll to grab elements using familiar CSS syntax. That's a big win when you're dealing with complex layouts, deeply nested elements, or class-heavy markup. For many developers, reading CSS selectors is simply easier than reading XPath expressions.
AngleSharp works just fine with the same HTML body returned by ScrapingBee. You fetch the page through the API, pass the HTML string into AngleSharp, and then query the DOM as if it came from a browser. There's no requirement to run a real browser unless you actually need to interact with the page after parsing.
The trade-off is that AngleSharp is a bit more involved to set up, especially for beginners. It has more concepts and configuration options, and it's slightly heavier than HtmlAgilityPack. If you're just starting out or doing straightforward scraping, HtmlAgilityPack is usually simpler. If you already think in CSS and want a browser-like DOM model, AngleSharp is a solid choice.
Fizzler: CSS selectors on top of HtmlAgilityPack
Fizzler is not a full HTML parser by itself. It's a small add-on that sits on top of HtmlAgilityPack and adds CSS selector support. If you like HtmlAgilityPack but don't want to deal with XPath, Fizzler fills that gap.
The model stays simple. HtmlAgilityPack loads and parses the HTML, and Fizzler lets you query the DOM using familiar CSS-style selectors. Expressions like .product-card a.title or div.price span.value are easier to read and often easier to maintain than complex XPath. This works nicely in a ScrapingBee setup. ScrapingBee returns the HTML, HtmlAgilityPack builds the DOM, and Fizzler handles selector logic. You get HtmlAgilityPack's tolerance for messy markup without switching to a heavier, browser-like parser.
The HtmlAgilityPack adapter (Fizzler.Systems.HtmlAgilityPack) hasn't been updated since 2020, even though the core Fizzler selector engine has newer releases. It still works fine in many existing HAP-based scrapers, but for new projects that want a more actively evolving DOM + selector stack, AngleSharp is usually the cleaner pick.
Selenium WebDriver: Full browser automation for dynamic content
Selenium WebDriver is not really an HTML parser. It's a full browser automation tool. Instead of working with an HTML string, you control a real browser, load pages, click buttons, fill forms, and wait for JavaScript to run.
In C#, Selenium still makes sense in a few cases. Very complex single-page apps, multi-step flows, or pages that require logins and user interaction can be hard to deal with using plain HTTP requests. If the only way to reach the data is by clicking through the UI or triggering client-side logic, Selenium can handle that.
The downside is weight and speed. Running a browser is slow, resource-heavy, and harder to maintain. You need drivers, browser versions, timeouts, and a lot of defensive code just to keep things stable. For simple data extraction, this is usually overkill. This is where ScrapingBee changes the equation.
In many cases, ScrapingBee can render the page, execute the JavaScript, and return the final HTML directly. Once you have that HTML, a lightweight C# HTML parser like HtmlAgilityPack or AngleSharp is enough. You get the result without managing browsers, which is almost always faster and easier.
CsQuery: jQuery-like syntax for DOM traversal
CsQuery is a very old library that brings a jQuery-style API to C#. It was popular years ago because the selector syntax felt familiar to frontend developers, and DOM traversal looked similar to jQuery code.
Today, CsQuery is effectively unmaintained and hasn't seen meaningful updates since 2013. It may still appear in legacy codebases, and in that context it's fine to understand how it works. For new projects, it's not recommended. Modern C# scraping setups are better served by actively maintained libraries like HtmlAgilityPack or AngleSharp.
Majestic-12: High-performance legacy parser
Majestic-12 is a low-level HTML parser that focused heavily on performance and raw speed. It dates back to the mid-2010s and shows up mostly in older tutorials or long-lived scraping systems that were built around it at the time.
While it can still function, Majestic-12 is effectively a legacy option today. It hasn't kept pace with modern .NET development or developer expectations. For new projects, higher-level and actively maintained libraries like HtmlAgilityPack or AngleSharp are almost always the better choice.
Summary
| Library | What it is | Best used for | Works well with ScrapingBee | Notes |
|---|---|---|---|---|
| HtmlAgilityPack | Lightweight HTML DOM parser | Most scraping and data extraction tasks | Yes | Default choice for a C# HTML parser, tolerant to broken HTML |
| AngleSharp | HTML5-compliant DOM implementation | Complex layouts, CSS-heavy pages, browser-like DOM needs | Yes | Supports CSS selectors, slightly heavier to set up |
| Fizzler | CSS selector layer for HAP | Using CSS selectors without switching parsers | Yes | Built on top of HtmlAgilityPack |
| Selenium WebDriver | Full browser automation | SPAs, logins, multi-step user interactions | n/a (different approach) | Heavy and slow compared to API-based fetching |
| CsQuery | jQuery-style DOM traversal | Legacy projects with jQuery-like parsing logic | Technically yes, but legacy | Less common in new projects |
| Majestic-12 | Low-level high-performance parser | Older performance-focused or legacy systems | Technically yes, but legacy | Rarely used in modern codebases |
Hidden tricks for efficient HTML parsing
Once you've picked a C# HTML parser, the real gains come from how you use it. A few small tricks can save a lot of code and make your scrapers easier to read and maintain. This is especially true when you're working with HTML returned by ScrapingBee, where the markup is already rendered and ready to parse.
These tips are about doing more with less: using CSS selectors instead of long XPath expressions, handling imperfect HTML without extra cleanup, and navigating around the DOM to extract related data like product cards, tables, or lists.
Combining Fizzler with HtmlAgilityPack for CSS selector support
HtmlAgilityPack is powerful, but XPath can get ugly fast. That's where Fizzler helps. It adds CSS selector support on top of HtmlAgilityPack, so you can keep using the same parser while writing more readable queries.
The setup is straightforward. You load the HTML as usual, then query the document using CSS selectors instead of XPath. This is especially handy for common scraping tasks like extracting product cards or list items.
A simple pattern looks like this:
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
var doc = new HtmlDocument();
doc.LoadHtml(html);
foreach (var card in doc.DocumentNode.QuerySelectorAll(".product-card"))
{
var title = card.QuerySelector(".title")?.InnerText?.Trim();
var price = card.QuerySelector(".price")?.InnerText?.Trim();
var href = card.QuerySelector("a")?.GetAttributeValue("href", null);
}
Selectors like .product-card a.title are often much easier to understand than long XPath expressions, especially when the markup changes slightly over time. If you already use HtmlAgilityPack, Fizzler is an easy upgrade that can simplify a lot of scraping logic.
Using AngleSharp's QuerySelectorAll() for nested element extraction
AngleSharp really shines when you're extracting structured data from repeated blocks, like cards or rows in a table. If you think in CSS, the code tends to read almost like frontend logic. Assume you already have the HTML string from ScrapingBee. You parse it once, then use nested selectors to pull out related fields from each card.
A typical extraction flow might look like this:
using System;
using System.Threading.Tasks;
using AngleSharp.Dom;
using AngleSharp.Html.Parser;
public static class AngleSharpExample
{
public static async Task RunAsync(string html, string pageUrl)
{
var baseUri = new Uri(pageUrl, UriKind.Absolute);
var parser = new HtmlParser();
IDocument document = await parser.ParseDocumentAsync(html);
foreach (var card in document.QuerySelectorAll(".product-card"))
{
var title = card.QuerySelector(".title")?.TextContent?.Trim();
var price = card.QuerySelector(".price")?.TextContent?.Trim();
var href = card.QuerySelector("a")?.GetAttribute("href")?.Trim();
// Resolve relative URLs like "/p/123" against the real page URL
string? absoluteUrl = null;
if (!string.IsNullOrWhiteSpace(href))
{
absoluteUrl = Uri.TryCreate(baseUri, href, out var resolved)
? resolved.ToString()
: null;
}
Console.WriteLine($"{title} | {price} | {absoluteUrl}");
}
}
}
This style feels natural if you're used to CSS selectors. You start from a parent element and drill down to children and attributes without switching mental models. For complex layouts with nested elements, this can be easier to read and maintain than XPath-heavy approaches.
Both of these techniques work well with HTML returned by ScrapingBee and help keep your parsing code focused on structure, not string manipulation.
Handling malformed HTML with HtmlAgilityPack's error-tolerant parser
Real-world HTML is often messy. Old CMS templates, half-broken markup, missing closing tags, weird nesting, all of that shows up very quickly once you start scraping at scale. This is one of the reasons HtmlAgilityPack is so widely used as a C# HTML parser. HtmlAgilityPack is forgiving by default. It doesn't expect perfect HTML and will try to build a usable DOM even when the markup is clearly wrong. That means your scraper keeps working even if a page forgets to close a <div> or nests elements in a questionable way.
There are also a few options that help in these situations. One useful setting is enabling fixes for broken nesting:
var doc = new HtmlDocument();
doc.OptionFixNestedTags = true;
doc.LoadHtml(html);
This won't magically fix every problem, but it helps stabilize the DOM on older or CMS-heavy sites where the HTML was never meant to be machine-friendly. Instead of crashing or returning null everywhere, you still get a structure you can query.
If you scrape anything beyond modern frontend stacks, this tolerance alone is often enough reason to stick with HtmlAgilityPack.
Using XPath axes for sibling and parent node selection
Sometimes the data you want isn't inside the node you matched. A common example is a product title in one element and the price sitting right next to it, maybe in a sibling node or wrapped in a parent container. XPath axes let you move around the DOM instead of only going downwards. You can jump to parents or siblings without re-querying the whole document.
Here's a compact example that starts from a list of title nodes and navigates to related elements:
using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;
var doc = new HtmlDocument();
doc.LoadHtml(html);
IEnumerable<HtmlNode> titles =
doc.DocumentNode.SelectNodes("//h2[@class='title']")
?? Enumerable.Empty<HtmlNode>();
foreach (var title in titles)
{
var price = title.SelectSingleNode("following-sibling::span[@class='price']");
var container = title.SelectSingleNode("parent::div");
}
Axes like following-sibling::, preceding-sibling::, and parent:: are extremely useful when scraping tables, lists, or card layouts where related data is split across nearby elements. Once you get used to them, you can solve a lot of "where is this value?" problems without rewriting your selectors.
Parsing inline styles and attributes with AngleSharp
Not all data lives in text nodes. Many sites hide useful values inside attributes like data-* or even inline styles. AngleSharp makes reading these straightforward. With AngleSharp, attributes are first-class citizens. You can read them directly without extra parsing layers. This is handy for things like internal IDs, tracking values, or flags that aren't visible on the page.
A minimal example looks like this:
var cards = document.QuerySelectorAll(".product-card");
foreach (var card in cards)
{
var id = card.GetAttribute("data-id");
var style = card.GetAttribute("style");
}
You can also inspect specific style values if needed, but even just grabbing the raw attribute is often enough. This approach works well when scraping modern pages where important data is stored in attributes instead of visible text.
AngleSharp's browser-like DOM model makes this feel natural, especially if you're already used to reading attributes in frontend code.
Performance benchmarks and use case scenarios
When you're choosing a C# HTML parser, performance can matter a lot, especially if you're parsing thousands of pages in a batch job. In .NET, BenchmarkDotNet is the standard way to measure runtime and allocations without fooling yourself with Debug builds or stopwatch noise.
One thing to keep in mind in a ScrapingBee workflow: the slowest part is usually upstream (network + rendering + anti-bot). Your parser is doing in-memory work: turn an HTML string into a DOM and run queries. So the benchmark here is intentionally narrow: it helps you compare "DOM build + query" cost across parsers, not total scraping speed.
Benchmark setup: HtmlAgilityPack vs AngleSharp vs Fizzler
If you benchmark parsers, the easiest mistake is benchmarking different work. AngleSharp CSS selectors, HAP XPath, and Fizzler CSS-on-HAP aren't identical engines, so the goal is same intent, not perfect theoretical equivalence.
This benchmark splits the work into two parts:
ParseOnly: parse HTML into a DOM and run a query (this is closer to a simple scraper that parses once per page)SelectOnly: run a query on a pre-parsed DOM (this isolates selector/query overhead and avoids measuring parsing each time)
It also uses two HTML fixtures so you don't accidentally benchmark only "clean textbook HTML":
WellFormed: consistent markup (good baseline)Messy: deliberately imperfect markup (closer to what you see in the wild)
And two selector workloads:
SimpleLinks: cheap selector (a[href])CardLinksWithAttrFilter: more realistic nested selector + attribute filter
Install packages (NuGet):
dotnet add package BenchmarkDotNet
dotnet add package HtmlAgilityPack
dotnet add package AngleSharp
dotnet add package Fizzler.Systems.HtmlAgilityPack
Write the script:
using System;
using System.Linq;
using System.Text;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
using AngleSharp.Dom;
using AngleSharp.Html.Parser;
public class Program
{
public static void Main(string[] args)
=> BenchmarkRunner.Run<HtmlParserBenchmarks>();
}
public enum HtmlFixture
{
WellFormed,
Messy
}
public enum SelectorWorkload
{
// Cheap: many matches, simple filter
SimpleLinks,
// More realistic: nested structure + attribute contains filter
CardLinksWithAttrFilter
}
[MemoryDiagnoser]
public class HtmlParserBenchmarks
{
// Tweak size to match your real pages.
[Params(5_000)]
public int LinkCount { get; set; }
[Params(HtmlFixture.WellFormed, HtmlFixture.Messy)]
public HtmlFixture Fixture { get; set; }
[Params(SelectorWorkload.SimpleLinks, SelectorWorkload.CardLinksWithAttrFilter)]
public SelectorWorkload Workload { get; set; }
private string _html = "";
private HtmlParser _angleParser = null!;
// Pre-parsed docs for SelectOnly benchmarks
private HtmlDocument _hapParsed = null!;
private HtmlDocument _hapParsedForFizzler = null!;
private IDocument _angleParsed = null!;
[GlobalSetup]
public void Setup()
{
_html = Fixture switch
{
HtmlFixture.WellFormed => BuildWellFormedHtml(LinkCount),
HtmlFixture.Messy => BuildMessyHtml(LinkCount),
_ => throw new ArgumentOutOfRangeException()
};
_angleParser = new HtmlParser();
// Pre-parse once so SelectOnly benchmarks don't include parsing time.
_hapParsed = new HtmlDocument();
_hapParsed.LoadHtml(_html);
_hapParsedForFizzler = new HtmlDocument();
_hapParsedForFizzler.LoadHtml(_html);
// Use sync parse to avoid async overhead in microbenchmarks.
_angleParsed = _angleParser.ParseDocument(_html);
}
// ----------------------------
// ParseOnly: DOM build + query
// ----------------------------
[Benchmark]
public int ParseOnly_HtmlAgilityPack_XPath()
{
var doc = new HtmlDocument();
doc.LoadHtml(_html);
var nodes = doc.DocumentNode.SelectNodes(GetHapXPath(Workload));
return nodes?.Count ?? 0;
}
[Benchmark]
public int ParseOnly_AngleSharp_Css()
{
var doc = _angleParser.ParseDocument(_html);
return doc.QuerySelectorAll(GetCssSelector(Workload)).Length;
}
[Benchmark]
public int ParseOnly_Fizzler_CssOnHap()
{
var doc = new HtmlDocument();
doc.LoadHtml(_html);
return doc.DocumentNode.QuerySelectorAll(GetCssSelector(Workload)).Count();
}
// ----------------------------
// SelectOnly: query only
// ----------------------------
[Benchmark]
public int SelectOnly_HtmlAgilityPack_XPath()
{
var nodes = _hapParsed.DocumentNode.SelectNodes(GetHapXPath(Workload));
return nodes?.Count ?? 0;
}
[Benchmark]
public int SelectOnly_AngleSharp_Css()
{
return _angleParsed.QuerySelectorAll(GetCssSelector(Workload)).Length;
}
[Benchmark]
public int SelectOnly_Fizzler_CssOnHap()
{
return _hapParsedForFizzler.DocumentNode.QuerySelectorAll(GetCssSelector(Workload)).Count();
}
// ----------------------------
// Workload mapping
// ----------------------------
private static string GetCssSelector(SelectorWorkload workload) => workload switch
{
// All <a> with href
SelectorWorkload.SimpleLinks => "a[href]",
// Nested: ".card > .title a" + attribute contains filter
SelectorWorkload.CardLinksWithAttrFilter => ".card > .title a[href*='ref=bench']",
_ => throw new ArgumentOutOfRangeException(nameof(workload), workload, null)
};
private static string GetHapXPath(SelectorWorkload workload) => workload switch
{
// All <a> with href
SelectorWorkload.SimpleLinks => "//a[@href]",
// Equivalent intent to the CSS selector above:
// Find div.card, then direct child div.title, then descendant a with href containing marker.
// (This XPath is "workload comparable", not "perfectly equivalent".)
SelectorWorkload.CardLinksWithAttrFilter =>
"//div[contains(concat(' ', normalize-space(@class), ' '), ' card ')]" +
"/div[contains(concat(' ', normalize-space(@class), ' '), ' title ')]" +
"//a[@href and contains(@href, 'ref=bench')]",
_ => throw new ArgumentOutOfRangeException(nameof(workload), workload, null)
};
// ----------------------------
// HTML fixtures
// ----------------------------
private static string BuildWellFormedHtml(int linkCount)
{
var sb = new StringBuilder(capacity: linkCount * 90);
sb.Append("<!doctype html><html><head><title>x</title></head><body>");
sb.Append("<div class='container'>");
for (int i = 0; i < linkCount; i++)
{
// Half the links include a marker so the "attr contains" workload has predictable matches.
var marker = (i % 2 == 0) ? "?ref=bench" : "";
sb.Append("<div class='card'>");
sb.Append("<div class='title'>");
sb.Append("<a class='item' href='/p/");
sb.Append(i);
sb.Append(marker);
sb.Append("'>Item ");
sb.Append(i);
sb.Append("</a>");
sb.Append("</div>");
sb.Append("</div>");
}
sb.Append("</div></body></html>");
return sb.ToString();
}
private static string BuildMessyHtml(int linkCount)
{
// Intentionally malformed-ish:
// - missing closing tags
// - weird nesting
// - extra wrappers
// Goal: simulate "real internet HTML" that isn't perfectly balanced.
var sb = new StringBuilder(capacity: linkCount * 95);
sb.Append("<html><head><title>x</title></head><body>");
sb.Append("<div class='container'>");
for (int i = 0; i < linkCount; i++)
{
var marker = (i % 2 == 0) ? "?ref=bench" : "";
sb.Append("<div class='card'>");
sb.Append("<div class='title'>");
// Missing </span>, missing </div> sometimes, inconsistent quoting.
sb.Append("<span class=meta>meta");
sb.Append(i);
sb.Append("<a class=item href='/p/");
sb.Append(i);
sb.Append(marker);
sb.Append("'>Item ");
sb.Append(i);
sb.Append("</a>");
// Close only some tags to keep it "messy".
if (i % 3 == 0) sb.Append("</div>"); // close title
if (i % 5 == 0) sb.Append("</div>"); // close card
}
sb.Append("</div></body></html>");
return sb.ToString();
}
}
Run benchmarks in Release (debug builds might lie):
dotnet run -c Release
For consistent numbers, pin the target framework (e.g.
-f net8.0/-f net9.0) and run on the same machine.
BenchmarkDotNet will run in a dedicated process, do warmup iterations, and then print a table with timings and allocations. If you change LinkCount (or add a third fixture from a real page snapshot), you'll see how each library behaves as HTML size and "messiness" changes.
How to read the results without overthinking it
This benchmark is measuring two separate costs:
- DOM build cost: how expensive it is to parse HTML into objects
- Query cost: how expensive it is to run selectors/XPath on those objects
Don't treat the numbers as universal. Parser performance depends heavily on:
- the shape of your HTML (tiny vs huge, well-formed vs messy)
- the shape of your selectors (simple vs nested + attribute filters)
- your runtime and hardware (.NET version, CPU, GC, etc.)
What you actually want from the benchmark is a directionally useful answer to questions like:
- "Does AngleSharp's richer DOM cost me meaningfully on my workloads?"
- "If I stick with HtmlAgilityPack, am I saving allocations in big batches?"
- "Is Fizzler's selector layer overhead noticeable at my scale?"
What this does and doesn't tell you in a ScrapingBee pipeline
In production scraping, your bottleneck is often outside parsing (rendering, rate limits, retries, target variability). ScrapingBee handles most of that upstream, which is why it makes sense to benchmark parsers in isolation. It keeps the test focused and repeatable.
But this is still not an end-to-end scraping benchmark. It won't tell you:
- how long a job takes including network + rendering
- whether a selector is stable against real website changes
- how much time you spend on retries, captchas, or timeouts
So treat this section as a way to pick a sane default and understand trade-offs, not as a leaderboard that "proves" one parser wins everywhere.
Best parser for large HTML documents
Large pages are where parser choice (and your extraction habits) start to matter. Think category pages with hundreds of products, long tables, or CMS pages full of nested blocks. Most libraries can parse big HTML; the real difference is how much memory/time they burn doing it.
HtmlAgilityPack usually feels comfy on large documents because it's lightweight and forgiving with real-world markup. AngleSharp can handle big pages too, but it builds a richer, more browser-like (HTML5) DOM, so as a rule of thumb you'll often pay a bit more in memory and parse time.
No matter which one you pick, the biggest wins come from how you query the DOM after parsing. You'll still parse the whole HTML string into a DOM, but you can avoid expensive "grab everything and filter later" queries. Start from a tight container (like a product card or a table) and work inward. Skip broad stuff like "all divs" unless you really need it.
If you want a bigger performance jump, reduce the work before parsing even starts. ScrapingBee can return structured JSON instead of full HTML:
- Use
extract_rules(passed as stringified JSON in the query string) to get back only the fields you care about. - Or use the AI options like
ai_query/ai_extract_rulesfor JSON extraction. - If you set
json_response=true, ScrapingBee wraps the response in a JSON envelope (you'll typically see fields liketypeandbody, where the HTML ends up insidebodywhentypeis"html").
In any of these JSON modes, don't feed the response into an HTML parser. Parse JSON first (and/or check Content-Type) and only build a DOM when you actually have raw HTML. Less markup → fewer nodes → fewer allocations → faster runs, especially when you're scraping big pages at scale.
Best parser for dynamic JavaScript-rendered pages
Dynamic pages are where many teams instinctively reach for Selenium. In C#, that means running a browser, managing drivers, waiting for elements, and dealing with flaky timing issues. It works, but it's heavy and adds a lot of moving parts. A simpler option in many cases is letting ScrapingBee render the JavaScript for you. ScrapingBee loads the page, executes client-side code, and returns the final HTML. From there, you can parse the result with HtmlAgilityPack, AngleSharp, or any other C# HTML parser you like.
This approach usually wins on maintenance. You don't manage browsers, you don't tune waits, and you don't fight version mismatches. You still get access to fully rendered content, but your parsing logic stays clean and testable. Selenium remains useful for very interactive flows, but for pure data extraction, rendered HTML plus a parser is often enough.
Memory usage comparison across libraries
Memory usage mostly comes down to how big the DOM is and how long you keep it around.
- AngleSharp builds a detailed DOM with more objects, which costs more memory but gives you a richer API.
- HtmlAgilityPack uses a simpler internal model and is usually lighter for the same document.
- Fizzler adds a small overhead on top of HtmlAgilityPack due to selector parsing, but it's rarely a deal breaker.
A few safe rules of thumb help regardless of the library. Don't keep documents alive longer than needed. Parse, extract, and discard. If you're processing many pages, batch them instead of loading everything into memory at once. When possible, parse only the fragments you care about instead of the entire page.
Again, ScrapingBee helps here by reducing how much HTML you need to deal with in the first place. The less markup you feed into your parser, the less memory it needs to do its job.
Choosing the right parser for your project
By now you've seen that there's no single "perfect" C# HTML parser. The good news is that you usually don't need one. Most projects work best with a simple default, plus a clear idea of when it's time to switch tools.
In a ScrapingBee-based setup, the architecture stays clean. ScrapingBee fetches the page, handles rendering, and deals with blocking. Your parser's job is much smaller. It takes HTML and turns it into structured data that your app or database can use. Once you think of it that way, choosing a parser becomes mostly about developer experience and performance trade-offs.
If you want a broader look at how this fits into a full extraction flow, this guide walks through the C# side in more detail: Data extraction in C#.
When to use HtmlAgilityPack vs AngleSharp
- For most scraping tasks, start with HtmlAgilityPack. It's fast, forgiving, easy to install, and works well with the HTML ScrapingBee returns. If your selectors are simple and XPath doesn't bother you, this will cover a lot of ground.
- AngleSharp makes sense when you want browser-like behavior or rely heavily on CSS selectors. If you're dealing with complex layouts, deeply nested elements, or you already think in CSS, AngleSharp often feels more natural.
Both libraries work well with rendered HTML from ScrapingBee, so switching between them doesn't affect the rest of your pipeline. And if needed, you can mix tools. It's fine to use HtmlAgilityPack for most pages and AngleSharp for the tricky ones.
Trade-offs between speed, ease of use, and feature set
Most parser decisions come down to priorities. There's no free lunch, just different balances.
A simple way to think about it:
- Speed first: HtmlAgilityPack. Good for batch jobs and large volumes.
- Ease of reading selectors: AngleSharp or HtmlAgilityPack + Fizzler. Better if you prefer CSS over XPath.
- Rich DOM features: AngleSharp. Closer to how browsers behave.
Fast batch jobs that parse thousands of pages usually benefit from simpler parsers. Interactive tools, dashboards, or complex extraction logic often benefit from richer APIs, even if they cost a bit more in memory.
Compatibility with .NET Core and .NET framework
If you're targeting modern .NET, both HtmlAgilityPack and AngleSharp are safe choices. They work well with .NET Core and newer .NET versions and are actively used in current projects.
Older libraries tend to show up in legacy codebases. They may still work, but they often lack updates, documentation, or good support for newer runtimes. For new projects, it's usually better to stick with libraries that are clearly maintained and tested against modern .NET.
Community support and maintenance status
Community support matters more than it sounds. Good documentation, active GitHub issues, and real answers on StackOverflow save time when things go wrong. They also make it easier to learn best practices instead of reinventing them.
HtmlAgilityPack and AngleSharp both have active communities and plenty of examples online. That makes them safer long-term choices than older or niche libraries like Majestic-12. ScrapingBee tutorials also fill in the gaps by showing how these parsers fit into real scraping workflows.
If you're curious how similar decisions look in other ecosystems, this comparison on the Ruby side is a good reference point: Ruby HTML and XML Parsers.
Turn your C# HTML parser into a production scraper
At this point, the pattern should be pretty clear: you don't need a complicated stack to build a solid C# scraper. Let ScrapingBee handle the messy parts of the web, and let your C# HTML parser do what it's best at: turning HTML into clean, structured data.
The flow stays simple. ScrapingBee fetches the page, deals with JavaScript, blocks, and captchas, and hands you back usable HTML. Your parser (whether that's HtmlAgilityPack, AngleSharp, or a mix of tools) loads that HTML and extracts exactly what your app needs. No browsers to manage. No fragile regex. Just data.
If you've been following along, you can take the TL;DR example from earlier, drop it into your project, and point it at a real site today. Start small. Extract a list of products, links, or prices. Once that works, scaling up is mostly incremental. Add concurrency to process pages faster. Add retries and basic error handling for long-running jobs. Use extract_rules in ScrapingBee to return only the fields you care about and cut down how much HTML your parser even has to touch.
If your next C# scraping project needs to run reliably in production, the fastest way there is to stop fighting the network layer and focus on extraction. ScrapingBee is built to take that pain off your plate.
You can get started here and try it against a real target site: Best Web Scraping API.
Conclusion
Parsing HTML in C# doesn't have to be complicated. Once you separate fetching from parsing, the whole problem gets easier to reason about and easier to scale. ScrapingBee handles access, rendering, and blocking, and your C# HTML parser turns the result into structured data you can actually use.
For most projects, starting with HtmlAgilityPack is the safest move. AngleSharp is there when you need richer DOM behavior or prefer CSS selectors. The key is that you can switch or mix tools without changing your overall architecture. Keep the pipeline simple, pick the parser that fits your needs, and build from there. That's usually all it takes to go from a quick script to a scraper that holds up in production.
Before you go, check out these related reads:
Frequently asked questions (FAQs)
Which C# HTML parser should I start with for basic web scraping?
For most basic scraping tasks, HtmlAgilityPack is the best starting point. It's easy to install, fast, tolerant of broken HTML, and works well with real-world pages. It also fits perfectly with HTML returned by ScrapingBee, so you can focus on extraction instead of setup.
How do I combine ScrapingBee with a C# HTML parser?
You use ScrapingBee to fetch and render the page, then pass the returned HTML string directly into your parser. ScrapingBee handles JavaScript, blocks, and captchas, while your C# HTML parser focuses only on DOM traversal and data extraction. The two tools stay cleanly separated.
Can I parse HTML in C# without downloading the full page content?
Yes, in many cases. ScrapingBee supports extract_rules, which let you return only the fields you care about instead of the full HTML. That reduces network payload size and parsing work, making your C# code faster and more memory-efficient, especially for large or repetitive pages.
How can I speed up large-scale HTML parsing jobs in C#?
Reduce work before parsing starts. Run requests concurrently in your app within ScrapingBee's concurrency limits. Use ScrapingBee to get final rendered HTML when needed. Keep selectors narrow so you don't traverse the whole DOM. Batch pages, dispose documents fast, and don't keep parsed trees in memory.



