The best VPN 2024

The Best VPS 2024

The Best C# Book

How to parse HTML with C#

In my working life, I sometimes write some code in C# to get some HTML content from the network and parse it through C#. So do you know How to parse HTML with C#?

HTML parsing is a vital part of web scraping, as it allows converting web page content to meaningful and structured data. Still, as HTML is a tree-structured format, it requires a proper tool for parsing, as it can’t be properly traversed using Regex.

If you are writing a crawler program, for some websites, the automatic program will be restricted from crawling the website content, usually, we will use some network proxy, here are some free and paid IPs available, maybe you will be interested.

How to parse HTML with C#
How to parse HTML with C#

How to parse HTML with C#

When we are writing the code for parsing HTML in C#, we can introduce some class libraries to simplify our work more conveniently. By the way, some common class libraries are:

  1. HtmlAgilityPack
  2. AngleSharp
  3. CsQuery
  4. Fizzler

HtmlAgilityPack

HtmlAgilityPack is one of the most (if not the most) famous HTML parsing libraries in the .NET world. As a result, many articles have been written about it. In short, it is a fast, relatively handy library for working with HTML (assuming XPath queries are simple).

MIT License.

This parsing library will be convenient if the task is typical and well described by an XPath expression. For example, to get all the links from a page, we need very little code:

public IEnumerable<string> HtmlAgilityPackParse()
{
    HtmlDocument htmlSnippet = new HtmlDocument();
    htmlSnippet.LoadHtml(Html);

    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

How to parse HTML with C#

Proxies
How to parse HTML with C#

Still, CSS classes usage is not convenient for this library and requires creating more complex expressions:

public IEnumerable<string> HtmlAgilityPackParse()
{
    HtmlDocument hap = new HtmlDocument();
    hap.LoadHtml(html);
    HtmlNodeCollection nodes = hap
        .DocumentNode
        .SelectNodes("//h3[contains(concat(' ', @class, ' '), ' r ')]/a");
    
    List<string> hrefTags = new List<string>();

    if (nodes != null)
    {
        foreach (HtmlNode node in nodes)
        {
            hrefTags.Add(node.GetAttributeValue("href", null));
        }
    }

    return hrefTags;
}

Of the observed oddities – a specific API, sometimes incomprehensible and confusing. However, the fact that the library is no longer abandoned adds enthusiasm and makes it a real alternative to AngleSharp.

AngleSharp

AngleSharp is written from scratch using C#. The API is based on the official JavaScript HTML DOM specification. There are quirks in some places that are unusual for .NET developers (e.g., accessing an invalid index in a collection will return null instead of throwing an exception; there is a separate URL class; namespaces are very granular), but generally nothing critical.

MIT License.

The library code is clean, neat, and user-friendly. For example, extracting links from the page is almost no different from Javascript and Python alternatives:

public IEnumerable<string> AngleSharpParse()
{
    List<string> hrefTags = new List<string>();

    var parser = new HtmlParser();
    var document = parser.Parse(Html);
    foreach (IElement element in document.QuerySelectorAll("a"))
    {
    hrefTags.Add(element.GetAttribute("href"));
    }

    return hrefTags;
}

How to parse HTML with C#

CsQuery

CsQuery is a jQuery port for .NET. It implements all CSS2 & CSS3 selectors, all the DOM manipulation methods of jQuery, and some of the utility methods.

It was one of the modern HTML parsers for .NET. The library was based on the validator.nu parser for Java, which in turn is a port of the parser from the Gecko (Firefox) engine.

MIT license

Unfortunately, the project is abandoned by the author. Recommended alternative to it is AngleSharp. The code for getting links from a page looks nice and familiar to anyone who has used jQuery:

public IEnumerable<string> CsQueryParse()
{
    List<string> hrefTags = new List<string>();

    CQ cq = CQ.Create(Html);
    foreach (IDomObject obj in cq.Find("a"))
    {
        hrefTags.Add(obj.GetAttribute("href"));
    }

    return hrefTags;
}

How to parse HTML with C#

Fizzler

Fizzler is an add-on to HtmlAgilityPack (the Fizzler’s implementation is based on HtmlAgilityPack), allowing you to use CSS selectors.

GNU GPL license.

Let’s discover what problem solves Fizzler using the sample from the documentation:

// Load the document using HTMLAgilityPack as normal
var html = new HtmlDocument();
html.LoadHtml(@"
  <html>
      <head></head>
      <body>
        <div>
          <p class='content'>Fizzler</p>
          <p>CSS Selector Engine</p></div>
      </body>
  </html>");

// Fizzler for HtmlAgilityPack is implemented as the
// QuerySelectorAll extension method on HtmlNode

var document = html.DocumentNode;

// yields: [<p class="content">Fizzler</p>]
document.QuerySelectorAll(".content");

// yields: [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("p");

// yields empty sequence
document.QuerySelectorAll("body>p");

// yields [<p class="content">Fizzler</p>,<p>CSS Selector Engine</p>]
document.QuerySelectorAll("body p");

// yields [<p class="content">Fizzler</p>]
document.QuerySelectorAll("p:first-child");

How to parse HTML with C#

It is almost the same speed as HtmlAgilityPack, but more convenient because of the CSS selectors.

How to parse HTML with C#
How to parse HTML with C#

Conclusion

The conclusions, probably, everyone has made for himself. However, I’d add that the best choice, for now, would be AngleSharp, because it’s under active development, has an intuitive API, and shows good processing times.

Does it make sense to switch to AngleSharp from HtmlAgilityPack? Probably not – you can use Fizzler and enjoy a speedy and convenient library.

The benchmark code can be found here.

https://github.com/kami4ka/dot-net-html-parsers-benchmark

Leave a Comment