Parse HTML with .NET and the HtmlAgilityPack

Parse HTML with .NET and the HtmlAgilityPack

Parsing HTML with .NET can be a useful skill for any developer working on web applications or data analysis projects. One tool that can be particularly helpful in these scenarios is HtmlAgilityPack, an open-source HTML parser library for .NET. In this blog post, we'll go over how to use HtmlAgilityPack to parse HTML and extract data from it in a .NET application.

What is HtmlAgilityPack?

HtmlAgilityPack is a .NET library that provides a simple, flexible, and efficient way to parse and traverse HTML documents. It can be used to extract data from HTML documents, modify the HTML structure, and even clean up poorly formatted HTML.

HtmlAgilityPack is available as a NuGet package, which makes it easy to install and use in your .NET projects. Simply search for "HtmlAgilityPack" in the NuGet Package Manager, or run the following command in the Package Manager Console:

Install-Package HtmlAgilityPack

# or

dotnet add package HtmlAgilityPack

Parsing HTML with HtmlAgilityPack

To parse an HTML document with HtmlAgilityPack, you first need to create an HtmlDocument object and load the HTML content into it.

string html = "<html><body><h1>Hello, World!</h1></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

Once you have loaded the HTML document into an HtmlDocument object, you can start traversing and extracting data from it.

Traversing HTML with HtmlAgilityPack

HtmlAgilityPack provides a number of methods and properties for traversing the HTML document tree. The most important one is the DocumentNode property, which returns the root node of the HTML document. You can then use the ChildNodes property to access the child nodes of the root node, and the NextSibling and PreviousSibling properties to move between siblings.

Here's an example of how you can use these properties to traverse an HTML document and print the text of all the <a> elements:

foreach (HtmlNode a in document.DocumentNode.SelectNodes("//a"))
{
    Console.WriteLine(a.InnerText);
}

HtmlAgilityPack also provides a SelectSingleNode method that allows you to select a single node matching a given XPath expression. This can be useful if you only need to extract a single element from the HTML document.

HtmlNode a = document.DocumentNode.SelectSingleNode("//a");
Console.WriteLine(a.InnerText);

Extracting data from HTML with HtmlAgilityPack

HtmlAgilityPack provides a number of methods and properties for extracting data from HTML documents. The most important ones are:

  • InnerText: returns the text content of a node
  • InnerHtml: returns the HTML content of a node
  • Attributes: returns a dictionary of the attributes of a node
  • Name: returns the name of a node
HtmlNode a = document.DocumentNode.SelectSingleNode("//a");
string text = a.InnerText;
string html = a.InnerHtml;
string name = a.Name;

You can also use the SelectNodes method to select multiple nodes matching a given XPath expression. For example, to extract all the links from an HTML document, you can use the following code:

foreach (HtmlNode link in document.DocumentNode.SelectNodes("//a[@href]"))
{
    string href = link.Attributes["href"].Value;
    Console.WriteLine(href);
}

Conclusion

In this blog post, we've covered how to use HtmlAgilityPack to extract data from HTML in a .NET application. HtmlAgilityPack is a powerful and versatile library that can be used in a variety of scenarios where you need to work with HTML data.