Parse HTML with .NET and the HtmlAgilityPack

dotnet

Parsing HTML with .NET can be a useful skill for any developer working on web applications or data analysis projects. One tool that can be particularly helpful in these scenarios is HtmlAgilityPack, an open-source HTML parser library for .NET. In this blog post, we'll go over how to use HtmlAgilityPack to parse HTML and extract data from it in a .NET application.

What is HtmlAgilityPack?

HtmlAgilityPack is a .NET library that provides a simple, flexible, and efficient way to parse and traverse HTML documents. It can be used to extract data from HTML documents, modify the HTML structure, and even clean up poorly formatted HTML.

HtmlAgilityPack is available as a NuGet package, which makes it easy to install and use in your .NET projects. Simply search for "HtmlAgilityPack" in the NuGet Package Manager, or run the following command in the Package Manager Console:

Install-Package HtmlAgilityPack

# or

dotnet add package HtmlAgilityPack

Parsing HTML with HtmlAgilityPack

To parse an HTML document with HtmlAgilityPack, you first need to create an HtmlDocument object and load the HTML content into it.

string html = "<html><body><h1>Hello, World!</h1></body></html>";
HtmlDocument document = new HtmlDocument();
document.LoadHtml(html);

Once you have loaded the HTML document into an HtmlDocument object, you can start traversing and extracting data from it.

Traversing HTML with HtmlAgilityPack

HtmlAgilityPack provides a number of methods and properties for traversing the HTML document tree. The most important one is the DocumentNode property, which returns the root node of the HTML document. You can then use the ChildNodes property to access the child nodes of the root node, and the NextSibling and PreviousSibling properties to move between siblings.

Here's an example of how you can use these properties to traverse an HTML document and print the text of all the <a> elements:

foreach (HtmlNode a in document.DocumentNode.SelectNodes("//a"))
{
    Console.WriteLine(a.InnerText);
}

HtmlAgilityPack also provides a SelectSingleNode method that allows you to select a single node matching a given XPath expression. This can be useful if you only need to extract a single element from the HTML document.

HtmlNode a = document.DocumentNode.SelectSingleNode("//a");
Console.WriteLine(a.InnerText);

Extracting data from HTML with HtmlAgilityPack

HtmlAgilityPack provides a number of methods and properties for extracting data from HTML documents. The most important ones are:

InnerText: returns the text content of a node
InnerHtml: returns the HTML content of a node
Attributes: returns a dictionary of the attributes of a node
Name: returns the name of a node

HtmlNode a = document.DocumentNode.SelectSingleNode("//a");
string text = a.InnerText;
string html = a.InnerHtml;
string name = a.Name;

You can also use the SelectNodes method to select multiple nodes matching a given XPath expression. For example, to extract all the links from an HTML document, you can use the following code:

foreach (HtmlNode link in document.DocumentNode.SelectNodes("//a[@href]"))
{
    string href = link.Attributes["href"].Value;
    Console.WriteLine(href);
}

Conclusion

In this blog post, we've covered how to use HtmlAgilityPack to extract data from HTML in a .NET application. HtmlAgilityPack is a powerful and versatile library that can be used in a variety of scenarios where you need to work with HTML data.

Autor

Benjamin Abt

Ben is a passionate developer and software architect and especially focused on .NET, cloud and IoT. In his professional he works on high-scalable platforms for IoT and Industry 4.0 focused on the next generation of connected industry based on Azure and .NET. He runs the largest german-speaking C# forum myCSharp.de, is the founder of the Azure UserGroup Stuttgart, a co-organizer of the AzureSaturday, runs his blog, participates in open source projects, speaks at various conferences and user groups and also has a bit free time. He is a Microsoft MVP since 2015 for .NET and Azure.