Parsing HTML Document in .Net using HTML Agility Pack -C#

After downloading HTML document from a URL, i’m guessing you would want to make sense of it or you may want to use the actual contents or nested text and don’t want to mess with the tags. Personally, when parsing HTML document in .Net, i ll recommend the HTML Agility Pack.

The HTML Agility pack library can be downloaded from here or you can actually use Nuget and download straight from visual studio. If the former option is used, you have to add a reference to HtmlAgilityPack.dll from your project. That being said and done, lets move on.

Here’s two major reasons why i LOVE Html Agility pack:

it supports plain XPATH and if you don’t know what XPATH means, i’d suggest you quickly go through , this tutorial on w3schools. Its real simple and straightforward
The library works extremely well with malformed HTML documents as you will encounter a lot. Trust me 😉
To demonstrate the use of HTML Agility pack i will be using this HTML document below (don’t mind the contents ;)):



	
		
		Testing
	
	
		

Ahmed A Opeyemi

This example demonstrates the use of HTML Agility

Designed by Ahmed

After adding a reference to HtmlAgilityPack.dll in your project, we import the HtmlAgilityPack Name space by adding

using HtmlAgilityPack;

to the top of our class code or code-behind class.

Remember our html documents has been downloaded into a string, but i wrote it out above for the sake of this post. Next step is to load the string into an Html document by writing

HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(contents);

the first line creates a new Html document class while the second actually loads the document from the string “contents”. In this libarary, we would making use of two classes majorly, which is

HtmlNode (a single HTML Node)
HtmlNodeCollection (a combined list and collection of HTML Nodes)
The HtmlNode represents a single node in our document for example a div node, or an li node or even the root node of the document and the HtmlNode class has several properties and methods, we would be using the following in this tutorial:

DocumentNode (Gets the root node of the element)
SelectSingleNode (Selects the first that matches the specified xpath expression)
SelectNodes (Selects a list of nodes that matches the xpath expression)
Attributes (Gets the collection of attributes for a particular node)
ChildNodes (Gets all the children of the node)
Name (Gets or Sets the nodes name)
InnerHTML (gets or sets the HTML between the opening and closing tag of the object)
InnerText (gets or sets the text between the opening and closing tag of the object)
Having said that, lets get to work using the HTML document written above

HtmlNode RootNode = null, FirstDivNode = null, HeaderNode = null; footerNode;
//declares and instantiates htmlNode needed

RootNode = doc.DocumentNode; //Gets the root node of the document and passes to the RootNode

//select the first div in the root node or document
FirstDivNode = RootNode.SelectSingleNode("//div");

//But if you want to specify class while selecting, you can do this
FirstDivNode = RootNode.SelectSingleNode("//div[@class='wrapper']");

//To select the header tag inside the FirstDivNode,
HeaderNode = FirstDivNode.SelectSingleNode(".//header");

/**Take note of the dot(.) this tells the parser that you want it to look in the current node and not the whole document node. You also might want to check to see if a node is not null/empty before you dive, you write**/
if (FirstDivNode != null)
HeaderNode = FirstDivNode.SelectSingleNode(".//header");

//To get the inner text of the h1 tag in the header, we write
string H1Text = HeaderNode.SelectSingleNode(".//h1[@id='CompanyName']").InnerText;

//What if you dont know which tag it is, but you are sure of an attribute e.g id, name, class
string H1Text = HeaderNode.SelectSingleNode(".//*[@id='CompanyName']").InnerText;

//Notice i changed the h1 to *, that tells the parser i'm not sure what kind of element it is,
//so just select any element that has the specified attribute

//Lets try selecting multiple nodes

//using the footer with id=firstfooter in the html document, we can select all the li tags in it

//using the SelectNodes method
footerNode = FirstDivNode.SelectSingleNode(".//footer[@id='firstFooter']"); //selects the footer node

HtmlNodeCollection LiNodes = null;//instantiates and sets the Node collection to null
if (footerNode != null)
LiNodes = footerNode.SelectNodes(".//li"); // there you have all the li nodes selected
//you can use a foreach loop to access all elements of the collection, for example:

foreach (var li in LiNodes)
{
  string LiInnerText = li.InnerText;
//or you can fetch the value of the href attributes of the a node in each li
string HrefValue = li.SelectSingleNode(".//a").Attributes["href"].Value;
}

I really hope this “short post” shows us how to effectively parse HTML documents in C#.
Big ups to the creator(s) of the HTML Agility Pack.
Have a wonderful day y’all. 😀