Extracting Element & Attribute Values in XHTML

Published on February 17, 2007

Introduction

Here's a scenario: you have an XHTML document and need to get the contents of its elements and attributes. Perhaps you are indexing web pages for a search application. Or perhaps you need to analyze a web page for search engine optimization (SEO) purposes. In either case, you will most likely want to deal only with the content of the web page and exclude any markup.

For the purposes of this article, we define “content” as the values of element and attributes represented in the order that they appear in the document. By preserving the order of the content, the content lends itself to keyword prominence and proximity analysis. Both are very relevant to search and search engine optimization applications.

Take the following sample XML document:


<?xml version="1.0" encoding="utf-8" ?>
<!-- Sample XHTML document -->
<html>
   <head>
      <title>This is my title.</title>
   </head>

   <body>
      <p>This is a paragraph with a link to <a 
	  href="www.google.com" title="Google Search Engine">
	  Google</a>.</p>
      <p>Google is a great search engine.</p>
   </body>
</html>
						

The content of this document is: “This is my title. This is a paragraph with a link to www.google.com Google Search Engine Google. Google is a great search engine.

NOTE: The “www.google.com” and “Google Search Engine” part of the content may appear out of place, but remember that we are preserving the order that they appear in the document.

Background on Document Object Model

So, how do we extract the element and attribute values of an XML document? One approach is to leverage the functionalities of the Document Object Model (DOM).

The DOM is platform- and language- neutral interface for accessing and manipulating XML, HTML, and related document formats. It represents these documents as a tree data structure. And because the DOM uses a tree data structure, it is aware of hierarchical-relationships (e.g. parent, child, sibling, etc.).

When the DOM represents an XML document, it considers everything in the document as a node: the document itself, the XML declaration, processing instructions, comments, elements, the text value of the element, attributes, etc.

Implementation

Since the DOM is a tree data structure, it can be traversed. Traversing a tree means visiting each node exactly one time. This is also called “walking the tree.

The following is a recursive function for traversing the DOM tree. It returns the content of elements and attributes as a string:

1 Function GetContents(ByVal xmlNode As Xml.XmlNode) As String
2
3    Dim Contents As String = ""
4
5    Select Case xmlNode.NodeType
6       Case Xml.XmlNodeType.Element
7          ' Include attributes values
8          If xmlNode.Attributes.Count > 0 Then
9             For Each AttributeNode As Xml.XmlAttribute In 
               xmlNode.Attributes
10               Contents = Contents & AttributeNode.Value & " "
11            Next
12          End If
13      Case Xml.XmlNodeType.Text
14         Contents = xmlNode.Value & " "
15      Case Else
16         ' Document, XmlDeclaration, ProcessingInstruction, etc.
17         ' Do nothing
18      End Select
19
20      ' Call routine recursively for each child node until child 
         node is Nothing
21      Dim xmlChild As Xml.XmlNode = xmlNode.FirstChild
22      Do Until xmlChild Is Nothing
23      ' Recursive call
24         Contents = Contents & GetContents(xmlChild)
25         ' Move the following node
26         xmlChild = xmlChild.NextSibling
27      Loop
28
29   Return Contents
30
31   End Function
						

Using the Code

Here's a demonstration of the GetContents function using the sample XML document:


Dim XmlDocument As New Xml.XmlDocument
XmlDocument.LoadXml("<?xml version='1.0' encoding='utf-8' ?>
<!-- Sample XHTML document -->
<html>
<head>
<title>This is my title.</title>

</head>
<body>
<p>This is a paragraph with a link to
<a href="www.google.com" title="Google Search Engine"
>Google</a>.</p>

<p>Google is a great search engine.</p>
</body>
</html>")
' GetContents return "This is my title. This is a 
paragraph with a link to www.google.com Google Search 
Engine Google. 
Google is a great search engine."
Dim MyContents As String = GetContents(XmlDocument)
						

Explanation

During the first call, the node type of the argument xmlNode is Xml.XmlNodeType.Document. This matches the Case Else (line 15), which essentially ignores the Document node type.

The function then checks the root node for any child nodes and recursively calls itself until there are no more child nodes (lines 21-22). Each call passes the child node as the new root node (line 24). This enables our function to visit all the descendants down a tree path.

At each recursive call, the string variable Contents is assigned value(s) of each attribute (line 10) if the current node type is Xml.XmlNodeType.Element (line 6) and the element has attributes (line 8).

Notice that there is no attempt to add the “value” of the element after adding the values of the attributes to the variable Contents (between line 12 and 13). This is because the Value property of an XmlElement is always Nothing.

The “value” of the element is actually taken from the Value property of the element's first child node, which is a text node. The text node is visited immediately after the element node.

When the function reaches a text node, xmlChild will contain the value of Nothing (line 21). Text nodes have no children. In a tree data structure, this is known as a leaf node.

When there are no more child nodes to visit, the recursive call is completed (line 24) and the node's next sibling is visited (line 26). The recursion is then applied to that node and its descendants. After all the siblings and their descendants are visited, the function visits the parent's siblings and so on.

Eventually, all paths of the tree are visited and the Contents variable contains the concatenated values of all elements and attributes.

Preparation for Analysis

The results of the GetContents function may not be ready for immediate analysis. The following discusses some steps you can take to prepare the results for analysis.

Removing Stop Words

When search engines analyze web pages, they usually remove words that inherently carry no meaning themselves. These are called, “stop words.” For the purposes of the code below, we define our stop words as “the”, “a”, and “an.

'Remove "the", "a", and "an."
MyContents = System.Text.RegularExpressions.Regex.Replace(MyContents, 
"(\bthe\b)|(\ba\b)|(\ban\b)", "", RegexOptions.IgnoreCase)
						

We surrounded the words with the word boundary character sequence in Regular Expressions: \b. This prevents matching “the” in, say, “theory” when attempting to match “the” only. The option IgnoreCase is also applied for case insensitive matches.

Removing Punctuation Marks

Punctuation marks are other characters that may appear in our results that may not be needed for analysis. In our example, we had a period after the word “title” and “engine.” Additionally, there was a period surrounded by spaces (in between the words where “Google” appear consecutively).

For the purposes of the code below, we define our punctuation mark as period only.

' Remove period
MyContents = System.Text.RegularExpressions.Regex.Replace(MyContents, "\.", 
"", RegexOptions.IgnoreCase)
						

The period in the regular expression is escaped by the backslash. This is because the period (or “dot” as commonly referred to in Regular Expressions) is a metacharacter.

Removing Extra Space and Whitespace

We purposely added a space the end of each concatenation to prevent values from merging into one string (lines 10 and 14). This could result in a return value containing extraneous spaces. At the very least, the result will contain a trailing space. Additionally, there may be whitespaces (e.g. carriage returns, tabs, etc.) in the result if the input contained them.

The following code shows how to extract spaces and whitespace characters:

' Replace whitespace with single space
MyContents = System.Text.RegularExpressions.Regex.Replace(
MyContents, "\s", " ")
' Replace two or more space with single space
MyContents = System.Text.RegularExpressions.Regex.Replace(
MyContents, " {2,}", " ")
' Remove leading and trailing space
MyContents = System.Text.RegularExpressions.Regex.Replace(
MyContents, "(^ )|( $)", "")
						

Handling Malformed Documents

The implementation discussed here only works with documents that are well-formed. If you have to handle malformed documents, which nearly all web documents are, you will need to do a conversion first. One option is to use a SourceForge project named HTML Tidy to convert malformed (X)HTML documents to well-formed.

Alternative Approaches

Extensible Stylesheet Language Transformation (XSLT)

An obvious choice for extracting the contents of an XHMTL document is to use XSLT. The following stylesheet extacts all element and attribute values while preserving their order:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:fo="http://www.w3.org/1999/XSL/Format">

   <xsl:output method="text"/>
      <xsl:template match="/">
      <xsl:apply-templates/>
   </xsl:template>

   <xsl:template match="*">

      <xsl:text> </xsl:text>
      <xsl:for-each select="@*">
      <xsl:value-of select="."/>
      <xsl:text> </xsl:text>

      </xsl:for-each>
      <xsl:apply-templates/>
   </xsl:template>
</xsl:stylesheet>
						

The disadvantage of this approach is that you do not have full programmatic capabilities, such as the capability to throw exceptions or raise events during the contents extraction process.

Regular Expressions

Using regular expressions is another option to solving this problem. It also has the added benefit of being able to extract the contents of malformed web documents.

The following pattern matches elements and its attributes:

<(“[^”]*”|’[^’]*’|[^^’”>])*>
						

Note: additional logic would be needed to handle nested elements.

Conclusion

This article explained how to extract the element and attribute values of an XHTML document using the DOM. An application of this would be for the analysis of the content, perhaps for use in a text search indexing process or SEO analysis tool.

The sample code can also be modified for custom requirements. For example, it can be modified to extract on the values of certain elements, attributes, or element-attribute combinations depending on what the analysis process considers content.

Finally, two alternative approaches to solving the same problem were provided. One approach used XSLT and the other used Regular Expresssions.

More Reading