|OK, I'd be first to post a link to Parsing Html The Cthulhu Way
[^] if anyone suggests Regular Expressions, but I have a problem using an XmlDocument (and therefore XPath) with an HTML file I'm downloading.
The page is a list of files to download -- I need to extract the
hrefs from the
as, obviously I'd prefer to use XPath to do that.
0) The file doesn't contain an opening
<HTML> tag (it does have a closing
</HTML> tag ) -- I can tack one on, that's not a big deal.
1) It contains at least one
entity (and possibly other entities) and the XmlDocument doesn't like that.
So I need options, people!
I can summon Cthulhu.
I can use Regular Expressions to replace any offending entities and then feed the result to an XmlDocument.
What other options might there be?