2009-06-18

How to write a Java-based HTML-Crawler?

It's actually a task from my test work. First idea that came to me is: Use an instance of XMLReader to fetch a page, then use regular expression to parse, save the results with the help of XMLWriter. Such a way seems naive, but almost non-usable! So let me explain the difficulty which can be confronted.


1. Internet is a mess.



There are too many web pages in the cyberspace that do NOT pay any attention to the web standard (X)HTML. So we can see a lot of nested tags, unpaired tags and even worse, badly programmed JavaScript may be in just one page.



2. Ads are confusing.


Now come Advertisements with different formats, e.g. Text, Flash, Images and Videos(!). Most of them live with a lot of JavaScript to improve their impact. Some use even non-standard tags such as <noscript> or browser-specific tags.


3. Ajax is not so user-friendly.

We all love to use Ajax technology. A typical Ajax use case is to let some texts loaded dynamically, then it results a severe problem with a HTML-Parser. If the parser saves the fetched HTML-page in a cache, the cache file contains a lot of JavaScript which load elements later. So with the cache file we cannot get what we want normally.


What can we do with all those?


The current popular solution is to use the HTML or Webpage validation tools! Between different libraries HTMLUnit will fit our requirement exactly. The project describes itself as a "GUI-Less browser for Java programs". So we can handle a webpage just like we see it in our browser. The above disadvantages could be then avoided.


Interesting is, HTMLUnit is originally a java unit testing framework for testing web based applications. It is similar in concept to HTTPUnit but is very different in implementation.


For beginners, HTMLUnit provides a very simple documentation to "Get Started". With the examples provided on the website it's okay to write a HTML-Crawler now.