It's actually a task from my test work. First idea that came to me is: Use an instance of XMLReader to fetch a page, then use regular expression to parse, save the results with the help of XMLWriter. Such a way seems naive, but almost non-usable! So let me explain the difficulty which can be confronted.
1. Internet is a mess.
2. Ads are confusing.
3. Ajax is not so user-friendly.
What can we do with all those?
The current popular solution is to use the HTML or Webpage validation tools! Between different libraries HTMLUnit will fit our requirement exactly. The project describes itself as a "GUI-Less browser for Java programs". So we can handle a webpage just like we see it in our browser. The above disadvantages could be then avoided.
Interesting is, HTMLUnit is originally a java unit testing framework for testing web based applications. It is similar in concept to HTTPUnit but is very different in implementation.
For beginners, HTMLUnit provides a very simple documentation to "Get Started". With the examples provided on the website it's okay to write a HTML-Crawler now.