2009-06-18

How to write a Java-based HTML-Crawler?

It's actually a task from my test work. First idea that came to me is: Use an instance of XMLReader to fetch a page, then use regular expression to parse, save the results with the help of XMLWriter. Such a way seems naive, but almost non-usable! So let me explain the difficulty which can be confronted.


1. Internet is a mess.



There are too many web pages in the cyberspace that do NOT pay any attention to the web standard (X)HTML. So we can see a lot of nested tags, unpaired tags and even worse, badly programmed JavaScript may be in just one page.



2. Ads are confusing.


Now come Advertisements with different formats, e.g. Text, Flash, Images and Videos(!). Most of them live with a lot of JavaScript to improve their impact. Some use even non-standard tags such as <noscript> or browser-specific tags.


3. Ajax is not so user-friendly.

We all love to use Ajax technology. A typical Ajax use case is to let some texts loaded dynamically, then it results a severe problem with a HTML-Parser. If the parser saves the fetched HTML-page in a cache, the cache file contains a lot of JavaScript which load elements later. So with the cache file we cannot get what we want normally.


What can we do with all those?


The current popular solution is to use the HTML or Webpage validation tools! Between different libraries HTMLUnit will fit our requirement exactly. The project describes itself as a "GUI-Less browser for Java programs". So we can handle a webpage just like we see it in our browser. The above disadvantages could be then avoided.


Interesting is, HTMLUnit is originally a java unit testing framework for testing web based applications. It is similar in concept to HTTPUnit but is very different in implementation.


For beginners, HTMLUnit provides a very simple documentation to "Get Started". With the examples provided on the website it's okay to write a HTML-Crawler now.


2009-05-05

blue-chip company

Today I got an offer from a headhunter for a contract-based work. Interesting is, I met a foreign word called "Blue Chip Company". So I did a little research to find out what it really means. Here it is! ;)

1. finance definition:

A company that is very strong financially, with a solid track record of producing earnings and only a moderate amount of debt. A blue-chip company also has a strong name in its industry with dominant products or services. Typically, blue-chip companies are large corporations that have been in business for many years and are considered to be very stable. However, there is no formal requirement for being a blue chip. Often, blue-chip companies are found in the Dow Jones Industrial Average.

2. Better explanation

Blue chip (stock market)

A blue chip stock is the stock of a well-established company having stable earnings and no extensive liabilities. The term derives from casinos, where blue chips stand for counters of the highest value. Most blue chip stocks pay regular dividends, even when business is faring worse than usual.

The phrase was coined by Oliver Gingold of Dow Jones sometime in 1923 or 1924. Company folklore recounts that the term apparently got its start when Gingold was standing by the stock ticker at the brokerage firm that later became Merrill Lynch. Noticing several trades at USD$200 or USD$250 a share or more, he said to Lucien Hooper of W.E. Hutton & Co. that he intended to return to the office to “write about these blue chip stocks.” Thus the phrase was born. It has been in use ever since, originally in reference to high-priced stocks, more commonly used today to refer to high-quality stocks. In contemporary media, Blue Chips and their daily performances are frequently mentioned alongside other economic averages like the Dow Jones Industrial Average.

/////////////////////

2009-02-13

The notorious Problem with letterSpacing in Flash

These days I met a problem in Flash with the property letterSpacing of a dynamic text field.

If we are looking at the examples provided by Adobe:

this.createTextField("mytext", this.getNextHighestDepth(), 10, 10, 200, 100);
mytext.multiline = true;
mytext.wordWrap = true;
mytext.border = true;

var format1:TextFormat = new TextFormat();
format1.letterSpacing = -1;

var format2:TextFormat = new TextFormat();
format2.letterSpacing = 10;

mytext.text = "Eat at \nJOE'S.";
mytext.setTextFormat(0, 7, format1);
mytext.setTextFormat(8, 12, format2);

The problem is, this doesn't work for a dynamic text field, only for dynamic generated text field!

There are a lot of solutions on the internet.

Solution 1:

var
styling:TextFormat = new TextFormat();
styling.font = "Blackadder";
styling.color = 0xBA1424;
styling.letterSpacing = 15;

tf.setNewTextFormat(styling);
tf.text="mööööp"


This one doesn't use the method setTextFormat() to set the letterSpacing, instead of that, setNewTextFormat() will be used.

Solution 2:

From the blog "Summit Projects":

function setTextFormatting(){
var fmt:TextFormat = club_name.name.getTextFormat();
club_name.name.setTextFormat(fmt);
club_name.name.setNewTextFormat(fmt);
club_name.name.autoSize = “left”;
}


He said, "It’s silly, but it works.". Unfortunately, it didn't work in my case.

Solution 3:

Bob Walton suggests that in his blog "Flash: yourMom.getTextFormat(); //is the key to letterSpacing"

var fmt:TextFormat = myTextField.getTextFormat();
myTextField.setTextFormat(fmt);
myTextField.setNewTextFormat(fmt);


This also didn't work for me.

I spent hours to figure out how can I fix it. Now here is the solution from mine.

txtCtrl.html = true;
txtCtrl.htmlText = "<font letterspacing='-3'>Now it is the right one!</font>";


So you can see, I got no sucess on setting the format of a dynamic text field. At the end, a font property can achieve that so easily!

2008-12-19

What we can learn from the online advertising industry? (Part One)

I must admit, maybe the topic is somehow too big for me, but my intension is to show the tips and tricks that we can learn from the online advertising industry.

First, the online advertising industry uses Flash based banners mostly nowdays. So there are something that must be taken into consideration.

1. How does a Flash file communicate with the browser?
2. What kind of effects we can achieve with the combination of Flash, Javascript and CSS?

Here is the answer.

Before Flash version 7, a Flash file can tell the commands to the browser with the Actionscript 1.0 function fscommand(argument).

Let's assume that the flash file is called myMovie, which has a size of 728px x 600px. Then the according part in Javascript should be:


function myMovie_DoFSCommand(command, args) {
if (command=="adcollapse") {
document.getElementById("flashad").style.clip=
"rect(0px, 728px, 600px, 608px)";
}

if (command=="adexpand") {
document.getElementById("flashad").style.clip=
"rect(0px, 728px, 600px, 0px)";
}
}


The flash file can use a button to call fscommand("adexpand") and a close button to call fscommand("adcollapse"). Its syntax looks like:


btnClose.onRelease = function() {
fscommand("adcollapse");
}


So what happened in the Javascript function? When one clicks the close button, the Javascript function will be called. Then the function finds that the Flash file wants to exeute a command "adcollapse", so the function will modify the style of the layer where the flash object is.


<div id="flashad">
<object ...>
<embed ...>
</div>


document.getElementById("flashad") is used to find the HTML element which contains the flash object. Then the property clip of the style will be modified. The syntax of clip is: clip: rect(top, right, bottom, left)

See? If the command "adcollapse" is called, the flash will be cut to the size of (728-608)= 120px x 600px to the right, but if the command "adexpand" is executed, the size will be restored to the full size of the flash file.

The end effect can be seen here.

2008-11-09

Time is too slow for those who wait

I would say I am not the kind of person who can be easily touched after living such long years, but this little poem really hits me at the first moment.


Time is too slow for those who wait 
by Henry Van Dyke


Time is too slow for those who wait,
too swift for those who fear,
too long for those who grieve, 
too short for those who rejoice,
but for those who love, 
time is eternity.

2008-02-28

COMPUTE clause in MSSQL

I am totally frustrated with the SQL recently. The bad guy is COMPUTE clause.

Since I am making a migration project from MSSQL 2000 server to Oracle 10i Server, a lot of SQL queries in the applications should be modified to meet the Oracle SQL requirements. What I found in the internet is:

---

> The Compute by clause of MSSQL basically allows you to get a running
> total at the bottom (end) of the report.
> In a way it is similar then using ".. group by .." with aggregate
> functions (sum) but in this case I am not trying to "... group by .."
> does not make sense in the context of the query, just want to get a
> summary (sum and count) of some columns at the end of the record.

The "standard" way to do this is to make a second query to compute the aggragates. However it is possible to combine the two if you really need the aggregates in the same result set.

> > > select A.ProdID, A.Description, A. Qty, A.Price
> > > from SoldItems as A
> > > where A.ListID = 15
> > > order by A.ProdID
> > > compute count(A.ProdID),sum(A.Price),sum(A.Qty)

SELECT ProdID, Description, Qty, Price
FROM
(SELECT A.ProdID, A.Description, A.QTY, A.Price, 1 AS Kind
FROM SoldItems AS A
WHERE A.ListID = 15
UNION ALL
SELECT count(B.ProdID), NULL AS Description, sum(B.Price), sum(B.Qty),
2 AS Kind
FROM SoldItems AS B
WHERE B.ListID = 15
) AS C
ORDER BY Kind, ProdID
;

---

After that I must stil fight with SHAPE clause... What a life!

2008-02-18

Strange problem with file copying in Windows Explorer

Since I sometimes need to use chinese in Windows, I set up my Windows XP under "Control Panel" -> "Regional and Language Options" -> "Advanced" -> "non-Unicode Program" Encoding -> "Simplified Chinese".

Now I get a big problem. I copied a zip file from a Linux box to my local computer. The copy process was ok and the zip file could be extracted too. Then I tried to copy that zip file to a directory on a machine with windows server 2003 enterprise version using Copy/Paste. The copy process seemed normal, but when I opened the file directly on the target machine and tried to extract it, WinRar told me "the zip file is corrupt!".

If I change the non-Unicode Program's Encoding to something like "German", then the problem will not happen. It seems that the windows explorer is a non-unicode program?!

My Windows is already updated to SP2. The IE version is still version 6.

Until now I haven't find any solution to this problem.