03
Aug
08

Forum Mining

I found a nasty bug in the Google Hack.  I was going to call it “interesting” but I’ve been overusing the hell out of that word and I’m trying to strike it from my vocabulary.

I have been using links2 to get rid of html markup.  It works fine until you pipe it to a file.  All sorts of crazy, subtle things happen.

It will translate a “?” to %3f, “=” to %3d, etc. , which is fine until you subsequently pipe that back to wget, which does not translate it back.  So if you have a URL like…

index.php%3fshowtopic%3d54476%27st%3d50

which should  be

index.php?showtopic=54476&st=50

… wget then sends it verbatim and the remote site chokes with a 404 Not Found.

This behavior in links2 is not observed when it displays in a terminal, only when it’s piped to a file.

OK, nice catch.  It means there is a little life left in the Google Hack, since it has not been getting any forum data since it was hatched.  And there is tons of data in proxy forums (in fact the operators of such forums hate  being mined and you usually need to be registered – sucking out of Google’s cache can get around registration sometimes).

I was sure we were going to top out here pretty soon, but the database may make it to 400,000 rows yet.

I have my doubts about half a million, though.

BTW, it’s Sunday.  Will Bahrain make another appearance or has that particular pooch been screwed?

Advertisements

0 Responses to “Forum Mining”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


Archives

Advertisements

%d bloggers like this: