05
Jul
08

Websense Got Me!

I was talking with my kid, Rinky Dink, yesterday. He’s off at college, working like a slave and taking summer classes.

He he told me that he can’t see The List because Websense has categorized it as “Proxy Avoidance”.

HA! Those people have been pissed off at me ever since I published my Websense Policy Bypass vulnerability notice (and the accompanying video).

I guess they’re keeping an eye on The Dinkster! Less than two weeks online and already I’m on their Shit List.

Anyway, I was distracted from my Anti Google Anti-Bot activities for most of yesterday. I decided to refine my Universal Proxy Harvester in preparation for it.

All of my site-specific harvesters are just that, one-time hacks for pulling info off a page (or pages) from a single site. My Java and GIF hacks are still required for those freakishly paranoid sites that can’t be html-scraped. For 99.9% of other Proxy Lists you don’t need to jump through all those hoops.

I have finally generalized a sed/grep/tr/cut routine that can scrape any site in any language, eliminating the tedious cut & paste of my ad hoc searches.

It is a First Class Hack, boys and girls.

I did a Google search of BlogSpot to find all the proxy lists there and it worked perfectly. I found over 50 live proxies (out of over 15,000 addr:port combinations, which bumped the database up to nearly 225,000 proxy entries).

From there I decided to get all the proxy lists in China (just use site:.cn in your Google search for that).

And things sort of fell apart from there.

For one thing, the Chinese don’t like colons. Most of the sites use “addr port” instead of “addr:port”. Simple enough, even for multiple spaces.

The next issue was with html2text, which, as the name would imply, translates html to text. For some reason (documented somewhere I can’t find), the -ascii switch doesn’t work worth a diddly damn on Chinese Web pages.

This forced me to make a gawd awful, inefficient sed script to kill every character greater than ascii 127.

But after all that was taken care of, it works against any Web page, html table, or forum posting you can throw at it.

I’m currently running my first Chinese Web scrape and it’s showing great results. They sure know how to find proxies.

The next step is merging this with automatic Google queries.


0 Responses to “Websense Got Me!”



  1. Leave a Comment

Leave a comment


Archives