De-Obfuscation Revisited

Consider if you will this Russian proxy list.

This fellow went to a lot of trouble to prevent people like me from harvesting his data.

All the address:port combinations are GIFs, which is bad enough, but that’s not all!

The GIF file names are based on your session cookie and unique for evey visit. So first you get your cookie, then request the page (it helps if you strip the Accept-Encoding: gzip,deflate header and use HTTP/1.0 instead of 1.1) and get the unique GIF names. Then you download the GIFs and throw them at your OCR program.

GNU OCR (or simply “gocr”) couldn’t handle the 7’s in these GIFs, but I piped them through a utility called “gifsicle” and scaled them up by a factor of 10. After that, it only had a problem with colons, but that was taken care of with a quick sed script.

Most of the proxies were already in my database, but I got about 10 out of the 100 or so he had listed. A 10% hit rate is pretty damned good (almost unheard of, in my experience), so this site is going into the permanent rotation.


0 Responses to “De-Obfuscation Revisited”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s



%d bloggers like this: