1.75M Proxies

Another milestone. Well, whoop dee fucking doo!

It’s going to be a long march to the second million, but we’re well under way, boys and girls!

Since most of the proxy harvest happens over Google these days, I added a URL tracking table to the proxy database. A lot of URLs get hit over and over again. I don’t really care about proxy lists from 2003 (those were the days), but there are hundreds of those out there. 99.9% of the address/ports are already in the database, so scouring those old lists is a simple waste of resources.

This table puts an end to that. All it has is three columns, the url, the sha-1 hash of the url, and a count of how many times the url is seen.

This helps a lot for harvesting proxy forums with long histories (clean that old shit up, guys!). It doesn’t help so much when the current list is at the top level of the site (such as, for example, “http://www.niceproxy.com” – which is a parked domain so don’t bother with it).

As those pile up in the table I can chop them out and do some dedicated runs.

The first time through on this I finally discovered why the “.net” TLD (top level domain) was always the most fruitful. I found a cgi proxy page that is updated hourly! No graphics, no crap, just an ASCII list of addresses and ports!

Nice. That site is now in the daily 4AM schedule.


