The System

At the present time, the System is an Xubuntu 6.10 VM (virtual Machine) running on top of Windows XP Professional.  It’s a vintage 2005 hacked together box, the pieces/parts mostly coming from Tiger Direct.

P4 3.2gHz, 3G of RAM, a total of about 160G of disk.  Seven fans to keep the darn thing cool.  Loud as fuck.  The Xubu VM has 1G of RAM dedicated to it.  Originally it had 256M, but the system spawns so many processes that it used up all the RAM and all the virtual memory and crashed about once a week until I got things under control.

As you can see from the screen capture at the right (click for a larger view), it uses up most of the CPU power available (it was running a Google Hack run when the capture was made).

 

The Database

The database is MySQL 5 dot something, the standard package that Ubuntu ships.  The database has all of two tables, proxies and gold.  The proxies table holds more than a quarter of a million addresses and ports, date/time discovered, last state, plus country, city, region, latitude/longitude of the IP address, as determined by MaxMind’s GeoCityLite database.  (The geoip database shown in the illustration is no longer used since the MaxMind C API is much faster than the brute force table lookup that’s required if you use the database method.)

The gold table was created for the Web page.  It contains all the proxies that have ever been listed online, with the same information as the proxies table but also including the date/time of the last check that was made.  Everything that is in gold is in proxies, so there is quite a bit of duplication.

The database is the single most important part of the system.  I should back it up some day.

 

The Web Site

It costs $4/month from GoDaddy.  Lots of disk space, lots of bandwidth.  You couldn’t ask for much more.  Mostly it redirects people to my BlogSpot page, but it also holds UT mods for my UT99 server (its original purpose) and hosts my Google Maps application that shows where my UT99 players come from (another use for the GeoCityLite database).

The only fault I have found with GoDaddy is they have an intermittent ftp issue that kills my page updates.  Whenever they “fix” it, it eventually comes back.  Their frontline tech support lies.  At least twice they have told me they “escalated” the issue, only to find out later that second level techs had never heard of the problem.  Anyway… did I mention it’s FOUR BUCKS A MONTH?  Every time I call GoDaddy’s tech support line they lose money on me.

The pages for The List are all static pages pushed from the system.  Since it has been known to grow to up to a maximum of 17 pages, there are actually 20 pages, most of which redirect back to Page 1.  I had to do this because Google likes to send users to pages that used to exist when the list was longer.  This is a hack and I don’t like it, but it’s better than a 404 page.

I suppose I could create a database at the Web site and add all sorts of functionality like sorting by port or country, but who really cares about that crap anyway?  Besides, SQL injection is a real issue and the idea of moving the database off the Web server for security reasons has always appealed to my sense of security.

The Web site is the least important piece of the puzzle.  I’m not even sure why I mentioned it.

For the record, the cruft you’re reading now is on googlepages.com.  I love Google Pages and recommend it to anyone who wants to put a quick & dirty page (like this one) online fast.  It’s a little flaky and it’s not always WYSIWYG, but you get used to it.  Sometimes it crashes IE, sometimes it crashes FireFox, but on the whole it’s been very dependable.  Moreso than BlogSpot, which flakes out a LOT. (This content was moved to WordPress 05/23/2009)

 

The Scripts

The scripts are all written in bash, with a few support routines written in C.

Why bash?  Because it’s there.  And because I have a good handle on bash these days.  And because you can do almost anything in bash.

The following utilities are all stock Xubu stuff:

  • wget – command line page sucker
  • curl  – same as wget.  Used sparingly.
  • html2text – used for text dumps of Web pages.  Has a few issues with introducing garbage.
  • links2 – a text based Web browser that can dump Web pages to text.
  • NetCat – sometimes, wget and the others can’t handle it.  I use NetCat whenever a Web site demands a cookie because there is no good way to do it otherwise.
  • nmap – used for scanning ports.
  • grep – used for its regex abilities to find IP:port combinations on Web pages.
  • sed  – used to slice and dice Web pages
  • get1line – written by myself (originally in 1991), pulls one line out of a text file randomly.  Used to pick a User-Agent string (Google and others do not like wget’s default User-Agent), and to pull IP:port data out of text files that have been sorted.
  • rhino – used to execute Javascript (many proxy lists use Javascript tricks to obfuscate the page).  The trick is to replace all occurrences of “document.write()” with “print()”.
  • giftopnm – used to translate a GIF to a PNM file in order to feed it to…
  • gocr – GNU Optical Character Recognition – to translate GIF’d IP:port data to text.
  • mysql – to put stuff into and take stuff out of the database from the command line.
  • ftp – for sending the finished pages to the Web server.

The scripts are second only to the database in importance.

Some scripts are used as page scrapers (a typical page scraper is shown at the right – click for a larger view).  They put IP:port information into text files that are fed to the scripts that check the proxy and add it to the database (or ignore it if it’s already there).  Other scripts check for active proxies in the proxies table and move those into the gold table, after which other scripts pull them out, double check the status, and put them on the page if they are still alive.

A lot of proxies die before they ever make it from the proxies table to the gold table. 

 

The Method – Adding to the Proxies Table

After a list of IPs and ports is made, it is sent to the tester.  First, the IP:port is checked with namp.  There are three possible outcomes:

  • The port is CLOSED.
  • The port is OPEN.
  • nmap TIMES OUT.

If the port is CLOSED, that means there is a system on the other end but the port is not listening for connections.  These may be re-investigated later but all processing stops and the IP:port is added to the database and marked CLOSED.

If the port is OPEN, further testing takes place.  A proxy judge page is randomly selected and requested through via wget using the IP:port as a proxy.  There are TWO possible outcomes:

  • The proxy judge page will be returned.
  • wget TIMES OUT.

If the page is returned, the IP:port is marked ACTIVE and stored in the database.

Unfortunately, the way the database was designed in the first place, it is not possible to determine whether nmap or wget returned TIMEOUT.  Unfortunate because the “TIMED OUT” state is by far  the most common (followed by CLOSED, followed by ACTIVE) and an initial nmap value of OPEN could be followed by a wget result of TIMED OUT.  The default TIMEOUT for both nmap and wget is 30 seconds, so there may be salvagable proxies in all the 200,000+ proxies that timed out if the timeout value was changed to 45 or 60 seconds. 

This really wouldn’t be that hard to fix (the result codes are CLOSED=0, OPEN=1, and TIMEOUT=2 – so the wget timeout, which reflects an initial nmap finding of “open”, could have a value of 3), but it would take a very long time to do the 218,000+ entries – representing four months of work – already marked as TIMEOUT.  At best it would get some additional slow, borderline useful proxies, but that is from my prespective.  A proxy in Zimbabwe that looks painfully slow from the USA may be fast for users in, say, Uganda or Nigeria.  This whole issue bothers me so much I’ll probably do something about it eventually.

 

The Method – Adding to the Gold Table

Once every even hour (2, 4, 6, etc) the process that creates The List wakes up.  It performs the following tasks:

  • It checks the gold table for the date/time of the last proxy added.
  • It checks the proxies table for ACTIVE proxies discovered after the date/time of the newest proxy in the gold table and makes a list of address/ports if it finds any.
  • It requests a random proxy judge page using the address/ports list made from the proxies table.  nmap is not used, but the three possible results (CLOSED/OPEN/TIMEOUT) are the same.
  • The results are added to the gold table and the “Last Checked” date is updated in the proxies table.
  • The Web pages are generated and moved to the Web server.

 

The Method – Clearing Out Dead Proxies

Every weekday at 8AM EDT a process wakes up and starts checking the “ACTIVE” proxies in the gold table.  Dead proxies are marked as such and the next run of sending out the Web page will reflect the changes.

At this time both the proxies and gold tables are searched for duplicates.  If any are found they are removed.  Duplicate rows are generally a pain in the ass in SQL.  Usually it requires moving unique rows into an temporary table, dropping the original table, and recreating the original table out of the temporary table.  But since each row usually has a unique “last checked” date/time stamp, I delete every duplicate except the oldest timestamp.

Lately this has been making for a very short (less than 200 proxies) list.  But that is the way it is if you want a dependable proxy list.

I recently got an email from an admirer (at least I think he was an admirer) who said only about 5% of the proxies worked.  This might have been true at the time, but show me any public proxy list with more than 1% active proxies!  You can’t do it!  And you will die trying to find one!  And you won’t find any  (besides mine) with 5%!  That is just the way it is.  Proxies come and go, but mostly they go.  A very large percentage of these proxies are clueless end-users who eventually figure out they’ve been PWN3D and need a firewall, anti-virus, or both.  It’s not like the old days, when you could tap into any old Internet cafe in Zagreb or an Oracle Application Server that had mod_proxy installed and listening on port 8000 on every interface.  Times have changed.  Businesses have dumped millions… if not billions… into security products and cable and DSL users are older and wiser (although a sucker is still born every minute – no problem there).

 

Proxy Judges

Proxy judge Web pages are an important part of the process and are remarkably easy to find.  They are everywhere and the people who serve them up seem to be blissfully unaware that they’re being used as such.  Generally they’re nothing more than sample diagnostic code that some Web “developer” cut & pasted into a page and forgot to take offline. 

Incredibly simple to find through a Google search, I have about 50 stable ones I use randomly and I add new ones constantly.  By the time I find them, some have never been touched by a browser (other than a Google SpiderBot) since they were put on the Web.  Apparently the increased traffic from IP addresses (proxies) all over the world must send up a red flag because several have dropped off the ‘Net since I started this.  I suspect the site owners eventually find out the page is getting more traffic than the rest of their site and decide to pull it.

Every judge page that is used to verify a proxy is put into the gold table so that statistics on their performance can be taken regularly.  Dead proxy judges are removed as soon as possible, since a dead judge can make a live proxy look dead as well.

 

The Madness

The madness is an end in and of itself.  This is a research project that has an actual goal, but I wouldn’t being doing it if it wasn’t fun. 

I do  get some kicks out of out-listing the listers.  I don’t like them much and their feeble attempts at protecting their “data” are entertaining.  Listers, listen up!  If you don’t want people hijacking your data, don’t put it online!  It’s that simple!

duh.

Eventually it will all be published, expecially after I get around to the SOCKS proxies.  They’re already in the database, waiting to be shown the light of day.  That is when the real lulz begin, boys and girls.

Stay tuned.

Advertisements

0 Responses to “the method”



  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s




Archives

Advertisements

%d bloggers like this: