I thought it would be neat to write a program to spider the Web autonomously... so I wrote a Perl script that does just that!
The computer science folks would consider the spidering aspect a breadth first search that uses a queue data-structure to hold the collection of hostnames. That sounds like a mouth full, but the idea is simple (I promise!).
The program basically works by looping through the following phases:
- do a HTTP dump on a host (port 80) in queue
- scrape unique hostnames from the output
- add the found hostnames to the crawl queue
- do something fun with the output
The link to the code is here if you want to check that out.
Init Process
The program starts out with a single hostname, either passed in as an argument or default of slashdot.org, that allows the spider to do its first scrape of a HTTP dump and build the initial collection of hostnames needed to start the crawling process (we will talk about these steps in detail later on). After the program has a minimal amount of hosts in queue, the program is then able to enter the main program loop that reiterates the process again and again (seek, dump, extract, queue hosts, etc). This being said, if you choose to specify an initial host yourself, make sure it is a site with lots of links that belong to other websites! A great example is a news site that links out to many other news sites (slashdot.org, digg.com, buy.com). The point I am trying to get it is that we do not want a site that has no references to other sites.
Create HTTP Dump (port 80)
We want to make an HTTP connection (port 80) to a web host and get a dump of the HTML code that makes up the web page. This is where we can find more links to queue so the program will have more hosts to crawl. In this program, I use a Perl script to manually initiate the dump as if I were manually querying a HTTP connection via telnet or netcat. If you are unaware of their output, run either of the programs as suggested below.
To use netcat here, replace the telnet command with netcat
- $ telnet www.google.com 80
- Trying 74.125.127.100...
- Connected to google.com.
- Escape character is '^]'.
- GET 0
- <enter>
- <enter>
- ... ... snip ... snip ... ...
- <lots of HTML and JavaScript scrolls by>
scrape unique hostnames from the output
The program is autonomous due to the fact that it finds its own hosts to crawl. This is done, as previously mentioned, by using an HTTP connection (port 80) to get a dump of a Web page and extract the links from the dump so the program can add those to the crawl queue.
add the found hostnames to the crawl queue
do something fun with the output
Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | #!/usr/bin/env perl ####################################################################### ## lyle@digitalfoo.net http://digitalfoo.net ## ## ABOUT ## This program basically crawls the Web, endlessly looking for more ## hosts to web scrape for even more hosts. It can easily be ## extended with more code to actually make use the hosts or even ## the HTTP dump... just imaging something neat and make it! ## ## LICENSE ## "THE BEER-WARE LICENSE" (Revision 42): ## lyle@digitalfoo.net wrote this file. As long as you retain this ## notice you can do whatever you want with this stuff. If we meet ## some day, and you think this stuff is worth it, you can buy me a ## beer in return. Lyle Scott ## ## FEEDBACK ## If you have a bug, comment, or suggestion for a feature, contact ## me via my websites or the contact posted above. ## ####################################################################### use strict; use warnings; use IO::Socket::INET; ### harvest the web by doing http dumps on port 80 ### and scrape hostnames prefixed with http:// to ### gain more hosts to crawl. ### [0] = initial host to scrape sub harvest ($) { my $start_host = shift; my $extracted_hosts; my @queue; my @crawled_hosts; my $host; my $h; my $iteration_count = 0; my $found_count = 0; # do initial dump of start host my $dump = http_dump($start_host); # get a few links to start crawling if (length($dump)) { $extracted_hosts = extract_http_hosts($dump); while ($h = shift(@$extracted_hosts)) { next if grep(/$h/, @queue); push(@queue, $h); } print "nfound ", scalar(@queue), " hosts on the starting host ($start_host)!nn"; push(@crawled_hosts, $start_host); } else { print "ERROR, http_dump($start_host)n"; exit; } # while we have hosts to crawl while ($host = shift(@queue)) { # total count of loops... good for stats $iteration_count++; push (@crawled_hosts, $host); # print some statistics every 30 hosts if (($iteration_count % 30) == 0) { print "nSTATS:> iterations: $iteration_count found: $found_count queue: " .scalar(@queue). "nn"; } # get HTTP dump of port 80 on the host $dump = http_dump($host); # if we did not get output, skip it next unless length($dump); # extract http://(hostnames) from HTTP dump on port 80 $extracted_hosts = extract_http_hosts($dump); # put what we find on queue for later processing my $c = 0; my $bail = 0; while ($h = shift(@$extracted_hosts)) { next if grep(/$h/, @crawled_hosts); next if grep(/$h/, @queue); push(@queue, $h); $c++; } printf ("added %4d new hosts to crawl off of %sn", $c, $host); $found_count++; } # TODO: change to query google for more links die "nn... all out of hosts!nn"; } ### Utility function to trim white spaces off either end ### [0] string to trim sub trim ($) { my $str = $_[0]; $str =~ s/^s+//; $str =~ s/s+$/0 (X11; U; FreeBSD i386; en-US; rv:1.9.0.10) Gecko0.10$EOL" . "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8$EOL" . "Connection: close$EOF"; # send the header print $sock $header; # read in all the result lines into one variable my $http_req_results; { # slurp up all lines in the output; local $/ = undef; # assign entire output to variable $http_req_results = <$sock>; } # return entire http dump of page source return $http_req_results; } else { warn "WARN:> http_dump: Cant connect to $_[0] : $@n"; warn "WARN:> Skipping it and moving on...n"; return -1; } } # # make sure to start off with somewhere with a lot of links # if ($#ARGV == 0) { harvest($ARGV[0]); } else { harvest("slashdot.org"); } |
- Tags
- AI (1)
- ALIX (1)
- digitalfoo.net (2)
- embedded (6)
- FreeBSD (30)
- Java (1)
- Linux (26)
- misc (7)
- my projects (1)
- MySQL (2)
- NanoBSD (3)
- opensource (7)
- perl (1)
- PHP (3)
- programming (11)
- Python (1)
- security (4)
- Archives
- 2011
- February (1)
- March (1)
- June (1)
- July (1)
- August (1)
- 2010
- June (5)
- July (2)
- December (4)
- April (6)
- March (2)
- May (1)
- August (2)
- October (1)
- November (1)
- 2009
- August (7)
- July (8)
- April (4)
- May (4)
- December (2)
- June (1)
- September (1)
- November (4)
- October (1)
- Web Tools
- Index
- dig-shovel Live
- SQL Injection Encoder
- Links
-

