Coding a Web Spider in Perl
  Aug 14, 2009

I thought it would be neat to write a program to spider the Web autonomously... so I wrote a Perl script that does just that!

The computer science folks would consider the spidering aspect a breadth first search that uses a queue data-structure to hold the collection of hostnames. That sounds like a mouth full, but the idea is simple (I promise!).

The program basically works by looping through the following phases:

  1. do a HTTP dump on a host (port 80) in queue
  2. scrape unique hostnames from the output
  3. add the found hostnames to the crawl queue
  4. do something fun with the output

The link to the code is here if you want to check that out.

Init Process

The program starts out with a single hostname, either passed in as an argument or default of slashdot.org, that allows the spider to do its first scrape of a HTTP dump and build the initial collection of hostnames needed to start the crawling process (we will talk about these steps in detail later on). After the program has a minimal amount of hosts in queue, the program is then able to enter the main program loop that reiterates the process again and again (seek, dump, extract, queue hosts, etc). This being said, if you choose to specify an initial host yourself, make sure it is a site with lots of links that belong to other websites! A great example is a news site that links out to many other news sites (slashdot.org, digg.com, buy.com). The point I am trying to get it is that we do not want a site that has no references to other sites.

Create HTTP Dump (port 80)

We want to make an HTTP connection (port 80) to a web host and get a dump of the HTML code that makes up the web page. This is where we can find more links to queue so the program will have more hosts to crawl. In this program, I use a Perl script to manually initiate the dump as if I were manually querying a HTTP connection via telnet or netcat. If you are unaware of their output, run either of the programs as suggested below.

To use netcat here, replace the telnet command with netcat

  1. $ telnet www.google.com 80
  2. Trying 74.125.127.100...
  3. Connected to google.com.
  4. Escape character is '^]'.
  5. GET 0
  6. <enter>
  7. <enter>
  8. ... ... snip ... snip ... ...
  9. <lots of HTML and JavaScript scrolls by>

scrape unique hostnames from the output

The program is autonomous due to the fact that it finds its own hosts to crawl. This is done, as previously mentioned, by using an HTTP connection (port 80) to get a dump of a Web page and extract the links from the dump so the program can add those to the crawl queue.

add the found hostnames to the crawl queue

do something fun with the output

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
#!/usr/bin/env perl
 
#######################################################################
## lyle@digitalfoo.net    http://digitalfoo.net
##
## ABOUT
##   This program basically crawls the Web, endlessly looking for more
##   hosts to web scrape for even more hosts. It can easily be 
##   extended with more code to actually make use the hosts or even 
##   the HTTP dump... just imaging something neat and make it!
##
## LICENSE
##   "THE BEER-WARE LICENSE" (Revision 42):
##   lyle@digitalfoo.net wrote this file. As long as you retain this 
##   notice you can do whatever you want with this stuff. If we meet
##   some day, and you think this stuff is worth it, you can buy me a
##   beer in return.  Lyle Scott
##
## FEEDBACK
##   If you have a bug, comment, or suggestion for a feature, contact
##   me via my websites or the contact posted above.
##
#######################################################################
 
use strict;
use warnings;
 
use IO::Socket::INET;
 
### harvest the web by doing http dumps on port 80 
### and scrape hostnames prefixed with http:// to 
### gain more hosts to crawl.
### [0] = initial host to scrape
sub harvest ($)
{
    my $start_host = shift;
    my $extracted_hosts;
    my @queue;
    my @crawled_hosts;
    my $host;
    my $h;
    
    my $iteration_count = 0;
    my $found_count = 0;
    
 
    # do initial dump of start host
    my $dump = http_dump($start_host); 
    
    # get a few links to start crawling
    if (length($dump)) {
        $extracted_hosts = extract_http_hosts($dump);
 
        while ($h = shift(@$extracted_hosts)) {
            next if grep(/$h/, @queue);
            push(@queue, $h);
        }
        
        print "nfound ", scalar(@queue), " hosts on the starting host ($start_host)!nn";
        push(@crawled_hosts, $start_host);
    }
    else {
        print "ERROR, http_dump($start_host)n";
        exit;
    }
 
 
    # while we have hosts to crawl
    while ($host = shift(@queue)) {
 
        # total count of loops... good for stats 
        $iteration_count++;
        
        push (@crawled_hosts, $host);
        
        # print some statistics every 30 hosts
        if (($iteration_count % 30) == 0) {
            print "nSTATS:> iterations: $iteration_count     found: $found_count    queue: " .scalar(@queue). "nn";
        }   
        
        # get HTTP dump of port 80 on the host
        $dump = http_dump($host);
 
        # if we did not get output, skip it
        next unless length($dump);
        
        # extract http://(hostnames) from HTTP dump on port 80
        $extracted_hosts = extract_http_hosts($dump);
 
        # put what we find on queue for later processing
        my $c = 0;
        my $bail = 0;
 
        while ($h = shift(@$extracted_hosts)) {
            next if grep(/$h/, @crawled_hosts);
            next if grep(/$h/, @queue); 
            push(@queue, $h);
            $c++;
        }
 
        printf ("added %4d new hosts to crawl off of  %sn", $c, $host);
        $found_count++;
    }
    
    # TODO: change to query google for more links
    die "nn... all out of hosts!nn";
}
 
 
 
 
### Utility function to trim white spaces off either end
### [0] string to trim
sub trim ($) 
{
    my $str = $_[0];
    $str =~ s/^s+//;
    $str =~ s/s+$/0 (X11; U; FreeBSD i386; en-US; rv:1.9.0.10) Gecko0.10$EOL" .
                     "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8$EOL" .
                     "Connection: close$EOF";
 
        # send the header             
        print $sock $header;
        
        # read in all the result lines into one variable
        my $http_req_results;
 
        {
            # slurp up all lines in the output;
            local $/ = undef; 
        
            # assign entire output to variable
            $http_req_results = <$sock>;
        }
        
        # return entire http dump of page source
        return $http_req_results;
    }
    else {
        warn "WARN:> http_dump: Cant connect to $_[0] : $@n";       
        warn "WARN:> Skipping it and moving on...n"; 
        return -1;  
    }
}
 
 
 
#
# make sure to start off with somewhere with a lot of links
#
if ($#ARGV == 0) {
    harvest($ARGV[0]);
}
else {
    harvest("slashdot.org");
}



Post a New Comment

Name

Message

Security
Code

        (case insensitive & space between words)