• Favorite Desktop Environment 
    • KDE
    • Gnome
    • Fluxbox
    • XFCE
    • Enlightenment
    • I live life in the console
    • Other


Read more...   created: Aug 14, 2009

I thought it would be neat to write a program to spider the Web autonomously... so I wrote a Perl script that does just that!

The computer science folks would consider the spidering aspect a breadth first search that uses a queue data-structure to hold the collection of hostnames. That sounds like a mouth full, but the idea is simple (I promise!).

The program basically works by looping through the following phases:

  1. do a HTTP dump on a host (port 80) in queue
  2. scrape unique hostnames from the output
  3. add the found hostnames to the crawl queue
  4. do something fun with the output

The link to the code is here if you want to check that out.