27

3
The Continued Struggle With Spiders

I have been trying to stop spiders from downloading files from my blog.  Not pages and such, just the different files that I have made available for anyone interested.  I outlined some of this effort in my previous posts regarding the Robots.txt file and such.  The problem is that even with all of these precautions some spiders do not follow the rules, skipping your Robots.txt instructions.  Very annoying indeed.  So how can I ensure that the list of downloads is accurate?  Well one additional step I have recently taken, albeit a drastic step, is to create an extension that will block spiders in the file.serving handler.

I start with a list of known spiders that have been hitting my server.  Currently it’s in my web.config but this could easily be part of the Extension Manager settings:

  <appSettings>
    <add key="crawlers" value ="spider, rulinki, Sogou, Baidu, baiduspider, googlebot, msnbot, Rambler, slurp, AbachoBOT, Accoona, AcoiRobot, ASPSeek, CrocCrawler, Dumbot, FAST-WebCrawler, GeonaBot, Gigabot, Lycos, MSRBOT, Scooter, AltaVista, IDBot, eStyle, Scrubby, Yahoo" />
  </appSettings>

Now I need to insert this extension into the FileHandler.Serving event handler:

FileHandler.Serving += new EventHandler<EventArgs>(File_Downloading);

Next I load the list from the web.config file into a string collection:

//Load list of crawlers
string raw = ConfigurationManager.AppSettings["crawlers"];

// Convert comma delimited list into a string array
string[] crawlers = raw.Split(new char[] { ',' });

Then I loop through each entry in the string collection and see if the value exists in the Request.UserAgent:

foreach (string crawler in crawlers)
{
    bool isMatch = Regex.IsMatch(context.Request.UserAgent, crawler, RegexOptions.Compiled & RegexOptions.IgnoreCase);
    if (isMatch)
    {
        context.Response.StatusCode = 403;
        context.Response.End();
        break;
    }
}

And that’s all there is too it.  Now when a spider attempts to download a file using the filehandler module it checks the Request.UserAgent against my list.  Whenever there is a match I see the following in my IIS logs:

2008-06-27 18:16:01 XXX.XXX.XXX.XXX GET /post/2007/12/file.axd file=makecert.zip 80 - 220.181.32.56 Baiduspider+(+http://www.baidu.com/search/spider.htm) - 403 0 0 300

Again, not completely foolproof since some of the nasty spiders don’t populate their UserAgent properties.  Still working on those blocking those guys…

CrawlerBlockingExtension.rar (1.08 kb)

Comments

Add comment


 

biuquote
  • Comment
  • Preview
Loading



Credits

  • DSCODUC on Technorati
  • SpamPoison
  • Project Honey Pot
  • CCA Share Alike 3.0
  • 1and1 Hosting