5

Creepy Crawlers and Files Downloaded

 

I little while ago I started to suspect that my filedownload stats weren't all that accurate.  It seems odd to me that every file was downloaded every day.  So a little research and I have uncovered a problem with the current strategy.

First I took a look at the IIS logs and filtered my downloads using the tail command:

tail -100 c:\logfiles\w3svc1\u_ex080505 | find "/file.axd"

This produced a list of all the downloaded files using the filehandler.  Next I looked at a specific file (ex. Encrypt_Web_Config.zip) to see who was downloading the file.  Several download entries looked like the following:

image

If you look carefully you can see that three downloads occurred for the Encrypt_Web_Config.Zip file.  You should also notice that one of them indicates the source to be www.metadatalabs.com/mlbot.  Now I'm just guessing here but it appears that the client with the name mlbot is a webcrawler and not an actual web user.

So it would be safe to conclude that the actual download count of my files is not accurately reflected in the Count Files Downloaded extension. 

Great.  Now what?  I started to think about how best to correct this issue.  Next I went looking at some sites that talk about the robots.txt file and dynamic content.  After reading several pages I decided to make an addition to my robots.txt file.  Notice line 6 with regards to the /*file= syntax:

User-agent: *
Disallow: /login.aspx
Disallow: /search.aspx
Disallow: /error404.aspx
Disallow: /archive.aspx
Disallow: /*file=

User-agent: Slurp
Disallow: /*?file=
Disallow: /*.zip$
Disallow: /*.rar$

sitemap: http://www.dscoduc.com/sitemap.axd

To validate this I went to the Google Webmasters Tools page and tested it out.  According to Google this should block their webcrawler from indexing the file downloads. I don't know how it will impact other webcrawlers so we will have to watch the logs and see if it makes a difference.

Is this a good thing?  Not really sure, and would love input from you and your opinion.

Add comment


 

biuquote
  • Comment
  • Preview
Loading



Credits

  • DSCODUC on Technorati
  • SpamPoison
  • Project Honey Pot
  • CCA Share Alike 3.0
  • 1and1 Hosting