Posted June 17, 2008 and filed under Technology    tags: 

spider If you haven’t paid much attention to your Robots.txt file then you aren’t alone.  Many times I will come across a blog or website and out of curiosity I will have a look at their Robots.txt file to see if they updated it correctly.  For example, a quick look at www.LearnMSNet.com and you will see the Robots.txt file is the one that came with BlogEngine.NET.  In another example http://www.ckurl.com/techblog/robots.txt has the robots.txt file, albeit the one that came with BlogEngine.Net, but only at the blog sub-level of the domain.  At the root of the domain www.ckurl.com where the Robots.txt file should be placed there doesn’t seem to be one at all.

This isn’t a huge deal if you aren’t too interested in your blog showing up in a search result.  But I suspect that many people do want their blogs to be searchable but just don’t pay any attention to this aspect of configuration.

Today I wanted to verify the exclusion of crawlers from accessing points of my blog that I didn’t want them to access; ex.  download files, contact page, OPML links…  Pretty much all of the http handlers (.axd) should be excluded.  After doing some reading I completed a Robots.txt file that appears to satisfy my needs.

UPDATE (6/18/08):  I have revised my Robots.txt from my original post to include some specific crawlers.

User-agent: Slurp
Disallow: /*.rar$
Disallow: /*.zip$
Disallow: /*.exe$
Disallow: /*.txt$

User-agent: msnbot
Disallow: /*.rar$
Disallow: /*.zip$
Disallow: /*.exe$
Disallow: /*.txt$

User-agent: ia_archiver
User-agent: Sosospider
User-agent: sogou
User-agent: BecomeBot
Disallow: /

User-agent: *
Disallow: /login.aspx
Disallow: /search.aspx
Disallow: /error404.aspx
Disallow: /archive.aspx
Disallow: /contact.aspx
Disallow: /file.axd
Disallow: /js.axd
Disallow: /image.axd
Disallow: /opml.axd
Disallow: /css.axd

sitemap: http://www.dscoduc.com/sitemap.axd

 

How do I know this works correctly?  Well there are a couple of tools out there that can assist you.  For example, have a look at the Robots Tester site.  This site will tell you if there are any formatting errors.  The output will look something like the following:

image

After you have confirmed the syntax of your file it is a good idea to see if Google interprets the instructions correctly.  Head over to Google Webmaster Tools and the Tools section.  In this section you will find the option “Analyze robots.txt” which will let you enter a URL to verify.  In my case I wanted to validate that downloadable files were being Disallowed so I entered in the address http://www.dscoduc.com/file.axd?file=Manage_Files.rar.  Here were the results:

results

According to the results Google will skip any links to my download files…  Goodness!

NOTE: Not all web crawlers respect the Robots.txt file and will access the links included in the Disallow list…  This really can’t be helped without explicitly blocking these crawlers at the address level (does anyone know of a better way?).

What are your thoughts?  Does it make sense to spend so much time worrying about the Robots.txt file?

If you liked this article why not share it with others?

Kick it up to DotNetKicks.com

Comments

Add comment


(Will show your Gravatar icon)

biuquote
Loading