The necessary evil. Web crawling has become pervasive across the Internet. It seems there are hundreds, if not thousands of web crawlers and bots out there combing the web for anything and everything. Not a big deal right? Some crawling is welcomed as a necessary way to get your content to the search engines. Fortunately using the Robots.txt file there is the ability to control what content is crawled and how often the crawling is performed. But do you really have control? Looking over my IIS logs it is obvious that some web crawlers do not follow the Robots.txt file. If you take a moment to look over your IIS logs you can see repeated connections by web crawlers that seemingly go over the same content over and over. A couple of examples:
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Sosospider+(+http://help.soso.com/webspider.htm)
Googlebot/2.1;++http://www.google.com/bot.html)
Is this really necessary? I decided to write about it because the abusive behavior causes unnecessary traffic and increased loads on my server. This really started to bug me so I decided to try a little experiment. What if I looked at the incoming UserAgent and sent an Access Denied response to those that I didn’t want to crawl my site?
I assumed it wouldn’t be too hard to write a BlogEngine Extension that would let me control the entries using the Extension Manager. So I wrote an extension and have been testing it on my site with much success. I use the Regular Expression engine to compare the UserAgent Filters against the incoming UserAgent. I took advantage of an article that Mads posted regarding RegEx.Escape to handle special characters that would cause problems with the regular expression, so thanks Mads!
You have to be careful in the UserAgent Filter you specify because you could end up locking yourself (and everyone else) out of the website. For example, if you added “Mozilla” to the UserAgent Filter list then anyone connecting with IE and Firefox would be denied the connection. So I started with the following UserAgent Filters:
Sogou+web+spider
Baiduspider+
Sosospider+
larbin_2.6.3+
twiceler-0.9
Here is the meat of the code:
HttpContext context = HttpContext.Current;
if (context.Request.UserAgent != null)
{
DataTable table = _settings.GetDataTable();
// Compare the current UserAfent against the UserAgent list for a match
foreach (DataRow row in table.Rows)
{
string escExpression = Regex.Escape((string)row["UserAgentFilter"]);
if (Regex.IsMatch(context.Request.UserAgent, escExpression, RegexOptions.Compiled & RegexOptions.IgnoreCase))
{
context.Response.StatusCode = 403;
context.Response.End();
break;
}
}
}
An excellent place to check for UserAgents is the BotsVsBrowsers website. If you have any questions or concerns please feel free to drop me a note.
UserAgentBlocking.rar