I have been playing with several ways to stop hits to my blog from spiders. Not all spiders, just those creepy crawlers that don’t honor the robots.txt file. Initially I wrote an extension that executed during the page/post serving that would compare a list of strings against the UserAgent provided by the web connection. As most things in life go, there is always room for improvement. So now I want to share with everyone the latest and greatest way to manage access to your site.
The latest version, which I have named GateKeeper, uses the Web Handler functionality to hook into the request at the earliest stage of the web request. I have also combined the ability to check both the UserAgent and the IPAddress against a list of filtered strings.
To get started you first need to copy the GateKeeper.cs file into the ~/App_Code folder. This is a web handler so you don’t need to add it into the Extensions folder, the App_Code root will be just fine.
Next you need to update your web.config file to include the new handler file. You need to go to the section <httpModules> section and add the following syntax:
<httpModules>
...
<add name="GateKeeper" type="GateKeeper"/>
...
</httpModules>
Next you need to add some entries in the <appSettings> section:
<!-- Determines if a log entry should be added tp the ~/app_data/gatekeeper.xml log file -->
<add key="EnableLogging" value="True"/>
<!-- Case-insensitive string used to match against the browser UserAgent -->
<add key="UserAgentFilter" value="Sogou, Baiduspider, Sosospider, twiceler, larbin"/>
<!-- Examples of valid IPAddress: 127.0.0.1, 127.0.0.0/24, 127.0.0.0-127.0.0.255 -->
<add key="IPAddressFilter" value="127.0.0.2-127.0.0.3"/>
Finally you need to configure the EnableLogging, UserAgentFilter, and IPAddressFilter sections:
EnableLogging Option
The first thing worth noting is the EnableLogging option. This value allows you to enable/disable logging of each blocked attempt. The file ~/app_data/gatekeeper.xml is automatically generated if it doesn’t already exists, and blocks are separated into either UserAgent or IPAddress sections of the xml file.
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<gatekeeper>
<useragent>
<entry date="9/8/2008 9:02:50 PM" url="http://localhost/default.aspx" useragent="Opera/9.25 (Windows NT 6.0; U; en)" ipaddress="127.0.0.1" />
</useragent>
<ipaddress>
<entry date="9/8/2008 8:49:48 PM" url="http://localhost/default.aspx" useragent="Mozilla/5.0 ...)" ipaddress="127.0.0.1" source="IPAddress" />
</ipaddress>
</gatekeeper>
It’s worth pointing out that the GateKeeper.xml file could grow to be very large depending on your blocking filter and the number of blocked connections. So keep an eye on it!
UserAgentFilter Option
The UserAgentFilter is a list of matching strings that you wish to compare against the Request.UserAgent string. The list is comma separated and each value is compared. Be careful with the string pattern you enter here as you could accidentally block valid browsers.
IPAddressFilter Option
The IPAddress filter is the best part of this utility. I integrated some excellent code from Bo Norgaard and some fancy regular expressions that allow you to enter three different IP address formats:
- Single IP Address (ex. 127.0.0.1)
- IP Address with Mask (ex. 127.0.0.0/24)
- IP Address Range (ex. 127.0.0.0-127.0.0.255)
You can mix and match for your filter list by inserting a comma between each entry.
And that’s about it. Have a look at the code if you like and feel free to make suggestions and recommendations.
GateKeeper.zip
UPDATE October 23rd, 2008: I noticed some exceptions showing up in my eventlog and traced it back to a bug in the extension related to empty useragents. These were semi-harmless as the end result was the same, the request was unable to complete. But in order to keep the eventlog clean and avoid future exceptions I have corrected the bug and updated the code here. If you have previously downloaded this extension then download the extension again and update your current version.