I have been working hard on completing the next version of GateKeeper and have been running the latest build on my blog. Part of the testing requires me to watch my logs carefully to ensure innocent viewers don’t accidentally end up in the honeypot portion of the solution. While watching the logs the other day I noticed a suspected honeypot violator who had ended up following my honeypot url… Here is the log entries as they appeared in my log file:
/robots.txt – 80 – 88.75.152.118 Java/1.6.0_07 – 200 0 0 820 207 406
What you see is the requests start coming in from 88.75.152.118 at the robots.txt file. That’s a great start, as I would expect a spider to check with that file for which paths are approved and allowed. The following entries are also fine, displaying the usual behavior of a curious spider:
/page/MyToolbox.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 200 0 0 39466 216 203
/page/Password-Maker.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 200 0 0 35955 221 390
/category/Personal.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 200 0 0 32472 219 796
/category/Security.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 200 0 0 32838 219 515
/category/Technology.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 200 0 0 33310 221 390
All looks great at this point until all of a sudden they make that fateful decision to peek into the /honey/ folder:
/honey/gotcha.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 403 0 0 570 215 296
Oops. That was a bad decision by the spider because after that last request the GateKeeper module has automatically added this IP address to my blacklist and sent me an email to let me know it happened. Now all future requests from this IP Address will be denied, returning a 403 error. You can see the immediate results of this lapse in judgment by the spider:
/page/BlogRoll.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 403 0 0 570 215 296
/page/Code-Archive.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 403 0 0 570 219 249
/page/FAQ-Plugin.aspx – 80 – 88.75.152.118 Java/1.6.0_07 – 403 0 0 570 217 343
All 403’s back to that creepy crawler. Now, I don’t really know who owns this IP Address (dslb-088-075-152-118.pools.arcor-ip.net) at this point, nor do I recognize the UserAgent (Java/1.6.0_07) being passed to me. But who cares? I pay the web hosting company based on bandwidth used and see no reason to pay for bandwidth used by a spider that isn’t going to follow my directions.
What’s the big deal, you may ask… Well I post downloadable attachments that can sometimes be big. I have found in my logs that spiders sometimes ignore my instructions to not download anything with the following extensions: *.exe, *.zip, *.rar, *.txt. I do this because a spider isn’t going to use my attachments but their downloads will use my bandwidth. So GateKeeper solves this problem.
Are you interested in getting the same level of protection? Jump over to the GateKeeper project page and look for the latest released version of this tool.




Sat, Jan 10, 2009
Technology