ahref.com > Guides > Technology



Apache and PHP vs. the Spambots Continued

robots.txt

When a robot visits a website, it is supposed to check for a "robots.txt" file before
doing anything else (at http://www.ahref.com/robots.txt for www.ahref.com,
for example). The robots.txt file tells the robot what pages on the
site it can access. If a robot disobeys the directives in robots.txt, either by
not checking it in the first place or checking it then ignoring what it says,
it's a bad robot.


To tell robots not to visit any pages on your site, it should say:

User-agent: *
Disallow: /



That is, all user-agents (*) are disallowed from visiting anything on the site
(anything under /).
But you probably want some robots - for example, search engine robots - to traverse your
site. To set a trap for a bad robot, put something like the following in your
robots.txt file:


User-agent: *
Disallow: /int/
User-agent: *
Disallow: /inttoo/

This tells robots not to go into the /int/ or /inttoo/ sections on your website.
(Choose another word if you actually have valid content in such a directory on your
site.) So no good robots will go there.

You don't want normal users to go there, either; so don't put any obvious links to that
directory on your web pages. But to lure in bad robots, put an invisible link
on your front page (and possibly elsewhere), around a single-pixel transparent
gif, leading to a page in the first disallowed directory:

Normal users shouldn't go there, because the link is invisible; and good robots won't
go, because it's disallowed. So anything that does follow the link will be a
bad robot. Make sure that the page you link to in the disallowed directory is
PHP-parsed (my server is set to parse .html files with PHP), because it's supposed to
notify you of unwelcome visitors.


Getting Robot Alerts From PHP

To get PHP to alert you when the trap page is hit, just put the following code
in the page (substitute your own domain name for ???.com):



As you can imagine, whenever someone visits the page, an email message will go to
webmaster@???.com, with details on the IP address and user agent used. (Keep in mind
that the spambot-writer determines what the name of the user agent is; often, they'll
claim to be a "normal" browser, like "Mozilla/4.0 (compatible; MSIE 4.01; Windows 98)",
when they're not.) The readfile line is to give the robot some text to chew on and will start the
neutralization process. Notice that it references a URL in the second directory that you
declared disallowed in the robots.txt file. (If you used another directory name
in robots.txt, use that directory name here.) Create the index.html file for
that directory - you can put any text you want in there (maybe just another copy
of your front page, minus its links).

So now you're getting notified whenever bad robots - including spambots - come to your
site. Next, we neutralize them.

continue reading >>>

o

You have three choices at this point, two good, one bad:

just keep the robot away from your content
keep the robot away from your content, but keep it coming back for more
feed it fake email addresses
Keeping the robot away from your content will take less of your machine's resources.
Keeping it coming back for more pages - email-address-less pages - will use up your
resources, but it will also use the spambot user's resources, which might make you feel
good.
Feeding the robot fake email addresses is a bad idea, for two reasons. First - if you
randomly generate fake email addresses, you might end up creating a real one by
accident, which will be bad news for whoever owns that address. Second - spammers
often use someone else's email address when sending spam; if you give the
spammer 500 fake email addresses, someone else will probably get the bounces.
Even if the spammer doesn't use a "valid" email address to spam from,
someone's mail server will have to deal with all the bounces. Which isn't fun.
So please - don't feed the spambot.

Keeping the Spambot Away

Here's where mod_rewrite comes in. Once you get the email message that a bad robot is at
your site, look at the user agent that the robot is identifying itself as. If it
has a distinctive user agent (something like "EmailWolf" or "WebBandit" rather
than "Mozilla"), put the following lines in Apache's srm.conf file (or
whichever Apache config file you like to keep such things in). This will send any users
with that same user agent to /int/index.html, no matter what URL they try to
access on your site:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT
RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]


Be sure to substitute the bad robot's user agent for BADUSERAGENT and your website's
document root for DOCROOT. A strange quirk I've noticed with mod_rewrite: it doesn't
seem to like blank spaces in the value that RewriteCond looks for (the documentation for
mod_rewrite doesn't use any blank spaces in example values, either). So if the
user agent is "EmailWolf 4.0", don't use the full user agent on the RewriteCond
line; just use everything before the blank space - "EmailWolf".

If the bad 'bot doesn't have a distinctive user agent, put this instead, to block the IP
address the robot is coming from:

RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS
RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]


Substitute the robot's IP address for BADIPADRESS. If you have more than one type of
robot you want to fool, your config file will look something like this:


RewriteEngine on
RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS1 [OR]
RewriteCond %{REMOTE_ADDR} ^BADIPADDRESS2 [OR]
RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT1 [OR]
RewriteCond %{HTTP_USER_AGENT} ^BADUSERAGENT2
RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]
This will send any requests from the IP addresses BADIPADDRESS1 or BADIPADDRESS2, or
user agents BADUSERAGENT1 or BADUSERAGENT2, to / inttoo/index.html. (By the way: you'll
need to restart Apache for the changes to the configuration files to "take.")

If you want to use up the spambot user's resources (and, unfortunately, your own),
read on.

Keep That Spambot Comin'
If you don't mind getting a few hundred or thousand extra hits from bad robots, then
instead of creating the inttoo directory and an index.html file for it, create a file
called inttoo in your top document directory (so it's accessible at
http://www.???.com/inttoo) and put the following text in it:

howdy ?PHP /* This program generates a random series of URLs to waste bad robots' time */ /* prep for random number generation; number of links to generate is 6 to 10; we'll force the robot to wait 10-20 seconds; we'll have 30 random words on the page, too */ srand (mktime ()); $maxer = getrandmax(); $numlinks = 6 + (1.0 * rand () / $maxer) * 4; $numwords = 30; $sleep_delay = 10 + (1.0 * rand () / $maxer) * 10; /* Set the dictionary file to a file with a line-delimited series of words, each on one line. My /usr/dict/words file is 45,000 words long; you should probably copy just a thousand words into another file and use that file. */ $dictionary_file = "/usr/dict/words"; $wlist = file ($dictionary_file); /* generates some random non-linked words, so not everything on the page is a link, which is something bots might look out for */ for ($wcount = 0; $wcount < $numwords; $wcount++) { $rcount = (1.0 * rand () / $maxer) * sizeof ($wlist); $word = $wlist[$rcount]; print "
$word "; } sleep ($sleep_delay); /* base_url is the directory which was disallowed in robots.txt. this generates a bunch of random links, all into that disallowed directory */ $base_url = "/inttoo/"; for ($wcount = 0; $wcount < $numlinks; $wcount++) { $rcount = (1.0 * rand () / $maxer) * sizeof ($wlist); $word = $wlist[$rcount]; print "
$word\n "; } ? If you have multiple virtual hosts on one server,
you'll need to have a copy in each document trees, or to create one
copy and link to it from each document tree.

Last, but not least, add the following lines to your Apache httpd.conf file
(I'm still
using PHP3; change x-httpd-php3 if you're not):


ForceType application/x-httpd-php3



and change the line:

RewriteRule ^.*$ DOCROOT/inttoo/index.html [L]


to:

RewriteRule ^.*$ DOCROOT/inttoo
[L,T=application/x-httpd-php3]
in the srm.conf file (or wherever you put it). That should all be one line, by the way.
Again, if you're using PHP4, change the line appropriately.

This will force any calls to URLs under http://www.???.com/inttoo/ to just call the
program inttoo. Any robot following links in that URL-space will just keep getting
randomly-generated pages, each taking 10-20 seconds to load, without any email addresses
on the pages.

continue reading >>>