Show Contents Previous Page Next Page Chapter 6 - Authentication and Authorization / Access Control with mod_perl Web-crawling robots are an increasing problem for webmasters. Robots are supposed to abide by an informal agreement known as the robot exclusion standard (RES), in which the robot checks a file named robots.txt that tells it what parts of the site it is allowed to crawl through. Many rude robots, however, ignore the RES or, worse, exploit robots.txt to guide them to the "interesting" parts. The next example (Example 6-3) gives the outline of a robot exclusion module called Apache::BlockAgent. With it you can block the access of certain web clients based on their User-Agent field (which frequently, although not invariably, identifies robots). The module is configured with a "bad agents" text file. This file contains a series of pattern matches, one per line. The incoming request's User-Agent field will be compared to each of these patterns in a case-insensitive manner. If any of the patterns hit, the request will be refused. Here's a small sample file that contains pattern matches for a few robots that have been reported to behave rudely: ^teleport pro\/1\.28 ^nicerspro ^mozilla\/3\.0 \(http engine\) ^netattache ^crescent internet toolpak http ole control v\.1\.0 ^go-ahead-got-it ^wget ^devsoft's http component v1\.0 ^www\.pl ^digout4uagent Rather than hardcode the location of the bad agents file, we set its path using a configuration variable named BlockAgentFile. A directory configuration section like this sample perl.conf entry will apply the Apache::BlockAgent handler to the entire site: <Location /> PerlAccessHandler Apache::BlockAgent PerlSetVar BlockAgentFile conf/bad_agents.txt </Location> Apache::BlockAgent is a long module, so we'll step through the code a section at a time. package Apache::BlockAgent; use strict; use Apache::Constants qw(:common); use Apache::File (); use Apache::Log (); use Safe (); my $Safe = Safe->new; my %MATCH_CACHE;
The module brings in the common Apache constants and loads file-handling code from Apache::File. It also brings in the Apache::Log module, which makes the logging API available. The standard Safe module is pulled in next, and a new compartment is created where code will be compiled. We'll see later how the sub handler { my $r = shift; my($patfile, $agent, $sub); return DECLINED unless $patfile = $r->dir_config('BlockAgentFile'); return FORBIDDEN unless $agent = $r->header_in('User-Agent'); return SERVER_ERROR unless $sub = get_match_sub($r, $patfile); return OK if $sub->($agent); $r->log_reason("Access forbidden to agent $agent", $r->filename); return FORBIDDEN; }
The code first checks that the BlockAgentFile configuration variable is present. If not, it declines to handle the transaction. It then attempts to fetch the User-Agent field from the HTTP header, by calling the request object's header_in() method. If no value is returned by this call (which might happen if a sneaky robot declines to identify itself), we return
Otherwise, we call an internal function named get_match_sub() with the request object and the path to the bad agent file. get_match_sub() uses the information contained within the file to compile an anonymous subroutine which, when called with the user agent identification, returns a true value if the client is accepted, or false if it matches one of the forbidden patterns. If get_match_sub() returns an undefined value, it indicates that one or more of the patterns didn't compile correctly and we return a server error. Otherwise, we call the returned subroutine with the agent name and return The remainder of the module is taken up by the definition of get_match_sub(). This subroutine is interesting because it illustrates the advantage of a persistent module over a transient CGI script: sub get_match_sub { my($r, $filename) = @_; $filename = $r->server_root_relative($filename); my $mtime = (stat $filename)[9]; # try to return the sub from cache return $MATCH_CACHE{$filename}->{'sub'} if $MATCH_CACHE{$filename} && $MATCH_CACHE{$filename}->{'mod'} >= $mtime;
Rather than tediously read in the bad agents file each time we're called, compile each of the patterns, and test them, we compile the pattern match tests into an anonymous subroutine and store it in the Next we open up the bad agents file, fetch the patterns, and build up a subroutine line by line using a series of string concatenations: my($fh, @pats); return undef unless $fh = Apache::File->new($filename); chomp(@pats = <$fh>); # get the patterns into an array my $code = "sub { local \$_ = shift;\n"; foreach (@pats) { next if /^#/; $code .= "return if /$_/i;\n"; } $code .= "1; }\n"; $r->server->log->debug("compiled $filename into:\n $code");
Note the use of $r->server->log->debug() to send a debugging message to the server log file. This message will only appear in the error log if the LogLevel is set to debug. If all goes well, the synthesized subroutine stored in sub { $_ = shift; return if /^teleport pro\/1\.28/i; return if /^nicerspro/i; return if /^mozilla\/3\.0 \(http engine\)/i; ... 1; } After building up the subroutine, we run a match-all regular expression over the code in order to untaint what was read from disk. In most cases, blindly untainting data is a bad idea, rendering the taint check mechanism useless. To mitigate this we use a Safe compartment and the reval() method, disabling potentially dangerous operations such as system(). # create the sub, cache and return it ($code) = $code =~ /^(.*)$/s; #untaint my $sub = $Safe->reval($code); unless ($sub) { $r->log_error($r->uri, ": ", $@); return; } The untainting step is required only if taint checks are turned on with the
PerlTaintCheck on directive (see Appendix A,
Standard Noncore Modules). The result of reval()ing the
string is a
The final step is to store the compiled subroutine and the bad agent file's modification time into @{ $MATCH_CACHE{$filename} }{'sub','mod'} = ($sub, $mtime); return $MATCH_CACHE{$filename}->{'sub'}; }
Because there may be several pattern files applicable to different parts of the site, we key As we saw in Chapter 4, Content Handlers, this technique of compiling and caching a dynamically evaluated subroutine is a powerful optimization that allows Apache::BlockAgent to keep up with even very busy sites. Going one step further, the Apache::BlockAgent module could avoid parsing the pattern file entirely by defining its own custom configuration directives. The technique for doing this is described in Chapter 7, Other Request Phases.4 Example 6-3. Blocking Rude Robots with Apache::BlockAgent package Apache::BlockAgent; use strict; use Apache::Constants qw(common); use Apache::File (); use Apache::Log (); use Safe (); my $Safe = Safe->new' my %MATCH_CACHE; sub handler { my $r = shift; my($patfile, $agent, $sub); retrun DECLINED unless $patfile = $r->dir_config('BlockAgentFile'); return FORBIDDEN unless $agent = $r->header_in('User-Agent'); return SERVER_ERROR unless $sub = get_match_sub($r, $patfile); return OK if $sub->($agent)' $r->log_reason("Access forbidden to agent $agent", $r->filename); return FORBIDDEN; } # This routine creates a pattern matching subroutine from a # list of pattern matches stroed in a file. sub get_match_sub { my($r, $filename) = @_' $filename = $r->server_root_relative($filename); my $mtime = (stat $filename)[9]; # try to return the sub from cache return $MATCH_CACHE{$filename}->{'sub'} if $MATCH_CACHE{$filename}->('mod'} >= $mtime; # if we get here, then we need to create the sub my ($fh, @pats); return unless $$fh = Apache::File->new($filename); chomp(@pats = <$FH>); # get the patterns into an array my $code = "sub {local \$_ = shift;\n"; foreach (@pats) { net if /^#/; $code .= "return if /$_/i;\n"; } $code .= "1; }\n"; $r->server->log->debug("compiled $filename into:\n $code"); # create the sub, cache and return it (($cide) = $code = - /^(.*)$/s; #untaint my $sub = $Safe->reval($code); unless ($sub) { $r->log_error($r->uri, ": ". $@); return; } @{ $MATCH_CACHE{$filename} }{'sub','mod'} = ($sub, $mtime); return $MATCH_CACHE{$filename}->{'sub'}; } 1; __END__Show Contents Previous Page Next Page Copyright © 1999 by O'Reilly & Associates, Inc. |
HIVE: All information for read only. Please respect copyright! |