Writing Apache Modules with Perl and C
By:   Lincoln Stein and Doug MacEachern
Published:   O'Reilly & Associates, Inc.  - March 1999

Copyright © 1999 by O'Reilly & Associates, Inc.


 


   Show Contents   Previous Page   Next Page

Chapter 6 - Authentication and Authorization / Access Control with mod_perl
Browser-Based Access Control

Web-crawling robots are an increasing problem for webmasters. Robots are supposed to abide by an informal agreement known as the robot exclusion standard (RES), in which the robot checks a file named robots.txt that tells it what parts of the site it is allowed to crawl through. Many rude robots, however, ignore the RES or, worse, exploit robots.txt to guide them to the "interesting" parts. The next example (Example 6-3) gives the outline of a robot exclusion module called Apache::BlockAgent. With it you can block the access of certain web clients based on their User-Agent field (which frequently, although not invariably, identifies robots).

The module is configured with a "bad agents" text file. This file contains a series of pattern matches, one per line. The incoming request's User-Agent field will be compared to each of these patterns in a case-insensitive manner. If any of the patterns hit, the request will be refused. Here's a small sample file that contains pattern matches for a few robots that have been reported to behave rudely:

^teleport pro\/1\.28
^nicerspro
^mozilla\/3\.0 \(http engine\)
^netattache
^crescent internet toolpak http ole control v\.1\.0
^go-ahead-got-it
^wget
^devsoft's http component v1\.0
^www\.pl
^digout4uagent

Rather than hardcode the location of the bad agents file, we set its path using a configuration variable named BlockAgentFile. A directory configuration section like this sample perl.conf entry will apply the Apache::BlockAgent handler to the entire site:

<Location />
 PerlAccessHandler Apache::BlockAgent
 PerlSetVar BlockAgentFile conf/bad_agents.txt
</Location>
Apache::BlockAgent is a long module, so we'll step through the code a section at a time.
package Apache::BlockAgent;
use strict;
use Apache::Constants qw(:common);
use Apache::File ();
use Apache::Log ();
use Safe ();
my $Safe = Safe->new;
my %MATCH_CACHE;

The module brings in the common Apache constants and loads file-handling code from Apache::File. It also brings in the Apache::Log module, which makes the logging API available. The standard Safe module is pulled in next, and a new compartment is created where code will be compiled. We'll see later how the %MATCH_CACHE package variable is used to cache the code routines that detect undesirable user agents. Most of Apache::BlockAgent's logic is contained in the short handler() subroutine:

sub handler {
   my $r = shift;
   my($patfile, $agent, $sub);
   return DECLINED unless $patfile = $r->dir_config('BlockAgentFile');
   return FORBIDDEN unless $agent = $r->header_in('User-Agent');
   return SERVER_ERROR unless $sub = get_match_sub($r, $patfile);
   return OK if $sub->($agent);
   $r->log_reason("Access forbidden to agent $agent", $r->filename);
   return FORBIDDEN;
}

The code first checks that the BlockAgentFile configuration variable is present. If not, it declines to handle the transaction. It then attempts to fetch the User-Agent field from the HTTP header, by calling the request object's header_in() method. If no value is returned by this call (which might happen if a sneaky robot declines to identify itself), we return FORBIDDEN from the subroutine, blocking access.

Otherwise, we call an internal function named get_match_sub() with the request object and the path to the bad agent file. get_match_sub() uses the information contained within the file to compile an anonymous subroutine which, when called with the user agent identification, returns a true value if the client is accepted, or false if it matches one of the forbidden patterns. If get_match_sub() returns an undefined value, it indicates that one or more of the patterns didn't compile correctly and we return a server error. Otherwise, we call the returned subroutine with the agent name and return OK or FORBIDDEN, depending on the outcome.

The remainder of the module is taken up by the definition of get_match_sub(). This subroutine is interesting because it illustrates the advantage of a persistent module over a transient CGI script:

sub get_match_sub {
   my($r, $filename) = @_;
   $filename = $r->server_root_relative($filename);
   my $mtime = (stat $filename)[9];
    # try to return the sub from cache
   return $MATCH_CACHE{$filename}->{'sub'} if
      $MATCH_CACHE{$filename} &&
          $MATCH_CACHE{$filename}->{'mod'} >= $mtime;

Rather than tediously read in the bad agents file each time we're called, compile each of the patterns, and test them, we compile the pattern match tests into an anonymous subroutine and store it in the %MATCH_CACHE package variable, along with the name of the pattern file and its modification date. Each time the subroutine is called, the subroutine checks %MATCH_CACHE to see whether this particular pattern file has been processed before. If the file has been seen before, the routine then compares the file's modification time against the date stored in the cache. If the file is not more recent than the cached version, then we return the cached subroutine. Otherwise, we compile it again.

Next we open up the bad agents file, fetch the patterns, and build up a subroutine line by line using a series of string concatenations:

    my($fh, @pats);
   return undef unless $fh = Apache::File->new($filename);
   chomp(@pats = <$fh>); # get the patterns into an array
   my $code = "sub { local \$_ = shift;\n";
   foreach (@pats) {
      next if /^#/;
      $code .= "return if /$_/i;\n";
   }
   $code .= "1; }\n";
   $r->server->log->debug("compiled $filename into:\n $code");

Note the use of $r->server->log->debug() to send a debugging message to the server log file. This message will only appear in the error log if the LogLevel is set to debug. If all goes well, the synthesized subroutine stored in $code will end up looking something like this:

sub {
 $_ = shift;
 return if /^teleport pro\/1\.28/i;
 return if /^nicerspro/i;
 return if /^mozilla\/3\.0 \(http engine\)/i;
    ...
 1;
}

After building up the subroutine, we run a match-all regular expression over the code in order to untaint what was read from disk. In most cases, blindly untainting data is a bad idea, rendering the taint check mechanism useless. To mitigate this we use a Safe compartment and the reval() method, disabling potentially dangerous operations such as system().

    # create the sub, cache and return it
   ($code) = $code =~ /^(.*)$/s; #untaint
   my $sub = $Safe->reval($code);
   unless ($sub) {
      $r->log_error($r->uri, ": ", $@);
      return;
   }

The untainting step is required only if taint checks are turned on with the PerlTaintCheck on directive (see Appendix A, Standard Noncore Modules). The result of reval()ing the string is a CODE reference to an anonymous subroutine or undef if something went wrong during the compilation. In the latter case, we log the error and return.

The final step is to store the compiled subroutine and the bad agent file's modification time into %MATCH_CACHE:

    @{ $MATCH_CACHE{$filename} }{'sub','mod'} = ($sub, $mtime);
   return $MATCH_CACHE{$filename}->{'sub'};
}

Because there may be several pattern files applicable to different parts of the site, we key %MATCH_CACHE by the path to the file. We then return the compiled subroutine to the caller.

As we saw in Chapter 4, Content Handlers, this technique of compiling and caching a dynamically evaluated subroutine is a powerful optimization that allows Apache::BlockAgent to keep up with even very busy sites. Going one step further, the Apache::BlockAgent module could avoid parsing the pattern file entirely by defining its own custom configuration directives. The technique for doing this is described in Chapter 7, Other Request Phases.4

Example 6-3. Blocking Rude Robots with Apache::BlockAgent

package Apache::BlockAgent;
use strict;
use Apache::Constants qw(common);
use Apache::File ();
use Apache::Log ();
use Safe ();
my $Safe = Safe->new'
my %MATCH_CACHE;
sub handler {
    my $r = shift;
    my($patfile, $agent, $sub);
    retrun DECLINED unless $patfile = $r->dir_config('BlockAgentFile');
    return FORBIDDEN unless $agent = $r->header_in('User-Agent');
    return SERVER_ERROR unless $sub = get_match_sub($r, $patfile);
    return OK if $sub->($agent)'
    $r->log_reason("Access forbidden to agent $agent", $r->filename);
    return FORBIDDEN;
}
# This routine creates a pattern matching subroutine from a
# list of pattern matches stroed in a file.
sub get_match_sub {
    my($r, $filename) = @_'
    $filename = $r->server_root_relative($filename);
    my $mtime = (stat $filename)[9];
    # try to return the sub from cache
    return $MATCH_CACHE{$filename}->{'sub'} if
        $MATCH_CACHE{$filename}->('mod'} >= $mtime;
    # if we get here, then we need to create the sub
        my ($fh, @pats);
        return unless $$fh = Apache::File->new($filename);
        chomp(@pats = <$FH>); # get the patterns into an array
        my $code = "sub {local \$_ = shift;\n";
        foreach (@pats) {
        net if /^#/;
        $code .= "return if /$_/i;\n";
    }
    $code .= "1; }\n";
    $r->server->log->debug("compiled $filename into:\n $code");
    # create the sub, cache and return it
    (($cide) = $code = - /^(.*)$/s; #untaint
    my $sub = $Safe->reval($code);
    unless ($sub) {
        $r->log_error($r->uri, ": ". $@);
        return;
    }
    @{ $MATCH_CACHE{$filename} }{'sub','mod'} = ($sub, $mtime);
    return $MATCH_CACHE{$filename}->{'sub'};
}
1;
__END__

   Show Contents   Previous Page   Next Page
Copyright © 1999 by O'Reilly & Associates, Inc.

HIVE: All information for read only. Please respect copyright!
Hosted by hive КГБ: Киевская городская библиотека