Blog Entry - 22th May 2009 - Programming - Web Server Configuration

URL Voodoo - Apache


Introduction

My experience with comment-spam bots compelled me to teach myself about server configuration directives, for IP address blocking.

At the same time, I dipped my toe into the turbulent waters of URL re-writing.

Set out below are my rough notes, representing things I have gathered from other web entries, and my best guess as to what is going on; and limited to aspects I have a direct interest in and/or have a hope of understanding.

I have only scratched the surface here of course.

I assume that you are already very familiar with Perl regular expressions, and make no attempt to explain these.

I am focusing on Apache, but will follow up with a post on Zeus and PHP only solutions.

The whole article is one rambling mess, and gradually got bigger and bigger, so my apologies for this.

Basic Lesson

I understand that Apache supports (though a number of modules) configuration directives which, among other things, permit visitor blocking, URL-rewriting and other processing of HTTP requests coming into your server.

These commands are usually included in one of two places:-

httpd.conf

This is the main server configuration file, applied to every request that comes into the server.

You have to re-start the server (Run As Administrator in Vista) each time you change it.

.htaccess

These are text files you create, known as per-directory configuration files, which you can put in the root directory or a sub-directory of of your web site.

They are applied to every request for a resource that is in the directory the .htaccess file is saved in, and any of its sub-directories.

The .htacess file is loaded each time a request is made, so you don't need to re-start the server; but at the same time this re-loading does involve a performance penalty. Indeed I have seen recommendations that .htaccess files be avoided where possible, because they do create an additional workload for the server.

They are mainly used where you do not have control over the httpd.conf file (through which you can achieve the equivalent in a <Directory> section. I.e. if I wanted to apply directives to http://www.baconbutty.com/info, I could either put a .htaccess file in that directory, or put the commands in a <Directory "C:/WAMP/Apache/httpd/info"> section).

You can change the standard name for these files from .htaccess` to something else AccessFileName .mynewname in the httpd.conf file.

You will also need to include a FollowSymLinks True and AllowOverride All or AllowOverride True in a <Directory> section for your DOCUMENT_ROOT in your http.conf file for these .htaccess files to work.

E.g.


# Specify the document root
DocumentRoot "E:/Server/www"

# Set the permissions for the document root
<Directory "E:/Server/www">

    Options Indexes FollowSymLinks
    AllowOverride All
    Order allow,deny
    Allow from all

</Directory>

If you are using a shared server, then this may require some negotiation with your web hosting company, or a change to another web hosting company.

Note that if you use Filezilla for your FTP client, you will need to select Server > Force showing hidden files to see the .htaccess files, and then refresh the directory listing.

Order of precedence

Now what happens if you have configurations in more than one file?

E.g.

httpd.conf
C:/Apache/www/.htaccess
C:/Apache/www/a/.htaccess

What happens if you call up /a/somefile.htm? Which of the above files will run?

The answer, as far as I can tell, depends on what is happening in those files.

Generally C:/Apache/www/a/.htaccess will take precedence over the others and run first, unless C:/Apache/www/a/.htaccess has nothing to say at all. I.e. if C:/Apache/www/a/.htaccess has something to say on the matter, then neither of the previous ones will run. Even if that something is nothing.

E.g. if I simply insert RewriteEngine on in C:/Apache/www/a/.htaccess and do nothing else, then this will effectively block any rewrite directives in C:/Apache/www/.htaccess and httpd.conf.

See:-

#C:/Apache/www/a/.htaccess
RewriteEngine on

#C:/Apache/www/.htaccess
RewriteEngine on
RewriteRule ^a/(\d+)/$ /b/$1/

If I type in "http://localhost/a/1" I will get a 404 error for URL /a/1, and the C:/Apache/www/.htaccess will not run.

However, if I comment out the # RewriteEngine on in www/a/.htaccess, then www/.htaccess will run, and I will get a 404 error for URL /b/1 (after the substitution).

Equally, if a URL substituion does occurr, then we start all over again:-

#C:/Apache/www/a/.htaccess
RewriteEngine on
RewriteRule ^a/(\d+)/$ /b/$1/

#C:/Apache/www/.htaccess
RewriteEngine on
RewriteRule ^b/(\d+)/$ /c/$1/

In this case, after C:/Apache/www/a/.htaccess has changed a/1/ to /b/1/, Apache treates this as a brand new request, and as there is nothing in the /b/ directory, it will run the C:/Apache/www/.htaccess file and I will get a 404 error for URL /c/1 (after the substitution).

Got that!?

Environment Variables

Within http.conf and .htaccess files a range of variables are generally available, and addressed in the form %{VARIABLE_NAME}.

These are the CGI environment variables, and some additions. See Apache Docs and Zytrax and NSCA.

The notes below are just a rough sub-set of a wider range of variables available.

URL Components

   
http://www.baconbutty.com:8080/a/b/index.php?entryid=24#inpagelocaton
\__/   \________________/ \__/\____________/ \________/ \___________/
 |            |            |         |           |           |
 scheme     host           port     path        query      fragment
              |            |         |            |           \
              |            |         |            |            \
         SERVER_NAME  SERVER_PORT  REQUEST_URI   QUERY_STRING   [not accessible]
         HTTP_HOST

Client

REMOTE_ADDR

The IP address of the client making the request

E.g. 127.0.0.1

REMOTE_HOST

The hostname of the client making the request.

If this can be resolved from REMOTE_ADDR, or if not, normally the same as REMOTE_ADDR.

HTTP_REFERER

The URL of the page that made this request.

If linked from e-mail or manually entered, this value is NULL.

HTTP_ACCEPT

The MIME types the requestor will accept as defined in the HTTP header.

E.g. text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1

HTTP_ACCEPT_ENCODING

The encoding types the requestor will accept as defined in the HTTP header.

E.g. HTTP_ACCEPT_ENCODING=gzip, deflate.

E.g. deflate, gzip, x-gzip, identity, *;q=0.

HTTP_ACCEPT_LANGUAGE

E.g. en-GB,en;q=0.9

HTTP_ACCEPT_CHARSTET

E.g. iso-8859-1, utf-8, utf-16, *;q=0.1

HTTP_COOKIE
HTTP_COOKIE2

The requestors's cookie, if one is set.

E.g. PHPSESSID=j79rmobpf62mgo9hdjl4jgus85.

HTTP_USER_AGENT

The browser type of the requestor.

E.g. Opera/9.50 (Windows NT 6.0; U; en).

Request

REQUEST_METHOD

The name of the method being used (GET, POST, etc.)

E.g. GET

REQUEST_PROTOCOL
SERVER_PROTOCOL

The name and version of the protocol with which the request was made.

E.g. HTTP/1.0, HTTP/1.1, etc.

HTTP_HOST

The hostname of the URL requested

E.g. www.baconbutty.com.

SERVER_PORT

The port number to which the request was sent

E.g. 80

REQUEST_URI

The resource requested on the HTTP request line. The portion of the URL following the scheme and host portion, without the query string, but including the first /.

E.g. /a/b/index.php

QUERY_STRING

The information (if any) following the "?" in the URL for this request, but not including the fragment.

E.g. entryid=24

REDIRECT_URL

If the request was a redirect from somewhere else, the URL is asscessible through this variable.

E.g. /entry/24/

REDIRECT_QUERY_STRING

The query string supplied as part of the re-direct.

E.g. entryid=24

Server Internal

THE_REQUEST

The full HTTP request line sent by the browser to the server (e.g., "GET /index.html HTTP/1.1").

Use this if you want to match against both the REQUEST_URI + QUERY_STRING at the same time, in a RewriteCond.

HTTPS

Will contain the text "on" if the connection is using SSL/TLS, or "off" otherwise.

SERVER_ADDR

The IP address of the server.

E.g. 127.0.0.1

SERVER_NAME

The server's host name, DNS alias or IP address.

E.g. www.baconbutty.com.

SERVER_PORT

The port for the request.

E.g. 80.

DOCUMENT_ROOT

The root directory of your server.

E.g. C:/WAMP/Apache/htdocs

REQUEST_FILENAME
SCRIPT_FILENAME

The full server path to the file or directory

E.g. C:/WAMP/Apache/htdocs/tgis/tgis.php

SCRIPT_NAME

The relative path (to your DOCUMENT_ROOT).

E.g. /tgis/tgis.php

REQUEST_TIME

Unix time for the request.

E.g. 1240783050

Directives and Limits

By default every directive you include applies to every type of request method (GET, HEAD, POST, PUT etc).

It is possible to limit which request methods your directive applies to, in two ways.

Limit

See: Limit

# Deny if not expressly allowed
Order Deny, Allow

<Limit GET POST>
Allow from 127.0.0.1
</Limit>

The request will be denied unless it is a GET and POST method and it originates from IP address 127.0.0.1

Note that GET also implicitly includes HEAD requests.

LimitExcept

See: LimitExcept

# Allow if not expressly denied
Order Allow, Deny

<LimitExcept GET POST>
Deny from all
</LimitExcept>

This example denys all requests that don't use the GET or POST methods.

Allowing and Denying Requests

Allow and Deny Directives

The Allow and Deny directives let you selectively block and allow requests to your server, or particular directories in your server (where included in a .htaccess file).

All Allow directives are processed as a group, and all Deny directives are processed as a group, irrespective of where they appear.

Order of Precedence

For any two Allow directives, the later takes precedence in the event of conflict.

For any two Deny directives, the later takes precedence in the event of conflict.

The order of precedence between Allow and Deny is determined by the Order directive, which should normally appear first.

See Order

E.g.

Order Deny, Allow 

Allow takes precedence over Deny.

First, all Deny directives are evaluated; if any match, the request is denied unless it also matches an Allow directive. Any requests which do not match any Allow or Deny directives are permitted.

In other words:-

  • Allow directives take precedence over Deny directives in the event of conflict.
  • Default is to Allow if no directive applies.

E.g.

Order Allow,Deny 

Deny takes precedence over Allow.

Firstly, all Allow directives are evaluated; at least one must match, or the request is rejected. Next, all Deny directives are evaluated; if any matches, the request is rejected. Lastly, any requests which do not match an Allow or a Deny directive are denied by default.

In other words:-

  • Default is to Deny if no directive applies.
  • All requests must accordingly by covered by an Allow, otherwise Deny.
  • Of those Allow requests, block any that are covered by a specific Deny.

Allow from

See : Allow

Allow from all
Allow from 127.0.0.1
Allow from baconbutty.com

You can specify a full or partial IP address or host name.

Deny from

See : Deny

Deny from all
Deny from 127.0.0.1
Deny from baconbutty.com

You can specify a full or partial IP address or host name.

Selective Files

It is possible to use the SetEnvIf to exempt certain files from the Allow and Deny.

# Sets allow-it to true if Request_URI matches
SetEnvIf Request_URI "^/(robots\.txt|my_custom_403_page\.html)$" allow-it 

# Allow takes precendence 
Order Deny,Allow 

# env=allow-it is true if `allow-it` exists 
<Limit get> 
Allow from env=allow-it 
</Limit> 

URL Rewriting

Introduction

This has been described as voodoo. In fact, there are so many points you can prick yourself on, I started to feel like a voodoo doll.

The main documentation looks deceptively simple; but it harbours many subtle and confusing points.

I am not sure the information below is correct, but it is my best guess at understanding what is going on, at a superficial level, based on the documentation, trial and error, and a little reverse engineering.

Two definitions:-

  • dynamic url - /blog.php?id=1
  • static url - /blog/1/ or /image.gif

Docs

The key starting place is: Apache 2.0 - Mod ReWrite Docs

Is it on?

You must explicitly switch the rewrite engine on.

A simple test like this in a .htaccess file in the DOCUMENT_ROOT :-

RewriteEngine on
RewriteRule ^1$ 2

will produce an error that it cannot find /2 if you type "www.mysite.com/1" in the address bar.

I.e. 1 is replaced by 2, and 2 does not exist, producing a 404 error.

Logging

If you need some help following what is going on behind the scenes then insert the following in your httpd.conf file:-

RewriteLog 	"logs/rewrite.log"
RewriteLogLevel	9

This records all of the re-write actions in a log file in the logs directory of the Apache installation directory.

Level 9 is the most verbose, so expect long files.

Key Points

Options +FollowSymLinks

Some forums seem to recommend including this in each .htaccess file, to be on the safe site (in case it is not set in the httpd.conf.

Browser Address Bar

When you re-write a URL, the browser address bar will not normally be updated. The re-write will happen internally within Apache.

If you want the browser address bar to be updated, you must append an [R] flag (external redirect). Normally you also append an [L] flag (stop here) [R,L], as you will not want further re-writing rules in the .htaccess fule to apply. But se below, the .htaccess will run again from the top.

How many times is .htaccess run?

If the URL is changed (i.e. one or more substitutions are made) by your re-write code, then the .htaccess or httpd.conf re-write rules will be re-run from the beginning using the changed URL (an INTERNAL REDIRECT).

In other words, the .htaccess file will keep re-running over and over until there are no further changes to the URL.

For this reason, be very careful if any later RewriteRule produces a URL which is then capable of being matched by an earlier RewriteRule on a re-run.

Also watch out for infinite loops.

You can avoid this with :-

RewriteOptions MaxRedirects=10

Which will stop INTERNAL REDIRECTS after they reach 10 (i.e. 10 substitutions).

For example:-

RewriteEngine on

RewriteRule ^3$ /4
RewriteRule ^2$ /3 
RewriteRule ^1$ /2 

With the above, if I type in http://www.baconbutty.com/1, the message will be cannot find file /4.

I.e. the last rule is matched, then .htaccess starts from the top (INTERNAL REDIRECT) and the second to last is matched, then .htaccess starts again from the top and the first is matched!

See this log:-


This can produce confusing effects.

For instance, say I want to have a canonical URL in the form http://www.baconbutty.com/blog/1/ which is then redirected to /blog-entry.php?id=1. The user types in http://www.baconbutty.com/blog/1. I first need to add the trailing / and then re-write to blog-entry.php?id=1.

You could do this:-

RewriteEngine on

# Add the trailing /
RewriteRule ^(.*[^/])$ /$1/ [R=301,L]

# Re-write to blog-entry
RewriteRule ^/blog/(\d+)/$ blog-entry.php?id=$1

But this results in http://www.baconbutty.com/blog-entry.php/?id=1 - there is a / after php. Why? Because:-

  • the second rule creates http://www.baconbutty.com/blog-entry.php?id=1
  • the .htaccess is run again,
  • the first rule, which uses .*, applies again to insert / after php.

So, what to do?

The only answer seems to be to make the regular expression for the RewriteRule for the trailing / more specific.

E.g.

RewriteRule ^/blog/(\d+)[^/]$ /blog/$1/ [R=301,L]
RewriteRule ^/blog/(\d+)/$ blog-entry.php?id=$1

RewriteEngine

To get up and running, you set RewriteEngine on, assuming that the appropriate module has been loaded and settings made as detailed further above.

RewriteRule

The first thing to learn is the basic (and yet complex!) directive RewriteRule.

It is difficult to know what order to present the infomation in, so I provide the basic elements of the grammar, and then a procedure to show how it seems to work in practice. I suggest you flip between both.

Grammar

The basic grammar is as follows

RewriteRule <pattern> <substitution> <flags>

If pattern matches something (see below) then replace that something with substitution applying flags.

Order of Execution

Each RewriteRule is executed in order of appearance, and, as noted above, if one or more substitutions occurr, the whole .htaccess file, or potentially (as noted in the Basic Lesson above) another .htaccess file or the httpd.conf file, if relevant to the substituted URL, will run again from the start, as an INTERNAL REDIRECT, once the current .htaccess file completes.

pattern

The pattern is a regular expression which is matched against something.

The something is a substring of the %{REQUEST_URI} depending on which file your RewriteRule is contained in.

Given the following URL:-

   
http://www.baconbutty.com/a/b/index.php?entryid=24#inpagelocaton
\_______________________/\____________/ \________/ \___________/
              |                 |           |           |
            host               path        query      fragment
                                |
                           REQUEST_URI

The REQUEST_URI is /a/b/index.php. Note that this includes the initial /.

The initial / is ignored for pattern matching, at least for .htaccess files, but not I think for httpd.conf, but I have not tested that.

Accordingly the path substring is:-

  • If my RewriteRule is in the httpd.conf file, /a/b/index.php (but not in a Directory Section). (Note the leading /)
  • If my RewriteRule is in a .htaccess file in the root directory, a/b/index.php.
  • If my RewriteRule is in a .htaccess file in the sub-directory a, b/index.php.
  • If my RewriteRule is in a .htaccess file in the sub-directory a/b, then index.php.

If you want to match the complete path substring, then the pattern should begin with ^ and end with $.

If you prefix the pattern with ! the rule is applied if the pattern does NOT match.

Note that the pattern does not match against the QUERY_STRING: you have to use RewriteCond for that.

Finally note that you should try to make your patterns as unambiguous as possible. Liberal use of patterns such as .* will adversely affect performance.

substitution and flags

No we get on to the most complex part of the whole process.

The substitution is the new path the URL is to point to.

Set out below are some of the more basic rules:-

General Rule

In the general case the process results in a replacement of the REQUEST_URI with your substitution string, so that you get a new URL in the form http://host/substituted-uri?query-string#fragment.

It is not actually quite that simple, but this is a good starting point.

Replacing the host

Normally your substitution string is not concerned with the host part of the URL.

If I call up http://www.baconbutty.com/blog/12/ and replace /blog/12 with /blog.php?id=12, the host http://www.baconbutty.com is simply re-attached at the end of the processing.

If I want to replace the host, I simply start the substitution with the new host. E.g. RewriteRule ^blog/(\d+)/$ http://www.newhost.com/blog.php?id=$1.

If I want to access the value of the current host (without knowing it in avance) you use the server variable HTTP_HOST thus: http://${HTTP_HOST}/blog.php?id=$1.

Begin with the / character.

This section applies if you are not replacing the host above - i.e. your substitution is a path to append to the existing host.

You will normally wish to begin your substitution string with a / character. This signifies that the substitution is a path originating in the DOCUMENT_ROOT.

Thus if I have a rule in directory http://www.baconbutty.com/a/ that reads RewriteRule ^(\d+)/$ /page.php?id=$1, the resulting URL will read http://www.baconbutty.com/page.php?id=$1. (And, if there is a `.htaccess file in the DOCUMENT_ROOT, it will then be triggered).

Now, if you omit the leadng / you are telling Apache that your substitution is relative to some other directory within the DOCUMENT_ROOT - but which?

It is relative to the RewriteBase. I.e. the URL is built as follows http://host/RewriteBase/substituted-uri?query-string#fragment.

According to the Apache documents The default setting is: RewriteBase physical-directory-path. I.e. if my DOCUMENT_ROOT is in C:/Apache/www then this is the RewriteBase, which produces the following strange and invalid URL: http://www.baconbutty.com/C:/Apache/www/page.php?id=$1.

So if you are going to omit the / you need to say where the relative path is using RewriteBase.

E.g.

RewriteEngine on
RewriteBase /c/
RewriteCond %{REQUEST_URI} ^/a/b/(\d+)/\d+/$
RewriteRule ^b/\d+/(\d+)/$ page.php?id=%1&name=$1 [QSA]

Will rewrite http://www.baconbutty.com/a/b/12/13/?val=24#bookmark to http://www.baconbutty.com/c/page.php?id=12&name=13&val=24#bookmark.

but the following (only difference is a / in front of page):-

RewriteEngine on
RewriteBase /c/
RewriteCond %{REQUEST_URI} ^/a/b/(\d+)/\d+/$
RewriteRule ^b/\d+/(\d+)/$ /page.php?id=%1&name=$1 [QSA]

Will rewrite http://www.baconbutty.com/a/b/12/13/?val=24#bookmark to http://www.baconbutty.com/page.php?id=12&name=13&val=24#bookmark. The / overrides the RewriteBase.

Note RewriteBase / will rewrite all relative URLs to the DOCUMENT_ROOT.

Back References

If you have capturing parentheses in your pattern (e.g. /a/(\d+)/), then the captured values can be referenced in the substitution with $1..9.

In addition, if you use RewriteCond and the last condition that matches has capturing parentheses, then the captured values can be referenced in the substitution with %1..9. (Note the % sign). But note it is only the last matching RewriteCond that counts here. E.g. if we have the following:-


RewriteEngine on

RewriteCond %{REQUEST_URI} ^/blog/(\d+)/(\d+)/(\d+)/$
RewriteCond %{REQUEST_URI} ^/blog/(\d+)/(\d+)/(\d+)/$
RewriteCond %{REQUEST_URI} ^/blog/\d+/\d+/(\d+)/$
RewriteRule ^blog/\d+/\d+/\d+/$ /show.php?id=%1-%2-%3 [R]

And enter http://localhost/blog/2009/05/02/ we get http://localhost/show.php?id=02 because the last RewriteCond has only one set of parentheses.

Note these backreferences are not available if you use the `` (NOT) operator prefix in any pattern or RewriteCond as there is no captured value because it did not match!.

Server Variables

You can insert server variables in your substitution with %{VARIABLE_NAME}.

Query String and [QSA] flag.

If the substitution contains no QUERY_STRING, then the original QUERY_STRING is retained and re-attached at the end.

If the substitution ends with just ? then this will erase any QUERY_STRING.

If the substitution contains a QUERY_STRING, but no [QSA] flag, then this will override (replace) the original QUERY_STRING, which will be lost.

If the substitution contains a QUERY_STRING, and a [QSA] flag, then both the original and substitution QUERY_STRINGS will be retained and concatenated, but the substitution will win if value names are the same.

No Substitution

If your string is simply - that means no substitution. You often use this with other chained rules.

flags

Flags appear in square brackets. The only flag I was looking for at this point was the white flag. Or if you prefer, I was flagging when I got to this point, and certainly wasn't Jolly Roger, more George Cross.

You can combine them with commas, e.g. [R,L].

There are a large range of flags, which provide a powerful range of functionality, and are well worth studying.

The simple ones I am interested in are:-

[L]

This stops the processing of any subsequent RewriteRules and the URL is returned at that point.

As noted above, if one or more substitutions have occurred, then the httpd.conf or .htaccess file will be re-run, and re-run, until no further substitutions occurr.

[NC]

Make the pattern matching not case-sensitive.

[QSA]

Preserve the current query string, and add any new items from the current substitution.

[R] or [R=code]

This forces an external re-direction, and changes the browser address bar.

If no code is used, it is 302.

Others are:-

300 : Multiple Choices

301 : Moved Permanently

This is used to tell search engines to drop your old dynamic URLs.

302 : Moved Temporarily (HTTP/1.0)

302 : Found (HTTP/1.1)

303 : See Other (HTTP/1.1)

304 : Not Modified

305 : Use Proxy

307 : Temporary Redirect

You use this if you want the client browser's address bar to update with the value of the changed URL.

Normally you should also apply [L] at this point, as [R=301,L] to stop further substitutions.

This combination is often used if you want your URL's to always end with a /. E.g.

RewriteEngine on

# Force a / to be added and then restart from the top. Address bar is updated.
RewriteRule ^b/(\d+)$ b/$1/ [R=301,L]

# Hidden re-write to the php file.  Address bar is not updated.
RewriteRule ^b/(\d+)/$ /page.php?id=$1 [QSA]

This will replace http://www.baconbutty.com/b/12 with http://www.baconbutty.com/b/12/ and update the address barm, and then convert

http://www.baconbutty.com/b/12/ to http://www.baconbutty.com/page.php?id=12.

Process

I have tried to map out how Apache processes an incoming URL, and this is my best guess.

Assume you start with the following typed URL:-

   
http://www.baconbutty.com/a/b/12/13/?val=24#bookmark
\_______________________/\_________/ \____/ \______/
              |            |           |       |
              |         path         query     fragment
              |           |            |         
         HTTP_HOST  REQUEST_URI    QUERY_STRING  

http://localhost/a/b/12/13/?val=24#bookmark           
http://localhost/page.php?id=12&name=13&val=24#bookmark 

Assume that you want to re-write this to "http://www.baconbutty.com/page.php?id=14&name=13&val=24#bookmark".

Assume also that your DOCUMENT_ROOT (the physical path on the disk where your web pages are stored) is C:/www (a Windows system).

Assume you have a .htaccess file in folder a that reads:-

RewriteEngine on
RewriteCond %{REQUEST_URI} ^/a/b/(\d+)/\d+/$
RewriteRule ^b/\d+/(\d+)/$ /page.php?id=%1&name=$1 [QSA]

Assume you have a page.php as follows:-

<?php 
function Debug_Dump($obj /*: Any*/) /*: String*/
{
	ob_start();
	var_dump($obj);
	$a = ob_get_contents();
	ob_end_clean();	
	return "<pre class='dump'>" . htmlspecialchars($a, ENT_QUOTES) . "</pre>";
}

echo Debug_Dump($_GET);
echo Debug_Dump($_SERVER);
?>

The process seems to be:-

  • HTTP_HOST (http://www.baconbutty.com) is stripped off
  • ? is discarded
  • QUERY_STRING (val=24) is stripped off
  • # is discarded
  • fragment (bookmark) stripped off
  • This leaves you with the REQUEST_URI (/a/b/index.php).
  • The leading /a/ is removed, because the .htaccess file is the a directory.
  • This leaves you with the relative path b/index.php, which is what the pattern is matched against.
  • The pattern is then matched against b/index.php.
  • If the pattern matches then start creating the substitution.
  • Because the substitution begins with /, this means that we are building a path that starts in the root directory, not in a.
  • Because we have %1 this is the 12 matched in the last RewriteCond.
  • Because we have $1 this is the 13 matched in the RewriteRule pattern.
  • The substitution is therefore /page.php?id=12&name=13
  • Because we have the flag [QSA] this means retain val=24 and add it to the new URL.
  • The final URL is http://www.baconbutty.com/page.php?id=12&name=13 + &val=24 + #bookmark.
  • If there is a .htaccess in the directory http://www.baconbutty.com/ then this will be applied to the new URL.
  • If the URL were rewritten back to a, i.e. http://www.baconbutty.com/a/page.php then the .htaccess file in folder a would be triggered again.

RewriteCond

So now I have a rough idea about RewriteRules, I turn my attention briefly to the RewriteCond.

In brief, a RewriteRule can have one or or RewriteConds preceding it.

If the pattern for the RewriteRule matches (this is tested first), then before the subsitution occurrs, each of the RewriteConds (first come first served) is evaluated (collecting back-references as they go), and if, and only if, they are all true, is the substituion effected.

E.g. if we have the following in the

RewriteEngine on

RewriteCond %{REQUEST_URI} (\d+)/(\d+)/(\d+)/$
RewriteRule ^blog/\d+/\d+/\d+/$ /show.php?id=%1-%2-%3 [R]

This matches a URL which ends with a date, e.g. 2009/05/02/. So, if we type in http://localhost/blog/2009/05/02/ we get http://localhost/show.php?id=2009-05-02.

Why use RewriteCond? Because it can match a wider range of strings than RewriteRule. RewriteRule matches only against a substring of REQUEST_URI. RewriteCond can match against whatever string you specify, such as QUERY_STRING.

Grammar

The basic grammar is as follows

RewriteCond <TestString> <pattern> [flags]

This is different to RewriteRule. RewriteRule always matches against all or part of the REQUEST_URI.

RewriteCond matches against whatever TestString you care to give it.

Most of the time the TestString will be a server variable, and most often %{REQUEST_URI}

The key lesson to note is that the request URI is always going to begin with a / whereas the string that the RewriteRule matches does not, because it matches against a substring of the full REQUEST_URI, relative to the directory in which the .htaccss is located.

So inspecting the following .htaccess file in the DOCUMENT_ROOT:-

RewriteEngine on

RewriteCond %{REQUEST_URI} ^/blog/(\d+)/(\d+)/(\d+)/$
RewriteRule ^blog/\d+/\d+/\d+/$ /show.php?id=%1-%2-%3 [R]

If you look closely, the pattern for RewriteCond begins with a / but does not for the pattern in the RewriteRule. This was a considerable source of confusion for me at first.

TestString

The main things you can include in your TestString (as well as plain text) are:-

Server Variables

E.g. REQUEST_URI or QUERY_STRING

RewriteRule pattern backreferences

The processing order for a given RewriteRule is:-

1. Process RewriteRule pattern (against the REQUEST_URI path) and collect $1..$N backreferences

2. If it matches, process each of the RewriteConds, from first appearing to last, and capture %1...%N backreferences from the last.

3. If all RewriteConds match, then process the RewriteRule substitution.

Given 1. occurs first, the $1..$N backreferences are in principle available to the RewriteCond TestString.

Here is an example. Assume the following in the a directory off the DOCUMENT_ROOT:-

RewriteEngine on

RewriteCond $1 ^(2009)$
RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /show2009.php?id=%1-$2-$3 [R,L]

RewriteCond $1 !^(2009)$
RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /showOther.php?id=$1-$2-$3 [R]

#http://localhost/a/blog/2009/05/02/
#http://localhost/a/blog/2008/05/02/

For the above URLs, this will rewrite to show2009 or showOther depending on the year.

RewriteCond pattern backreferences

If you have more than one RewriteCond, you can use the %1...%N backreferences of the previous in the next.

Here is an example. Assume the following in the a directory off the DOCUMENT_ROOT:-

RewriteEngine on

RewriteCond $1 ^(\d)\d{0,3}$
RewriteCond %1 ^2$
RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /show2000.php?id=$1-$2-$3 [R,L]

RewriteCond $1 ^(\d)\d{0,3}$
RewriteCond %1 !^2$
RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /showOther.php?id=$1-$2-$3 [R]

#http://localhost/a/blog/2001/05/02/
#http://localhost/a/blog/1999/05/02/

For the above URLs, this will rewrite to show2000 or showOther depending on whether the year is in the range 2000-2999.

pattern

This is a Perl compatible regular expression, which you can prefix with ! if you want the pattern to be successful if and only if it does NOT match.

Some of the special items are:-

<CondPattern

Compares the pattern and test string to compare them for alphabetical order. A < B

>CondPattern

Compares the pattern and test string to compare them for alphabetical order. B > A

=CondPattern

Is the pattern identical to the TestString?

This is gives better performance than using a regular expression for the same purposes.

=""

Is the TestString the empty string?

-d
-f

Is the TestString an absolute path to an existing directory (-d) or file (-f)?

This is useful if your RewriteRule is likely to match actual resources, where you want a straight (no substitution) pass through.

Remember this can cause performance issues.

You must supply an absolute file system path as the TestString, not a relative one.

Typically you might use the %{REQUEST_FILENAME} environment variable as the TestString.

Or you might construct something using the %{DOCUMENT_ROOT} path (remembering to add an extra / after %{DOCUMENT_ROOT}.

Here is an unrealistic example. Assume we have the following directory structure of the root a/b/ and assume b contains test.txt. Assume the following in the a directory off the DOCUMENT_ROOT:-

RewriteEngine on

RewriteCond $1 >a
RewriteCond $1 <c
RewriteCond %{QUERY_STRING} =""
RewriteCond %{DOCUMENT_ROOT}/a/$1 -d
RewriteCond %{DOCUMENT_ROOT}/a/$1/test.txt -f

RewriteRule ^blog/([a-z])/(\d+)/$ /show.php?id=$2 [R,L]

#http://localhost/a/blog/b/1/

For the above URL, this will rewrite to show.php if:-

  • ([a-z]) contains a character which is >a and <b
  • The QUERY_STRING is empty
  • a contains a directory matched by ([a-z])
  • that directory contains a file called test.txt
flags
[NC]

This makes the test case-insensitive.

[OR]

Normally all conditions must be met. This lets you specify that conditions are either/or.

Final Examples

Basic URL rewriting

If I want to:-

  • add a / to the end of http:localhost/blog/12
  • update the browser address bar
  • re-write to http:localhost/show.php?id=12
RewriteEngine on

RewriteRule ^blog/(\d+)$ /blog/$1/ [R=301, L]
RewriteRule ^blog/(\d+)/$ /show.php?id=$1
Guessable URLs

I want http:localhost/images/ to point to http:localhost/pictures/


RewriteEngine on

RewriteRule ^images/?$ /pictures/ [R=301]
Rewrite the whole URL to a QUERY_STRING

This is something Drupal does:-

RewriteEngine on

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
Rewriting Dynamic to Static

If you switch to static URLs, you may have to rewrite old dynamic URLs to the new static ones.

E.g. with my web site I have http:www.baconbutty.com/blog-entry.php?id=24, I want to write it to http:www.baconbutty.com/blog/entry/24/.

RewriteEngine on

# Capture the id, you cannot do it in the RewriteRule
RewriteCond %{QUERY_STRING} id=(\d+)

RewriteRule ^blog-entry.php$ /blog/entry/%1/? [R=301] 

#http://localhost/blog-entry.php?id=24

Don't Forget

Relative Links

If you are going to use static URLs, such as http://www.baconbutty.com/blog/12/ then you need to remember two things:-*

  • Firstly, all links to other pages will need to be switched to static.
  • Secondly, this this will potentially screw up your links to other static resources / bookmarks in the document itself.

In relation to the second point, lets consider this further.

There are two aspects to consider:-

  • Bookmark links to the same page, such as href="#toc804580". These should continue to work as is, http://www.baconbutty.com/blog/12/#toc804580 should be just fine (as noted the fragment is dealt with separately.
  • Relative links to other static resources will fail. E.g. if I import a style with @import url(common.css); then common.css is relative and will be seen as http://www.baconbutty.com/blog/12/common.css which does not exist.

So how to solve the second bullit point?

  • Well the easiest is to convert all relative links to full URLS. E.g. @import url(http://www.baconbutty.com/common.css, or the shorter verision of appending a /, e.g. @import url(/common.css);
  • There is a shortcut, which is to use the base element. E.g. <base href="/"> or <base href="http://www.baconbutty.com/">. BUT WATCH OUT - this will also affect your in-page bookmarks. So it is not really a solution.

Accordingly, if you are going to move to static URLs, you need to ensure that you adequately deal with relative links and bookmarks separately.


Comment(s)


Sorry, comments have been suspended. Too much offensive comment spam is causing the site to be blocked by firewalls (which ironically therefore defeats the point of posting spam in the first place!). I don't get that many comments anyway, so I am going to look at a better way of managing the comment spam before reinstating the comments.


Leave a comment ...


{{PREVIEW}} Comments stopped temporarily due to attack from comment spammers.


Other Recent Entries in Web Server Configuration