Contents
Introduction
My experience with comment-spam bots compelled me to teach myself about server configuration directives, for IP address blocking.
At the same time, I dipped my toe into the turbulent waters of URL re-writing.
Set out below are my rough notes, representing things I have gathered from other web entries, and my best guess as to what is going on; and limited to aspects I have a direct interest in and/or have a hope of understanding.
I have only scratched the surface here of course.
I assume that you are already very familiar with Perl regular expressions, and make no attempt to explain these.
I am focusing on Apache, but will follow up with a post on Zeus and PHP only solutions.
The whole article is one rambling mess, and gradually got bigger and bigger, so my apologies for this.
Basic Lesson
I understand that Apache supports (though a number of modules) configuration directives which, among other things, permit visitor blocking, URL-rewriting and other processing of HTTP requests coming into your server.
These commands are usually included in one of two places:-
httpd.conf
This is the main server configuration file, applied to every request that comes into the server.
You have to re-start the server (Run As Administrator in Vista) each time you change it.
.htaccess
These are text files you create, known as per-directory configuration files, which you can put in the root directory or a sub-directory of of your web site.
They are applied to every request for a resource that is in the directory the .htaccess
file is saved in, and any of its sub-directories.
The .htacess
file is loaded each time a request is made, so you don't need to re-start the server; but at the same time this re-loading does involve a performance penalty. Indeed I have seen recommendations that .htaccess
files be avoided where possible, because they do create an additional workload for the server.
They are mainly used where you do not have control over the httpd.conf
file (through which you can achieve the equivalent in a <Directory>
section. I.e. if I wanted to apply directives to http://www.baconbutty.com/info
, I could either put a .htaccess
file in that directory, or put the commands in a <Directory "C:/WAMP/Apache/httpd/info">
section).
You can change the standard name for these files from .htaccess` to something else
AccessFileName .mynewname in the
httpd.conf file.
You will also need to include a FollowSymLinks True
and AllowOverride All
or AllowOverride True
in a <Directory>
section for your DOCUMENT_ROOT
in your http.conf
file for these .htaccess
files to work.
E.g.
# Specify the document root
DocumentRoot "E:/Server/www"
# Set the permissions for the document root
<Directory "E:/Server/www">
Options Indexes FollowSymLinks
AllowOverride All
Order allow,deny
Allow from all
</Directory>
If you are using a shared server, then this may require some negotiation with your web hosting company, or a change to another web hosting company.
Note that if you use Filezilla for your FTP client, you will need to select Server > Force showing hidden files
to see the .htaccess
files, and then refresh the directory listing.
Order of precedence
Now what happens if you have configurations in more than one file?
E.g.
httpd.conf
C:/Apache/www/.htaccess
C:/Apache/www/a/.htaccess
What happens if you call up /a/somefile.htm
? Which of the above files will run?
The answer, as far as I can tell, depends on what is happening in those files.
Generally C:/Apache/www/a/.htaccess
will take precedence over the others and run first, unless C:/Apache/www/a/.htaccess
has nothing to say at all. I.e. if C:/Apache/www/a/.htaccess
has something to say on the matter, then neither of the previous ones will run. Even if that something is nothing.
E.g. if I simply insert RewriteEngine on
in C:/Apache/www/a/.htaccess
and do nothing else, then this will effectively block any rewrite directives in C:/Apache/www/.htaccess
and httpd.conf
.
See:-
#C:/Apache/www/a/.htaccess
RewriteEngine on
#C:/Apache/www/.htaccess
RewriteEngine on
RewriteRule ^a/(\d+)/$ /b/$1/
If I type in "http://localhost/a/1"
I will get a 404 error for URL /a/1
, and the C:/Apache/www/.htaccess
will not run.
However, if I comment out the # RewriteEngine on
in www/a/.htaccess
, then www/.htaccess
will run, and I will get a 404 error for URL /b/1
(after the substitution).
Equally, if a URL substituion does occurr, then we start all over again:-
#C:/Apache/www/a/.htaccess
RewriteEngine on
RewriteRule ^a/(\d+)/$ /b/$1/
#C:/Apache/www/.htaccess
RewriteEngine on
RewriteRule ^b/(\d+)/$ /c/$1/
In this case, after C:/Apache/www/a/.htaccess
has changed a/1/
to /b/1/
, Apache treates this as a brand new request, and as there is nothing in the /b/
directory, it will run the C:/Apache/www/.htaccess
file and I will get a 404 error for URL /c/1
(after the substitution).
Got that!?
Environment Variables
Within http.conf
and .htaccess
files a range of variables are generally available, and addressed in the form %{VARIABLE_NAME}
.
These are the CGI environment variables, and some additions. See Apache Docs and Zytrax and NSCA.
The notes below are just a rough sub-set of a wider range of variables available.
URL Components
http://www.baconbutty.com:8080/a/b/index.php?entryid=24#inpagelocaton
\__/ \________________/ \__/\____________/ \________/ \___________/
| | | | | |
scheme host port path query fragment
| | | | \
| | | | \
SERVER_NAME SERVER_PORT REQUEST_URI QUERY_STRING [not accessible]
HTTP_HOST
Client
REMOTE_ADDR
The IP address of the client making the request
E.g. 127.0.0.1
REMOTE_HOST
The hostname of the client making the request.
If this can be resolved from REMOTE_ADDR
, or if not, normally the same as REMOTE_ADDR
.
HTTP_REFERER
The URL of the page that made this request.
If linked from e-mail or manually entered, this value is NULL.
HTTP_ACCEPT
The MIME types the requestor will accept as defined in the HTTP header.
E.g. text/html, application/xml;q=0.9, application/xhtml+xml, image/png, image/jpeg, image/gif, image/x-xbitmap, */*;q=0.1
HTTP_ACCEPT_ENCODING
The encoding types the requestor will accept as defined in the HTTP header.
E.g. HTTP_ACCEPT_ENCODING=gzip, deflate
.
E.g. deflate, gzip, x-gzip, identity, *;q=0
.
HTTP_ACCEPT_LANGUAGE
E.g. en-GB,en;q=0.9
HTTP_ACCEPT_CHARSTET
E.g. iso-8859-1, utf-8, utf-16, *;q=0.1
HTTP_COOKIE
HTTP_COOKIE2
The requestors's cookie, if one is set.
E.g. PHPSESSID=j79rmobpf62mgo9hdjl4jgus85
.
HTTP_USER_AGENT
The browser type of the requestor.
E.g. Opera/9.50 (Windows NT 6.0; U; en)
.
Request
REQUEST_METHOD
The name of the method being used (GET, POST, etc.)
E.g. GET
REQUEST_PROTOCOL
SERVER_PROTOCOL
The name and version of the protocol with which the request was made.
E.g. HTTP/1.0
, HTTP/1.1
, etc.
HTTP_HOST
The hostname of the URL requested
E.g. www.baconbutty.com
.
SERVER_PORT
The port number to which the request was sent
E.g. 80
REQUEST_URI
The resource requested on the HTTP request line. The portion of the URL following the scheme and host portion, without the query string, but including the first /
.
E.g. /a/b/index.php
QUERY_STRING
The information (if any) following the "?" in the URL for this request, but not including the fragment.
E.g. entryid=24
REDIRECT_URL
If the request was a redirect from somewhere else, the URL is asscessible through this variable.
E.g. /entry/24/
REDIRECT_QUERY_STRING
The query string supplied as part of the re-direct.
E.g. entryid=24
Server Internal
THE_REQUEST
The full HTTP request line sent by the browser to the server (e.g., "GET /index.html HTTP/1.1").
Use this if you want to match against both the REQUEST_URI + QUERY_STRING at the same time, in a RewriteCond.
HTTPS
Will contain the text "on" if the connection is using SSL/TLS, or "off" otherwise.
SERVER_ADDR
The IP address of the server.
E.g. 127.0.0.1
SERVER_NAME
The server's host name, DNS alias or IP address.
E.g. www.baconbutty.com
.
SERVER_PORT
The port for the request.
E.g. 80
.
DOCUMENT_ROOT
The root directory of your server.
E.g. C:/WAMP/Apache/htdocs
REQUEST_FILENAME
SCRIPT_FILENAME
The full server path to the file or directory
E.g. C:/WAMP/Apache/htdocs/tgis/tgis.php
SCRIPT_NAME
The relative path (to your DOCUMENT_ROOT).
E.g. /tgis/tgis.php
REQUEST_TIME
Unix time for the request.
E.g. 1240783050
Directives and Limits
By default every directive you include applies to every type of request method (GET, HEAD, POST, PUT etc).
It is possible to limit which request methods your directive applies to, in two ways.
Limit
See: Limit
# Deny if not expressly allowed
Order Deny, Allow
<Limit GET POST>
Allow from 127.0.0.1
</Limit>
The request will be denied unless it is a GET
and POST
method and it originates from IP address 127.0.0.1
Note that GET
also implicitly includes HEAD
requests.
LimitExcept
See: LimitExcept
# Allow if not expressly denied
Order Allow, Deny
<LimitExcept GET POST>
Deny from all
</LimitExcept>
This example denys all requests that don't use the GET or POST methods.
Allowing and Denying Requests
Allow and Deny Directives
The Allow
and Deny
directives let you selectively block and allow requests to your server, or particular directories in your server (where included in a .htaccess
file).
All Allow
directives are processed as a group, and all Deny
directives are processed as a group, irrespective of where they appear.
Order of Precedence
For any two Allow
directives, the later takes precedence in the event of conflict.
For any two Deny
directives, the later takes precedence in the event of conflict.
The order of precedence between Allow
and Deny
is determined by the Order
directive, which should normally appear first.
See Order
E.g.
Order Deny, Allow
Allow takes precedence over Deny.
First, all Deny directives are evaluated; if any match, the request is denied unless it also matches an Allow directive. Any requests which do not match any Allow or Deny directives are permitted.
In other words:-
Allow
directives take precedence overDeny
directives in the event of conflict.- Default is to
Allow
if no directive applies.
E.g.
Order Allow,Deny
Deny
takes precedence over Allow
.
Firstly, all Allow directives are evaluated; at least one must match, or the request is rejected. Next, all Deny directives are evaluated; if any matches, the request is rejected. Lastly, any requests which do not match an Allow or a Deny directive are denied by default.
In other words:-
- Default is to
Deny
if no directive applies. - All requests must accordingly by covered by an
Allow
, otherwiseDeny
. - Of those
Allow
requests, block any that are covered by a specificDeny
.
Allow from
See : Allow
Allow from all
Allow from 127.0.0.1
Allow from baconbutty.com
You can specify a full or partial IP address or host name.
Deny from
See : Deny
Deny from all
Deny from 127.0.0.1
Deny from baconbutty.com
You can specify a full or partial IP address or host name.
Selective Files
It is possible to use the SetEnvIf
to exempt certain files from the Allow and Deny.
# Sets allow-it to true if Request_URI matches
SetEnvIf Request_URI "^/(robots\.txt|my_custom_403_page\.html)$" allow-it
# Allow takes precendence
Order Deny,Allow
# env=allow-it is true if `allow-it` exists
<Limit get>
Allow from env=allow-it
</Limit>
URL Rewriting
Introduction
This has been described as voodoo. In fact, there are so many points you can prick yourself on, I started to feel like a voodoo doll.
The main documentation looks deceptively simple; but it harbours many subtle and confusing points.
I am not sure the information below is correct, but it is my best guess at understanding what is going on, at a superficial level, based on the documentation, trial and error, and a little reverse engineering.
Two definitions:-
- dynamic url -
/blog.php?id=1
- static url -
/blog/1/
or/image.gif
Docs
The key starting place is: Apache 2.0 - Mod ReWrite Docs
Is it on?
You must explicitly switch the rewrite engine on.
A simple test like this in a .htaccess
file in the DOCUMENT_ROOT
:-
RewriteEngine on
RewriteRule ^1$ 2
will produce an error that it cannot find /2
if you type "www.mysite.com/1"
in the address bar.
I.e. 1
is replaced by 2
, and 2
does not exist, producing a 404 error.
Logging
If you need some help following what is going on behind the scenes then insert the following in your httpd.conf
file:-
RewriteLog "logs/rewrite.log"
RewriteLogLevel 9
This records all of the re-write actions in a log file in the logs
directory of the Apache installation directory.
Level 9 is the most verbose, so expect long files.
Key Points
- Options +FollowSymLinks
Some forums seem to recommend including this in each
.htaccess
file, to be on the safe site (in case it is not set in thehttpd.conf
.- Browser Address Bar
When you re-write a URL, the browser address bar will not normally be updated. The re-write will happen internally within Apache.
If you want the browser address bar to be updated, you must append an [R] flag (external redirect). Normally you also append an [L] flag (stop here) [R,L], as you will not want further re-writing rules in the .htaccess fule to apply. But se below, the .htaccess will run again from the top.
- How many times is .htaccess run?
If the URL is changed (i.e. one or more substitutions are made) by your re-write code, then the
.htaccess
orhttpd.conf
re-write rules will be re-run from the beginning using the changed URL (anINTERNAL REDIRECT
).In other words, the
.htaccess
file will keep re-running over and over until there are no further changes to the URL.For this reason, be very careful if any later
RewriteRule
produces a URL which is then capable of being matched by an earlierRewriteRule
on a re-run.Also watch out for infinite loops.
You can avoid this with :-
RewriteOptions MaxRedirects=10
Which will stop INTERNAL REDIRECTS after they reach 10 (i.e. 10 substitutions).
For example:-
RewriteEngine on RewriteRule ^3$ /4 RewriteRule ^2$ /3 RewriteRule ^1$ /2
With the above, if I type in
http://www.baconbutty.com/1
, the message will becannot find file /4
.I.e. the last rule is matched, then
.htaccess
starts from the top (INTERNAL REDIRECT
) and the second to last is matched, then.htaccess
starts again from the top and the first is matched!See this log:-
This can produce confusing effects.
For instance, say I want to have a canonical URL in the form
http://www.baconbutty.com/blog/1/
which is then redirected to/blog-entry.php?id=1
. The user types inhttp://www.baconbutty.com/blog/1
. I first need to add the trailing/
and then re-write toblog-entry.php?id=1
.You could do this:-
RewriteEngine on # Add the trailing / RewriteRule ^(.*[^/])$ /$1/ [R=301,L] # Re-write to blog-entry RewriteRule ^/blog/(\d+)/$ blog-entry.php?id=$1
But this results in
http://www.baconbutty.com/blog-entry.php/?id=1
- there is a/
afterphp
. Why? Because:-- the second rule creates
http://www.baconbutty.com/blog-entry.php?id=1
- the
.htaccess
is run again, - the first rule, which uses
.*
, applies again to insert/
afterphp
.
So, what to do?
The only answer seems to be to make the regular expression for the RewriteRule for the trailing
/
more specific.E.g.
RewriteRule ^/blog/(\d+)[^/]$ /blog/$1/ [R=301,L] RewriteRule ^/blog/(\d+)/$ blog-entry.php?id=$1
- the second rule creates
RewriteEngine
To get up and running, you set RewriteEngine on
, assuming that the appropriate module has been loaded and settings made as detailed further above.
RewriteRule
The first thing to learn is the basic (and yet complex!) directive RewriteRule
.
It is difficult to know what order to present the infomation in, so I provide the basic elements of the grammar, and then a procedure to show how it seems to work in practice. I suggest you flip between both.
- Grammar
The basic grammar is as follows
RewriteRule <pattern> <substitution> <flags>
If
pattern
matches something (see below) then replace that something withsubstitution
applyingflags
.- Order of Execution
Each RewriteRule is executed in order of appearance, and, as noted above, if one or more substitutions occurr, the whole
.htaccess
file, or potentially (as noted in the Basic Lesson above) another.htaccess
file or thehttpd.conf
file, if relevant to the substituted URL, will run again from the start, as anINTERNAL REDIRECT
, once the current.htaccess
file completes.- pattern
The pattern is a regular expression which is matched against something.
The something is a substring of the
%{REQUEST_URI}
depending on which file yourRewriteRule
is contained in.Given the following URL:-
http://www.baconbutty.com/a/b/index.php?entryid=24#inpagelocaton \_______________________/\____________/ \________/ \___________/ | | | | host path query fragment | REQUEST_URI
The
REQUEST_URI
is/a/b/index.php
. Note that this includes the initial/
.The initial
/
is ignored for pattern matching, at least for.htaccess
files, but not I think forhttpd.conf
, but I have not tested that.Accordingly the path substring is:-
- If my
RewriteRule
is in thehttpd.conf
file,/a/b/index.php
(but not in a Directory Section). (Note the leading/
) - If my
RewriteRule
is in a.htaccess
file in the root directory,a/b/index.php
. - If my
RewriteRule
is in a.htaccess
file in the sub-directorya
,b/index.php
. - If my
RewriteRule
is in a.htaccess
file in the sub-directorya/b
, thenindex.php
.
If you want to match the complete path substring, then the pattern should begin with
^
and end with$
.If you prefix the pattern with
!
the rule is applied if the pattern does NOT match.Note that the pattern does not match against the QUERY_STRING: you have to use RewriteCond for that.
Finally note that you should try to make your patterns as unambiguous as possible. Liberal use of patterns such as
.*
will adversely affect performance.- If my
- substitution and flags
No we get on to the most complex part of the whole process.
The substitution is the new path the URL is to point to.
Set out below are some of the more basic rules:-
General Rule
In the general case the process results in a replacement of the
REQUEST_URI
with your substitution string, so that you get a new URL in the formhttp://host/substituted-uri?query-string#fragment
.It is not actually quite that simple, but this is a good starting point.
Replacing the host
Normally your substitution string is not concerned with the
host
part of the URL.If I call up
http://www.baconbutty.com/blog/12/
and replace/blog/12
with/blog.php?id=12
, the hosthttp://www.baconbutty.com
is simply re-attached at the end of the processing.If I want to replace the
host
, I simply start the substitution with the new host. E.g.RewriteRule ^blog/(\d+)/$ http://www.newhost.com/blog.php?id=$1
.If I want to access the value of the current host (without knowing it in avance) you use the server variable
HTTP_HOST
thus:http://${HTTP_HOST}/blog.php?id=$1
.Begin with the
/
character.This section applies if you are not replacing the host above - i.e. your substitution is a path to append to the existing host.
You will normally wish to begin your substitution string with a
/
character. This signifies that the substitution is a path originating in the DOCUMENT_ROOT.Thus if I have a rule in directory
http://www.baconbutty.com/a/
that readsRewriteRule ^(\d+)/$ /page.php?id=$1
, the resulting URL will readhttp://www.baconbutty.com/page.php?id=$1
. (And, if there is a `.htaccessfile in the DOCUMENT_ROOT, it will then be triggered).
Now, if you omit the leadng
/
you are telling Apache that your substitution is relative to some other directory within the DOCUMENT_ROOT - but which?It is relative to the
RewriteBase
. I.e. the URL is built as followshttp://host/RewriteBase/substituted-uri?query-string#fragment
.According to the Apache documents
The default setting is: RewriteBase physical-directory-path
. I.e. if my DOCUMENT_ROOT is inC:/Apache/www
then this is the RewriteBase, which produces the following strange and invalid URL:http://www.baconbutty.com/C:/Apache/www/page.php?id=$1
.So if you are going to omit the
/
you need to say where the relative path is usingRewriteBase
.E.g.
RewriteEngine on RewriteBase /c/ RewriteCond %{REQUEST_URI} ^/a/b/(\d+)/\d+/$ RewriteRule ^b/\d+/(\d+)/$ page.php?id=%1&name=$1 [QSA]
Will rewrite
http://www.baconbutty.com/a/b/12/13/?val=24#bookmark
tohttp://www.baconbutty.com/c/page.php?id=12&name=13&val=24#bookmark
.but the following (only difference is a
/
in front of page):-RewriteEngine on RewriteBase /c/ RewriteCond %{REQUEST_URI} ^/a/b/(\d+)/\d+/$ RewriteRule ^b/\d+/(\d+)/$ /page.php?id=%1&name=$1 [QSA]
Will rewrite
http://www.baconbutty.com/a/b/12/13/?val=24#bookmark
tohttp://www.baconbutty.com/page.php?id=12&name=13&val=24#bookmark
. The/
overrides theRewriteBase
.Note
RewriteBase /
will rewrite all relative URLs to the DOCUMENT_ROOT.Back References
If you have capturing parentheses in your
pattern
(e.g./a/(\d+)/
), then the captured values can be referenced in thesubstitution
with$1..9
.In addition, if you use
RewriteCond
and the last condition that matches has capturing parentheses, then the captured values can be referenced in thesubstitution
with%1..9
. (Note the%
sign). But note it is only the last matching RewriteCond that counts here. E.g. if we have the following:-RewriteEngine on RewriteCond %{REQUEST_URI} ^/blog/(\d+)/(\d+)/(\d+)/$ RewriteCond %{REQUEST_URI} ^/blog/(\d+)/(\d+)/(\d+)/$ RewriteCond %{REQUEST_URI} ^/blog/\d+/\d+/(\d+)/$ RewriteRule ^blog/\d+/\d+/\d+/$ /show.php?id=%1-%2-%3 [R]
And enter
http://localhost/blog/2009/05/02/
we gethttp://localhost/show.php?id=02
because the last RewriteCond has only one set of parentheses.
Note these backreferences are not available if you use the ``
(NOT) operator prefix in any pattern or RewriteCond as there is no captured value because it did not match!.
Server Variables
You can insert server variables in your substitution with
%{VARIABLE_NAME}
.Query String and [QSA] flag.
If the substitution contains no QUERY_STRING, then the original QUERY_STRING is retained and re-attached at the end.
If the substitution ends with just
?
then this will erase any QUERY_STRING.If the substitution contains a QUERY_STRING, but no
[QSA]
flag, then this will override (replace) the original QUERY_STRING, which will be lost.If the substitution contains a QUERY_STRING, and a
[QSA]
flag, then both the original and substitution QUERY_STRINGS will be retained and concatenated, but the substitution will win if value names are the same.No Substitution
If your string is simply
-
that means no substitution. You often use this with other chained rules.- flags
Flags appear in square brackets. The only flag I was looking for at this point was the white flag. Or if you prefer, I was flagging when I got to this point, and certainly wasn't Jolly Roger, more George Cross.
You can combine them with commas, e.g.
[R,L]
.There are a large range of flags, which provide a powerful range of functionality, and are well worth studying.
The simple ones I am interested in are:-
[L]
This stops the processing of any subsequent RewriteRules and the URL is returned at that point.
As noted above, if one or more substitutions have occurred, then the
httpd.conf
or.htaccess
file will be re-run, and re-run, until no further substitutions occurr.[NC]
Make the pattern matching not case-sensitive.
[QSA]
Preserve the current query string, and add any new items from the current substitution.
[R] or [R=code]
This forces an external re-direction, and changes the browser address bar.
If no code is used, it is 302.
Others are:-
300 : Multiple Choices
301 : Moved Permanently
This is used to tell search engines to drop your old dynamic URLs.
302 : Moved Temporarily (HTTP/1.0)
302 : Found (HTTP/1.1)
303 : See Other (HTTP/1.1)
304 : Not Modified
305 : Use Proxy
307 : Temporary Redirect
You use this if you want the client browser's address bar to update with the value of the changed URL.
Normally you should also apply
[L]
at this point, as[R=301,L]
to stop further substitutions.This combination is often used if you want your URL's to always end with a
/
. E.g.RewriteEngine on # Force a / to be added and then restart from the top. Address bar is updated. RewriteRule ^b/(\d+)$ b/$1/ [R=301,L] # Hidden re-write to the php file. Address bar is not updated. RewriteRule ^b/(\d+)/$ /page.php?id=$1 [QSA]
This will replace
http://www.baconbutty.com/b/12
withhttp://www.baconbutty.com/b/12/
and update the address barm, and then converthttp://www.baconbutty.com/b/12/
tohttp://www.baconbutty.com/page.php?id=12
.- Process
I have tried to map out how Apache processes an incoming URL, and this is my best guess.
Assume you start with the following typed URL:-
http://www.baconbutty.com/a/b/12/13/?val=24#bookmark \_______________________/\_________/ \____/ \______/ | | | | | path query fragment | | | HTTP_HOST REQUEST_URI QUERY_STRING http://localhost/a/b/12/13/?val=24#bookmark http://localhost/page.php?id=12&name=13&val=24#bookmark
Assume that you want to re-write this to
"http://www.baconbutty.com/page.php?id=14&name=13&val=24#bookmark"
.Assume also that your
DOCUMENT_ROOT
(the physical path on the disk where your web pages are stored) isC:/www
(a Windows system).Assume you have a
.htaccess
file in foldera
that reads:-RewriteEngine on RewriteCond %{REQUEST_URI} ^/a/b/(\d+)/\d+/$ RewriteRule ^b/\d+/(\d+)/$ /page.php?id=%1&name=$1 [QSA]
Assume you have a
page.php
as follows:-<?php function Debug_Dump($obj /*: Any*/) /*: String*/ { ob_start(); var_dump($obj); $a = ob_get_contents(); ob_end_clean(); return "<pre class='dump'>" . htmlspecialchars($a, ENT_QUOTES) . "</pre>"; } echo Debug_Dump($_GET); echo Debug_Dump($_SERVER); ?>
The process seems to be:-
HTTP_HOST
(http://www.baconbutty.com
) is stripped off?
is discardedQUERY_STRING
(val=24
) is stripped off#
is discarded- fragment (
bookmark
) stripped off - This leaves you with the
REQUEST_URI
(/a/b/index.php
). - The leading
/a/
is removed, because the.htaccess
file is thea
directory. - This leaves you with the relative path
b/index.php
, which is what thepattern
is matched against. - The pattern is then matched against
b/index.php
. - If the pattern matches then start creating the substitution.
- Because the substitution begins with
/
, this means that we are building a path that starts in the root directory, not ina
. - Because we have
%1
this is the12
matched in the last RewriteCond. - Because we have
$1
this is the13
matched in the RewriteRule pattern. - The substitution is therefore
/page.php?id=12&name=13
- Because we have the flag
[QSA]
this means retainval=24
and add it to the new URL. - The final URL is
http://www.baconbutty.com/page.php?id=12&name=13
+&val=24
+#bookmark
. - If there is a
.htaccess
in the directoryhttp://www.baconbutty.com/
then this will be applied to the new URL. - If the URL were rewritten back to
a
, i.e.http://www.baconbutty.com/a/page.php
then the.htaccess
file in foldera
would be triggered again.
RewriteCond
So now I have a rough idea about RewriteRules
, I turn my attention briefly to the RewriteCond
.
In brief, a RewriteRule can have one or or RewriteConds preceding it.
If the pattern
for the RewriteRule matches (this is tested first), then before the subsitution occurrs, each of the RewriteConds (first come first served) is evaluated (collecting back-references as they go), and if, and only if, they are all true, is the substituion effected.
E.g. if we have the following in the
RewriteEngine on
RewriteCond %{REQUEST_URI} (\d+)/(\d+)/(\d+)/$
RewriteRule ^blog/\d+/\d+/\d+/$ /show.php?id=%1-%2-%3 [R]
This matches a URL which ends with a date, e.g. 2009/05/02/
. So, if we type in http://localhost/blog/2009/05/02/
we get http://localhost/show.php?id=2009-05-02
.
Why use RewriteCond? Because it can match a wider range of strings than RewriteRule. RewriteRule matches only against a substring of REQUEST_URI. RewriteCond can match against whatever string you specify, such as QUERY_STRING.
- Grammar
The basic grammar is as follows
RewriteCond <TestString> <pattern> [flags]
This is different to RewriteRule. RewriteRule always matches against all or part of the REQUEST_URI.
RewriteCond matches against whatever
TestString
you care to give it.Most of the time the
TestString
will be a server variable, and most often%{REQUEST_URI}
The key lesson to note is that the request URI is always going to begin with a
/
whereas the string that the RewriteRule matches does not, because it matches against a substring of the full REQUEST_URI, relative to the directory in which the.htaccss
is located.So inspecting the following
.htaccess
file in the DOCUMENT_ROOT:-RewriteEngine on RewriteCond %{REQUEST_URI} ^/blog/(\d+)/(\d+)/(\d+)/$ RewriteRule ^blog/\d+/\d+/\d+/$ /show.php?id=%1-%2-%3 [R]
If you look closely, the pattern for RewriteCond begins with a
/
but does not for the pattern in the RewriteRule. This was a considerable source of confusion for me at first.- TestString
The main things you can include in your TestString (as well as plain text) are:-
Server Variables
E.g. REQUEST_URI or QUERY_STRING
RewriteRule pattern backreferences
The processing order for a given RewriteRule is:-
1. Process RewriteRule pattern (against the REQUEST_URI path) and collect $1..$N backreferences
2. If it matches, process each of the RewriteConds, from first appearing to last, and capture %1...%N backreferences from the last.
3. If all RewriteConds match, then process the RewriteRule substitution.
Given 1. occurs first, the $1..$N backreferences are in principle available to the RewriteCond TestString.
Here is an example. Assume the following in the
a
directory off the DOCUMENT_ROOT:-RewriteEngine on RewriteCond $1 ^(2009)$ RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /show2009.php?id=%1-$2-$3 [R,L] RewriteCond $1 !^(2009)$ RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /showOther.php?id=$1-$2-$3 [R] #http://localhost/a/blog/2009/05/02/ #http://localhost/a/blog/2008/05/02/
For the above URLs, this will rewrite to
show2009
orshowOther
depending on the year.RewriteCond pattern backreferences
If you have more than one RewriteCond, you can use the %1...%N backreferences of the previous in the next.
Here is an example. Assume the following in the
a
directory off the DOCUMENT_ROOT:-RewriteEngine on RewriteCond $1 ^(\d)\d{0,3}$ RewriteCond %1 ^2$ RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /show2000.php?id=$1-$2-$3 [R,L] RewriteCond $1 ^(\d)\d{0,3}$ RewriteCond %1 !^2$ RewriteRule ^blog/(\d+)/(\d+)/(\d+)/$ /showOther.php?id=$1-$2-$3 [R] #http://localhost/a/blog/2001/05/02/ #http://localhost/a/blog/1999/05/02/
For the above URLs, this will rewrite to
show2000
orshowOther
depending on whether the year is in the range 2000-2999.- pattern
This is a Perl compatible regular expression, which you can prefix with ! if you want the pattern to be successful if and only if it does NOT match.
Some of the special items are:-
<CondPattern
Compares the pattern and test string to compare them for alphabetical order. A < B
>CondPattern
Compares the pattern and test string to compare them for alphabetical order. B > A
=CondPattern
Is the pattern identical to the TestString?
This is gives better performance than using a regular expression for the same purposes.
=""
Is the TestString the empty string?
-d
-f
Is the TestString an absolute path to an existing directory (-d) or file (-f)?
This is useful if your RewriteRule is likely to match actual resources, where you want a straight (no substitution) pass through.
Remember this can cause performance issues.
You must supply an absolute file system path as the
TestString
, not a relative one.Typically you might use the
%{REQUEST_FILENAME}
environment variable as the TestString.Or you might construct something using the
%{DOCUMENT_ROOT}
path (remembering to add an extra/
after%{DOCUMENT_ROOT}
.Here is an unrealistic example. Assume we have the following directory structure of the root
a/b/
and assumeb
containstest.txt
. Assume the following in thea
directory off the DOCUMENT_ROOT:-RewriteEngine on RewriteCond $1 >a RewriteCond $1 <c RewriteCond %{QUERY_STRING} ="" RewriteCond %{DOCUMENT_ROOT}/a/$1 -d RewriteCond %{DOCUMENT_ROOT}/a/$1/test.txt -f RewriteRule ^blog/([a-z])/(\d+)/$ /show.php?id=$2 [R,L] #http://localhost/a/blog/b/1/
For the above URL, this will rewrite to
show.php
if:-- ([a-z]) contains a character which is >a and <b
- The QUERY_STRING is empty
- a contains a directory matched by ([a-z])
- that directory contains a file called
test.txt
- flags
[NC]
This makes the test case-insensitive.
[OR]
Normally all conditions must be met. This lets you specify that conditions are either/or.
Final Examples
- Basic URL rewriting
If I want to:-
- add a
/
to the end ofhttp:localhost/blog/12
- update the browser address bar
- re-write to
http:localhost/show.php?id=12
RewriteEngine on RewriteRule ^blog/(\d+)$ /blog/$1/ [R=301, L] RewriteRule ^blog/(\d+)/$ /show.php?id=$1
- add a
- Guessable URLs
I want
http:localhost/images/
to point to
http:localhost/pictures/RewriteEngine on RewriteRule ^images/?$ /pictures/ [R=301]
- Rewrite the whole URL to a QUERY_STRING
This is something Drupal does:-
RewriteEngine on RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
- Rewriting Dynamic to Static
If you switch to static URLs, you may have to rewrite old dynamic URLs to the new static ones.
E.g. with my web site I have
http:www.baconbutty.com/blog-entry.php?id=24
., I want to write it to
http:www.baconbutty.com/blog/entry/24/RewriteEngine on # Capture the id, you cannot do it in the RewriteRule RewriteCond %{QUERY_STRING} id=(\d+) RewriteRule ^blog-entry.php$ /blog/entry/%1/? [R=301] #http://localhost/blog-entry.php?id=24
Don't Forget
- Relative Links
If you are going to use static URLs, such as
http://www.baconbutty.com/blog/12/
then you need to remember two things:-*- Firstly, all links to other pages will need to be switched to static.
- Secondly, this this will potentially screw up your links to other static resources / bookmarks in the document itself.
In relation to the second point, lets consider this further.
There are two aspects to consider:-
- Bookmark links to the same page, such as
href="#toc804580"
. These should continue to work as is,http://www.baconbutty.com/blog/12/#toc804580
should be just fine (as noted the fragment is dealt with separately. - Relative links to other static resources will fail. E.g. if I import a style with
@import url(common.css);
thencommon.css
is relative and will be seen ashttp://www.baconbutty.com/blog/12/common.css
which does not exist.
So how to solve the second bullit point?
- Well the easiest is to convert all relative links to full URLS. E.g.
@import url(http://www.baconbutty.com/common.css
, or the shorter verision of appending a/
, e.g.@import url(/common.css);
- There is a shortcut, which is to use the base element. E.g.
<base href="/">
or<base href="http://www.baconbutty.com/">
. BUT WATCH OUT - this will also affect your in-page bookmarks. So it is not really a solution.
Accordingly, if you are going to move to static URLs, you need to ensure that you adequately deal with
relative links
andbookmarks
separately.
Sorry, comments have been suspended. Too much offensive comment spam is causing the site to be blocked by firewalls (which ironically therefore defeats the point of posting spam in the first place!). I don't get that many comments anyway, so I am going to look at a better way of managing the comment spam before reinstating the comments.