Detecting Apache ErrorDocument redirection in PHP
OK, this took me some time to figure out.
Search engine spiders keep requesting resources that were removed from my site long ago. Since they keep coming back, they don't seem to process the repeated 404's they've been receiving. So to let hem know those resources will not return I want to send out HTTP response codes 410 (Gone) instead of the 404's (Not found).
The Apache documentation describes this can be done using the RewriteRule
directive combined with a [G]
flag, like this:
RewriteRule ^news/politics.* - [G]
Together with the 410 response code I also configured Apache to send an error page explaining the error using a ErrorDocument
directive:
ErrorDocument 410 /error410
404 instead of 410?
Surprisingly, these changes in configuration result in 404's, together with the normal 404 error page, when requesting one of the removed resources.
Checking the error page /error410
by directly requesting it in the browser returned the 410 page, so that seems to be OK.
Rewriting requests
One thing you need to know is that my website uses a PHP framework I wrote myself. This has a single entry point for nearly all requests. This script examines the incoming request and executes the corresponding script to render the page.
The Apache configuration to send the requests to this script is also a RewriteRule
:
RewriteRule ^(.*)$ framework.php
This rule is the last rewrite rule in the configuration, so it catches all requests not handled by any other specific rule. The main script uses the REQUEST_URI
server variable to determine which page to render.
To examine what happens I looked at the value of the $_SERVER['REQUEST_URI']
parameter during script execution of a normal request like
http://kwebble.com/blog and a gone URL like http://kwebble.com/news/politics/no-longer-here.
For the first one the value is /blog, as expected. For the other URL I expected /error410, the URI of the configured ErrorDocument
. To my surprise it was /news/politics/no-longer-here, the original URI.
I expected the error page because I thought the configured URI would be executed as a separate request by Apache. Here I was wrong, Apache internally redirects to the configured error document instead of making a separate request. This explains the 404, because this URL no longer points to a valid resource. But how to detect the error?
Detecting redirects
Looking at the server variables I noticed some differences between the 2 requests:
- The values of
REDIRECT_STATUS
differ, with 200 for the normal URL and 410 for the incorrect URL. - A parameter called
REDIRECT_REDIRECT_STATUS
is only present for the incorrect URL. The value is 410. - The values of
REDIRECT_URL
differ, with /kwebble-site/blog and /kwebble-site/error410 for the incorrect URL.
These REDIRECT_STATUS
and REDIRECT_REDIRECT_STATUS
server variables are created by Apache when doing an internal redirect. On an error this occurs 2 time: first when the error is detected and then again to create the error document.
The solution
To render the 410 page I changed the code to look for the REDIRECT_STATUS
. If it's 200 the rendered page is based on the value of REQUEST_URI
, otherwise use the value of REDIRECT_URL
.
Instead of always using the value from REDIRECT_URL
an additional check on REDIRECT_STATUS
is done. I added this because in my search for information I found several pages suggesting the REDIRECT_URL
is not always present.
This extra check makes sure a error page is always generated, and if possible reported as 410. Else it is reported like it always has with the 404, as next best alternative.