1. Detecting Apache ErrorDocument redirection in PHP

    OK, this took me some time to figure out.

    Search engine spiders keep requesting resources that were removed from my site long ago. Since they keep coming back, they don't seem to process the repeated 404's they've been receiving. So to let hem know those resources will not return I want to send out HTTP response codes 410 (Gone) instead of the 404's (Not found).

    The Apache documentation describes this can be done using the RewriteRule directive combined with a [G] flag, like this:

    RewriteRule ^news/politics.* - [G]

    Together with the 410 response code I also configured Apache to send an error page explaining the error using a ErrorDocument directive:

    ErrorDocument 410 /error410

    404 instead of 410?

    Surprisingly, these changes in configuration result in 404's, together with the normal 404 error page, when requesting one of the removed resources.

    Checking the error page /error410 by directly requesting it in the browser returned the 410 page, so that seems to be OK.

    Rewriting requests

    One thing you need to know is that my website uses a PHP framework I wrote myself. This has a single entry point for nearly all requests. This script examines the incoming request and executes the corresponding script to render the page.

    The Apache configuration to send the requests to this script is also a RewriteRule:

    RewriteRule ^(.*)$ framework.php

    This rule is the last rewrite rule in the configuration, so it catches all requests not handled by any other specific rule. The main script uses the REQUEST_URI server variable to determine which page to render.

    To examine what happens I looked at the value of the $_SERVER['REQUEST_URI'] parameter during script execution of a normal request like http://kwebble.com/blog and a gone URL like http://kwebble.com/news/politics/no-longer-here.

    For the first one the value is /blog, as expected. For the other URL I expected /error410, the URI of the configured ErrorDocument. To my surprise it was /news/politics/no-longer-here, the original URI.

    I expected the error page because I thought the configured URI would be executed as a separate request by Apache. Here I was wrong, Apache internally redirects to the configured error document instead of making a separate request. This explains the 404, because this URL no longer points to a valid resource. But how to detect the error?

    Detecting redirects

    Looking at the server variables I noticed some differences between the 2 requests:

    • The values of REDIRECT_STATUS differ, with 200 for the normal URL and 410 for the incorrect URL.
    • A parameter called REDIRECT_REDIRECT_STATUS is only present for the incorrect URL. The value is 410.
    • The values of REDIRECT_URL differ, with /kwebble-site/blog and /kwebble-site/error410 for the incorrect URL.

    These REDIRECT_STATUS and REDIRECT_REDIRECT_STATUS server variables are created by Apache when doing an internal redirect. On an error this occurs 2 time: first when the error is detected and then again to create the error document.

    The solution

    To render the 410 page I changed the code to look for the REDIRECT_STATUS. If it's 200 the rendered page is based on the value of REQUEST_URI, otherwise use the value of REDIRECT_URL.

    Instead of always using the value from REDIRECT_URL an additional check on REDIRECT_STATUS is done. I added this because in my search for information I found several pages suggesting the REDIRECT_URL is not always present.

    This extra check makes sure a error page is always generated, and if possible reported as 410. Else it is reported like it always has with the 404, as next best alternative.