Tag Archives: gzip

Cache PHP to gzipped static html pages using htaccess redirect

No matter the server performance, the fastest kind of website for a visitor is the one with static HTML pages; this way the server just has to upload existing data to the browser instead of starting the PHP compiler, open a connection to MySQL, fetch data, format the page and only after that send it. Less CPU, less memory, less processing time, more users served in a smaller amount of time.

Avoiding PHP and MySQL execution altogether is the key, and my project was to create something similar to WPSuperCache to be implemented in a generic PHP website, so my thanks go to said plugin’s developers for the source code that gave me precious tips.

DISCLAIMER: this guide presumes you are “fluent” with PHP and .htaccess coding, and is meant to just give directions as of how to obtain a certain result. This guide will not give a pre-cooked solution to just copy/paste into your website, you need to change/add code to adapt it to your need; I am not the person who can help you if you need a tailored solution, simply because I have no time to give away free personalized help!

Here’s our logical scheme:

  1. Visitor comes to website and asks for page X;
  2. Webserver checks via .htaccess (avoiding PHP execution) if there is a static cached version of page X already, and in this case either serves the gzipped file (if the client supports it), or falls back to the uncompressed HTML version… at which point our scheme ends; otherwise, if no cached version exists, continue to next step;
  3. PHP builds the page and serves it to the visitor as “fresh”, but at the same time saves the HTML output as both gzipped and uncompressed version so the next visitor will directly download that one.

This is a very bare procedure, which needs to be perfected by the following conditions:

  • Not all pages need to be cached (those that by their very nature change very often, like real time statistics, a poll, whatever)… in fact, some pages must NOT be cached (those that only moderators can view, both for secutiry reasons and consistency); PHP must avoid caching those pages, so that htaccess will always serve the “fresh” PHP page.
  • Pages that do get cached still need to be refreshed, because sometimes they can change (articles or posts that are modified/updated with time, or where comments get posted).
  • Some pages, even if the corresponding cached version exists, must NOT be served from cache, for instance when POST data is submitted (sometimes, also with GET, you as the webmaster should know if this is the case), that requires PHP to be handled.

Define “caching behaviour” in PHP

This section is where you set the variables that the PHP caching routine will check against to know when and how to do its job.

$cachepath=$_SERVER["DOCUMENT_ROOT"]."/cache/";
$cachesuffix="-static";
$nocachelist=array(
    "search"=>1,
    "statistics"=>1,
    "secretcodes"=>1,
    "..."=>1,
 );

And explained:

  • $cachepath is the folder where you want the static (HTML and gzip) files to be stored (the website I use this caching routine on, has all the pages accessible from root folder with rewrite URLs, in other words the only slash in the URL is the one after the domain name, and I have no need to reproduce any folder structure; if you do, you’re on your own);
  • $cachesuffix is a string I need to add to the URL string (for example if address is domainname.com/pagename then in this case the cachefile will be named pagename-static), this is useful if you want to cache the homepage which is not named index.php but is just domainname.com/ because in that case, the cachefile name would be empty and .htaccess won’t find it;
  • $nocachelist is an associative array where you have to add as many keys (pointing to a value of 1) as the pages you don’t want (for whatever reason) to cache; in the key name you have to put the string the user would write in the URL bar after the domain name slash to get to the page, for example if you don’t want to cache domainname.com/statistics you would be using “statistics”=>1 in there, as already is.

Have PHP actually save to disk the cached pages

     if (
        !isset($nocachelist[$_GET["page"]]) &&
        !$_SESSION["admin"] &&
        !count($_POST) &&
        !$_SERVER["QUERY_STRING"]
    ) {
        //build the uncompressed cache
        if (!file_exists($cachepath.$url.$cachesuffix.".html")) {
            $cached=str_replace("~creationtime","cached",$html);
            $fp=fopen($cachepath.$url.$cachesuffix.".html","w");
            fwrite($fp,$cached);
            fclose($fp);
        }
        //build the compressed cache
        if (!file_exists($cachepath.$url.$cachesuffix.".html.gz")) {
            $cachedz=str_replace("~creationtime","cached&gzipped",$html);
            $cachedz=gzencode($cachedz,9);
            $cachedzsize=strlen($cachedz);
            $fp=fopen($cachepath.$url.$cachesuffix.".html.gz","w");
            fwrite($fp,$cachedz);
            fclose($fp);
        }
    }

Pretty much self explanatory, isn’t it.
No seriously, do you really want me to elaborate?

Ok, you got it.

  • The $_GET[“page”] is a GET value I set early in the code to know where we are in the website, you can use any variable here as long as you can check it against the $nocachelist array;
  • The other conditions in the first if should be clear, they avoid building the page’s cache if the CMS’s admin is logged (security) or if POST data is submitted or if there is a query string appended to the URL (consistency/stability);
  • $url is a variable that I define early in the code, and contains the string after the domain name slash and before the query string question mark, basically the kind of string you fill the $nocachelist array with (if you were paying attention, you may now think I have a redundant variable since $url and $_GET[“page”] should be the same, but this is not the case for other reasons);
  • $html is the string variable that, across the whole CMS, defines the raw HTML code to echo at the end of the PHP execution; you can either do like I do and define such string, or use an output buffer to obtain HTML if you instead print the HTML directly to screen during PHP execution;
  • ~creationtime is a “hotkey” I use in my template to plug in the time in seconds that was needed to create the page in PHP; since I am creating a cached version now, the creation time of the page is zero, because it’s already there to be downloaded from the browser instead of having to be compiled by the server, so in there I print either “cached&gzipped” for clients that support gzip, or only “cached” when the browser doesn’t not; you can safely strip out this part, as this is more of a eyecandy/nerdy/debug thing.

Let .htaccess send the cached files before starting the PHP compiler

AddEncoding x-gzip .gz
<FilesMatch "\.html\.gz$">
    ForceType text/html
</FilesMatch>

#GZIP CMS
RewriteCond %{REQUEST_METHOD} !POST
RewriteCond %{REQUEST_URI} !/forum/
RewriteCond %{QUERY_STRING} !.*=.*
RewriteCond %{HTTP:Cookie} !^.*(isadmin|nocache).*$
RewriteCond %{HTTP:X-Wap-Profile} !^[a-z0-9\"]+ [NC]
RewriteCond %{HTTP:Profile} !^[a-z0-9\"]+ [NC]
RewriteCond %{HTTP:Accept-Encoding} gzip
RewriteCond /full/path/to/your/htdocs/cache/$1-static.html.gz -f
RewriteRule ^(.*) /cache/$1-static.html.gz [L,T=text/html]

#UNCOMPRESSED CMS
RewriteCond %{REQUEST_METHOD} !POST
RewriteCond %{REQUEST_URI} !/forum/
RewriteCond %{QUERY_STRING} !.*=.*
RewriteCond %{HTTP:Cookie} !^.*(isadmin|nocache).*$
RewriteCond %{HTTP:X-Wap-Profile} !^[a-z0-9\"]+ [NC]
RewriteCond %{HTTP:Profile} !^[a-z0-9\"]+ [NC]
RewriteCond /full/path/to/your/htdocs/cache/$1-static.html -f
RewriteRule ^(.*) /cache/$1-static.html [L]

This is the code you need to plug in .htaccess, preferably after everything else, but before defining the custom error pages; anyway, since you should be .htaccess-fluent, you shouldn’t need to be told where this fits best.

Some detailing for the curious:

  • First bit is needed (at least I needed it) to serve the gzipped files in a way that the browser knows to handle, otherwise I just got gibberish (the gzip was being sent to output without being uncompressed by the browser first);
  • if the cookies “isadmin” or  “nocache” exist, the cache version of the page will not be served even if it exists; easy explanation, if an admin is logged, and there is special content on a page that only admins can see, you don’t want them to see the “vanilla” cached version of the pages instead; so it’s your duty in this case to set a “isadmin” cookie when an admin logs in, and remove it when the admin logs out;
  • choosing the correct full path on the webserver was a bit tricky in my case, I can’t quite remember where the issue was, but depending on the method I used to get it I had different paths; I kind of remember I had to choose between $_SERVER[“DOCUMENT_ROOT”] and dirname(__FILE__) because only one was working with .htaccess;
  • you don’t really have to change much in this code snippet unless you have particular needs; the /forum/ path exclusion could be irrelevant in your case;
  • thanks go to WPSuperCache developers for their .htaccess code that I stole to build this snippet!

Last but not least

The cache is there to help you, not to give an handicap to your website. You choose what the cache performance should be, and how many times it should get served instead of the dynamic page, before considering it paied off, but you have to clear the cache from time to time to make sure you’re not serving outdated stuff to your visitors.

In my case I use an external cronjob provider (setcronjob.com) to trigger a PHP routine every night which includes the following:

$handle=opendir("cache");
while (($file = readdir($handle))!==false) @unlink("cache/".$file);
closedir($handle);

so that everyday the website starts off with a fresh cache. Not less important, you should clear the cache of a single page as soon as you know that page changed, unless you’re ok with the changes being visibile only the next day after all the cache is cleared anyway. Example: you edit a page while logged as admin, or a user posts a comment, or anything happens that you have control over that alters in any way the page’s HTML: simply use unlink() to delete both the gzipped and uncompressed caches and the website will recreate them with the updated content.

 

Have fun with pimping up this draft!