Category Archives: php-mysql

Inverted search in MySQL database: check if keywords in field match some text

Usually, “searching a database” means looking for rows that contain a “document” in a field that matches a certain keywords set (mostly using the MATCH AGAINST construct); let’s say you want to do the opposite instead, that is finding those rows that have a field containing keywords, or a phrase, that matches to a text you already know. This is what I call an inverted search, or reverse search.

Real-life scenario: you build a website for job offers, candidates come in searching for listings that suit them, but find none, or find just a few that are not satisfactory. Either they keep coming to the website doing a search (let’s say they use the keywords “veterinar* cat” because they want to take care of animals, and like cats above all), or they subscribe to the RSS feed for that search (if your site offers one)… or they “save” this preference, and have the site send them an email as soon as a new offer pops up that matches their search, and this is where this article comes in.

Now, it’s easy to save the keywords for each user in a MySQL table, but how easy is it to check if the text of new messages matches said keywords? In PHP, you would go about building an array of keywords, and using strpos() on each of them to see if they are all contained in the text.

In MySQL you have the INSTR() which is the equivalent of PHP’s strpos(), but you cannot store arrays in a field (I mean real arrays, not serialized ones), and storing as many rows as the keywords are is going to be troublesome, as normalization isn’t needed: a “set of keywords” is mostly unique to each user subscribing to a search, so it makes sense to store them in a text field in the form of “keyword1,keyword2,keyword3,…”; again, if you were in PHP, you would do a explode(“,”,$keywords) to obtain the array of keywords, but in MySQL there is no such function, so you pretty much have to create one.

Ideally, you need a function where you pass the known document/text to match against (the concatenated subject and the body of the job offer, in this example), the string in the form “keyword1,…,keywordN“, and a parameter which tells the function that the delimiter (“glue” in PHP’s explode/implode) is a comma, so:

somefunction(document, keywords, delimiter)

This function would then return a positive integer if all of the keywords are present in the text, or 0 otherwise.

The function I put together by scraping info around was:

DELIMITER ;;

DROP FUNCTION IF EXISTS `matchkwdarray`;;
CREATE FUNCTION `matchkwdarray`(`str` text, `arr` text, `delimit` tinytext) RETURNS char(1) CHARSET utf8
 DETERMINISTIC
BEGIN
 DECLARE i INT DEFAULT 1;
 DECLARE matches INT DEFAULT 1;
 DECLARE token TINYTEXT;
 label:WHILE(matches>0) DO
 SET token=replace(substring(substring_index(arr, delimit, i), length(substring_index(arr, delimit, i - 1)) + 1), delimit, '');
 IF token='' THEN LEAVE label; END IF;

 SET matches=INSTR(str, token);
 SET i=i+1;
 END WHILE;
 RETURN matches;
END;;

DELIMITER ;

To add this to your MySQL database, you can either connect via CLI, or, as most people with a website on a shared hosting, not having command-line access, using a database frontend; phpMyAdmin, bloated as it is, doesn’t support functions/routines, so you need another one, and I found Adminer to be a massively better alternative, performance and flexibility-wise, and it’s only a 150kb-something single PHP file, which makes it all the better.

So, when you test the function, you have the following:

SELECT matchkwdarray('this is just a very simple test', 'just,test', ',')
Returns: 2

SELECT matchkwdarray('this is just a very simple test', 'some,stuff', ',')
Returns: 0

SELECT matchkwdarray('this is just a very simple test', 'this,very,stuff', ',')
Returns: 0

The function is resource efficient: as soon as a keyword is not in the text, it stops processing the keywords array. Surely, if you don’t want a 100% match but are fine with lower than that, you can easily modify the function to “stream” the whole keywords array, even removing the last character off each keywords to allow some variations, and return a percentage match which you can then evaluate later on.

MySQL SELECT query with more WHERE conditions runs faster

For some reason I wanted to add more redundant conditions to a WHERE clause in a MySQL SELECT query, but I feared that, being redundant conditions, I may end up slowing up the query.

Shortly, there is this classifieds site, where each item’s table has the town_id, province_id and region_id rows. If you know the town_id you’re looking for already, you can query:

SELECT * FROM items WHERE town_id='12'

instead of:

SELECT * FROM items WHERE town_id='12' AND province_id='34' AND region_id='56'

as the results will be the same… but should you?

I prepared a quick test panel:

$creationstart=strtok(microtime()," ")+strtok(" ");
for($i=0;$i<=999;$i++)
 $test=doquery("SELECT * FROM items WHERE town_id='58003'");
$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "town only: $creationtime<br />";

$creationstart=strtok(microtime()," ")+strtok(" ");
for($i=0;$i<=999;$i++)
 $test=doquery("SELECT * FROM items WHERE town_id='58003' AND province_id='58' AND region_id='12'");
$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "all conditions: $creationtime<br />";

Surprisingly, it took anywhere from 2.4s to 4.5s for the “all conditions” routine to complete, while the “town only” query ended up being about 0.4s slower. At first I thought it might be the MySQL cache in action, but even inverting the position of the cycles inside the code, results were the same.

So well, even if it looks redundant to you, adding more WHERE ... AND ... clauses in your query speeds up the fetching of the results.

  This article has been Digiproved

PHP search string functions performance is “needle” dependent

I was playing with some templating benchmarks in my post before this one, and bumped into an interesting find regarding the way strpos works, and consequently all similar “search (and replace)” functions like str_replace, strrpos, stripos, str_ireplace, substr_replace, even preg_match and preg_replace (I tested only str_replace and preg_match out of these, but I assume they will all bear the same results).

  1. From here on  am using the php.net convention, and will call “needle” the string you’re searching for, and “haystack” the string inside of which you’re executing the search.
  2. Knowing these results will only be useful if you can choose which “needles” to put in the “haystack” to be searched for in a later time, hence only if you’re building a template.
  3. I serched if someone else found this already, but this doesn’t appear to be the case, so if I am selling for new already known things just don’t bash me.

So let’s say you’re creating a template scheme, where you insert several tags that need to be replaced by the dynamically generated content of your website. You can call them whatever you want as long as they are unique strings inside the template, so for example %tag% or {tag} or <!–tag–> or ~tag~ (or just whatever). Maybe you think that choosing which delimiting chars to use for the tag name is only up to your personal tastes, and I did too before discovering this by accident, but I am going to guess that if the template contains well formed HTML code, then using ~tag~ is going to be much faster than using <!–tag–>.

The general principle is, searching for a needle beginning with a “rare” character is much faster than searching for a needle starting with a commonly used character inside the haystack.

For example, if your haystack is an excerpt from an ebook, searching for “%hello” (if present) will be way faster than searching for “hello”. The reason for this? The function in C that searches for the string, starts by searching for its first character, if found checks if the following one matches, and so on; so if you’re searching for “hello” the function will pause at every “h” to see if after that there’s an “e”, and if yes then checks if there’s and “l” and then another “l”, yet if the word is “hellish”, the function will not find the ending “o”, and will have to discard the work and time spent and go on with the search. The “%” character on the other hand is pretty rare inside of a “normal text”, if not unique, so the function will have to “pause” way less times before hitting a full match.

Let’s put it to the test, this is the routine:

$creationstart=strtok(microtime()," ")+strtok(" ");
for ($i=0;$i<100000;$i++) $testpos=strpos($test,"malesuarda");
$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "malesuarda $testpos: ".$creationtime."<br />";

$creationstart=strtok(microtime()," ")+strtok(" ");
for ($i=0;$i<100000;$i++) $testpos=strpos($test,"%malesuada");
$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "%malesuada $testpos: ".$creationtime."<br />";

Let’s do some explaining: $test is a fairly long string that I previously defined inside the code (it is made of a lorem ipsum kind of text, several paragraphs amounting to almost 13kb), inside of which I took a random word, “malesuada“, which is repeated several times, and I made two occurrences of this word slightly different, to render them unique; they were both towards the end of the string, I changed one into malesuarda adding a”r”, and another one (further away in the string) into %malesuada, then just loaded the PHP script; I echoed the value of $testpos as well, to confirm that the strings were actually found.

As expected, here are the results:

malesuarda 10970: 3.5609
%malesuada 11514: 0.7632

Replacing strpos with any other functions listed at the beginning of this article will deal similar results.

PHP templating, preg_match, file_get_contents and strpos, the fastest

While working on the website of reecycle.it, il frecycling italiano per il riciclo degli oggetti, I am spending my resources also on reducing at most the load on the Apache server of my italian shared hosting, Tophost, so i decided to make some benchmarks to see which was the best speed I could load the templates at, using different methods (a template is a “model” of a page -or part of-  written in HTML code, inside which the dynamic values loaded by the website engine are inserted).

All the templates used by reecycle.it were initially created as several .html files (one per each template) with several %tags% substituted by the str_replace() function before the output, but I thought that maybe there were different, faster ways to obtain the same result; a page can contain up to 5 or more different templates (general layout, login panel, simple search widget, search results table, and single row of the search results table), so each page could tell the server to access and load 5 different files on the hard disk (we are ignoring the disk cache for simplicity); maybe reading a single, bigger file containing all of the templates, loading it into memory, and later extracting the needed parts, is faster? This path can be taken in two ways, either by writing two “clean” lines of code using preg_match() and a regular expression, or by using a rather less “elegant” but strictly more performant combo of strpos() and substr() which instead needs several more lines of code.

In other words, I needed to know which one of the three (separate template files, single big template with regexp extraction, and single big template with strpos/substr extraction) was faster. I already knew preg_match was heaps slower than strpos/substr, yet I included it in the test for the sake of completeness.

This is the routine I used:

<?php
$creationstart=strtok(microtime()," ")+strtok(" ");

for ($i=0;$i<1000;$i++) {
 $text=file_get_contents("full.html");
 for ($n=1;$n<=8;$n++) {
 preg_match("/<!--{$n}\(-->(.*)<!--\){$n}-->/s",$text,$matches);
 $html[$n]=$matches[1];
 }
}

$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "preg_match: ".$creationtime."<br />";

/////////////////
$creationstart=strtok(microtime()," ")+strtok(" ");

for ($i=0;$i<1000;$i++) {
 $text=file_get_contents("full.html");
 for ($n=1;$n<=8;$n++) {
 $start=strpos($text,"<!--$n(-->")+strlen("<!--$n(-->");
 $ending=strpos($text,"<!--)$n-->");
 $html[$n]=substr($text,$start,($ending-$start));
 }
}

$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "strpos/substr: ".$creationtime."<br />";

////////////////////
$creationstart=strtok(microtime()," ")+strtok(" ");

for ($i=0;$i<1000;$i++) {
 for ($n=1;$n<=8;$n++) {
 $html[$n]=file_get_contents($n.".html");
 }
}

$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "file_get_contents: ".$creationtime."<br />";

where full.html is the single HTML file containing all the templates (a total of 8, consisting in paragraphs in lorem ipsum style, of different length), identified by <!--templatenumber(--> and <!--)templatenumber--> between which was the code to extract, while the single template files were named from 1.html to 8.html.

What the code does is, for each method it repeats 1000 iterations in which every single template is loaded, from 1 to 8, and measures the needed time to complete. This usage is not very realistic, as the template code is actually several lines of HTML code instead of few long lines of text, and templates are never loaded all together, but only those which are actually needed to draw the page; anyway, better performance in this test means better performance in real-life use (WRONG! check bottom for more details).

So, this was the result:

preg_match: 1.8984
strpos/substr: 0.0681
file_get_contents: 0.1352

Final times were obviously different at each page refresh, from a minimum (for preg_match) of 1.4s up to a maximum of 3s, anyway the relationship between them remained the same, that is the strpos/substr combination was two times faster than file_get_contents called for each file, yet what surprised me is how preg_match method is almost 30 times slower than strpos/substr, and hence 15 times slower than asking the server to read several different files together (I suppose this was due to the disk cache in action).

On a side note, the tech support of my hosting, inquired about this, suggested me to drop reading separate files in favour of reading a single file and using preg_match… go figure.

UPDATE:

I just went and tested this benchmark with real templates off reecycle.it… oh how much I was wrong.
I made a function on reecycle.it to fetch the requested template, that when the template inside the big file is missing, loads the single file template, returns it, and adds the missing template to the big single file, so after a while I got a 37kb supertemplate.tpl containing all of the templates of the website. I just changed the routines in the example above, to use the real files of the templates… and behold, the results were inverted! Using file_get_contents() on several files was two times faster than using strpos/substr on a single big file. No matter how I changed it, several, separate small files were still much faster than a single big file.

I blame it on two things: the file is actually big, so string functions have to deal with a big chunk of data to process, and especially the tag formats, since the template delimiters, practically in HTML comment format, begin with the “less than” symbol which is bound to be repeated lots of times inside HTML code, maybe confusing the strpos function.

In fact, in the templates archive file I modified the delimiting tags so they were like {tag(} and {)tag} instead of using the html comment tags <!–…–>, and the results of the benchmark went back to normal, being faster for the strpos/substr combo on the single file archive than with file_get_contents on several separate files, and the more the template requests, the faster the strpos/strsub method if compared to file_get_contents… see results of my next post.

Se vuoi cambiare la tua password, o non la ricordi, inserisci qui l’indirizzo di posta elettronica con cui ti sei registrato<br />
<form method=”post” action=”index.php”>
<table>
<tr>
<td class=”dida”>
email
</td>
<td class=”dati”>
<input type=”text” name=”resetmail” />
</td>
</tr>
<tr>
<td colspan=”2″>
<input type=”hidden” value=”resetpassword” name=”action” />
<input type=”submit” value=”Invia mail di reset” name=”sendresetpassword” />
</td>
</tr>
</table>
</form>

  This article has been Digiproved

Tophost e variabile di sessione “user” riservata in PHP

Stavo allegramente imprecando stamattina contro il server FTP del nodo8 di tophost su cui è ospitato un progetto a cui lavoro, siccome non era utilizzabile a causa di impressionante lentezza nelle risposte.

Ho colto l’occasione per rivedere alcuni aspetti del codice PHP e riordinare la gestione delle sessioni utente per renderle più funzionali, al che ho deciso di caricare in un’unica variabile di sessione, $_SESSION["user"], un array contenente tutti i dati rilevanti dell’utente per potervi accedere rapidamente senza richiedere costantemente un accesso al database.

Ebbene, non sembrava che riuscissi a far funzionare nulla, eppure è tecnicamente possibile caricare un array all’interno di una variabile di sessione. L’array veniva caricato subito dopo il login (confermato da var_dump) ma alla pagina successiva era sostituito da una stringa, corrispondente allo username MySQL del sito. Ho messo in discussione il mio codice, cercando dove fosse il problema, e modificando alcune assegnazioni (il tutto perdendo minuti ad aspettare che il server FTP accettasse gli upload), fino a quando non ho provato a cambiare il nome della variabile a $_SESSION["visitor"], e magia, ha funzionato tutto correttamente. A quanto pare, siccome non ho trovato traccia di questo in manuali “ufficiali”, i server di tophost sono impostati per accedere al nome utente MySQL tramite la variable di sessione “user”… non so cosa pensare. Qual è la variabile di sessione contenente la password MySQL?

  This article has been Digiproved

PunBB e-mail notification on each new post extension

On another site I manage, I just switched from the previous forum script to PunBB; one feature I missed after the switch is the ability to activate an email notification for each and every post the users add into the forum, so I had to write my own.
Following is the source code of the extension, save it as manifest.xml, upload it into /extensions/newpost_mail_notify/ PunBB folder (or another name of your liking), and install from PunBB admin panel with the usual method. There is nothing to set, as long as the extension is active it just does its job.

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE extension SYSTEM "ext-1.0.dtd">

<extension engine="1.0">
<id>pun_newpost_mail_notify</id>
<title>Send email to admin for each new post</title>
<version>1.1</version>
<description>This extension simply sends an email to the configured admin email address everytime a new post is added by someone who's not the configured admin</description>
<author>Ephestione</author>

<minversion>1.3.2</minversion>
<maxtestedon>1.3.4</maxtestedon>

<hooks>
<hook id="po_pre_redirect"><![CDATA[
if ($forum_user['group_id']!=1) { //default admin user id
$notification_to=$forum_config['o_admin_email'];
$notification_subject='['.$forum_config['o_board_title'].'] New topic/post notification';
$notification_message='New post/thread at URL: '.$base_url.'/viewtopic.php?pid='.$new_pid.'#p'.$new_pid;
mail($notification_to,$notification_subject,$notification_message);
}
]]></hook>

</hooks>

</extension>

This extension is very spartan, it will send a mail to the preconfigured admin email address, only if the new post is by someone else than the admin (obviously), with a simple link pointing to the post in the board.

I just updated the code with the suggestion by Grez from PunBB forums, correcting the check on the user which now uses the group_id and not the user id, thus being more solid.

Disclaimer: not my fault if your server explodes, this extension has been made/tested for PunBB 1.3.4, and if you’re not using that version you should definitely upgrade; may be a bad idea to use it if you get a lot of posts each day; I needed it because my forum is not so active, so I prefer getting notified if there’s something new, rather than go checking everyday.

  This article has been Digiproved

How to fix Fatal error: Allowed memory size exhausted in WordPress 3

This error is nothing new, per se, as far as I know it was present also in previous versions of wordpress (I’m a fresh new user and I’ve been using v3+ for more than I’ve been with 2.9.7 (first one I installed).
When this error occurred in previous versions, all you needed to do was open the [WORDPRESS FOLDER] > wp-settings.php file, and change the maximum allowed allocable ram from there. And that’s what most guides still tell you to do now, leaving you stranded as that line you need to change is not there anymore.
What you see though, is that at the beginning of the wp-settings.php file there is a line that says:

require( ABSPATH . WPINC . '/default-constants.php' );

which pretty much says it all, unless you are not the least bit PHP savvy (hence, this guide is for you).
All you need to do is open the [WORDPRESS FOLDER] > wp-includes > default-constants.php file, and change the line:

define('WP_MEMORY_LIMIT', '32M');

into

define('WP_MEMORY_LIMIT', '40M');

or anything of your liking (may be 36M, 48M, or whatever depending on your setup); obviously, the lower the better for the performance of your server, especially if you’re on a shared hosting.

  This article has been Digiproved

Resize, enlarge or scale an html image map with a PHP script

I am creating a portal for an italian website which will sport a nice region selector with an imagemap, and region highlighting with a javascript. I found a free and detailed image map of italy along with a combined png, but it was too small to be really useable, so I needed to put a bigger image; with that, I also needed to alter the imagemap coordinates to they matched the enlarged image… noway I was going to do that by hand!

So, after searching for a pre-made solution (which I obviously didn’t find)  I devised a very simple PHP script to do eaxctly that. The script puts the entire code for the image map inside a string variable, and then processes that string with a regular expression search and replace to change the values accordingly to my needs. I only needed to make the image two times bigger, mantaining the aspect ratio, but since I was going to publish this inside a guide, I said to myself “why not making it so you can change the aspect ratio as well”. So, if you want, you can make the imagemap two times larger, and 1.5 times taller.

Here’s the sample script (the $html variable is defined with a sample imagemap for the region of Lazio alone, for space purposes, but you can put whatever you want inside, I used it with the whole map of Italy, together with <map> tags and lines breaks/indentation). Be careful and ESCAPE the double quotes inside the HTML before pasting the code inside the string. In other words, simply put a backslash (the \ character) before each occurence of a double quote (the $quot; character) inside the HTML, I used the search and replace function of Notepad++.

<?php
$html="<area href=\"#\" alt=\"state\" title=\"lazio\" shape=\"poly\" coords=\"74.513,86.938,75.667,87.365,75.667,88.007,74.744,89.077,75.436,90.467,76.359,90.039,77.857,90.039,78.319,90.039,79.127,90.788,79.588,91.857,79.588,92.606,80.049,93.034,80.51,93.034,81.317,94.103,81.779,94.852,82.24,94.959,83.74,94.852,84.201,94.959,85.123,94.959,86.392,94.103,87.43,93.141,88.122,93.141,89.39,93.141,89.967,92.713,91.351,90.895,91.813,90.895,92.274,91.216,93.196,90.895,94.349,90.788,94.926,90.467,96.31,89.825,96.886,90.467,96.656,90.895,95.849,91.323,95.387,92.072,94.234,92.072,92.965,92.713,92.505,93.676,92.505,94.317,92.734,94.959,91.928,95.28,91.813,95.922,91.467,96.778,92.505,98.382,92.505,99.023,92.505,99.986,91.928,101.804,91.928,103.194,92.734,103.837,94.234,103.623,96.31,104.264,97.579,105.013,99.309,106.51,102.191,108.543,103.229,108.543,104.728,109.077,106.113,110.361,106.574,111.965,106.804,113.035,106.574,113.783,106.574,114.425,105.882,114.853,105.305,115.067,104.844,115.067,104.728,116.029,104.728,117.099,104.152,118.061,103.46,118.703,102.999,119.345,102.999,120.093,101.961,120.308,100.23,120.735,99.539,120.308,98.271,119.345,96.656,118.489,95.156,118.275,92.965,118.489,91.005,118.703,89.39,116.885,89.506,116.029,88.122,114.639,85.931,113.997,83.97,112.607,81.548,110.574,78.55,107.687,77.627,105.869,76.128,104.692,74.975,102.874,73.706,101.056,71.745,99.023,70.131,97.098,67.594,94.959,69.093,93.676,70.131,92.606,70.592,91.216,70.592,90.039,71.745,89.611,72.553,88.649,73.014,88.221,72.553,86.938,73.245,86.189,74.513,86.938\" />";
echo preg_replace("/([0-9.]{2,}),([0-9.]{2,})/e","round(\\1*2,3).','.round(\\2*1.5,3)",htmlentities($html));
?>

All you need to do is change the parameters inside the regular expression (numbers in red), first is for horizontal proportion, second for vertical; in this example the imagemap will be resized to 2 times its length, and 1.5 times its height, if you want to make the size three times bigger, change it to

echo  preg_replace("/([0-9.]{2,}),([0-9.]{2,})/e","round(\\1*3,3).','.round(\\2*3,3)",htmlentities($html));

Then, copy the php script to your server, open it in a web browser, and copy/paste the result. Note: according the the way the imagemap is originally formatted, you may need to edit the regular expression to accomodate for spaces, tabs, linebreaks or whatever; in this case, since the coordinates were listed with just commas inbetween it was not needed. If you are stuck, write in the comments an excerpt of your imagemap code and I’ll try and help you.

  This article has been Digiproved