Tag Archives: strings

PHP search string functions performance is “needle” dependent

I was playing with some templating benchmarks in my post before this one, and bumped into an interesting find regarding the way strpos works, and consequently all similar “search (and replace)” functions like str_replace, strrpos, stripos, str_ireplace, substr_replace, even preg_match and preg_replace (I tested only str_replace and preg_match out of these, but I assume they will all bear the same results).

  1. From here onĀ  am using the php.net convention, and will call “needle” the string you’re searching for, and “haystack” the string inside of which you’re executing the search.
  2. Knowing these results will only be useful if you can choose which “needles” to put in the “haystack” to be searched for in a later time, hence only if you’re building a template.
  3. I serched if someone else found this already, but this doesn’t appear to be the case, so if I am selling for new already known things just don’t bash me.

So let’s say you’re creating a template scheme, where you insert several tags that need to be replaced by the dynamically generated content of your website. You can call them whatever you want as long as they are unique strings inside the template, so for example %tag% or {tag} or <!–tag–> or ~tag~ (or just whatever). Maybe you think that choosing which delimiting chars to use for the tag name is only up to your personal tastes, and I did too before discovering this by accident, but I am going to guess that if the template contains well formed HTML code, then using ~tag~ is going to be much faster than using <!–tag–>.

The general principle is, searching for a needle beginning with a “rare” character is much faster than searching for a needle starting with a commonly used character inside the haystack.

For example, if your haystack is an excerpt from an ebook, searching for “%hello” (if present) will be way faster than searching for “hello”. The reason for this? The function in C that searches for the string, starts by searching for its first character, if found checks if the following one matches, and so on; so if you’re searching for “hello” the function will pause at every “h” to see if after that there’s an “e”, and if yes then checks if there’s and “l” and then another “l”, yet if the word is “hellish”, the function will not find the ending “o”, and will have to discard the work and time spent and go on with the search. The “%” character on the other hand is pretty rare inside of a “normal text”, if not unique, so the function will have to “pause” way less times before hitting a full match.

Let’s put it to the test, this is the routine:

$creationstart=strtok(microtime()," ")+strtok(" ");
for ($i=0;$i<100000;$i++) $testpos=strpos($test,"malesuarda");
$creationend=strtok(microtime()," ")+strtok(" ");
echo "malesuarda $testpos: ".$creationtime."<br />";

$creationstart=strtok(microtime()," ")+strtok(" ");
for ($i=0;$i<100000;$i++) $testpos=strpos($test,"%malesuada");
$creationend=strtok(microtime()," ")+strtok(" ");
echo "%malesuada $testpos: ".$creationtime."<br />";

Let’s do some explaining: $test is a fairly long string that I previously defined inside the code (it is made of a lorem ipsum kind of text, several paragraphs amounting to almost 13kb), inside of which I took a random word, “malesuada“, which is repeated several times, and I made two occurrences of this word slightly different, to render them unique; they were both towards the end of the string, I changed one into malesuarda adding a”r”, and another one (further away in the string) into %malesuada, then just loaded the PHP script; I echoed the value of $testpos as well, to confirm that the strings were actually found.

As expected, here are the results:

malesuarda 10970: 3.5609
%malesuada 11514: 0.7632

Replacing strpos with any other functions listed at the beginning of this article will deal similar results.