Tag Archives: MySQL

MySQL query with ORDER BY,GROUP BY to remove duplicate rows after ordering

Apparently this is a widely discussed issue with MySQL; many newbies (like me) wanting to SELECT rows from a table, where they know there are duplicates of some field ‘a’ which they want unique, but making sure that the selected row has the field ‘b’ with the lower, highest, longest or shortest (whatever) value among the duplicates… well, these newbies first try and make a query like:

SELECT a, b, c FROM table WHERE <conditions> ORDER BY b DESC GROUP BY a

and they get errors and realize it’s not possible, because MySQL syntax wants the GROUP BY clause to come before the ORDER BY one, like:

SELECT a, b, c FROM table WHERE <conditions> GROUP BY a ORDER BY b DESC

In this case though, rows first get grouped, and only after duplicates are removed, they get ordered (pretty much useless for what we need).

How to go about first ordering duplicates the way we need, and then remove the rest?

To cut to the chase, this is how:

SELECT t.a, t.b, t.c 
  FROM table t
  JOIN (SELECT MIN(b) as b
    FROM table
    GROUP BY a) t2 ON t2.b = t.b
  WHERE <conditions>
  GROUP BY a
  ORDER BY NULL

What this “query skeleton” does, is ordering the rows by the field ‘b’ (ASC order, since MIN() function is used, otherwise you would be using MAX(), or anyway the function that better suits your needs) inside the JOIN operator, and then grouping these rows by field ‘a’ in the outer query; the ORDER BY NULL clause at the end is to avoid MySQL doing unnecessary ordering steps as the rows have already been ordered by the GROUP BY clause. Effective, sure, but incredibly slow query.

I thought I could go on by just using this, yet I wasn’t satisfied, as I became a performance optimization junkie during my PHP/MySQL learning curve. So I thought, what if I just get the raw rows, and order them later as I need in a PHP array?

So, let’s examine this approach instead:

$results=doquery("
  SELECT a, b, c
    FROM table
    WHERE <conditions>
  ");
$datagrid=array();
while ($slot=mysql_fetch_array($results)) $datagrid[]=$slot;
sort($datagrid);
$compare="";
foreach ($datagrid as $k=>$row)
  if ($row[0]==$compare)
    unset($datagrid[$k]);
  else
    $compare=$row[0];

After this, the multidimensional array $datagrid contains the data that you need. Some explaining is in order:

  • the SQL query returns the unprocessed rows corresponding to the WHERE conditions you specified, so you will get duplicate rows, in an arbitrary order
  • after inserting the rows data inside a multidimensional array, the sort() PHP function takes care of sorting this array depending on the sub-arrays in it, first by subarray[0], then by subarray[1] and so on (let’s open a parenthesis here: if you search for “php sort multidimensional array” on the internet, you will find all sorts of hand-made custom scripts to do this, as if sort() alone wasn’t enough; well, I have PHP5 on my hosting, and I threw a multidimensional array at it… it got splendidly sorted just as I needed, so it worked for me; in case it doesn’t for you, look on php.net function_sort user comments, there are all kinds of custom multi-sorting scripts in there, even if nothing is as fast as just sort() which is a builtin function)
  • a $compare string is initialized, and used on the following foreach loop to unset() all the subarrays where the element [0] (our ‘a’ field) is the same as the previous subarray’s [0], thus effectively pruning duplicates after the first occurrence

After this you can pass the resulting array to another routine which will process it as needed.

To take performance even further, in case you need to do a simple processing on the resulting ordered&grouped array, you can change the last foreach of the previous routine into:

foreach ($datagrid as $row)
  if ($row[0]!=$compare) {
    $compare=$row[0];
    <processing routine in here>
  }

Again, to explain, this foreach loop doesn’t care about unsetting the duplicated subarrays, but just runs your processing code on the first occurence of a subarray, ignoring the others; with this approach, you are cutting several times on the excecution time:

  1. you presumably save time as you don’t need another foreach loop to process the pruned multidimensional array, since you do everything in the first and only foreach
  2. you avoid issuing as many unset()‘s as the duplicate rows are
  3. you save on cycles and memory needed to define a new array to store the pruned data, as you only work on the very first array you built on the SQL results
  4. you simplify the foreach because you are not requesting an associative result anymore ($k=>$row) but only request the element value

With my benchmark (on a VERY reduced sets of rows, I must admit), I measured a performance increase of a whopping one hundred times (yes, exactly 100X faster) for the PHP ordering/pruning, if compared to MySQL “double-GROUP BY” approach (0.08 seconds for MySQL/PHP, against 8 seconds for MySQL only, for 100 iterations). Your results may vary on very large, or of a different kind, data sets, anyway testing will not hurt. From my 100x speed experience, there’s still plenty of leeway space.

Inverted search in MySQL database: check if keywords in field match some text

Usually, “searching a database” means looking for rows that contain a “document” in a field that matches a certain keywords set (mostly using the MATCH AGAINST construct); let’s say you want to do the opposite instead, that is finding those rows that have a field containing keywords, or a phrase, that matches to a text you already know. This is what I call an inverted search, or reverse search.

Real-life scenario: you build a website for job offers, candidates come in searching for listings that suit them, but find none, or find just a few that are not satisfactory. Either they keep coming to the website doing a search (let’s say they use the keywords “veterinar* cat” because they want to take care of animals, and like cats above all), or they subscribe to the RSS feed for that search (if your site offers one)… or they “save” this preference, and have the site send them an email as soon as a new offer pops up that matches their search, and this is where this article comes in.

Now, it’s easy to save the keywords for each user in a MySQL table, but how easy is it to check if the text of new messages matches said keywords? In PHP, you would go about building an array of keywords, and using strpos() on each of them to see if they are all contained in the text.

In MySQL you have the INSTR() which is the equivalent of PHP’s strpos(), but you cannot store arrays in a field (I mean real arrays, not serialized ones), and storing as many rows as the keywords are is going to be troublesome, as normalization isn’t needed: a “set of keywords” is mostly unique to each user subscribing to a search, so it makes sense to store them in a text field in the form of “keyword1,keyword2,keyword3,…”; again, if you were in PHP, you would do a explode(“,”,$keywords) to obtain the array of keywords, but in MySQL there is no such function, so you pretty much have to create one.

Ideally, you need a function where you pass the known document/text to match against (the concatenated subject and the body of the job offer, in this example), the string in the form “keyword1,…,keywordN“, and a parameter which tells the function that the delimiter (“glue” in PHP’s explode/implode) is a comma, so:

somefunction(document, keywords, delimiter)

This function would then return a positive integer if all of the keywords are present in the text, or 0 otherwise.

The function I put together by scraping info around was:

DELIMITER ;;

DROP FUNCTION IF EXISTS `matchkwdarray`;;
CREATE FUNCTION `matchkwdarray`(`str` text, `arr` text, `delimit` tinytext) RETURNS char(1) CHARSET utf8
 DETERMINISTIC
BEGIN
 DECLARE i INT DEFAULT 1;
 DECLARE matches INT DEFAULT 1;
 DECLARE token TINYTEXT;
 label:WHILE(matches>0) DO
 SET token=replace(substring(substring_index(arr, delimit, i), length(substring_index(arr, delimit, i - 1)) + 1), delimit, '');
 IF token='' THEN LEAVE label; END IF;

 SET matches=INSTR(str, token);
 SET i=i+1;
 END WHILE;
 RETURN matches;
END;;

DELIMITER ;

To add this to your MySQL database, you can either connect via CLI, or, as most people with a website on a shared hosting, not having command-line access, using a database frontend; phpMyAdmin, bloated as it is, doesn’t support functions/routines, so you need another one, and I found Adminer to be a massively better alternative, performance and flexibility-wise, and it’s only a 150kb-something single PHP file, which makes it all the better.

So, when you test the function, you have the following:

SELECT matchkwdarray('this is just a very simple test', 'just,test', ',')
Returns: 2

SELECT matchkwdarray('this is just a very simple test', 'some,stuff', ',')
Returns: 0

SELECT matchkwdarray('this is just a very simple test', 'this,very,stuff', ',')
Returns: 0

The function is resource efficient: as soon as a keyword is not in the text, it stops processing the keywords array. Surely, if you don’t want a 100% match but are fine with lower than that, you can easily modify the function to “stream” the whole keywords array, even removing the last character off each keywords to allow some variations, and return a percentage match which you can then evaluate later on.

MySQL SELECT query with more WHERE conditions runs faster

For some reason I wanted to add more redundant conditions to a WHERE clause in a MySQL SELECT query, but I feared that, being redundant conditions, I may end up slowing up the query.

Shortly, there is this classifieds site, where each item’s table has the town_id, province_id and region_id rows. If you know the town_id you’re looking for already, you can query:

SELECT * FROM items WHERE town_id='12'

instead of:

SELECT * FROM items WHERE town_id='12' AND province_id='34' AND region_id='56'

as the results will be the same… but should you?

I prepared a quick test panel:

$creationstart=strtok(microtime()," ")+strtok(" ");
for($i=0;$i<=999;$i++)
 $test=doquery("SELECT * FROM items WHERE town_id='58003'");
$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "town only: $creationtime<br />";

$creationstart=strtok(microtime()," ")+strtok(" ");
for($i=0;$i<=999;$i++)
 $test=doquery("SELECT * FROM items WHERE town_id='58003' AND province_id='58' AND region_id='12'");
$creationend=strtok(microtime()," ")+strtok(" ");
$creationtime=number_format($creationend-$creationstart,4);
echo "all conditions: $creationtime<br />";

Surprisingly, it took anywhere from 2.4s to 4.5s for the “all conditions” routine to complete, while the “town only” query ended up being about 0.4s slower. At first I thought it might be the MySQL cache in action, but even inverting the position of the cycles inside the code, results were the same.

So well, even if it looks redundant to you, adding more WHERE ... AND ... clauses in your query speeds up the fetching of the results.

  This article has been Digiproved