Tag Archives: order by

MySQL query with ORDER BY,GROUP BY to remove duplicate rows after ordering

Apparently this is a widely discussed issue with MySQL; many newbies (like me) wanting to SELECT rows from a table, where they know there are duplicates of some field ‘a’ which they want unique, but making sure that the selected row has the field ‘b’ with the lower, highest, longest or shortest (whatever) value among the duplicates… well, these newbies first try and make a query like:

SELECT a, b, c FROM table WHERE <conditions> ORDER BY b DESC GROUP BY a

and they get errors and realize it’s not possible, because MySQL syntax wants the GROUP BY clause to come before the ORDER BY one, like:

SELECT a, b, c FROM table WHERE <conditions> GROUP BY a ORDER BY b DESC

In this case though, rows first get grouped, and only after duplicates are removed, they get ordered (pretty much useless for what we need).

How to go about first ordering duplicates the way we need, and then remove the rest?

To cut to the chase, this is how:

SELECT t.a, t.b, t.c 
  FROM table t
  JOIN (SELECT MIN(b) as b
    FROM table
    GROUP BY a) t2 ON t2.b = t.b
  WHERE <conditions>
  GROUP BY a
  ORDER BY NULL

What this “query skeleton” does, is ordering the rows by the field ‘b’ (ASC order, since MIN() function is used, otherwise you would be using MAX(), or anyway the function that better suits your needs) inside the JOIN operator, and then grouping these rows by field ‘a’ in the outer query; the ORDER BY NULL clause at the end is to avoid MySQL doing unnecessary ordering steps as the rows have already been ordered by the GROUP BY clause. Effective, sure, but incredibly slow query.

I thought I could go on by just using this, yet I wasn’t satisfied, as I became a performance optimization junkie during my PHP/MySQL learning curve. So I thought, what if I just get the raw rows, and order them later as I need in a PHP array?

So, let’s examine this approach instead:

$results=doquery("
  SELECT a, b, c
    FROM table
    WHERE <conditions>
  ");
$datagrid=array();
while ($slot=mysql_fetch_array($results)) $datagrid[]=$slot;
sort($datagrid);
$compare="";
foreach ($datagrid as $k=>$row)
  if ($row[0]==$compare)
    unset($datagrid[$k]);
  else
    $compare=$row[0];

After this, the multidimensional array $datagrid contains the data that you need. Some explaining is in order:

  • the SQL query returns the unprocessed rows corresponding to the WHERE conditions you specified, so you will get duplicate rows, in an arbitrary order
  • after inserting the rows data inside a multidimensional array, the sort() PHP function takes care of sorting this array depending on the sub-arrays in it, first by subarray[0], then by subarray[1] and so on (let’s open a parenthesis here: if you search for “php sort multidimensional array” on the internet, you will find all sorts of hand-made custom scripts to do this, as if sort() alone wasn’t enough; well, I have PHP5 on my hosting, and I threw a multidimensional array at it… it got splendidly sorted just as I needed, so it worked for me; in case it doesn’t for you, look on php.net function_sort user comments, there are all kinds of custom multi-sorting scripts in there, even if nothing is as fast as just sort() which is a builtin function)
  • a $compare string is initialized, and used on the following foreach loop to unset() all the subarrays where the element [0] (our ‘a’ field) is the same as the previous subarray’s [0], thus effectively pruning duplicates after the first occurrence

After this you can pass the resulting array to another routine which will process it as needed.

To take performance even further, in case you need to do a simple processing on the resulting ordered&grouped array, you can change the last foreach of the previous routine into:

foreach ($datagrid as $row)
  if ($row[0]!=$compare) {
    $compare=$row[0];
    <processing routine in here>
  }

Again, to explain, this foreach loop doesn’t care about unsetting the duplicated subarrays, but just runs your processing code on the first occurence of a subarray, ignoring the others; with this approach, you are cutting several times on the excecution time:

  1. you presumably save time as you don’t need another foreach loop to process the pruned multidimensional array, since you do everything in the first and only foreach
  2. you avoid issuing as many unset()‘s as the duplicate rows are
  3. you save on cycles and memory needed to define a new array to store the pruned data, as you only work on the very first array you built on the SQL results
  4. you simplify the foreach because you are not requesting an associative result anymore ($k=>$row) but only request the element value

With my benchmark (on a VERY reduced sets of rows, I must admit), I measured a performance increase of a whopping one hundred times (yes, exactly 100X faster) for the PHP ordering/pruning, if compared to MySQL “double-GROUP BY” approach (0.08 seconds for MySQL/PHP, against 8 seconds for MySQL only, for 100 iterations). Your results may vary on very large, or of a different kind, data sets, anyway testing will not hurt. From my 100x speed experience, there’s still plenty of leeway space.