MySQL query with ORDER BY,GROUP BY to remove duplicate rows after ordering

Apparently this is a widely discussed issue with MySQL; many newbies (like me) wanting to SELECT rows from a table, where they know there are duplicates of some field ‘a’ which they want unique, but making sure that the selected row has the field ‘b’ with the lower, highest, longest or shortest (whatever) value among the duplicates… well, these newbies first try and make a query like:

SELECT a, b, c FROM table WHERE <conditions> ORDER BY b DESC GROUP BY a

and they get errors and realize it’s not possible, because MySQL syntax wants the GROUP BY clause to come before the ORDER BY one, like:

SELECT a, b, c FROM table WHERE <conditions> GROUP BY a ORDER BY b DESC

In this case though, rows first get grouped, and only after duplicates are removed, they get ordered (pretty much useless for what we need).

How to go about first ordering duplicates the way we need, and then remove the rest?

To cut to the chase, this is how:

SELECT t.a, t.b, t.c 
  FROM table t
  JOIN (SELECT MIN(b) as b
    FROM table
    GROUP BY a) t2 ON t2.b = t.b
  WHERE <conditions>
  GROUP BY a
  ORDER BY NULL

What this “query skeleton” does, is ordering the rows by the field ‘b’ (ASC order, since MIN() function is used, otherwise you would be using MAX(), or anyway the function that better suits your needs) inside the JOIN operator, and then grouping these rows by field ‘a’ in the outer query; the ORDER BY NULL clause at the end is to avoid MySQL doing unnecessary ordering steps as the rows have already been ordered by the GROUP BY clause. Effective, sure, but incredibly slow query.

I thought I could go on by just using this, yet I wasn’t satisfied, as I became a performance optimization junkie during my PHP/MySQL learning curve. So I thought, what if I just get the raw rows, and order them later as I need in a PHP array?

So, let’s examine this approach instead:

$results=doquery("
  SELECT a, b, c
    FROM table
    WHERE <conditions>
  ");
$datagrid=array();
while ($slot=mysql_fetch_array($results)) $datagrid[]=$slot;
sort($datagrid);
$compare="";
foreach ($datagrid as $k=>$row)
  if ($row[0]==$compare)
    unset($datagrid[$k]);
  else
    $compare=$row[0];

After this, the multidimensional array $datagrid contains the data that you need. Some explaining is in order:

  • the SQL query returns the unprocessed rows corresponding to the WHERE conditions you specified, so you will get duplicate rows, in an arbitrary order
  • after inserting the rows data inside a multidimensional array, the sort() PHP function takes care of sorting this array depending on the sub-arrays in it, first by subarray[0], then by subarray[1] and so on (let’s open a parenthesis here: if you search for “php sort multidimensional array” on the internet, you will find all sorts of hand-made custom scripts to do this, as if sort() alone wasn’t enough; well, I have PHP5 on my hosting, and I threw a multidimensional array at it… it got splendidly sorted just as I needed, so it worked for me; in case it doesn’t for you, look on php.net function_sort user comments, there are all kinds of custom multi-sorting scripts in there, even if nothing is as fast as just sort() which is a builtin function)
  • a $compare string is initialized, and used on the following foreach loop to unset() all the subarrays where the element [0] (our ‘a’ field) is the same as the previous subarray’s [0], thus effectively pruning duplicates after the first occurrence

After this you can pass the resulting array to another routine which will process it as needed.

To take performance even further, in case you need to do a simple processing on the resulting ordered&grouped array, you can change the last foreach of the previous routine into:

foreach ($datagrid as $row)
  if ($row[0]!=$compare) {
    $compare=$row[0];
    <processing routine in here>
  }

Again, to explain, this foreach loop doesn’t care about unsetting the duplicated subarrays, but just runs your processing code on the first occurence of a subarray, ignoring the others; with this approach, you are cutting several times on the excecution time:

  1. you presumably save time as you don’t need another foreach loop to process the pruned multidimensional array, since you do everything in the first and only foreach
  2. you avoid issuing as many unset()‘s as the duplicate rows are
  3. you save on cycles and memory needed to define a new array to store the pruned data, as you only work on the very first array you built on the SQL results
  4. you simplify the foreach because you are not requesting an associative result anymore ($k=>$row) but only request the element value

With my benchmark (on a VERY reduced sets of rows, I must admit), I measured a performance increase of a whopping one hundred times (yes, exactly 100X faster) for the PHP ordering/pruning, if compared to MySQL “double-GROUP BY” approach (0.08 seconds for MySQL/PHP, against 8 seconds for MySQL only, for 100 iterations). Your results may vary on very large, or of a different kind, data sets, anyway testing will not hurt. From my 100x speed experience, there’s still plenty of leeway space.

4 thoughts on “MySQL query with ORDER BY,GROUP BY to remove duplicate rows after ordering”

  1. Ciao,
    Con la seguente variante si ottiene la stessa cosa vero?
    (ho verificato di persona in mysql ma vorrei conferma)

    SELECT t.a, min(t.b), t.c
    FROM dbesercizi.t
    GROUP BY a

    1. Vorrei tanto darti una risposta definitiva, ma non lo so!
      Sono pur sempre un niubbo autodidatta, però di logica il min(t.b) viene eseguito nel SELECT solo dopo il GROUP BY, quindi prima GROUP BY esegue il suo raggruppamento secondo come sta comodo, senza riguardo per il valore del campo b, e poi il min(t.b) una volta applicato ai risultati sarà sempre uguale a t.b.
      Sempre secondo me eh!
      Poi siccome tu hai una tabella, prova a fare nella stessa query il SELECT sia del valore minimo che del campo nudo e crudo, e se sono sempre uguali tra di loro allora è come dico io 😉

  2. Allora… la tabella è semplice:

    INSERT INTO dbesercizi.t(x,a,b,c)
    values(
    a,5,3,1
    ),
    (
    b,5,6,2
    )

    se faccio

    SELECT t.a, min(t.b), t.c
    FROM dbesercizi.t
    GROUP BY a;

    allora ottengo

    a=5 min(t.b)=3 c=1

    se faccio

    SELECT t.a, max(t.b), t.c
    FROM dbesercizi.t
    GROUP BY a;

    allora ottengo
    a=5 max(t.b)=6 c=1

    invece (se ho capito bene quello che intendi dire tu)
    senza alcun aggregato
    SELECT t.a, t.b, t.c
    FROM dbesercizi.t
    GROUP BY a;

    ottengo
    a=5 b=3 c=1

    Da qui ho dedotto che funziona….
    Fammi sapere la tua opinione a riguardo e se sei disponibile magari se mi potresti spiegare meglio il meccanismo del GROUP by, nel senso che viene prima fatto il GROUP BY e dopo la SELECT o il contrario?

    1. Non sono così esperto, ma se per te ha funzionato nel modo in cui riporti, l’unico altro test da fare sarebbe espandere la tabella di prova per vedere se i risultati sono coerenti.
      E infine fare un benchmark per vedere se GROUP BY è più o meno veloce di min()/max()

Leave a Reply to mikelmanto Cancel reply

Your email address will not be published. Required fields are marked *