MySQL - poor performance with simple order by

advertisements

Tables (InnoDB):

  • STUDENT - 400.000 lines (35.000 are inactive)
  • CLASS - 40.000 lines
  • CLASS_STUDENT - 460.000 lines

Query:

SELECT S.*
FROM STUDENT S
JOIN CLASS_STUDENT CS ON CS.STUDENT_ID = S.STUDENT_ID
JOIN CLASS C ON C.CLASS_ID = CS.CLASS_ID
WHERE S.ACTIVE = TRUE
GROUP BY S.STUDENT_ID --this suffices to fetch students only once
ORDER BY C.CLASS_DATE DESC --datetime field
LIMIT 0,5

Execution time with ORDER BY: 3.2297 sec

Execution time without ORDER BY: 0.0015 sec

3 seconds in my system is a lot of time that will result in a bad experience to my users. Isn't there any way to speedup this query with ORDER BY? The LIMIT is for pagination. I'm ordering by CLASS_DATE desc because I want to see in my paginated results firstly the students attending to recent classes.

I cannot remove the joins I'm using.

Thanks!

EDIT: EXPLAIN for both queries:

EDIT2: innodb_buffer_pool_size = 4GB and I have 16 GB in my server


If this was my project, I would only list each student once. Not multiple times, repeating the student for each row in the enrollment (CLASS_STUDENT) table.

I would seriously consider denormalizing the database implementation, by adding a derived column to the STUDENT table, e.g.

 ALTER TABLE STUDENT ADD latest_class_date DATE DEFAULT NULL;

And populate that column:

 UPDATE STUDENT t
   LEFT
   JOIN ( SELECT cs.student_id
               , MAX(c.class_date) AS latest_class_date
            FROM CLASS_STUDENT cs
            JOIN CLASS c
              ON c.class_id = cs.class_id
           GROUP BY cs.student_id
        ) s
     ON t.student_id = s.student_id
    SET t.latest_class_date = s.latest_class_date

With that new column added, I could create an appropriate index, eg.

... ON STUDENT (active, latest_class_date, student_id)

Then my query is much simpler:

SELECT s.*
  FROM student s
 WHERE s.active = 1
 ORDER BY s.active DESC, s.latest_class_date DESC, student_id DESC
 LIMIT 5

I'd include the student_id in the ORDER BY to make the result deterministic. (Without that, MySQL is free to return any rows with the same latest_class_date in any order.)

I'd also implement pagination a little more efficiently, retaining the values from the last previously retrieved row, and supplying those values in the query. To get the "next 5 rows":


(EDIT: The query pattern for pagination in the original was glaringly wrong. This has been corrected, both here and in the last query in the followup.)

SELECT s.*
  FROM student s
 WHERE s.active = 1
   AND s.latest_class_date <= ?
   AND (s.latest_class_date < ? OR s.student_id < ? )
 ORDER BY s.active DESC, s.latest_class_date DESC, student_id DESC
 LIMIT 5

The contents of the new latest_class_date column will need to be maintained. A couple of choices there.

If I can live with a period of time that the contents of that column might be out of sync, then

  • schedule execution of the update statement to periodically refresh the contents of that column.

If I have a requirement to keep the contents of that column synchronized, then either:

  • modify the application that stewards the student, class and student_class tables to ensure the latest_class_date columns is populated whenever rows are added/changed/removed from CLASS and STUDENT_CLASS tables, or

  • add triggers on the tables to keep that column populated

    • BEFORE INSERT/UPDATE triggers on STUDENT
    • AFTER INSERT/UPDATE/DELETE triggers on CLASS_STUDENT
    • AFTER INSERT/UPDATE/AFTER DELETE triggers on CLASS

(I need to remember that triggers are not fired by foreign key actions. For example, if rows are deleted from CLASS_STUDENT as the result of a CASCADE action of a FOREIGN KEY when a row in deleted from CLASS, then only the triggers for the CLASS table will be fired. That means I'll have to handle the required actions in the delete trigger on the CLASS table.)


FOLLOWUP

If you "need all the joins" because of other information in those tables you "might need one day", the suggestion I made above isn't going to help much. In terms of performance, dollars to donuts it's the "Using filesort" operation on a huge set that is eating your lunch. Adding joins to the query I suggested isn't going to avoid that "Using filesort" operation.

If there's reasonable performance on the query I suggested, then we might be able to make use of that query as an inline view, to limit the number of rows returned, before we do the joins.

But before we do that, we have to first address the duplicated student rows returned when there are multiple rows in the enrollment (CLASS_STUDENT) table. Do we want the same student returned multiple times? Or do we want to return the student row multiple times, once for each row in CLASS_STUDENT for a class wit the same class_date. Or, do we want to return the student row just once, with information from just one CLASS? If we are returning multiple rows for a student, is the intent to list five students per page, or to list five student_class per page?

Assuming "pagination" of five rows per page, are we expecting a resultset like this?

rows 1 thru 5

student    class   class_date
-------    -----   ----------
Sam        phys    2016-02-12
Sam        calc    2016-02-12
Mary       lit     2016-02-12
Mahatma    art     2016-02-12
Paul       music   2016-02-11

rows 6 thru 10

Paul       engl    2016-02-11
Sam        art     2016-02-10
...

The query we write is going to be informed by the specification.

We can write a myriad possible queries. But without a specification, every one of those queries is just a guess. Without knowing what resultset we want returned (under a variety of possible conditions), we have no way of verifying that a query we write is correct.


I looked at your query again, and notice that you do have a GROUP BY student_id. (We're going to assume that student_id is unique in the students table.)

If there are multiple rows in the CLASS_STUDENT table for a student, and the CLASS rows that those are related to have different value for class_date, the value of class_date returned in the original query is indeterminate. MySQL is free to choose any of the possible class_date values. (It's not just the class_date column... the values from the rows from CLASS and CLASS_STUDENT are indeterminate.)

With the original query, there is no guarantee that students with the "latest" class_date are going to be listed before other students. For example, with this set:

student   class_date
-------   ----------
Sam       2016-02-22
Sam       2015-07-17
Paul      2016-01-11

The query above could order the rows with Sam before Paul, or with Sam after Paul. Either result is valid with the original query. And you aren't guaranteed that every time you run the query the result will be the same. The result is indeterminate.

Other databases would throw an error with the original query, along the lines of "non-aggregate in SELECT list not appearing in GROUP BY clause". A MySQL specific extension allows the query to execute. It's possible to get MySQL to behave the same way as other databases, and throwing an error, by including ONLY_FULL_GROUP_BY in the sql_mode.


SELECT d.*
  FROM ( SELECT s.student_id
              , s.latest_class_date
           FROM student s
          WHERE s.active = 1
            AND s.latest_class_date <= ?
            AND ( s.latest_class_date < ? OR s.student_id < ? )
          ORDER BY s.active DESC, s.latest_class_date DESC, student_id DESC
          LIMIT 5
       ) r
  JOIN student d
    ON d.student_id = r.student_id
  JOIN class_student e
    ON e.student_id = d.student_id
  JOIN class c
    ON c.class_id = e.class_id
   AND c.class_date = r.latest_class_date
 GROUP BY d.student_id