- STUDENT - 400.000 lines (35.000 are inactive)
- CLASS - 40.000 lines
- CLASS_STUDENT - 460.000 lines
SELECT S.* FROM STUDENT S JOIN CLASS_STUDENT CS ON CS.STUDENT_ID = S.STUDENT_ID JOIN CLASS C ON C.CLASS_ID = CS.CLASS_ID WHERE S.ACTIVE = TRUE GROUP BY S.STUDENT_ID --this suffices to fetch students only once ORDER BY C.CLASS_DATE DESC --datetime field LIMIT 0,5
Execution time with
ORDER BY: 3.2297 sec
Execution time without
ORDER BY: 0.0015 sec
3 seconds in my system is a lot of time that will result in a bad experience to my users. Isn't there any way to speedup this query with
ORDER BY? The
LIMIT is for pagination. I'm ordering by
CLASS_DATE desc because I want to see in my paginated results firstly the students attending to recent classes.
I cannot remove the joins I'm using.
EXPLAIN for both queries:
innodb_buffer_pool_size = 4GB and I have 16 GB in my server
If this was my project, I would only list each student once. Not multiple times, repeating the student for each row in the enrollment (
I would seriously consider denormalizing the database implementation, by adding a derived column to the
STUDENT table, e.g.
ALTER TABLE STUDENT ADD latest_class_date DATE DEFAULT NULL;
And populate that column:
UPDATE STUDENT t LEFT JOIN ( SELECT cs.student_id , MAX(c.class_date) AS latest_class_date FROM CLASS_STUDENT cs JOIN CLASS c ON c.class_id = cs.class_id GROUP BY cs.student_id ) s ON t.student_id = s.student_id SET t.latest_class_date = s.latest_class_date
With that new column added, I could create an appropriate index, eg.
... ON STUDENT (active, latest_class_date, student_id)
Then my query is much simpler:
SELECT s.* FROM student s WHERE s.active = 1 ORDER BY s.active DESC, s.latest_class_date DESC, student_id DESC LIMIT 5
I'd include the
student_id in the
ORDER BY to make the result deterministic. (Without that, MySQL is free to return any rows with the same
latest_class_date in any order.)
I'd also implement pagination a little more efficiently, retaining the values from the last previously retrieved row, and supplying those values in the query. To get the "next 5 rows":
(EDIT: The query pattern for pagination in the original was glaringly wrong. This has been corrected, both here and in the last query in the followup.)
SELECT s.* FROM student s WHERE s.active = 1 AND s.latest_class_date <= ? AND (s.latest_class_date < ? OR s.student_id < ? ) ORDER BY s.active DESC, s.latest_class_date DESC, student_id DESC LIMIT 5
The contents of the new
latest_class_date column will need to be maintained. A couple of choices there.
If I can live with a period of time that the contents of that column might be out of sync, then
- schedule execution of the update statement to periodically refresh the contents of that column.
If I have a requirement to keep the contents of that column synchronized, then either:
modify the application that stewards the student, class and student_class tables to ensure the
latest_class_datecolumns is populated whenever rows are added/changed/removed from
add triggers on the tables to keep that column populated
BEFORE INSERT/UPDATEtriggers on
AFTER INSERT/UPDATE/DELETEtriggers on
AFTER INSERT/UPDATE/AFTER DELETEtriggers on
(I need to remember that triggers are not fired by foreign key actions. For example, if rows are deleted from
CLASS_STUDENT as the result of a CASCADE action of a FOREIGN KEY when a row in deleted from
CLASS, then only the triggers for the
CLASS table will be fired. That means I'll have to handle the required actions in the delete trigger on the
If you "need all the joins" because of other information in those tables you "might need one day", the suggestion I made above isn't going to help much. In terms of performance, dollars to donuts it's the "Using filesort" operation on a huge set that is eating your lunch. Adding joins to the query I suggested isn't going to avoid that "Using filesort" operation.
If there's reasonable performance on the query I suggested, then we might be able to make use of that query as an inline view, to limit the number of rows returned, before we do the joins.
But before we do that, we have to first address the duplicated student rows returned when there are multiple rows in the enrollment (
CLASS_STUDENT) table. Do we want the same student returned multiple times? Or do we want to return the student row multiple times, once for each row in
CLASS_STUDENT for a class wit the same
class_date. Or, do we want to return the student row just once, with information from just one
CLASS? If we are returning multiple rows for a student, is the intent to list five students per page, or to list five student_class per page?
Assuming "pagination" of five rows per page, are we expecting a resultset like this?
rows 1 thru 5
student class class_date ------- ----- ---------- Sam phys 2016-02-12 Sam calc 2016-02-12 Mary lit 2016-02-12 Mahatma art 2016-02-12 Paul music 2016-02-11
rows 6 thru 10
Paul engl 2016-02-11 Sam art 2016-02-10 ...
The query we write is going to be informed by the specification.
We can write a myriad possible queries. But without a specification, every one of those queries is just a guess. Without knowing what resultset we want returned (under a variety of possible conditions), we have no way of verifying that a query we write is correct.
I looked at your query again, and notice that you do have a
GROUP BY student_id. (We're going to assume that
student_id is unique in the
If there are multiple rows in the
CLASS_STUDENT table for a student, and the
CLASS rows that those are related to have different value for
class_date, the value of
class_date returned in the original query is indeterminate. MySQL is free to choose any of the possible
class_date values. (It's not just the
class_date column... the values from the rows from
CLASS_STUDENT are indeterminate.)
With the original query, there is no guarantee that students with the "latest" class_date are going to be listed before other students. For example, with this set:
student class_date ------- ---------- Sam 2016-02-22 Sam 2015-07-17 Paul 2016-01-11
The query above could order the rows with
Paul, or with
Paul. Either result is valid with the original query. And you aren't guaranteed that every time you run the query the result will be the same. The result is indeterminate.
Other databases would throw an error with the original query, along the lines of "non-aggregate in SELECT list not appearing in GROUP BY clause". A MySQL specific extension allows the query to execute. It's possible to get MySQL to behave the same way as other databases, and throwing an error, by including
ONLY_FULL_GROUP_BY in the
SELECT d.* FROM ( SELECT s.student_id , s.latest_class_date FROM student s WHERE s.active = 1 AND s.latest_class_date <= ? AND ( s.latest_class_date < ? OR s.student_id < ? ) ORDER BY s.active DESC, s.latest_class_date DESC, student_id DESC LIMIT 5 ) r JOIN student d ON d.student_id = r.student_id JOIN class_student e ON e.student_id = d.student_id JOIN class c ON c.class_id = e.class_id AND c.class_date = r.latest_class_date GROUP BY d.student_id