MySQL and more

Thursday, February 9, 2012

Filesorts, Secondary Indexes and the Importance of Covering Indexes


Here's a question that was driving me crazy: Why do these two explain plans look different?

explain select a from test order by b;
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | b    | 5       | NULL | 3263769 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+

and this

explain select a,c from test order by b;
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
|  1 | SIMPLE      | test  | ALL  | NULL          | NULL | NULL    | NULL | 3263769 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+


The second query does a filesort, but the only change is adding another column to the SELECT clause! For reference, here is the table structure. As you can see, there's a primary key with auto increment, and a secondary key on b.

CREATE TABLE `test` (
  `a` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `b` int(11) DEFAULT NULL,
  `c` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`a`),
  KEY `b` (`b`)
) ENGINE=InnoDB AUTO_INCREMENT=3278870 DEFAULT CHARSET=latin1


You don't get the same behavior if you're ordering by the primary key value. So it seems there's a big difference when sorting if you use a secondary index.

 The explanation seems to be that (with innodb tables) if you order by a secondary index, and you need to read row data, then mysql has to go back to read the clustered index anyway, so it just ignores the secondary index and does a filesort.

Wow, that makes optimization of sorts much more difficult! What's the solution? I've heard that Multi-Range-Read in MySQL 5.6 will fix this, but I haven't tested that myself yet. For now, a query like this highlights the importance of covering indexes.

alter table test add index (b,c);

explain select a,c from test order by b;
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | b_2  | 40      | NULL | 3263685 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+

3 comments:

Robert Hodges said...

Hi Gavin, I think this variation is the difference between having a sorted covering index vs. not having a covering index. The first query is a special case because you are asking for the primary keys sorted in index column order, here 'b'. So MySQL just scans the index, which contains exactly those values.

In the second case you are selecting an additional field that is not in the index. This requires a table scan, because while reading the index on 'b' would get you the sort order it would generate a bunch of random i/o which the query optimizer rightly thinks is less efficient than just reading the pages and sorting them.

If you were using a DBMS other than MySQL with InnoDB, the index in the first ID would likely not be covering as the index entries would have pointers to pages rather than primary keys. In that case you would end up with a table scan and sort as well. InnoDB stores the index values + primary keys in its secondary indexes. This is design choice that works to your benefit here.

gtowey said...

Hey Robert,

Yes, your explanation makes perfect sense. The result was surprising at first, but becomes clear when thinking about how the indexes are structured.

I think I'm mostly trying to point out that just reading the mysql manual on order-by-optimizations doesn't give you the entire picture. None of these considerations are discussed there, and they even give an example almost exactly like I showed, which they claim will use an index to resolve the order by clause.

I also noticed that use of LIMIT changes this equation too -- it seems the optimizer will switch to preferring the index when your limit is about 10% of the rows or less. This is *also* not discussed in the manual =)

Robert Hodges said...

Query optimization is definitely one of the dark arts. There's a trade-off point between an index range scan and full table scan but it's not something you would normally find without looking at code.

Personally I like the MySQL optimizer because it's simple enough you can convince yourself you understand what it's doing. That's not the case with some of the competition.

Followers