Information is Contagious

MySQL and more

Monday, March 26, 2012

Introducing Anemometer - MySQL Slow Query Visualization


The DBA team at Box.com has just open sourced our in-house tool for tracking sql queries.  It's really been a huge help to us in terms of growing our company while maintaining consistent database performance.

This uses the pt-query-digest tool for collection -- you can set it up on cron to sample slow logs at regular intervals from all your db servers and insert the results into a database.  Anemometer makes it easier to dig through that information and see your query stats.

Here's a quick sample of what it looks like:

Enjoy!




Thursday, February 23, 2012

Multi-Insert and LAST_INSERT_ID()



Edit:

Thanks to a comment by Maxim Krizhanovsky,  this is all actually documented in quite a bit of detail.  It's a long read, but well worth it: http://dev.mysql.com/doc/refman/5.1/en/innodb-auto-increment-handling.html

The really important part to note, is that the behavior of auto increment in some cases can be modified with the server setting innodb_autoinc_lock_mode including when you use INSERT ... SELECT and LOAD DATA INFILE.  Some settings are not safe to use with statement based replication. The default setting is safe, but doesn't have the best performance.

Original article follows:


Using a multi-insert to save rounds trips to the database can be a great time saver, but what if you want to also retrieve all the auto-increment values from those rows you just inserted?

CREATE TABLE test ( a int unsigned auto_increment primary key );

INSERT INTO test VALUES (), (), ();

SELECT LAST_INSERT_ID();


The value returned is 1, which is the first auto increment value of the multi-insert statement.  This is clearly documented in the manual.

What isn't documented, is if the next two values can safely be assumed to be 2 and 3.  In this simple example, with no concurrent inserts, that will be true.  But will it always be true?

Luckily, the answer is yes -- at least for innodb (I haven't checked other storage engines.)  Auto increment ids for a multi-insert are reserved all at once in a block.  The only extra thing to be aware of, are your server's settings for auto_increment_increment.

The source code shows this:
storage/innobase/handler/ha_innodb.cc

/*********************************************************************//**
This function initializes the auto-inc counter if it has not been
initialized yet. This function does not change the value of the auto-inc
counter if it already has been initialized. Returns the value of the
auto-inc counter in *first_value, and ULONGLONG_MAX in *nb_reserved_values (as
we have a table-level lock). offset, increment, nb_desired_values are ignored.
*first_value is set to -1 if error (deadlock or lock wait timeout) */
UNIV_INTERN
void
ha_innobase::get_auto_increment(
/*============================*/
        ulonglong offset,              /*!< in: table autoinc offset */
        ulonglong increment,           /*!< in: table autoinc increment */
        ulonglong nb_desired_values,   /*!< in: number of values reqd */
        ulonglong *first_value,        /*!< out: the autoinc value */
        ulonglong *nb_reserved_values) /*!< out: count of reserved values */
{

Thursday, February 9, 2012

Filesorts, Secondary Indexes and the Importance of Covering Indexes


Here's a question that was driving me crazy: Why do these two explain plans look different?

explain select a from test order by b;
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | b    | 5       | NULL | 3263769 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+

and this

explain select a,c from test order by b;
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows    | Extra          |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+
|  1 | SIMPLE      | test  | ALL  | NULL          | NULL | NULL    | NULL | 3263769 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+---------+----------------+


The second query does a filesort, but the only change is adding another column to the SELECT clause! For reference, here is the table structure. As you can see, there's a primary key with auto increment, and a secondary key on b.

CREATE TABLE `test` (
  `a` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `b` int(11) DEFAULT NULL,
  `c` varchar(32) DEFAULT NULL,
  PRIMARY KEY (`a`),
  KEY `b` (`b`)
) ENGINE=InnoDB AUTO_INCREMENT=3278870 DEFAULT CHARSET=latin1


You don't get the same behavior if you're ordering by the primary key value. So it seems there's a big difference when sorting if you use a secondary index.

 The explanation seems to be that (with innodb tables) if you order by a secondary index, and you need to read row data, then mysql has to go back to read the clustered index anyway, so it just ignores the secondary index and does a filesort.

Wow, that makes optimization of sorts much more difficult! What's the solution? I've heard that Multi-Range-Read in MySQL 5.6 will fix this, but I haven't tested that myself yet. For now, a query like this highlights the importance of covering indexes.

alter table test add index (b,c);

explain select a,c from test order by b;
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
| id | select_type | table | type  | possible_keys | key  | key_len | ref  | rows    | Extra       |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+
|  1 | SIMPLE      | test  | index | NULL          | b_2  | 40      | NULL | 3263685 | Using index |
+----+-------------+-------+-------+---------------+------+---------+------+---------+-------------+

Monday, April 5, 2010

IPv6 Functions for MySQL

MySQL has inet_aton and inet_ntoa functions for converting IPv4 addresses to an unsigned 32 bit integer value, and back into dotted decimal format. Since IPv6 is becoming more popular, I decided to create a couple functions for handling IPv6 addresses as well. In this case, we want to convert the friendly ::0:53D6:0001 style addresses to a 16 byte binary value that is easier to store. But we also want to extract them in a human readable format as well. To do that we now have inet_aton6() and inet_ntoa6().

Examples:

mysql> SELECT inet_aton6('0:0:0:0:0:0:af00:b001');
+------------------------------------------+
| inet_aton6('0:0:0:0:0:0:af00:b001') |
+------------------------------------------+
| 000000000000000000000000AF00B001 |
+------------------------------------------+

mysql> SELECT inet_ntoa6(unhex('000000000000000000000000AF00B001'));
+-------------------------------------------------------+
| inet_ntoa6(unhex('000000000000000000000000AF00B001')) |
+-------------------------------------------------------+
| 0:0:0:0:0:0:af00:b001 |
+-------------------------------------------------------+

When storing in a table, it's recommended to use BINARY(16), and use UNHEX() to convert the hexidecimal string representation into a binary string. inet_aton6 should do that itself, but returning binary strings from a udf causes errors.

CREATE TABLE `address` (
`ipv6` binary(16) DEFAULT NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1

mysql> INSERT INTO address (ipv6) VALUES (unhex(inet_aton6('0:0:0:0:0:0:af00:b001')));

mysql> SELECT inet_ntoa6(ipv6) FROM address;
+-----------------------+
| inet_ntoa6(ipv6) |
+-----------------------+
| 0:0:0:0:0:0:af00:b001 |
+-----------------------+


Enjoy!

Monday, March 8, 2010

Qsh.pl: distributed query tool

I've written quite a few tools over time to connect to many mysql servers and run queries. Most of these have been pretty specific to a small set of tasks such as running an alter across many servers. Any sysadmin that is in charge of many servers is probably familiar with dsh, and as I was using recently I realized how all those specific tools I've written for mysql could be generalized into a dsh like tool. Thus, Qsh.pl was born! (download at launchpad)

Usage should be familiar to anyone who has used dsh before, it even will read group files made for dsh in /etc/dsh/group/or /usr/local/etc/group/.

Here's an example where this tool was quite useful. I was getting a query error for SHOW GLOBAL STATUS. This was a curious result since we're running mysql 5.0 everywhere. So what better way to find out which machines are complaining than just run it everywhere:

# qsh.pl -Mcg all_servers --user root --ask-pass --db=test -e 'SHOW GLOBAL STATUS' 2>error.log
{snip ... lots of output}
Done. Total time 2.919

My group file for all_servers includes 120 mysql servers, executing that query on all of them took a total of 2.9 seconds, not bad. I also redirected stderr to a file, so any query errors are easy to find:
cat error.log
myserver1: Query Error (1064) You have an error in your SQL syntax ...
myserver2: Query Error (1064) You have an error in your SQL syntax ...

Ok, we found all the servers that return an error, why do they complain?
# qsh.pl -Mcm myserver1,myserver2,myserver3 --user root --ask-pass --db=test -e 'SELECT VERSION()'
Password:
myserver1: +-----------------+
myserver1: | VERSION() |
myserver1: +-----------------+
myserver1: | 4.1.21-standard |
myserver1: +-----------------+
myserver3: +----------------------------+
myserver3: | VERSION() |
myserver3: +----------------------------+
myserver3: | 5.0.66a-enterprise-gpl-log |
myserver3: +----------------------------+
myserver2: +-----------------+
myserver2: | VERSION() |
myserver2: +-----------------+
myserver2: | 4.1.21-standard |
myserver2: +-----------------+
Done. Total time 0.063
Ooops! That's right, still a few old versions for legacy reasons.

That's just one example of how I used it. There are probably lots of use cases out there, but since it's new I'm still learning to rely on it. It certainly makes things faster when I can think about querying many servers at once, and is a more efficient way to work when dealing with many machines. It might be useful for:

+ comparing explain plan between many machines
+ altering large tables across many slaves, before promoting one to master.
+ grabbing status output from many machines to feed into awk or sed

Monday, February 1, 2010

Three Types of Sets: Performance Comparrisson

In a previous post I talked about how to take a classic SQL anti-pattern (storing lists as CSV in a single field), normalize it, and how to write queries for it.

There are two other ways to handle sets of values: The built in SET column type, and using a INT/BIGINT column as a bitfield.

The mysql set data type provides a convenient way of working with a list of values, where each field could represent multiple simultaneous values. It's really just a wrapper around bitmask operations -- it lest you work with them using text as well as numerical values.

But how do they perform? My quick test showed that aside from a couple edge cases, they all preformed about the same speed. In fact it was surprising to me to see that the bitmask type queries were really not that much faster than the more normalised table structure with multiple rows. I would have guessed that the extra grouping operations would have added significant overhead, but they didn't.










QueryAvg. Time
SELECT * from bitmask WHERE val&3 LIMIT 10000.0023512
SELECT * from bitmask WHERE val&3=3 LIMIT 10000.0149182
select * from settable where s&3=3 LIMIT 10000.0155686
select id FROM normalset
group by id
HAVING sum(val='yellow') and sum(val='cyan')
LIMIT 1000
0.016244
select id,
BIT_OR(CASE val WHEN 'cyan' THEN 1 WHEN 'yellow' THEN 2 ELSE 0 END) as bitvals,
count(*)
FROM normalset
WHERE val in ('cyan','yellow')
GROUP BY id
HAVING bitvals=3
LIMIT 1000
0.0163958
select * from settable
where s like '%cyan%yellow%' LIMIT 1000
0.0333536



Notice that that the top query in the list is an order of magnitude faster than the others, however it doesn't actually answer the same question as the other queries. I included it anyway because it was a curious result -- apparently bit manipulation operations alone are very very fast in mysql, but just adding an equality test slows the query down considerably. The cost of the equality test is much much higher than I would have anticipated. It would be an interesting project to track down exactly why that is.

You can see that working with the SET datatype using string values has the worst performance. When you use it like an ordinary bitmask, then the performance is identical to the other, faster methods (no surprise there.) Working with it using text values involves some internal conversions. IMO this means ordinary bitmasks are the clear winner over the SET datatype -- the SET datatype requires an ALTER TABLE to add new values, and the the convenience of using text values should be avoided because they're slow. If you need maximum flexibility, the more normalized structure seems to preform just as well.


To generate the test data, I created about 200k unique items and randomly assigned any of a set of 4 values to each item. This ended up with about 400k associations, so each item had on average two values out of the set. Statements follow:


-- creating SET table, generating values
create table settable (id serial, s set('cyan','yellow','magenta','black'), key (s) );
insert into settable (s) values ( 'cyan,yellow' ), ( 'yellow,magenta'), ( 'cyan,black,yellow'), ( 'magenta' );

-- repeat until desired # of rows reached:
insert into settable (s) select truncate(rand()*15+1,0) from settable;

-- creating normalized set, copy values from settable
create table normalset (id int not null default 0, val char(8), key(id) );
insert into normalset SELECT id, 'cyan' FROM settable WHERE s like '%cyan%';
insert into normalset SELECT id, 'yellow' FROM settable WHERE s like '%yellow%';
insert into normalset SELECT id, 'magenta' FROM settable WHERE s like '%magenta%';
insert into normalset SELECT id, 'black' FROM settable WHERE s like '%black%';

-- create bitmask, copy values from settable
CREATE TABLE bitmask ( id serial, val int unsigned NOT NULL default '0', UNIQUE KEY (id), KEY (val) );
insert into bitmask SELECT id, s+0 FROM settable;

Tuesday, December 29, 2009

How to Fix the Comma Separated List of Doom

This is a solution that I've seen a few people have trouble with now, so I want to make more people aware of it:

Say you have some items that can have any of a set of associated values. These could be a list flags, it could be keywords or tags. This is often seen as the anti-pattern known as "Comma Separated List of Doom" (CSLoD). Naive developers will often create something like the following:


create table cslod ( id serial, vals varchar(255) );
insert into cslod values (NULL, "cyan,yellow" );
select * from cslod;
+----+-------------+
| id | vals |
+----+-------------+
| 1 | cyan,yellow |
+----+-------------+


I know that most people reading this already understand this is bad but let me re-iterate: Please please please don't do this! It will be very very difficult to write meaningful queries later.

The natural solution is to create separate rows for each value in the list:


create table normalset (id int unsigned not null default 0, val varchar(12) not null default '', primary key (id, val) );
insert into normalset VALUES (1, "cyan"), (1, "yellow" );
select * from normalset;
+----+--------+
| id | val |
+----+--------+
| 1 | cyan |
| 1 | yellow |
+----+--------+


Ok, to be truly normalised, val should be an id with a foreign key to another unique list of values, but it's easier to illustrate with this not-quite-perfect structure.


Now, using the improved structure, how would we write a query that retrieves all ids that have *both* cyan and yellow as values?


insert into normalset VALUES (2, "cyan"), (2, "black" ), (3, "magenta"), (4,"cyan"), (4, "magenta"), (4, "yellow");

select * from normalset;
+----+---------+
| id | val |
+----+---------+
| 1 | cyan |
| 1 | yellow |
| 2 | cyan |
| 2 | black |
| 3 | magenta |
| 4 | cyan |
| 4 | magenta |
| 4 | yellow |
+----+---------+


We want to return 1 and 4:

SELECT id FROM normalset GROUP BY id HAVING SUM(val='cyan') AND SUM(val='yellow');
+----+
| id |
+----+
| 1 |
| 4 |
+----+


Simple! The trick is to use GROUP BY, and then you can write conditions based on aggregates and filter them in the HAVING clause. There are a few other ways to obtain the same result:


SELECT id, GROUP_CONCAT(val order by val) as theset, count(*) FROM normalset GROUP BY id HAVING theset like '%cyan%yellow%'

SELECT id, BIT_OR(CASE val WHEN 'cyan' THEN 1 WHEN 'yellow' THEN 2 ELSE 0 END) as bitvals, count(*) FROM normalset GROUP BY id HAVING bitvals=3


The first form can be easily used to find other conditions, such as all rows that have cyan as a value but NOT yellow:

SELECT id FROM normalset GROUP BY id HAVING SUM(val='cyan') AND NOT SUM(val='yellow');
+----+
| id |
+----+
| 2 |
+----+


I have also been comparing these methods to using the SET datatype, and plain ol' bitmasks. I'll add a performance comparison of various methods in another post.

Followers