Information is Contagious

MySQL and more

Monday, February 11, 2013

Efficient Partial Table Scans

There's a pretty common MySQL recipe for performance that if you want to efficiently scan through lots of rows in small chunks that LIMIT with OFFSET is right out.  Using OFFSET, MySQL will have to scan all the rows until it finds the starting position before it starts reading results to return.   Just to be clear, these statements look like:

SELECT id FROM foo ORDER BY id LIMIT 10 OFFSET 1000;


If you were trying to read all rows in table then this would  be a very slow and expensive way to do that (in terms of MySQL resources.) The most common optimization is to switch to an algorithm where you remember the last highest id value for each chunk of rows, and then add that to the WHERE clause.

SELECT id FROM foo WHERE id > 100000  ORDER BY id LIMIT 10;


 This would be a much more efficient way select all values from a table in small chunks because it will use the PRIMARY KEY and be able to quickly find the first row to read with no extra reads. But what if we add a twist: you only want to scan through some part of your table in this way, something like:

SELECT id FROM foo WHERE b=1 AND id > 100000  ORDER BY id LIMIT 10;


In this case, we want to select all rows where b=1.  Now if we try to use the primary key we have to skip over rows that don't match the b=1 condition.  If those rows are not close together in the table, then this can be a very inefficient method.  You may have to read millions of rows just to find even the next 1 row to return.

The natural conclusion is that you would apply an index to column b.  Since indexes in innodb are clustered, MySQL can use the PK value hidden at the end to order the rows and scan them in order.  Problem solved, right?

Not so fast.

There's a hidden gotcha here, so lets take a look at the MySQL status counters to reveal what's going on internally.


mysql > explain SELECT id FROM foo WHERE b=1 AND id > 100000 ORDER BY ID LIMIT 1;
+----+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra                    |
+----+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | foo   | ref  | PRIMARY,b     | b    | 2       | const | 4729 | Using where; Using index |
+----+-------------+-------+------+---------------+------+---------+-------+------+--------------------------+

Do you see the problem yet?  It's subtle.  Let's look at the handler status variables after we run the query.

mysql> SELECT id FROM foo WHERE b=1 AND id > 100000 ORDER BY ID LIMIT 1;
+--------+
| id     |
+--------+
| 100136 |
+--------+
1 row in set (0.01 sec)

mysql> show status like 'Handler_read_next';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_read_next | 911   |
+-------------------+-------+

The value or the Handler_read_next show it had to read 911 rows before it found the 1 row requested and could stop executing and return the result.  That's not good.  In fact, it's going to get worse as you progress through the table and the queries will become progressively slower.

What's going on is that MySQL apparently cannot use the hidden value in the clustered index to resolve the "id > 1000" part of the WHERE clause.  This is quite disappointing.  It seems like it should be filed as a bug, but so far I couldn't find any existing bug that describes this behavior.  If anyone knows of one please let me know.  Otherwise I'll probably file something.

The workaround for now is to add an index like the following:

ALTER TABLE foo ADD INDEX bplus (b,id);


I can feel all of you out there cringing right now.  Seems gross doesn't it?  Now we have the primary key value on the end of each index record *twice*!  However, it's hard to argue with results:

mysql> explain SELECT id FROM foo WHERE b=1 AND id > 100000 ORDER BY ID LIMIT 1;
+----+-------------+-------+-------+-----------------+-------+---------+------+------+--------------------------+
| id | select_type | table | type  | possible_keys   | key   | key_len | ref  | rows | Extra                    |
+----+-------------+-------+-------+-----------------+-------+---------+------+------+--------------------------+
|  1 | SIMPLE      | foo   | range | PRIMARY,b,bplus | bplus | 6       | NULL | 3818 | Using where; Using index |
+----+-------------+-------+-------+-----------------+-------+---------+------+------+--------------------------+
1 row in set (0.00 sec)

mysql> SELECT id FROM foo WHERE b=1 AND id > 100000 ORDER BY ID LIMIT 1;
+--------+
| id     |
+--------+
| 100136 |
+--------+
1 row in set (0.00 sec)

mysql> show status like 'Handler_read_next';
+-------------------+-------+
| Variable_name     | Value |
+-------------------+-------+
| Handler_read_next | 0     |
+-------------------+-------+
1 row in set (0.00 sec)

mysql> show status like 'Handler_read%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_first    | 0     |
| Handler_read_key      | 1     |
| Handler_read_last     | 0     |
| Handler_read_next     | 0     |
| Handler_read_prev     | 0     |
| Handler_read_rnd      | 0     |
| Handler_read_rnd_next | 0     |
+-----------------------+-------+
7 rows in set (0.01 sec)

This result shows that key_len is now 6 instead of 2 -- it's using both key parts, the Handler_read_next is zero, which shows it's scanning no extra rows. There is just one key read for this query.

I've tested this result in both 5.5 and 5.6;  I was hoping that this would be among the many great query optimizer enhancements in the new version.  Sadly, that's not the case.  If anyone wants to reproduce this, here's the table structure and data:

CREATE TABLE `foo` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `b` tinyint(4) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `b` (`b`)
) ENGINE=InnoDB 

insert into foo values (NULL, RAND()*100);
insert into foo select NULL, RAND()*100 FROM foo; -- repeat until approx 200k rows are inserted.

Edit: Several readers have pointed out that this has been fixed in current versions of 5.6 -- I was testing on one that was released several months ago (5.6.5)  Things can change fast!  MariaDB also supports this properly as well.  Please read the comments for the full picture.

Thursday, February 7, 2013

Serious XFS Performance Regression in Linux Kernel 2.6.32-279

I'm not the only one to have noticed this, but I spent a sufficient amount of time banging my head against a wall finding this out that I thought it important to make more people aware of this.

While trying to validate new database hardware we were seeing some serious performance issues in production.  Most MySQL benchmarks using sysbench or pt-playback couldn't reproduce it, but a simple sysbench 16 threaded filio test on the mysql partition showed about 1/3 the throughput we would expect.   The fact that much of the hardware was new as well as the OS we were using made tracking down the cause difficult (changing from CentOS 5.5 to Scientific Linux 6.)

Finally some of our ops people working on different systems started noticing similar issues, and they uncovered the XFS issue.  Sure enough -- when took existing hardware, upgraded to SL6 and ran the same sysbench filio test we immediately saw a severe throughput drop.

There is a fix in the beta version of the next RedHat release that includes kernel 2.6.32-343.  Installing this kernel version on currently affected machines shows almost a complete recovery.  Unfortunately that's not a fix we're willing to use in production.

Alternative fixes are to either use ext4, or try newer versions of CentOS 5 -- neither of which are appealing to us, so we're currently just going to wait for the new kernel to be released.  Percona has blogged about this issue advising that ext4 may be the way to go, and that has sparked quite a controversy.  Read the comments there for more information on the regression.

The Curious Case of the Missing Binlogs

When you enable binlogs in the my.cnf file you can either set the log-bin flag to true, or you can set it to a path and file name prefix such as this:

[mysqld]
log-bin=/path/to/binlogs/mysql-binlog
This changes the default location where binlogs are stored.  The problem is that when you connect to mysql there is currently no way to query the server to find out if that path has been changed, and what it currently is.  This means you can't be sure where any server's binlogs are actually stored.

Ok, so they're not really missing, but it's a known issue that mysql doesn't make them easy to find.  The server obviously knows the path internally, but it doesn't make this information available. Bug #35231 has been open on this issue since 2008 and is currently being ignored.

Why?

This is such a trivial change to expose that info somewhere.  Why neglect such basic information?

The relay_log variable which behaves the same in the conf file, shows the behavior you would expect in the global variables on the server -- the same as the value you set in the conf.

If MySQL is worried about backwards compatibility then please just add a new variable name.

Saturday, November 17, 2012

Tiny PHP MVC

I'm not the first, but this is what I've been using to kickstart php based projects.

1. Save this code as index.php
2. Create a views directory
3. Call the script with index.php?action=method
4. Create new actions by adding public methods to the Controller class

<?php
error_reporting(E_ALL);

// default action; use example.com/script.php?action=blah to control which function to call
$action = 'index';
if (isset($_GET['action']))
{
 $action = $_GET['action'];
}

// create the controller and call the requested action if possible
$controller = new Controller();
if (is_callable(array($controller, $action )))
{
 $controller->$action();
}
else
{
 print "Invalid action ($action)";
}
exit; // and we're done

// helper function
function site_url()
{
 return $_SERVER['SCRIPT_NAME'];
}

/**
 * basic controller class, any public method will be accessible via script.php?action=method
 */
class Controller
{
 private $model;
 
 function __construct()
 {
  // create an instance of any models used
  $this->model = new Model();
 }
 
 public function index()
 {
  include "views/header.php";
  
  // get data from our model
  $message = $this->model->get_message();
  
  // the template will have to know we create the $message variable here
  require "views/index.php";
 }
}


/**
 * simple model class, handles getting data for the controller
 * This could use database connections, or other methods
 */
class Model
{
 public function get_message()
 {
  return "Hello World";
 }
}
?>

Sunday, November 11, 2012

Multi-threaded Consistent Reads

A common task in sharded database environments is to copy some chunk of production data from one physical server to another in a way which is transparent for users who may be actively making changes.  One pattern is to use replication to keep the target data up to date with any changes until the moment at which you change your application configuration and your users start writing their data to the new shard's location.  However, to use replication, you need to copy data from one system to another in a manner which is consistent to a specific point in time.  This lets you record the binary log position from which you begin replicating, ensuring that no data has been either lost or duplicated.

There are many ways to do this, and many of the existing tools rely on copying the entire source database.   Here's a few choices:

1. mysqldump --single-transaction can make a dump without locking your database; the downside is that it's single threaded and can take a very long time on large datasets.
2. xtrabackup -- this can be a very fast way of making a point in time backup, but again, if your dataset is busy then the apply-logs phase needed to prepare the backup for use can take a while.  It also copies either the entire database, or at least entire tables.
3. mydumper -- a fast multi-threaded tool for dumping data, but it only dumps whole tables.

Using the above tools, you might make a point in time backup of your entire database, and build a new slave in the usual fashion.  Then you can throw away unneeded records on the slave before pushing it into production.  This is generally a very efficient way to do this, and tools like Tumblr's Jetpants use this technique.

But what if you really need to export the data because you're merging it into an existing database?  Or the set of data you are moving is small percentage of the original data set.  In those cases having a fast method to consistently export only some records is very helpful.

Mysqldump does this by using innodb's MVCC to establish a snapshot.  But first it has to take a brief lock on the database.


FLUSH /*!40101 LOCAL */ TABLES;
FLUSH TABLES WITH READ LOCK;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */;
SHOW MASTER STATUS;
UNLOCK TABLES;

This sets up the read snapshot, and records the binary log position while the master is locked.  The snapshot is valid only for the current connection and only while it's not interrupted.  While the transaction is open MySQL will keep all the old records around which have changed, and make sure we read from the old versions.  This is pretty cool, but if you have a lot of data to export then this is going to be very slow to only have one connection reading data --especially if it's not sequential records (meaning you have to do a lot of random reads.)

To parallelize this task, we need to get several threads all synchronized to read from the same point in time.  The trick to doing this is the moment in which the database is locked.  At that moment no more changes are being made, so many threads can all connect, establish a snapshot and be guaranteed to read from the same point in time.  This can be easily done but there aren't many cases when you'll actually want to do this by running these command in a mysql CLI session.  It's much more typical to write this process in a script. You'll need a master process, and your "worker" threads.  Here's the algorithm:

Master Process:
FLUSH /*!40101 LOCAL */ TABLES;
FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;
Thread 1:
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */;
Thread 2:
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */;
Master Process:
UNLOCK TABLES;

That's it!  Now you have two connections that can read from the same point in time.  This can speed up large read operations a lot.

Wednesday, November 7, 2012

Virtual DBA team at #dbhangops

Have you ever wondered how MySQL DBAs at other companies manage their day to day operations? Then join us at the #dbhangops google hangout.  My co-worker is trying to get a mix of DBAs running mid to large scale deployments to provide an open forum for people to talk about their experience and learn from others.   We've had several successful hangouts so far, and we're interested in opening it up to a wider audience.

Our next meetup is TBD, but we're planning on the following agenda:

Query killing: protecting your DBs
Alerting and Monitoring: what do you monitor and why?
The MySQL Utilities Python Library: use cases, building tools and automation
My.cnf config tuning: what parameters are important for you and why?

For more info see Geoff Anderson's blog.

For news about upcoming events, check out this Twitter stream.




Thursday, October 25, 2012

How can MySQL Replication Break?

How many different ways can you get into trouble with replication?  Most of them are obvious once they happen to you, but until they do you're often not watching out for them.  This is a list of all the ones I can think of, and most of them have happened to me at one point.

Don't get me wrong, I think MySQL Replication is very robust and easy to use, but when you have lots of servers replicating, you're likely to hit those edge-cases more often; you will see problems.  It's better to plan on making things resilient than expecting they will never break.

1. Relay Log corruption

This happens either because the master's binlog is corrupted, or because of network issues. Or aliens.  The causes can seem pretty random, and it's usually just easier to fix it and move on.  By issuing a CHANGE MASTER TO ... ; statement which includes MASTER_HOST and sets the binlog file and position to the current execution position shown in SHOW SLAVE STATUS, then you can force the slave to clear out its relay logs and download them from the master again.

2. Long slave connection timeout 

In 5.1 it was the case that if the master disappeared unexpectedly due to network problems, then the slave could wait as long as an hour before reporting a problem.  Checking the error log would show its attempts to reconnect.  5.5 looks like it has more sane behavior: option_mysqld_master-retry-count

3. Long running update statements causing slave to lag

This is one of the most frequent problems that everyone encounters with replication.  Because replication is currently single threaded, a single slow update/insert/delete will cause replication to be delayed.  This restriction changes how basically every application uses the database -- large updates are often broken down into many small chunks;  operations that logically should be in a single transaction are often broken up as well, making consistency harder to achieve.  If you're reading data from a slave, it may be stale so now you have to introduce additional checks.  The rate of changes your database can handle depends on how fast your slaves can keep up.  Everyone approches this in different ways.

4. Data drift

The idea behind having replicated slaves is that they are exact copies of your data -- you hope!  In reality it's possible for the same query to produce different results on the master and slave.  You either have to avoid non-deterministic queries, or use row-based replication (which has its own set of pros and cons)  Checking for consistency is time-consuming and resource intensive.  Not checking means you're playing with fire.  Everyone should have a process for refreshing their data, or simply throwing away a slave and starting fresh.

5. Bad server_id values  

You have to take care to make sure server_id is unique. If you have two servers with the same server_id value, then you'll mysteriously lose data on the slave.  It's also possible to get replication events in a master/master replication set up that was not generated by either server.  Since it's never discarded, then that event is played on each server in a loop forever.

6. Master not writing binlog

This one is obvious, if the master doesn't write events to the binlog, then you can't replicate anything.  It should be hard to disable the binlog on a master that is currently replicating -- but it's not: SET GLOBAL SQL_LOG_BIN=0;  Ooops.  None of the servers will complain.

7. Replicate_ignore/do_db rules

This is perhaps one of the single largest points of confusion.  I've seen countless cases where someone says "help my replication isn't working" and it's because they have these options set, but don't understand how they work. Numerous blog posts exist to explain the dangers.

8.  RESET MASTER / RESET SLAVE

These two commands will break your replication set up unless you're very careful about how you use them.  In particular the behavior of RESET SLAVE which is used to forget its replication position, but *not* the connection info can be weird.  MySQL will let you try to start replication again, and most often you just end up with an error, because it's trying to replicate from a binlog position which doesn't exist on the master.  RESET MASTER is perhaps something you should never do in production.

9. Manually deleting binlogs or relay logs

Both of these files are managed through mysql; there should never be a need to manipulate them at the filesystem level.

10.  Purging master logs

You're going to need to clean out old binlogs, but be careful -- if you have a slave that was still trying to use an older binlog file then you'll break replication and have to rebuild that slave.  This can easily happen if a slave is stopped for a while and you're not aware of it.

11.  Bad CHANGE MASTER TO statement

This statement is very powerful -- as mentioned in #1 it can be used to fix relay log corruption, but be sure you use the correct binlog file and position, or else you risk skipping replication statements, or running them twice.


There we go!  I don't expect that this list contains all possibilities -- what are some other ways replication has caused you pain?

Followers