A common task in sharded database environments is to copy some chunk of production data from one physical server to another in a way which is transparent for users who may be actively making changes. One pattern is to use replication to keep the target data up to date with any changes until the moment at which you change your application configuration and your users start writing their data to the new shard's location. However, to use replication, you need to copy data from one system to another in a manner which is consistent to a specific point in time. This lets you record the binary log position from which you begin replicating, ensuring that no data has been either lost or duplicated.
There are many ways to do this, and many of the existing tools rely on copying the entire source database. Here's a few choices:
1. mysqldump --single-transaction can make a dump without locking your database; the downside is that it's single threaded and can take a very long time on large datasets.
2. xtrabackup -- this can be a very fast way of making a point in time backup, but again, if your dataset is busy then the apply-logs phase needed to prepare the backup for use can take a while. It also copies either the entire database, or at least entire tables.
3. mydumper -- a fast multi-threaded tool for dumping data, but it only dumps whole tables.
Using the above tools, you might make a point in time backup of your entire database, and build a new slave in the usual fashion. Then you can throw away unneeded records on the slave before pushing it into production. This is generally a very efficient way to do this, and tools like Tumblr's Jetpants use this technique.
But what if you really need to export the data because you're merging it into an existing database? Or the set of data you are moving is small percentage of the original data set. In those cases having a fast method to consistently export only some records is very helpful.
Mysqldump does this by using innodb's MVCC to establish a snapshot. But first it has to take a brief lock on the database.
This sets up the read snapshot, and records the binary log position while the master is locked. The snapshot is valid only for the current connection and only while it's not interrupted. While the transaction is open MySQL will keep all the old records around which have changed, and make sure we read from the old versions. This is pretty cool, but if you have a lot of data to export then this is going to be very slow to only have one connection reading data --especially if it's not sequential records (meaning you have to do a lot of random reads.)
To parallelize this task, we need to get several threads all synchronized to read from the same point in time. The trick to doing this is the moment in which the database is locked. At that moment no more changes are being made, so many threads can all connect, establish a snapshot and be guaranteed to read from the same point in time. This can be easily done but there aren't many cases when you'll actually want to do this by running these command in a mysql CLI session. It's much more typical to write this process in a script. You'll need a master process, and your "worker" threads. Here's the algorithm:
Master Process:
That's it! Now you have two connections that can read from the same point in time. This can speed up large read operations a lot.
There are many ways to do this, and many of the existing tools rely on copying the entire source database. Here's a few choices:
1. mysqldump --single-transaction can make a dump without locking your database; the downside is that it's single threaded and can take a very long time on large datasets.
2. xtrabackup -- this can be a very fast way of making a point in time backup, but again, if your dataset is busy then the apply-logs phase needed to prepare the backup for use can take a while. It also copies either the entire database, or at least entire tables.
3. mydumper -- a fast multi-threaded tool for dumping data, but it only dumps whole tables.
Using the above tools, you might make a point in time backup of your entire database, and build a new slave in the usual fashion. Then you can throw away unneeded records on the slave before pushing it into production. This is generally a very efficient way to do this, and tools like Tumblr's Jetpants use this technique.
But what if you really need to export the data because you're merging it into an existing database? Or the set of data you are moving is small percentage of the original data set. In those cases having a fast method to consistently export only some records is very helpful.
Mysqldump does this by using innodb's MVCC to establish a snapshot. But first it has to take a brief lock on the database.
FLUSH /*!40101 LOCAL */ TABLES; FLUSH TABLES WITH READ LOCK; SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ; START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */; SHOW MASTER STATUS; UNLOCK TABLES;
This sets up the read snapshot, and records the binary log position while the master is locked. The snapshot is valid only for the current connection and only while it's not interrupted. While the transaction is open MySQL will keep all the old records around which have changed, and make sure we read from the old versions. This is pretty cool, but if you have a lot of data to export then this is going to be very slow to only have one connection reading data --especially if it's not sequential records (meaning you have to do a lot of random reads.)
To parallelize this task, we need to get several threads all synchronized to read from the same point in time. The trick to doing this is the moment in which the database is locked. At that moment no more changes are being made, so many threads can all connect, establish a snapshot and be guaranteed to read from the same point in time. This can be easily done but there aren't many cases when you'll actually want to do this by running these command in a mysql CLI session. It's much more typical to write this process in a script. You'll need a master process, and your "worker" threads. Here's the algorithm:
Master Process:
FLUSH /*!40101 LOCAL */ TABLES;
FLUSH TABLES WITH READ LOCK;
SHOW MASTER STATUS;
Thread 1:
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ; START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */;
Thread 2:
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ; START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */;Master Process:
UNLOCK TABLES;
That's it! Now you have two connections that can read from the same point in time. This can speed up large read operations a lot.
4 comments:
That's a good trick. Are there any potential race conditions due to the fact that you are assuming that all work is blocked by the table locks?
Hi,
The only race is that the session holding the lock will release the lock if that session disconnects prematurely. This causes a race with the threads waiting to create a read view, since not all will create a view at the same point if the lock is released prematurely.
FTWRL will acquire a read lock on all tables. All outstanding reads and writes must complete before the FTWRL is obtained.
Since all writes are blocked all new read snapshots (consistent views) will be created with an identical view of the data.
Obviously you don't want to do this on a regular basis, but if you want a multiread consistent snapshot this is currently the only solution.
I've looked at InnoDB source a bit, and there is a function for cloning an existing read snaphost. I'd like to see a UDF (or SQL syntax) for starting a repeatable read-only transaction from a specific InnoDB transaction id, or the ability to clone the read view from another session using this ability.
I think Robert you're asking if innodb's background flushing is going to have an affect on this process, and as Justin said that it's not a problem here.
We're stopping all new changes with the FTWRL, which is what the read view is synchronizing on. The background IO is only affecting what is the oldest change available.
Justin: that's a really awesome discovery! I wonder why that functionality isn't exposed yet. Sounds like it could be used internally for parallel alter table and things like that.
Post a Comment