I had been chasing this problem for quite a while now and I'm really excited to have finally found the root cause. This is a rather long post as I think the narrative of the problem solving process here is just as important as the problem itself, so bear with me.
From those graphs the behavior seems pretty random. But keep in mind that Threads_running only shows the current state, if you're not looking at it at the exact right moment, then you miss it. It's not like counter status variables like questions, or com_select, which you can take deltas to find out how many events happened since the last time you looked.
Those graphs capture data every 30 seconds, but that's still not enough to find out every time this is happening, because the spike only lasts 2-5 seconds.
But using what data we had tried everything we could think of to find a correlation: was it log file purging in innodb? how about waiting read or write requests? Mutex contention? Binlog rotation? Memory allocation? Slow log collection? Stats gathering? Replication? Nothing seemed to match up.
This is actually one of the factors that drove me to write Rain Gauge. I started trying to capture more info on these events via pt-stalk, and thought about how I wanted to see when this was happening even if I wasn't paying attention -- also by monitoring all our other servers as well, I could look for correlation across hosts -- maybe implying some kind of network or application level issue.
The collections looked like this:
Are you seeing it yet? It took me a long time to see -- in fact it was my coworkers who began to say it before I even realized what was in front of my eyes: It seems like it's happening every five minutes.
innodb_buffer_pool_restore_at_startup=300
I set that to zero, and BINGO ... the spikes disappeared. What was going on?
This is from a feature added to Percona MySQL for Dumping and Restoring the Buffer Pool Contents. This is actually quite a clever thing -- by writing the memory addresses of the buffer pool to disk before shutdown, you can tell the server to load everything you had in the buffer pool at that point in time again at startup. This dramatically helps the "cold cache" problem often seen when restarting mysql with a large buffer pool. Instead of taking a huge performance hit until the server warms up its caches, you can efficiently and quickly repopulate it.
The option above lets you configure the server to automatically dump its buffer pool at a given interval, so that you always have a recent buffer pool dump. The value is the number of seconds between writing out the file. 300 seconds is 5 minutes.
But the collections aren't exactly 300 seconds apart, each one is a little bit later -- so you can even see how long it took each time to do this buffer pool dump -- about 4-5 seconds.
When this feature was first created, I don't think servers contained as much RAM as they do now -- and the way it works is it has to take a mutex lock on the buffer pool LRU mutex while it writes the file. Our servers have a very large buffer pool, so this process took quite a bit of time and the mutex locking was interfering with running transactions which is what causes Threads_running to back up.
The result of finding and fixing this issue was huge for us. This was greatly impacting our query times, pushing the average time up almost 10x. The same graph of threads_running shows the dramatic change:
You can clearly see the moment when this feature was disabled. There are still a few jumps, but they're now much more sporadic, and less severe. Best of all the correlate clearly to another event -- which will be the subject for another blog post.
The problem
The symptom was that we'd see frequent spikes in the number of Threads_running on our busy database servers. This was happing day and night for as long as we'd been collecting data, but worse during peak traffic periods. They would look like this:From those graphs the behavior seems pretty random. But keep in mind that Threads_running only shows the current state, if you're not looking at it at the exact right moment, then you miss it. It's not like counter status variables like questions, or com_select, which you can take deltas to find out how many events happened since the last time you looked.
Those graphs capture data every 30 seconds, but that's still not enough to find out every time this is happening, because the spike only lasts 2-5 seconds.
But using what data we had tried everything we could think of to find a correlation: was it log file purging in innodb? how about waiting read or write requests? Mutex contention? Binlog rotation? Memory allocation? Slow log collection? Stats gathering? Replication? Nothing seemed to match up.
This is actually one of the factors that drove me to write Rain Gauge. I started trying to capture more info on these events via pt-stalk, and thought about how I wanted to see when this was happening even if I wasn't paying attention -- also by monitoring all our other servers as well, I could look for correlation across hosts -- maybe implying some kind of network or application level issue.
The collections looked like this:
Are you seeing it yet? It took me a long time to see -- in fact it was my coworkers who began to say it before I even realized what was in front of my eyes: It seems like it's happening every five minutes.
That Eureka Moment!
Then while going through lists of innodb variables to find some connection (yet again) I happened to look at this one:innodb_buffer_pool_restore_at_startup=300
I set that to zero, and BINGO ... the spikes disappeared. What was going on?
This is from a feature added to Percona MySQL for Dumping and Restoring the Buffer Pool Contents. This is actually quite a clever thing -- by writing the memory addresses of the buffer pool to disk before shutdown, you can tell the server to load everything you had in the buffer pool at that point in time again at startup. This dramatically helps the "cold cache" problem often seen when restarting mysql with a large buffer pool. Instead of taking a huge performance hit until the server warms up its caches, you can efficiently and quickly repopulate it.
The option above lets you configure the server to automatically dump its buffer pool at a given interval, so that you always have a recent buffer pool dump. The value is the number of seconds between writing out the file. 300 seconds is 5 minutes.
But the collections aren't exactly 300 seconds apart, each one is a little bit later -- so you can even see how long it took each time to do this buffer pool dump -- about 4-5 seconds.
When this feature was first created, I don't think servers contained as much RAM as they do now -- and the way it works is it has to take a mutex lock on the buffer pool LRU mutex while it writes the file. Our servers have a very large buffer pool, so this process took quite a bit of time and the mutex locking was interfering with running transactions which is what causes Threads_running to back up.
The Solution
The solution is easy. We really don't need that feature anymore; or at the least we can force the buffer pool flush manually just before restart.The result of finding and fixing this issue was huge for us. This was greatly impacting our query times, pushing the average time up almost 10x. The same graph of threads_running shows the dramatic change:
You can clearly see the moment when this feature was disabled. There are still a few jumps, but they're now much more sporadic, and less severe. Best of all the correlate clearly to another event -- which will be the subject for another blog post.
How did we miss this for so long?
This is where I talk about monitoring and just how important it is. When we enabled this feature a year ago, we didn't have nearly the kind of monitoring in place that we do now. System load was lower, so the spikes were less severe and the only monitoring we had were the cacti graphs which didn't have a great resolution. The consequence is that the change in behavior on the databases wasn't noticed immediately when this change was made.
It wasn't until several months later when I started seeing this behavior and started digging into it. By that time the chance to recognize the correlation between the configuration change and the spikes in threads_running were lost.
This is why it's so important to have really comprehensive monitoring -- because you can never be sure what unintended side effects result from seemingly innocent changes.
If we had a system like Rain Gauge and Anemometer running at the time we made this change, then the difference in behavior would have been immediate and obvious. We could have saved ourselves a year of degraded performance, and the countless hours it took to track this down. But we can't change the past. Fixing this now is a big win and a good lesson learned.
.png)

.png)
4 comments:
Hello, Oracle's version of the buffer pool dump/load feature does not hold the BP mutex while writing to the file on disk and, what is more important - does not do any automatic dumps. One can configure to dump automatically at shutdown time (buffer_pool_dump_at_shutdown=ON) or to trigger an immediate dump at runtime by setting buffer_pool_dump_now=ON. Its up to you if you want to setup an automated periodic dumps (e.g. via cron job or something else), but its certainly not going to happen without your knowledge.
We consider holding the LRU list mutex for the duration of the LRU dump a bug: https://bugs.launchpad.net/percona-server/+bug/686534. We have already released the fix for Percona Server 5.1, and 5.5 will follow.
Thanks for an interesting analysis. As Laurynas mentioned earlier, the fix is in works for next milestone or so.
@Zhivka,
Manual dumps are possible in the Percona version as well -- http://www.percona.com/doc/percona-server/5.5/management/innodb_lru_dump_restore.html?id=percona-server:features:innodb_lru_dump_restore#manual-operation
So, this can be used in init scripts for instance to load / dump at startup / shutdown respectively.
We described a similar behavior in
http://www.palominodb.com/blog/2012/06/20/mystery-solved-replication-lag-innodb
Post a Comment