How many different ways can you get into trouble with replication? Most of them are obvious once they happen to you, but until they do you're often not watching out for them. This is a list of all the ones I can think of, and most of them have happened to me at one point.
Don't get me wrong, I think MySQL Replication is very robust and easy to use, but when you have lots of servers replicating, you're likely to hit those edge-cases more often; you will see problems. It's better to plan on making things resilient than expecting they will never break.
1. Relay Log corruption
This happens either because the master's binlog is corrupted, or because of network issues. Or aliens. The causes can seem pretty random, and it's usually just easier to fix it and move on. By issuing a CHANGE MASTER TO ... ; statement which includes MASTER_HOST and sets the binlog file and position to the current execution position shown in SHOW SLAVE STATUS, then you can force the slave to clear out its relay logs and download them from the master again.
2. Long slave connection timeout
In 5.1 it was the case that if the master disappeared unexpectedly due to network problems, then the slave could wait as long as an hour before reporting a problem. Checking the error log would show its attempts to reconnect. 5.5 looks like it has more sane behavior: option_mysqld_master-retry-count
3. Long running update statements causing slave to lag
This is one of the most frequent problems that everyone encounters with replication. Because replication is currently single threaded, a single slow update/insert/delete will cause replication to be delayed. This restriction changes how basically every application uses the database -- large updates are often broken down into many small chunks; operations that logically should be in a single transaction are often broken up as well, making consistency harder to achieve. If you're reading data from a slave, it may be stale so now you have to introduce additional checks. The rate of changes your database can handle depends on how fast your slaves can keep up. Everyone approches this in different ways.
4. Data drift
The idea behind having replicated slaves is that they are exact copies of your data -- you hope! In reality it's possible for the same query to produce different results on the master and slave. You either have to avoid non-deterministic queries, or use row-based replication (which has its own set of pros and cons) Checking for consistency is time-consuming and resource intensive. Not checking means you're playing with fire. Everyone should have a process for refreshing their data, or simply throwing away a slave and starting fresh.
5. Bad server_id values
You have to take care to make sure server_id is unique. If you have two servers with the same server_id value, then you'll mysteriously lose data on the slave. It's also possible to get replication events in a master/master replication set up that was not generated by either server. Since it's never discarded, then that event is played on each server in a loop forever.
6. Master not writing binlog
This one is obvious, if the master doesn't write events to the binlog, then you can't replicate anything. It should be hard to disable the binlog on a master that is currently replicating -- but it's not: SET GLOBAL SQL_LOG_BIN=0; Ooops. None of the servers will complain.
7. Replicate_ignore/do_db rules
This is perhaps one of the single largest points of confusion. I've seen countless cases where someone says "help my replication isn't working" and it's because they have these options set, but don't understand how they work. Numerous blog posts exist to explain the dangers.
8. RESET MASTER / RESET SLAVE
These two commands will break your replication set up unless you're very careful about how you use them. In particular the behavior of RESET SLAVE which is used to forget its replication position, but *not* the connection info can be weird. MySQL will let you try to start replication again, and most often you just end up with an error, because it's trying to replicate from a binlog position which doesn't exist on the master. RESET MASTER is perhaps something you should never do in production.
9. Manually deleting binlogs or relay logs
Both of these files are managed through mysql; there should never be a need to manipulate them at the filesystem level.
10. Purging master logs
You're going to need to clean out old binlogs, but be careful -- if you have a slave that was still trying to use an older binlog file then you'll break replication and have to rebuild that slave. This can easily happen if a slave is stopped for a while and you're not aware of it.
11. Bad CHANGE MASTER TO statement
This statement is very powerful -- as mentioned in #1 it can be used to fix relay log corruption, but be sure you use the correct binlog file and position, or else you risk skipping replication statements, or running them twice.
There we go! I don't expect that this list contains all possibilities -- what are some other ways replication has caused you pain?
Don't get me wrong, I think MySQL Replication is very robust and easy to use, but when you have lots of servers replicating, you're likely to hit those edge-cases more often; you will see problems. It's better to plan on making things resilient than expecting they will never break.
1. Relay Log corruption
This happens either because the master's binlog is corrupted, or because of network issues. Or aliens. The causes can seem pretty random, and it's usually just easier to fix it and move on. By issuing a CHANGE MASTER TO ... ; statement which includes MASTER_HOST and sets the binlog file and position to the current execution position shown in SHOW SLAVE STATUS, then you can force the slave to clear out its relay logs and download them from the master again.
2. Long slave connection timeout
In 5.1 it was the case that if the master disappeared unexpectedly due to network problems, then the slave could wait as long as an hour before reporting a problem. Checking the error log would show its attempts to reconnect. 5.5 looks like it has more sane behavior: option_mysqld_master-retry-count
3. Long running update statements causing slave to lag
This is one of the most frequent problems that everyone encounters with replication. Because replication is currently single threaded, a single slow update/insert/delete will cause replication to be delayed. This restriction changes how basically every application uses the database -- large updates are often broken down into many small chunks; operations that logically should be in a single transaction are often broken up as well, making consistency harder to achieve. If you're reading data from a slave, it may be stale so now you have to introduce additional checks. The rate of changes your database can handle depends on how fast your slaves can keep up. Everyone approches this in different ways.
4. Data drift
The idea behind having replicated slaves is that they are exact copies of your data -- you hope! In reality it's possible for the same query to produce different results on the master and slave. You either have to avoid non-deterministic queries, or use row-based replication (which has its own set of pros and cons) Checking for consistency is time-consuming and resource intensive. Not checking means you're playing with fire. Everyone should have a process for refreshing their data, or simply throwing away a slave and starting fresh.
5. Bad server_id values
You have to take care to make sure server_id is unique. If you have two servers with the same server_id value, then you'll mysteriously lose data on the slave. It's also possible to get replication events in a master/master replication set up that was not generated by either server. Since it's never discarded, then that event is played on each server in a loop forever.
6. Master not writing binlog
This one is obvious, if the master doesn't write events to the binlog, then you can't replicate anything. It should be hard to disable the binlog on a master that is currently replicating -- but it's not: SET GLOBAL SQL_LOG_BIN=0; Ooops. None of the servers will complain.
7. Replicate_ignore/do_db rules
This is perhaps one of the single largest points of confusion. I've seen countless cases where someone says "help my replication isn't working" and it's because they have these options set, but don't understand how they work. Numerous blog posts exist to explain the dangers.
8. RESET MASTER / RESET SLAVE
These two commands will break your replication set up unless you're very careful about how you use them. In particular the behavior of RESET SLAVE which is used to forget its replication position, but *not* the connection info can be weird. MySQL will let you try to start replication again, and most often you just end up with an error, because it's trying to replicate from a binlog position which doesn't exist on the master. RESET MASTER is perhaps something you should never do in production.
9. Manually deleting binlogs or relay logs
Both of these files are managed through mysql; there should never be a need to manipulate them at the filesystem level.
10. Purging master logs
You're going to need to clean out old binlogs, but be careful -- if you have a slave that was still trying to use an older binlog file then you'll break replication and have to rebuild that slave. This can easily happen if a slave is stopped for a while and you're not aware of it.
11. Bad CHANGE MASTER TO statement
This statement is very powerful -- as mentioned in #1 it can be used to fix relay log corruption, but be sure you use the correct binlog file and position, or else you risk skipping replication statements, or running them twice.
There we go! I don't expect that this list contains all possibilities -- what are some other ways replication has caused you pain?
4 comments:
Some other 'fun' ways to break replication:
- Remove binlogs on the master with rm in a cron job. (instead of expire-logs-days or PURGE)
- Break DNS and use hostnames for replication.
- Master binlog corruption (RESET MASTER can be useful!)
- Restore a slave or relay host from vm snapshot.
I suffered from #3 this week. The slave was over 22000 seconds behind and not catching up. The cause was a REPLACE FROM(SELECT). The table being replaced was a running average of the data in the SELECT. I healed this my turning the REPLACE table into a BLACKHOLE table. That allowed everything else to catch up. I could then switch it back and it rebuilt it's averages. This was much better then rebuilding the slave.
1) Primary / unique key conflict: If the slave is being used, it is possible that some rows might be inserted. Slave halted due to duplicate key error is one of the most common reasons. Skipping that statement or removing the offending row are the only 2 choices in such cases. Making the slave read only is another good option. log-slave-updates will help you find the statements run directly on the slave.
2) If max_allowed_packet size is set too low (default is low!) slave may have issues.
3) User privileges: Slave needs a special privilege to read master binary file. If you mess up with the master users, slave might complain.
4) In case of slave server crash, it may loose the track of the master. Open the error log to get the binary file name and it's position.
INSERT SELECT statements without ORDER BY.
This has caused us MANY duplicate key errors and gives nasty inconsistencies that are hard to check for. The issue is that SELECT without ORDER BY can return the rows in whatever order is quickest, which is often determined by what is in the cache on the local server! Even if the ORDER BY is there but isn't complete you wont always get the same set of data being inserted on the slave.
Post a Comment