I have a number mysql servers running version 5.1.63 and whilst running some queries against the slave earlier this week, I noticed some data on the slave that should have been removed using an update statement on the master.
My initial thoughts were:
- someone on the team was updating the slave, which I have since disproved
- that the column being updated had changed
So, I investigated by running a mysql show status "table" query. This was run against a test database on each of the servers to see what the data length was, in a lot of cases it was showing me the data length differed between servers, but on an eyeball look at the data I could see the data was the same, so I couldn't use this method to see if there were any differences as it appears to be prone to error.
Next I ran a simple (across all dbs) row count for each table to confirm the row count was the same - it was.
I then started looking in the bin logs for replication. I could see the update statements that should have run clearly visible in the logs, but the update never ran.
What I need to know is:
- is replication broken? I'm assuming it is
- if I create new slave servers, will I encounter the same issue?
- how do I find out the extent of the issue on my servers?
Any help is appreciated.
If you are using statement based replication then it is easily possible to end up with different results on master and slave due to badly constructed INSERT statements.
INSERT SELECT without ORDER BY, or where the ORDER BY can leave non deterministic results will cause the slaves to diverge from master.
From the MySQL site http://dev.mysql.com/doc/refman/5.1/en/insert-select.html
The order in which rows are returned by a SELECT statement with no ORDER BY clause is not determined. This means that, when using replication, there is no guarantee that such a SELECT returns rows in the same order on the master and the slave; this can lead to inconsistencies between them. To prevent this from occurring, you should always write INSERT ... SELECT statements that are to be replicated as INSERT ... SELECT ... ORDER BY column. The choice of column does not matter as long as the same order for returning the rows is enforced on both the master and the slave. See also Section 16.4.1.15, “Replication and LIMIT”.
If this has happened then your replicas have diverged and the only safe way to bring them back in line is to rebuild them from a recent backup of the master DB. The worst part of this is the error may never cause the replication to fail, yet the results are inconsistent. Normally replication fails when an UPDATE or DELETE statement affects a different number of rows than on master, this is confusing as it was not the UPDATE that actually caused the error and the only way I know to fix the issue is to inspect every INSERT query in the code base!