A great place to be!

Testing Replication Over the Pond - Part 1 Non-Secure

Testing Non-Secure Replication

A series of experiments were conducted to determine whether MySQL replication would prove to be reliable with SSL enabled. Please note that all tests were conducted using the MyDNS schema with includes the SOA and RR tables on MySQL 5.1.

The first experiment sets focused on replication operations and not on a predetermined set of Insert, Update or Delete patterns. So Inserts were used since they are the easiest to tag and verify. Again, the focus is on replication channel fault recovery.

Later iterations used simultaneous Inserts, Deletes and Updates against both SOA and RR tables from production snapshots of the mydns database, with request sets greater than 100,000 rows each, over an SSL replication channel.

The first test was done by closing the Slave server's outbound SQL port to the Master with iptables, the Slave was uaware it was no longer connected to the Master after several minutes beyond closing.

After re-enabling the SQL port, replication was still broken and the slave did not automatically reconnect. Only by manually issuing a"stop slave, start slave" did replication began working again. At that point, a MySQL event was created that executed "stop slave, start slave" every minute (just as a hack) to see if the replication would continue to work while replicating 1000s of records after opening and closing connections to the Master. By adding the periodic restart events, replication never failed.

Two config options were used to control the replication behavior automatically, slave-net-timeout and master-connect-retry. The “slave-net-timeout” variable has a default value of 3600 seconds (1 hour.) That explains the explicit need to set this value. The interval in which to retry to re-establish the connection to the Master is controlled by “master-connect-retry,” which has a default value of 1 minute. Both of these values were set to 10 seconds for the next test.

- Inserted 1,000,000 records (at the rate of 55 per second) into the Master SOA
table, crons were used to disconnect the Europe and Asia Slaves from the
Phoenix Master four times per hour for 3 minute intervals.
100% recovery and replication

- Inserted 200,000 records using three 10 minute disconnection intervals per hour causing the Slaves to fall behind by over 30,000 records during each 10 minute interval.
100% recovery and replication

- Disconnected both Slaves from the Master, inserted 100,000 records, and then reconnect to the Slaves.
100% recovery and replication

The following shows how the replication performed at a given failure point:
I/O Thread Replication Recovery (Phoenix -> Europe)
I/O Transfer Rate (Non-SSL) = ~240 Records/Second
Master SQL Insert Rate = 55 Records/Second

-- Legend --
Slv-D = Slave Disconnect Slv-C = Slave Connect ◄ = Sync Point

Master: --- SOA Replication Table Row Count = 16565 Slv-D
Slave: --- SOA Replication Table Row Count = 16338
Slave: --- SOA Replication Table Row Count = 16534 Slv-D
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 16534
Slave: --- SOA Replication Table Row Count = 18087 Slv-C
Slave: --- SOA Replication Table Row Count = 20889
Slave: --- SOA Replication Table Row Count = 23713
Slave: --- SOA Replication Table Row Count = 26524
Slave: --- SOA Replication Table Row Count = 29320
Slave: --- SOA Replication Table Row Count = 32138
Slave: --- SOA Replication Table Row Count = 34977
Slave: --- SOA Replication Table Row Count = 37805
Slave: --- SOA Replication Table Row Count = 40630
Slave: --- SOA Replication Table Row Count = 43460
Slave: --- SOA Replication Table Row Count = 46273
Slave: --- SOA Replication Table Row Count = 49105
Slave: --- SOA Replication Table Row Count = 51970
Slave: --- SOA Replication Table Row Count = 54757
Slave: --- SOA Replication Table Row Count = 57599
Slave: --- SOA Replication Table Row Count = 60453
Slave: --- SOA Replication Table Row Count = 63229 ◄

Master: --- SOA Replication Table Row Count = 52670 Slv-C
Master: --- SOA Replication Table Row Count = 53298
Master: --- SOA Replication Table Row Count = 53926
Master: --- SOA Replication Table Row Count = 62040
Master: --- SOA Replication Table Row Count = 63291 ◄

An experiment was run by disconnecting the Slaves for a little over 5 hours while 1,000,000 records were inserted into the Master SOA table. The Slaves were reconnected and the 1,000,000 records were transferred to both Slaves in approximately 1.25 hours.

A final experiment type was performed by monitoring and recording the Slave’s Slave_IO_Running and Slave_SQL_Running thread states in 10 second intervals as the Slave’s outbound SQL ports were blocked and unblocked (for 30 minute windows.) At approximately 4 hours into the experiment, the Asia Slave failed to recovery after the SQL port was unblocked. After issuing a “start slave io_thread” command, the Asia Slave reconnected to the Master.

The experiment was repeated with the Asia Slave, this time it ran for 9 hours before the I/O thread failed to recover from the communication failure. The Europe Slave functioned properly for 13 hours before going into the same failure state as the Asia Slave.

The above experiments continued to use the originally added timeout and retry values:
slave-net-timeout = 10
master-connect-retry = 10

Next, the timeout and retry values were set to:
slave-net-timeout = 60
master-connect-retry = 30

After restarting the Slave with the new values set, the following experiments were run:
- The Slave_IO_Running and Slave_SQL_Running thread states were recorded in 10 second intervals as the Slave’s outbound SQL ports were blocked and unblocked (for 2 minute windows.) The run time was 8 hrs. and 56 mins. Both the Europe and Asia Slaves continued to recover after each of the 15 state switches per hour without failure.
- The original experiment using 30 minute blocking/unblocking windows was run for 9 hrs. 38 mins. Both Slaves functioned without failure.
- The last experiment reverted back to the 2 minute blocking/unblocking windows. The run was for 36 hrs. Again, both remote Slaves properly responded without failure.

It is a safe assumption to correlate that the “slave-net-timeout” and “master-connect-retry” configuration options are critical to the reliable behavior of MySQL replication over the pond.

Testing Replication Over the Pond - Part 1 Non-Secure

You need to be a member of Everything MySQL to add comments!

RSS