We have adopted replication methods for our database hosting that adds numerous layers of redundancy.
In the CatN replication stack used to host Clear Books we use one Master server and two Slaves. The Master and one of the Slaves are physical servers, and the second Slave is an Amazon EC2 instance with EBS. Using EC2 provides added redundancy in case the CatN data centre is destroyed or we suffer catastrophic hardware failure. We can always switch to a second Slave that is entirely in the cloud and unaffected by any physical hardware problems.
MySQL replication ensures that the same data is present on every server within the replication stack, so in the case of the Master server failing, we manually switch all web requests to one of the Slave servers until the Master can be repaired.
Working with MySQL replication experts at Percona we plan, and are working towards, making this fail over automatic and instant. This will allow us to easily switch to another host in the case of the Master failing, with no loss in service or data.
The EC2 Slave server which runs in Ireland is dedicated to pushing the replicated data to our backup script. Using a Slave prevents service disruption while the data is being backed up as all requests are met by the Master server. The EC2 instance also pushes encrypted SQL dump backups to S3 storage in Amazon’s North America data centre. Amazon claim that their S3 service has 99.999999999% durability of any object in a given year, where any object is an individual customer database backup. We store up to 10 backups for each customer – one for each of the last 7 days, and 1 from the 1st of the month for the last 3 months.
The physical servers in the replication stack contain 4 hard disks in RAID10 configuration and the EC2 Slave uses EBS logical volumes. RAID10 stripes data across disks allowing data to be accessed concurrently. This striped data is then mirrored to a second set of disks for redundancy, allowing a disk to fail from a mirror, without suffering data loss.
Combining RAID10 with the Master and Slave replication stack ensures that any critical data is present on six disks, two disks in each server. Data is easily recovered from any failure across the stack, and in the case of a large hardware failure we can simply switch the load to a remaining node. Eventually, we will have automatic failover to ensure there is no disruption in service while the failure is repaired. This whole stack will be monitored with same Zabbix monitoring tool we use across our whole network.