Jun 19, 2017
Amazon had a load balancer failure in 2012. The analysis of the event shows that there were missing data in the devices that caused issues. The restore of data from these devices is complex, way more complex with less mature tools than most database platforms. The result was a nearly 10 hour period of time when some customers were experiencing issues.
In 2016, Gliffy had three days of downtime from a database error. In this case, an admin was updating a replicated system, but failed to sever a link with the primary node. Forgetting this step caused a data removal on the node, which replicated to the secondary nodes. They discovered the restore and replay of logs would take many days due to the size. They hadn't practiced a DR situation in some time, and were not prepared for the delays.
Read the rest of "DevOps Can Help"