Earlier today we experienced an unexpected outage as a result of a routine upgrade and database patch. We perform these patches on average around 4 times a week, and while the scope of this patch was greater than usual, we've never come across a situation like this before this had implications beyond the specific client account and table being patched.
The persistent relational storage we use at Accelo is MySQL, a robust and well-proven database engine from Oracle. To maximize reliability, security, and portability, we've architected AffintiyLive so every client account (deployment) uses its own database instance. The effect of this in a shared multi-tenant environment is that while one client might be trashing their database the impacts on other clients are negligible because table locks and write contention is quarantined.
However, in today's upgrade we ran into an unexpected situation we hadn't seen before which caused a wide number of our clients to experience periods of inaccessibility to their Accelo instance even though their own database content wasn't being modified/updated at all.
The situation that occurred was the modification (or a series of ALTER statements in SQL terms) of a number of tables across each of our client deployments. These are processes we undertake frequently - as an agile and continuously improving software team, we push code at least daily and this often involved improvements and enhancements to database structures. So, undertaking another database update/improvement - which we'd already tested and verified in our development process - wasn't something we expected to be problematic.
For a sense of scope, we have thousands and thousands of clients, each with hundreds of database tables and some of those tables containing almost a million rows. Today's update affected almost 10 million records, and while that might sound like a lot, it really isn't. This is why we didn't undertake this update in a period of scheduled maintenance of structure downtime like last Friday/Saturday - this shouldn't have resulted in downtime, just as the updates we run almost every day don't result in downtime. But today, something different happened.
As I write this our engineering team is still managing the situation and understanding the specifics of what happened. Our hunch so far - as to why this update caused performance impacts that affected all of our clients at the same time even though each database was being updated one table at a time individually - is that this is actually a hangover of the database performance issues from last week and that this broad sweep across many millions of records caused "recovery" or index repair processes to automatically kick in which dramatically increased the load and thus impaired performance of our MySQL databases.
The short term resolution is going to be for us to undertake structured and planned maintenance this weekend to systematically run check processes across all of the tables in all of the databases that we run for clients.
The medium-term resolution is to actually move our underlying technology to a new and more scalable infrastructure provider. We already have a project underway to provide significant (and potentially unlimited) headroom, replication and geographic scalability for Accelo over the coming months, and I'll have more to share about this in the weeks and months ahead.
If you've got any questions about this or other issues, please don't hesitate to email us at [email protected] and we'll get back to you as soon as we can.