About 10 hours ago, we suffered a 20 minute unscheduled outage (between 23:08 and 23:28 UTC). On this occasion, there wasn't an issue with infrastructure (in the server sense), but we instead had a code push which encountered a problem. Our systems administration team was able to quickly identify, revert and patch the problem, with all systems back online.
With the problem fixed, we then turned our focus to identifying how the problem occurred in the first place, and found a very unexpected cause - unexpected because of the focus, time and financial resources which we invest into our build and testing systems at Accelo. Since we've never really shared this information before, I figure there's no better time to expound on those systems than as the team is revamping them.
At Accelo, we have dozens of developers working hard every day to improve and enhance our product so that our clients can run their service operations more easily and prosperously, and get back to doing the work they love. Our focus on testing begins before code is actually written, in two ways:
First, before our developers begin writing new code to implement a new feature or fix a bug, they write tests which define what a successful result from their code will look like. These tests are used once the code has been completed to ensure that it performs the function that it was originally meant to.
Second, our senior development team has compiled a detailed list of rules, called "linter" rules. These rules act as best practice guides, automatically informing our developers of potential issues and improvements before they actually finish their coding.
Like most modern technology teams, we use git extensively. When our developers have completed work on their code, they then push their code into a special staging server where our QA team is able to review how the code actually behaves. These staging servers serve a special function, running a copy of Accelo in a quarantined, internal environment where we can safely review changes before they're applied to our clients' live accounts. While the QA team works to review the behavior of the code in Accelo, the developer who created the code simultaneously creates a pull request - a fancy git term for requesting final review of their code before its applied to the live Accelo system.
As soon as the pull request is created, our automated testing systems come to life. When Accelo first began, these automated testing systems were pretty simple, but with every lesson learned - whether they be from a bug slipping through, an error occurring or simply reevaluating our process for better ways to work - these tests have increased in depth and breadth. Today, every single pull request is tested automatically by dozens of servers running almost 1,000 different test plans, with some individual test plans containing thousands of unique tests. In a whirlwind of testing, these servers test all areas of Accelo with every single pull request, and it is really a sight to behold.
When the automated systems identify a problem, they report it loudly, adding notes directly to the code that's being reviewed. Any time a note like this is made, the developer receives a real-time notification, requiring them to address the problem before tests can proceed. All code has to pass these tests before it is applied to the live Accelo system (described in the Deploy processes below). In parallel to this automated approach, we also have a squad of senior developers who manually review the content of these pull requests. Line by line, they check the code to ensure that it both works, and works efficiently, adding comments to help the team improve how their code has been constructed. If our senior team encounters something that needs attention, they assign specific tasks to the original developer to address the issue, with details on how that could best be done. That developer is then responsible for completing all tasks and revising their code to address the comments before running the full test plan again.
Finally, once all tasks have been completed, all comments addressed and all tests have been passed, our senior developers queue all pull requests to deploy to the live Accelo system.
The deployment process handles the process of taking pull requests which have been approved and queued by our senior development team, and applies them to the live Accelo system. This automated process works to synchronize new code changes with dozens of servers orchestrating the process so that all servers receive and begin running the new code at the same time by updating the database schema, resetting daemons and services, and updating our CDN (content delivery network), across all of the different regions which Accelo supports. This process is even dynamic enough to account for automated scaling, so that if additional servers were brought online after the process started, i.e. to process an influx of new emails to process, that it also includes them in the update.
Once the deployment process is complete, our team is notified directly via Slack. While it it a risk to trust this critical update of live systems to automation rather than manual changes (in case something goes wrong, and the team needs to take action), the sheer volume of updates which are required for Accelo's servers, their services and database, and the requirement that all of these servers receive the same update at the same time, means that automation is actually significantly more reliable in our experience.
Today's error was unfortunately the exception that proves the rule. During a deployment, some of our front end servers didn't receive the updated code, causing them to fail when compiling and syncing up with the other servers. This resulted in an outage, and required the development team to manually intercede in the deployment process.
To prevent this type of error from occurring in the future, the team is working diligently to refine the deployment process to more proactively check for and prevent this type of error in the future, and if one does occur, to more seamlessly restore itself automatically. This work will be completed this week.
We apologize for any inconvenience this outage caused, and will learn from it to ensure that it can't happen again.