Yesterday, many Accelo users experienced an extended outage between 13:35 and 15:20 (San Francisco time, -8:00). This outage was the longest we've experienced in years and was caused by an extreme load condition on our primary database infrastructure.
We apologize for the inconvenience this outage caused to our clients and have taken short-term steps, as well as accelerated our medium-term efforts, to reduce the chance that an outage of this magnitude can happen again.
In short, we suffered an inadvertent denial of service attack. By seeing a specific type of traffic/request that was 10,000x of normal in a very short period of time, we had the equivalent of what happens when someone yells "fire" in a crowded room with just one door - the rush of people all trying to move through a narrow door at once causes the problems, not the fire itself.
More specifically, one of our clients had their own email systems (within their office, not related to Accelo) compromised by unknown persons. As a result, this client spewed massive amounts of requests through to our email systems vendor, SendGrid, in effect pretending to open many many emails, tens of thousands of times. SendGrid then relayed these requests to us (as it should) so we could record that the specific email had indeed been opened - but what we didn't count on and hadn't seen was the avalanche.
The reason this hurt so much is because the bandwidth/resources associated with loading an image (what the "attacker" did) is a lot less than the resources we use to record that an email was opened (including using notifications in the Accelo product so our users know the email had been opened). In normal use, this is fine - how fast can you really open an email, close it, and open it again, after all? But when you have over 20,000 email opens happening in the course of 15 minutes, it starts to hurt.
In effect, the combination of rapid saving/writing of traffic much more than normal, combined with the read-load associated with finding the right activity and person who opened it on our end, meant that we overloaded the CPU of our database systems. This meant that requests backed up, and then more of these incoming "open" notifications (and the more we processed, the longer it took to process every new one), made the problem worse.
As part of our scalable infrastructure, we were able to provision additional server resources through AWS, but the nature of a database as a large, persistent and always running/changing system, combined with its incredibly high CPU load, meant that the resizing/upgrading process took a lot longer to take effect than usual, and was part of the reason for this outage lasting longer than any we've experienced for quite a long time.
The second part of the reason for the outage was that we had a lot of backed-up load and requests, and when we bought the systems back online our load balancers started to return 503 errors. This meant that when we got the database systems online at 14:50, the problem moved to a different part of the stack for 15 minutes.
After getting systems back online and stable (which involved disabling the delivery, open and click notification system across the platform for a few hours), we moved quickly to completely change how we process these SendGrid callbacks. Since SendGrid was just doing their job, and there's no way to stop someone hurtful doing something like this in the future, we've changed the way we're processing these callbacks.
As one of our first callback integrations (we implemented SendGrid and the delivery/open/click notifications over 4 years ago), the SendGrid notifications were being processed in real time by our main web server stack. We've had plans in place for a number of months to change the pipeline for these requests to use our new queuing systems (for the technically minded, AWS Lambda and SQS services), but other priorities had taken a front seat.
After yesterday's outage, our engineering team rightly reprioritized this, and in a pretty impressive effort by any measure, completely re-architected the processing pipeline and flow to direct all SendGrid events through to these asynchronous queues. As a result, this specific situation can't cause an outage again; if we get flooded with this sort of traffic, it will load up a queue instead, and the worst case we should see is a delay in the time between when someone opens an email and when you can see the blue outline in the stream of Accelo telling you that they opened the email.
While the flood of unexpected data and the lack of queue pipelines caused our database to be overloaded yesterday, the core problem is that our database got overloaded.
To address this, we're actually completely re-architecting our database storage and operations layer. For the technically minded, this means moving to AWS Aurora (which has a different scaling model for handling load upgrades). It also means using a more tuned set of database servers, with dedicated read instances and connections tuned for return data fast, and a more limited and focused model for doing the less-frequent operations (which are expensive and involve table locking, but which are critical to our users too).
We've also introduced a new caching infrastructure, which will reduce the need to touch databases for frequently interrogated data (we already have it in place for permissions and user accounts).
These changes are significant and need to be done extremely carefully. While yesterday's outage was clearly frustrating, no client data was lost and no client data was compromised. Every step we take with these improvements has security and resiliency at the top of our mind, with performance being an important but tertiary consideration. We know you want your data fast, but we also know Accelo is a system that you rely on to run your business, and we take our responsibilities in this respect incredibly seriously.
Our whole team would like to apologize for yesterday's outage. While I know that the work they did to get our systems back online - even while the "attack" was continuing at full speed - and then making changes to ensure this specific issue couldn't hurt us in the future were incredible, what matters to you, our clients, is that we had an outage that inconvenienced you and your teams. For this, we're sorry and we will continue to work hard to make sure these things don't happen again.