The past week has been a very busy time at Accelo, and as a result, we experienced a number of growing pains that meant periods of degraded or poor end user experience. We apologize for the inconvenience these issues caused and want to take a moment to explain what happened, what we're doing to improve performance, as well as how we will work to better predict/catch situations of load surge in the future.
In a more general sense, Accelo has been growing rapidly over the last few months. Across the platform, our users are spending record amounts of time in the platform and doing record amounts of work - logging tens of thousands of work activities a day across what amounts to millions of sales, projects, tickets, retainers, tasks, invoices and more.
As part of our approach to handling scale - both within specific client accounts and across the platform generally - we are constantly improving our entire stack - from storage layers to processing algorithms and business logic, to managing content and data delivery systems. As an example of the continual improvement front, our Systems & Security team have averaged 6 structural system configuration changes a week over the last year, and these changes don't include the thousands of automated scaling and optimization processes that we have orchestrated within Amazon Web Services directly.
One of the projects our team is working on currently to improve the speed of Accelo relates to how we use indexes and caches to bring together the data around activities and tasks in Accelo. Successful caching means being very sensitive to when an object that has been cached is changed in the database - if it gets modified in any way, you need to invalidate the cached version and refresh it. The changes we've already successfully applied for caching staff accounts and permissions (we've seen a 20x performance increase accordingly), and combined with other system upgrades, we've seen improved performance in the last couple of weeks. However, when we made the change to start caching activities and tasks, we had an unintended consequence - the sensitivity to change meant that we were telling our search indexing system that it too needed to be updated more frequently.
Normally, the queue processes these changes quite quickly, but because of the change-tracking and reindexing being much more sensitive, we ended up queuing up more than 100 million items to be reindexed in a relatively short period of time. While our systems performed admirably in the face of this onslaught to scale and process the backlog, the scale of this backlog meant that our users experienced delays in the time between a new activity or task being created and when they could see it in places like the inbox or activity stream, as well as delays in seeing new/changed companies, contacts, sales, projects, tickets, retainers and objects in the "search" input at the top of the Accelo screen.
Unfortunately, that kind of knock-on effect wasn't picked up in testing as it only became apparent with a full production load; but steps are being put in place to better detect this kind of behavior in the future (we now have alarms on this piece of telemetry, especially including the backup/overflow queue, that weren't as clearly reported on previously).
While our team was able to stem the incoming load relatively quickly through configuration changes, we still had those 100 million backlogs to process!
Dropping the queue wasn't an option because while there was over-sensitive noise in there, there was also important data we needed to process, and there was no way to know which queue entry was which.
The size of that backlog put a lot of pressure on several of the backend databases we have in use by virtue of the size of the queue and all the activity needed to process it - things were working very hard! This created some intermittent slowdowns of the user experience around page loads and email processing, but the more persistent issue was around the processing delays. New (non-noisy) events were continuing to be added to the front of the queue, but the sheer number of events coming in from the backlog did result in some distorted behavior and search update times were not fast enough.
It has been a fine balancing act this week to make good headway on the backlog without further degrading the broader system too much - running warm, but not too hot. While results may not have shown up in the activity stream in a timely manner, rest assured the data was all stored safely; it just hadn't populated to the search database as quickly as we would prefer.
An improvement has been put in to split the search queue; so rather than just relying on user updates being put at the front of a queue, it now has a separate queue to itself. This has shown to give us more reliable queueing behavior under a high volume of events. So, should a similar situation occur, we are more confident it won't result in the same behavior we had this past week.
We have taken advantage of the weekend lull to process the backlog and reindex all customer data. So combined with the code improvements made through the week, we feel we are in good shape for the week ahead and will be keeping a close eye on the search queues behavior.