Some of our users - particularly in Europe, and to a lesser extent on the East Coast of the US - would have noticed some sluggish performance of Accelo over the last two days.
Our engineering teams in Australia and California have also been burning the midnight oil getting to the bottom of some performance and load issues that have slowed down our systems. Rather than just shrugging this off and hoping you forget about it, we'd like to open up and give you some insights to hopefully make up for any inconvenience and frustration experienced.
As many growing web services have found over the years, things don't move in a straight line - there are ups and downs, and growth comes in bursts and from unexpected sources.
Great web services like Facebook managed to grow very quickly with very few hiccups, other services like Twitter didn't do so well, and they suffered significant downtime as they grew. We'd much rather be Facebook than Twitter, but unfortunately the unexpected can (and does) occur.
On this particular occassion - one of the first periods of system instability we've suffered in over 9 months - the problem was actually caused by some code written over two years ago. This code related to the way we fetched contacts and address records against a given company, and displayed them on the company view screen.
Literally tens of millions of times over many months this screen loaded perfectly. And we're not exaggerating - this screen is the most frequently used in the whole of Accelo, and since the redesign earlier in 2012, performance has been very good. However, we recently had a new client come on board whose data pushed us past the limit.
Without getting into the boring stuff, most of our users have only a handful of contacts against a client. However, this Accelo user has hundreds of individual contacts against a company. They also have the same sort of number of addresses or offices, and they dilligently went in and loaded them all under this same master client just recently.
Unfortunately, and unexpectedly, this caused significant performance bottlenecks. Accelo deals with many millions of data points on a daily basis, so having a few hundred contacts mess things up was completely unexpected. It also meant that our initial investigations to find the source of the problem weren't as effective as we'd like; we were looking deep down at super complex things and interplays between systems, not something simple like *this*.Â
The bottlenecks then had the unfortunate effect of causing even further pain, compounding together to make things hundreds of times worse than they would be on their own. Servers already under performance load were then hit even more frequently by users thinking they needed to refresh, and like a highway with a slow truck on it taking up too many lanes, traffic banked up. Getting things going fast again meant building new highway, trying to keep drivers moving on the road while we did it, and putting wings and jet engines on the truck so it wasn't on the highway anymore - allÂ at the same time.
Lots of fun. And unfortunately, for you, our users, it wasn't any fun to be stuck on the highway either while all of this was going on.
We've now found and solved this problem by making some quick upgrades to the company view screen and a bunch of other technical tricks. You might now notice that this screen will only load the first 20 or so contacts against a client account - this probably isn't something you'll see as a difference most of the time, but for those larger accounts you work, you'll now see this screen load faster.
In terms of finding someone other than the first 20 contacts, you've got two options; one is to click on the button under the contact list on the left to load the next 20, and the other is to use the (now much faster) search bar in the top left of the company screen to find contacts who's name matches (or partially matches) what you're searching for.
Now, admittedly, this was a very fast find, solve and deploy project by software engineering standards. There are still some rough edges with this patch (the more button is ugly, for example). We'll also be monitoring the situation closely to identify any future bottlenecks that come up. We've instituted a higher level of logging to identify these high level problem children, which will compliment our extensive systems monitoring, alerting and platform integrity management.
Like you, we get frustrated when technology doesn't work, and like you, we also know that it happens from time to time. And while we can accept that stuff breaks from time to time, often the most frustrating thing is being in the dark.
To help solve this and better communicate during these frustrating periods in the future, we've implemented a new Twitter account called @ALstatus.
If you follow @ALstatus, or check out its page on Twitter at www.twitter.com/ALstatus, you'll be able to see notifications of when things are changing or not quite working right. Since we push code almost every day, we're going to start letting you know via this account when we're pushing new code. We'll of course use it to tell you when there are problems that we're aware of and working on, and we're also going to experiment with pushing some of our automatic monitoring and alerts to this Twitter account so that we're as transparent with you, our users, as we possibly can be.
If you've got any questions about this, please don't hesitate to email firstname.lastname@example.org. We understand how critical Accelo is to running your business, and if you've got questions or concerns we'd be only be too happy to address them.