Some of our users have been in contact recently to ask about delays they're seeing in emails from clients being automatically captured into Accelo.
Glenn from our team (pictured right, looking every bit the dashing engineer) has recently been doing some pretty hard core geekery to make our email capturing processes a lot more scalable and robust, and we'll be implementing a range of changes over the next week, both to increase reliability for our existing clients and to support what we anticipate will be a significant increase in load with some changes we've got coming very soon.
A couple of things we just wanted to re-iterate for anyone seeing delays in email processing:
No emails in the capture process are being lost - if anything, you'll just be experiencing delays.
The creation time on emails captured will reflect when they were sent, not when they were processed. So, from a record keeping perspective, Accelo will be accurate.
When a message is delayed, other messages that come in after it may be processed first. Aside from the confusion at the delay, this creates the only potential for persistent problem; followup emails might not be captured in the correct order.
We expect to have transitioned all of our users to a new email capture processing engine by the end of next week (3rd of August) which will allow us to handle much more email traffic in a more reliable way (so if bad emails get captured from bad actors, we'll deal with them more gracefully).
Technical details for those interested to explain why below.
A more technical explanation
We've discovered more and more users sending mal-formed emails, which have caused issues for the mail parsing library we use.
When the parsing library finds a corrupt email, instead of gracefully returning and saying "sorry, not going to process this crazy bad boy", it dies anbd shuts down the processing engine.
The death of the processing engine closes the socket to the mail server, which then responds the way all good mail servers do: by queuing or deferring the email and trying to deliver it again later.
We have a special process which keeps an eye on the processing engine, and when it sees that it dies or stops working, it restarts it within a few seconds.
Unfortunately, the mail server, having seen a mail transport/destination disappear or be unavailable, doesn't check immediately to see if it has become available again (this is the standard behavior in a mail server - if a link or network on the internet goes down, instead of pounding away at a closed door every second, the mail server says "I'll just leave you alone for a bit and try again later").
As a consequence, the mail server thus keeps deferring incoming mail destined to be sent to our processing engine for a variable period (depending on load, this is often greater than 10 minutes). Of crouse, the processing engine has only been gone for a few seconds, but the mail server doesn't know that until it tries again in its own good time.
Eventually, the mail server tries to deliver again, finds the service socket open and available, and processing resumes.
Queued/deferred mail isn't delivered immediately; instead, it is processed on a decaying time algorithm so the oldest (longest deferred mail) is retried the *least* frequently.
While we could hack at the way the mail server works, we're instead transitioning to a new model where the email processing engine isn't called at all when we process an email from a server.
In preparation for the new versions of Accelo, we're transitioning to a new architecture which will have similar properties to a map-reduce system for asynchronous and delegated processing (which is where the crunching goes on), which will mean things never die and delays/deferrals never occur.
PS - Glenn is a little shy and likes cats; he would have preferred this blog post look a bit more like this.