Twice as hungry, ten times as busy

Two weeks ago, the Exceptional API suffered some serious growing pains. It hung on some requests and responded with 500 errors on others. It even went offline at one point, much to our great regret.

Most of this pain was caused by some cumstomers' over-active apps; we worked with those guys to resolve this problem.

But we know better than most that these sort of emergencies are not freak occurrences; they happen and we'll have to deal with an influx of exceptions from one app again. (Just last week, one app sent us over a million exceptions in one day!) So we need to be able to handle situations like this gracefully and keep the Exceptional service fast, rock-solid and reliable.

I'm happy to report that last week we rolled out three changes that mean requests to our API are answered quickly, our exception queue is practically empty and in most cases, exceptions show up in Exceptional in less than a second.

Here are the changes we've made:

  1. We set a relatively short timeout on our API so that our customers' apps will never hang if we suffer high load on it again.
  2. We added a new API instance which means all API requests are load balanced across two large servers, eating all the requests thrown at it and responding to your app quickly.
  3. We increased the exception processing processes tenfold; we now have 30 daemons crunching through the queue.

This is just the start of a list of improvements that we're getting through slowly but surely. We hope you'll stick around long enough to see what's in the pipeline.

Thanks so much again for your support and patience.

3 Responses to "Twice as hungry, ten times as busy"

  1. Wes Winham says:

    Have you considered rate-limiting on a per-customer basis? It seems like you could set a very very high limit (like 50k per 24 hour period) and come out way ahead when one customer has a problem and hits you 1 million times. I doubt they'd find the 950k duplicate exceptions that they didn't get to be a big loss.

  2. Ryan Tomayko says:

    Yeah. Check out this rate limiting with memcache technique:

    http://simonwillison.net/2009/Jan/7/ratelimitcache/

    We're planning on using it to check for a variety of DOS attacks but I imagine it could be useful in situations like this as well. You might even rate limit using the exception name instead of the user id or IP -- a sort of dupe limiter.

  3. Jared Dobson says:

    Nice job! Thanks for the improvements!!!