Exceptional: our infrastructure, built to scale
While we've been in beta we've had a few scaling issues atypical of an app that is so young. Given the nature of how Exceptional works, if several large apps are busy all at once and they're all throwing exceptions, that's a lot of data through our API to process.
We've addressed these issues on several fronts, and while we're still not done, we're quite far along at providing a solid service that you can rely on.
First up, our hardware. We're using Amazon's EC2 service. It just works. It's fast, powerful and cost effective. We're able to add instances as and when we need them to cope with demand.
Next, our API. This is something that we've put a lot of work into. We accept and write requests to disk as soon as they come through so as not to lock out any web-facing processes. Next, all requests are queued up and processed in the background. We'll be able to scale this out as we need to. Sometimes exceptions occur in the data processing queue that are caused by malformed XML or bugs on our side. The queue rescues these, flagging any requests that cause exceptions so that we can follow them up, then moving on to the next request in the queue. We're currently servicing hundreds of thousands of API requests per day.
We're very concerned about keeping our web interface snappy. We've optimised it for speed on an app level and we'll be working in the coming weeks to add speed on the http level. There are a slew of enhancements that we'll be making to improve app speed, but I'm already quite proud of the speed of the app as it is without any http level caching. See Jeremy Kemper's RailsConf keynote for more info on this.
Our database lives on its own EC2 instance. Like most web-apps, our database sees most of the pain of our busy API. We've worked hard to keep it really efficient. All 404 data is checksummed and we use incrementors wherever we can. Every exception coming through is also checksummed. After an initial 100 or so occurrences of an exception, we start incrementing, but taking samples of new exceptions rather than full data for every one. This way, we can give you a good overview of the active exceptions in your app while keeping our database size manageable.
We're using Amazon's Elastic Block Store for data persistence and nightly backups. EBS is an awesome service that allows us to take backups quickly and efficiently and restore data quickly in the event of failure.
There are a great many improvements that we plan to make over the coming weeks and months. We've built the architecture to scale and to handle large apps as well as smaller ones. We're keeping a good eye on CouchDB, a new(ish) database that is built from the ground up to be highly reliable and distributed. We already have a working implementation of the app built around storing data in CouchDB which we'll be testing and improving in the coming months.
Along with new features and improvements to the app in general, we're commited to making sure that it lives on solid, stable architecture so that ours is a service that we're proud of, and that you can trust.






September 23rd, 2008 at 06:57 PM
I like technical details about what problems you faced and how you fixed it. Can you post more about your experiments with CouchDB ? Good app, btw.
September 24th, 2008 at 11:10 AM
Hey,
thanks for the comment, I'll definitely blog a bit more about CouchDB in a few weeks, when I've had a bit more time to get down and dirty with it!