Servers on fire

by Andy Boyle.

Well, that was fun. Last night our server for our 2010 midterm elections application was crying under the weight of tons SQL queries that weren’t apparently caching. We run a medium Amazon instance for all 14 of our papers, and boy oh boy did we learn a thing or two from this.

After a few hours, my colleagues started optimizing the tables, which helped quite a bit. Then by 11 p.m., when California results were coming in, most of our papers on the eastern side of the U.S. must’ve started watching the news and quit slamming our site. We went from 99 percent CPU used and then dropped to incredibly reasonable levels after that.

So, what have I learned from all of this? We used Siege to test the server, and it was easily getting 2,500 hits a minute and not feeling that much pain. But when we went live last night, we included some widgets for the websites to use on their homepages. We didn’t factor that into the test, which is something I should’ve done. MySQL died at one point and we ended up rebooting the server. Traffic on our app pages went to a crawl, and I felt pretty embarrassed that there wasn’t more I could do.

It seems that a few other large news organizations had issues with their sites, too. So, if any of you folks had any major server meltdowns, please post it in the comments and explain what you’re going to do to fix it.

That leads to the main reason I’m writing this today: I’m charged with spending the rest of the week figuring out how we can make sure this never happens again. Whether that means learning how to set up our servers differently (Varnish, anyone?) or rethinking using PHP and switch to Ruby or Django.

So, fine friends, offer your suggestions in the comments on what you think we should do, and, if possible, any links that may help me on my wonderful journey.

***edit at 3:10 p.m.***

Looks like we did have caching on. Just appears the SQL queries were incredibly CPU intensive and locked up the server’s processes. So that caused the MySQL process to bypass the query cache, and then the complex queries monopolized the CPU. Just thought I’d clarify that after talking with my comrades.