Servers on fire

by Andy Boyle.

Well, that was fun. Last night our server for our 2010 midterm elections application was crying under the weight of tons SQL queries that weren’t apparently caching. We run a medium Amazon instance for all 14 of our papers, and boy oh boy did we learn a thing or two from this.

After a few hours, my colleagues started optimizing the tables, which helped quite a bit. Then by 11 p.m., when California results were coming in, most of our papers on the eastern side of the U.S. must’ve started watching the news and quit slamming our site. We went from 99 percent CPU used and then dropped to incredibly reasonable levels after that.

So, what have I learned from all of this? We used Siege to test the server, and it was easily getting 2,500 hits a minute and not feeling that much pain. But when we went live last night, we included some widgets for the websites to use on their homepages. We didn’t factor that into the test, which is something I should’ve done. MySQL died at one point and we ended up rebooting the server. Traffic on our app pages went to a crawl, and I felt pretty embarrassed that there wasn’t more I could do.

It seems that a few other large news organizations had issues with their sites, too. So, if any of you folks had any major server meltdowns, please post it in the comments and explain what you’re going to do to fix it.

That leads to the main reason I’m writing this today: I’m charged with spending the rest of the week figuring out how we can make sure this never happens again. Whether that means learning how to set up our servers differently (Varnish, anyone?) or rethinking using PHP and switch to Ruby or Django.

So, fine friends, offer your suggestions in the comments on what you think we should do, and, if possible, any links that may help me on my wonderful journey.

***edit at 3:10 p.m.***

Looks like we did have caching on. Just appears the SQL queries were incredibly CPU intensive and locked up the server’s processes. So that caused the MySQL process to bypass the query cache, and then the complex queries monopolized the CPU. Just thought I’d clarify that after talking with my comrades.

7 comments on ‘Servers on fire’

  1. jkokenge says:

    Varnish your Rails man. Ask @thejefflarson about Rails scalability re: the Dollars For Docs app. Varnish your Rails.

  2. MatthewJuhl says:

    It sounds like most of your problems can be solved by getting your cache working. If your DB is getting that many hits, it’s bound to fail. Memcached works like a charm if you use it well.

  3. thejefflarson says:

    Things I like in this world:
    1. Varnish (our config: https://gist.github.com/58d4ceb0b5cd54500b04)
    2. Google Refine
    Things I dislike:
    1. Feudalism

    You’re fighting the good fight man.

  4. Aron says:

    Ouch. Been there, done that… This isn’t a middleware problem; it’s a caching problem, as noted above. Running live database queries under load is a recipe for disaster. And MySQL will cache queries BUT only in succession. You can see this yourself.

    Try an expensive query, run it again. The second time it will run super fast. Then try running the expensive query, run a completely different query, and then go back to the expensive query… you’ll see the difference.

    The bottom line is to cache. This is where a framework (Rails/Django) would come in handy, because it makes this very very easy.

  5. Andy Boyle says:

    Thanks for all the help, you fine folks.

  6. [...] a caching plugin. If something I write here goes viral, I’d like to avoid watching my server fry. One way to head this off is to set up caching so that every page load isn’t executing a [...]

Leave a Reply