Brad Fitzpatrick (bradfitz) wrote in lj_backend,
Brad Fitzpatrick
bradfitz
lj_backend

megaraid explosion

As many of you may already know, the weakest link in LiveJournal's architecture is our "global master" database. We separate our databases into "globals" and "users". We have by far tons more user databases... and they're generally setup master-master, so there is no single master that can fail and kill us.

But the "global" databases aren't setup like that. They're master-slave, with about 5 slaves doing various things. If the global master fails, we're screwed.

That's why the global master is on really nice hardware... we don't want it to fail.

Now, we're moving to putting the entire global database on MySQL Cluster so it's spread between a bunch of machines and entirely in memory, but we're not there yet.

Last night at about 2:25 am, the megaraid2 driver in Linux 2.4.28 bit it, spewing errors all over. It was a bitch and a half to recover from, but I think we finally finished up about 8 am this morning. (lisa did most the work) Luckily once the global master came back up we could run on that without any slaves for a while since it was low-traffic time. Getting the slaves back up was tedious, but easy.

This, folks, is a perfect example of why I'm still not happy with our architecture. Our global master needs to be on MySQL cluster. We could even do shared disks and two identical global masters, but the failover between them, and the possibility of either or both corrupting the filesystem and tablespace isn't comforting...

In the meantime I'm going to be studying the changes in the megaraid2 driver between Linux 2.4 and Linux 2.6 and seeing who else has seen this sort of problem.

Fun fun fun....
Subscribe

  • LISA slides

    I almost forgot to post these.... My slides from the LISA talk I did last week: http://www.danga.com/words/2004_lisa/ (LISA = Large Installation…

  • persistent connections

    Perlbal supports HTTP persistent connections now, so persistent connections you get. LiveJournal's felt damn fast today as a result (except when a…

  • MogileFS transition

    As of tonight, all userpics, phoneposts, and captchas are now stored on our MogileFS file storage system. Our old system, while well-intentioned,…

  • Post a new comment

    Error

    Comments allowed for members only

    Anonymous comments are disabled in this journal

    default userpic

    Your IP address will be recorded 

  • 20 comments

  • LISA slides

    I almost forgot to post these.... My slides from the LISA talk I did last week: http://www.danga.com/words/2004_lisa/ (LISA = Large Installation…

  • persistent connections

    Perlbal supports HTTP persistent connections now, so persistent connections you get. LiveJournal's felt damn fast today as a result (except when a…

  • MogileFS transition

    As of tonight, all userpics, phoneposts, and captchas are now stored on our MogileFS file storage system. Our old system, while well-intentioned,…