persistent connections

Perlbal supports HTTP persistent connections now, so persistent connections you get.

LiveJournal's felt damn fast today as a result (except when a DB exploded).

Next up: HTTP/1.1 chunked responses, when needed. (which is harder, because once we speak HTTP/1.1 we have to understand 1.1 requests, which I'm not sure we're quite ready to do....)

For now almost all of our responses have content-lengths, though, so chunked responses aren't really needed.

Props to marksmith for the persistent connections support.

MogileFS transition

As of tonight, all userpics, phoneposts, and captchas are now stored on our MogileFS file storage system.

Our old system, while well-intentioned, was pretty cheesy and lame technically. It was never meant to be used for long ... it was mostly just a crutch until we figured out what we really wanted to do.

Here's a snapshot of our MogileFS installation at present. We have 6.14 TB free. And if that's not enough, we have 10 machines on-hand that could store 1TB each if we run out of room. We'd just need to throw 4 hot-swap SATA disks in them.
lj@grimace:~$ mogcheck.pl
Checking mogilefsd availability...
        10.0.0.81:7001 ... responding.
        10.0.0.82:7001 ... responding.

Device information...
  hostname     device   age    size(G)       used       free    use%  delay
      sto1       dev1   56s    224.319     15.022    209.297   6.70% 0.004s
      sto1       dev2   56s    229.161      9.337    219.823   4.07% 0.004s
      sto1       dev3   56s    229.161      9.273    219.888   4.05% 0.005s
      sto1       dev4   56s    229.161      9.308    219.853   4.06% 0.004s
      sto1       dev5   56s    229.161      9.271    219.890   4.05% 0.013s
      sto1       dev6   56s    229.161      9.409    219.752   4.11% 0.009s
      sto1       dev7   56s    229.161      9.305    219.856   4.06% 0.005s
      sto1       dev8   56s    229.161      9.342    219.819   4.08% 0.004s
      sto1       dev9   56s    229.161      9.298    219.862   4.06% 0.007s
      sto1      dev10   56s    229.161      9.245    219.916   4.03% 0.008s
      sto1      dev11   56s    229.161      9.334    219.826   4.07% 0.004s
      sto1      dev12   56s    229.161      9.281    219.879   4.05% 0.005s
      sto1      dev13   56s    229.161      9.364    219.797   4.09% 0.006s
      sto1      dev14   56s    229.161      9.295    219.865   4.06% 0.008s
      sto2      dev15   10s    224.319      9.342    214.977   4.16% 0.004s
      sto2      dev16   10s    229.161      9.317    219.843   4.07% 0.006s
      sto2      dev17   10s    229.161      9.394    219.767   4.10% 0.005s
      sto2      dev18   10s    229.161      9.387    219.774   4.10% 0.005s
      sto2      dev19   10s    229.161      9.236    219.925   4.03% 0.004s
      sto2      dev20   10s    229.161      9.312    219.849   4.06% 0.006s
      sto2      dev21   10s    229.161      9.211    219.949   4.02% 0.005s
      sto2      dev22   10s    229.161      9.312    219.849   4.06% 0.010s
      sto2      dev23   10s    229.161      9.231    219.930   4.03% 0.004s
      sto2      dev24   10s    229.161      9.370    219.791   4.09% 0.006s
      sto2      dev25   10s    229.161      9.305    219.856   4.06% 0.008s
      sto2      dev26   10s    229.161      9.243    219.917   4.03% 0.013s
      sto2      dev27   10s    229.161      9.264    219.896   4.04% 0.009s
      sto2      dev28   10s    229.161      9.326    219.834   4.07% 0.004s
                total         6406.817    266.336   6140.481   4.16% 0.173s

Those top two lines are checking on the mogilefsd trackers... they're the servers that keep track of where all the files are at. They're actually just a protocol translator in front of the same MySQL database. And if that database goes down? Well, then we'd be screwed. That's why the database is currently on really nice hardware. But the real plan going forward is to use MySQL Cluster, which we'll be using for our global master DB as well. Then there'd be no single point of failure at all.

Oh, and the MogileFS info shown above is for all of livejournal.com, pics.livejournal.com, and picpix.com.... when you make your MogileFS client object, you just specify what domain you're using. For instance, "danga.com::fb" (for fotobilder) or "danga.com::lj" (livejournal). Then you can have identically named files in all namespaces that don't conflict.

If anybody's interested in using MogileFS, we'd love to help you set it up. Join the list and ask away.

megaraid explosion

As many of you may already know, the weakest link in LiveJournal's architecture is our "global master" database. We separate our databases into "globals" and "users". We have by far tons more user databases... and they're generally setup master-master, so there is no single master that can fail and kill us.

But the "global" databases aren't setup like that. They're master-slave, with about 5 slaves doing various things. If the global master fails, we're screwed.

That's why the global master is on really nice hardware... we don't want it to fail.

Now, we're moving to putting the entire global database on MySQL Cluster so it's spread between a bunch of machines and entirely in memory, but we're not there yet.

Last night at about 2:25 am, the megaraid2 driver in Linux 2.4.28 bit it, spewing errors all over. It was a bitch and a half to recover from, but I think we finally finished up about 8 am this morning. (lisa did most the work) Luckily once the global master came back up we could run on that without any slaves for a while since it was low-traffic time. Getting the slaves back up was tedious, but easy.

This, folks, is a perfect example of why I'm still not happy with our architecture. Our global master needs to be on MySQL cluster. We could even do shared disks and two identical global masters, but the failover between them, and the possibility of either or both corrupting the filesystem and tablespace isn't comforting...

In the meantime I'm going to be studying the changes in the megaraid2 driver between Linux 2.4 and Linux 2.6 and seeing who else has seen this sort of problem.

Fun fun fun....

database update -- new machines, 64-bit, innodb

After a month and a half of vendor and motherboard hell, we now have six new 64-bit database machines on their way:

Two dual 64-bit Intel Xeons (EM64T)
Two dual 64-bit AMD Opteron 246s (2.0 Ghz)
Two dual 64-bit Intel Itanium2 (1.4Ghz 1.5MB cache)

That's a total of 12 new 64-bit processors... the first 12 we've had.

Why is this notable? Because now our user clusters can run InnoDB well. We already run InnoDB on our global machines, and it kicks ass, but we've stuck with MyISAM (which is lame, but has its benefits) on the user clusters because 32-bit machines don't give a single process enough memory to run InnoDB the way we would've liked.

See, InnoDB maintains all its own caches in-process, whereas MyISAM only caches indexes, and the kernel caches data pages. On a 4GB or 8GB box on MyISAM, you can get 2GB of indexed cached in-process (because you only have 3GB of user address space on a 32-bit machine) and the rest of the memory on the box is used by the kernel to cache data pages.

But with InnoDB you only have that 2GB for everything... data and indexed. Sure, the kernel can still help out, but they step on each other's toes.

Plus InnoDB uses twice as much disk space as MyISAM, so MyISAM won there. And MyISAM is easier to sysadmin. But MyISAM has table-level locking, which sucks, but can be mitigated by having multiple databases per machine. (and with memcached it's not a big deal... only sometimes) In a nutshell, MyISAM's worked well enough for us so far, and we've come to be able to deal with (or tolerate) its deficiencies so far. But that's not to say we've been happy with MyISAM.

Anyway, we've been holding out for 64-bit for awhile now, waiting to run InnoDB effectively. Soon we'll be able to.

Also, the new machines feature:

-- 8, 12, or 16 GB of memory
-- twice as many disks in the RAID 10 as we've done in any other machine (instead of RAID 10 on 4 disks we'll have RAID 10 on 8 disks or 10 disks). and that's in addition to the RAID 1 for the operating system and DB logs volume

So it's all very exciting. Can't wait to get all the users moved to this new hardware.

Copy of lj_maintenance post....

Sorry all... site's slow. :-(

It's not from me doing work earlier. It's been slow the past few Sunday nights because that's our peak point of the weak (people ending their weekend in the US, and people in Europe/Russia getting into work on Monday bored)

There are two main reasons something can be slow:

-- not enough CPU (your Pentium or AMD or G5 or whatever is over-worked)

-- disks not fast enough (like you open a program and hear the harddrive grinding away for a few seconds)

Our current problem is not enough CPU ("CPU-bound") as opposed to the latter, "IO-bound".

Anyway, we ordered 8 new webservers, which should all be able to do more than our current fastest machines (which do 134,000 requests/hour). In the last hour we did 2.8M requests, while we were limited by CPU. So 8 new guys helping out should bring our capacity up at least another 1,072,000 requests/hour, but probably quite a bit more. That's some good breathing room for now.

Lisa's on vacation for the next couple days, but then we can start installing the new servers, assuming they're ready then.

We also go on crazy profiling/optimization binges whenever this happens, but we've kinda tuned everything we can for now. I have a few more ideas, but they're not things that can happen before the new servers come in.

Fun stuff lately in server land...

Presentation I did at MySQL conference in Orlando:

http://www.danga.com/words/2004_mysqlcon/

Building a distributed filesystem for Fotobilder/LiveJournal (will be open source):

http://www.livejournal.com/~brad/2009886.html
http://www.livejournal.com/~brad/2010534.html
http://www.livejournal.com/~brad/2010997.html

We just bought 2 machines with 16 250GB disks, so we'll soon have 8TB of storage. I imagine we'll get about 6TB of real storage out of that after redundancy. (thumbnails and scaled versions will only be on disk once, probably, since they can be recreated easily....)

Building a new load balancer for FotoBilder/LiveJournal, with special support for mixing efficient buffer of mod_perl requests and for efficiently serving large files (using sendfile(2)) from disk, so mod_perl doesn't have to do it:

http://www.livejournal.com/~brad/2007943.html

The proxy works already w/ FotoBilder. Haven't put it into production yet, but we rebooted all our LiveJournal proxies into Debian testing w/ epoll.h headers so we could build IO::Epoll (which is a requirement for Perlbal). They were already running Linux 2.6 (for epoll)

more machines arriving soon

As an update to our earlier CPU problems, we're picking up our four new web nodes tomorrow. They're burning in tonight.

A few days ago we also traced down waves of global blocking (CPUs going to idle, no processes working, backlog of HTTP connections building up) to a misconfiguration of sorts related to how we have Akamai requesting userpics from us. Things got a lot nicer after changing that but we're still hitting the CPU limits during our busy times.

We're looking forward to getting the new machines online. It won't take long at all once we get 'em... they just netboot and start working.

good news, bad news, and more good news

Been busy/stressed, haven't had time to post....

Good news: database load is all pretty evened out. We haven't been disk-bound for weeks now. We have two of the clusters running master-master, and a 3rd master-master cluster on its way (ordered last week or so). Then we'll clean off two of the existing DB clusters to the new ones (much more powerful) and ugprade the old guys.

Bad news: we're CPU bound again. (it comes in waves between CPU and disk bound)

but...

Good news: we have 4 new web nodes on the way (which do 6x more web requests/hour than our oldest ones) and we've also been profiling code and rewriting/fixing stuff up to be faster.

We're now logging for each web request the CPU time used, as well as memory growth and shared memory decline. So we can do queries against our access logs like, "What were the 50 most CPU-heavy requests in the past hour?" or "What are the top 20 CPU-heavy codepaths on average?". We've been having tons of fun with that.

Our logging has got a lot better lately. We're making a tool (which we'll release to the other LJ sites when we're done) which will do all the common queries to check for:

-- evil/dumb spiders
-- attacks
-- anon comment spammers (though this doesn't matter so much once we flip on the anon-comment human-test code....)
-- CPU/memory outliers
-- slow/popular codepaths
-- etc...

And while it does that, caches subresults for ranges of time, so incremental real-time queries of the above become possible, and we can be automatically paged whenever the next brain-dead spider hits.

As for the anon comment spam: Mahlon wrote code to do the image/audio human tests when any IP does more than 1 anonymous comment in 'n' minutes. So any comment spammer would only be able to get in 1 anonymous comment spam for viagra or indian porn before they'd have to start proving they're human and not a script, slowing them down, probably making them go to another site. We'll be turning this one once we're around to watch it... probably tomorrow? It's been ready for a week now, but we've been letting other new code cool, making sure there were no problems, and it all seems to be going fine. (except for the CPU shortage)

Anyway, my apologies for the CPU problem... at least once we get them they can go online immediately... no warm-up or transition period like with new database servers.