Log in

No account? Create an account
LiveJournal's Backend's Journal
[Most Recent Entries] [Calendar View] [Friends]

Below are the 19 most recent journal entries recorded in LiveJournal's Backend's LiveJournal:

Monday, November 22nd, 2004
3:58 pm
LISA slides
I almost forgot to post these.... My slides from the LISA talk I did last week:


(LISA = Large Installation Systems Administration)
Friday, October 29th, 2004
3:22 pm
persistent connections
Perlbal supports HTTP persistent connections now, so persistent connections you get.

LiveJournal's felt damn fast today as a result (except when a DB exploded).

Next up: HTTP/1.1 chunked responses, when needed. (which is harder, because once we speak HTTP/1.1 we have to understand 1.1 requests, which I'm not sure we're quite ready to do....)

For now almost all of our responses have content-lengths, though, so chunked responses aren't really needed.

Props to marksmith for the persistent connections support.
Monday, October 18th, 2004
1:14 am
MogileFS transition
As of tonight, all userpics, phoneposts, and captchas are now stored on our MogileFS file storage system.

Our old system, while well-intentioned, was pretty cheesy and lame technically. It was never meant to be used for long ... it was mostly just a crutch until we figured out what we really wanted to do.

Here's a snapshot of our MogileFS installation at present. We have 6.14 TB free. And if that's not enough, we have 10 machines on-hand that could store 1TB each if we run out of room. We'd just need to throw 4 hot-swap SATA disks in them.
lj@grimace:~$ mogcheck.pl
Checking mogilefsd availability... ... responding. ... responding.

Device information...
  hostname     device   age    size(G)       used       free    use%  delay
      sto1       dev1   56s    224.319     15.022    209.297   6.70% 0.004s
      sto1       dev2   56s    229.161      9.337    219.823   4.07% 0.004s
      sto1       dev3   56s    229.161      9.273    219.888   4.05% 0.005s
      sto1       dev4   56s    229.161      9.308    219.853   4.06% 0.004s
      sto1       dev5   56s    229.161      9.271    219.890   4.05% 0.013s
      sto1       dev6   56s    229.161      9.409    219.752   4.11% 0.009s
      sto1       dev7   56s    229.161      9.305    219.856   4.06% 0.005s
      sto1       dev8   56s    229.161      9.342    219.819   4.08% 0.004s
      sto1       dev9   56s    229.161      9.298    219.862   4.06% 0.007s
      sto1      dev10   56s    229.161      9.245    219.916   4.03% 0.008s
      sto1      dev11   56s    229.161      9.334    219.826   4.07% 0.004s
      sto1      dev12   56s    229.161      9.281    219.879   4.05% 0.005s
      sto1      dev13   56s    229.161      9.364    219.797   4.09% 0.006s
      sto1      dev14   56s    229.161      9.295    219.865   4.06% 0.008s
      sto2      dev15   10s    224.319      9.342    214.977   4.16% 0.004s
      sto2      dev16   10s    229.161      9.317    219.843   4.07% 0.006s
      sto2      dev17   10s    229.161      9.394    219.767   4.10% 0.005s
      sto2      dev18   10s    229.161      9.387    219.774   4.10% 0.005s
      sto2      dev19   10s    229.161      9.236    219.925   4.03% 0.004s
      sto2      dev20   10s    229.161      9.312    219.849   4.06% 0.006s
      sto2      dev21   10s    229.161      9.211    219.949   4.02% 0.005s
      sto2      dev22   10s    229.161      9.312    219.849   4.06% 0.010s
      sto2      dev23   10s    229.161      9.231    219.930   4.03% 0.004s
      sto2      dev24   10s    229.161      9.370    219.791   4.09% 0.006s
      sto2      dev25   10s    229.161      9.305    219.856   4.06% 0.008s
      sto2      dev26   10s    229.161      9.243    219.917   4.03% 0.013s
      sto2      dev27   10s    229.161      9.264    219.896   4.04% 0.009s
      sto2      dev28   10s    229.161      9.326    219.834   4.07% 0.004s
                total         6406.817    266.336   6140.481   4.16% 0.173s

Those top two lines are checking on the mogilefsd trackers... they're the servers that keep track of where all the files are at. They're actually just a protocol translator in front of the same MySQL database. And if that database goes down? Well, then we'd be screwed. That's why the database is currently on really nice hardware. But the real plan going forward is to use MySQL Cluster, which we'll be using for our global master DB as well. Then there'd be no single point of failure at all.

Oh, and the MogileFS info shown above is for all of livejournal.com, pics.livejournal.com, and picpix.com.... when you make your MogileFS client object, you just specify what domain you're using. For instance, "danga.com::fb" (for fotobilder) or "danga.com::lj" (livejournal). Then you can have identically named files in all namespaces that don't conflict.

If anybody's interested in using MogileFS, we'd love to help you set it up. Join the list and ask away.
Thursday, October 7th, 2004
11:52 am
megaraid explosion
As many of you may already know, the weakest link in LiveJournal's architecture is our "global master" database. We separate our databases into "globals" and "users". We have by far tons more user databases... and they're generally setup master-master, so there is no single master that can fail and kill us.

But the "global" databases aren't setup like that. They're master-slave, with about 5 slaves doing various things. If the global master fails, we're screwed.

That's why the global master is on really nice hardware... we don't want it to fail.

Now, we're moving to putting the entire global database on MySQL Cluster so it's spread between a bunch of machines and entirely in memory, but we're not there yet.

Last night at about 2:25 am, the megaraid2 driver in Linux 2.4.28 bit it, spewing errors all over. It was a bitch and a half to recover from, but I think we finally finished up about 8 am this morning. (lisa did most the work) Luckily once the global master came back up we could run on that without any slaves for a while since it was low-traffic time. Getting the slaves back up was tedious, but easy.

This, folks, is a perfect example of why I'm still not happy with our architecture. Our global master needs to be on MySQL cluster. We could even do shared disks and two identical global masters, but the failover between them, and the possibility of either or both corrupting the filesystem and tablespace isn't comforting...

In the meantime I'm going to be studying the changes in the megaraid2 driver between Linux 2.4 and Linux 2.6 and seeing who else has seen this sort of problem.

Fun fun fun....
Friday, October 1st, 2004
3:56 pm
database update -- new machines, 64-bit, innodb
After a month and a half of vendor and motherboard hell, we now have six new 64-bit database machines on their way:

Two dual 64-bit Intel Xeons (EM64T)
Two dual 64-bit AMD Opteron 246s (2.0 Ghz)
Two dual 64-bit Intel Itanium2 (1.4Ghz 1.5MB cache)

That's a total of 12 new 64-bit processors... the first 12 we've had.

Why is this notable? Because now our user clusters can run InnoDB well. We already run InnoDB on our global machines, and it kicks ass, but we've stuck with MyISAM (which is lame, but has its benefits) on the user clusters because 32-bit machines don't give a single process enough memory to run InnoDB the way we would've liked.

See, InnoDB maintains all its own caches in-process, whereas MyISAM only caches indexes, and the kernel caches data pages. On a 4GB or 8GB box on MyISAM, you can get 2GB of indexed cached in-process (because you only have 3GB of user address space on a 32-bit machine) and the rest of the memory on the box is used by the kernel to cache data pages.

But with InnoDB you only have that 2GB for everything... data and indexed. Sure, the kernel can still help out, but they step on each other's toes.

Plus InnoDB uses twice as much disk space as MyISAM, so MyISAM won there. And MyISAM is easier to sysadmin. But MyISAM has table-level locking, which sucks, but can be mitigated by having multiple databases per machine. (and with memcached it's not a big deal... only sometimes) In a nutshell, MyISAM's worked well enough for us so far, and we've come to be able to deal with (or tolerate) its deficiencies so far. But that's not to say we've been happy with MyISAM.

Anyway, we've been holding out for 64-bit for awhile now, waiting to run InnoDB effectively. Soon we'll be able to.

Also, the new machines feature:

-- 8, 12, or 16 GB of memory
-- twice as many disks in the RAID 10 as we've done in any other machine (instead of RAID 10 on 4 disks we'll have RAID 10 on 8 disks or 10 disks). and that's in addition to the RAID 1 for the operating system and DB logs volume

So it's all very exciting. Can't wait to get all the users moved to this new hardware.
Sunday, May 2nd, 2004
8:19 pm
Copy of lj_maintenance post....
Sorry all... site's slow. :-(

It's not from me doing work earlier. It's been slow the past few Sunday nights because that's our peak point of the weak (people ending their weekend in the US, and people in Europe/Russia getting into work on Monday bored)

There are two main reasons something can be slow:

-- not enough CPU (your Pentium or AMD or G5 or whatever is over-worked)

-- disks not fast enough (like you open a program and hear the harddrive grinding away for a few seconds)

Our current problem is not enough CPU ("CPU-bound") as opposed to the latter, "IO-bound".

Anyway, we ordered 8 new webservers, which should all be able to do more than our current fastest machines (which do 134,000 requests/hour). In the last hour we did 2.8M requests, while we were limited by CPU. So 8 new guys helping out should bring our capacity up at least another 1,072,000 requests/hour, but probably quite a bit more. That's some good breathing room for now.

Lisa's on vacation for the next couple days, but then we can start installing the new servers, assuming they're ready then.

We also go on crazy profiling/optimization binges whenever this happens, but we've kinda tuned everything we can for now. I have a few more ideas, but they're not things that can happen before the new servers come in.
Tuesday, April 27th, 2004
3:27 pm
Fun stuff lately in server land...
Presentation I did at MySQL conference in Orlando:


Building a distributed filesystem for Fotobilder/LiveJournal (will be open source):


We just bought 2 machines with 16 250GB disks, so we'll soon have 8TB of storage. I imagine we'll get about 6TB of real storage out of that after redundancy. (thumbnails and scaled versions will only be on disk once, probably, since they can be recreated easily....)

Building a new load balancer for FotoBilder/LiveJournal, with special support for mixing efficient buffer of mod_perl requests and for efficiently serving large files (using sendfile(2)) from disk, so mod_perl doesn't have to do it:


The proxy works already w/ FotoBilder. Haven't put it into production yet, but we rebooted all our LiveJournal proxies into Debian testing w/ epoll.h headers so we could build IO::Epoll (which is a requirement for Perlbal). They were already running Linux 2.6 (for epoll)
Wednesday, March 3rd, 2004
11:48 pm
more machines arriving soon
As an update to our earlier CPU problems, we're picking up our four new web nodes tomorrow. They're burning in tonight.

A few days ago we also traced down waves of global blocking (CPUs going to idle, no processes working, backlog of HTTP connections building up) to a misconfiguration of sorts related to how we have Akamai requesting userpics from us. Things got a lot nicer after changing that but we're still hitting the CPU limits during our busy times.

We're looking forward to getting the new machines online. It won't take long at all once we get 'em... they just netboot and start working.
Sunday, February 29th, 2004
11:02 pm
good news, bad news, and more good news
Been busy/stressed, haven't had time to post....

Good news: database load is all pretty evened out. We haven't been disk-bound for weeks now. We have two of the clusters running master-master, and a 3rd master-master cluster on its way (ordered last week or so). Then we'll clean off two of the existing DB clusters to the new ones (much more powerful) and ugprade the old guys.

Bad news: we're CPU bound again. (it comes in waves between CPU and disk bound)


Good news: we have 4 new web nodes on the way (which do 6x more web requests/hour than our oldest ones) and we've also been profiling code and rewriting/fixing stuff up to be faster.

We're now logging for each web request the CPU time used, as well as memory growth and shared memory decline. So we can do queries against our access logs like, "What were the 50 most CPU-heavy requests in the past hour?" or "What are the top 20 CPU-heavy codepaths on average?". We've been having tons of fun with that.

Our logging has got a lot better lately. We're making a tool (which we'll release to the other LJ sites when we're done) which will do all the common queries to check for:

-- evil/dumb spiders
-- attacks
-- anon comment spammers (though this doesn't matter so much once we flip on the anon-comment human-test code....)
-- CPU/memory outliers
-- slow/popular codepaths
-- etc...

And while it does that, caches subresults for ranges of time, so incremental real-time queries of the above become possible, and we can be automatically paged whenever the next brain-dead spider hits.

As for the anon comment spam: Mahlon wrote code to do the image/audio human tests when any IP does more than 1 anonymous comment in 'n' minutes. So any comment spammer would only be able to get in 1 anonymous comment spam for viagra or indian porn before they'd have to start proving they're human and not a script, slowing them down, probably making them go to another site. We'll be turning this one once we're around to watch it... probably tomorrow? It's been ready for a week now, but we've been letting other new code cool, making sure there were no problems, and it all seems to be going fine. (except for the CPU shortage)

Anyway, my apologies for the CPU problem... at least once we get them they can go online immediately... no warm-up or transition period like with new database servers.
Tuesday, January 6th, 2004
5:29 pm
Networking Advice Needed
If you're a networking guru, please read and help us:


Comments disabled here. Please reply in the lj_dev post. Thanks!
Monday, January 5th, 2004
8:35 pm
Upgrade plans
I thought I'd post a brief update on tonight's upgrade, and plans for the next couple weeks and months....

Tonight: We finally did our big global master swap. "gm" (for lack of a better name) was our global master. It was pretty beefy at the time, but it was starting to suck. We bought a new machine, "gm2" (again, because we're original) to take over its job. But because we've had good hardware go bad within the first month in the past, we no longer make important hardware go live on the real site for its first month or so. Instead, we break it in doing a non-important but similar function to what it will be doing. So, for gm2 we made it be a global slave. It totally kicked ass: twice the memory, raid 10 instead of 5, faster disks, huge raid cache card. Over time, gm continued to suck more, and gm2 continued to be idle. So over today we did the swap. We started by re-parenting all the other machines to have gm2 as their master (the email databases, the directories, the other global slaves). So until an hour ago we had:

gm -> gm2 -> (everything else)

Then after we posted to lj_maintenance we told gm to block all new writes, then switched the active master in the ljconfig to be gm2. Then we stopped gm, backed it up, and copied over 9.1GB from another global slave (all the mysql files and master.info, etc) and started it up, parenting it as well.

So now:

gm2 -> (everything else, including gm)

Now the plan is to upgrade gm to be beefy like gm2, as well as c01m (the cluster 1 master), since they're identical hardware. They'll be switched to the fancy raid card, more memory, faster disks, raid 10, etc. Then they'll be paired up in a master-master config. (we're in the slow process of moving users from cluster 1 to pork/chop (cluster 8)) Once 1 is empty we can upgrade it all.

Then we'll be emptying all the other users of old-style clusters to new-style clusters and upgrading their hardware and db configs.

And along the way we'll get gm2 a buddy, and make that master-master too, in case gm2 were to die. (the other global slaves might cut it, but it's questionalbe.)

We're also always moving queries/data off the global cluster onto per-user clusters, which makes the site scale easier.

That's it for now.
Monday, December 15th, 2003
5:29 pm
raid over memcache
By now I assume most people in this community are familiar with memcached which we use to accelerate our databases.

We have 41 GB of memcache memory in use, over about a dozen hosts. Unfortunately, one of those machines is 10 GB of that. We lost that guy this morning, so 25% of our memory cache went offline and the site took a performance hit. Any other machine going down (512MB - 4GB) probably wouldn't have made a big enough blip to notice.

So we're thinking about letting memcache be configured in pairs of machines, so any machine going down can't hurt the overall memcache hit rate. Of course this'll waste memory, but memory is cheap and easy to admin.
Wednesday, December 10th, 2003
4:00 pm
New cluster type: master-master
If you'll recall from the pretty diagram, user database clusters are composed of a master and one or more slaves. The slaves are read-only, and just replicate writes from the master. So if the master goes down, the entire cluster is read-only until the master is fixed. That sucks.

And with memcache, we hardly do reads anyway, so the slaves are useless.

So, we're going to be slowly converting all our database clusters to be master-master. Each DB cluster will have two identical machines (or nearly identical) and they'll replicate from each other. But to avoid mid-air collisions, only one will be the "active" master at a time.... either "a" or "b".

Advantages to new system:

-- if a machine dies, we can dynamically fail over to the other machine in the pair. nobody goes read-only. (and no more single points of failure!)

-- every night (or whenever) we can switch the active database and do maintenance: DB optimizations, backups, kernel/mysql ugprades, etc. most imporantly, this means that when we move users off a cluster, we can do full deletes/packs (optimize table) of the data on the inactive db, without worrying about fragmentation hurting over time. (which we discovered the hard way a year ago or so)

-- no half-powered slave machines. all DB machines are beefy and reliable. (currently the slave machines are RAID 0.... twice as likely to die. all new DB clusters will be RAID 5 or 10, with redundant power, etc)

-- no idle slaves due to memcache. sure, one machine at a time will be idle, but idle with a purpose. looking at it over a month period, both will be equally idle as they take turns.

We've just checked in support for the master-master setup. Very few code changes were necessary. We changed the DBI::Role role names from "clusterN" and "clusterNslave" to be instead "clusterNa" and "clusterNb", then another config option says whether "a" or "b" is active for N. The automatic failover part will come later.

To prevent against problems with transient active swaps the failover daemon will take care to only do it at a certain rate, but just in case it goes too fast, or we go too fast, the only potential scary conflict case is LJ::alloc_user_counter(), which allocates a unique ID for a given (userid, domain), where domain is like "journal entries" or "comments". Now, that function uses memcache as a guard against potential race conditions between two masters when one is lagging during a flip. (but in practice, mysql replication hardly lags, especially when it gets no traffic whatsoever, and the current code won't be giving it any traffic unless it's active)

We're going to be putting the new code live and testing our first master-master cluster with some fake test users, then we'll start moving tons of people over to it. The plan then is to empty all the other clusters, one at a time, then upgrade their hardware when they're empty, then reconfig them to be master-master, and empty another old-style cluster over to the new empty master-master one, and so on until everything's converted.

Then Lisa, Nick and I will be able to sleep soundly, knowing the site will fix itself when a DB cluster master fails. (tho so far cluster masters have been pretty good, since we only get top-quality hardware for them.... but it's inconvenient not being able to maintenance on them without disturbing users).

Anyway, I'm excited. I hope you are too. :-)
Tuesday, November 25th, 2003
2:40 am
Anti-spam efforts
We're very much aware of all the blogspam people have been getting lately. I guess this is just the next wave of the net falling apart. Oh well. We'll all live.

We've started efforts to combat it, though. Patches are here and here.

So far we're doing all blocking by hand, but as of a couple hours ago we have the automatic spam detection emailing us now, whenever it thinks it found a spammer. If it doesn't find any false positives in the next day or so, we'll turn it on fully automatic mode, and spammers won't be able to do much damage. (around a dozen spams, instead of 10,000+ as we've been seeing sometimes before we get to it....)

If the spammers get smarter and really want to infiltrate LiveJournal then we'll continue with the arms race and fight them. We don't want blogspam any more than you.
Wednesday, November 19th, 2003
11:41 am
Performance / reliability work
Enjoying the site speed today?

We've been doing a lot of work this week on performance and reliability:

Slow queries
We've been doing a ton of query analysis and removing/rewriting ancient database queries which aren't efficient and dragging down the DB servers and thus holding up the rest of the site. Back in the ol' days we didn't have to worry about performance much, so a lot of stuff worked then when we were small, but doesn't work today. We fixed it all by either temporarily disabling it, replacing the queries with APIs we have now but didn't have them (which are all memcached-aware), rewriting the queries, or breaking it up into multiple efficient queries, often with a new table schema.

Anyway, we've got all the really bad stuff. There's a few somewhat inefficient things left that need to move from the global cluster to user clusters (and in the process have their schema changed), but they're not hurting much now.

New webservers
We put 4 new diskless web nodes online, each a Dual Xeon 3.0Ghz 1MB cache with 2GB memory, all running Linux 2.6.0-test, and showing 4 virtual processors each (because of hyper threading). We also "fixed" our previous newest 4 machines (Dual Xeon 2.4 Ghz, 512k cache, 1GB memory) by putting 2.6 on them which made hyper threading work. (for some reason 2.4 wasn't auto-detected HT on the Xeons... some APIC issue which is fixed in 2.6 but not 2.4)

New load balancing
We're now completely using our home-grown load balancing system (CVS/ljcom/src/rewrite-balancer/) which quickly adapts to the availability of all the web nodes. Before we were only running it on one of the proxies. But avva fixed the last remaining issue with it, so now we can use it everywhere.

Coming up....
We have even more hardware coming in: a new insanely powerful user cluster master, and an equivalent machine with slightly less disk space for the new global master.

We also want to convert half our internal network to gigabit to lower db/memcached latency.

And we're working on a rewrite of the memcached perl client module which is quicker.

And because of our new load balancing system, we'll soon be able to put new code live atomically without users getting errors for 5-10 seconds while everything restarts. Basically all web nodes will be in one of two partitions, and the partitions will go half down, upgrade to all newest code, restart, and we'll switch the active parition, fix the other half, then bring all up. But no BML/library version mismatches or proxy connect errors.... and we won't have to think about dependencies when putting code live incrementally. It'll be nice.

And just to let you know, we always prioritize speed and reliability over new fun features. At least, those who are in a position to do so, only work on speed/reliability. Some developers continue working on fun stuff in the background, but it doesn't go live until the site's in a stable, fast state and everybody's available to review the code and help get it live.

Thursday, November 13th, 2003
9:57 pm
How PhonePost works
So here's how PhonePost works:

-- We have an interesting deal with a really cool VoIP ("Voice over IP") phone company (VoicePulse) to route calls in to us. VoicePulse is largely on the east coast now, hence the phone number distribution, but they have ambitious growth plans, so we'll get new numbers as they grow.

-- VoicePulse normally does the complete opposite business.... letting people use regular phones over VoIP to other regular phones.... nothing about Asterisk to end-users, but they're Asterisk masters, so we approached them, and found they were really cool and wanted to help.

-- We pay them 'n' cents per minute to route calls from real phones in all their markets to our VoIP server sitting amongst LiveJournal's other servers. (We could also get an 800 number for twice the cost per minute, but then the economics of the whole pricing structure change a bit...)

-- Our VoIP server is just a normal box (3GB memory, dual P3 1Ghz, scsi raid array) running Asterisk, The Open Source PBX.

-- Asterisk is really cool.

-- Asterisk defines "AGI", the Asterisk Gateway Interface, in the spirit of CGI. It can invoke an external program to talk to the main asterisk server via stdin/stdout. We launch a little bitty Perl script which guides people through the menu system. The Perl script uses the trivial Asterisk::AGI module to make AGI even easier.

-- When people hang up during a call or press # and choose security, we drop a "qinfo" (queue info) file alongside their .wav file that was recorded, tell the user the message was queued, and hang up.

-- Another process running on our PBX box scans the spool directory every few seconds, looking for .wav files with .qinfo files. The phonepostd (phonepost daemon) then:

-- encodes them to .ogg (or potentially .mp3)

-- allocates a blob ID for that user, registered in the "phonepost" domain.

-- puts the file on the NetApp

-- makes an LJ post containing the phone post tag and posts it

-- deals with failures at any stage, logs errors, and retries later if there's a problem.

All this is totally isolated on its own box, so it doesn't interfere with other LiveJournal features.

Let me anticipate some questions:

-- 800 number: perhaps in the future. We'll see how many other numbers we can get first.

-- MP3: we'll see. we'd feel better making .ogg popular, though. we'd have to charge more for mp3 support (like $1 or $2/month) because mp3 is evil and encumbered with patents and licensing issues.

-- international support: it works. Caller ID step is a little ambiguous though. You might just have to enter your whole number (whatever you entered on the phonepost manage page.... even 88888 would work.... it's just a login)

-- international numbers: we might get UK numbers.

-- direct SIP connections to our Asterisk box: considering it.

-- How login works: A unique tuple (phonenumber, PIN) maps to an account. People can have the same phonenumber (or arbitrary string of numbbers) but the pair can't be shared by any user. If you have 5 accounts, just use a different PIN for each.

Um, think that's it.

Friday, November 7th, 2003
11:49 am
Pretty network diagram
xevinx, our new LiveJournal designer, has put together a graphical representation of LiveJournal's backend. (This was no easy task given he had to take all the information Brad and I threw at him and try to make it visually comprehensible.) Its pretty self explanatory if you pay attention to the key, but feel free to ask questions or make comments. Also, please share with those not reading this community that might be interested.

We'll probably refer back to this in the future when we're explaining some specific part of our network. Going forward we may have more diagrams for specific elements of our system when we want more detail, if enough users find this helpful.

Our network, in pastels...Collapse )
Thursday, November 6th, 2003
10:08 am
Current bottleneck: global master
Now that users are equally loaded between database clusters, the user databases are nice and zippy. And with userpics off, and posts/comments compressed, they won't be hurting for diskspace so quick anymore either.

The current bottleneck is the "global master". Global refers to non-user-specific data, like the global registry of usernames and userids, friends edges, payment data, and things which we haven't "clustered" yet. (we use the verb "clustered" to mean, "moved to the user clusters") The "master" part means it's the primary database in the cluster, where all writes go. Because it's the master, it's always guaranteed to be the most up-to-date (the slaves can lag by a second in the common case, or even a minute or so if things are really screwed). Because it's always caught up, we use it to populate memcached. If there's a memcached miss, we go to a master (user master or global master) to find the latest results and populate. (with locking, where necessary)

The global master's database is pretty small. It's like less than 10GB of active data. As such, it's never been a problem.

Recently, though, we finally started to hit the machine's I/O limit. (it's just on a SCSI 10k RAID5)

So, we need to do three things:

-- remove things from it, moving them to user clusters. these things include polls, friends edges (store two copies of the edge... one for the source user on the source user's cluster, and the other edge on the destination user's edge), S1 styles (which is acutally done for S1, but not all users are migrated from dataversion 4 to 5), S2 styles (which isn't even memcached!), etc

-- memcached more. (pretty much just S2 metadata)

-- faster I/O for gm. we've been looking at a whole spectrum of SSD products (solid state disks). since the database is so small, we could put the whole damn thing in memory. problem is, all SSD vendors overprice their stuff, since SSD stuff is so cool. 64GB fibre-channel drives cost around $102k. in general they're about $2k per GB. We'd only need 16 GB, probably, once we clean up some stuff. (we're only at 13 GB total now). we're also benchmarking different filesystems. ext3 data=journal with the journal on NVRAM (like a umem.com PCI card) looks promising, but benchmarks are inconclusive so far. i'm having trouble finding a benchmark that simulates our workload well.

So, that's the plan.

The easy solution is to throw money at the problem, which we'll probably do to some extent (we totally hate disks... so damn slow), but we'll also be moving all the remaining stuff from the global cluster to user clusters and memcaching more, to spread load out. (we originally meant to do everything, but it's a pain, so we only do bits at a time....)
Wednesday, November 5th, 2003
7:25 pm
Hello users. If you don't already know me, I'm your friendly Livejournal Systems Administrator. I'm going to start this community with some more detailed information about userpics, as its been a hot topic of late. I'll be answering specific questions about userpics, CDNs, Akamai or other related items in the comments, so if there is something I leave out or needs further explanation please ask. Going forward I see this community as an area where more technical (and less policy) type of conversations are had, so I, as well as our other sysadmin, nbarkas, will try to keep up with questions here.

Userpics live in the databases much like any other text or user data. At the most basic level, when pics are uploaded they go in to the database, and when a request is made to view a userpic the webserver pulls it out of the database. However, making a database call every single time a userpic is requested is a lot of load on the databases, and that affects all the other db requests. As userpics aren't dynamic, we use caches to minimize the number of times a userpic request hits a database.

The Past:

Prior to about a month ago, we were off loading bandwidth from our Seattle facility by having 2 caches at an east coast facility (voxel.net) serve all userpic traffic. Requests for files on userpic.livejournal.com would go to one of these servers, and they would first try to serve the file from their local disk cache of images. If the file wasn't there, the request was simultaneously redirected to our servers in Seattle where the image could be served from the database as well as stored locally on disk cache, so the next request for that image could be served locally. This worked pretty well, but we were frequently running in to problems of disk space and network problems with the provider. This is when we began looking in to using a CDN (content distribution network) and purchasing hardware to allow us to have userpics stored on an independent disk array rather than the databases.


As of a month ago, userpics are now served over a large network of caches maintained by a 3rd party company. Originally we were using Speedera for this, but are now using Akamai. What essentially happens with Akamai is similar to the setup we had previously: userpic.livejournal resolves to Akamai's network, and they handle sending the request to the "closest" cache they have on their network of thousands. This may not necessarily be the closest cache geographically, but rather the one determined to give you the
fastest response time measured separately through a number of variables. If the cache you hit doesn't have the image stored locally, it will try to find it on other predetermined caches on their network, and if its still not found at that time the request is then made to us. This cuts down the amount of times userpic requests hit us at all, which gives us more room to handle all the other data requested of us.

Going forward:

Now that we have a Network Appliance disk filer, we are moving userpics out of the database entirely and instead storing them on the filer. This work has started already and several sub clusters now have their userpics stored there. This almost totally eliminates database overhead for userpics. When new userpics are uploaded, they're put on the filer. When requests from Akamai are sent to us for userpics, they're served off the filer.

I wanted to document this in part because there have been a number of people who have thought that our selling extra userpics this week has negatively affected site performance in general. As you can see, with the measures put in place over the last couple of months, userpics are no longer a strain on us as far as the databases go. Databases are typically the main cause for site slow-down. By moving userpics out of the database, and having Akamai handle serving the requests, we are accomplishing both increased and more reliable availability and performance of userpics for you, as well as decreasing the number of operations our database servers are doing.
About LiveJournal.com