LISA:  So can I do a quick mic check?  Is that good?  Can you 
hear me?  What a setup!  I have to make you laugh now.  How 
awful.  I'm VP of engineering at Stride which is a consultancy 
company here in New York.  And before stride I was at The 
Guardian newspaper for many years.  I think pretty much all my 
disasters have been caching things and I'm going to tell you 
about the worst ones.  Pretty soon after starting to at The 
Guardian, the website would go down at lunchtime.  It wasn't 
particularly high traffic.  It wasn't higher than the week 
before.  There's no particular reason that it should be going 
down every day.  So a bunch of us got together, and we decided 
to look at what the cache stats looked like.  And what we 
decided was that the cache is being completely cleared right 
before lunchtime.  Right before it went down.  So at this point, 
the guardian was serving about 2,000 pages per second, it was 
doing that on 12 servers and so the only way that you could 
serve that wasn't is relying very heavily on caching.  So if you 
take cache down, the server just simply can't cope with the 
load.  So we know great, the caches are being cleared but why.  
We started going through the locks of what was going on.  We 
started to noticed that right before the caches were cleared, 
someone submitted a poll.  It's like on a burning issue of who's 
got the best hair, Jennifer Aniston or Justin Bieber.  And 
actually, the poll was in numbers and percentages.  So is it a 
bug?  The bug went, right, 25 people vote for Justin Bieber, I'm 
the 26th, I clicked Justin Bieber, the number did not go up by 
one because of caching.  And so the bug fix we developed, we're 
going to clear the cache, the numbers are going to go up, and at 
peak time, someone submitted a poll and the website went down.  
And so we learned two very important lessons from here.  The 
first is caching is important.  Monitor it.  And spending a week 
to work on caching is kind of embarrassing.  And the second one 
is never build a clear cache system.  Partly because by building 
a clear cache system is what you're doing undoing you're 
building a big red button that somebody can push to take down 
your website and if you have that button, someone's going to 
push that button.  And also because caching is complicated and 
you don't want a complicated cache.  If caching is rightly 
important to you and really hard to test then you want it to be 
as simple as possible.  The other very important lesson is to 
cache for the smallest amount of time you can get away with.  So 
quite often when you ask people about caching systems they say I 
can cache for a long time, the longer the better.  But you have 
to think, how long is it okay to store data for if you cache for 
one day you're potentially resharing bad data.  So you have to 
think about that.  And so we built a three-day cache.  This was 
a good idea, right?  This was going over it will terrible 
queries.  We had queries that's built up of lists of articles on 
the website, this was the kind of things that it was doing.  And 
the queries, they were finding really old articles that didn't 
change very much.  And so to protect ourselves from the horrible 
line of the line of queries.  FM every now and then, one of 
those articles would get taken down from the website, it would 
get deleted for legal reasons.  And when that happens, the 
server would say, to just remove that one article from the 
cache.  So there came a time when that message got missed and so 
an article got deleted from the cache -- so it got deleted but 
the three-day cache didn't know that it had gotten deleted.  So 
it's a hibernate cache.  So it can give you objects back at the 
end.  If you think hibernate can deliver gracefully something 
that you've deleted, you're wrong.  Our whole website needed 
hibernate to load.  So at this point half the website is showing 
sorry things.  This would have been bad enough if there were a 
one minute cache or a five minute cache we were still showing 
sorry pages for five minutes.  It would have been set 
automatically.  But this was a three day cache.  We can't show 
"sorry" for three days.  Luckily there was 4:00 in the 
afternoon, so just clearing cache didn't take the website down 
so that was a good start.  But all those truly horrible queries 
that we were trying to protect ourselves against with the three 
day cache, running those queries all at the same time, for our 
website was terrible.  Cache poison.  The lesson here is fix 
your SQL.  Don't try and take over a terrible SQL with caches 
because at some point it's going to go horribly wrong.  And it's 
not going to save you.  Anybody know this guy?  This is the 
creator of WikiLeaks.  And in 2010 we had a live Q&A with him on 
the website.  And so our caching is much simpler, we have a one 
minute really simple HTML fragment cache.  And so we're feeling 
good about ourselves, everything is great.  So the live Q&A 
kicks off and pretty soon, response time is terrible.  It's 
getting worse and worse.  And weird well a bunch of database 
queries.  It was really strange.  In fact it got so bad the rest 
of the website was in danger.  There were synchronous comment 
systems.  So if the comment system is slow, the rest of the 
website slows down.  Ordinarily the comment system is the least 
important part of the system so you could just turn it off.  But 
we got told by a tutorial, by this Q&A, that the comment system 
is the most important thing on the website.  And so we turned 
everything else off to keep this commenting system going.  But 
the rest of the website went stale we stayed up, and at the end 
of it, we were like what the hell.  Why didn't caching save us?  
Why was this so terrible?  We had left to see what was 
serverring from the comment cache.  What we discovered was that 
it was a commenting cache.  We didn't know that the other ones 
were deployed.  There's a database query looks up what the 
system should go -- so effectively we wouldn't doing one 
database call per every comment.  This is a terrible idea.
           We're serving from cache, you're able to serve very, 
very quickly if you're doing any kind of processing for any kind 
of cached content that you are serving is going to slow you 
down, it's not going to work.  If you're doing a database call 
for every bit of content that you're serving, you're done.  Your 
cache is not going to save you.  It's really -- only cache 
static content.  And I'm going to finish with a woolly rat.  You 
see, everybody has a woolly rat.  And BuzzFeed it was the dress. 
 This is just a piece of content that suddenly just goes crazy 
viral.  It drives crazy traffic.  It could be days or months of 
the initial guest content published.  But suddenly it goes 
everywhere.  This is an URs (phonetic) it's a species of rat -- 
and it went crazy.  The good thing this content though is yeah, 
sure it's crazy traffic but it's crazy traffic to one piece of 
content so this is actually a brilliant task of caching system.
           Your cache should eat this for breakfast in fact if 
you get this kind of crazy content and your servers are not 
dealing with it.  Hey, check your cache something's gone wrong.  
For us it didn't work with with the woolly rat but it was 
manual, someone had to turn it on when we were about to get high 
traffic.  And there's no one to flip the switch and suddenly 
everyone's being paged to come turn it on.  So after that, we 
made sure that the cache was automatic.  So after a certain 
amount of painless were reached, the response time got slower 
and it had to become automatic.  So you may have noticed that 
all the dates in this talk were a few years old.  That was 
parking lot because I don't work for The Guardian newspaper 
anymore and it's also partly because we've learned our listen.  
We kept caching much simpler, and caching slower.  And so far we 
haven't had caching disasters.  So in conclusion, do these 
things, and you too will avoid Great Caching Disasters.  Thank 
you.
[ Applause ]