Monitoring

2014-06-27

I have some datastructures exerpiments I really want to finish, but I've been busy starting a new job (and moving) and haven't a had a chance.

So, in the meantime, lets talk about distributed system or cluster system monitoring.

The goal of "monitoring" is:

So, how does one accomplish this? I have found through operating services myself, and watching other folks to do, that you basically always want the same things. Front-end, back-end, large or small. We have two tools at our disposal to acheive these goals. Whitebox and blackbox monitoring. Whitebox monitoring is when we get the data directly from the system, we're treating the system as an open box that we understand, and asking it to tell us what it thinks is going on. Logs are a good example of whitebox monitoring (though not one I like to use much). Whitebox monitoring is in contrast to blackbox monitoring, where you have no idea about the internals of the system. Instead you have a system completely external to your systems "probing" that system. Often this means it's acting like a user or client, pinging your servers, and watching things like latency and error rates.

So, Given our 2 tools lets go through the 4 goals above and talk about how to accomplish each.

1) Insight into the health of the system.

This is very straightforward. Your goal is to have 3 to 5 metrics that tell you if your system is working. If none of those metrics is out of threshold, then (in lieu of a known issue), you can generally assume the system is stable. I'm going to call these metrics your "key metrics". Some groups use the term "SLOs", though that term conflates this concept with client communication which I'd rather not do here.

You want these key metrics to cover as much of the system's behavior as possible, but you have a competing goal of them being be simple and easy to understand. The first goal is important because you don't want to make sure something on that front-page of graphs gets wonky when your system gets wonky, if not you'll miss problems. The second is equally important though because a single red light saying "IT'S BAD!" doesn't tell you much, and thresholds are hard to set, if you can't reason about what that metric means. When that metric is out of threshold you should be able to easily understand the impact it has, so you can decide how to proceed.

Note that getting these statistics is complex. Basically all systems have 2 relevant properties that you want to know about. 1) is it doing it's job 2) is it doing it fast enough. We can get each of these as whitebox or blackbox. Well, whitebox is better, since it reflects what the user sees right? So lets just use whitebox for both! Well, no. Here's why that's a bad idea. In general your probing is going to be a very small portion of your traffic. You usually can't afford to probe every interesting query all the time, so your users may be making different queries than your prober is. Between these two this means whitebox monitoring is usually more noisy, less granular, and likely to miss special cases. Blackbox of course suffers from not representing your users, or networking and such connecting you to your users. As a result you usually want both whitebox AND blackbox monitoring... optimally you probably want both for BOTH metrics.

For all your key metrics you want to avoid choices that cause them to change dramatically as your system scales, or as load scales. For example saying "we want to make sure users are always hitting our site" and alerting if your hits per second drops below a constant is guaranteed to alert every new years eve, and every world cup. You may need something like that, but keep these to a minimum, instead if you can use things like error rates as a fraction of total queries. Ratios are great.

Latency is weird. If you have 10 million queries per second flowing through your system, you don't want to alert because one was too slow. On the other hand you really care if some are extremely slow, or most are kindof slow. Because of this you often want to pick a couple of percentiles, maybe 99'th and 50'th (unfortunately this does depend on the scale of your system, but only very loosely), and alert on their latency being high. I'm going to cheat a bit and count those as one metric, since you can easily put them on one graph. And I'm making the rules anyway :).

Think about what your system does, and make sure your metrics are representative. If 90% of your queries are of one type that's super-cheap and fast, and 5% are expensive and incredibly slow, maybe you want to break those out into seperate metrics. You really can't do this process blind, you have to look at your system, what metrics you can get reasonably, and what they will tell you.

2) Insight into what's impacting the health of your system

Key metrics tell us whether our system is healthy, and give a 10,000 foot view of how it's unhealthy, but not a clue at all as to why or where to look. This is what the rest of your metrics are for. Here is where you go crazy, the more metrics the merrier. That said, piles and piles of metrics don't really help you if you can't find them. Think about what you want to know while debugging a given problem, and what metric you would want to help dig down and see what's wrong, or prove that something is or is not the issue.

For example. I wake up at 2am to a page saying that the 99'th percentile latency is high on my webservice. This webservice is backed by OpenTSDB sitting on HSpace on HFS and the whole thing is running on EBS backed EC2 instances. What do I do? If I've set up measurement of latency and compute percentiles *per component*. I click on a link by my latency graph and it takes me to a breakdown of latency per component and per query type. I look and I see that operations with *'s on the first parameter are crazy slow, that is whole-table scans at the HSPace layer. Every layer is saying things are stupid slow, so the other breakdown is useless today. Well, I think maybe it's HSpace so I look click on a link for that and I see that one tablet is slow. Huh, now I log into the EC2 instance backing that tablet and find that the EC2 instance takes 30 seconds to authenticate my ssh connection... well shit, my EC2 instance is probably getting hosed by a competing workload, I can dig around on the machine and maybe I'll find the competing workload from someone else is blowing all my cachelines... so I go buy a larger nicer machine, and tomorrow I'll see if I can get dedicated machines or something. I file a bug to get on that and go back to bed. (To be clear I've never run OpenTSDB, HSpace, or HFS, I just wanted an example with a relatively deep stack behind it)

That's ideally how you want debugging to look. It takes a LOT of work and a LOT of metrics to make things that smooth - and most of the time it won't be nearly that nice, but the closer you get the nicer it'll be.

3) To have insight into how these compare to historical values.

It's third quarter. My manager comes to me and asks what I need for budget next year. I ask around and find out that we're taking on a new large client that's half again the size of our largest client. After getting numbers it turns out that it's actually about half the size of our largest client... but that still means their query load is 20% of our total on average. Additionally I dig into their use-case and find out that it matches that of another client... okay.

So, I dig up the client that their use-case matches and look at their historical query load on various parts of our system. I take the peaks over the last 2 months and compare that to the average to get an idea of spikiness. Peaks are about 300% of normal, and occur at 12 noon. I compare that to the system as a whole and find that the spike is the same as that in the system overall. Damn, that sucks.

So I take our system as a whole, match it to a growth function, and based on that predict our traffic 4'th quarter next year due to organic growth. Then I check that function against our representative client and find that it matches. So I scale the represenative client up to the scale of our new client, add that to the other curve, and I've got our required capacity. From there we backsolve to how much of each resource we need, maybe adding a few percent slop here and there for systems that don't scale linearly, a little extra headroom, and the like. In short, you spitball it, but not until you get some backed numbers.

This sort of solving requires history, and it requires being able to query and analyze that history. I've repeatedly tried to build a generalized tool for capacity planning and have yet to succeed, in fact I've yet to succeed in even building a specialized tool for a specific task. If anyone knows of any I'd be very interested, for now I do it with ad-hoc queries similar to the process described above. Again the above is not a real scenario, but it maps closely to the process I have used, and will use again, when capacity planning.

4) To receive notification if your system is, or is about to be, unhealthy.

And now to the last bullet point. The one that all operations engineers hold near and dear to their hearts, and yet hate with a passion... Alerting.

We have out key metrics, so obviously we want to get notified when those are out of whack. One could argue that the key metrics are all we need for monitoring, and ideally this is the case. That said, the world we live in is never ideal, and the reality is that our key metrics are almost certainly going to miss some cases. Also, key metrics tend to be designed to tell you about user impacting issues. What about issues that you know are going to be user impacting? You can surface these sooner if you alert on them directly, rather than waiting for them to impact the key metrics enough. Examples of these are, part or all of the service is simply absent to our monitoring, our monitoring itself is noticably broken somehow, we are missing capacity that we're supposed to have and are just lucky that we're not in a peak load. All of these are clearly interesting pagable events, even though the key metrics are looking healthy.

There's another interesting set of cases as well. Our key metrics going out of whack are almost certainly pagable events. What about little niggling things? Things that are wrong, but we really don't want to get paged for. Things that may not show up in the key metrics until several things go wrong at once. For example, lets say that once we lose over 10% of our machines things start going south because we'll be out of capacity if we lose a few more. Or our system is supposed to be N+2 but we're at 95% of capacity before we become only N+1. Systems are constantly segfaulting, but never enough to actually cause a user-impacting problem. These are *interesting* events, and we want to hear about them, but don't want to get woken up in the middle of the night. For these events you want some sort of notification, or some kind.

Summary

So, in my view, that's what monitoring is for. That's what we're trying to accomplish. With that in mind my next post is going to be about tools. Since starting at Meteor I've been researching all of the tools available outside of Google, and I have to say that I'm a bit disappointed. I expected awkward kludgy tools, but I expected the tools to be able to do the things I needed. I'll go into first describing what we need to accomplish the goals listed here, and then talk about some of the extant systems and how they do or do not fall short.

BTree arity continued

2014-06-06

Remember our graph from last time. This is the time taken for 20 million random elements to be inserted and removed at each arity, in a tree of simple integers.

I got curious and wanted to see what this looked like if you use the tree as an actual dictionary structure. This stores a std::pair as each element. Note that again this is in terms of arity. In case you are curious, I was pretty lazy when I ran the first test and my computer was doing all sorts of things. I was having enough trouble interpreting the data on the second test that I ran it overnight with not much else in the background. That's probably why it's so much smoother.


This test was done on the same machine as before, a Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz, ubuntu desktop, gcc 4-8-1-10ubuntu9. Linux kernel 3-11.0-19-generic. Note further that it has 8GB of ram, 64 byte cachelines, and 128k of L1 cache.

So what conclusions can we draw from this? Honestly, I'm really not sure. But, I will make some educated guesses

First of all lets talk about what we're changing. As we walk down the tree we are in effect doing a binary search. Similarly as we search for an element within the node we are also doing a binary search. So, changing arity doesn't change the search at all.

There are two properties that should change. first is caching effects: We get better locality when we don't have to switch nodes as often. Second is the linear moving of data, as nodes get bigger inserts and removals into a node, as well as splits and joins, require more and more shifting of the elements. So our graph should be a graph of these two competing properties, caching gets better as nodes increase in size, but that linear copy cost increases as well. Note that when we do shifts and splits it is not only the array of data, but also the array of pointers to child nodes that has to be shifted. I actually just realized I could probably speed up my tree by not doing this move on leaf nodes, but we do it in the tests shown here.

So, on the left side where the latency drops precipitiously we're getting very clear caching wins. Within a cacheline or two the shifting has basically no cost. One or two cycles to move an element is nothing compared to a cache miss. Then we kindof hit the bottom. Note that it starts to level out at about 25 elements. For the first graph it levels out at about 55 elements. Both occur where the node is about 200 bytes in size, or about 4 cachelines. It's no-where near L1 cache-size when it levels out, so it can't be that it levels out due to L1 cache-size. It must be something else. The additional cost of shifting alone (ignoring caching) is perfectly linear, so it can't explain an elbow like that. My guess is actually that it's related to the non-locality of binary search, as we make it larger at this point we're not getting appreciably fewer cache-misses because since we split off an average of a quarter of the array each time, chances our that past this point our first cache-miss (and cache-line load) will never get another hit inside of it. I'm sure there's some complex math that explains why this is an inflection point, but you can see why something is qualitatively different as you cross that size boundary.

The right side starts to climb again at about 90 elements in one case and 180 in the other... or ~720 bytes. This is around 12 cachelines in size. I have no idea why this inflection point occurs. It's a much weaker inflection so it seems likely that it really is just the linear effects finally overriding caching benifits. That's the best I can guess right now.

In any case... I thought it was neat that storing pairs didn't cost us *that* much in speed compared to storing sets, and that due to the inflection points being in nearly the same places when measured by bytes per node we can demonstrate that what we're seeing are almost certainly caching related effects.

Again all tests are on github.

Sound multiplexing in Linux

2014-05-31

Linux's default sound system is pretty bare-bones. These days the go-to solution for sound drivers is Alsa. Alsa is just enough to get you the ability to get sound in and out of the machine, and adjust whatever the hardware supports. On thing it does not do by default is multiplexing. That is, only one application can make sound at a time.

This is a problem, and a problem many people have solved many ways. The result has been a plethora of different sound daemons and standards to go with them, jack, esd, pulseaudio, etc. The idea is you run a daemon that talks to the hardware (via the kernel) and everyone else sends it's sound stream through the daemon, which mixes it for you.

Now, when I set up my laptop with ubuntu-server, what I got by default was pulse-audio. My laptop has an interesting sound card feature, the sound-card can play out of both the headphone jack, and the speakers, at the same time. Often plugging in headphones cuts out the speaker, but not on this machine. Instead there are seperate volume controls for the two outputs.

That's all well and good, but pulse-audio is designed to need minimal configuration, and similar to so much software of it's ilk, that means it's not configurable. As a result it has different ideas about what should happen when I plug in my headphones than I do (in particular, it likes to crank the volume WAY up). It does mute the speakers, but I find the changing of my volume to ear-bleeding to me unacceptable. Since I couldn't change this, I removed pulse-audio.

I use my laptop to listen to music, as well as sometimes to receive phone-calls via gmail, skype, etc. These applications require sound, and without multiplexing I can't hear the phone ring if I'm listening to music. So, I needed a new multiplexer.

In comes dmix. dmix is alsa's own multiplexing solution. It's built right in, no need to use some other protocol, or some daemon that's too smart for it's own good. It's not shiny, it's not featureful, but it's simple and works. To make it the default you just edit the file "/etc/asound.conf". I'm not going to go into details on how, there's plenty of pages out there on how, but should you want this basic feature, without some heavyweight solution, give it a try.

Update

Someone requested some references, I hadn't bother as I had little trouble finding them but I may have gotten lucky. So here we go

Complexity vs. Constant Factors

2014-05-30

If you're formally trained, you've seen lots and lots of asymptotic analysis and the like in Computer Science.

In an earlier post about trees I noted one way you can be led astray by asymptotic analysis when I compared AVL trees and Red-Black trees. The question there is whether the part of the work people analyze is the part that makes up the most work.

There's another simpler way it often goes wrong though. Constant factors *matter*. A computer only has so much memory, and so much hardware. Every problem that can be solved on a given computer can be solved in 2**number_of_bits where number_of_bits is all storage on the machine. After that the machine must cycle to an already met state. That number is *extremely* large, (and in point of fact likely takes all the energy in the universe to cycle through) but it is finite.

It turns out that small local operations on arrays can still be VERY fast, particularly because the general locality helps with caching. A BTree is a way of taking advantage of these speedups, but within the small finite space of your node size.. that is, your arity. This makes BTrees MUCH faster than either AVL or RedBlack trees, despite the asymptotic analysis being identical to RedBlack trees.

I was curious about exactly where this tradeoff lies for BTrees. A btree stores many elements in each node, and usually keeps the elements sorted within that node in an array. This means inserts and removals from a node require shifting everything over in the node - a linear time operation in the size of the array, but B-tree nodes are of finite size, so still O(1). Anyway the number of elements in a BTree corrosponds to the number of children it has (-1 actually), so we can call it the "arity". So, I tried running my benchmark (same one used in the earlier tree experiments) at varying arities to see where our efficiency peaked. Note that these results are for 32 bit integers. The hardware and other dependencies are: Dell XPS 12 convertable laptop/tablet with: Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz, ubuntu desktop, gcc 4-8-1-10ubuntu9. Linux kernel 3-11.0-19-generic.

It appears to peak around 70 elements. The peak is pretty wide though so it's hard to be sure given the noise in the data. This was done with a single trial on a laptop that wasn't necessarilly perfectly isolated. For details on the exact experiement, and ability to run it yourself see



This isn't a surprising result at all, it's about what I expected. My guess had been between 100 and 200, which isn't far off.

Now if you read my post about real-time http://www.blog.computersarehard.net/2014/05/worst-case-vs-average-case.html You probably just stopped to wonder about this algorithm.... Yup, you can bet that the worst-case goes up a bit as we add more elements even while the average might be dropping. So, it's not clear what the best ARITY to use would be. Given no information about use-case there's a pretty clear cliff where we get a huge win by increasing the arity. So, not knowing I'd be tempted to set it at maybe 55. This will get us basically all of the gains of increasing arity, but keep the worst-case and variance comparitively small.

Consider the minimum there, ~25 seconds or so. The best run I've seen out of AVL is ~49 seconds. That's nearly a 2x difference. Even at Arity 5 the btree destroys the AVL tree at only ~39 seconds.

Conclusion: Constant factors matter, and can matter a lot. What's neat is that we still keep our asymptotic bounds TOO, so on both big and small problems we can be fairly sure BTree will keep beating the AVL tree by the same factors. I think that's pretty neat.


Worst case vs. Average case

2014-05-28

For years and years everyone's been focused on the average case and amortized analysis. Quicksort is most programmers favorite sort. Splay trees are popular in some groups due to their amortized performance, etc.
I'd like to propose an alternative. I would like to propose that for most problems this view is incorrect, and that we should be focused on the worst case performance rather than the average. The reason for this is that ALL code is real-time.

The definition of realtime is:

A program is realtime if it must complete it's task within a specified real world time bound T.

If I'm playing a videogame and 1 out of every 100 frames takes longer than a 100'th of a second to render I can visually see a flicker/stall in the video. As a result the videogame looks crappy and I hate playing it. This is a clear case where our software is realtime. If it takes longer than a certain amount of time, then it's not performing correctly.

Now lets say I go to visit a website, I sit there for 5 minutes waiting for it to load and it never does. So I give up on the website close the browser and never visit it again. Clearly this websites behavior was incorrect, and as a result it caused the website to lose a user. So, part of the correctness criteria for that website is a realtime bound. That websites server code is real-time!

Worst case vs. Average case

If I need my code to meet a real-time criteria, always, then I need to make sure that it's worst-case is less than that bound. Thus, we want to choose algorithms with low variability in time, and possibly a worst average case, in exchange for ensuring that the worst-case falls within our limitations.

This concept is especially true in distributed systems. In a normal system if our worst-case is highly unlikely it won't impact us often at all. On the other hand if rendering the website the user wanted requires hitting 100,000 servers, then it's pretty likely that that worst-case that only happens 0.0001% of the time will trigger on ONE of those servers. Examples of this include search-engines which fan out the request to servers each of which is responsible for part of the space of all data being searched.

Surprisingly this even comes up on back-end systems. Lets say I'm building an enormous dataset on a large cluster. I do this once a day.  If I'm doing a mapreduce and 1 out of every 1000 shards of my mapreduce is running really slowly, then my total mapreduce time is bound on those worst case shards, not the average. And I have a realtime bound of 1 day (probably less, since it'd be nice if I had a little leeway).

On top of that, while waiting for those couple of shards to finish their work we're probably waisting the rest of the cluster's computational ability - it's just idle. We could try and cram in other workloads, but this gets really complicated. It's much easier if you can flatten every workload as much as possible.

Lies and Deceit

I've just told a lie. In practice on really large systems it's often actually the 99.99% latency (or some number of 9's anyway), and you can may be able to just throw-out/time-out on the other cases. You probably can't do this at every layer though! So, as you distribute and add more layers to a system you still need to worry more and more about the wost-case performance.

Note for wankers, I'm using O not big theta because it's easier to type, and while the distinction would be formally useful here it's not widely used.

The point

When we start thinking of algorithms this way, the cost of mis-sizing a hashtable is an O(1) operation suddenly jumping to an O(n) operation so we can resize. Our quicksort looks pretty bad at O(n). Balanced trees, and the extra memory needed for mergesort or the extra moves for heapsort start to look a lot more tempting.

That is not to say average case, or even better amortized analysis, is not useful. If you do 100,000 operations on your dict, and all you care about is the time bound of that total set of operations, but this is only true if no-one else depends on the intermediate results.

So think about the real bounds of the code you are writing, and whenever someone says they want something to be "faster" stop for a second ask yourself if worst case or average is really what needs to improve.