Monitoring, post2, what we want

2014-07-04

In my last post I talked about what we're trying to accomplish with monitoring http://www.blog.computersarehard.net/2014/06/monitoring.html
I think it's hard to see what we need in a tool without some hard examples of what we want to compute. So, here are some examples of the types of things we want to compute. The discussion in that previous post should be sufficient to motivate why each of these metrics would be interesting and useful.

Percent errors returned to users, overall and broken down by error type, across the entire service and per server
99'th 95'th and 50'th percentiles, of total system latency, per component latencies, and per server latency.
Average and 95'th percentile CPU/disk/memory utilization, over the last quarter.

Rates and sums

Lets start with percent errors returned to users:

This is basically error_rate / responses_rate. What we have though are discrete query response events that we're counting. Chances are we can't afford to send or record metrics for every response, so instead they are going to get bundled somehow on a periodic basis, another discrete event. This means we don't really have a continuous function that we can simply take a differential of, instead we need to take a period of time that includes several samples and compute a differential over that - so we have a few points to work with.

This means our rate isn't just a rate it's a rate with a period parameter. In general you want to make as many decisions as possible in your monitoring system, rather than your application, so you can easily change them without re-releasing your software. So we really want to set this period parameter in our monitoring.

This means that we don't want to export a rate, instead we want to export a constantly incrementing counter. We can then compute a differential over an arbitrary period post-facto in the monitoring system. Thus getting a 1 minute, 5 minute or 20 minute rate as we prefer. This period acts like a low-pass filter, the larger it is the more it "smoothes" the jumps in your rates. For the sake of example lets say we have a pretty large system with high query rates, and we want fairly low resolution and high sensitivity to quick changes, so lets go with a 5 minute rate. So now we have:


 percent_errors = rate(errors, 10m) / rate(responses, 10m)

Now we want to compute this over our entire service, which looks like:


 rate(sum_over_servers(errors), 10m) / rate(sum_over_servers(responses), 10m).

Dimensionality of data

We also want


 percent_errors = rate(errors, 10m) / rate(responses, 10m)

for each server as well. That way once we see that the error rate shot up, we can tell if it's a particular machine causing our problems.

And we want


 percent_errors = rate(specific_error, 10m) / rate(responses, 10m)

So we can break down what problem is being passed back to the user.

So, there's several interesting things going on here. We basically have 2 dimensions. We have servers and error types. We *could* write out every one of these equations across both dimenisions, but that would be a LOT of equations, one for every error type, and one for every server... and wait! we probably want one for every error type and server combination! Even worse, if we add or remove a server our rules change. This doesn't sound at all like how a properly lazy software engineer approaches a problem.

Instead of describing every calculation we do, we want to describe each category of calculation. To compute the error rates for each I basically want to say "do this calculation over every error type". In haskell terms this is something like a list monad. If you're used to matlab it's like operating on matrices. I'm going to describe a bit of a formalism here, not because it's the only one that would work, but to try and clarify the problem. To accommodate this new "parallel computations" model we can think of every timeseries as being described by an unordered list of labels. That is a dictionary, struct, or record, depending on your favorite terminology. So for example 404 errors on server 10 might look like this:


 {response_type: error_http404, server: myhost10, property: response}

Using this model we can now drop a key... say


 {server: myhost10, property: response}

To request everything that matches the two keys we do supply. This is like an array of timeseries, a 1 dimensional matrix. Thus this gets us all of the response_types for myhost10. If we drop 2 keys we'll get a 2 dimensional matrix, etc. Great, so now we can do something like


 percent_errors = rate({server: myhost10, property: response}, 10m) / rate(sum({server: myhost10, property: response)}, 10m)

But we still have to write this for every server. If we try and write:


 rate({property: response}, 10m) / rate(sum({property: response)}, 10m))

It all falls apart. We end up summing over all servers and all errors for our divisor, while our quotiant is calculated per server. Dividing these makes no sense at all! To solve this, we need to tell "sum" what it should sum over. As it turns out, our result is now going to be arrays on both sides. So, we also probably need to give "/" some clue how to match up those two arrays, so it knows what to do if they aren't identically sized or something.

Group By

I've actually never used SQL, but it seems this is by far the best terminology out there for I'm describing here. I finalized realized the connection a couple of days ago while talking with Jess, my girlfriend, about the problem I was trying to find a monitoring system to solve. It turns out the solution to the problem describe above is something SQL folks call a "Group By" clause. The idea is to say "retain this set of keys", while you collapse over all others. So for example:


 sum({property: response}) Group By server

Would calculate the sum of all response_types, but wouldn't sum across servers and would thus return a 1 dimensional matrix, an array, or results, one for each server.

Group by isn't usually used in this context, but we might as well use it in our pretend formalism since we already have it as an operator. For binary operators lets just say that it uses the key you define as the variable it matches on the two sides. So to fix our calculation above we get:


 rate({property: response}, 10m) / Group By server rate(sum Group By server ({property: response)}, 10m))

Obviously this is a bit messy with both infix and prefix operators, and our Group By clause as an addendum to each, but I didn't want to change from the initial syntax to much, and wanted to leverage people's understanding of the SQL concept...

So, what have we found so far?

We've noticed so far a few things that we really need for our monitoring system

We need multiple levels of highely flexible aggregations... in other words, we need full mathematical formula support
We need to compute on sets of data all at the same time, so we don't write hundreds of rules that change when we turn up new servers
We need this "group by" concept, somehow, built into our language again so we don't have to write rules over and over again

I'd like to add one more note which is that our dictionary syntax is cute, but misses one a point. What if we wrote down these tags in an ordered list:


 property:response.response_type:error_http404.server:myhost10

Then, we could use regular expressions to parse out our tags. A query for all response types on server 10 would look like this:


 property:response.response_type:\.*.server:myhost10

Note that this syntax is actually *more* general than our previous syntax, since we can also match on just parts of labels, so for example we could do this:


 property:response.response_type:error_\.*.server:myhost10

Now we're selecting only for response_types with an error code. Before we would've had to change the export of our variables to get this data into a separate dimension, now we can get new dimensions on the fly whenever we want them. This isn't great of course because the syntax is obnoxious for general use (and probably for the implementation as well), but it's definitely a useful property of a system to be able to pick out pieces of a name add new dimensions on the fly like this.

histograms and percentiles

Percentiles and percentiles.

Most systems give you the ability to compute a percentile over a set of variables, so for example:


 percentile({property:response, response_type:error_http404})

That's great and all, but it's not usually what you want. This is a percentile of variables, but frequently you want a percentile of *events*. That is something like a latency percentile. You can't represent every query as a variable or everything goes heywire and your index space explodes far too large to store. In fact, frequently you can't even afford to write data down for every query. Instead you're probably going to bucket your query latencies into a histogram. This only gives you an approximation, but it can be a good one if you're careful about your histogram selection. E.g. even sized buckets are probably not what you want, since latency is theoretically unbounded upwards. Instead you probably want exponential bucket sizes, so your resolution is relative to magnitude.

Bucketing into a histogram gives you another advantage. Given a percentile latency for each machine, you can't aggregate these into a percentile latency for all the machines combined. Percentiles just don't work that way. To be able to do this calculation you need a lot more information about the distribution. With histograms you can sum your histograms across all your servers, then compute an approximate percentile for all your servers. Simple if histograms are reasonable to work with in your system.

There's one more trick about percentiles. This doesn't relate directly to monitoring tools, but I would be remiss not to mention it while on the topic. For a two stage pipeline ->A->B->, you cannot use a histogram of the latency of A and a histogram of the latency of B to compute the histogram of the latency of A->B. The reason is that the latencies of these systems aren't guaranteed to be uncorrelated. In fact, since latencies frequently depend on the exact query, they are far more likely to be correlated than not to be. This fact itself is the type of thing your trying to see with your monitoring. To properly measure the latency of this system, and be able to break it down and look at each piece, you need to measure the latency of A, the latency of B, and the latench of A->B seperately, export *each* as a histogram, and compute your percentiles across each. It sucks, but it's mathematical reality.

Calculations on history

If I want to know how many resources my system is using there are a lot of things I want to look at. But I'm probably interested in our peak (or some percentile thereof), our average, and how much it varies. Maybe if I spend some engineering effort I could flatten out my usage. Is it worth it? How much money would that save in resources (or rather, how much growth-space would that give us, without having to buy more).

Another great example of this is computing quarterly SLO rollups. An SLO is frequently measured in "9's" of uptime. Well, systems aren't really either up or down. They can be in-between. In fact this is what our metric we were discussing earlier measures, the error rate of our system when it is serving users. Given this we probably have an SLO that looks like "5 9's 4 9's of the time". Meaning that we should have 99.999% availability 99.99% of the time.

At the end of a quarter we want to see how well we've been doing. So we're going to take our original key-metric, threshold it at 99.999, and then ask how often it was above or below that threshold.

There are almost certainly other ways to do this, but by far the most obvious is to compute on historical data. We want to be able to graph history, but we also want to be able to look at history numerically to help us pull out trends, and examine the past to help predict the future. There's a whole chunk of monitoring that fundamentally is all about modeling, and modeling is all about looking at the past.

Alerts come from same system as graphs

This is just common sense. When I get paged, I want to go look at the data that paged me. I want to see a history of it and look for the event that tripped the alert. Remember that by the time I'm looking the event is likely already over, so history is all I've got. It may happen again, but I want to fix it *before* that happens, after all, that's my job.

Data that is kindof similar to the data that paged me doesn't cut it. I want exactly the data that paged me, so I can be absolutely certain of what's going on.

When debugging a system you often have hunches, and one of the major purposes of monitoring is to give hard data to either confirm or deny those hunches. Monitoring systems are complex and often need debugging in their own right. Keep it simple and easy to examine.

Configs are stored in config files

I would've thought this was obvious, but looking at the extant monitoring systems, it apparently isn't. This is a general principle, but I'm going to bring it up here anyway.

When trying to build stable systems, the simplest solution that does the job is the best. Config files are simple. If something goes wrong in production and data gets lost, I've got the config file right here in a versioning system. If I notice wonky behavior and want to look for recent changes, again I have a versioning system. If I want to generate a config, generating a flat-file is easy and automatically idempotent.

Could you get these properties from a database? Sure you could, but now things are complex. Your synchronization could go wrong, and then your monitoring is not doing what you think it is.

Realtime alert and graph data latency

We need soft realtime constraints around say... two minutes at worst, for data to make it to alerting. Data making it to graphs can be delayed by another couple of minutes without causing issues, I'd probably put that maximum acceptable delay for hitting graphs at around 5 minutes. Smaller would clearly be nicer.

Conclusion

Lets review the properties discussed we've said we need:

Multiple levels of highly flexible aggregations... in other words, we need full mathematical formula support
Computation on sets of data all at the same time, so we don't write hundreds of rules that change when we turn up new server. This also requires the GroupBy concept, or an analog.
Usable percentile and histogram support (or ability to write it)
Alerts coming from the same system as graphs
Configs stored in config files
Realtime alert and graph data latency

I thought I would get to the tools this time... but describing what we're looking for took an entire post. So in the 3'rd post in this series I'll cover tools.

I'm still trying to understand some of the computational models (specifically graphite's) sufficiently to write that post, but hopefully I'll lock it down relatively soon.

Computers Are Hard

Cluster Operations to Type Theory, Algorithms to Security