In defense of phablets

2015-03-29

Convergence
A while ago I wrote a "manifesto" on mobile devices. My idea at the time was that a mobile device should be a real computer, and anything less and we'd never really be happy with them. Further, I theorized that as computers got faster, eventually we would each carry a device that would serve all our (local) computational needs. This "phone" for lack of a better term would tether to your display and keyboard at your desk. A folding keyboard elsewhere for lack of other input devices, a heads up display if that was desirable etc. This is the ultimate convergence device. One device to rule them all. It's simple, it's small, you can carry it anywhere and interact with it directly, or tether up to a screen at work to get down to business.

At the time what I didn't understand was that it's very few users who care if their data is local or not. To me, the ubiquity of wireless connectivity was no-where near sufficient to warrant putting all my data in the cloud. I often couldn't get it! Some of my data sure, but not all of it. It still isn't anywhere near that level, but what gets built is decided by silicon valley, not by people living deep in the Rockies or in the rural hills of the east. If data is stored "in the cloud" (ugh, I hate that phrase), then there is no reason you *need* all these devices to be the same device. And if you don't need them to be, it's both easier and more lucrative to sell you lots of devices.

All of this is to say, I got that one wrong. Although, at the time I made this prediction desktops were still commonplace, and it is now common to use a laptop that wirelessly tethers to an external display and keyboard, I was at least partly right. Anyway, I don't think my vision was a bad idea, it's just not where industry is going to take us if we sit back and watch.

I hate dealing with 15 devices. I don't like buying them, charging them, storing them, setting them up, etc. The last thing I want is 8 devices to do what one device can do. As a result, I've resisted first smartphones, then netbooks, then E-readers, and then tablets. A laptop can do these things, why push around another device? But I want to do mobile computing. I want to find something that fills those useful niches, but doesn't just add a device.

Mobile computing
I have been privileged enough to get the opportunity to try a lot of different mobile devices over the years. Here, I'm taking laptops for granted, and focusing on things smaller than laptops.

In highschool I had a newton 2000 with a keyboard. That was a really great device. I carried it to school and typed up homework on it. I used it a lot. Back then most of the useful work I did on a computer was word processing, and this actually filled that need quite well. I often wrote papers and printed them off without ever using another device In this sense the Newton was "real computer" in that it allowed full content creation without use of another device (for the needs of the time).

Later on I inherited a Jornada 720. The clamshell form-factor was wonderful, and the 0.75 pitch keyboard was a good size to still be typable, but not good enough for general word processing, especially with the heavy screen tipping the device over all the time. Also Windows CE was useless. I eventually got Linux running on it, but it didn't have enough ram to every really work well with the Linux UI at the time, even for someone like me. It was a great idea, and came very close to being really useful, but missed.

While working at Google they gave me a G1, the first Google smartphone. It was very cool, but I immediately bemoaned it being only kind of a computer. It was a single user system. Word processing on it was almost impossible. It was a phone, and could do maps and such, but it was massively frustrating to know I had that computing power in there but couldn't really get to it. It was neat to own, but I never would've purchased it.

Next they gave me a Nexus 1. This device had a large enough screen to conceivably do real stuff on it, but had no physical keyboard. Eventually they released a version of android that could handle an external bluetooth keyboard and I got one. This gave me a device kindof like a tiny newton. Android still felt like it was making itself a second class citizen, more so than the newton did. I never did "real" work on it because the OS was too hobbled. I eventually got a real gnu userland working on it, but even then the screen was just too small. Close, but this still didn't fill the niche I'd been looking for since I was little, either in form factor or software.

Eventually they gave me a galaxy nexus. Now that was a real change. By this time android had reached it's tween years. It had started to realize that eventually it was going to be a real OS, but had no idea how to get there yet. While I had this device I quit my job at Google and went on the road.Living out of a car the combination of a bluetooth keyboard and the galaxy nexus was incredibly powerful. Plopping down a laptop in a cafe is a statement "I'm going to be here a while", setting a cellphone and a little keyboard down? Not so much. I was able to write emails in internet cafe's when out of cell range up near Yosemite. I didn't need an inverter to charge it. I could charge it over and over off the car battery without worrying about killing the battery, or I could use a solar charger. I wrote a lot of email on that device. It felt kindof like I had a newton again, but with too small a screen. It had grown up enough that I could do real work on it, finally. Still though, it felt hobbled. Posting to a blog wasn't really feasible.

Around then I gave Jess an Asus transformer. It could run Linux dual boot. Sadly while a friend of ours had it working, she never *quite* got it there. This device was a really interesting form-factor. A small and light tablet, and a keyboard. She used it as her primary computing device for a year. This made it painfully clear that Android was indeed not a real OS yet. For example, every browser available for it was highly unstable when you *really* used it (not just poked at it like you would on a phone). There were websites that simply didn't work. Many websites have apps for phones that you can use instead on android, but these were almost always missing critical features. Gmail for example to this day is still missing half the features that are available on the website. For certain things she still needed a "real" computer.

Somewhere in the middle of all that I also got a kindle 3G. This device let me carry around a ton of books with me, yet still read the in full sun, and almost never recharge. It also gave me a backup way to check my email from almost anywhere for free. I rooted it of course, like everything else, but even with console while It's an amazing device, it's just too slow and clunky to use for anything real. I love it, but fundamentally it's not in this class of devices at all, so we'll toss it aside and discuss it no further.

One other thing I discovered though on the road with the galaxy nexus... Why is it a phone? Eventually I realized how much gas I was burning trying to find wifi and purchased a Verizon hotspot to save money. This got me the coverage of Verizon at $50 a month for 5GB, far cheaper than I could get Verizon phone plan (I needed Verizon for the coverage). I had a $2 a day plan (only for days I used it) with T-mobile on my galaxy nexus. After a while though I got data calling working on the hotspot and it was more reliable. At some point, I realized I didn't need my cellphone to be a cellphone. It was just a tiny tablet.

My latest laptop was yet another attempt to hit the middle-ground, but from the top. I got a Dell XPS12, this is a flipscreen convertible laptop that turns in to a tablet. It's a neat idea, but for the most part I haven't used it as a tablet. Only recently did the drivers finally work for gestures on the touchscreen under Linux. I now have a tablet but... I'm not entirely sure what to use a large heavy tablet for. A light tablet is comfortable to hold in one hand and read on, to hold over your head lying on your back, etc. With a large tablet the only use is while standing, but... unless you need a large screen a small light tablet is still better for that use too.

Phablets
Recently, I went to purchase a smartphone. My Galaxy Nexus is bordering on ancient at this point at 4 years old. The screen is shattered, and it's just not working that well. It makes me sad that this is how technology works, but it is. So, I went looking for a new phone. I realized that what I wanted was a 6" tablet... but, those don't exist. 8" is too big, I can't carry it around in a pocket, it would have to be in a backpack. 5" is too small, I can't edit real text and use it like a real device.

So, I got a 6" phone. I actually do have the Tmobile plan on it, for emergencies and getting paged for work. I'd prefer not to have the radio as it uses power. I'd also prefer not to have it since the radio makes the device accessible to the NSA for domestic spying (no, it's not a conspiracy theory, this one is real). But... oh well, 911 support is nice. Overall, paired with a folding keyboard, it's getting really close to what I've been looking for since I was a kid. Android has made it to it's teen years now, it's growing in to a real adult operating system, with multi-user support and all the basics, but it's not actually there yet. It's gotten easier to run a GNU userland as well now, though I'm running a nightly build of a bleeding edge rooted open source image right now (cyanogenmod) so I haven't gotten it working yet.

So, I still have a kindle, a camera (for low-light pictures), a HAM radio, a car, and a verizon hotspot, but for my computing needs I have 2 devices. My Nexus 6 phablet, and a Dell XPS 12 (the convertable laptop).

Both the phablet and the laptop can make phone calls. Both can do text editing, run a terminal, run SSH client and server, run a decent web browser, etc. 2 devices that can basically do everything is getting pretty close (strictly speaking I also have a camera, a HAM radio, and a car, but you get the idea). I find myself grabbing the phablet frequently to use as a reading device, even inside the house. I don't have to plug it in and can just carry it around my 750 square foot house.

The next step is to finish making android into a real OS. Apps either shouldn't exist and the web versions tested on a browser, or should support the full set of features. Alternatively, we could just get really good at getting gnu/linux to run on these devices, so we can actually use the full sweet of software available.

Exciting times. I have a folding keyboard coming in the mail, and I'm excited to try writing blog-posts on the Nexus as soon as it does.



Monitoring: update

2015-03-13

Prometheus appears to actually exist now, and is coming along pretty well.
https://github.com/prometheus/prometheus

Glancing through it, it appears to hit the bulk of my points. It's still lacking a permanent DB for storage, so it's probably not deployable for most folks yet, but in general it looks like they are doing things basically right In particular.

So, there's no answer out there that I think solves the poblem *yet*, but at least someone is getting close. Check out the query language, particularly operators:
http://prometheus.io/docs/querying/basics/

A friend of mine also works at splunk, and they showed me some of splunk's features. As a backing store it appears to fit much of what I was describing as well. It has reasonable group-by semantics for example. Fundamentally splunk is basically a timeseries DB, it just happens to most often be used like ES is for processing logs.

So, with any luck my old monitoring posts will soon become defunct. Here's hoping!


Original AVL paper

2015-01-05

This came up the other day, it wasn't on my blog and I wanted to find a link to it. So here it is:

An algorithm for the organization of Information
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CCAQFjAA&url=http%3A%2F%2Fmonet.skku.ac.kr%2Fcourse_materials%2Fundergraduate%2Fal%2Flecture%2F2006%2Favl.pdf&ei=aK-qVKbcL8udNt_Jg4gE&usg=AFQjCNHp14CdgTfIBGOoBdk4OX17ryVCjw&bvm=bv.82001339,d.eXY&cad=rja

This is an amazing read because you suddenly realize just how *differently* we thought about algoriithms and programming back then. This paper doesn't even touch on the API, it's all about the layout of the tree in memory. When's the last time you read an algorithm description that didn't mention the API?

Anyway, it's a neat read. Give it a try!


Monitoring, post3, tools

2014-08-22

Alright, it's time to write this. I had intended to write this much, much sooner, but then the "Group By" realization hit me, and I got stick for a while trying to prove out some features in the system I'm actually using.

Our goal is a system that gathers whitebox and blackbox data from applications, and whitebox data from machines. Lets us easily query, view, graph, and alert on that data with the feature set from my previous post: .

There are also some soft requirements in practice for really using a DB:


Nagios et al.

Nagios is the industry standard solution. Nagios is what many many systems are trying to emulate, and as a result it's flaws are endemic to the world of monitoring systems.

First, let me say that I have not used Nagios, but I have used some of it's competitors

The main problem is that fundamentally these systems mostly are not meant to do the type of things I discussed in my previous post. It's meant mostly to alert on directly monitored values (e.g. is memory usage too high?) They then give you some small finite set of "aggregations" so you can see, for example, how much memory *all* your machines are using together. This works fine until it doesn't. Values like that tend to be scale dependant and take constant tweaking.

A popular meme right now is "automated thresholding", that's when your system figures out what "normal" is, and alerts when a value is some statistically significant distance outside that "normal". This is an attempt to solve the problem of scale independence. But it has 2 problems. The first is that sometimes your "normal" drifts, and that's indicative of e.g. your users slowly outgrowing how far your system can scale. After a year of slow drift "normal" could be 5% of queries being dropped... that's not okay.

Note that I said 5% of queries being dropped. Ratios are a much simpler solution to this problem. In practice this does not entirely solve scale independence. If you don't already know why, Google for the "law of large numbers". That said, while it's not perfect, it gets you 90% of the way there, and I would argue that the fallability and complexity of AI-type approaches are not worth the tiny bit of scale independence it buys you.

On top of this it's designed to be "easy to use", which means everything is based on clicky-button interfaces, no backup or versioning for your configs, no idempotent setup, etc. It's a nightmare from the perspective of a reliability engineer as soon as you view that system as something you have to manage and not just use, despite that you're paying someone else a ton of money to run it for you.
Basically, don't believe a monitoring tool will ever solve a problem for you, or allow you to think less about your system. I'm just going to leave it there, and let you generalize out to all of the other systems in this same family.

Collection -> DB -> alerting solutions

Okay, so if you get frustrated with the tools in the above model and start poking around what you quickly find is that folks in the know are using systems with 3 seperate components. Data collection, a timeseries database, and an alerting system that queries the DB.

This has some nice advantages. You can use lots of different collection engines for collecting different types of data (e.g. statsd for application level, various services for whitebox probing, etc.) Yet all the data ends up in the same DB, making building UIs and integrating data across those collection engines a breeze.

There is one notable downside, which is that your alerting is fundamentally polling based. It has to do expensive queries against the DB every N seconds so it can alert you if something goes bad. If you're doing polling you can usually assume you're doing something wrong. The *right* answer would let alerting trigger at the collection level. BUT we still want to incorporate old data in some cases, so it also needs to go back to the DB to get data, or keep a local store, or something.

There is one system out there that strives to do this a better way called Prometheus . Unfortunately, it's not ready for production yet, but keep an eye on it.

Collection

This part of the equation is boring, honestly. We need something to gather stats from machines and applications and feed those stats in to the database. There are a hundred ways to do this that are fine. As long as it doesn't block the application, we can get application and machine level data, and the data gets to it's destination, it'll work.

DBs

Before we dive too deep we need to pair down the field. So, looking at my earlier requirements list it's pretty clear that we need a rich query interface with some understanding of timeseries. Based on that I'm tossing out options like Postgres, MySQL, etc.

There are also time-series plugins, modes, etc. for some databases, but all of the databases like this that I found store time-series in the wrong way. To do the calculations I've discussed earlier we need to be able to query for a timeseries, and then compute on a subset of that timeseries. The "plugin" approach tends to store a *set* of timeseries as something you can query for, which really doesn't help us much.

Here's a nice list of open source timeseries databases.
https://en.wikipedia.org/wiki/Time_series_database

Here's the ones I've looked at
Now, I'm certain that I'm going to get something wrong in this post. There's simply too many details. My goal is to let others share some of the realizations and research I've had and save a little time. So, please bear with me, and if you find mistakes drop a comment and I'll try and fix it.

Druid:
Looks like a promising system, but everyone says it's a total bear to actually run. It's based on zookeeper, and it uses a mysql instance for it's metadata. That's already some interesting requirements, but not terrible. Then you start looking at it's pieces, it has a controller node, a broker node, a historical node. A minimal system is just very complex... too much for me.

Graphite:
Seems to be the established "common-sense" answer among the options. It's open source. It has a relatively rich query language. As of version 0.10.0 it has "map" and "reduce" which give a generalized Group By semantic like I talked about in previous posts. It's not terrible to run yourself, though it does have some scaling limitations. There are hosted options where they've already worked out the scaling issues for the most part - sadly hostedgraphite doesn't support 0.10.0 yet, but they are working on it (I've been talking with them :D).

The biggest win of graphite though is that it's just got tons of community support. Everything integrates with it, all the collection tools, and many of the front-end alerting tools out there.

InfluxDB:
InfluxDB is the shiny new guy in town. I was really excited to read through their website. It's open source, the developers are also doing a hosted option. The language is very rich, and unlike graphite features like "group by" were built in from the beginning, so it's supported properly. Unfortunately, it's rather new, so still a bit too new for my blood. If you want the new hotness though, give it a try.

KairosDB:
From what I understand this is basically a re-write of OpenTSDB. Like OpenTSDB it runs on hbase (translation: it's impossible to administer unless you're already a hadoop guru). Unlike OpenTSDB it can also run Cassandra Cassandra is the open-source bigtable written in Java, except that unlike bigtable it's also a storage-stack. The problem is, it's not that reliable. They still haven't hammered out a lot of bugs. Is it usable? Yes. Do you want to deal with it if you don't have to? No. It also has nice query semantics supporting most of the features you might want.
This might be a decent option if you can find a good hosting provider. Non-hosted though, it's probably a no-go for a small organization. For a big organization it might be just the ticket.

OpenTSDB:
See Kairos. OpenTSDB is a true standard, it's all open source, all that great stuff. But, like kairos it's nasty to actually run it. Unlike kairos it only runs on hbase, that is, the hadoop stack. If you already run a hadoop stack that's all well and good, no big deal. If you don't it's a heck of an understaking just for your monitoring database. That said, I understand that it scales quite well. Again it has support for rich queries and all the shinies.
Informix:
From IBM, it's closed source. It's been around a pretty long time, so it's really optimized for a different set of use-cases from what I gather. From my perspective I'm putting a lot of time and energy into a system, and closed-source scares me because I can't jump ship if I want to. Systems like graphite have standard interfaces that are supported by lots of systems. Informix is like the exact inverse of this, being 100% proprietary and no-one wants to go within a 100 yards of something owned by IBM (for good reason).

Of these, I ended up picking graphite. I state that here to explain the next section

Alerting

This part is harder. After scouring the fields for systems that alert based on data in graphite. They are all seriously flawed.

The biggest flaw is that they all use configuration backed by databases. This means that my data about what to alert on, which is fundamentally configuration, is instead live-data in production. If I lose that database in production I'm going to be stuck rebuilding all of my monitoring from scratch. If I make a mistake and bump the wrong button, or someone changes something and we decide it was wrong, we have no rollback, no tracking. Any features related to versioning have to be built in. Also, I can't do code-reviews and such on changes, review them branch them, and do everything else source-control lets me do.

For me, this shot down every single system I could find. My conclusion? Build one myself.



Monitoring, post2, what we want

2014-07-04

In my last post I talked about what we're trying to accomplish with monitoring http://www.blog.computersarehard.net/2014/06/monitoring.html
I think it's hard to see what we need in a tool without some hard examples of what we want to compute. So, here are some examples of the types of things we want to compute. The discussion in that previous post should be sufficient to motivate why each of these metrics would be interesting and useful.

Rates and sums

Lets start with percent errors returned to users:

This is basically error_rate / responses_rate. What we have though are discrete query response events that we're counting. Chances are we can't afford to send or record metrics for every response, so instead they are going to get bundled somehow on a periodic basis, another discrete event. This means we don't really have a continuous function that we can simply take a differential of, instead we need to take a period of time that includes several samples and compute a differential over that - so we have a few points to work with.

This means our rate isn't just a rate it's a rate with a period parameter. In general you want to make as many decisions as possible in your monitoring system, rather than your application, so you can easily change them without re-releasing your software. So we really want to set this period parameter in our monitoring.

This means that we don't want to export a rate, instead we want to export a constantly incrementing counter. We can then compute a differential over an arbitrary period post-facto in the monitoring system. Thus getting a 1 minute, 5 minute or 20 minute rate as we prefer. This period acts like a low-pass filter, the larger it is the more it "smoothes" the jumps in your rates. For the sake of example lets say we have a pretty large system with high query rates, and we want fairly low resolution and high sensitivity to quick changes, so lets go with a 5 minute rate. So now we have:

percent_errors = rate(errors, 10m) / rate(responses, 10m)

Now we want to compute this over our entire service, which looks like:

rate(sum_over_servers(errors), 10m) / rate(sum_over_servers(responses), 10m).

Dimensionality of data

We also want

percent_errors = rate(errors, 10m) / rate(responses, 10m)

for each server as well. That way once we see that the error rate shot up, we can tell if it's a particular machine causing our problems.

And we want

percent_errors = rate(specific_error, 10m) / rate(responses, 10m)

So we can break down what problem is being passed back to the user.

So, there's several interesting things going on here. We basically have 2 dimensions. We have servers and error types. We *could* write out every one of these equations across both dimenisions, but that would be a LOT of equations, one for every error type, and one for every server... and wait! we probably want one for every error type and server combination! Even worse, if we add or remove a server our rules change. This doesn't sound at all like how a properly lazy software engineer approaches a problem.

Instead of describing every calculation we do, we want to describe each category of calculation. To compute the error rates for each I basically want to say "do this calculation over every error type". In haskell terms this is something like a list monad. If you're used to matlab it's like operating on matrices. I'm going to describe a bit of a formalism here, not because it's the only one that would work, but to try and clarify the problem. To accommodate this new "parallel computations" model we can think of every timeseries as being described by an unordered list of labels. That is a dictionary, struct, or record, depending on your favorite terminology. So for example 404 errors on server 10 might look like this:

{response_type: error_http404, server: myhost10, property: response}

Using this model we can now drop a key... say

{server: myhost10, property: response}

To request everything that matches the two keys we do supply. This is like an array of timeseries, a 1 dimensional matrix. Thus this gets us all of the response_types for myhost10. If we drop 2 keys we'll get a 2 dimensional matrix, etc. Great, so now we can do something like

percent_errors = rate({server: myhost10, property: response}, 10m) / rate(sum({server: myhost10, property: response)}, 10m)


But we still have to write this for every server. If we try and write:

rate({property: response}, 10m) / rate(sum({property: response)}, 10m))

It all falls apart. We end up summing over all servers and all errors for our divisor, while our quotiant is calculated per server. Dividing these makes no sense at all! To solve this, we need to tell "sum" what it should sum over. As it turns out, our result is now going to be arrays on both sides. So, we also probably need to give "/" some clue how to match up those two arrays, so it knows what to do if they aren't identically sized or something.

Group By

I've actually never used SQL, but it seems this is by far the best terminology out there for I'm describing here. I finalized realized the connection a couple of days ago while talking with Jess, my girlfriend, about the problem I was trying to find a monitoring system to solve. It turns out the solution to the problem describe above is something SQL folks call a "Group By" clause. The idea is to say "retain this set of keys", while you collapse over all others. So for example:

sum({property: response}) Group By server

Would calculate the sum of all response_types, but wouldn't sum across servers and would thus return a 1 dimensional matrix, an array, or results, one for each server.

Group by isn't usually used in this context, but we might as well use it in our pretend formalism since we already have it as an operator. For binary operators lets just say that it uses the key you define as the variable it matches on the two sides. So to fix our calculation above we get:

rate({property: response}, 10m) / Group By server rate(sum Group By server ({property: response)}, 10m))

Obviously this is a bit messy with both infix and prefix operators, and our Group By clause as an addendum to each, but I didn't want to change from the initial syntax to much, and wanted to leverage people's understanding of the SQL concept...

So, what have we found so far?

We've noticed so far a few things that we really need for our monitoring system
I'd like to add one more note which is that our dictionary syntax is cute, but misses one a point. What if we wrote down these tags in an ordered list:

property:response.response_type:error_http404.server:myhost10

Then, we could use regular expressions to parse out our tags. A query for all response types on server 10 would look like this:

property:response.response_type:\.*.server:myhost10

Note that this syntax is actually *more* general than our previous syntax, since we can also match on just parts of labels, so for example we could do this:

property:response.response_type:error_\.*.server:myhost10

Now we're selecting only for response_types with an error code. Before we would've had to change the export of our variables to get this data into a separate dimension, now we can get new dimensions on the fly whenever we want them. This isn't great of course because the syntax is obnoxious for general use (and probably for the implementation as well), but it's definitely a useful property of a system to be able to pick out pieces of a name add new dimensions on the fly like this.

histograms and percentiles

Percentiles and percentiles.

Most systems give you the ability to compute a percentile over a set of variables, so for example:

percentile({property:response, response_type:error_http404})
That's great and all, but it's not usually what you want. This is a percentile of variables, but frequently you want a percentile of *events*. That is something like a latency percentile. You can't represent every query as a variable or everything goes heywire and your index space explodes far too large to store. In fact, frequently you can't even afford to write data down for every query. Instead you're probably going to bucket your query latencies into a histogram. This only gives you an approximation, but it can be a good one if you're careful about your histogram selection. E.g. even sized buckets are probably not what you want, since latency is theoretically unbounded upwards. Instead you probably want exponential bucket sizes, so your resolution is relative to magnitude.

Bucketing into a histogram gives you another advantage. Given a percentile latency for each machine, you can't aggregate these into a percentile latency for all the machines combined. Percentiles just don't work that way. To be able to do this calculation you need a lot more information about the distribution. With histograms you can sum your histograms across all your servers, then compute an approximate percentile for all your servers. Simple if histograms are reasonable to work with in your system.

There's one more trick about percentiles. This doesn't relate directly to monitoring tools, but I would be remiss not to mention it while on the topic. For a two stage pipeline ->A->B->, you cannot use a histogram of the latency of A and a histogram of the latency of B to compute the histogram of the latency of A->B. The reason is that the latencies of these systems aren't guaranteed to be uncorrelated. In fact, since latencies frequently depend on the exact query, they are far more likely to be correlated than not to be. This fact itself is the type of thing your trying to see with your monitoring. To properly measure the latency of this system, and be able to break it down and look at each piece, you need to measure the latency of A, the latency of B, and the latench of A->B seperately, export *each* as a histogram, and compute your percentiles across each. It sucks, but it's mathematical reality.

Calculations on history

If I want to know how many resources my system is using there are a lot of things I want to look at. But I'm probably interested in our peak (or some percentile thereof), our average, and how much it varies. Maybe if I spend some engineering effort I could flatten out my usage. Is it worth it? How much money would that save in resources (or rather, how much growth-space would that give us, without having to buy more).

Another great example of this is computing quarterly SLO rollups. An SLO is frequently measured in "9's" of uptime. Well, systems aren't really either up or down. They can be in-between. In fact this is what our metric we were discussing earlier measures, the error rate of our system when it is serving users. Given this we probably have an SLO that looks like "5 9's 4 9's of the time". Meaning that we should have 99.999% availability 99.99% of the time.

At the end of a quarter we want to see how well we've been doing. So we're going to take our original key-metric, threshold it at 99.999, and then ask how often it was above or below that threshold.

There are almost certainly other ways to do this, but by far the most obvious is to compute on historical data. We want to be able to graph history, but we also want to be able to look at history numerically to help us pull out trends, and examine the past to help predict the future. There's a whole chunk of monitoring that fundamentally is all about modeling, and modeling is all about looking at the past.

Alerts come from same system as graphs

This is just common sense. When I get paged, I want to go look at the data that paged me. I want to see a history of it and look for the event that tripped the alert. Remember that by the time I'm looking the event is likely already over, so history is all I've got. It may happen again, but I want to fix it *before* that happens, after all, that's my job.

Data that is kindof similar to the data that paged me doesn't cut it. I want exactly the data that paged me, so I can be absolutely certain of what's going on.

When debugging a system you often have hunches, and one of the major purposes of monitoring is to give hard data to either confirm or deny those hunches. Monitoring systems are complex and often need debugging in their own right. Keep it simple and easy to examine.

Configs are stored in config files

I would've thought this was obvious, but looking at the extant monitoring systems, it apparently isn't. This is a general principle, but I'm going to bring it up here anyway.

When trying to build stable systems, the simplest solution that does the job is the best. Config files are simple. If something goes wrong in production and data gets lost, I've got the config file right here in a versioning system. If I notice wonky behavior and want to look for recent changes, again I have a versioning system. If I want to generate a config, generating a flat-file is easy and automatically idempotent.

Could you get these properties from a database? Sure you could, but now things are complex. Your synchronization could go wrong, and then your monitoring is not doing what you think it is.

Realtime alert and graph data latency

We need soft realtime constraints around say... two minutes at worst, for data to make it to alerting. Data making it to graphs can be delayed by another couple of minutes without causing issues, I'd probably put that maximum acceptable delay for hitting graphs at around 5 minutes. Smaller would clearly be nicer.

Conclusion

Lets review the properties discussed we've said we need:
I thought I would get to the tools this time... but describing what we're looking for took an entire post. So in the 3'rd post in this series I'll cover tools.

I'm still trying to understand some of the computational models (specifically graphite's) sufficiently to write that post, but hopefully I'll lock it down relatively soon.