newer posts
older posts

AVL trees or RedBlack Trees

2014-05-28

Before launching into this. I'm talking about Sorted Binary Tree balancing algorithms. If you don't know what those are, you'll want to before bothering to read onward.

Background

For this post I want to focus on AVL vs RedBlack trees, but first a little background. In a paper called "An Algorithm for the Organization of Information" G. M. Adel'son-Vel'skii and E. M. Landis layed out the first datastructure algorithm invented. It's a great paper too, very approachable, give it a read. You may want to skim Wikipedia's explanation as well though. Papers back then were written a bit differently, but more on that later.

Anyway. BTrees were invented later. RedBlack trees were then invented as an isomorph to BTrees where data never had to be shifted, one node was one piece of data. This can prove useful, for example if you're using pointers to the data stored in the tree.

There are tons of other balancing algorithms, but RedBlack trees and AVL trees have the distinction of  O(log(n)) for all operations, without having to amortize the results. This means that EVERY operation is O(log(n)) it doesn't just average out to that. That property is great if you have realtime bounds for your code, rather than just wanting it to run "fast". They do this because they guarantee that they are approximately "balanced" meaning every subtree has about the same number of children (transitively) to it's left and to it's right.

AVL trees have a tigher bound on how balanced they are than do RedBlack trees. This makes them a touch faster for lookup, but it means there are more balancing operations on average as data goes in and out. In particular RedBlack trees are guaranteed O(1) rotations for all operations, where AVL trees are not.

The Experiment

For fun I wrote myself a small forest of trees. The experiments aren't complete, but I got some slightly surprising results. I wrote them up on Google+, but I wanted to put them somewhere more permanent, so here it is.

All of the code for what I'm talking about is here:
https://github.com/multilinear/datastructures_C--11
Check PERFORMANCE for details on the experiments.

What surprised me is that AVL outperformed the RedBlack tree, it's not by a lot, but it's by some. I ran several further experiments making sure that it wasn't the depth difference impacting lookups. I ran a ton of extra lookups to see what would happen, and the gap didn't widen. This implies the lookups are too small a factor in my workload to matter.

I then spent quite a bit of time trying to figure out what else it could be. I already have 2 redblack implementations, and I'm running against the faster of the two. Thus I'm fairly confident that the algorithm is doing the right thing. I also have extensive consistancy checking which is turned off for the benchmark, but which is used in the unittest, so I'm fairly sure the algorithm is correct.




After much testing my best theory is that since RedBlack trees were written and declared faster the world of hardware has changed. See, when inserting or removing from an AVL tree you can tell whether to rotate by looking only at the nodes in the path to the one you modified. In a RedBlack tree you need to look at some other neighbors. Additionally note that as machines have sped up, the gap between cache layers has widened at every layer. L1 is many times faster than L2 than it used to be, same L2 to L3 and L3 to main memory.

My theory is that pulling in the cache-lines to look at the neighbors is in fact so expensive that it overrides having to do more rotations. Thus while theoretically RedBlack trees only do O(1) rotations per insert, this doesn't matter because the main expense is actually *looking* at nodes.

If anyone else comes up with another theory I'd be very interested. That's what I've got for now.









First Post

2014-05-28

I've been keeping a blog about my outdoor adventures for some time over at blog.smalladventures.net .

I worked at Google for over 5 years as an SRE. In that time most of the computer work I was doing was fairly not public. Also, working 5 days a week, and some weekends, I really didn't want to spend much time outside of that playing with computers, certainly not enough to sustain a job.

I quit last year though, and have been doing some interesting side projects. I actually just accepted a job offer with Meteor that I'll be starting soon, but in contrast that is open source, and I convinced them to let me work only 4 days a week. As a result, hopefully I'll want to do at least some computer work for fun, and I'll probably want to post more as I learn about open source types of projects and the like.

So, here's my blog.

I did a lot of robotics in highschool and college and went to Carnegie Mellon with the intention of doing AI. I then got into operating systems, where I TA'd a fairly intense operating systems class. I realized the languages I was working in were awful and thus got interested in type theory and programming language design, taking several classes down that track such as category theory and a class on building a compiler for an SML type language. Somewhere along the line I also got interested in algorithms and took a graduate course in complexity theory.

Then, as mentioned earlier, I was an SRE for 5 years. I also run Linux on all my own machines. My preferences and needs for personal machines are strongly driven by my desire to be outdoors.

Suffice to say my interests are rather eclectic, and so this blog is likely to follow the same pattern.

newer posts
older posts