newer posts
older posts

IM protocols and clients and FOSS


I'm publically announcing my move off of the signal chat protocol controled by Whisper Systems. I'll still be on there occasionally for now, but I'm almost always available on matrix

I've been considering various options. I'm unwilling to consider using anything that doesn't support end to end encryption, or anything with more central control than Signal. But, these days there are actually options in this space! Signal is not the most FOSS/libre player anymore. And their insistance on only supporting the Android/iPhone ecosystems is a major downer. The other two options I found are Jami and Matrix. Here's a "feature" matrix for matrix, signal, and Jami.


System Functional e2ee Open source FOSS/libre centralized clients available Required OS
Signal yes yes mostly no yes Whisper System's provided Android or iPhone
Matrix yes yes yes yes federated Dozens available Android, iPhone, linux, OSX, Windows, Arm and x86, can easilly port to other OS/Archs
Jami no yes yes yes fully decentralized one FOSS client Android, iPhone, Linux, OSX, Windows

For me Matrix seems like the clear winner. Jami *sounds* good but when I tried it I couldn't actually work. So Matrix it is. More details below.

Problems with Signal

Whisper Systems has talked about a first-class desktop client for years, but nothing of the sort has materialized. They have a desktop client, but it can only be used secondarily to having a phone by mere mortals. A few people have managed to run it as a primary client, but you have to do a lot of hacking around the system design. As a result to really use Signal you must own an iPhone or an Android phone with a valid phone number and be willing to register that with Signal. For me this is a practical matter. I don't own an Android phone and don't intend to any time soon. Many folks have also pointed out that in many countries you can't get a phone without basically registering your identity with the government as attached to that phone. That raises privacy related safety issues for those dissenting from their government.

While Signal protocol is public, and some code is open source, it is neither FOSS/libre nor an open system. Whisper Systems has singular control of the server side of the system. My understanding at least is that the server code they publish is out of date. But, most importantly, they change the protocol whenever they feel like it. This means that in practical terms only they can provide consistantly functional clients. This has been demonstrated by attempts like axlotl to fill the application gap for first-class desktop IM clients. axlotl does work for some people some of the time, but it's flaky at best. In reading about axlotl it appears this is due to the constantly moving target of the protocol, as represented by the server-side software run by Whisper Systems. When I tried axlotl I got stuck in a perpetual auth loop.

Let me be clear. I laud Whisper Systems and the developers there for their efforts. They have changed the entire conversation around e2e encryption and chat protocols, making e2ee available to normal users, and doing it well. For a centralized chat ecosystem they are doing a better job than anyone else out there, and I'm NOT condemning that effort. They take security seriously, they open source the important parts of their software so anyone can audit it. From a pure security perspective they are doing everything right, and that is not only rare but basically unheard of in the software industry.

What I am saying is that I demand more. There is no reason we should be locked into a single provider for our chat services. While they provide a valuable service, they do so on their terms with heavy centralized control of the overall ecosystem. Sure you can run a 3'rd party client, but they don't announce ahead of time that server support is changing, so while they *allow* it, they don't *support* it. I'm not a FOSS/libre purist actually. I just want to be able to use my software in certain ways, and control my own systems, and this is just one more case where those desires are in practice incompatible with more closed software ecosystems.

Problems with Jami

I attempted to use Jami for some time. Jami sounds ideal in that it basically drops the server-side entirely, using only a very thin server for aiding in NAT traversal. The problem? It doesn't work (yet at least). While I did manage to chat with my wife occasionally with this software it was so flaky as to be useless. I couldn't predict when it would or wouldn't work. I had to restart clients frequently. Whether this is the protocol or the client I'm not positive, but whatever the reason it's not currently a usable option. After a couple of months of experimentation I gave up. Besides these problems Jami is harder to use for normal users. For example: with no server there is no such thing as account recovery. The model is confusing, and the chance of lockout for normal users is high.

Is matrix really better than these?

While I think Jami's vision is absolutely the ideal. In practice seems like a better balance between usability and encryption, striking a similar balance to signal, and most importantly, matrix actually works. Matrix is a fully open federated protocol that also supports e2ee and has dozens of fully FOSS/libre clients, and multiple FOSS/libre server-side implementations. provides an easy default server for the normal user much like Signal, but unlike Signal if someone is dissatisfied with that they can go off and run their own server and still chat with folks on other servers. And because the protocol spec is a well defined standard there are many clients available for basically anything, many of which work consistantly. And if you have an extremely weird use-case (you're running OpenBSD on a vax) you can always port a client yourself.

I've run into one significant flaw in Matrix. E2ee support in the client is *optional*. This is sensible as Matrix can also be used as an IRC replacement (open chatrooms), where encryption is likely not that valuable. As a result though many clients don't support E2ee yet. I recommend "fluffychat" as I've had the most luck with this client. My wife runs it on Android. I run it on an ARM64 linux phone and an x86_64 linux tablet. There was an issue with it getting stuck in right-to-left language mode on linux caused by an underlying library but it's been resolved. Element would be my next recommendation. It worked well for a while then broke on me last week. To be fair though I am using debian unstable so it may not be the application's fault.


As someone who has opted out of the iPhone/Android duopoly, Signal is basically not an option anymore. So I'm using Matrix as my primary chat platform. I hope I can drag a few other folks along with me.

librem5 configuration


I got a librem5 about a year ago now. It was quite useless until recently when I realized that to get newer software I had to flash it myself once, THEN I would get updates. I've been playing with it since, and it's somewhat usable. I wouldn't say it's ready for primetime, but while the stability is so-so, I can actually do useful things with it now. Given that I wanted to document what I've done to set it up.

first, I reflashed with byzantium. Then during setup I connected it to my Nextcloud instance I'm already running... easy enough. As a tip I hate running sudo over and over, "sudo su" doesn't work on pureOS but "sudo /bin/bash" does.

Now for software

Lastly I removed all the apps I don't actually use (e.g. text editor). Unfortunately I didn't record the names of all of the apps, but you can always use "apt list --installed" and grep to work it out.

Finding software: I believe the librem5 is basically mobian/phosh. If you take a dpkg for mobian/phosh and try and install it it won't just work (I tried it), because the library versions aren't identical, but if you're searching for software that will work well on the librem5, software that works on mobian/phosh probably will, and is most likely already available in the pureOS repos if you use the commandline apt. The PureOS store is incredebly limited and I found it largely useless, I'm sure it'll become more useful in the future. A surprising amount of software is available by flatpak as well, while I'm not a fan of it

Overall I think PureOS is getting there in terms of functionality and I'm excited to really use my device now. Battery life is just about long enough to last a work day. What it needs most now is stability. For example, the bluetooth functionality is pretty sketchy. I'm unsurprised by this as the gnome/bluetooth stuff is no *more* functional on laptops last time I tried it. I gave up and use a commandline tool for bluetooth on my Microsoft Surface Go linux tablet (my other, and my primary, machine). Unfortunately this has been common for Gnome. Many of my friends still call "NetworkManager" "NetworkMangler" due it's long history of stability problems. I love the idea of linux phones, but the gnome development philosophy has, sadly, historically failed to deliver reliable and functional tooling. It's quite possible that librem and pinephone and others will actually help to fix those longstanding problems. I'm crossing my fingers

The best thing about the Librem though is that if they fail that's not the end of the story. Folks over in the pinephone world are playing with several other even more linuxy/gnu'ish phone GUIs. Running that stuff on the 100% open and free software librem is comparitively easy. I'm hoping purism can really deliver themselves, but it's nice to have a backup plan :). This is a HUGE difference from the days of Nokia and maemo. Although it was open then, the ecosystem wasn't there, but it's finally here.

One last note... my device has a problem with the SD card at least, and likely a problem with the SIM slot as well. I identified this problem the day the device arrived, but Purism dropped the ball on customer support and never replied after their first response. I've emailed them back now and hoping they'll get me a fully functioning device. I gather online that this isn't a terribly unusual story. They are trying hard but seem to be perpetually overloaded. In a way it's a good thing, as it means people want what they are selling, but it means the customer service experience might not be the best.


I got my phone back from purism with a working microSD slot... WOOT! I have a 1TB card arriving tomorrow so I can sync everything from nextcloud and have more than enough spare space for storing maps of the U.S. and Canada for puremaps, a copy of wikipedia, and whatever else I feel like.

I need to fix up some issues with my nextcloud server (I just moved from Hughesnet to Starlink and it's way better, but I lost my whacky ipv6 hacks), but once that's properly online again most of my use-cases are met by this suite of apps. Some could use some cleanup but purism is getting there.


Although I like my device. I wouldn't encourage others to get it. The price-tag is just too high for a device with poor battery life with still unreliable software. I *really* like the idea of running Linux on a phone, but I think the PinePhone project is a more economical way to get there. Sure the performance is a bit lower, but the devices are cheaper and the ecosystem is far better. In that end that's going to make for a more usable device. Ecosystems matter, that's why Linux won and BSD is now a barely used OS.

String Sorts


You'll recall in my recent post about Fast Sort we learned that Radix Sort is significantly faster at sorting 32 bit integers than normal comparison based sorts, particularly as the array gets large. A result every computer scientist knows theoretically, yet with median-find we saw how theoretical bounds don't always translate in practice.

Well, a friend asked me, what about sorting variable length strings? Does radix sort still beat comparison based sorts? It's a fair question so, I decided to take a shot at it. And here are my results:

In my mind that graph is pretty clear, above 10 elements radixsort is clearly winning. Note that this is currently a 128 bit radix-sort that only handles ASCII... though I'm actually only feeding it uppercase strings currently. So, lets talk about how this algorithm works, because it's not an entirely trivial conversion of radixsort

String Radix Sort Algorithm

This is a little bit interesting. You may recall that there are two types of radix-sort. Least Significant Digit first, and Most Signicant Digit first. These are referred to as LSD and MSD. My binary radix sort from earlier benchmarks was an example of an MSD sort, and the one I just referred to as "radix sort" is an LSD sort. LSD sorts are preferred generally because they are stable, simplier to implement, require less temp space AND are generally more performant.

There's just one problem. With variable length strings, LSD sorts don't work very well. We'd have to spend a lot of time scanning over the array just looking for the longest array so we can compute what counts as the smallest significant bit. Remember that in lexicographic ordering it slike all the strings are left justified. The left-most charactor in each string is equivelent in precidence, not the rightmost.

MSD sorts, must be recursive in nature. That is, they need to work on only the sublist we sorted in to a certain bucket so far. I'm going to call this sublist a "slice". To keep our temporary space vaguely in order I'm using a total of 5 lists.

Here's the algorithm. Start by looking at the first bytes of the strings. Look in slice list A, and get the next slice. Bucket everything in this slice. Each of these buckets (if non-empty) becomes a new slice, so write strings back out to string list B, and write the index of end each slice in to string list B. Swap lists A and B, move to the next byte, and do it again. We terminate when for each slice it's either of length 1, or we run out of bytes. To see the full implementation take a look at string_sort.h in my github repo .

Conveniently, they way my algorithm works it is in fact stable. We walk the items in order, bin them in order, then put them in the new list still in order. If they are equal there is no point where they'd get swapped.

It's a LOT of temporary space, which is kind of ugly, but it's pretty performant as you saw above. Another optomization I haven't tried is short-circuiting slices of length 1. We should be able to trivially copy these over and skip all the hashing work. Testing would be required to see if the extra conditional was worth it... but It seems likely

Data tested on

To test this I'm generating random strings. It's a simple algorithm where, with a probability of 9/10 I add another random uppercase letter, but always stopping at 50 charactors. I'm mentioning this because obviously the distribution of the data could impact the overall performance of a given algorithm. Note that this means functionally we're only actually using 26 of our 128 buckets. On the other hand, real strings are usually NOT evenly distributed, since languages carry heavy biases towards certain letters. This means my test is not exactly represenative, but I haven't given it a clear advantage either.


I can't say that this is a clear win for Radix Sort for sorting mixed-length strings. The temporary space issue can be non-trivial, and certainly time isn't always worth trading for space. We're using O(3N) additional space for this sort. That said, there are some obvious ways to reduce the space overhead if you need to, e.g. radix-sort smaller chunks of the array, then merge them. Use 32 bit instead of 64 bit pointers, or come up with a cuter radix-sort.

Note that my radix-sort was a mornings work to figure out the algorithm, write and validate an implementation, find a couple optomizations, and benchmark it. I wrote this post later. Oddly "inline" made a huge difference to gcc's runtime (it's needed due to loop unrolling for handling the A/B list cases). In any case, I've little down someone can beat my implementation, and maybe find something using a bit less space. I just wanted to prove it was viable, and more than competitive with comparison based sorts.

Median Find


Similar to Radix Sort, I thought it might be interesting to see how worst-case linear time medianfind actually performed. Since the algorithm is identical to expected-case linear-time medianfind (usually called quickselect), except for the pivot selection, I elected to add a boolean to the template to switch between them (since it's in the template it'll get compiled out anyway). Before we go in to the results, here's a quick refresher on these algorithms:

Problem Statement

Imagine you have a sorted array. If you want the K'th largest element, you can simply look at the element in location K in the array. Comparison-based sorting an array takes O(Nlog(N)) time (strictly speaking the theoretical limit is log-star, but it doesn't really matter). What if we want to do this without sorting the array first?

Quick Select

Quick select chooses a pivot and runs through the array throwing elements in to 2 buckets... smaller and larger thanthe pivot. Then it looks the number of elements in the buckets to tell which one contains the k'th element, and recurses. We can prove we usually choose a good pivot and this is expected O(N) time. But it's worst-case is much worse.

Worst-case-linear Quick Select

What if we always chose the right pivot? Or at least... a good *enough* pivot. This is how we build our worst-case-linear quick select algorithm. It's a really cool trick, but it's been covered in many places. So if you want to know how it works you can check wikipedia, or this nice explanation .

Real World performance

All of that is great in theory, but what is the *actual* performance of these things... well, in a benchmark, but at least on a real computer.

As usual I'm working on an array of test_size k/test_size times, so we work over the same number of array-cells at every point on the graph: small arrays many-times on the left, and large arrays fewer-times on the right.

For a while I was pretty upset about these results. The runtimes for "lineartime" quickselect look more like quicksort (the algorithm displayed as the "sort" line) then they do like basic quickselect. In short... that doesn't look linear at all. What the heck?

I must have a bug in my code right? This algorithm was proved linear by people much smarter than me. So, my first step was to double-check my code and make sure it was working (it wasn't, but the graph above is from after I fixed it). I double, triple, and quadrouple checked it. I wrote an extra unittest for the helper function that finds the pivot, to make sure it was returning a good pivot. Still, as you see above, the graph looked wrong.

I finally mentioned this to some friends and showed them the graph. Someone suggested I count the "operations" to see if they looked linear. I implemented my medianfind algorithm using a seperate index array. That way I could output the index of the k'th element in the *original* array. From there everything is done "in place" in that one index array. As a result, swapping two elements is my most common operation. That seemed like a pretty accurate represention of "operations". So, here's what that graph looks like.

Now THAT look's a bit more linear! It's not exactly a flat line, but it looks asymptotic to a flat line, and thus classified as O(N). Cool... So, why doesn't the first graph look like this?

Real machines are weird. That index array I'm storing is HUGE. In fact, it's twice the size of the original array, because the original is uint32_t's and my index array is size_t's for correctness on really large datasets. The first bump is similar in both graphs, but then a little farther to the right in the time graph we see it go crazy... that is probably the algorithm starting to thrash the cache. Presumably if I made it big enough we'd see it go linear again. That said, if we go that large we're soon running on a NUMA machine, with even more layers of slowness, or hitting swap.

So, should you *ever* use guaranteed linear-time medianfind? Probably not. If there is a case it's vanishingly rare. It happens pivot-selection distributes well, so there's probably a use there? But, if you just used the "fastsort" we talked about in my last post you'd get even better performance, and it's still linear, AND it distributes well too! It's not comparison-based of course, but there are very few things that can't be radixed usefully with enough of them, and if you're stubborn enough about it.

Big O notation didn't end up proving all that useful to us in this case did it? Big O is usually a good indicator of what algorithm will be faster when run on large amounts of data. The problem is, what is large? Computers are finite, and sometimes "large" is so large that computers aren't that big, or their behavior changes at those sizes.

Fast Sort


In my last post I was benchmarking various sorts against each other to test a couple of radix algorithms. . Radix Sort came out the clear winner for large data sizes (as we would expect), but that experiment left some questions.

In particular, on small arrays (less than ~100 elements in size), Radix Sort performed quite poorly. I posited a couple of theories. Maybe it's prefetching and branch prediction, or maybe it's all of the passes over the radix-array itself that costs more (as a friend of mine suggested). If the former is the main factor than we should see little performance difference as the radix-size changes. If the latter is the main factor then as the radix-size goes up our runtimes on these small arrays should shoot through the roof.

Radix Sort arity test results

As you'll recall from my last post, every point on this graph involves sorting the same number of elements. The X axis indicates the length N of the array sorted, but it is then sorted K times where N*K = a constnant. This way this graph shows nonlinearities in the algorithms as they change with N. The subscript for each algorithm is the number of *bits* in the radix, so the actual radix-size is 2^K (e.g. 256 for the yellow line). Note that the constant is not the same as my last post, so please don't cross-compare the graphs.

This graph gives us a pretty clear answer. Our slow performance on small arrays is definitely the radix array. With a radix of 256 it doesn't start to look linear until around 1000 elements, and doesn't beat a radix of 2 until ~20 elements.

You're probably wondering about the weirdness out on the right. Testing on any larger arrays gets somewhat time-consuming quickly. Logarithmic graphs are annoying that way. That said, I generated several graphs like this one, and all had similar artifacts towards the right, especially for larger radix sizes. But it looks pretty noisy overall. My current guess is caching behavior. The machine I'm testing on is far from sanitary, lots of other stuff is going on in the background. It could be that as radix sort uses nearly the entire cache it becomes highly sensitive to other workloads. A larger radix might have more opportunity for chunks of the radix array to get kicked out repeatedly and thus be more sensitive.

Mixed sort

We saw in my last post that radix sort is extremely valuable for larger data-sizes, in fact, it beat out eeverything else by around 40 elements. On the other hand, we also saw that it has pretty bad behavior on really small arrays. Those results were found with a radix size of 32 (or 5 bits, which would land between the blue and green lines on todays graph).

From this new experiment we see that there is a significant cost to larger radix sizes. For most uses on modern computers 256 integers is a non-issue memory-wise. But, with a large radix our performance is abysmal on really small arrays. A standard solution to this type of problem is to blend algorithms (this is frequently done with quicksort, using selection sort for extremely small arrays). Heapsort and mergesort both are stable sorts as well so they seem like good candidates. Lets see how they compare.

Radix sort arity test results

The graph is a bit ugly because I didn't do a ton of points, but the results are still pretty clear. If we just take the best algorithm at every point it looks like we should use heapsort <20, then radix sort with 4 bits up to 30, then radix sort with 6 bits up to 60, and then radix sort with 8 bits. No doubt if I tested larger radix sizes this trend would continue.

Of course, if our conditional choice gets to complex it's cost becomes a problem as well. To make radix sort fast it's compiled with the radix size as a template parameter, so each switch is going to tresh code cash, etc. If you're getting THAT picky you'll be benchmarking for your specific use-case anyway. So for general use I would propose that we keep it simple. Heapsort up to about 30, and then radix sort with 6 bits above that is a nice simple algorithm that gets us pretty close to our ideal speeds. It certainly wipes the floor with any other single algorithm. We'll call our new mixed algorithm fast sort.

Radix sort arity test results

It's a little hard to see the green line that is fastsort, but that's because it performs exactly how we would hope, like heapsort up until 30, and then like radix sort with a 6 bit radix.


Discussions of sort speeds are usually theoretical (how many comparisons are being done). In reality, as we showed with AVL trees being just as fast or faster than Red Black trees, moves and cache behavior play at least as big of a role as comparison count. Most of the literature I've seen says heapsort is not a good choice for speed alone, while here we've found it to be a very good option for smaller arrays. Our final fast sort algorithm differs from anything you are likely to find in a textbook, yet is an extremely performant, and by happinstance stable, sorting algorithm.

There are of course caveats to these conclusions. As noted in earlier posts, these are my implementations, it's always possible (even probable) that I'm farther from the ideal implementation for one algorithm than another. Additionally our fast sort has one rather annoying property drastically limiting it's utility. Our radix algorithm has to work on the type being sorted. At the moment it's written for unsigned types, but it would have to be changed for signed types, strings, integral types larger than size_t, etc. With comparison-based sorts you can sort largest-to-smallest simply by swaping the comparitor. With radix sort (and thus fast sort) significantly less trivial adjustements have to be made to the sort itself if you don't want to have to reverse it in a seperate pass. In practice, these limitations are probably why radix sort gets so little attention despite it's obvious strengths. The details can't be abstracted away as cleanly as comparison based sorts.

newer posts
older posts