XKCD's phrase-to-units at high m·s^-1

In the previous installment of ~~bastardizing the English language~~ fun wordplay with Randall, we made a sentence-to-map-directions solver based on XKCD 2260. This time, we're in for something a little more scientific: a sentence-to-units solver based on the recent XKCD 2312. This time with unit symbols, we've got much smaller puzzle pieces to pack into phrases than we used to with towns, so we can aim for exact solutions. It turns out that people have made a lot of units, sometimes very weird ones, so we also shouldn't have much trouble getting at least the matching done.

The real fun question is how to do good matching. Undoubtedly, the best solution is short and sweet, one that can take some crazy complicated rap and find that it's got units of flow [m³/s]. These units we give names to are generally simple in base unit terms, so a good proxy for an elegant algorithm is one that makes small base-unit terms.

We can revisualize this problem in terms of geometry, where the named units are vectors in the 7-dimensional space of the SI units, drawn by their exponents along each dimension. We then consider all sets of named units with "good" coverage of the string (for which I've found that maximal coverage still yields a lot of different solutions) and ask about the statistics of the distance of the optimal sums over each set from the origin. With fairly good likelihood, with a long-enough string, there will be some solutions that actually produce perfectly unitless answers, thanks in large part to the strong law of small numbers. The vectors are usually pretty small and fairly scattered (although biased towards the "common" units like length and time). We also only consider integer exponents of the base units, and so only integer points inthe SI space. With the mean lengths of the vectors is in and around 3 or so, it's not hard to see that the number of points within a ball of radius around 3 of the origin fills up pretty quick even with a few disjoint sets that fill the target phrase.

As I hinted earlier, the number of sets that maximally fill the string is pretty darn big from my testing, typically in the hundreds or thousands, and with some degree of pruning that I haven't quantified either. This is very encouraging for suboptimal solutions, and so with that thinking in mind, I went for the easy win: another dynamic-programming string muncher.

DP description & implementation

Unfortunately, the size of the number of possible permutations of the vectors vs. the origin-facing ones is quite gruesome, and so we need some heuristics to guide our way. Since the units can have positive and negative exponents, and we're optimizing an absolute kind of score, any local decisions will definitely be sub-optimal. However, staying close to 0 when we can is a good start, and so the DP is used to pull out a decent solution with near-zero units.

I went for the most obvious DP algorithm, where I minimize the unit size at each letter. Given the target phrase $s$, the exponent of the $i$th of the seven base units of some unit $x$ is $u_i{x}$, and the set of all units at our disposal is $L$, the DP step is:

$$\forall l \in L . C[n,l] = min_{j \in C[n+len(l)]} \left \{ \sum_{i=1}^7 | u_i\{l\} + u_i\{C[n+len(l),j]\} | \right \}\quad if:l = s[n:n+len(l)]$$

falling back to:

$$ C[n] = C[n+1] $$

if there are no matches at all. By the logic illustrated above, this is not optimal. However, I've come to appreciate how lucky we are that we as a society generally settled on the more common letters for the more common units. (Thank goodness for elementary charge → e for example). It's meant that the optimal solution would probably not involve a bunch of crazy aggregate units locking perfectly into each other all too often.

In fact, the data structuring posed a bigger problem with this project than the algorithm design. There are two tricky bits of this DP in particular:

Prefixes and reciprocation: The results of each memoization step are copied extensively, yet each solution needed to be able to tack on a prefix and reciprocate the result, so the memory that is shared between DP branches is tricky to manage.
Mappings galore: There are many mappings to preserve: the string segments to the phrase, the modified units to their original SI units, named unit symbols to their familiar names and Wikipedia links, and more.

A quick note on stack: I first scaffolded a solution in Python for agility, or so I thought. Within a few hours, the hassle of managing the isolation of branches of the DP by manual copying quickly became the bottleneck of the whole project, so it didn't take much convincing to flip the backend and relegate Python to the data preparation, where it filtered the heterogenous Wikipedia data. Haskell took over the server implementation, with its immutability and clear typing elegantly servicing both problems.

The last step was converting the SI units back into human-readable unit types (the "pressure"s and "force"s, etc.). This is a version of knapsack problem. In this case, knowing the weights are fairly small and that base units have direct representations in the unit set (e.g. meter, kilogram), the heuristic could be pretty aggressive in favoring "uncreative" solutions. In this case, I forced the unit score to be non-increasing over iterations to contain the explosions that even giving an extra +1 at each iteration created. However, every now and then you get an energy per unit volume in the solution and can't help but smile a little.

Please give it a spin! I'm on the lookout for a better units dataset, as Wikipedia's is especially skewed towards length and time dimensions. I wil probably fill in the electrical units soon as it's a minor travesty that farad and henry are missing, but the list is still certainly a workable start.

Use it: units.lam.io
GitHub: acrylic-origami/phrase2unit