The Curious Case of the Sorted Array

The Curious Case of the Sorted Array

A curious journey into the world of geohashes and microprocessor caches, where code can be made to run faster without changing it.

27 October, 2020
Vortexa Analysts
Vortexa Analysts

Written by Ed Wright, Lead Geospatial Engineer

Microprocessors. Those tiny circuits with their billions of transistors working together at crazy speeds. As a software engineer, we tend to think of them as an engine to get our work done, often sitting in the cloud. Sure, we can have a big engine or a small engine, which in our high-level world usually only translates to being able to concurrently run a lot of threads efficiently or perhaps only a few, but it usually stops there.

I really remember with fondness the day in the office it didn’t stop there.

 

The problem domain

So at Vortexa, we had one of those interesting problems to solve where you need both data scientists and engineers doing their thing. Predicting where ships, vessels are going to go and the path they are going to take to get there. Determining the path is my department.

There is so much to this I could write thousands of words, but in essence we’re looking at this:

a flow chart

 

  • Data science works out the vessel destination. Clever. Lots of machine learning models and stuff. Outside the scope of the curious sorted array thing (which is coming)
  • Model of the world’s oceans built. Again out of scope, but imagine a complex grid, a mesh of points across the world’s oceans mapping all the places vessels can go
  • A-Star shortest path – a standard algorithm for finding the shortest (cheapest would be a better term) path between two points on an interconnected mesh
  • Put those together, and we have a path across the seas, the route where we think the vessel will go
A-Star searching

There’s a pretty good page on Wikipedia about this. What it boils down to is we need the model of our world to be a collection of vertices, nodes in our mesh, which:

  • Have a location
  • Have a list of neighbours they are connected to, with a cost for getting to each connection

We also need two functions, a heuristic function, which given a location can give the cost to get to the destination, typically the distance, and a function to give the cost/distance between two given connected nodes. Planet Earth is a bit curvy, so Haversine will do the trick for the cost calculations.

We have a lot of vertices and a lot of connections. We are also going to make heavy use of this service for a whole host of reasons. Thus we’re back in my comfort zone, where efficiency matters. As an example, each neighbourly-connection can have the associated cost pre-calculated in advance.

The goal here isn’t to teach A-Star, Wikipedia and other sources can do that. I just needed to set the scene so the data structure below makes sense.

Did I mention this thing needs to be fast? We chose Rust as the language of choice. So we end up with a data structure which looks something like this:

Even if Rust isn’t your thing, the above should be fairly self-explanatory.

 

Fast forward a bit

So I spent a good deal of time measuring where the A-Star code was slow, picking the right priority-queue library, tweaking this and that. Then a colleague gave me some advice which seemed random, I tried it and the code ran 15% faster even though I didn’t modify the algorithm.

Sort the world array by geohash.

Six words. We need to really take these apart to understand where that 15% came from.

 

For starters, what’s a Geohash?

Imagine my postcode, my zip-code, is 1234567. As you are not familiar with this fictitious country, you might not know where that is, but if I were to say I knew someone living in 1234568, you could guess that probably wasn’t very far away – and most likely you’d be right. You’re almost looking at a geohash.

There are many hashing schemes. Here’s a simple one, to illustrate the point.

Take a point on planet Earth – it’ll have two coordinates, latitude and longitude. Now get a pen and paper out.

  • Is the point above (further north) of the equator? Write a 1 if so, otherwise a 0.
  • Is the point west of the meridian line (write a 1), otherwise if east, 0.
  • Say in the first step we’d found the point to be north, now we’d be asking if it is north of 45º or not, 1 or 0. A similar story applies for the southern hemisphere.
  • Again if west, is it more than 90º west or not? 1 or 0, and the same trick can apply in the east.

? and on we go, subdividing the planet down, literally bit by bit. If I’d chosen New York in the US, we’d have 40º N, 74ºW, so 1 (north), 1(west), 0 (not further north), 0 (not further west), the geohash so far would be 1100.

What is 1100? Apart from a 4-bit binary number which only needs half a byte to store it, using this scheme it looks like this:

map of the north atlantic ocean

It is 1/16th of the planet. Start adding more bits (you can encode this properly in bytes, no need to stick with binary strings), you can start getting pretty precise. Google says there are about 510 million km2 on the planet, so an innocent 32-bit number would narrow it down to 510 x 10^6  / 2^32 ~= 0.12km2. Note the units. This is quite precise!

Geohashes have the property that, apart from some boundary conditions, in general the longer the prefix two geohashes share, the closer the two regions are together. 11100101 and 11100111 are probably pretty close, and with longer geohashes the precision only improves.

 

But sorting by Geohash?

How do you sort an array of 2D points? If you sort by latitude (Y coordinate), you’ll get longitude (X coordinate) values all over the place. Ditto if you sort the opposite way. No good.

If you sort by geohash, most (not all, there are boundary conditions) of the points in the final array will be geospatially close to their neighbours. New York will be next to New Jersey within the array, if you will.

 

Back to A-Star. Or do I mean microprocessor architecture?

So arrays are random access, right? We could have a vertex at, say, index 123 referring to a neighbour at index 1230000. We software engineers know that the indexing operator on an array is pretty fast. Random access. Fast.

Did you know that access time is not constant?

So, to the silicon. A microprocessor uses a memory hierarchy, simply because i) super fast memory costs a fortune and ii) modern systems need a lot of memory. So typically a modern CPU will have:

  • Registers. These are special kinds of memory which hold one word, one value at a time. They can be literal values (the number 123) or a pointer to an address in memory, which ultimately? is just another number. Interpretation is everything here. This is the fastest memory of all. It is difficult to quantify the access time as it varies with architecture, but suffice it to say it is one clock cycle of the CPU or less. Less than nanoseconds.
  • Level 1 cache. Tiny. Could be 8KB. Comparable with a microcomputer from the early 1980s, but incredibly fast. Access will take low single-digit clock cycles. Note when a cache gets full, the least recently used data is discarded. This is not a data loss, as it was a mirror of data in main memory.
  • Level 2 and level 3 caches. Bigger and slower, but still way faster than main memory. Still small in the grand scheme of things, a few MB.
  • Main memory. Yawn. So very slow. And Big. It can take over a 100 clock cycles to get something from main memory. This is epically slower than the L1 cache or registers.
  • SSD. Let’s not even go there, thousands of times slower still. You thought they were fast?

A modern CPU is packed full of tricks (and thus huge numbers of transistors) aimed at using the high speed cache memory and registers to the maximum, and avoiding main memory access whenever possible. Another trait of main memory is once accessed, it can provide a small chunk of memory in one go. Caches are designed to accept these chunks, called cache lines. On a typical Intel chip, a cache line is 64 bytes.

So, we’re doing A-Star. We’re at node 123. We need to go to node 1230000. But the latter is not in our cache, we’ve not read it before (a cache miss). Sad face. What the CPU actually does is load a cache line from main memory, grabbing the data you want and some bytes around it. The principle is called locality of reference, as statistically it has been shown if a program accesses data at a certain address, it is likely to want more data nearby shortly afterwards. If the processor can grab memory before the code even requests it, you hide some of the slow main memory latency and things run a lot faster.

 

So the curious case?

We should note at this point, unlike some languages such as Java, when Rust has an array of structures in an array, they are contiguous. This is not an array of object references, this is an array of cookie-cutter structures, their fields laid out, repeating the pattern one after another. This is about to matter.

If, as a pre-processing step, we sort the array of vertices by their geohash (derived from the latitude and longitude coordinates), data which refers to similar locations on the planet will be in similar locations in the array, and thus physical memory. We don’t even need to store the geohash, just use it as a means to sort.

So when A-Star accesses node 123, and then wants neighbouring node 125 (not 1230000 because neighbours are now close within the structure), there is a strong chance the microprocessor having read node 123 has already read node 125 as well when it pulled the cache line in. The performance penalty of accessing comparatively super-slow main memory has been partially hidden.

Thus pre-sorting by geohash the array structure for A-Star, without touching the actual code which uses the data, results in a measured 15% speed increase.

I grant you this is not a game-changing difference in performance, but the enjoyable thing for me at least was the journey. As an engineer, I simply love the fact we can make the code faster with no runtime code changes and no hardware changes!

 

Vortexa

I feel privileged to be working with such an amazing team. A team which can conjure up suggestions such as “sort the array by geohash” because they know what a geohash is and how a modern microprocessor works!

This has just been some fun, but the insights we have on a daily basis allow us to dig deeper into the vast array of data we hold, extracting more meaning for our customers.

We are hiring. Come join us.

Vortexa Analysts
Vortexa
Vortexa Analysts