Data Structure Optimization

Let's store some data.

I need to store the combination of a 64 bit integer ID and a one byte bit field. Like this:

type info struct {
  id   int64
  data byte

That's all well and good, but I actually need an ordered list of them. Like this:

type list []info

Which wouldn't be so bad if I didn't really need a bunch of them. Like this:

type blob struct {
  infos list

var AllTheThings []*blob

It gets rather expensive. The info struct gets padded, so instead of being 8 + 1 bytes, it is actually 16 bytes on a 64 bit machine. An array of them is therefore 43.75% padding. An alternative design is something like this:

type blob struct {
  ids  []int64
  data []byte

The small overhead of the extra slice header is quickly outweighed by the win of not having a ton of padding. I lose the type system-supported invariant that each id is associated with exactly one byte of data. The maintenance of that invariant falls on the programmer. Any change to the ids slice must be coordinated with a change to the data slice. But the memory usage wins can be significant. For reference this sort of change dropped my program's heap usage from about 50GB to about 30GB.

Can we do better?

Another thing to look at is the real data that is being stored in the id field. For reasons of consistency it's an int64, but the real data I'm working with fits inside an int16 with quite a bit of headroom. So how about compressing things further:

type blob struct {
  ids16  []int16
  ids32  []int32
  ids64  []int64
  data16 []byte
  data32 []byte
  data64 []byte

That's even more code to maintain and ensure that the various bits of data end up in the right slice and get sorted correctly and rearranged as necessary. But this brought my heap down to 16GB. Almost certainly worth the added code complexity.

Incidentally the original form of the struct is still available in my library, with tests to ensure parity between the original and compressed data types. The original format is more convenient to use and works well enough when memory isn't an issue.

But why?

This was done to ensure that all of my data fits in memory. The code has to go through all of the data in order fairly often and it's good to have access to everything in memory rather than trying to save it to disk.