I was talking to $PERSON from $BIG_COMPANY the other day, and they happened to mention that around 80% of their CPU time was spent on serializing and deserializing data. We’re talking Protocol Buffers and C++ here so it isn’t the usual suspects.
Deserialize / Alter / Serialize
With anything web-related, once flushed from the cover of frameworks and libraries, an enormous proportion of operations end up looking something like:
def update_attribute(new_value): object = deserialize(input_stream) object.attribute = new_value output_stream.write(object.serialize())
… where the
output_stream are connections to a
database of some kind. The serialization and deserialization may be
hidden in the database bindings, or even in the core of the database
itself, but it’s a general pattern.
The problem is that to improve efficiency,
object’s class tends to
grow to encapsulate more and more information. Anything not in there has
to be fetched separately, after all, so it is natural to denormalize as
much into there as you can. So a “user” object tends to grow to include
names, addresses, preferences, many-to-many relationships, anything else
you might need to update the user in a hurry.
But this means we’re unserializing and then reserializing all these elements from protocol format to native format and back with every operation. And while it doesn’t take a lot of CPU to deserialize a single element or construct a single native object, it quickly adds up.
If you’re moving a lot of data around, a small increase in efficiency can be worth a lot.
Cap’n Proto looks like it has a lot of potential here, since it doesn’t actually deserialize to native objects … instead it offers a class which is easily manipulated in situ.
I think it’s a great idea, but as per Conway’s Law these kind of APIs often represent administrative boundaries within the organization, so it is difficult to co-ordinate making such a change in protocol. If you’re currently using JSON or ProtoBufs you’re probably stuck with it.
A Modest Proposal
Back in the day, XML ran into exactly this problem. RAM was smaller back then, and big datasets couldn’t be deserialized all in one go, so XML SAX Parsers ruled the earth.
Instead of parsing the entire document in one go, a SAX parser runs over the document and generates ‘events’ at the start and end of every tag. These events cause callback functions to be run, and the callback functions can pull out the parts of the document you’re interested in. The event callbacks can also emit, element by element, a new XML document. So you can alter or transform an XML document without ever loading the whole thing into RAM at one time.
This general idea could be used with other protocol languages too, so your new pseudocode looks like:
def update_attribute(new_value): def updater(key, value): if key == 'attribute': return new_value else: return value update(input_stream, output_stream, updater)
update has to do is to call the
updater function for every
element in the structure, passing the name of the element within the
structure and the existing value. The
updater function can then modify
the structure on the fly.
More generally, a
transform function could allow a new structure to be
built, much as SAX does. This can even be done asynchronously, so that
the data to be written can start being emitted even before the data
being read is fully received, much like Cut-through
Even a small improvement in serialization efficiency has an enormous dividend.
It is easy to think of the sophisticated class structures as the “real” data and the serialized data as a pale reflection of that reality, but perhaps this isn’t the right way to think of it at all …