Friday, July 11, 2008

Protocol Buffers

Google decided to open their tool for serializing structured data, called Protocol Buffers. It's language and platform neutral way of communicating data over networks or serializing it for storage. Interesting bit is that Google is using it in almost all of their projects.

On the surface, tt reminds me a lot of CORBA, especially the way you define message structures, but it differs from it a lot in terms of message exchange and serialization - you can store and/or communicate your data structure across the network by any means, unlike CORBA where you're forced to use CORBa message brokers. In my opinion, that's the main reason why CORBA was never so widely accepted.

How does it all work? First of all, you define your structure using a DSL and store it in a .proto file. You then compile that file using a tool and create data access classes for your language of choice. Classes can be generated for C++, Java and Python - my choice would be, as always, Python. Those generated classes are then used to create, populate, serialize and retrieve your protocol buffers messages.

The messages are very flexible - you can add new fields to your messages without breaking old code that's using them; they will simply ignore it. That functionality comes in very handy, especially for larger systems (think versioning and deployment).

Yes, but what about XML, you might ask? According to Google, PBs have many advantages over XML; they are:
  • simpler
  • 3 to 10 times smaller
  • 20 to 100 times faster
  • less ambiguous
  • generate data access classes that are easier to use programatically

I might disagree with the last one (think of JAXB in Java world), but I completely agree with the other ones - it's quite true that XML tends to be cumbersome and a big overkill, especially in environments where size and speed does matter.

For the end, I saved a very interesting quote from Google's PB pages:
Protocol buffers are now Google's lingua franca for data - at time of writing, there are 48,162 different message types defined in the Google code tree across 12,183 .proto files. They're used both in RPC systems and for persistent storage of data in a variety of storage systems.

Looks very interesting and very promising, considering Google is behind it. I'll follow up this article with some neat examples.

No comments: