Skip to content

Engineer Perspective

Serialization is the process of converting data structures or objects into a format that can be stored or transmitted and reconstructed later. Over time, many serialization formats have been created to address needs in efficiency, interoperability, and convenience. We group them here as text-based formats, binary formats, and schema-driven formats, tracing their history and design trade-offs. We also highlight key people behind these innovations. Code snippets from real libraries are shown (with references) to illustrate how these serializers work in practice.

Text-Based Formats (Human-Readable)

Early on, data interchange used simple text formats (e.g. CSV, INI-like key-value). The need for structured, self-describing data led to more sophisticated text formats:

  • XML (1996): Developed by W3C experts (Tim Bray, Jean Paoli, C. M.) to mark up documents. XML is verbose and schema-optional, allowing nested trees. It became popular for configuration and RPC (SOAP), but its size and parsing complexity were criticized. (Tim Bray later co-authored XML, though JSON proponents mocked XML’s verbosity as “Too Much Markup”.)

  • JSON (2001): Invented by Douglas Crockford and colleagues (e.g. Chip Morningstar) for lightweight browser–server messaging. JSON stands for JavaScript Object Notation, a human-readable format derived from JavaScript syntax. It uses name/value pairs and arrays. Why JSON? By the early 2000s, web apps needed a simpler, faster alternative to XML. Crockford “specified and popularized the JSON format” around 2001. Unlike XML, JSON omits closing tags and supports JavaScript-native types, making it much more compact. Trade-offs: it is easy to parse and widely supported, but has no formal schema (only an optional schema mechanism like JSON Schema). It also lacks explicit binary support. For example, Python’s built-in JSON encoder has a rich API (see code below) that shows how many options exist for controlling output.

    def dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True,
              allow_nan=True, cls=None, indent=None, separators=None,
              default=None, sort_keys=False, **kw):
        """Serialize `obj` to a JSON formatted `str`."""
        # (cached encoder optimization omitted)
        return _default_encoder.encode(obj)
    
  • YAML (2001–2004): Spearheaded by Clark Evans, Brian “Ingy” Ingerson, and Oren Ben-Kiki, YAML (“YAML Ain’t Markup Language”) was created to be a very human-friendly superset of JSON. It uses indentation for structure (like Python), supports references, and can express complex data with minimal syntax. Why YAML? People needed a format easier to write by hand (e.g. config files) than JSON or XML. Its designers explicitly collaborated (Ingerson, Evans, Ben-Kiki published YAML 1.0 in 2002). Trade-offs: extremely readable, but parsing is more complicated and historically had security issues (yaml.load can execute code). Nonetheless, YAML became popular in DevOps (e.g. Docker Compose, Kubernetes configs). For instance, using ruamel.yaml (a modern YAML lib), one can dump a Python dict with:

    import sys  
    from ruyaml import YAML
    data = {"servers": [{"ip": "10.0.0.1", "role": "db"}, {"ip": "10.0.0.2", "role": "web"}]}
    yaml = YAML()
    yaml.dump(data, sys.stdout)
    # Outputs something like:
    # servers:
    #   - ip: 10.0.0.1
    #     role: db
    #   - ip: 10.0.0.2
    #     role: web
    
  • Others: Other text-based formats include TOML (2013, for configuration, by Tom Preston-Werner), INI (simple key-value), and EDN (Clojure data, by Rich Hickey). The main trend is balancing readability vs verbosity. Early formats (XML) were too verbose; JSON reduced syntax and became nearly universal for web APIs; YAML and TOML further trade minimal syntax for structure. However, all text formats share overhead of parsing text and limited support for raw binary data.

Binary Formats (Compact/High-Performance)

Text formats have overhead (ASCII characters, parsing). As applications demanded higher performance and smaller messages (especially in distributed systems, IoT, and performance-sensitive apps), binary serialization emerged:

  • ASN.1 / XDR (1980s): Long before JSON, standards like ASN.1 and Sun’s XDR (External Data Representation) defined binary encoding for structured data (often in telecom or RPC). They support schema definitions (type tags, lengths) but were complex. ASN.1 still underlies many telecom protocols.

  • BSON (2009): MongoDB introduced BSON (“Binary JSON”) to store JSON-like documents efficiently (authors Dwight Merriman and Eliot Horowitz). BSON adds length prefixes and a few extra types (e.g. binary blob, Date). It’s more compact than XML, but still includes field names, so not minimal. Trade-offs: easy use in Mongo but larger than purely binary formats.

  • MessagePack (2008): Created by Sadayuki Furuhashi to be “like JSON, but fast and small”. It encodes JSON-like data (ints, strings, lists, maps) in a compact binary form. It omits quotes for keys and uses type tags. MessagePack has wide language support. For example, in Python:

    import msgpack
    packed = msgpack.packb({"nums": [1,2,3]})
    print(packed)         # e.g. b'\x81\xa4nums\x93\x01\x02\x03'
    print(msgpack.unpackb(packed))
    # Output: {'nums': [1, 2, 3]}
    

Trade-offs: MessagePack is faster/more compact than JSON, but still dynamic (no schema). It can encode raw bytes and supports extension types. It’s ideal for RPC or in-memory caching.

  • CBOR (2013): The IETF standardized CBOR (Concise Binary Object Representation). Designed by Carsten Bormann and Paul Hoffman, CBOR targets tiny code footprint (for IoT) and flexible extension. It encodes JSON-like data (maps, arrays, numbers, text, byte strings) with a compact binary format. Key goals: “extremely small code size, fairly small message size, and extensibility”. Compared to MsgPack, CBOR has more precise typing (e.g. separate major types for text vs binary, built-in date/time tags) and is an RFC. Trade-offs: similar to MsgPack but sometimes more compact; implementations exist in many languages.

  • UBJSON and others: UBJSON (Universal Binary JSON), Smile (binary JSON variant), Google’s FlatBuffers (below) and others exist with niche usage. The general trend: binary formats drop human-readability and incur parsing code, but gain speed and size. For example, Facebook’s Thrift Binary Protocol (below) and SBE (Simple Binary Encoding) are used in high-frequency systems.

Schema-Driven Formats (IDL-Defined)

For complex applications (microservices, RPC, data storage), schema-based serialization has become popular. These systems use an Interface Definition Language (IDL) or schema to generate code and ensure compatibility.

  • Protocol Buffers (2008): Developed at Google (initial design by Sanjay Ghemawat and Jeff Dean, later led by Kenton Varda). Protobuf uses a .proto schema file defining message types and fields. It generates code in many languages. Data is binary: fields are tagged and encoded with variable-length integers (omitting field names). Why Protobuf? Google needed a fast, compact alternative to XML/JSON for internal RPC. Protobuf is highly efficient because it transmits just the field numbers and values, not the field names. Trade-offs: Must compile schemas; evolving schemas need careful defaults/field removal for backward compatibility. Protobuf offers features like optional fields, nested messages, and enum types. For example, given message Person { int32 id = 1; string name = 2; }, the Python API (auto-generated) lets you write:

    import addressbook_pb2
    
    person = addressbook_pb2.Person()
    person.id = 1234
    person.name = "Alice"
    data = person.SerializeToString()
    person2 = addressbook_pb2.Person()
    person2.ParseFromString(data)
    

This shows how code enforces types. Protobuf became a de facto standard (especially protobuf v3, with JSON compatibility) and influenced many others. (Notably, Kenton Varda later designed Cap’n Proto with lessons learned.)

  • Apache Thrift (2007): Originating at Facebook, Thrift combines an IDL, code generation, and a multiplexed RPC framework. Its IDL is similar to Protobuf. Thrift allows pluggable transport (binary, JSON, compact binary) and was designed for “scalable cross-language services development”. Trade-offs: Early Facebook/Apache implementations (in C++) are fast, but Thrift’s Java library historically had issues. Unlike Protobuf, Thrift was conceived with RPC in mind. Thrift’s multiple protocols mean one can choose readability (JSON protocol) or speed (binary protocol). Example (using Thrift for Python/C++) might look similar to Protobuf usage.

  • Apache Avro (2010): Created by Doug Cutting as part of the Hadoop ecosystem. Avro uses JSON to define schemas (e.g. {"type":"record","name":"User",...}) and then encodes data in a compact binary form. A key idea: the schema travels with the data (in “object container files”) or is stored in a registry. This means no code generation is strictly required at read time (dynamic parsing is possible). Avro is optimized for Hadoop/Big Data (it uses row-based, schema-first encoding). Trade-offs: Having JSON schemas can be verbose, but it makes Avro very flexible and self-describing. It also supports schema evolution (old readers can skip unknown fields). For example, Avro’s file format starts with a header containing the JSON schema, then binary data blocks according to that schema.

  • Cap’n Proto (2013): Designed by Kenton Varda (author of Protobuf v2) to eliminate the cost of parsing entirely. Cap’n Proto keeps data in a flat, pre-serialized “message” form that can be memory-mapped and accessed without deserialization. It uses an IDL (similar to Protobuf’s) and generates code. Cap’n Proto achieves extremely high speed (zero-copy), but trade-off is rigidity: the schema is fixed in the binary layout, so you cannot easily add fields without re-compiling (though it supports versioning in other ways). Use cases include game networking or in-memory shared memory.

  • FlatBuffers (2014): Developed by Wouter van Oortmerssen at Google, FlatBuffers also targets zero-copy access. It encodes data in a way that objects reference offsets directly; like Cap’n Proto, one can read fields without parsing. It uses schemas (IDL or can import Protobuf .proto). Trade-offs: FlatBuffers prioritize read speed and memory efficiency, but mutations are hard and optional (unlike Protobuf or Thrift, which allow easy building of objects). It’s used in performance-critical contexts (e.g. games, mobile). Wikipedia notes it “allows you to directly access serialized data without parsing”.

  • Others: There are many related systems. For example, SBE (Simple Binary Encoding) by FIX (financial tech), Ion by Amazon (richly-typed superset of JSON with text/binary forms), and more. The key theme is always a balance among size vs speed vs flexibility. Static schemas (Protobuf/Thrift/Avro) give speed and type safety at the cost of flexibility, whereas dynamic formats (JSON/MsgPack) are more flexible but heavier. Modern developments try to push speed via zero-copy (Cap’n Proto, FlatBuffers) or hardware acceleration (e.g. simdjson for JSON parsing).

Language-Specific / Native Serializers

In addition to portable formats, many languages have their own object serialization:

  • Python Pickle (mid-1990s): Python’s built-in pickle serializes arbitrary Python objects (classes, lists, dicts) to a binary stream. Invented early in Python’s history, it is very flexible but insecure (untrusted pickles can execute code) and not cross-language. Example usage is simply pickle.dumps(obj) and pickle.loads(bytes). (For code safety, new projects often prefer JSON or more explicit formats.)

  • Java Serializable (1996): Java’s java.io.Serializable allows objects to be written to bytes. It’s easy but not recommended for long-term storage (it’s not compact and can break across Java versions). Frameworks like Kryo, Hessian, or Protocol Buffers are often used instead.

  • Other languages: Ruby’s Marshal, .NET’s BinaryFormatter, Erlang’s External Term Format, etc. These are mainly for inter-process or storage within the same environment. They typically assume exact same runtime and are not interoperable.

Trade-offs: Native serializers require trust (they can be exploited like pickles), tie to a runtime, and aren’t optimized for cross-language or long-term storage. They serve convenience when moving data between processes written in the same language.

Key Contributors and Timeline

This section highlights people and milestones in the serialization story:

  • Douglas Crockford (JSON, 2001): Pioneered JSON to “break away” from XML complexity. JSON’s simplicity led to explosive popularity.

  • Clark Evans, Brian “Ingy” Ingerson, Oren Ben-Kiki (YAML, 2001–2004): Designed YAML as a more human-friendly superset of JSON/XML.

  • Sadayuki Furuhashi (MessagePack, 2008): Created MessagePack to be a “binary JSON” – fast and compact.

  • Michael “Kent” Varda (Protocol Buffers, 2008; Cap’n Proto, 2013): Led the development of Google’s Protocol Buffers (noted in the code as “Author: kenton@google.com”) and later created Cap’n Proto to push performance further.

  • Sanjay Ghemawat & Jeff Dean (Protocol Buffers, 2008): Original designers of Google’s data interchange, as seen in Google’s own source: “Based on original Protocol Buffers design by Sanjay Ghemawat, Jeff Dean, and others”.

  • Doug Cutting (Avro, ~2009): Avro originated in Hadoop projects. DuckDB’s blog confirms “Avro was developed by Doug Cutting” around 2009. Cutting, co-creator of Hadoop and Lucene, needed a flexible, compact format for big data.

  • Tim Bray and others (XML, 1996+): As W3C editors, they defined XML, which became a foundation for data interchange before JSON’s rise (Citing directly: [26] notes Crockford’s JSON emerging as an alternative).

  • Brian Behlendorf, etc. (YAML): The YAML editors listed on the spec include Evans, Ben-Kiki, Ingerson.

  • Carsten Bormann & Paul Hoffman (CBOR, 2013): Authored the CBOR RFC to standardize a compact binary JSON.

  • Wouter van Oortmerssen (FlatBuffers, 2014): Credited as “primarily written by” Wouter, originally for Google.

  • Facebook Developers (Thrift, 2007): Though not individually named on Wikipedia, Thrift was “developed by Facebook” (later open-sourced via Apache). Its creation was led by developers like Ted Young and colleagues (Facebook engineers) with contributions from Yahoo.

These are just a few of many contributors. The evolution of serialization also had input from standards bodies (IETF, ECMA), academic papers (e.g. on ASN.1, data encoding), and companies (Google, Amazon, IBM, Facebook) pushing formats that solved their unique problems (e.g. Avro for Hadoop, Ion for AWS).

References

We have cited authoritative sources for each point above. Key references include:

  • JSON: Douglas Crockford’s work on JSON.
  • YAML: YAML 1.0 spec (Evans/Ingerson/Ben-Kiki).
  • MessagePack: Wikipedia entry (Furuhashi).
  • CBOR: IETF RFC 7049 (Bormann/Hoffman).
  • Protobuf: Google code comments (Ghemawat/Dean); protocol buffer tutorials.
  • Avro: DuckDB blog (Cutting).
  • FlatBuffers: Wikipedia (Wouter).
  • Thrift: Wikipedia (Facebook).
  • Code examples from various repos/libraries: CPython json.py, ruamel.yaml docs, msgpack-python README, Thrift utility example, Avro+Kafka example, and Protobuf Python guide.

Each citation provides concrete evidence (lines from specs or code) supporting the historical claims and showing real serializer usage. By weaving these elements, this “Serialization 101” overview covers the motivations, trade-offs, and evolution of serialization from multiple angles.