Monday, 22 August 2016

Deterministic json output from python

How do we get deterministic JSON output from Python?

JSON objects are not inherently ordered, their properties can be written in any order without making any difference to their meaning.

Unfortunately, people often use text tools to check the output, and sometimes we want to generate deterministic JSON:

Consider this bash/Python mess:

for ((i=0; i != 10; i++ )); do python3 -c 'import json; print(json.dumps({"a":420, "b":99, "c":100}))'; done

{"a": 420, "c": 100, "b": 99}
{"a": 420, "c": 100, "b": 99}
{"c": 100, "a": 420, "b": 99}
{"b": 99, "a": 420, "c": 100}
{"c": 100, "b": 99, "a": 420}
{"c": 100, "b": 99, "a": 420}
{"c": 100, "a": 420, "b": 99}
{"c": 100, "a": 420, "b": 99}
{"b": 99, "c": 100, "a": 420}
{"c": 100, "a": 420, "b": 99}

What's happening here?

Python's internal dicts' hashing is using some random number to prevent hash-collision attacks, which means that each time we run the program a different hashing pattern is produced, and the dictionary keys appear in a different order. This is good because the order isn't predictable to an attacker, but it means that an otherwise deterministic program generates different output each time it's run.

Why would this be a problem?
  • You can't compare JSON files using "diff" any more, because these differences always appear
  • Continuous integration - different output triggers false alerts
  • Humans may be able to read the json more easily in a specific order.

How do we fix it?

Use collections.OrderedDict. But be sure that you don't initialise it from a regular dict.

Correct:

od = collections.OrderedDict([("a",420), ("b", 999), ("c", 888)] )
od = collections.OrderedDict(); od["a"] = 420; od["b"] = 999; od["c"] = 888


Incorrect (creates a normal dict first):

od = collections.OrderedDict({"a":420, "b": 999, "c": 888})