- Using a bad universal data format.
- Depending on a universe of tools that make this bad format seem like the best choice.
We need that universe of tools. But we need them for a better data format.
What's wrong with plain text, then?
It is fundamentally incongruous with the data we store. Almost all data is structured: HTML, XML, JSON, TOML are all ways to store structured data in text files. Programming languages are structured with complex grammars. Where we use binary formats, almost all of them store structured data. ZIP files, DOC files, PNG files, everything is structured.
The incongruity is in the use of in-band signaling to delineate data. We can signal start and end of data in two ways:
- Length-prefixed encoding. The data is prefixed with a length field, then the exact specified number of bytes follows. Data content does not need to be escaped and finding the end of the data is trivial.
- In-band signals. The length of data is not indicated in advance, instead it's terminated by a specific byte or a sequence of bytes. If you want to encode the terminator sequence as part of the data, it requires escaping.
- The security and usability problems related to including JavaScript within HTML in a <script> tag.
- Above, I could not write "<script>" - I had to write <script>. Conversely, that had to be written as &lt;script&gt; – and so on.
- How do you include binary data in JSON? You base64-encode it, blowing up the size by 4/3.
- Security problems related to strings and line termination in HTML, JS and JSON.
- Ever tried including C++ code within C++ code - as in, a code generator? Or JavaScript within C++ code? Ha ha.
- In SMTP, email content is terminated by a single dot. Any line in an email that actually contains a single dot must be escaped and unescaped in transmission.
- In email, any line in the content that begins with "From" must be escaped. This escaping is not undone, so ">From" is visible to the recipient.
- ...
A better universal data format would be much like XML or JSON or TOML. These formats are actually designed for general purpose structured data, which is what we actually, almost always, want to store.
Except: it needs be binary and use length-prefixed encoding.
Then, we need a universe of tools, equally as powerful as the tools we have for textual files right now, to search, create, process, edit, compile, compare, and store versions of files in this universal data format.
The reason plain text seems "friendly" right now is simply the presence of all those tools. If we can settle on a universal binary format with length-prefixed encoding; and develop the associated tools; the new format and its toolset will be obviously superior and preferable to most everyone. The only problem we have right now is... no tools.
A candidate format could be ASN.1, but ASN.1 over-emphasizes saving every bit possible. This complicates the format so it's rife with security problems in decoders, and the complexity is an obstacle for the development of tools. In comparison, the SSH protocol does not emphasize saving every bit possible, and as a result is very straightforward to decode. For example, a string is a big-endian 32-bit length field followed by the bytes - encoded exactly the way you'd expect.
Perhaps we need something like JSON, encoded like SSH does it.
This post does not yet have any comments.