Serialization formats help to avoid messy custom grammars (though so do non-messy custom grammars), and are used extensively in computing – whenever there is a need to store data, to pass it from one program to another, to show it to or read it from a user.
Different formats come with their pros and cons, and often do affect the modules/structures one creates with an intent to serialize them – since the formats often lack a specified way to support some constructs (most notably, sum types), and their supported primitive types vary (e.g., unicode strings and binary data are not always supported, number representations are a mess). There is a Wikipedia comparison of data serialization formats, but I'd rather compare a different set of features here – the ones that seemed important to me at one point or another. Here they are:
Since it is mostly about shades of gray, common formats are (rather subjectively) rated on the scale from 0 to 4:
sum | txt | bin | sch | desc | prim | str | simp | |
---|---|---|---|---|---|---|---|---|
json | 0 | 3 | 0 | 3 | 4 | 3 | 0 | 2 |
yaml | 0 | 3 | 0 | 0 | 4 | 2 | 3 | 0 |
xml | 2 | 2 | 0 | 4 | 4 | 3 | 3 | 1 |
dsv | 2 | 2 | 1 | 2 | 0 | 4 | 4 | 3 |
posix | 1 | 2 | 1 | 3 | 0 | 2 | 3 | 1 |
s-exp | 3 | 3 | 0 | 0 | 4 | 0 | 0 | 2 |
n3 | - | 1 | 0 | 4 | 4 | 3 | 4 | 3 |
cfg/bnf | 4 | 4 | 1 | 3 | 2 | 4 | 2 | 4 |
Expanding on those:
JSON was my go-to format for a while, but the situation with streaming and sum types is annoying. S-expressions would be more usable if they were standardised, and DSV has its pros and cons comparing to those. But maybe XML is good enough for most purposes. Regardless of serialisation format choice, it is always possible to mess up underlying data models, or to compose and serialise those in a sensible way,
It is entertaining to muse on making a format from scratch, and perhaps useful to consider the choices one would make if it was practical to compose such a format.
I would try to pick a model for information encoding, before its serialization: to be generally useful, it should be least arbitrary. There is a few descriptive logics specifically for knowledge representation, and (G)ADTs usable with constructive/intuitionistic type theory and logic, which is usable as a foundateion of mathematics. There are alternatives, and they are tricky to compare, but I'm fairly certain that any decent model would be quite usable, even if not the only (or the best) solution to knowledge representation.
Then there's the rabbit hole of composing a language for all sorts of things (which is also exciting, but likely less practical; see "formal human languages" for more musings on that). So perhaps it is a good idea to just pick a seemingly practical logic.
The serialization itself should then be as simple as possible, according to Chomsky hierarchy, and with as few rules as possible. And preferably individual schemas should be extensible by different parties without conflicts and confusion (as done in XML and n-triples).
Out of the listed formats, JSON and YAML quite clearly don't fit the description, perhaps POSIX file format notation doesn't either; s-expressions, XML, n-triples, and possibly just DSV seem fairly close, or at least usable for pretending that they are.
At some point in 2022 I tried to sketch such a format, aiming ADTs encoded into DSV (akin to Coalpit, but with a basic specification instead of Haskell: relying on prefix notation with known/fixed arity), but that looked much like BNF. Since then context-free grammars with BNF are included into the comparison.
This reminds me of similarly simplifying and generalizing mathematical logic (proof assistance), producing metamath: similarly focusing on textual representation, and its rewriting. Likewise, rewriting systems are quite commonly (though often implicitly) used in programming and CS, and it might be interesting to attempt integrating those applications better than they commonly are combined. Also related are transition systems, action languages.
One can compose a format scoring well in the comparison table above by encoding algebraic data types using a format readable as both text and binary data, using fixed-length (powers of 2) textual strings for identifiers, hexadecimal numbers. Perhaps it can be made DSV-compatible by separating values with delimiters. ADTs would also provide a way to encode lists, though for optimization (and as a syntactic sugar) something akin to netstrings or quotation with escaping can be used. Compatibility with POSIX file format and s-expressions may be achievable in some cases.
Turning s-expressions into a sort of parenthesized DSV (skipping things like references, explicit primitive datatypes other than arbitrary text runs, mixing symbols and strings together, only using quotation to deal with spaces, defaulting to UTF-8), they would look like this:
(foo bar "baz \" ) . </text> qux" (1 2 3))
Instead of special syntax for strings with spaces, they could be grouped into lists (or even viewed as lists of characters, as done in some programming languages), rather similarly to use of texts within XML elements:
(foo bar (baz " \) . </text> qux) (1 2 3))
If there is a schema (constructors are known, have fixed numbers of arguments), parentheses can be skipped, as done in Coalpit:
foo bar "baz \" ) . </text> qux" cons 1 cons 2 cons 3 nil
Or:
foo bar "baz \" ) . </text> qux" : 1 : 2 : 3 .
Putting those two together, omitting both parentheses and quotations:
foo bar : baz " ) \. </text> qux . : 1 : 2 : 3 .
At which point lists are not necessary for encoding of strings with (unescaped) spaces: can generalize it to reading an arbitrary string value until there is a constructor that would allow to take another branch. That is, instead of "list a = : a list | .", "text a = <text> a </text>" would suffice and be more explicit, turning it into:
foo bar <text> baz " ) . \</text> qux </text> : 1 : 2 : 3 .
Its XML version would have been:
<foo> <bar/> <text> baz " ) . </text> qux </text> <:> 1 <:> 2 <:> 3 <./> </:> </:> </:> </foo>
Which is more verbose, but does not need a schema to parse. Conventional array encoding is different in XML though. And that can be converted into a sexp-like structure, akin to using SXML:
(foo bar (text baz " \) . </text> qux) (: 1 (: 2 (: 3 .))))
If CDATA was used in XML, it would more closely translate into a quoted string in s-expressions. After all those conversions, arbitrary-length runs of tokens within parentheses are used for unstructured data (text runs) instead of arbitrary-length lists, for which regular constructors are used. But since the parser may not be aware of constructors, could as well use those for both, taking us back to an early simplification:
(foo bar (baz " \) . </text> qux) (1 2 3))
Preservation of text formatting on parsing would be a little
more tricky than with explicit strings, more like that with XML,
yet the syntax becomes simpler. Quotation marks help to
distinguish between runs that should be parsed and raw texts,
similarly to type annotations. These text runs, which look like
expressions, can be processed similarly to primitive types:
perhaps provided by a parser as strings, to be parsed
recursively. Explicit type annotations (e.g., (uint32
3)
, (string (foo bar))
) can be embedded into
s-expressions as well.
It can also be approached the other way around, replacing parentheses with quotation marks (which is more idiomatic for DSVs), but that would require much more of awkward escaping for nested structures, while with parentheses escaping can be skipped if there is no risk of closing the outer expression unintentionally.
I drafted the grammar and a few implementations (C with Bison and flex, Haskell with attoparsec, Python with pyparsing) in the word-tree repository. The ABNF grammar follows:
delimiter = " " | "\n" restricted-char = "(" | ")" | delimiter | "\" tree-or-val = "(" forest ")" | delimiter + | ("\" restricted-char | any-char - restricted-char) + forest = tree-or-val *
For its schema, since it has no types, I think it makes sense to focus on parsing alone, possibly using generic grammar descriptions, such as ABNF. This brings us back to the CFG option mentioned above, just with a grammar skeleton, which would allow to skip defining a proper parser in some cases.