- Feature Name: binary-type
- Start Date: 2020-12-16
- Tremor Issue: tremor-rs/tremor-runtime#0000
- RFC PR: tremor-rs/tremor-rfcs#0000
This RFC proposes the addition of a new type:
binary along with the underpinnings needed to extend tremors type system.
The motivation here is three-fold.
As of writing this RFC tremor has no option to handle 'raw' data. As it stands, even if we never want to inspect or modify a payload we always pay the price of translating the data into tremors internal, JSONesque format or representing it as a utf8 string. Both these choices come at a computation cost at best or are wrong and lossy at worst. A binary type allows us to treat a message as "a bunch of bytes", removing any loss or additional computational cost of encoding/decoding them.
The second reason for a binary datatype is to be able to handle messages that are not covered by existing codes. A binary type along with the functions to inspect, deconstruct or create binary messages gives a new extension point for users to parse their own, non-textual formats.
Last but not least, extending tremor with a binary type, something that JSON isn't capable of representing lays the groundwork for adding more powerful types to tremor. It serves as a case study of the cost/complexity of this and will help simplifying the task for later additions.
binary type in tremor comes along with additions to the tremor-script syntax, the type system, codecs, the standard library and a specification of the serialization behavior. We will discuss each of those
The easiest way to think about it is that the
binary type represents an array of bytes.
Serialization is codec dependant and not always symmetric. A codec that can not represent binary data and is a general-purpose codec it will default to base64 encoding the data as a string (JSON for example). This is an asymmetric change as we will not by default decode a base64 string as binary given without context we can't determine if this is the desired behavior or not, this can however be done using tremor-scripts functions.
We use the erlang bit syntax as an inspiration. Tremor script gains support for binary semi-literals using the form of
<< expr1:<size>/<type>, expr2 >> where:
exprcab be either a number, another binary or a string.
sizedefines the size in bits for numbers ranging for 1 to 64, sub bit sizes are supported, or the size in bytes for strings and binaries. The default size for integers is
8and for strings or binaries is the entire binary
typeneeds to be specified as
binaryfor strings and binaries or can be a combination of endianness (
little), signedness (
signed) and numeric type (
integer) where parts are concatinated by a
-and the first one enumerated is the default
The default (no size or type suffix) being equivalent to
This RFC introduces the
base64 module for encoding and decoding
binary data as base64 strings. It also adds the
binary module with basic functions such as
len, as well as conversion functions to and from arrays of numbers.
string module gains functions to convert a string to a
binary as well as a function that will take a
binary and convert it to a utf8
string in a potentially lossy fashion for invalid utf8 data.
type module will gain the
is_binary function as well as
type::as_string now returning
"binary" when appropriate.
This RFC introduces the
binary codec that passes bytes unmodified as binary data.
json codec is changed to serialize
binary data as a base64 encoded string, however it will not automatically decode base64 strings as binary.
msgpack codec should encode and decode binary as
binary type is a
Vec<u8> that serializes as base64 encoded string in simd-json and as
bytes in serde compatible encoders.
If the new implemented functions are all constant, they will be able to be pre-computed when constant folding is possible.
The binary semi-literals will, whenever possible be turned into full-literals using constant folding, but remain constructs where not following the example of arrays and records.
If a literal is created that is not byte aligned the bits that overhang the last byte boundary will be treated as part of a new byte filling the less significant bits of it.
In other words
<<1:8, 2:4>> where we only have 4 bytes of the second byte will fill to:
This breaks symmetry with the JSON representation, however all differences can be recovered using the provided tremor-script functionality.
There are no decent alternative binary syntax to erlangs binary forms we found as part of the research.
Many C derived languages support encoding bytes in a string like form, this however doesn't offer anything near the capabilities we require.
This RFC is heavily inspired by the (erlang bit syntax)[https://erlang.org/doc/programming_examples/bit_syntax.html].
As of writing the first draft of this RFC it is still open how far into bit syntax compatibility it will go while balancing benefit and time constraints. Binary comprehensions, as existing in erlang are not part of this RFC. Neither support for 'less then a byte' boundaires.
Binary comprehensions are a major future possibility for bit syntax. They are also well seperated in the way that extracting them will not harm this RFC.
Matching binaries is the next logical step. Due to it being significantly more complex and the work on binary types and semi-literals is self contained, this will be handled in a seperate RFC.