Skip to main content

troy

Summary

Add a deployment language to tremor with which whole event flows, consisting of connectors, pipelines and flows can be defined, instantiated, run and removed again.

A connector is an existing artefact and represents a connection between tremor and an external system.

A pipeline is an existing artefact and is equivalent to a trickle query script which compiles to an executable pipeline as defined in the tremor-pipeline crate.

A flow is a new construct which replaces the concept of a binding and a mapping in the legacy deployment syntax yaml which this RFC is targetted at replacing.

Motivation

Current deployments are separated into configuration scopes as follows:

  1. onramps - these define legacy onramp or source connector configurations
  2. offramps - these define legacy offramp or sink connector configurations
  3. pipeline - these are referenced implicitly by name using the file stem ( name minus extension, for a top level pipeline ) passed to the tremor server command line arguments
  4. bindings - which defines how onramps, offramps and pipelines interconnect
  5. mappings - which instanciate and deploy binding.

The YAML format used for deployment configuration is sub-optimal for tremor application authors.

With modular tremor scripts, the business logic needed in tremor queries is significantly reduced for well-designed and well-factored tremor streaming applications. With modular queries entire sub-queries can be modularised and reused allowing the query logic required to compose a query application to be similarly terse.

This specification provides the same mechanisms and modularity provisions as the query and script language DSLs for deployments by replacing the YAML syntax with a modular deployment domain specific language to achieve the same hygiene factors with minimal additional syntax.

Also, given our experiences with tremor-script and tremor-query, a DSL that is well suited for the task at hand provides significant benefits:

  • Helpful and consistent hygienic errors that are syntax and semantics aware
  • Modularization of common definitions as reusable modules (e.g. HTTP proxy setup with sources, sinks, pipeline, and interconnecting flows)
  • More expressive - the deployment language embeds the query language, which embeds the scripting language allowing more expressive application authoring.
  • Paves the way for a live REPL environment in the future

The primary advantage is a capability to integrate definitions seamlessly and reuse existing language concepts and idioms. Tremor developers and operators get a consistent experience across the suite of languages and avoid distractions inherent in context switching from native DSLs to YAML which requires a higher degree of knowledge of tremor internals than desireable for new users to tremor.

Guide-level explanation

A useful tremor deployment always consists of one or more connectors, connected to one or more pipelines. These are deployed as a single unit that captures their interconnections and the overall flow of the deployed application logic.

The deployment lifecycle for an artefact follows the form:

  • Publish definition (with all configuration) with a unique artefact id.
  • Compute a Tremor URL that uniquely identifies the instance being created.
  • Incarnate or launch the instance, if not already running given its url.
  • Compute the instance id and url and register the instance accordingly.
  • Interconnect connectors and pipelines required for correct runtime operation.

The shutdown or undeployment sequence follows a similar form:

  • Pause and disconnect the interlinked artefact instances.
  • Stop and remove instances that are no longer referenced ( quiescence )
  • Cascade and sweep remove any artefact instances that are no longer in use.
  • Cascade and sweep remove any artefact definitions that are no longer in use.

The Tremor Deployment Language ( troy for short ) is the interface with the tremor runtime and guarantees that these control interactions follow the correct sequence of operations.

The language aims for sufficient expressivity to define complex deployment patterns, whilst providing a syntax that encourages iterative refinement and development of complex applications from simple working parts.

When a troy deployment specification is deployed - the definitions, instances and interconnections are created and launched following the lifecycle patterns summarised above by the runtime.

As with trickle, troy supports modular definition and leverages the same preprocessor mod .. end and use statements and shares the same module path and loading and module resolution mechanisms.

This allows boilerplate or reusable logic to be shared across multiple application deployments whether deployed in the same tremor instance or otherwise.

Example

# Defines a connector called `my_http_connector` based on
# the builtin `http_server` connector

define http_server connector my_http_connector

# Specifies required, possibly defaulted, arguments and defines
# the public interface of the `my_http_connector` user defined
# connector

args
server = "Tremor/1.2.3",

# Defaults builtin `http_server` connector arguments, potentially
# referencing interface arguments

with
codec = "json",
preprocessors = [ "newline" ],
config = {
"headers": {
"Server": args.server
}
}
end;

# Defines a connector called `my_file` based on
# the builtin `file` connector

define file connector my_file

# Has an empty interface specification

with
# Specializes the arguments to the builtin `file` connector
config = {
"path": "/snot/badger.json"
},
codec = "json",
preprocessors = [ "newline" ]
end;

# Define a simple passthrough pipeline called `passthrough`

define pipeline passthrough
pipeline
select event from in into out;
end;

# Define the `my_flow_application` deployment

define flow my_flow_application
flow

# This flow will create one instance of `my_http_connector`

create connector ma_http from my_http_connector
with
# specializing or overriding the `server` interface argument
server = "Tremor/3.2.1"
end;

# This flow will create one instance of `passthrough`
create pipeline passthrough;

# This flow will create one instance of `my_file`
create connector my_file;

# Link instances by id - may reference pre-existing instances known
# to the target runtime where this flow is deployed
#
connect "/connector/my_http_connector/ma_http/out" to "/pipeline/passthrough/passthrough/in";

# Alternately, may use local identifiers with explicit port assignments
connect passthrough/out to my_file/in;

# Or, local identifiers with defaulted port assignments
connect passthrough/out to my_file/in;
end;

# Finally, instruct the runtime to `deploy` our flow application
deploy flow my_app from my_flow_application;

For an overview of alternatives we considered and discussed, see Rationale and Alternatives

Reference-level explanation

Troy supports two very basic operations / kinds of statements:

  • Definition of artefacts with define statements
  • Creation of artefact instances with create statements
  • Interlinking of instances with connect statements
  • Deployment control plane commands with deploy statements

Artefact definitions

Definitions of connector, pipeline and flow specifications are declared using the define statement. The BNF for which is provided below

Connector definitions

EBNF grammar:

connector       = "define" builtin_ref "connector" artefact_id
[ "args" argument_list ]
[ with_block ]?
"end"
builtin_ref = id
artefact_id = id
argument_list = required_no_default | required_with_default ) [ "," argument_list ]
required_no_default = id
required_with_default = id "=" expr
with_block = "with" assignment_list
assignment_list = assignment [ "," assignment_list ]
assignment = let_path "=" expr
let_path = id [ "." id ]

The statement declares a user defined connector definition based on a builtin connector provided by tremor with a user defined id. A set of specification arguments with required and optionally defaulted parameters can be provided in the args clause. A set of configuration parameters required by the builtin connector from which the user defined connector is being created is provided with the with clause.

The args specification is the users parameterization of the user defined connector. It declares the intended interface to users of the connector.

The with specification provides required configuration to the underlying builtin tremor connector.

Defaulted arguments provided in the args block can be used in the with block. The values are any valid tremor-script expression that ordinarily returns a value. So match and patch statements or for comprehensions can be used for values, or literals. But the emit and drop expressions which ordinarily halt event flow cannot.

Whitespace and newlines are not significant in the BNF grammar.

Example:

define http connector my_http

# Specifies the `my_http` user defined connector interface
args
required_argument,
optional_arg_with_default = "default value"

# Provides essential configuration to the intrinsic or builtin
# `http` connector on top of which `my_http` is defined.
with
codec = "json",
preprocessors = [
{
"type": "split",
"config": {
"split_by": "\n"
}
},
"base64"
],
config = {
"host": "localhost",
"port": args.required_argument, # TODO verify
},
err_required = args.optional_arg_with_default
end;

Pipeline definitions

EBNF grammar:

pipeline             = "define" "pipeline" artefact_id
[ "args" argument_list ]
"pipeline" pipeline
"end"
artefact_id = id
argument_list = required_no_default | required_with_default ) [ "," argument_list ]
required_no_default = id
required_with_default = id "=" expr
pipeline = <tremor-pipeline>

The statement declares a user defined pipeline definition based on a user defined inline tremor query and bound to a user defined id. A set of specification arguments with required and optionally defaulted parameters can be provided in the args clause.

This statement does not support a with clause as there are no intrinsic artefacts being extended or parameterised by this type of definition.

The args specification is the users parameterization of the user defined pipeline. It declares the intended interface to users of the pipeline.

Whitespace and newlines are not significant in the BNF grammar.

Example:

define pipeline my_pipeline
# Specifies the `my_http` user defined connector interface
args
required_argument,
optional_arg_with_default = "default value"
pipeline
# An inline tremor query ( trickle ) script
use std::datetime;
define tumbling window fifteen_secs
with
interval = datetime::with_seconds(args.required_argument),
end;

select { "count": aggr::stats::count(event) } from in[fifteen_secs] into out having event.count > 0;
end;

Flow definitions

EBNF grammar:

connector       = "define" "flow" artefact_id
[ "args" argument_list ]
"flow" flow_stmts
"end"
artefact_id = id
argument_list = required_no_default | required_with_default ) [ "," argument_list ]
required_no_default = id
required_with_default = id "=" expr
flow_stmts = flow_stmt [ ";" flow_stmts ]*
flow_stmt = flow_create | flow_connect
flow_create = "create" create_kind instance_id "from" modular_target [ with_block ]? "end"
create_kind = "connector" | "pipeline"
instance_id = id
modular_target = [ id "::" ]* id
with_block = "with" assignment_list
assignment_list = assignment [ "," assignment_list ]
assignment = let_path "=" expr
let_path = id [ "." id ]
flow_connect = "connect" connect_endpoint "to" connect_endpoint
connect_endpoint = system_endpoint | troy_endpoint
system_endpoint = <stringified-url>
troy_endpoint = id [ "/" port_id ]?
port_id = id

The statement declares a user defined flow specification. Flow specifications replace the binding and mapping constructs in the legacy YAML syntax. In the legacy syntax the binding section specified how artefacts linked with each other to form a flow graph, and the mapping section specified how instances of the referenced artefacts were created at runtime.

Both were required to define a useful event flow graph at runtime. Production users of tremor who have been with the project from the beginning have outgrown these primitives as they are now building significant, complex, non-trivial streaming applications on top of tremor.

The YAML syntax is not modular and outside of simpler deployments the separation of instance management and interlinking doesn't work well for larger applications based on tremor.

Medium to large streaming applications were niche 18 months ago. They are normal today.

A set of specification arguments with required and optionally defaulted parameters can be provided in the args clause.

Flow specifications unify the creation of artefact instances and their interlinking in a hygienic fashion with potential to enhance the model in future by allowing sub flows to be defined, interlinked and instanciated in future revisions of troy.

Whitespace and newlines are not significant in the BNF grammar.

Example:

define flow my_eventflow
args
required_arg,
optional_arg = "default value"
flow
# Create instances required by this flow specification
create connector my_source from my_http_connector;
create pipeline my_pipeline from passthrough
with
required_argument = 15 # 15 second aggregation interfals
end;
create connector my_sink from my_file;

# Interlink instances
connect "tremor://connector/my_source/{required_arg}/out" to
my_pipeline:in;
connect my_pipeline:out to my_sink:in;
end;

Flow Create Statements

Every artefact that has a definition via define that is in scope in a flow specification can be created, optionally passing parameters using a with clause:

EBNF Grammar:

create          = "create"  flow_id "from" modular_target
[ "with" assignment_list ]
"end"
flow_id = id
modular_target = [ id "::" ]* id
assignment_list = assignment [ "," assignment_list ]
assignment = let_path "=" expr
let_path = id [ "." id ]

The instance_id provided in the definition coupled with the computed modular scope of the flow definition it is created within and the troy file it is created from are used to create the internal representation.

The create statement can be used with connector or pipeline definitions but not flow definitions in this iteration of the troy language.

Example:

create pipeline my_pipeline from passthrough
with
required_argument = 15 # 15 second aggregation interfals
end;

Flow Connect Statements

Flow connect statements allow artefact instances created in the same flow to be interlinked or artefacts that from a tremor deployment that are already deployed to be interlinked with a flow.

EBNF Grammar:

flow_connect          = "connect" connect_endpoint "to" connect_endpoint
connect_endpoint = system_endpoint | troy_endpoint
system_endpoint = <stringified-url>
troy_endpoint = id [ "/" port_id ]?
port_id = id

The connect statement links artefact instance specifications. A from and to endpoint is required. Both the from and to endpoints need to resolve to either pre-existing instances already running in termor, or instances that are defined in the context of the flow definition they are defined within.

There are two basic forms of endpoint:

Troy Native Connect Flow Form

This form uses the creation id's from the flow specification to create a reference to the target instance. If no port specification is provided then the default in and out ports are inferred automatically.

The short form:

connect passthrough to my_pipeline

Is equivalent to the longer form:

connect passthrough/out to my_pipeline/in

Runtime Flow Connect Form:

The alternative form can be used to link with artefact instances that are defined external to the flow specification that is currently in context.

For example, we can connect to the tremor runtime provided default passthrough flow pipeline as follows:

connect my_http_connector/out to "/pipeline/system::passthrough/system/in"

Deploying Flows

Flow definitions are a specification. They define how artefacts interconnect and they specify instances to be created if required. They can also reference pre-existing instances via their tremor URLs.

To deploy a flow into the tremor runtime and start, run and active any connectors, pipelines and flow graph connections defined in a flow we must use the deploy statement.

A unit of deployment in troy is a flow specification.

EBNF Grammar:

create          = "deploy"  deployment_id "from" modular_target
[ "with" assignment_list ]
"end"
deployment_id = id
modular_target = [ id "::" ]* id
assignment_list = assignment [ "," assignment_list ]
assignment = let_path "=" expr
let_path = id [ "." id ]

The deploy statement references a flow to be deployed using the from clause and may use a with clause to override the default flow configuration specified for that flow, if required.

Drawbacks

  • YAML is widely known especially within cloud-native environments.

  • Troy is new DSL in tremor, but it builds on the same mechanisms as the query and script DSLs and leverages the same namspace and modularity mechanisms

  • Introducing a new DSL steepens the learning curve for new tremor users.

Rationale and alternatives

  • The query language trickle was once defined in YAML. The query language was designed to provide a better user experience, hygienic easy to interpret error reporting and a syntax and semantics that was designed to meet the needs of tremor continuous streaming query authors.

  • By embedding the tremor script language via the query script operator the unification reduced the deployment and operational complexity and led to easier to develop queries.

  • By following the same proven and well-trodden path with troy, and building on the idioms and conventions introduced with trickle this should hold true for troy. Early indications from tremor production users is that it is a significant improvement over the YAML based deployment model.

  • Lastly, removing the remaining YAML serialization, deserialization and supporting code from tremor runtime enables refactoring of complex internals that are far-reaching and affect rust-lifetimes and other performance critical or user facing parts of tremor that haven't evolved since the earliest version of tremor when YAML was introduced to tremor as a stop-gap solution at the very beginning of the project.

Versions we considered and discarded

One initial draft contained with as a keyword for starting a key-value mapping (a record in tremor-script) as a special case only used in configuration contexts:

define connector artefact ws_conn
with
type = ws,
# nested record
config with
host = "localhost",
port = 8080
end,
codec = my_json with ... end
interceptors = ...
end;

This was discarded because with as a keyword doesn't really work as keyword for a key-value mapping. To be consistent with tremor-script, it should be record.

The first name for flow was deployment but it was an oversimplification. In a flow we specify the required artefact instances and interlinking. So flow is better suited and the design is richer and more expressive than the constrained syntax offered by the YAML based binding and mapping configuration scopes.

Also, separating the runtime commands such as deploy from the specifications to be deployed or that are subject to lifecycle management interactions was quickly established as a requirement when brainstorming the structure and semantics of the language and comparing it with production use cases where the needs are very well understood.

Prior art

Environments like terraform from HashiCorp are a good example of resource or state management in the form of configuration as code and the scheduler and tooling that moves a runtime from one planned state to another. In troy we separate out definitions or specifications from flow - which captures the runtime model and interlinking from deployment commands such as deploy which are commands to the runtime to execute a change to deploye state.

Another environment worth exploring is the Erlang REPL. Erlang is unique as a language and runtime in that the REPL can be used to navigate, explore and alter the running state of a live clustered deployemnt. In contrast to the terraform model, in Erlang and other actor-based or inspired systems the runtime topology can be modified live.

Both are useful models that inspired and influenced the evolution of troy as documented in this RFC.

Unresolved questions

  • Should we use Tremor URLs in create statements? Disposition: The simpler connect model is seen as sufficient at this time.

  • Should we add means to refer to artefacts by their attributes (host, artefact-type, artefact-id, configuration , module path)? Disposition: There are no use cases that we know of that would benefit from this flexibility at this time.

  • Should we allow users to create dormant, non-deployed and non-referenced artefacts or instances? Disposition: Whilst powerful, this is not seen a need at this time, and tradeoffs would need to be carefully considered.

Future possibilities

Configure Artefacts using Script blocks

Given the plan is to implement the with block as syntax sugar around a tremor-script block with a multi-let expression, we could enable the feature to use a more complex script block to define the configuration record. E.g. if we need to dispatch on som arguments and chose different config entries based on that, this would not be possible using a with block.

Example:

define file connector my_file_connector
args
dispatch_arg = "default"
script
match dispatch_arg of
case "snot" => {"snot": true, "arg": dispatch_arg}
case "badger" => {"badger": true, "arg": dispatch_arg}
default => {"default": true}
end
end

Graceful Operations

We could add statement variants for create that fail if an instance with the same id already exists or that gracefully do nothing if that is the case. This graceful behavior needs to be able to verify that the existing instance is from the very same artefact, and for this might be checking the artefact content too, not just its name.

Non-String Tremor-URLs

It would be nice for static analysis of troy scripts to have tremor urls in connect statements to not be strings, but to have them use ids as references to defined artefacts or already created artefact instances. string urls allow references outside the context of the current troy script though, which might or might not be valid based on the state of the registry at the point of creation. So maybe, an id based syntax for tremor-urls might help with error detection in troy scripts but would only work when referencing artefacts and instances is limited to the current troy script (including imports).

Some ideas to spawn discussion:

pipeline:my_pipeline/instance_id:in

If we change the requirement for artefact ids to be globally unique, not only per artefact type, we wouldnt even need to prefix them with their artefact type every time:

my_pipeline/instance_id:in

Define Codecs and interceptors (a.k.a pre- and post-processors)

It might be nice to be able to define codecs and interceptors as well in the deployment language. That will mean:

  • builtin codecs and interceptors are predefined in something like a Troy stdlib:
intrinsic codec json;
  • codecs and interceptors can be provided with configuration and be referencable under a unique name.
define codec pretty_json from json
with
pretty = true,
indent = 4,
end;

This will solve the current problem that pre- and postprocessors are not configurable. It will nonetheless introduce another type of artefact that actually isnt a proper artefact, and so applying the language concepts to it might not fully work out and lead to confusion.

Also it is not possible to fully define codecs and interceptors inside troy. They are all written in rust for performance reasons. The only thing we can do is to configure them and associate a codec/interceptor with its configuration and make this pair referencable within the language.

Think about what the natural extension and evolution of your proposal would be and how it would affect Tremor as a whole in a holistic way. Try to use this section as a tool to more fully consider all possible interactions with the project in your proposal. Also consider how the this all fits into the roadmap for the project and of the relevant sub-team.

This is also a good place to "dump ideas", if they are out of scope for the RFC you are writing but otherwise related.

If you have tried and cannot think of any future possibilities, you may simply state that you cannot think of anything.

Note that having something written down in the future-possibilities section is not a reason to accept the current or a future RFC; such notes should be in the section on motivation or rationale in this or subsequent RFCs. The section merely provides additional information.

Errata

Syntax sugar for let expressions in with blocks

In order to make defining config entries in tremor-script convenient, we introduced the with block. We need to add a few features to tremor-script to make this work reasonably well in a configuration context, where all we want is to return a record value without too much fuss and ceremony:

  • Add multi-let statements, that combine multiple lets inside a single statement whcih defined the variables therein and returns a record value with all those definitions.

  • Add auto-creation of intermediary path segments in let stetements:

    let config.nested.value = 1;

    In this case if config.nested does not exist we would auto-create it as part of this let as an empty record. This statement would fail, if we would try to nest into an existing field that is not a record.

Reference an external trickle file for pipeline definitions

It might be interesting to be able to load a trickle query from a trickle file. To that end we add new config directives to trickle that can define arguments and their default values.

#!config arg my_arg = "foo"
#!config arg required_arg
...

Flow Connect Statement Sugar

Connect statements describe the very primitive operation needed to establish an event flow from sources/connectors via pipelines towards sinks/connectors. Defining each single connection manually might be a bit too verbose. That is why we will provide some more convenient versions that basically all encode connect statements, but are much more concise and expressive.

We call them arrow statements. Both the LHS and the RHS are Tremor URL string. The LHS of a simple arrow statement is the "sender" of events, the RHS is the "receiver". The direction of the arrow describes the flow direction. If no port is provided, the LHS uses the out port, the RHS uses the in port, so users dont need to specify them in the normal case.

Examples:

# without ports
"/source/stdin" -> "/pipeline/pipe";
"/pipeline/pipe/err" -> "/sink/system::stderr";

# with ports
"/pipeline/pipe:err" -> "/sink/my_error_file:in";

Arrow statements also support chaining. This works as follows: As arrow statements when used as expression, will expose its RHS if used as LHS and its LHS if it is used as RHS. The arrow statements handle other arrow statements as LHS and RHS separately. In effect chaining arrow statements is just writing multiple TremorURLs connected via arrows:

"/source/system::stdin" -> "/pipeline/system::passthrough" -> "/sink/system::stderr";

This is equivalent to the following, when adding explicit parens to show precedence:

("/source/system::stdin/out" -> "/pipeline/system::passthrough/in") -> "/sink/system::stderr/in";

Which resolves via desugaring:

connect "/source/system::stdin/out" to "/pipeline/system::passthrough/in";
connect "/pipeline/system::passthrough/out" to "/sink/system::stderr/in";

Arrow statements also support tuples of Tremor URLs or tuples of other Arrow statements. These describe branching and joining at the troy level.

"/source/system::stdin/out" -> ("/pipeline/system::passthrough", "/pipeline/my_pipe" ) -> "/sink/system::stderr/in";

This desugars to:

"/source/system::stdin/out" -> "/pipeline/system::passthrough" -> "/sink/system::stderr/in";
"/source/system::stdin/out" -> "/pipeline/my_pipe" -> "/sink/system::stderr/in";

If we have multiple tuples within a statement, we create statements for each combination of them. Example:

# full sugar
("/source/my_file", "/source/my_other_file") -> ("/pipeline/pipe1", "/pipeline/pipe2") -> "/sink/system::stderr";

# desugars to:

"/source/my_file" -> "/pipeline/pipe1" -> "/sink/system::stderr";
"/source/my_file" -> "/pipeline/pipe2" -> "/sink/system::stderr";
"/source/my_other_file" -> "/pipeline/pipe1" -> "/sink/system::stderr";
"/source/my_other_file" -> "/pipeline/pipe2" -> "/sink/system::stderr";

The immediate win in terseness is obvious, we hope.

It will be very interesting to explore how to expose the same sugar to trickle select queries. This is a future possibility.

Top level connect statements

Connect statements have been introduced as part of flow definitions. It will greatly simplify setting up tremor installations if we could write them at the top level of a troy script.

The process of describing a tremor deployment would then consist conceptually of the following steps:

  1. define artefacts to be connected
  2. connect those artefacts

These steps describe a good intuition about setting up a graph of nodes for events to flow through.

Example:

define file connector my-file-connector
with
source = "/my-file.json",
codec = "json"
end;

connect "/connector/my-file-connector/out" to "/pipeline/system::passthrough/in";
connect "/pipeline/system::passthrough/out" to "/connector/system::stderr/in";

For setting up tremor with this script, the conecpt of a flow doesnt event need to be introduced. Escpecially for getting started and trying out simple setups in a local dev environment or for tutorials and workshops, this removes friction and descreases time-to-get-something-running-and-have-fun-understanding.

What is going on under the hood here to make this work?

Those connect statements will be put into a synthetic flow artefact with the artefact id being the file in which they are declared. There is exactly one such synthetic flow artefact for a troy file, but only if at least 1 globale connect statement is given. The flow instance id will be default. This flow artefact is defined and created without args upon deployment of the troy script file.

Following this route it sounds reasonable to also add disconnect statements as the dual of connect:

disconnect "/connector/my-file-connector/out" from "/pipeline/system::passthrough/in";

With flows we would destroy the whole flow instance to delete the connections therein at once. With top level connect statements, we cannot reference any such instance, unless we reference it using the naming scheme above. But we would lose the power to modify single connections if we'd refrain from looking into disconnect.