Skip to main content

Onramp Postgres

Summary

Pull data from Postgres tables.

Motivation

We see use cases a function of features supported by PostgreSQL out of the box, as well as additional features introduced as extensions or otherwise packaged as completely separate products. Primary examples that we are looking at are TimescaleDB and PipelineDB.

The ultimate goal is to be able to reliably pipe in data from PostgreSQL core to Tremor engine.

Guide-level Explanation

PostgreSQL is a standard Tremor onramp. It is configured by passing in standard PostgreSQL connection string arguments, such as host, port, user, password and dbname.

In addition to connection string arguments, we require:

  • interval in milliseconds that specifies the time we wait before executing the next query.
  • consume_from specifies the timestamp from which to backfill data.
  • query in form of standard SELECT query which retrieves rows from a table or view.

Unlike other onramps implemented so far, PostgreSQL onramp will persist consume_from and consume_until, allowing recovery from error conditions or other failures.

Example onramp.yml:

id: db
type: postgres
codec: json
config:
host: localhost
port: 5432
user: postgres
password: example
query: "SELECT id, name from events WHERE produced_at <= $1 AND produced_at >
$2"
interval_ms: 1000
dbname: sales
cache:
path: "/path/to/cache.json"
size: 4096

Cache is a memory mapped region, either file-based or anonymous.

Reference-level Explanation

Drawbacks

The onramp does not provide an implementation of all types Postgres supports and the entirety of query language (for example, LISTEN/NOTIFY semantics).

Rationale and Alternatives

As a starting point, or first implementation, supporting basic SELECT statements with a time interval as additional WHERE clause supports the major use case: ingesting rows from a table or view.

Alternative approach would utilize PostgreSQL single row mode suitable for ingestion of results returned by queries that span a large number of rows. Ensuring at-most-once ingestion of rows would be a potential problem and a time-consuming/API-breaking implementation.

Other approaches, such as trigger-based watches or binary log readers are also a possibility.

Prior Art

Data Integration.

Unresolved Questions

Future Possibilities

For users that utilise PostgreSQL as an event store, which seems to be a more common use case lately, support for LISTEN/NOTIFY would be essential to become one of modes of this onramp.

Support multiple queries in the same onramp.

More flexible means of specifying parameters.