SQLStreamBuilder Concepts Guide
Eventador SQLStreamBuilder (SSB) is a production-grade query engine for streaming data.
It is a comprehensive interface for creating stateful stream processing jobs against boundless streams of data using Structured Query Language (SQL). By using streaming SQL, you can simply and easily declare expressions that filter, aggregate, route, and otherwise mutate streams of data.
SQLStreamBuilder is particularly well-suited for:
- Creating entire stream processing systems without needing to code in Java/Scala.
- Building real time ETL pipelines.
- Pre-aggregating data before storing it into traditional datastores.
- Testing/Hypothesizing/Reasoning about your data.
- Allowing anyone familiar with SQL to build stream processors.
- Filtering, splitting, routing, cleansing, and aggregating to departmental specific streams/topics or datastores.
- Joining various incoming streams into one useful dataset.
SSB runs in an interactive fashion where you can quickly see the results of your query and iterate on your SQL syntax. It also works in a persistent fashion where the SQL queries you execute run as jobs on the Flink cluster-operating on boundless streams of data until cancelled. This allows you to author, launch, and monitor stream processing jobs within SSB. Every SQL query is a job. The Job Management Guide has details on managing jobs.
SSB processes data from a source to a sink using the SQL specified (or just to the browser only). When a source or sink is created, you assign it a virtual table name. You then use that virtual table name to address the FROM table in your query (source) and specify the destination (sink) in the interface. This allows you to create powerful aggregations, filters - any SQL expression against the stream. Sources and sinks are specified in the SSB console.
It’s important to note that because this is streaming SQL, the query does not end until canceled, and results may not show up immediately if there is a long time window, or if data matching the criteria isn’t being streamed at that moment.
Routing the query result data to a sink
SQLStreamBuilder is unique in that it lets you iterate on your SQL, but it also allows you to build robust data pipelines based on that SQL. When you execute a query, the results go to the
Virtual Table Sink that you selected in the
SQL window. This allows you to create aggregations, filters, joins, etc and then “pipe” the results to a sink. The schema for the results is the schema that you created when you ran the query (see above).
Results are also sampled to your browser so you can inspect the data, and iterate on your query. 100 rows are sampled back at a time. You can sample 100 more rows by clicking the
Sample button. If you select
Results in Browser for the
Sink Virtual Table then results are only sampled to the browser.
Kafka sinks are an output virtual table to send the data resulting from the query to. Kafka sink data is in JSON format.
Amazon S3 Sinks
An Amazon S3 Sink, is an empty S3 bucket that will be filled with objects containing the following filename layout:
The contents of these files is one JSON message per line. The S3 sink is currently only supported in the same region as your cluster.
Schema is defined for a given source when the source is created. You define a schema using JSON for the incoming data, and each virtual table has its own schema. For instance, if the source is Apache Kafka, you specify a topic name and you define a schema for the data.
For sinks, the schema is defined as the table structure in the SQL statement, both for column names and datatypes. SSB supports aliases for column names like
SELECT bar AS foo FROM baz as well as
CAST(value AS type) for type conversions.
SSB supports a couple of meta commands like
show tables and
describe <table> that make it easy to understand the schema if you are familiar with most database platforms.
When a key is removed from the messages coming from a source, SSB will happily continue to consume messages; however, upon sinking, it will mark the missing key as NULL. Similarly, when a key is removed from the source schema but not the messages coming from the source, SSB will ignore the key on the incoming stream.
Source and Sink availability
|JDBC (Relational DB)||TBA|
User Functions, once created can be referenced in any SQL statement as a SQL function. You can build a library of functions and your team can use them as needed. User functions also expose all the entire Java8 API further enhancing the usefulness of SQLStreamBuilder. For more information on User Functions, see the User Function Guide.