SQLStreamBuilder Getting Started Guide

There are a few steps to complete before you can start executing and iterating on SQL jobs for the first time:

  1. Create a Cloud Environment
  2. Create one or more data providers
  3. Create a virtual table as a source
  4. Create a virtual table as a sink
  5. Execute SQL jobs

Once an SSB deployment is created, data providers are defined, and you have a good set of virtual tables for your use case, you don’t need to repeat these steps.

1. Create a Cloud Environment.

A cloud environment is the cloud account processing resources will be allocated from. This can be your everyday AWS account, or a new account that is VPC Peered back to your main account.

  • Click on Cloud Environments
  • Click on the Create Cloud Environment button
  • Enter a logical name for this environment. It’s generally good to use the region in the name like AWS_EU_CENTRAL or something similar.
  • Add a description for the environment
  • Select AWS as your provider
  • Select the AWS region you would like to deploy to
  • Enter your AWS IAM Access Key
  • Enter your AWS IAM Secret Key
  • Select a VPC CIDR Range or use the default if it doesn’t collide with your IP range.
  • Ensure Triple AZ is selected
  • Select Create Environment

AWS resources will then be allocated to the account associated with the IAM keys. Initially just a couple of resources will be built, more will be created as you create clusters to run jobs on.

2. Create a data source

A data source is a registry that stores the clusters/endpoints you wish to query. Once created they hold the URI or connect string to a cluster or other data source, and the corresponding credentials. Once you setup data sources you can reference them without changing this configuration.

For data source types and availability see data providers.

  • Click on Data Sources on the side menu.
  • Click on Add Kafka Provider button at the top.
  • Add a unique name for the source, this can be anything that makes sense for your use case.
  • Add the Kafka brokers connect string.
  • If you are connecting to an endpoint over plaintext leave PLAINTEXT selected for Connection Protocol, otherwise select the connection protocol your source Kafka cluster supports.
  • If you selected SASL/SSL fill in SASL Username and SASL Password. Typically these can be generated/found in your cloud provider control panel.
  • Click on Save Changes to create the data provider.

Currently SSB only support Kafka data sources. We will be adding other sources soon.

3. Create virtual table as a source

You must have at least one source to run queries on SSB.

  • Click on SQLStreamBuilder on the side menu.
  • Click on the Virtual Tables tab.
  • Click on the Source Virtual Tables tab.
  • Click Add Source and select Apache Kafka.
  • Specify the name of the source, this will be what you use in the FROM clause of your query. Name it any logical name that makes sense to you.
  • Select a Kafka cluster, and topic for the source. This is the data provider you setup in the previous step.
  • Define a data format. Data formats can be of type JSON or AVRO.
  • Define or register a schema:

If you are using JSON encoded data you must define the schema. The format is JSON-schema like and a sample schema is provided as a template. Change the values to match your specific schema.

source

If you are using AVRO encoded data you must register the schema by providing a Schema Manager endpoint. The schema will be utilized in SSB and use Schema Registry for versioning.

  • Add the Schema Registry endpoint for Server and Port and click the plus button.

avro_source

  • Click on Save Changes.

Note: You can create multiple virtual tables for the same Kafka topic.

4. Create virtual table as a sink

(this step is optional - you can select Results in Browser to test functionality without sending the results to a sink)

  • Click on SQLStreamBuilder on the side menu.
  • Click on the Virtual Tables tab.
  • Click on the Sink Virtual Tables tab.
  • Click Add Sink and select Apache Kafka.
  • Specify the name of the sink, this will be what you select for a sink in the SQL window. Name it any logical name that makes sense to you.
  • Select a Kafka cluster, and topic for the sink. The schema in the sink is defined by the columns and their respective datatypes as specified in the query.

sink

5. Running SQL:

  • Click on SQLStreamBuilder on the side menu
  • Set a unique name for the SQL Job Name. Each query is a job and requires a name. If you are iterating quickly on your SQL statement, just use the default name. You can select the Name Job button to create a clever name for you if needed.
  • Select Environment.Cluster, and Add Cluster.
  • In the Add Deployment box, choose a logical Name for this deployment.
  • Give it a good Description.
  • Select a Region/Zone.
  • Click Create. Over the next few moments, a cluster will be provisioned, when it is complete move to the next step.

create

Note: If you already have a cluster created then select that one and skip this step. Creating cluster is typically just done a few times not every time you run a query.

  • Select Sink Virtual Table, and select the sink you want the query results to be sent to.

Execute SQL

  • To list the available virtual tables, begin editing within the SSB SQL Console and run SHOW TABLES, then click Execute or use the key combination Ctrl+Enter. A list of tables will be shown in the Results tab below the editor..
  • To view the schema for a virtual table, run DESC <tablename>, substituting the table you would like to run. Click Execute or use the key combination Ctrl+Enter. The schema will be displayed in the Results tab below the editor.
  • To run SQL, write a SQL statement into the editor and click Execute or use the key combination Ctrl+Enter.
  • SQL Syntax is Calcite Compatible, check out the SQL Syntax Reference.

Remember: Each execution is a new job.

The log window will offer verbose output on the parse status of the statement, and the results of the query will return to the Results tab below the editor as well as be spooled to the Virtual Table Sink. It should be noted that this paradigm is a bit different than typical database queries because it’s a streaming SQL query. The displayed results are a sample of the output from the query. If you navigate away from the page, the query will still be running as a job, and you can view it in the SQL Jobs tab.

If you don’t see results right away, this may be because of the nature of streaming SQL - data may not immediately match your criteria. The Logs tab shows the status of the SQL Job. You can also view the logs of an existing job by selecting the job in the SQL Jobs tab.

If you would like to re-sample the results from the query select the Sample button at any time.

Here is a video showing building and executing a query: