Data Sources

Introduction

Qwak's data sources are used to configure connections to your data. Data sources are used in order to create create feature sets.

There are two main types of data sources:

  • Batch: Data-at-rest sources of data, such as Athena, Snowflake, and Redshift.
  • Streaming: Data in motion sources, such as Kafka and Kinesis.

To connect to a data source:

  1. Enable network connectivity between the data sources and Qwak's cluster if they are not publicly accessible.
  2. Grant Qwak access to your data lake components by creating read-only service accounts and/or IAM roles.

Defining Data Sources

Data Sources can be defined and registered programatically via Qwak SDK and CLI or created altogether via the Qwak Dashboard.

Programatically

Qwak provides Python classes to define any Data Source type using the qwak.feature_store.data_sources package.

For example, you can define a CsvSource to read from an S3 based CSV file as follows:

from qwak.feature_store.data_sources import CsvSource

# The S3 anonymous config class is required for public S3 buckets
from qwak.feature_store.data_sources import AnonymousS3Configuration

# Create a CsvSource object to represent a CSV data source 
# This example uses a CSV file from a public S3 bucket

csv_source = CsvSource(
    name='credit_risk_data',                                    # Name of the data source
    description='A dataset of personal credit details',         # Description of the data source
    date_created_column='date_created',                         # Column name that represents the creation date
    path='s3://qwak-public/example_data/data_credit_risk.csv',  # S3 path to the CSV file 
    filesystem_configuration=AnonymousS3Configuration(),        # Configuration for anonymous access to S3
    quote_character='"',                                        # Character used for quoting in the CSV file
    escape_character='"'                                        # Character used for escaping in the CSV file
)

πŸ“˜

Data Sources defined with the Qwak SDK are not going to be registered in the cloud platform unless the qwak features register command is ran for that object.


From the UI

  1. Select Data Sources from the sidebar
  2. Click Create New Data Source.
  3. Select the required data source type from the list.
  4. Fill in the form (all required fields are marked with an asterisk).
  5. Test the connection to the data source to verify it's operating.,
  6. Click Save.
  7. The data source is created. :thumbsup:

Below is an example of creating a Batch / CSV file based Data Source in the Qwak Dashboard.


Registering Data Sources

To register a Data Source class defined with the SDK you can use the Qwak CLI features command as follows:

qwak features register -p data_source.py

Deleting Data Sources

To delete a data source, execute the following qwak command in the terminal:

qwak features delete --data-source <data-source-name>

🚧

Deleting Data Sources in use

Before you can delete a Data Source that is linked to one or more Feature Sets, you must either remove those Feature Sets or reassign them to a different Data Source.


What’s Next

Learn more about different types of Data Sources.