Accessing Features for Training and Retraining

This documentation provides examples and usage patterns for interacting with the Offline Feature Store using the OfflineClientV2 in Python (available from SDK version 0.5.61 and higher). It covers how to retrieve feature values for machine learning model training and analysis.

Prerequisites:

Before using these examples, ensure you have the following Python packages installed:

pip install pyathena pyarrow

APIs:

Get Feature Values

This API retrieves features from an offline feature store for one or more feature sets, given a population DataFrame. The resulting DataFrame will include the population DataFrame enriched with the requested feature values as of the point_in_time specified.

Arguments:

  • features: List[FeatureSetFeatures] - required
    A list of feature sets to fetch.
  • population: pd.DataFrame - required
    A DataFrame containing:
    • All keys of the requested feature sets.
    • A point in time column.
    • Optional enrichments, e.g., labels.
  • point_in_time_column_name: str - required
    The name of the point in time column in the population DataFrame.

Returns: pd.DataFrame

Example call:

import pandas as pd
from qwak.feature_store.offline import OfflineClientV2
from qwak.feature_store.offline.feature_set_features import FeatureSetFeatures

offline_feature_store = OfflineClientV2()

user_impressions_features = FeatureSetFeatures(
    feature_set_name='impressions',
    feature_names=['number_of_impressions']
)
user_purchases_features = FeatureSetFeatures(
    feature_set_name='purchases',
    feature_names=['number_of_purchases', 'avg_purchase_amount']
)
features = [user_impressions_features, user_purchases_features]

population_df = pd.DataFrame(
    columns=['impression_id', 'purchase_id', 'timestamp', 'label'],
    data=[['1', '100', '2021-01-02 17:00:00', 1], ['2', '200', '2021-01-01 12:00:00', 0]]
)

train_df: pd.DataFrame = offline_feature_store.get_feature_values(
    features=features,
    population=population_df,
    point_in_time_column_name='timestamp'
)

print(train_df.head())

Example results:

#	  impression_id    purchase_id        timestamp	           label    impressions.number_of_impressions    purchases.number_of_purchases	    purchases.avg_purchase_amount
# 0	      1	                100       2021-04-24 17:00:00	     1                 312                                  76	                             4.796842
# 1	      2	                200       2021-04-24 12:00:00	     0                  86                                   5	                             1.548000

In this example, the label serves as an enhancement to the dataset, rather than a criterion for data selection. This approach is particularly useful when you possess a comprehensive list of keys along with their respective timestamps. The Feature Store API is designed to cater to scenarios requiring data amalgamation from multiple feature sets, ensuring that, for each row in population_df, no more than one corresponding record is returned. Leveraging Qwak's time-series based feature store, which organizes data within start_timestamp and end_timestamp bounds for each feature vector (key), guarantees that a singular, most relevant result is retrieved for every unique key-timestamp combination.


Get Feature Range Values

Retrieve features from an offline feature-set for a given time range. The result data-frame will contain all data points of the given feature-set in the given time range. If population is provided, then the result will be filtered by the key values it contains.

Arguments:

  • features: FeatureSetFeatures - required:
    A list of features to fetch from a single feature set.
  • start_date: datetime - required:
    The lower time bound.
  • end_date: datetime - required:
    The upper time bound.
  • population: pd.DataFrame - optional:
    A DataFrame containing the following columns:
    • The key of the requested feature-set required
    • Enrichments e.g., labels. optional

Returns: pd.DataFrame

Example Call:

from datetime import datetime
import pandas as pd
from qwak.feature_store.offline import OfflineClientV2
from qwak.feature_store.offline.feature_set_features import FeatureSetFeatures

offline_feature_store = OfflineClientV2()

start_date = datetime(year=2021, month=1, day=1)
end_date = datetime(year=2021, month=1, day=3)
features = FeatureSetFeatures(
    feature_set_name='purchases',
    feature_names=['number_of_purchases', 'avg_purchase_amount']
)

train_df: pd.DataFrame = offline_feature_store.get_feature_range_values(
    features=features,
    start_date=start_date,
    end_date=end_date
)

print(train_df.head())

Example Results:

#	     purchase_id	         timestamp	         purchases.number_of_purchases	      purchases.avg_purchase_amount
# 0	      1	           2021-01-02 17:00:00	                 76	                               4.796842
# 1	      1	           2021-01-01 12:00:00	                  5	                               1.548000
# 2	      2	           2021-01-02 12:00:00	                  5	                               5.548000
# 3	      2	           2021-01-01 18:00:00	                  5	                               2.788000                                     

πŸ“˜

Current limitations

The get_feature_range_values API call is currently not available for Streaming Aggregations feature sets and not available to fetch data for multiple feature sets at the same time (join data).