Feature Generator
About the Framework
The aim of the Feature Generator is to provide a standardised, easy-to-use and easy-to-contribute to framework for creating features for machine learning models based on the data we have available within Heineken. The Feature Generation framework has three components:
- The
Config, which provides an easy to use API and format for defining your dataset and features that you want. - A
FeatureSetwhich is a generic class that defines a group of features. - The
FeatureCalculatorwhich houses the API to generate your features from the config file.
Info
The only code implementation that a user of the framework needs to implement is FeatureCalculator(config=config).run(df=df). This is the benefit. As a user of the framework you simply define your config.yaml and ensure that your dataframe is correctly filtered and your config is defined.
Usage
As a user, you only need 2 components, a config file:
dataset:
key_cols: ['Customer_ID']
date_col: 'IssuedDate'
features:
SingleMonthFeatureConfig:
sum_columns:
- QTY
- grandTotal
count_columns:
- QTY
calculate_percentage_change: True
PeriodMonthFeatureConfig:
calculation_columns:
- QTY
- grandTotal
deviation_from_mean_months:
- 3
calculate_percentage_change: True
calculate_deltas: True
resolve_divide_by_zero: True
and your code:
from amee_utils.feature_generator.calculator import FeatureCalculator
from amee_utils.feature_generator.config import Config
df = spark.read.parquet('path/to/dataset')
config = Config.from_yaml('path/to/config.yaml')
result = FeatureCalculator(config=config).run(df=df)
Note
The assumption here is that the data you provide into the FeatureCalculator contains all the data required to calculate your features. The framework isn't going to do everything for you .
There's a lot more context that goes into how the config file needs to look which we'll discuss below.
Defining your Configuration file
As of amee-utils@v0.3.1 the config file should be defined in yaml. If we go back and focus on the previous example config.yaml:
dataset:
key_cols: ['Customer_ID']
date_col: 'IssuedDate'
features:
SingleMonthFeatureConfig:
sum_columns:
- QTY
- grandTotal
count_columns:
- QTY
calculate_percentage_change: True
PeriodMonthFeatureConfig:
calculation_columns:
- QTY
- grandTotal
deviation_from_mean_months:
- 3
calculate_percentage_change: True
calculate_deltas: True
resolve_divide_by_zero: True
There are 2 components to the Config. The dataset config and the features definition.
Dataset Configuration
- key_cols: Defines the key within your dataset that will be used for various aggregations. You can provide a list of key columns as well.
- date_col: This is the column that defines the date order of your dataset. It is used to filter for specific periods when doing feature calculations and ensuring there is an order to the dataset.
Note
As of v0.3.1 this framework only solves for transactional information but this should still apply in future versions.
Feature Configuration
The feature config is a bit more complicated. The features that can be added to your config need to match the implementation of the FeatureSet config that the developer defines. For information around the specific feature configs that can be used, see the User Guide for the set of configs.