Skip to content

Feature Generator

About the Framework

The aim of the Feature Generator is to provide a standardised, easy-to-use and easy-to-contribute to framework for creating features for machine learning models based on the data we have available within Heineken. The Feature Generation framework has three components:

  1. The Config, which provides an easy to use API and format for defining your dataset and features that you want.
  2. A FeatureSet which is a generic class that defines a group of features.
  3. The FeatureCalculator which houses the API to generate your features from the config file.

Info

The only code implementation that a user of the framework needs to implement is FeatureCalculator(config=config).run(df=df). This is the benefit. As a user of the framework you simply define your config.yaml and ensure that your dataframe is correctly filtered and your config is defined.

Usage

As a user, you only need 2 components, a config file:

config.yaml
dataset:
  key_cols: ['Customer_ID']
  date_col: 'IssuedDate'

features:
  SingleMonthFeatureConfig:
    sum_columns:
      - QTY
      - grandTotal
    count_columns:
      - QTY
    calculate_percentage_change: True
  PeriodMonthFeatureConfig:
    calculation_columns:
      - QTY
      - grandTotal
    deviation_from_mean_months:
      - 3
    calculate_percentage_change: True
    calculate_deltas: True
    resolve_divide_by_zero: True

and your code:

from amee_utils.feature_generator.calculator import FeatureCalculator
from amee_utils.feature_generator.config import Config

df = spark.read.parquet('path/to/dataset')
config = Config.from_yaml('path/to/config.yaml')
result = FeatureCalculator(config=config).run(df=df)

Note

The assumption here is that the data you provide into the FeatureCalculator contains all the data required to calculate your features. The framework isn't going to do everything for you 🙃.

There's a lot more context that goes into how the config file needs to look which we'll discuss below.

Defining your Configuration file

As of amee-utils@v0.3.1 the config file should be defined in yaml. If we go back and focus on the previous example config.yaml:

config.yaml
dataset:
  key_cols: ['Customer_ID']
  date_col: 'IssuedDate'

features:
  SingleMonthFeatureConfig:
    sum_columns:
      - QTY
      - grandTotal
    count_columns:
      - QTY
    calculate_percentage_change: True
  PeriodMonthFeatureConfig:
    calculation_columns:
      - QTY
      - grandTotal
    deviation_from_mean_months:
      - 3
    calculate_percentage_change: True
    calculate_deltas: True
    resolve_divide_by_zero: True

There are 2 components to the Config. The dataset config and the features definition.

Dataset Configuration

  • key_cols: Defines the key within your dataset that will be used for various aggregations. You can provide a list of key columns as well.
  • date_col: This is the column that defines the date order of your dataset. It is used to filter for specific periods when doing feature calculations and ensuring there is an order to the dataset.

Note

As of v0.3.1 this framework only solves for transactional information but this should still apply in future versions.

Feature Configuration

The feature config is a bit more complicated. The features that can be added to your config need to match the implementation of the FeatureSet config that the developer defines. For information around the specific feature configs that can be used, see the User Guide for the set of configs.