Skip to content

SKU Feature Set

The SKU Feature Set generates features related to the top and bottom performing SKUs (Stock Keeping Units) for each customer over specified time periods.

Usage

Configure the SKU features in your config.yaml file:

config.yaml
features:
  SKUFeatureConfig:
    lag_months: [3,6]
    sku_id_column: "sku_id"
    primary_sort_field: "quantity"
    secondary_sort_field: "volume_hl"
    number_of_skus: 3
    resolve_tie_break: True

Features Generated

The SKU Feature Set generates the following types of features:

Feature Type Description Example Column Names
Top SKUs The top N SKUs based on the primary sort field TOP_SKU_P3_1, TOP_SKU_P3_2, TOP_SKU_P3_3
Bottom SKUs The bottom N SKUs based on the primary sort field BOTTOM_SKU_P3_1, BOTTOM_SKU_P3_2, BOTTOM_SKU_P3_3

Where: - P3 indicates a 3-month lag period - The number at the end (1, 2, 3) represents the rank of the SKU

Configuration Options

SKUFeatureConfig Class

The SKUFeatureConfig class allows you to customize the SKU feature generation:

  • lag_months: List of integers representing the number of months to lag (e.g., [3, 6])
  • sku_id_column: Name of the column containing SKU IDs
  • primary_sort_field: The main field used for sorting SKUs (e.g., "quantity")
  • number_of_skus: Number of top and bottom SKUs to calculate
  • secondary_sort_field: Optional field for tie-breaking (e.g., "volume_hl")
  • resolve_tie_break: Boolean flag to enable tie-breaking using the secondary sort field

Feature Generation Process

  1. The input data is filtered based on the specified lag months.
  2. SKUs are ranked for each customer based on the primary sort field.
  3. If tie-breaking is enabled, the secondary sort field is used to resolve ties.
  4. The top N and bottom N SKUs are selected for each customer and lag period.
  5. The results are formatted into columns like TOP_SKU_P3_1, BOTTOM_SKU_P3_1, etc.

Example

Given an input DataFrame with columns: customer_id, sku_id, quantity, volume_hl, and date, the SKU Feature Set might generate output like this:

customer_id TOP_SKU_P3_1 TOP_SKU_P3_2 TOP_SKU_P3_3 BOTTOM_SKU_P3_1 BOTTOM_SKU_P3_2 BOTTOM_SKU_P3_3
C1 SKU4 SKU3 SKU6 SKU5 SKU6 SKU3
C2 SKU4 SKU6 SKU5 SKU3 SKU5 SKU6

This output shows the top 3 and bottom 3 SKUs for each customer based on a 3-month lag period, sorted primarily by quantity and with ties resolved using volume_hl.

Notes

  • The feature set handles null values gracefully, filling in None for missing SKUs.
  • You can generate features for multiple lag periods (e.g., 3 months and 6 months) in a single run.
  • Ensure that your input data has sufficient history to cover the largest lag period you specify.