SKU Feature Set
The SKU Feature Set generates features related to the top and bottom performing SKUs (Stock Keeping Units) for each customer over specified time periods.
Usage
Configure the SKU features in your config.yaml file:
features:
SKUFeatureConfig:
lag_months: [3,6]
sku_id_column: "sku_id"
primary_sort_field: "quantity"
secondary_sort_field: "volume_hl"
number_of_skus: 3
resolve_tie_break: True
Features Generated
The SKU Feature Set generates the following types of features:
| Feature Type | Description | Example Column Names |
|---|---|---|
| Top SKUs | The top N SKUs based on the primary sort field | TOP_SKU_P3_1, TOP_SKU_P3_2, TOP_SKU_P3_3 |
| Bottom SKUs | The bottom N SKUs based on the primary sort field | BOTTOM_SKU_P3_1, BOTTOM_SKU_P3_2, BOTTOM_SKU_P3_3 |
Where:
- P3 indicates a 3-month lag period
- The number at the end (1, 2, 3) represents the rank of the SKU
Configuration Options
SKUFeatureConfig Class
The SKUFeatureConfig class allows you to customize the SKU feature generation:
lag_months: List of integers representing the number of months to lag (e.g., [3, 6])sku_id_column: Name of the column containing SKU IDsprimary_sort_field: The main field used for sorting SKUs (e.g., "quantity")number_of_skus: Number of top and bottom SKUs to calculatesecondary_sort_field: Optional field for tie-breaking (e.g., "volume_hl")resolve_tie_break: Boolean flag to enable tie-breaking using the secondary sort field
Feature Generation Process
- The input data is filtered based on the specified lag months.
- SKUs are ranked for each customer based on the primary sort field.
- If tie-breaking is enabled, the secondary sort field is used to resolve ties.
- The top N and bottom N SKUs are selected for each customer and lag period.
- The results are formatted into columns like TOP_SKU_P3_1, BOTTOM_SKU_P3_1, etc.
Example
Given an input DataFrame with columns: customer_id, sku_id, quantity, volume_hl, and date, the SKU Feature Set might generate output like this:
| customer_id | TOP_SKU_P3_1 | TOP_SKU_P3_2 | TOP_SKU_P3_3 | BOTTOM_SKU_P3_1 | BOTTOM_SKU_P3_2 | BOTTOM_SKU_P3_3 |
|---|---|---|---|---|---|---|
| C1 | SKU4 | SKU3 | SKU6 | SKU5 | SKU6 | SKU3 |
| C2 | SKU4 | SKU6 | SKU5 | SKU3 | SKU5 | SKU6 |
This output shows the top 3 and bottom 3 SKUs for each customer based on a 3-month lag period, sorted primarily by quantity and with ties resolved using volume_hl.
Notes
- The feature set handles null values gracefully, filling in None for missing SKUs.
- You can generate features for multiple lag periods (e.g., 3 months and 6 months) in a single run.
- Ensure that your input data has sufficient history to cover the largest lag period you specify.