Utils
Utilities for the feature_set module.
calculate_percentage_change(df, key_cols, round_digits=5, resolve_division_by_zero=False)
Calculate the percentage change for columns in the DataFrame.
Columns must be in the format: 'COLUMN_NAME_PREFIXMONTH', where MONTH is an integer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to calculate the percentage change for. |
required |
key_cols
|
list[str]
|
A list of key columns to identify unique records. |
required |
round_digits
|
int
|
The number of decimal places to round the percentage change to (default is 5). |
5
|
resolve_division_by_zero
|
bool
|
Boolean indicating whether or not to resolve division by zero errors by filling NaN values with 1. |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
The DataFrame with the calculated percentage change added as new columns. |
Source code in amee_utils/feature_generator/feature_set/utils.py
join_multiple_to_base(base_df, df_list, key_cols, join_type='left')
Join multiple DataFrames to a base DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
base_df
|
DataFrame
|
The base DataFrame to join the other DataFrames to. |
required |
df_list
|
list of DataFrame
|
A list of DataFrames to join to the base DataFrame. |
required |
key_cols
|
list of str
|
A list of key columns to join the DataFrames on. |
required |
join_type
|
str
|
The type of join to perform (default is "left"). |
'left'
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
The base DataFrame with the other DataFrames joined to it. |
Source code in amee_utils/feature_generator/feature_set/utils.py
percentage_change(change_from_column, change_to_column)
Calculate the percentage change between two columns in a PySpark DataFrame.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
change_from_column
|
str
|
The name of the column to calculate the percentage change from. |
required |
change_to_column
|
str
|
The name of the column to calculate the percentage change to. |
required |
Returns:
| Type | Description |
|---|---|
Column
|
A PySpark Column object representing the percentage change between the two columns. |
Source code in amee_utils/feature_generator/feature_set/utils.py
resolve_division_by_zero_func(df, column_name_pct_change, change_from_column, change_to_column)
Resolve division by zero scenarios for a specific percentage change column in a DataFrame.
Notes
This function addresses the cases where division by zero would occur during the calculation of percentage changes between two columns. It adjusts the specified percentage change column by:
- Setting it to
None(null in Spark) if bothchange_from_columnandchange_to_columnare null, indicating no data available for percentage change calculation. - Setting it to
-1ifchange_to_columnis null butchange_from_columnis not null, indicating a scenario that could be interpreted as a 100% decrease or a special case. - Leaving the value as is or filling NaN values with
1for all other cases. - Filling NaN with
1specifically also deals with cases wherechange_to_columnis not null butchange_from_columnis null and can be interpreted as a 100% increase
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame containing the columns to calculate percentage change between and to apply the resolution. |
required |
column_name_pct_change
|
str
|
The name of the column where the calculated percentage change is stored. This column will be modified based on the resolution rules. |
required |
change_from_column
|
str
|
The name of the column representing the initial value in the percentage change calculation. |
required |
change_to_column
|
str
|
The name of the column representing the subsequent value in the percentage change calculation. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
df |
DataFrame
|
The modified DataFrame with division by zero scenarios resolved in the specified percentage change column. |