Feature Engineering

This is the documentation for feature engineering. For most use cases the SimpleFeatureEngineer will suffice. It is also possible to create you own function as a feature engineer using FeatureEngineer. For more info see the Feature engineering examples

Feature engineering for timeseries

class sam.feature_engineering.BaseFeatureEngineer

Bases: BaseEstimator, TransformerMixin, ABC

Base class for feature engineering. To use this class, you need to implement the feature_engineer method.

Methods

feature_engineer_(X)

Implement this method to do the feature engineering.

fit(X[, y])

fit method

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

abstract feature_engineer_(X) DataFrame

Implement this method to do the feature engineering.

fit(X, y=None)

fit method

get_feature_names_out(input_features=None) List[str]

Function for obtaining feature names. Generally used instead of the attribute, and more compatible with the sklearn API.

Returns:
list:

list of feature names

transform(X) DataFrame

transform method

class sam.feature_engineering.FeatureEngineer(feature_engineer_function: Callable[[DataFrame, DataFrame], DataFrame] | None = None)

Bases: BaseFeatureEngineer

Feature engineering class. This class is used to feature engineer the data using default methods and makes integration with the timeseries models easier. You can implement your own feature engineering code as a function that takes two arguments: X and y and returns a feature table as a pandas dataframe.

Parameters:
feature_engineer_functionCallable[[pd.DataFrame, pd.DataFrame], pd.DataFrame]

The feature engineering function.

Methods

feature_engineer_(X)

feature engineering function

fit(X[, y])

fit method

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

feature_engineer_(X: DataFrame) DataFrame

feature engineering function

class sam.feature_engineering.IdentityFeatureEngineer(numeric_only: bool = True)

Bases: BaseFeatureEngineer

Identity feature engineering class. This is a placeholder class for when you don’t want to apply any feature engineering. Makes compatibility with the sam API easier.

Parameters:
numeric_onlybool

Whether to only include numeric columns in the output.

Methods

feature_engineer_(X)

feature engineering function, returns the input dataframe

fit(X[, y])

fit method

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

feature_engineer_(X: DataFrame) DataFrame

feature engineering function, returns the input dataframe

class sam.feature_engineering.SimpleFeatureEngineer(rolling_features: List[Tuple] | DataFrame | None = None, time_features: List[Tuple] | DataFrame | None = None, time_col: str | None = None, timezone: str | None = None, drop_first: bool = True, keep_original: bool = False)

Bases: BaseFeatureEngineer

Base class for simple time series feature engineering. Provides a method to create two types of features: rolling features and time components (one hot or cyclical).

Parameters:
rolling_featureslist or pandas.DataFrame (default=[])

List of tuples of the form (column, method, window). Can also be provided as a dataframe with columns: [‘column’, ‘method’, ‘window’]. The column is the name of the column to be transformed, the method is the method to be used (string), and the window is the window size (integer or string). Valid methods are “lag” or any of the pandas rolling methods (e.g. “mean”, “median”, etc.).

time_featureslist (default=[])

List of tuples of the form (component, type). Can also be provided as a dataframe with columns [‘component’, ‘type’]. For supported component values, see SimpleFeatureEngineer.valid_components (e.g. “second_of_day”, “hour_of_day). Valid types are “onehot” or “cyclical”.

time_colstr (default=None)

Name of the time column (e.g. “TIME”). If None, the index of the dataframe is used.

timezone: str, optional (default=None)

if tz is not None, convert the time to the specified timezone, before creating features. timezone can be any string that is recognized by pytz, for example Europe/Amsterdam. We assume that the TIME column is always in UTC, even if the datetime object has no tz info.

drop_firstbool (default=True)

Whether to drop the first value of time components (used for onehot encoding)

keep_originalbool (default=False)

Whether to keep the original columns in the dataframe.

Methods

feature_engineer_(X)

Feature engineering function that creates rolling features and time components.

fit(X[, y])

fit method

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

component_function = {'day_of_month': <function SimpleFeatureEngineer.<lambda>>, 'day_of_week': <function SimpleFeatureEngineer.<lambda>>, 'day_of_year': <function SimpleFeatureEngineer.<lambda>>, 'hour_of_day': <function SimpleFeatureEngineer.<lambda>>, 'hour_of_week': <function SimpleFeatureEngineer.<lambda>>, 'minute_of_day': <function SimpleFeatureEngineer.<lambda>>, 'minute_of_hour': <function SimpleFeatureEngineer.<lambda>>, 'month_of_year': <function SimpleFeatureEngineer.<lambda>>, 'second_of_day': <function SimpleFeatureEngineer.<lambda>>, 'second_of_hour': <function SimpleFeatureEngineer.<lambda>>, 'second_of_minute': <function SimpleFeatureEngineer.<lambda>>, 'week_of_year': <function SimpleFeatureEngineer.<lambda>>}
component_range = {'day_of_month': (1, 31), 'day_of_week': (1, 7), 'day_of_year': (1, 366), 'hour_of_day': (0, 23), 'hour_of_week': (0, 167), 'minute_of_day': (0, 1439), 'minute_of_hour': (0, 59), 'month_of_year': (1, 12), 'second_of_day': (0, 86399), 'second_of_hour': (0, 3599), 'second_of_minute': (0, 59), 'week_of_year': (1, 53)}
feature_engineer_(X: DataFrame) DataFrame

Feature engineering function that creates rolling features and time components.

valid_components = dict_keys(['second_of_minute', 'second_of_hour', 'second_of_day', 'minute_of_hour', 'minute_of_day', 'hour_of_day', 'hour_of_week', 'day_of_week', 'day_of_month', 'day_of_year', 'week_of_year', 'month_of_year'])

Rolling Features

class sam.feature_engineering.BuildRollingFeatures(rolling_type: str = 'mean', lookback: int = 1, window_size: str | None = None, deviation: str | None = None, alpha: float = 0.5, width: int = 1, nfft_ncol: int = 10, proportiontocut: float = 0.1, timecol: str | None = None, keep_original: bool = True, add_lookback_to_colname: bool = False)

Bases: BaseEstimator, TransformerMixin

Applies some rolling function to a pandas dataframe

This class provides a stateless transformer that applies to each column in a dataframe. It works by applying a certain rolling function to each column individually, with a window size. The rolling function is given by rolling_type, for example ‘mean’, ‘median’, ‘sum’, etcetera.

An important note is that this transformer assumes that the data is sorted by time already! So if the input dataframe is not sorted by time (in ascending order), the results will be completely wrong.

A note about the way the output is rolled: in case of ‘lag’ and ‘diff’, the output will always be lagged, even if lookback is 0. This is because these functions inherently look at a previous cell, regardless of what the lookback is. All other functions will start by looking at the current cell if lookback is 0. (and will also look at previous cells if window_size is greater than 1)

‘ewm’ looks at window_size a bit different: instead of a discrete number of points to look at, ‘ewm’ needs a parameter alpha between 0 and 1 instead.

Parameters:
window_size: array-like, shape = (n_outputs, ), optional (default=None)

vector of values to shift. Ignored when rolling_type is ewm if integer, the window size is fixed, and the timestamps are assumed to be uniform. If string of timeoffset (for example ‘1H’), the input dataframe must have a DatetimeIndex. timeoffset is not supported for rolling_type ‘lag’, ‘fourier’, ‘ewm’, ‘diff’!

lookback: number type, optional (default=1)

the features that are built will be shifted by this value If more than 0, this prevents leakage

rolling_type: string, optional (default=”mean”)

The rolling function. Must be one of: ‘median’, ‘skew’, ‘kurt’, ‘max’, ‘std’, ‘lag’, ‘mean’, ‘diff’, ‘sum’, ‘var’, ‘min’, ‘numpos’, ‘ewm’, ‘fourier’, ‘cwt’, ‘trimmean’

deviation: str, optional (default=None)

one of [‘subtract’, ‘divide’]. If this option is set, the resulting column will either have the original column subtracted, or will be divided by the original column. If None, just return the resulting column. This option is not allowed when rolling_type is ‘cwt’ or ‘fourier’, but it is allowed with all other rolling_types.

alpha: numeric, optional (default=0.5)

if rolling_type is ‘ewm’, this is the parameter alpha used for weighing the samples. The current sample weighs alpha, the previous sample weighs alpha*(1-alpha), the sample before that weighs alpha*(1-alpha)^2, etcetera. Must be in (0, 1]

width: numeric, optional (default=1)

if rolling_type is ‘cwt’, the wavelet transform uses a ricker signal. This parameter defines the width of that signal

nfft_ncol: numeric, optional (default=10)

if rolling_type is ‘nfft’, there needs to be a fixed number of columns as output, since this is unknown a-priori. This means the number of output-columns will be fixed. If nfft has more outputs, and additional outputs are discarded. If nfft has less outputs, the rest of the columns are right-padded with 0.

proportiontocut: numeric, optional (default=0.1)

if rolling_type is ‘trimmean’, this is the parameter used to trim values on both tails of the distribution. Must be in [0, 0.5). Value 0 results in the mean, close to 0.5 approaches the median.

keep_original: boolean, optional (default=True)

if the original columns should be kept or discarded True by default, which means the new columns are added to the old ones

timecol: str, optional (default=None)

Optional, the column to set as the index during transform. The index is restored before returning. This is only useful when using a timeoffset for window_size, since that needs a datetimeindex. So this column can specify a time column. This column will not be feature-engineered, and will never be returned in the output!

add_lookback_to_colname: bool, optional (default=False)

Whether to add lookback to the newly generated column names. if False, column names will be like: DEBIET#mean_2 if True, column names will be like: DEBIET#mean_2_lookback_0

Examples

>>> from sam.feature_engineering import BuildRollingFeatures
>>> import pandas as pd
>>> df = pd.DataFrame({'RAIN': [0.1, 0.2, 0.0, 0.6, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
...                    'DEBIET': [1, 2, 3, 4, 5, 5, 4, 3, 2, 4, 2, 3]})
>>>
>>> BuildRollingFeatures(rolling_type='lag', window_size = [0,1,4], \
...                      lookback=0, keep_original=False).fit_transform(df)
    RAIN#lag_0  DEBIET#lag_0  ...  RAIN#lag_4  DEBIET#lag_4
0          0.1             1  ...         NaN           NaN
1          0.2             2  ...         NaN           NaN
2          0.0             3  ...         NaN           NaN
3          0.6             4  ...         NaN           NaN
4          0.1             5  ...         0.1           1.0
5          0.0             5  ...         0.2           2.0
6          0.0             4  ...         0.0           3.0
7          0.0             3  ...         0.6           4.0
8          0.0             2  ...         0.1           5.0
9          0.0             4  ...         0.0           5.0
10         0.0             2  ...         0.0           4.0
11         0.0             3  ...         0.0           3.0

[12 rows x 6 columns]

Methods

fit([X, y])

Calculates window_size and feature function

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Returns feature names for the outcome of the last transform call.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transforms pandas dataframe X to apply rolling function

fit(X: Any | None = None, y: Any | None = None)

Calculates window_size and feature function

Parameters:
X: optional, is ignored
y: optional, is ignored
get_feature_names_out(input_features=None) List[str]

Returns feature names for the outcome of the last transform call.

transform(X: DataFrame) DataFrame

Transforms pandas dataframe X to apply rolling function

Parameters:
X: pandas dataframe, shape = `(n_rows, n_features)`

the pandas dataframe that you want to apply rolling functions on

Returns:
result: pandas dataframe, shape = (n_rows, n_features * (n_outputs + 1))

the pandas dataframe, appended with the new columns

sam.feature_engineering.range_lag_column(original_column: Series, range_shift: tuple = (0, 1)) Series

Lags a column with a range. Will not lag the actual value, but will set a 1 in the specified range for any non-zero value.

The range can be positive and/or negative. If negative it will ‘lag’ to the future.

Parameters:
original_column: pandas series

The original column with non-zero items to lag

range_shift: tuple (default=(0, 1))

The range to lag the original column, it is inclusive. A value of 0 is no lag at all.

Returns:
pandas series

The lagged column as a series. The input will be converted to float64.

Automatic Rolling Engineering

class sam.feature_engineering.AutomaticRollingEngineering(window_sizes: List[List], rolling_types: List[str] = ['mean', 'lag'], n_iter_per_param: int = 25, cv: int = 3, estimator_type: str = 'lin', passthrough: bool = True, cyclicals: List[str] | None = None, onehots: List[str] | None = None)

Bases: BaseEstimator, TransformerMixin

Steps for automatic rolling engineering: - setup self.n_rollings number of different rolling features (unparameterized yet) in

sklearn ColumnTransformer pipeline

  • find the best parameters for each of the rolling features using random search

  • setup a ColumnTransformer with these best features that can be used in the transform method

Parameters:
window_sizes: list of lists

each list should be integers or one of one of scipy.stats.distributions that convert to a window_size for BuildRollingFeatures. Each sublist corresponds to range tried for the n_rollings, and should be non-overlapping. So if you want 2 rollings to be generated per rolling_type and per feature, this could be: [scipy.stats.randint(1, 24), scipy.stats.randint(24, 168)]. Note that using long lists results in overflow error, therefore randint is recommended.

rolling_types: list of strings (default=[‘mean’, ‘lag’])

rolling_types to try for BuildRollingFeatures. Note: cannot be ‘ewm’.

n_iter_per_param: int (default=25)

number of random values to try for each parameter. The total number of iterations is given by n_iter_per_param * len(window_sizes) * len(rolling_types)

cv: int (default=3)

number of cross-validated tries to attempt for each parameter combination

estimator_type: str (default=’lin’)

type of estimator to determine rolling importance. Can be one of: [‘rf’, ‘lin’, ‘bayeslin’]

passthrough: bool (default=True)

whether to pass original features in the transform method or not

cyclicals: list (default=None)

A list of pandas datetime properties, such as [‘minute’, ‘hour’, ‘dayofweek’, ‘week’], that will be converted to cyclicals. The rationale here is that if time features are not added, the rolling engineering will find values for instance of 1 day ago to predict well, while actually this is simply a recurring daily pattern that can be captured by time features. Note that if timefeatures are added, they are not added in the transform method. Therefore, you will have to add them yourself during subsequent model building stages.

onehots: list (default=None)

A list of pandas datetime properties, such as [‘minute’, ‘hour’, ‘dayofweek’, ‘week’], that will be converted using onehot encoding. The rationale here is that if time features are not added, the rolling engineering will find values for instance of 1 day ago to predict well, while actually this is simply a recurring daily pattern that can be captured by time features. Note that if timefeatures are added, they are not added in the transform method. Therefore, you will have to add them yourself during subsequent model building stages.

Examples

>>> from sam.data_sources import read_knmi
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> import pandas as pd
>>> from sam.feature_engineering import AutomaticRollingEngineering
>>> from sklearn.model_selection import train_test_split
>>> from scipy.stats import randint
>>> # load some data
>>> data = read_knmi(
...     '2018-01-01',
...     '2019-01-01',
...     variables = ['T', 'FH', 'FF', 'FX', 'SQ', 'Q', 'DR', 'RH']).set_index(['TIME'])
>>> # let's predict temperature 12 hours into the future
>>> target = 'T'
>>> fut = 12
>>> y = data[target].shift(-fut).iloc[:-fut]
>>> X = data.iloc[:-fut]
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, shuffle=False)
>>> # do the feature selection, first try without adding timefeatures
>>> ARE = AutomaticRollingEngineering(
...     window_sizes=[randint(1,24)], rolling_types=['mean', 'lag'])
>>> ARE = ARE.fit(X_train, y_train)  
Fitting ...
>>> # check some diagnostics, note: results may vary as it is a random search
>>> r2_base, r2_rollings, yhat_base, yhat_roll = ARE.compute_diagnostics(
...     X_train, X_test, y_train, y_test)
>>> print(r2_base, r2_rollings)
0.34353081601421964 0.7027068150592539
>>> # you can also inspect feature importances:
>>> barplot = sns.barplot(data=ARE.feature_importances_, y='feature_name', x='coefficients')
>>> # and make plot of the timeseries:
>>> timeseries_fig = plt.figure(figsize=(12, 6))
>>> timeseries_fig = plt.plot(X_test.index, y_test.ravel(), 'ok', label='data')
>>> timeseries_fig = plt.plot(
...     X_test.index, yhat_base, lw=3, alpha=0.75, label='yhat_base (r2: %.2f)'%r2_base
... )
>>> timeseries_fig = plt.plot(
...     X_test.index, yhat_roll, lw=3, alpha=0.75, label='yhat_rolling (r2: %.2f)'%r2_rollings)
>>> timeseries_fig = plt.legend(loc='best')
Attributes:
feature_importances_: pandas dataframe

With ‘feature_name’ column and if estimator_type is set to ‘rf’, a ‘coefficients’ column if estimator_type is set to ‘lin’, an ‘importances’ column

feature_names_: list of strings

names of all features, depends on self.passthrough

rolling_feature_names_: list of strings

names of rolling features that were added

Methods

compute_diagnostics(X_train, X_test, ...)

This function is meant to provide some insight in the performance gained by adding the rolling features.

fit(X, y)

Finds the best rolling feature parameters and sets up transformer for the transform method Note!: all input must be linearly increasing in time and have a datetime index.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Applies the BuildRollingFeature transformations found to work best in the fit method Note!: all input must be linearly increasing in time and have a datetime index.

get_feature_names_out

compute_diagnostics(X_train: DataFrame, X_test: DataFrame, y_train: DataFrame, y_test: DataFrame) Tuple[float, float, ndarray, ndarray]

This function is meant to provide some insight in the performance gained by adding the rolling features.

For this, it computes r-squared between y_test and predictions made for two different proxy models of type defined by self.estimator_type. The first is for the original features presented in X_train and X_test (r2_base), and the second is for these features including the rolling features (r2_rollings). It also returns the fitted predictions.

Note!: all input must be linearly increasing in time and have a datetime index.

Parameters:
X_train: pandas dataframe

with shape [n_samples x n_features].

X_test: pandas dataframe

with shape [n_samples x n_features].

y_train: pandas dataframe

with shape [n_samples x n_features].

y_test: pandas dataframe

with shape [n_samples]

Returns:
r2_base: float

r-squared for the base model (without rollings)

r2_rollings: float

r-squared for the model including rollings

yhat_base: 1D array

prediction for the base model (without rollings)

yhat_roll: 1D array

prediction for the model including rollings

fit(X: DataFrame, y: DataFrame)

Finds the best rolling feature parameters and sets up transformer for the transform method Note!: all input must be linearly increasing in time and have a datetime index.

Parameters:
X: pandas dataframe

with shape [n_samples x n_features]

y: pandas dataframe

with shape [n_samples]

get_feature_names_out(input_features=None) List[str]
transform(X: DataFrame) DataFrame

Applies the BuildRollingFeature transformations found to work best in the fit method Note!: all input must be linearly increasing in time and have a datetime index.

Parameters:
X: pandas dataframe

with shape [n_samples x n_features]

Returns:
X_transformed: pandas DataFrame

with shape [n_samples x n_features]

Rolling feature importances .. image:: general_documents/images/automatic_rolling_importances.png

Testset timeseries image: .. image:: general_documents/images/automatic_rolling_engineering.png

Decompose datetime

sam.feature_engineering.decompose_datetime(df: DataFrame, column: str | None = 'TIME', components: Sequence[str] | None = None, cyclicals: Sequence[str] | None = None, onehots: Sequence[str] | None = None, remove_categorical: bool = True, keep_original: bool = True, cyclical_maxes: Sequence[int] | None = None, cyclical_mins: Sequence[int] | int | None = (0,), timezone: str | None = None) DataFrame

Decomposes a time column to one or more components suitable as features.

The input is a dataframe with a pandas timestamp column. New columns will be added to this dataframe. For example, if column is ‘TIME’, and components is [‘hour’, ‘minute’], two columns: ‘TIME_hour’ and ‘TIME_minute’ will be added.

Optionally, cyclical features can be added instead. For example, if cyclicals is [‘hour’], then the ‘TIME_hour’ column will not be added, but two columns ‘TIME_hour_sin’ and ‘TIME_hour_cos’ will be added instead. If you want both the categorical and cyclical features, set ‘remove_categorical’ to False.

Parameters:
df: dataframe

The dataframe with source column

column: str (default=’TIME’)

Name of the source column to extract components from. Note: time column should have a datetime format. if None, it is assumed that the TIME column will be the index.

components: list

List of components to extract from datatime column. All default pandas dt components are supported, and some custom functions: [‘secondofday’, ‘week’]. Note: week was added here since it is deprecated in pandas in favor of isocalendar().week

cyclicals: list

List of strings of newly created .dt time variables (like hour, month) you want to convert to cyclicals using sine and cosine transformations. Cyclicals are variables that do not increase linearly, but wrap around, such as days of the week and hours of the day. Format is identical to components input.

onehots: list

List of strings of newly created .dt time variables (like hour, month) you want to convert to one-hot-encoded variables. This is suitable when you think that variables do not vary smoothly with time (e.g. Sunday and Monday are quite different). This list must be mutually exclusive from cyclicals, i.e. non-overlapping.

remove_categorical: bool, optional (default=True)

whether to keep the original cyclical features (i.e. day) after conversion (i.e. day_sin, day_cos)

keep_original: bool, optional (default=True)

whether to keep the original columns from the dataframe. If this is False, then the returned dataframe will only contain newly generated columns, and none of the original ones

cyclical_maxes: sequence, optional (default=None)

Passed through to recode_cyclical_features. See recode_cyclical_features for more information.

cyclical_mins: sequence or int, optional (default=0)

Passed through to recode_cyclical_features. See recode_cyclical_features for more information.

timezone: str, optional (default=None)

if tz is not None, convert the time to the specified timezone, before creating features. timezone can be any string that is recognized by pytz, for example Europe/Amsterdam. We assume that the TIME column is always in UTC, even if the datetime object has no tz info.

Returns
——-
dataframe

The original dataframe with extra columns containing time components

Examples

>>> from sam.feature_engineering import decompose_datetime
>>> import pandas as pd
>>> df = pd.DataFrame({'TIME': pd.date_range("2018-12-27", periods = 4),
...                    'OTHER_VALUE': [1, 2, 3, 2]})
>>> decompose_datetime(df, components= ["year", "dayofweek"])
        TIME  OTHER_VALUE  TIME_year  TIME_dayofweek
0 2018-12-27            1       2018               3
1 2018-12-28            2       2018               4
2 2018-12-29            3       2018               5
3 2018-12-30            2       2018               6

Cyclical features

sam.feature_engineering.recode_cyclical_features(df: DataFrame, cols: Sequence[str], prefix: str = '', remove_categorical: bool = True, keep_original: bool = True, cyclical_maxes: Sequence[int] | None = None, cyclical_mins: Sequence[int] | int | None = (0,)) DataFrame

Convert cyclical features (like day of week, hour of day) to continuous variables, so that Sunday and Monday are close together numerically.

IMPORTANT NOTE: This function requires a global maximum and minimum for the data. For example, for minutes, the global maximum and minimum are 0 and 60 respectively, even if your data never reaches these global minimums/maximums explicitly. This function assumes that the minimum and maximum should be encoded as the same value: minute 0 and minute 60 mean the same thing.

If you only use cyclical pandas timefeatures, nothing needs to be done. For these features, the minimum/maximum will be chosen automatically. These are: [‘day’, ‘dayofweek’, ‘weekday’, ‘dayofyear’, ‘hour’, ‘microsecond’, ‘minute’, ‘month’, ‘quarter’, ‘second’, ‘week’]

For any other scenario, global minimums/maximums will need to be passed in the parameters cyclical_maxes and cyclical_mins. Minimums are set to 0 by default, meaning that only the maxes need to be chosen as the value that is equivalent to 0.

Parameters:
df: pandas dataframe

Dataframe in which the columns to convert should be present.

cols: list of strings

The suffixes column names to convert to continuous numerical values. These suffixes will be added to the column argument to get the actual column names, with a ‘_’ in between.

column: string, optional (default=’’)

name of original time column in df, e.g. TIME. By default, assume the columns in cols literally refer to column names in the data

remove_categorical: bool, optional (default=True)

whether to keep the original cyclical features (i.e. day) after conversion (i.e. day_sin, day_cos)

keep_original: bool, optional (default=True)

whether to keep the original columns from the dataframe. If this is False, then the returned dataframe will only contain newly generated columns, and none of the original ones. If remove_categorical is False, the categoricals will be kept, regardless of this argument.

cyclical_maxes: array-like, optional (default=None)

The maximums that your data can reach. Keep in mind that the maximum value and the minimum value will be encoded as the same value. By default, None means that only standard pandas timefeatures will be encoded.

cyclical_mins: array-like or scalar, optional (default=[0])

The minimums that your data can reach. Keep in mind that the maximum value and the minimum value will be encoded as the same value. By default, 0 is used, which is applicable for all pandas timefeatures.

Returns:
dataframe

The input dataframe with cols removed, and replaced by the converted features (two for each feature).

Weather features

class sam.feature_engineering.SPEITransformer(metric: str = 'SPEI', window: str = '30D', smoothing: bool = True, min_years: int = 30, model_: DataFrame | None = None)

Bases: BaseEstimator, TransformerMixin

Standardized Precipitation (and Evaporation) Index

Computation of standardized metric that measures relative drought or precipitation shortage.

SP(E)I is a metric computed per day. Therefore daily weather data is required as input. This class assumes that the data contains precipitation columns ‘RH’ and optionally evaporation column ‘EV24’. These namings are KNMI standards.

The method computes a rolling average over the precipitation (and evaporation). Based on historic data (at least 30 years) the mean and standard deviation of the rolling average are computed across years. The daily rolling average is then transformed to a Z-score, by dividing by the corresponding mean and standard deviation.

Smoothing can be applied to make the model more robust, and able to compute the SP(E)I for leap year days. If smoothing=False, the transform method can return NA’s

The resulting score describes how dry the weather is. A very low score (smaller than -2) indicates extremely dry weather. A high score (above 2) indicates very wet weather.

See: http://www.droogtemonitor.nl/index.php/over-de-droogte-monitor/theorie

Parameters:
metric: {“SPI”, “SPEI”}, default=”SPI”

The type of KPI to compute “SPI” computes the Standardized Precipitation Index “SPEI” computed the Standardized Precipitation Evaporation Index

window: str or int, default=’30D’

Window size to compute the rolling precipitation or precip-evap sum

smoothing: boolean, default=True

Whether to use smoothing on the estimated mean and std for each day of the year. When smoothing=True, a centered rolling median of five steps is applied to the models estimated mean and standard deviations per day. The model definition will therefore be more robust. Smoothing causes less sensitivity, especially for the std. Use the plot method to visualize the estimated mean and std

min_years: int, default=30

Minimum number of years for configuration. When setting less than 30, make sure that the estimated model makes sense, using the plot method

model_: dataframe, default=None

Ignore this variable, this is required to keep the model configured when creating a new instance (common in for example cross validation)

Examples

>>> from sam.data_sources import read_knmi
>>> from sam.feature_engineering import SPEITransformer
>>> knmi_data = read_knmi(start_date='1960-01-01', end_date='2020-01-01',
...     variables=['RH', 'EV24'], freq='daily').set_index('TIME').dropna()
>>> knmi_data['RH'] = knmi_data['RH'].divide(10).clip(0)
>>> knmi_data['EV24'] = knmi_data['EV24'].divide(10)
>>> spi = SPEITransformer().configure(knmi_data)
>>> spi.transform(knmi_data)  
            SPEI_30D
TIME ...

Methods

configure(X[, y])

Fit normal distribution on rolling precipitation (and evaporation) Apply this to historic data of precipitation (at least min_years years)

fit(X[, y])

Fit function.

fit_transform(X[, y])

Fit to data, then transform it.

get_params([deep])

Get parameters for this estimator.

plot()

Plot model Visualization of the configured model.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transforming new weather data to SP(E)I metric

configure(X: DataFrame, y: Any | None = None)

Fit normal distribution on rolling precipitation (and evaporation) Apply this to historic data of precipitation (at least min_years years)

Parameters:
X: pandas dataframe

A data frame containing columns ‘RH’ (and optionally ‘EV24’) and should have a datetimeindex

y: Any, default=None

Not used

fit(X: DataFrame, y: Any | None = None)

Fit function. Does nothing other than checking input, but is required for a transformer. This function wil not change the SP(E)I model. The SP(E)I should be configured with the configure method. In this way, the SPEITransfomer can be used within a sklearn pipeline, without requiring > 30 years of data.

Parameters:
X: pandas dataframe

A data frame containing columns ‘RH’ (and optionally ‘EV24’) and should have a datetimeindex

y: Any, default=None

Not used

plot()

Plot model Visualization of the configured model. This function shows the estimated mean and standard deviation per day of the year.

transform(X: DataFrame) DataFrame

Transforming new weather data to SP(E)I metric

Parameters:
Xpd.DataFrame

New weather data to transform

Returns:
pd.DataFrame

Returns a dataframe with single columns