Models

This is the documentation for modeling functions.

Linear Quantile Regression

class sam.models.LinearQuantileRegression(**kwargs)

Bases: BaseEstimator, RegressorMixin

scikit-learn style wrapper for QuantReg Fits a linear quantile regression model, base idea from https://github.com/Marco-Santoni/skquantreg/blob/master/skquantreg/quantreg.py This class requires statsmodels

Parameters:
quantiles: list or float, default=[0.05, 0.95]

Quantiles to fit, with `` 0 < q < 1 `` for each q in quantiles.

tol: float, default=1e-3

The tolerance for the optimization. The optimization stops when duality gap is smaller than the tolerance

max_iter: int, default=1000

The maximum number of iterations

fit_intercept: bool, default=True

Whether to calculate the intercept for this model. If set to false, no intercept will be used in calculations (e.g. data is expected to be already centered).

Examples

>>> from sam.models import LinearQuantileRegression
>>> from sam.data_sources import read_knmi
>>> from sklearn.model_selection import train_test_split
>>>
>>> # Prepare data
>>> data = read_knmi('2018-02-01', '2019-10-01', freq='hourly',
...                 variables=['FH', 'FF', 'FX', 'T']).set_index('TIME')
>>> y = data['T']
>>> X = data.drop('T', axis=1)
>>> # Fit model
>>> model = LinearQuantileRegression()
>>> model.fit(X, y)  
Attributes:
model_: statsmodel model

The underlying statsmodel class

model_result_: statsmodel results

The underlying statsmodel results

Methods

fit(X, y)

Fit a Linear Quantile Regression using statsmodels

get_params([deep])

Get parameters for this estimator.

predict(X)

Predict / estimate quantiles

score(X, y)

Default score function.

set_params(**params)

Set the parameters of this estimator.

fit(X: array, y: array)

Fit a Linear Quantile Regression using statsmodels

predict(X: array)

Predict / estimate quantiles

score(X: array, y: array)

Default score function. Returns the tilted loss

Keras templates

sam.models.create_keras_quantile_mlp(n_input: int, n_neurons: int, n_layers: int, n_target: int = 1, quantiles: list | None = None, dropout: float = 0.0, momentum: float = 1.0, hidden_activation: str = 'relu', output_activation: str = 'linear', lr: float = 0.001, average_type: str = 'mean') Callable

Creates a multilayer perceptron in keras. Optimizes the keras_joint_mse_tilted_loss to do multiple quantile and mean/median regression with a single model.

Parameters:
n_input: int

Number of input nodes

n_neurons: int

Number of neurons hidden layer

n_layers: int

Number of hidden layers. 0 implies that the output is no additional layer between input and output.

n_target: int, optional (default=1)

Number of distinct outputs. Each will have their own mean and quantiles When fitting the model, this should be equal to the number of columns in y_train

quantiles: list of floats (default=None)

Quantiles to predict, values between 0 and 1, default is None, which returns a regular mlp (single output) for mean squared error regression

dropout: float, optional (default=0.0)

Rate parameter for dropout, value in (0,1). Default is 0.0, which means that no batch dropout is applied

momentum: float, optional (default=1.0)

Parameter for batch normalization, value in (0,1) default is 1.0, which means that no batch normalization is applied Smaller values means stronger batch normalization, see keras documentation. https://keras.io/layers/normalization/

hidden_activation: str (default=’relu’)

Activation function for hidden layers, for more explanation: https://keras.io/layers/core/

output_activation: str (default=’linear’)

Activation function for output layer, for more explanation: https://keras.io/layers/core/

lr: float (default=0.001)

Learning rate

average_type: str (default=’mean’)

Determines what to fit as the average: ‘mean’, or ‘median’. The average is the last node in the output layer and does not reflect a quantile, but rather estimates the central tendency of the data. Setting to ‘mean’ results in fitting that node with MSE, and setting this to ‘median’ results in fitting that node with MAE (equal to 0.5 quantile).

Returns:
keras model

Examples

>>> from sam.models import create_keras_quantile_mlp
>>> from sam.datasets import load_rainbow_beach
>>> data = load_rainbow_beach()
>>> X, y = data, data["water_temperature"]
>>> n_input = X.shape[1]
>>> n_neurons = 64
>>> n_layers = 3
>>> quantiles = [0.1, 0.5, 0.9]
>>> model = create_keras_quantile_mlp(n_input, n_neurons, n_layers, quantiles=quantiles)
>>> model.fit(X, y, batch_size=16, epochs=20, verbose=0)  
<keras.callbacks.History ...
sam.models.create_keras_quantile_rnn(input_shape: tuple, n_neurons: int = 64, n_layers: int = 2, quantiles: list | None = None, n_target: int = 1, layer_type: str = 'GRU', dropout: float = 0.0, recurrent_dropout: str = 'dropout', hidden_activation: str = 'relu', output_activation: str = 'linear', lr: float = 0.001) Callable

Creates a simple RNN (LSTM or GRU) with keras. Optimizes the keras_joint_mse_tilted_loss to do multiple quantile and mean regression with a single model.

Parameters:
input_shape: tuple,

A shape tuple (integers) of single input sample, (window, number of features) where window is the parameter used in the preprocessing.RecurrentReshaper class

n_neurons: int (default=64)

Number of neurons hidden layer

n_layers: int (default=2)

Number of hidden layers. 0 implies that the output is no additional layer between input and output.

quantiles: list of floats (default=None)

Quantiles to predict, values between 0 and 1, default is None, which returns a regular rnn (single output) for mean squared error regression

n_target: int, optional (default=1)

Number of distinct outputs. Each will have their own mean and quantiles When fitting the model, this should be equal to the number of columns in y_train

layer_type: str (default=’GRU’)

Type of recurrent layer Options: ‘LSTM’ (long short-term memory) or ‘GRU’ (gated recurrent unit)

dropout: float, optional (default=0.0)

Rate parameter for dropout, value in (0,1) default is 0.0, which means that no batch dropout is applied

recurrent_dropout: float or str, optional (default=’dropout’)

Rate parameter for dropout, value in (0,1) default is ‘dropout’, which means that recurrent dropout is equal to dropout parameter (dropout between layers)

hidden_activation: str (default=’relu’)

Activation function for hidden layers, for more explanation: https://keras.io/layers/core/

output_activation: str (default=’linear’)

Activation function for output layer, for more explanation: https://keras.io/layers/core/

lr: float (default=0.001)

Learning rate

Returns

keras model

Examples

>>> import pandas as pd
>>> import numpy as np
>>> from sam.data_sources import synthetic_date_range, synthetic_timeseries
>>> from sam.preprocessing import RecurrentReshaper
>>> from sam.models import create_keras_quantile_rnn
>>> dates = pd.Series(synthetic_date_range().to_pydatetime())
>>> y = synthetic_timeseries(dates, daily = 2, noise = {'normal': 0.25}, seed=2)
>>> y = y[~np.isnan(y)]
>>> X = pd.DataFrame(y)
>>> X_3d = RecurrentReshaper(window=24, lookback = 1).fit_transform(X)
>>> X_3d = X_3d[24:]
>>> y = y[24:]
>>> input_shape = X_3d.shape[1:]
>>> model = create_keras_quantile_rnn(input_shape, quantiles=[0.01, 0.99])
>>> model.fit(X_3d, y, batch_size=32, epochs=5, verbose=0)  
<keras.callbacks.History ...
sam.models.create_keras_autoencoder_mlp(n_input: int, encoder_neurons: list = [64, 16], dropout: float = 0.0, momentum: float = 1.0, hidden_activation: str = 'relu', output_activation: str = 'linear', lr: float = 0.001) Callable

Function to create an MLP auto-encoder in keras Optimizes the mean squared error to reconstruct input, after passing input through bottleneck neural network.

Parameters:
n_input: int

Number of input nodes

encoder_neurons: list (default=[64, 16])

List of integers, each representing the number of neurons per layer within the encoder. Decoder is reversed version of encoder. Last element is number of neurons in “representation” layer Example: If n_layers=[64, 12], number of features is 120, the number of neurons per layer is [120, 64, 12, 64, 120].

dropout: float, optional (default=0.0)

Rate parameter for dropout, value in (0,1) default is 0.0, which means that no batch dropout is applied

momentum: float, optional (default=1.0)

Parameter for batch normalization, value in (0,1) default is 1.0, which means that no batch normalization is applied Smaller values means stronger batch normalization, see keras documentation. https://keras.io/layers/normalization/

hidden_activation: str (default=’relu’)

Activation function for hidden layers, for more explanation: https://keras.io/layers/core/

output_activation: str (default=’linear’)

Activation function for output layers, for more explanation: https://keras.io/layers/core/

lr: float (default=0.001)

Learning rate

Returns

keras model

Examples

>>> import pandas as pd
>>> from sam.data_sources import synthetic_date_range, synthetic_timeseries
>>> from sam.preprocessing import RecurrentReshaper
>>> from sam.models import create_keras_autoencoder_mlp
>>> dates = pd.Series(synthetic_date_range().to_pydatetime())
>>> X = [synthetic_timeseries(dates, daily=2, noise={'normal': 0.25}, seed=i)
...      for i in range(100)]
>>> X = pd.DataFrame(X)
>>> model = create_keras_autoencoder_mlp(n_input=100)
>>> model.fit(X.T, X.T, batch_size=32, epochs=5, verbose=0)  
<keras.callbacks.History ...
sam.models.create_keras_autoencoder_rnn(input_shape: tuple, encoder_neurons: list = [64, 16], layer_type: str = 'GRU', dropout: float = 0.0, recurrent_dropout: float = 0.0, hidden_activation: str = 'relu', output_activation: str = 'linear', lr: float = 0.001) Callable

Function to create a recurrent auto-encoder in keras Optimizes the mean squared error to reconstruct input, after passing input through bottleneck neural network. Reference: https://towardsdatascience.com/lstm-autoencoder-for-anomaly-detection-e1f4f2ee7ccf https://blog.keras.io/building-autoencoders-in-keras.html

Parameters:
input_shape: tuple,

shape of single input sample, (recurrent steps, number of features)

encoder_neurons: list (default=[64, 16])

List of integers, each representing the number of neurons per layer within the encoder. Decoder is reversed version of encoder. Last element is number of neurons in “representation” layer Example: If n_layers=[64, 12], number of features is 120, the number of neurons per layer is [120, 64, 12, 64, 120].

layer_type: str (default=’GRU’)

Type of recurrent layer Options: ‘LSTM’ (long short-term memory) or ‘GRU’ (gated recurrent unit)

dropout: float, optional (default=0.0)

Rate parameter for dropout, value in (0,1) default is 0.0, which means that no batch dropout is applied

recurrent_dropout: float, optional (default=0.0)

Rate parameter for dropout, value in (0,1) default is 0.0, which means that recurrent dropout is equal to dropout parameter (dropout between layers)

hidden_activation: str (default=’relu’)

Activation function for hidden layers, for more explanation: https://keras.io/layers/core/

output_activation: str (default=’linear’)

Activation function for output layers, for more explanation: https://keras.io/layers/core/

lr: float (default=0.001)

Learning rate

Returns

keras model

Examples

>>> import pandas as pd
>>> from sam.data_sources import synthetic_date_range, synthetic_timeseries
>>> from sam.preprocessing import RecurrentReshaper
>>> from sam.models import create_keras_autoencoder_rnn
>>> dates = pd.Series(synthetic_date_range().to_pydatetime())
>>> y = synthetic_timeseries(dates, daily=2, noise={'normal': 0.25}, seed=2)
>>> X = pd.DataFrame(y)
>>> X_3d = RecurrentReshaper(window=24, lookback=1).fit_transform(X)
>>> X_3d = X_3d[24:]
>>> input_shape = X_3d.shape[1:]
>>> model = create_keras_autoencoder_rnn(input_shape)
>>> model.fit(X_3d, X_3d, batch_size=32, epochs=5, verbose=0)  
<keras.callbacks.History ...

Statistical process control

class sam.models.ConstantTimeseriesRegressor(predict_ahead: Sequence[int] = (0,), quantiles: Sequence[float] = (), use_diff_of_y: bool = False, timecol: str | None = None, y_scaler: TransformerMixin | None = None, average_type: str = 'median', **kwargs)

Bases: BaseTimeseriesRegressor

Constant Regression model

Baseline model that always predict the median and quantiles. This model can be used as a benchmark or fall-back method, since the predicted median and quantiles can still be used to trigger alarms. Also see https://en.wikipedia.org/wiki/Statistical_process_control

This model uses the same init parameters as the other SAM models for compatibility, but ignores all of the feature engineering parameters

Note: using use_diff_of_y changes how this model works; instead of predicting static bounds, it will fit the median and quantiles on the differenced target then it will undo the differencing by adding those values to the last timestep, resulting in a model that predicts the last timestep + the median difference. This approach works especially when trying to predict a signal that has a continuous trend.

Parameters:
predict_ahead: tuple of integers, optional (default=(0,))

how many steps to predict ahead. For example, if (1, 2), the model will predict both 1 and 2 timesteps into the future. If (0,), predict the present. If not equal to (0,), predict the future. Combine with use_diff_of_y to get a persistence benchmark forecasting model.

quantiles: tuple of floats, optional (default=())

The quantiles to predict. Values between 0 and 1. Keep in mind that the mean will be predicted regardless of this parameter

use_diff_of_y: bool, optional (default=True)

If True differencing is used (the difference between y now and shifted y), else differencing is not used (shifted y is used).

timecol: string, optional (default=None)

If not None, the column to use for constructing time features. For now, creating features from a DateTimeIndex is not supported yet.

y_scaler: object, optional (default=None)

Should be an sklearn-type transformer that has a transform and inverse_transform method. E.g.: StandardScaler() or PowerTransformer().

average_type: str = “median”,

The type of average that is used to calculate the median and quantiles. FOR NOW ONLY “median” IS SUPPORTED.

kwargs: dict, optional

Not used. Just for compatibility with other SAM models.

Examples

>>> from sam.models import ConstantTimeseriesRegressor
>>> from sam.data_sources import read_knmi
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.metrics import mean_squared_error
...
>>> # Prepare data
>>> data = read_knmi('2018-02-01', '2019-10-01', latitude=52.11, longitude=5.18, freq='hourly',
...                  variables=['FH', 'FF', 'FX', 'T', 'TD', 'SQ', 'Q', 'DR', 'RH', 'P',
...                             'VV', 'N', 'U', 'IX', 'M', 'R', 'S', 'O', 'Y'])
>>> y = data['T']
>>> X = data.drop('T', axis=1)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, shuffle=False)
...
>>> model = ConstantTimeseriesRegressor(timecol='TIME', quantiles=[0.25, 0.75])
...
>>> model.fit(X_train, y_train)
>>> pred = model.predict(X_test, y_test)
>>> pred.head()
       predict_lead_0_q_0.25  predict_lead_0_q_0.75  predict_lead_0_mean
11655                   56.0                  158.0                101.0
11656                   56.0                  158.0                101.0
11657                   56.0                  158.0                101.0
11658                   56.0                  158.0                101.0
11659                   56.0                  158.0                101.0

Methods

dump(foldername[, prefix])

Writes the instanced model to foldername/prefix.pkl

fit(X, y[, validation_data])

Fit the ConstantTimeseriesRegressor model

get_actual(y)

Convenience function for getting the actual values (perfect prediction).

get_feature_names_out([input_features])

Function for obtaining feature names.

get_input_cols()

Function to obtain the input column names.

get_params([deep])

Get parameters for this estimator.

get_untrained_model()

Returns an underlying model that can be trained

load(foldername[, prefix])

Reads and loads the model located at foldername/prefix.pkl

make_prediction_monotonic(prediction)

When fitting multiple quantile regressions it is possible that individual quantile regression lines over-lap, or in other words, a quantile regression line fitted to a lower quantile predicts higher that a line fitted to a higher quantile.

postprocess_predict(prediction, X, y[, ...])

Postprocessing function for the prediction result.

predict(X[, y, return_data])

Predict using the ConstantTimeseriesRegressor

preprocess(X, y[, train])

Preprocess the data.

preprocess_fit(X, y[, validation_data])

This function does the following: - Validate that the input is monospaced and has enough rows - Perform differencing on the target - Fitting/applying the feature engineer - Bookkeeping to create the output columns - Remove rows with nan that can't be used for fitting - Optionally, preprocess validation data to give to the fit

preprocess_predict(X, y[, dropna])

Transform a DataFrame X so it can be fed to self.model_.

score(X, y)

Default score function.

set_params(**params)

Set the parameters of this estimator.

validate_data(X)

Validates the data and raises an exception if: - There is no time columns - The data is not monospaced

validate_predict_ahead()

Perform checks to validate the predict_ahead attribute

verify_same_indexes(X, y[, y_can_be_none])

Verify that X and y have the same index

dump(foldername: str, prefix: str = 'model') None

Writes the instanced model to foldername/prefix.pkl

prefix is configurable, and is ‘model’ by default

Overwrites the abstract method from SamQuantileRegressor

Parameters:
foldername: str

The name of the folder to save the model

prefix: str, optional (Default=’model’)

The name of the model

fit(X: DataFrame, y: Series, validation_data: Tuple[DataFrame, Series] | None = None, **fit_kwargs) Callable

Fit the ConstantTimeseriesRegressor model

This function will preprocess the input data, get the untrained underlying model and fits the model.

For compatibility reasons the method acceps fit_kwargs, that are not used.

Parameters:
X: pd.DataFrame

The independent variables used to ‘train’ the model

y: pd.Series

Target data (dependent variable) used to ‘train’ the model.

validation_data: tuple(pd.DataFrame, pd.Series) (X_val, y_val respectively)

Data used for validation step

Returns:
Always returns None, since there is no history object of the fit procedure
get_untrained_model() Callable

Returns an underlying model that can be trained

Creates an instance of the ConstantTemplate class

Returns:
A trainable model class
classmethod load(foldername, prefix='model') Callable

Reads and loads the model located at foldername/prefix.pkl

prefix is configurable, and is ‘model’ by default Output is an entire instance of the fitted model that was saved

Overwrites the abstract method from SamQuantileRegressor

Returns:
A fitted ConstantTimeseriesRegressor object
predict(X: DataFrame, y: Series | None = None, return_data: bool = False, **predict_kwargs) DataFrame | Tuple[DataFrame, DataFrame]

Predict using the ConstantTimeseriesRegressor

This will either predict the static bounds that were fitted during fit() or when using use_diff_of_y it will predict the last timestep plus the median/quantile difference.

In the first situation X is only used to determine how many datapoints need to be predicted. In the latter case it will use X to undo the differencing.

For compatibility reasons the method accepts predict_kwargs, that are not used.

Parameters:
X: pd.DataFrame

The independent variables used to predict.

y: pd.Series

The target values

return_data: bool, optional (default=False)

whether to return only the prediction, or to return both the prediction and the transformed input (X) dataframe.

Returns:
prediction: pd.DataFrame

The predictions coming from the model

X_transformed: pd.DataFrame, optional

The transformed input data, when return_data is True, otherwise None

Multilayer Perceptron (MLP)

class sam.models.MLPTimeseriesRegressor(predict_ahead: Sequence[int] = (0,), quantiles: Sequence[float] = (), use_diff_of_y: bool = False, timecol: str | None = None, y_scaler: TransformerMixin | None = None, feature_engineer: BaseFeatureEngineer | None = None, n_neurons: int = 200, n_layers: int = 2, batch_size: int = 16, epochs: int = 20, lr: float = 0.001, dropout: float | None = None, momentum: float | None = None, verbose: int = 1, r2_callback_report: bool = False, average_type: str = 'mean', **kwargs)

Bases: BaseTimeseriesRegressor

Multi-layer Perceptron Regressor for time series

This model combines several approaches to time series data: Multiple outputs for forecasting, quantile regression, and feature engineering. This class is an implementation of an MLP to estimate multiple quantiles for all forecasting horizons at once.

This is a wrapper for a keras MLP model. For more information on the model parameters, see the keras documentation.

Parameters:
predict_ahead: tuple of integers, optional (default=(0,))

how many steps to predict ahead. For example, if (1, 2), the model will predict both 1 and 2 timesteps into the future. If (0,), predict the present.

quantiles: tuple of floats, optional (default=())

The quantiles to predict. Values between 0 and 1. Keep in mind that the mean will be predicted regardless of this parameter

use_diff_of_y: bool, optional (default=True)

If True differencing is used (the difference between y now and shifted y), else differencing is not used (shifted y is used).

timecol: string, optional (default=None)

If not None, the column to use for constructing time features. For now, creating features from a DateTimeIndex is not supported yet.

y_scaler: object, optional (default=None)

Should be an sklearn-type transformer that has a transform and inverse_transform method. E.g.: StandardScaler() or PowerTransformer()

feature_engineering: object, optional (default=None)

Should be an sklearn-type transformer that has a transform method, e.g. sam.feature_engineering.SimpleFeatureEngineer.

n_neurons: integer, optional (default=200)

The number of neurons to use in the model, see create_keras_quantile_mlp

n_layers: integer, optional (default=2)

The number of layers to use in the model, see create_keras_quantile_mlp

batch_size: integer, optional (default=16)

The batch size to use in the model, see create_keras_quantile_mlp

epochs: integer, optional (default=20)

The number of epochs to use in the model, see create_keras_quantile_mlp

lr: float, optional (default=0.001)

The learning rate to use in the model, see create_keras_quantile_mlp

dropout: float, optional (default=None)

The type of dropout to use in the model, see create_keras_quantile_mlp

momentum: integer, optional (default=None)

The type of momentum in the model, see create_keras_quantile_mlp

verbose: integer, optional (default=1)

The verbosity of fitting the keras model. Can be either 0, 1 or 2.

r2_callback_report: boolean (default=False)

Whether to add train and validation r2 to each epoch as a callback. This also changes self.verbose to 2 to prevent log print mess up.

average_type: str (default=’mean’)

Determines what to fit as the average: ‘mean’, or ‘median’. The average is the last node in the output layer and does not reflect a quantile, but rather estimates the central tendency of the data. Setting to ‘mean’ results in fitting that node with MSE, and setting this to ‘median’ results in fitting that node with MAE (equal to 0.5 quantile).

kwargs: dict, optional

Not used. Just for compatibility with other SAM models.

Examples

>>> import pandas as pd
>>> from sam.models import MLPTimeseriesRegressor
>>> from sam.feature_engineering import SimpleFeatureEngineer
>>> from sam.datasets import load_rainbow_beach
...
>>> data = load_rainbow_beach()
>>> X, y = data, data["water_temperature"]
>>> simple_features = SimpleFeatureEngineer(
...     rolling_features=[
...         ("wave_height", "mean", 24),
...         ("wave_height", "mean", 12),
...     ],
...     time_features=[
...         ("hour_of_day", "cyclical"),
...     ],
...     keep_original=False,
... )
>>> model = MLPTimeseriesRegressor(
...     predict_ahead=(0,),
...     feature_engineer=simple_features,
...     verbose=0,
... )
>>> model.fit(X, y)  
<keras.callbacks.History ...
Attributes:
feature_engineer_: Sklearn transformer

The transformer used on the raw data before prediction

n_inputs_: integer

The number of inputs used for the underlying neural network

n_outputs_: integer

The number of outputs (columns) from the model

prediction_cols_: array of strings

The names of the output columns from the model.

model_: Keras model

The underlying keras model

Methods

dump(foldername[, prefix])

Writes the following files: * prefix.pkl * prefix.h5

fit(X, y[, validation_data])

This function does the following: - Validate that the input is monospaced and has enough rows - Perform differencing on the target - Create feature engineer by calling self.get_feature_engineer() - Fitting/applying the feature engineer - Bookkeeping to create the output columns - Remove rows with nan that can't be used for fitting - Get untrained model with self.get_untrained_model() - Fit the untrained model and return the history object - Optionally, preprocess validation data to give to the fit - Pass through any other fit_kwargs to the fit function

get_actual(y)

Convenience function for getting the actual values (perfect prediction).

get_explainer(X[, y, sample_n])

Obtain a shap explainer-like object.

get_feature_names_out([input_features])

Function for obtaining feature names.

get_input_cols()

Function to obtain the input column names.

get_params([deep])

Get parameters for this estimator.

get_untrained_model()

Returns a simple 2d keras model.

load(foldername[, prefix])

Reads the following files: * prefix.pkl * prefix.h5

make_prediction_monotonic(prediction)

When fitting multiple quantile regressions it is possible that individual quantile regression lines over-lap, or in other words, a quantile regression line fitted to a lower quantile predicts higher that a line fitted to a higher quantile.

postprocess_predict(prediction, X, y[, ...])

Postprocessing function for the prediction result.

predict(X[, y, return_data, ...])

Make a prediction, and undo differencing in the case it was used

preprocess(X, y[, train])

Preprocess the data.

preprocess_fit(X, y[, validation_data])

This function does the following: - Validate that the input is monospaced and has enough rows - Perform differencing on the target - Fitting/applying the feature engineer - Bookkeeping to create the output columns - Remove rows with nan that can't be used for fitting - Optionally, preprocess validation data to give to the fit

preprocess_predict(X, y[, dropna])

Transform a DataFrame X so it can be fed to self.model_.

quantile_feature_importances(X, y[, score, ...])

Computes feature importances based on the loss function used to estimate the average.

score(X, y)

Default score function.

set_params(**params)

Set the parameters of this estimator.

summary([print_fn])

Combines several methods to create a 'wrapper' summary method.

validate_data(X)

Validates the data and raises an exception if: - There is no time columns - The data is not monospaced

validate_predict_ahead()

Perform checks to validate the predict_ahead attribute

verify_same_indexes(X, y[, y_can_be_none])

Verify that X and y have the same index

dump(foldername: str | Path, prefix: str = 'model') None

Writes the following files: * prefix.pkl * prefix.h5

to the folder given by foldername. prefix is configurable, and is ‘model’ by default

Overwrites the abstract method from BaseTimeseriesRegressor

Parameters:
foldername: str

The name of the folder to save the model

prefix: str, optional (Default=’model’)

The name of the model

fit(X: DataFrame, y: Series, validation_data: Tuple[DataFrame, Series] | None = None, **fit_kwargs) Callable

This function does the following: - Validate that the input is monospaced and has enough rows - Perform differencing on the target - Create feature engineer by calling self.get_feature_engineer() - Fitting/applying the feature engineer - Bookkeeping to create the output columns - Remove rows with nan that can’t be used for fitting - Get untrained model with self.get_untrained_model() - Fit the untrained model and return the history object - Optionally, preprocess validation data to give to the fit - Pass through any other fit_kwargs to the fit function

Overwrites the abstract method from BaseTimeseriesRegressor

Parameters:
X: pd.DataFrame

The independent variables used to ‘train’ the model

y: pd.Series

Target data (dependent variable) used to ‘train’ the model.

validation_data: tuple(pd.DataFrame, pd.Series) (X_val, y_val respectively)

Data used for validation step

Returns:
tf.keras.callbacks.History:

The history object after fitting the keras model

get_explainer(X: DataFrame, y: Series | None = None, sample_n: int | None = None) SamShapExplainer

Obtain a shap explainer-like object. This object can be used to create shap values and explain predictions.

Keep in mind that this will explain the created features from self.get_feature_names_out(), not the input features. To help with this, the explainer comes with a test_values() attribute that calculates the test values corresponding to the shap values

Parameters:
X: pd.DataFrame

The dataframe used to ‘train’ the explainer

y: pd.Series, optional (default=None)

Target data used to ‘train’ the explainer. Only required when self.predict_ahead > 0.

sample_n: integer, optional (default=None)

The number of samples to give to the explainer. It is recommended that if your background set is greater than 5000, to sample for performance reasons.

Returns:
SamShapExplainer:

Custom Sam object that inherits from shap.DeepExplainer

Examples

>>> import pandas as pd
>>> import shap
>>> from sam.models import MLPTimeseriesRegressor
>>> from sam.feature_engineering import SimpleFeatureEngineer
>>> from sam.datasets import load_rainbow_beach
...
>>> data = load_rainbow_beach()
>>> X, y = data, data["water_temperature"]
>>> test_size = int(X.shape[0] * 0.33)
>>> train_size = X.shape[0] - test_size
>>> X_train, y_train = X.iloc[:train_size, :], y[:train_size]
>>> X_test, y_test = X.iloc[train_size:, :], y[train_size:]
...
>>> simple_features = SimpleFeatureEngineer(
...     rolling_features=[
...         ("wave_height", "mean", 24),
...         ("wave_height", "mean", 12),
...     ],
...     time_features=[
...         ("hour_of_day", "cyclical"),
...     ],
...     keep_original=False,
... )
...
>>> model = MLPTimeseriesRegressor(
...     predict_ahead=(0,),
...     feature_engineer=simple_features,
...     verbose=0,
... )
...
>>> model.fit(X_train, y_train)  
<keras.callbacks.History ...
>>> ();explainer = model.get_explainer(X_test, y_test, sample_n=10);()
... 
(...)
>>> ();shap_values = explainer.shap_values(X_test[0:30], y_test[0:30]);()
... 
(...)
>>> test_values = explainer.test_values(X_test[0:30], y_test[0:30])
>>> shap.force_plot(explainer.expected_value[0], shap_values[0][-1,:],
...                 test_values.iloc[-1,:], matplotlib=True)
get_untrained_model() Callable

Returns a simple 2d keras model. This is just a wrapper for sam.models.create_keras_quantile_mlp

Overwrites the abstract method from BaseTimeseriesRegressor

classmethod load(foldername: str | Path, prefix='model')

Reads the following files: * prefix.pkl * prefix.h5

from the folder given by foldername. prefix is configurable, and is ‘model’ by default Output is an entire instance of the fitted model that was saved

Overwrites the abstract method from BaseTimeseriesRegressor

Returns:
Keras model
predict(X: DataFrame, y: Series | None = None, return_data: bool = False, force_monotonic_quantiles: bool = False) DataFrame | Tuple[DataFrame, DataFrame]

Make a prediction, and undo differencing in the case it was used

Important! This is different from sklearn/tensorflow API… We need y during prediction for two reasons: 1) a lagged version is used for feature engineering 2) The underlying model can predict a differenced number, and then we want to output the

‘real’ prediction, so we need y to undo the differencing

Keep in mind that prediction will work if you are predicting the future. e.g. you have data from 00:00-12:00, and are predicting 4 hours into the future, it will predict what the value will be at 4:00-16:00

Overwrites the abstract method from BaseTimeseriesRegressor

Parameters:
X: pd.DataFrame

The independent variables used to predict.

y: pd.Series

The target values

return_data: bool, optional (default=False)

whether to return only the prediction, or to return both the prediction and the transformed input (X) dataframe.

force_monotonic_quantiles: bool, optional (default=False)

whether to force quantiles to not overlap. When fitting multiple quantile regressions it is possible that individual quantile regression lines over-lap, or in other words, a quantile regression line fitted to a lower quantile predicts higher that a line fitted to a higher quantile. If this occurs for a certain prediction, the output distribution is invalid. We can force monotonicity by making the outer quantiles at least as high as the inner quantiles.

Returns:
prediction: pd.DataFrame

The predictions coming from the model

X_transformed: pd.DataFrame, optional

The transformed input data, when return_data is True, otherwise None

quantile_feature_importances(X: DataFrame, y: Series, score: str | Callable | None = None, n_iter: int = 5, sum_time_components: bool = False, random_state: int | None = None) DataFrame

Computes feature importances based on the loss function used to estimate the average. This function uses ELI5’s get_score_importances: <https://eli5.readthedocs.io/en/latest/autodocs/permutation_importance.html> to compute feature importances. It is a method that measures how the score decreases when a feature is not available. This is essentially a model-agnostic type of feature importance that works with every model, including keras MLP models.

Note that we compute feature importance over the average (the central trace, and the last output node, either median or mean depending on self.average_type), and do not include the quantiles in the loss calculation. Initially, the quantiles were included, but experimentation showed that importances behaved very badly when including the quantiles in the loss: importances were sometimes consistently negative (i.e. in all random iterations), while these features should have been important according to theory, and excluding them indeed lead to much worse model performance. This behavior goes away when only using the mean trace to estimate feature importance.

Parameters:
X: pd.DataFrame

dataframe with test or train features

y: pd.Series

dataframe with test or train target

score: str or function, optional (default=None)

Either a function with signature score(X, y, model) that returns a scalar. Will be used to measure score decreases for ELI5. If None, defaults to MSE or MAE depending on self.average_type. Note that if score computes a loss (i.e. higher is worse), negative values indicate positive contribution to model performance (i.e. negative score decrease means that removing this feature will increase the metric, which is a bad thing with MAE/MSE).

n_iter: int, optional (default=5)

Number of iterations to use for ELI5. Since ELI5 results can vary wildly, increasing this parameter may provide more stability at the cost of a longer runtime

sum_time_components: bool, optional (default=False)

if set to true, sums feature importances of the different subfeatures of each time component (i.e. weekday_1, weekday_2 etc. in one ‘weekday’ importance)

random_state: int, optional (default=None)

Used for shuffling columns of matrix columns.

Returns:
score_decreases: Pandas dataframe, shape (n_iter x n_features)

The score decreases when leaving out each feature per iteration. The larger the magnitude, the more important each feature is considered by the model.

Examples

>>> # Example with a fictional dataset with only 2 features
>>> import pandas as pd
>>> import seaborn
>>> from sam.models import MLPTimeseriesRegressor
>>> from sam.feature_engineering import SimpleFeatureEngineer
>>> from sam.datasets import load_rainbow_beach
...
>>> data = load_rainbow_beach()
>>> X, y = data, data["water_temperature"]
>>> test_size = int(X.shape[0] * 0.33)
>>> train_size = X.shape[0] - test_size
>>> X_train, y_train = X.iloc[:train_size, :], y[:train_size]
>>> X_test, y_test = X.iloc[train_size:, :], y[train_size:]
...
>>> simple_features = SimpleFeatureEngineer(
...     rolling_features=[
...         ("wave_height", "mean", 24),
...         ("wave_height", "mean", 12),
...     ],
...     time_features=[
...         ("hour_of_day", "cyclical"),
...     ],
...     keep_original=False,
... )
...
>>> model = MLPTimeseriesRegressor(
...     predict_ahead=(0,),
...     feature_engineer=simple_features,
...     verbose=0,
... )
...
>>> model.fit(X_train, y_train)  
<keras.callbacks.History ...
>>> score_decreases = model.quantile_feature_importances(
...     X_test[:100], y_test[:100], n_iter=3, random_state=42)
>>> # The score decreases of each feature in each iteration
>>> feature_importances = score_decreases.mean()
>>> # This will show a barplot of all the score importances, with error bars
>>> seaborn.barplot(data=score_decreases)  
summary(print_fn: ~typing.Callable = <built-in function print>) None

Combines several methods to create a ‘wrapper’ summary method.

Parameters:
print_fn: Callable (default=print)

A function for writting down results

Benchmarking

sam.models.preprocess_data_for_benchmarking(data: DataFrame, column_filter: Callable, targetcol: str, test_size: float = 0.3, resample: bool = 'auto', resample_freq: str = 'auto', ffill_limit: int = 5)

Takes a dataframe in SAM format, and converts it to X_train/X_test/y_train/y_test, while taking some liberties when it comes to ‘faithfulness’ of the underlying data.

This function should NEVER be used in an actual project, but only for benchmarking models. For example, you want to compare two feature engineering methods to each other on a dataset in SAM format, but don’t feel like properly reshaping/imputing/resampling the dataset yourself.

Parameters:
data: pd.DataFrame

data in SAM format (TIME, TYPE, ID, VALUE columns)

column_filter: function

Function that takes in a column name (string) and returns a boolean, if this column should be included in the result. The columns ‘TIME’ and targetcol will be included in the result regardless of this function.

targetcol: str

The column to use as the target.

test_size: float, optional (default=0.3)

The portion of the data to use as testdata. This is always the last portion. So by default, the first 70% is train, the last 30% is test

resample: bool, optional (default=’auto’)

Whether or not to resample the data. If resample == ‘auto’, resample if the data is not monospaced yet

resample_freq: str, optional (default=’auto’)

If resample, the frequency to resample to. If resample_freq == ‘auto’, use the median frequency of the data

ffill_limit: int, optional (default=5)

If resample, the maximum number of values to ffill. By default, only ffill 5 values. This means the data won’t be exactly monospaced, but will prevent extremely long flatlines if there are gaps in the data

Returns:
X_train: pd.DataFrame

Monospaced data in SAM format with no targetcol that can be used for training together with y_train

X_test: pd.DataFrame

Monospaced data in SAM format with no targetcol that can be used for testing together with y_test

y_train: pd.Series

Values from targetcol that can be used for training together with X_train

y_test: pd.Series

Values from targetcol that can be used for testing together with X_test

sam.models.benchmark_model(train_test_data: ~typing.Tuple[~pandas.core.frame.DataFrame, ~pandas.core.frame.DataFrame, ~pandas.core.series.Series, ~pandas.core.series.Series], scorer: ~typing.Callable = <function mean_squared_error>, validation_data=True, return_histories=False, fit_kwargs=None, **modeldict)

Benchmarks a dictionary of sam models on train/test data, and returns a dictionary with scores The models are assumed to be SAM models in 2 ways:

  • predict is called as predict(X_test, y_test)

  • model.get_actual(y_test) is called

Parameters:
train_test_data: tuple

tuple with elements X_train, X_test, y_train, y_test

scorer: function, optional (default=sklearn.metrics.mean_squared_error)

Function with signature func(y_true, y_pred) where y_true and y_pred are pandas series, and it returns a scalar

validation_data: bool

If true, X_test and y_test will be added as ‘validation_data’ in the kwargs (fit_kwargs in this case) when fitting the models.

return_histories: bool (default=False)

If true, returns also a dictionary of History objects

fit_kwargs: dict (default=None)

The kwargs to include in the method fit of each model.

modeldict: dict

Dictionary the model names as keys (str) and the model objects as values

Returns:
dict:

Dictionary of form: {modelname_1: score, ..., modelname_n: score, persistence_benchmark: score, mean_benchmark: score} where all scores are scalars.

dict (optional):

If return_histories == True returns also a dictionary of tensorflow History objects

sam.models.plot_score_dicts(**score_dicts)

Very simple plotting function for showing the results

Parameters:
score_dicts: kwargs

Containing score dictionaries (output from benchmark_model). The key of the dictionary is the name/descriptor of the dataset.

Returns:
matplotlib.axes.Axes:

Plot of score dictionaries

Examples

>>> fig = plot_score_dicts(
...     chicago={'model_a': 0.5, 'persistence_benchmark': 0.6, 'mean_benchmark': 0.4},
...     china={'model_a': 0.1, 'mean_benchmark': 0.3, 'persistence_benchmark': 0.4}
... )
sam.models.benchmark_wrapper(models: dict, datasets: dict, column_filters: dict, targetcols: dict)

Wrapper around entire benchmark pipeline. Takes a dictionary of models, dictionary of datasets in SAM format, and preprocesses, fits and evaluates all models and benchmarks, and plots them

Parameters:
models: dict

Dictionary of SAM models

datasets: dict

Dictionary of datasets in SAM format

column_filters: dict

Dictionary of functions that accept a colum in ID_TYPE format

targetcols: dict

Dictionary of strings, the targetcolumns in ID_TYPE format

Examples

>>> from sam.datasets import load_rainbow_beach, load_sewage_data
>>> from sam.models import MLPTimeseriesRegressor, benchmark_wrapper
>>> from sam.preprocessing import wide_to_sam_format
>>>
>>> sewage = load_sewage_data()
>>> sewage = sewage.drop(['Precipitation', 'Temperature'], axis=1)
>>> sewage = sewage.iloc[0:200, :]
>>> sewage['TIME'] = sewage.index
>>> sewage = wide_to_sam_format(sewage)
>>> datasets = {
...     'sewage': sewage,
... }
>>> column_filters = {
...     'sewage': lambda x: x,
... }
>>> targetcols = {
...     'sewage': 'Discharge_Hoofdgemaal',
... }
>>> models = {
...     'mymodel': MLPTimeseriesRegressor(predict_ahead=[3], timecol='TIME',
...                               dropout=0.5, verbose=True)  # some non-default params
... }
>>> benchmark_wrapper(models, datasets, column_filters, targetcols)  
Epoch ...