Metrics

This is the documentation for Metrics functions.

Incident recall

A new metric as an extension of recall, which for now we’ll call ‘traditional recall’. It can be most easily explained from an example:

Note

A prediction has to be made if a value will go over a threshold in the near future. We call such an exceedance an incident. These incidents have to be predicted before they happen, so someone can act on it. They need some time to act so we want to know six to four hours in advance. A prediction will be made every hour on the hour, so at t-6, t-5 and t-4 we expect a positive prediction. In our dataset we have 10 incident. Say we predict, for all those 10 incidents, correctly at t-5 (true postive) and incorrectly at t-6 and t-4. In this case we have a ‘traditional recall’ of only 33%.

This 33% is however not representative of how many incidents the model actually saw coming in the timespan of t-6 to t-4. Every incident was predicted to happen correctly at some point in time. To fix this we define incident recall. We aggregate the positive predictionos per incident to say if at any point during the lead up the incident was correctly predicted. In our previous (contrived) example the incident recall metric would be a 100%, as all incidents were predicted correctly at t-5.

Note that incident recall >= recall in every case, traditional recall is a lower bound. Also note that this does not translate to precision. For precision you care how many of your positive predicitions are correct, regardless if the incident was also correctly predict a moment earlier or later. We can still use ‘traditional’ precision.

sam.metrics.incident_recall(y_incidents: ndarray, y_pred: ndarray, range_pred: Tuple[int, int] = (0, 0))

Given y_pred, y_incidents and a prediction range, see what percentage of incidents in y_incidents was positively predicted in y_pred, within window range_pred. Works for binary classification only (predicting 1 means incident, predicting 0 means no incident)

For use in a make_scorer, e.g. make_scorer(incident_recall, y_incidents=df[‘incidents’], range_pred=(1,5))

Parameters:
y_incidents: 1d array-like, or label indicator array / sparse matrix

Incidents that we want to predict with the classifier within window range_pred

y_pred: 1d array-like, or label indicator array / sparse matrix

Predicted labels, as returned by a classifier.

range_pred: tuple of ints, optional (default = (0,0))

length of two: (start prediction window, end prediction window) If we want to predict an incidents 5 to 1 rows in advance, this will be (1,5) range_pred is inclusive, so (0, 1) means you can predict either 0 or 1 timesteps in advance

Returns:
result: float

The percentage of incidents that was positively predicted

Examples

>>> from sam.metrics import incident_recall
>>> y_pred = [1,0,0,1,0,0,0]
>>> y_incidents = [0,1,0,0,0,0,1]
>>> range_pred = (0,2)
>>> incident_recall(y_incidents, y_pred, range_pred)
0.5
sam.metrics.make_incident_recall_scorer(range_pred: Tuple[int, int] = (0, 0), colname: str = 'incident')

Wrapper around incident_recall_score, to make it an actual sklearn scorer. This works by obtaining the ‘incident’ column from the data itself. This column name is configurable. This scorer does need the incident column to be present, which means by default it will be used in training the model. If this is not desired, a custom model will have to be made that deletes this variable before fitting/predicting.

Parameters:
range_pred: tuple of int, optional (default = (0,0))

length of two: (start prediction window, end prediction window) Passed through to incident_recall

colname: string, optional (default = “indicent”)

The column name that is given to incident_recall as incidents we want to predict

Returns:
scorer: a function that acts as a sklearn scorer object.

signature is scorer(clf, X), where clf is a fit model, X is test data

Examples

>>> from sam.metrics import make_incident_recall_scorer
>>> from sklearn.base import BaseEstimator
>>> import pandas as pd
>>> import numpy as np
>>>
>>> op = type("MyClassifier", (BaseEstimator, object),
...          {"predict": lambda self, X: np.array([0, 1, 0, 0, 0, 0, 0, 0])})
>>> data = pd.DataFrame({"incident": [0, 0, 0, 1, 1, 0, 1, 0], "other": 1})
>>>
>>> scorer = make_incident_recall_scorer((1, 3), "incident")
>>> scorer(op(), data)
0.6666666666666666
sam.metrics.precision_incident_recall_curve(y_incidents: ndarray, y_pred: ndarray, range_pred: Tuple[int, int] = (0, 0))

Analogous to sklearn.metrics.precision_recall_curve <https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_curve.html>, but for incident recall and precision. Precision the percentage of correct incidents in the prediction. Incident recall can be found in the incident_recall function: for every incident, check if at least there is a single positive prediction.

The calculation of the thresholds is done by calling sklearn.precision_recall_curve.

Given incidents and a prediction, as well as range, returns precision, recall, thresholds.

Parameters:
y_incidents: 1d array-like, or label indicator array / sparse matrix

Incidents that we want to predict with the classifier within window range_pred

y_pred: 1d array-like, or label indicator array / sparse matrix

Predicted labels, as returned by a classifier.

range_pred: tuple of int, optional (default = (0,0))

length of two: (start prediction window, end prediction window) If we want to predict an incidents 5 to 1 rows in advance, this will be (1,5)

Returns:
precision: 1d array-like

an array of precision scores

recall: 1d array-like

an array of incident recall scores

thresholds: 1d array-like

an array of thresholds to interpret the above 2

Examples

>>> y_pred = [0.4, 0.4,  0.1, 0.2, 0.6,  0.5, 0.1]
>>> y_incidents = [  0, 1, 0, 0, 0, 0, 1]
>>> range_pred = (0, 2)
>>> p, r, t = precision_incident_recall_curve(y_incidents, y_pred, range_pred)
>>> p
array([0.71428571, 0.8       , 1.        , 1.        , 1.        ,
       1.        ])
>>> r
array([1. , 1. , 1. , 0.5, 0.5, 0. ])
>>> t
array([0.1, 0.2, 0.4, 0.5, 0.6])

Mean absolute scaled error

sam.metrics.mean_absolute_scaled_error(y_true: array, y_pred: array, shift: int, sample_weight: array | None = None, multioutput: str = 'uniform_average')

Given true value and predicted value, calculates MASE. Lower is better. MASE is the mean absolute error, divided by the MAE of a naive benchmark. The naive benchmark is the ‘persistence benchmark’: predicting the same value shift points ahead. For example, when shift=1, the naive benchmark is the MAE when for timestep n, we predict the value at timestep n-1. As shift increases, the naive benchmark will get worse, and the MASE will get better (lower).

Even though this function does not require timestamps as input, it requires that the inputs are sorted by TIME (ascending), and that the timestamps are uniform. Otherwise, the output of this function will have no meaning.

The shift points at the beginning of the array cannot be used for the persistence benchmark, but they are used for calculating the MAE of the prediction. If the truth vector is constant, the naive benchmark will be 0. In this case, the function will throw an error since MASE has no meaning.

Parameters:
y_true: array-like of shape = (n_samples) or (n_samples, n_outputs)

Ground truth (correct) target values, sorted by time (ascending), with uniform timestamps

y_predarray-like of shape = (n_samples) or (n_samples, n_outputs)

Estimated target values, sorted by time (ascending), with uniform timestamps

shift: int

The shift used when calculating the naive benchmark. For example, when the timestamps have a frequency of 15 minutes, and your model is predicting 1 hour ahead, then you should set shift to 4 to have an honest benchmark.

sample_weightarray-like of shape = (n_samples), optional

Sample weights.

multioutputstring in [‘raw_values’, ‘uniform_average’]

or array-like of shape (n_outputs) Defines aggregating of multiple output values.

‘raw_values’:

Returns a full set of errors in case of multioutput input.

‘uniform_average’:

Errors of all outputs are averaged with uniform weight.

Returns:
loss: float or ndarray of floats

MASE output is non-negative floating point. The best value is 0.0.

Examples

>>> y_true = [1, 2, 3]
>>> y_pred = [0.5, 1.5, 2.5]
>>> # persistence benchmark would have a loss of 1. Our prediction has a loss of 0.5
>>> mean_absolute_scaled_error(y_true, y_pred, shift=1)
0.5

Train r2

sam.metrics.train_r2(true: Series | array, predicted: Series | array, benchmark: float | array)

Calculation of r2 using a benchmark (often mean/median/mode of the train set). The idea is that calculating r2 over a an average of the test set is ‘leaking’ information. This is especially problematic if the test set is small compared to the train set. It is therefore more suitable to compare the model predictions to the average estimated in the train set. It is also possible to define an array benchmark, for example a prediction using your model with input a certain feature nulled (such as rainfall prediction). Then, by comparing this benchmark with the model prediction with a non-null rainfall prediction it’s possible to determine that certain (for example rain) feature is important in the model and thus indeed improves predictions.

Parameters:
true: np.array or pd.Series

Actual timeseries

predicted: np.array or pd.Series

Predicted timeseries

benchmark: float or pd.Series

Average (mean/median/mode) estimated from the train set or a timeseries to compare to

Returns:
——-
r2: float

Custom R2 with training mean/median

Tilted loss

sam.metrics.tilted_loss(y_true: ndarray, y_pred: ndarray, quantile: float = 0.5, axis: int = -1)

Calculate tilted (also known as pinball or quantile loss) with numpy. Given a quantile \(q\), and an error \(e=y_\text{true}-y_\text{pred}\), then tilted loss \(L_q\) is defined as:

\[\begin{split}L_q(e) = \sum_{i:e_i\leq0} (1 - q) |e_i| + \sum_{i:e_i > 0} q |e_i|\\ \text{where } i \in \{0, \dots, n_\text{samples} - 1\} \text{ and } q \in [0, 1]\end{split}\]

This function is equivalent to the mean absolute error if \(q=0.5\), which approximates the median:

\[\begin{split}\begin{eqnarray} L_\text{0.5} &= \sum_{i:e_i\leq0} 0.5 |e_i| + \sum_{i:e_i > 0} 0.5 |e_i| \\ &= 0.5 * \sum_i |e_i|\\ \text{MAE} &= \frac{1}{n_\text{samples}} \sum_i |e_i| \text{MAE} &= 2 L_\text{0.5} \end{eqnarray}\end{split}\]

For a given value of \(q\), the function that minimizes the tilted loss will be the \(q\)’th quantile function.

Parameters:
y_true: array-like of shape = (n_samples, n_cols)

True labels.

y_pred: array-like of shape = (n_samples, n_cols)

Predictions. Must be same shape as y_true

quantile: float, optional (default=0.5)

The quantile to use when computing tilted loss. Must be between 0 and 1 for tilted loss to be positive.

axis: int, optional (default=-1)

Over which axis to take the mean. By default the last axis (-1) is taken. For a (n_samples, n_cols) target array set axis=0 to retrieve the tilted loss of each col.

Returns:
float:

The quantile loss

Examples

>>> import numpy as np
>>> from sam.metrics import tilted_loss
>>> actual = np.array([1, 2, 3, 4])
>>> pred = np.array([0.9, 2.1, 2.9, 3.1])
>>> tilted_loss(actual, pred, quantile=0.5)
0.15000000000000002

Keras metrics

sam.metrics.keras_rmse(y_true: Tensor, y_pred: Tensor)

Calculate root mean squared error in Keras.

Parameters:
y_true: Theano/TensorFlow tensor.

True labels.

y_pred: Theano/TensorFlow tensor.

Predictions. Must be same shape as y_true

Examples

>>> from sam.metrics import keras_rmse
>>> from sam.models import create_keras_quantile_mlp
>>> n_input = 1000
>>> n_neurons = 64
>>> n_layers = 3
>>> quantiles = [0.1, 0.5, 0.9]
>>> model = create_keras_quantile_mlp(n_input, n_neurons, n_layers, quantiles=quantiles)
>>> model.compile(loss=keras_rmse)
sam.metrics.keras_tilted_loss(y_true: Tensor, y_pred: Tensor, quantile: float = 0.5)

Calculate tilted, or quantile loss in Keras. Given a quantile q, and an error e, tilted loss is defined as (1-q) * |e| if e < 0, and q * |e| if e > 0.

This function is the same as the mean absolute error if q=0.5, which approximates the median. For a given value of q, the function that minimizes the tilted loss will be the q’th quantile function.

Parameters:
y_true: Theano/TensorFlow tensor.

True labels.

y_pred: Theano/TensorFlow tensor.

Predictions. Must be same shape as y_true

quantile: float, optional (default=0.5)

The quantile to use when computing tilted loss. Must be between 0 and 1 for tilted loss to be positive.

Examples

>>> from sam.metrics import keras_tilted_loss
>>> from sam.models import create_keras_quantile_mlp
>>> n_input = 1000
>>> n_neurons = 64
>>> n_layers = 3
>>> quantiles = [0.1, 0.5, 0.9]
>>> model = create_keras_quantile_mlp(n_input, n_neurons, n_layers, quantiles=quantiles)
>>> quantile = 0.5  # Quantile, in this case the median
>>> model.compile(loss=lambda y,f: keras_tilted_loss(y, f, quantile))
sam.metrics.keras_joint_mse_tilted_loss(y_true: Tensor, y_pred: Tensor, quantiles: List[float] | None = None, n_targets: int = 1)

Joint mean and quantile regression loss function using mse. Sum of mean squared error and multiple tilted loss functions Custom loss function, inspired by https://github.com/fmpr/DeepJMQR Only compatible with tensorflow backend.

This calculates loss for multiple quantiles, and multiple targets. The total loss is the sum of the mse of all targets, and the tilted loss of all quantiles.

y_true is expected to be a tensor with shape (None, n_targets) y_pred is expected to be a tensor with shape (None, n_targets * (n_quantiles + 1)) For example, if there are 2 outputs and 2 quantiles, the order of y_pred should be: [output_1_quantile_1, output_2_quantile_1, output_1_quantile_2, output_2_quantile_2, output_1_mean, output_2_mean]

Parameters:
y_true: tensorflow tensor

True values

y_pred: tensorflow tensor

Predicted values

quantiles: list of floats (default=None)

Quantiles to predict. Values between 0 and 1. By default, no quantile loss is used.

n_targets: integer, optional (default=1)

The number of distinct outputs to predict

Examples

>>> from sam.metrics import keras_joint_mse_tilted_loss as mse_tilted
>>> from sam.models import create_keras_quantile_mlp
>>> n_input = 1000
>>> n_neurons = 64
>>> n_layers = 3
>>> quantiles = [0.1, 0.5, 0.9]
>>> model = create_keras_quantile_mlp(n_input, n_neurons, n_layers, quantiles=quantiles)
>>> qs = [0.1, 0.9]
>>> model.compile(loss=lambda y,f: mse_tilted(y, f, qs))
sam.metrics.get_keras_forecasting_metrics(quantiles: List[float] | None = None)

Get list of standard forecasting metrics. These metrics could be used specifically for regression problems, such as forecasting. The default list is mse, mae, rmse. Quantile loss can also be added, by giving a list of quantiles.

Parameters:
quantiles: list of floats (default=None)

List of quantiles to make losses of. Values between 0 and 1. By default, no quantile loss is used. Note that mae is already included and is the same as quantile loss with quantile 0.5.

Examples

>>> from sam.metrics import keras_rmse, get_keras_forecasting_metrics
>>> from sam.models import create_keras_quantile_mlp
>>> n_input = 1000
>>> n_neurons = 64
>>> n_layers = 3
>>> quantiles = [0.1, 0.5, 0.9]
>>> model = create_keras_quantile_mlp(n_input, n_neurons, n_layers, quantiles=quantiles)
>>> model.compile(loss=keras_rmse, metrics=get_keras_forecasting_metrics())

Compute quantile ratios

sam.metrics.compute_quantile_ratios(y: Series, pred: DataFrame, predict_ahead: int = 0, precision: int = 3)

Computes the total proportion of data points in y beneath each quantile in pred. So for example, with quantile 0.75, you’d expect 75% of values in y to be beneath this quantile This function would then return a dictionary with 0.75 as key, and the actual proportion of values that fell beneath that quantile in y (e.g. 0.743). This allows the user to judge whether the fitted quantile model accurately reflects the distribution of the data.

Parameters

y: pd.Series

series of observed true values

pred: pd.DataFrame

output of MLPTimeseriesRegressor.predict() function

predict_ahead: int (default=0)

Number of timestep ahead to evaluate

precision: int (default=3)

decimal point precision of result

Returns:
quantile_ratios: dict

key is quantile, value is the observed proportion of data points below that quantile