Data Validation

This is the documentation for the validation functions.

Base Validation class

class sam.validation.BaseValidator

Bases: BaseEstimator, TransformerMixin, ABC

Abstract base class for validators

Methods

fit(X[, y])

fit method

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names. Generally used instead of the attribute, and more compatible with the sklearn API. Returns ------- list: list of feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

validate(X)

Validate the data.

fit(X, y=None)

fit method

get_feature_names_out(input_features=None) List[str]

Function for obtaining feature names. Generally used instead of the attribute, and more compatible with the sklearn API. Returns ——- list:

list of feature names

transform(X: DataFrame) DataFrame

transform method

abstract validate(X: DataFrame) DataFrame

Validate the data.

This method should return a boolean array of the same shape as X, where True indicates a value that is invalid.

Detect Outside Range

class sam.validation.OutsideRangeValidator(cols: list | None = None, min_value: float | dict | str | None = None, max_value: float | dict | str | None = None)

Bases: BaseValidator

Validator class method that removes data that is outside the provided range

Parameters:
cols: list (optional)

Columns of input data to be checkout for being outside range. If None, all columns will be validated

min_value: float, dict or “auto” (optional)

Minimum value to check against. If None, no minimum will be checked. If “auto”, the minimum value of the data will be used.

max_value: float, dict or “auto” (optional)

Maximum value to check against. If None, no maximum will be checked. If “auto”, the maximum value of the data will be used.

Methods

fit(X[, y])

Fit the transformer to the data.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names. Generally used instead of the attribute, and more compatible with the sklearn API. Returns ------- list: list of feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

validate(X)

Transform the data.

fit(X, y=None)

Fit the transformer to the data.

Parameters:
X: pd.DataFrame

Dataframe containing the features to be checked.

y: pd.Series or pd.DataFrame (optional)

Series or dataframe containing the target (ignored)

validate(X)

Transform the data.

Parameters:
X: pd.DataFrame

Dataframe containing the features to be checked.

Detect Extreme Values

class sam.validation.MADValidator(rollingwindow: int | str, cols: list | None = None, madthresh=15)

Bases: BaseValidator

This transformer finds extreme values and sets them to nan in a few steps:

  • Estimate upper and lower bounds from the data in the fit method by computing median deviation above and below a running median

  • Mark differences outside these bounds as nan in the transform method

This class can be passed to the plot function (see Extreme value removal plot) to create a visualization of the removal procedure. It is advisory to take a look at this diagnostic plot to see if your rollingwindow parameter is sufficiently large to capture slow variations, without removing local peaks that might be ‘outliers’.

In addition, the default madthresh of 15 is relatively conservative. Less strict thresholds can be tried.

Note that you only pass cols that are suited for extreme value detection. For instance, a pump can sometimes be out of operation and so be set to 0. This signal is therefore not suited for extreme value detection.

Note that nans still have to be filled in with a later procedure.

Note that you should fit this method to the train set!

Parameters

rollingwindow: int or string

if number, this amount of values will be used for the rolling window if string, should be in pandas timedelta format (‘1D’), and data should have a datetime index. A sensible value for this depends on your time resolution, but you could try values between 200-400.

cols: list of strings (optional)

columns to detect extreme values for. If None, all columns will be used.

madthresh: float

number of median absolute deviations to use as threshold.

Examples

>>> from sam.validation import MADValidator
>>> from sam.visualization import diagnostic_extreme_removal
>>> import numpy as np
>>> import pandas as pd
>>>
>>> # create some random data
>>> np.random.seed(10)
>>> data = np.random.random(size=(1000))
>>>
>>> # split in train and test
>>> train_df = pd.DataFrame()
>>> train_df['values'] = data[:800]
>>> test_df = pd.DataFrame()
>>> test_df['values'] = data[800:]
>>>
>>> # with one clear outlier
>>> train_df.loc[25] *= 10
>>>
>>> # now detect extremes
>>> cols_to_check = ['values']
>>> REV = MADValidator(
...     rollingwindow=10,
...     cols=cols_to_check,
...     madthresh=10)
>>> train_corrected = REV.fit_transform(train_df)
>>> fig = diagnostic_extreme_removal(REV, train_df, 'values')
>>> test_corrected = REV.transform(test_df)
>>> fig = diagnostic_extreme_removal(REV, test_df, 'values')

Methods

fit(data)

Estimate upper and lower bounds from the data by column by computing median deviation above and below a running median by column.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names. Generally used instead of the attribute, and more compatible with the sklearn API. Returns ------- list: list of feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

validate(X)

Sets values that fall outside bounds set in the fit method to nan

fit(data: DataFrame)

Estimate upper and lower bounds from the data by column by computing median deviation above and below a running median by column. This method creates the attiburte self.thresh_high and self.thresh_low that contain the respective bounds.

Parameters:
data: pd.DataFrame

with time indices and feature columns

validate(X: DataFrame)

Sets values that fall outside bounds set in the fit method to nan

Parameters:
data: pd.DataFrame

with time indices and feature columns

Returns
——
data_r: pd.DataFrame

input data with columns marked as nan

Trainset image:

_images/extreme_values_example_trainset.png

Testset image:

_images/extreme_values_example_testset.png

Detect Flatlines

class sam.validation.FlatlineValidator(cols: list | None = None, window: int | str = 1, pvalue: float | None = None, margin: float = 0, backfill: bool = True)

Bases: BaseValidator

Detect flatlines and set to nan. Note that you have to check whether signals can contain natural flatliners (such as machines turned off), that might not need to be removed.

Parameters:
cols: list of strings (defaults to None)

columns to apply this method to. If None, will apply to every column.

window: “auto” or int (default = 1)

number of previous equal values to consider current value a flatliner. so if set to 2, requires that 2 previous values are identical to current to set current value to nan. If set to “auto”, the threshold is derived in the fit method. Based on a train set, the probability of difference being 0 is estimated. This probability can be used to estimate the number of consecutive flatline samples, before the likelihood is below the pvalue parameter The maximum acceptable flatline window is derived for each column separately, with the same pvalue

pvalue: float or None (default=None)

Threshold for likelihood of multiple consecutive flatline samples Only used if window="auto" Small pvalues lead to a larger threshold, hence less flatlines will be removed

margin: int (default = 0)

Maximum absolute difference within window to consider them equal. Default is 0, which means that all samples within used window must be exactly equal to form a flatline.

backfill: bool (default = True)

whether to label all within the window, even before the first detected data point. This is useful if you want to remove flatlines from the beginning of a signal. However, that is not always representative of for a real-time application, so one might want to set this to False.

Examples

>>> import pandas as pd
>>> from sam.validation import FlatlineValidator
>>> # create some data
>>> data = [1, 2, 6, 3, 4, 4, 4, 3, 6, 7, 7, 2, 2]
>>> # with one clear outlier
>>> test_df = pd.DataFrame()
>>> test_df['values'] = data
>>> # now detect flatlines
>>> cols_to_check = ['values']
>>> RF = FlatlineValidator(
...     cols=cols_to_check,
...     window=3)
>>> data_corrected = RF.fit_transform(test_df)

Methods

fit(data)

If window size is 'auto', derive thresholds for each column Threshold is based on the probability that a sensor value does not change.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Function for obtaining feature names. Generally used instead of the attribute, and more compatible with the sklearn API. Returns ------- list: list of feature names.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

transform method

validate(X)

Validates the dataframe against the fitted dataframe.

fit(data: DataFrame)

If window size is ‘auto’, derive thresholds for each column Threshold is based on the probability that a sensor value does not change. The likelihood of a flatliner of m time steps, is this probability to the power m. A threshold such that flatliners with a likelihood below the pvalue are removed.

Parameters:
data: pd.DataFrame

The dataframe to apply the method to

validate(X: DataFrame) DataFrame

Validates the dataframe against the fitted dataframe. Returns a boolean dataframe where True indicates an invalid value.

Parameters:
X: pd.DataFrame

Input dataframe to validate

Trainset image:

_images/flatline_removal_example.png

Create validation Pipeline

sam.validation.create_validation_pipe(cols: list, rollingwindow: int | str, remove_extreme_values: bool = True, remove_flatlines: bool = True, impute_values: bool = True, madthresh: int = 15, flatwindow: int = 2, max_iter: int = 10, n_nearest_features: int = 10, impute_method: str = 'iterative')

Sets up a pipeline to do data validation. Can incorporate:

  • remove extreme values

  • remove flatlines

  • impute values

Parameters

cols: list of strings

which columns in the dataframe to apply data validation to

rollingwindow: int or str

parameter used in RemoveExtremeValues, see RemoveExtremeValues

remove_extreme_values: bool (default=True)

if set to True, will find extreme values and set to nan

remove_flatlines: bool (default=True)

if set to True, will find flatline signals

impute_values: bool (default=True)

if set to True, will impute found nan signals

madthresh: int (default=15)

parameter used in RemoveExtremeValues, see RemoveExtremeValues

flatwindow: int (default=2)

parameter used in RemoveFlatlines, see RemoveFlatlines

max_iter: int (default=10)

how many iterations to try for iterative_imputer see https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

n_nearest_features: int (default=10)

how many close features to use for iterative_imputer Example <https://scikit-learn.org/stable/modules/generated/ sklearn.impute.IterativeImputer.html>

impute_method: string (default=’iterative’)

if set to ‘iterative’, will impute values using IterativeImputer. This is much slower, but also much more accurate. Can also be set to any of the SimpleImputer strategies: ‘mean’, ‘median’, ‘most_frequent’, ‘constant’

Returns:
pipe: sklearn.pipeline.Pipeline instance

The input data should be a wide-format dataframe, where rows are time and columns are features. The rows of the data should be linearly increasing in time, and can contain gaps in time. However, when a string is used to specify the rollingwindow parameter, the input data should have a datetime index.

Examples

>>> import numpy as np
>>> import pandas as pd
>>> from sam.validation import create_validation_pipe
>>> from sam.visualization import diagnostic_extreme_removal, diagnostic_flatline_removal
>>>
>>> # create some data
>>> np.random.seed(10)
>>> base = np.random.randn(100)
>>> X_train = pd.DataFrame(np.tile(base, (3, 3)).T, columns=['1', '2', '3'])
>>> X_test = pd.DataFrame(np.tile(base, (3, 1)).T, columns=['1', '2', '3'])
>>> y_test = pd.Series(base, name='target')
>>> y_train = pd.Series(np.tile(base, 3).T, name='target')
>>>
>>> # add outliers to y_test:
>>> y_test.iloc[[5, 10, 61]] *= 30
>>> # add flatlines to y_train and y_test:
>>> y_test.iloc[20:40] = 1
>>> y_train.iloc[20:50] = 1
>>>
>>> # setup pipeline
>>> pipe = create_validation_pipe(cols=list(X_train.columns) + ['target'], rollingwindow=5,
...                              impute_method='iterative')
>>>
>>> # put data together
>>> train_data = X_train.join(y_train)
>>> test_data = X_test.join(y_test)
>>>
>>> # now fit the pipeline on the train data and transform both train and test
>>> train_data = pd.DataFrame(pipe.fit_transform(train_data), columns=train_data.columns,
...                           index=train_data.index)  
[IterativeImputer] ...
>>> test_data = pd.DataFrame(pipe.transform(test_data), columns=test_data.columns,
...                          index=test_data.index)  
[IterativeImputer] ...
>>>
>>> # the fitted pipeline can now be passed to diagnostic plot functions:
>>> # create validation visualizations
>>> f_ext = diagnostic_extreme_removal(
...     pipe['extreme'], test_data, 'target')
>>> f_ext = diagnostic_flatline_removal(
...     pipe['flat'], test_data, 'target')