Preprocessing

This is the documentation for preprocessing functions.

Clipping data

class sam.preprocessing.ClipTransformer(cols: list | None = None, min_value: float | None = None, max_value: float | None = None)

Bases: BaseEstimator, TransformerMixin

Transformer that clips values to a given range.

Parameters:
cols: list (optional)

Columns of input data to be clipped. If None, all columns will be clipped.

min_value: float (optional)

Minimum value to clip to. If None, min will be set to the minimum value of the data.

max_value: float (optional)

Maximum value to clip to. If None, max will be set to the maximum value of the data.

Methods

fit(X[, y])

Fit the transformer to the data.

fit_transform(X[, y])

Fit to data, then transform it.

get_feature_names_out([input_features])

Get the names of the output features.

get_params([deep])

Get parameters for this estimator.

set_output(*[, transform])

Set output container.

set_params(**params)

Set the parameters of this estimator.

transform(X)

Transform the data.

fit(X: DataFrame, y: DataFrame | Series | None = None)

Fit the transformer to the data.

Parameters:
X: pd.DataFrame

Dataframe containing the features to be clipped.

y: pd.Series or pd.DataFrame (optional)

Series or dataframe containing the target (ignored)

get_feature_names_out(input_features=None) List[str]

Get the names of the output features.

transform(X: DataFrame) DataFrame

Transform the data.

Parameters:
X: pd.DataFrame

Dataframe containing the features to be clipped.

Normalize timestamps

Warning

If your each timestamp in your data appears only once (for example, if your data is in wide format), you should almost certainly not use this function. Instead, consider using pd.resample instead. This pandas function does the same as this function, but is more stable, more performant, and has more options. The only time in which you should consider using this function, is if the same timestamp can occur multiple times. (for example, if your data is in long format)

sam.preprocessing.normalize_timestamps(df: DataFrame, freq: str, start_time: datetime | str = '', end_time: datetime | str = '', round_method: str = 'ceil', aggregate_method: str | Callable | dict | List[Callable] = 'last', fillna_method: str | None = None)

Create a dataframe with all timestamps according to a given frequency. Fills in values for these timestamps from a given dataframe in SAM format.

WARNING: This function makes assumptions about the data, and may not be safe for all datasets. For instance, a dataset with timestamps randomly distributed like a poisson process will significantly change when normalizing them, which may mean throwing away data. Furthermore, grouping measurements in the same timestamp may destroy cause-and-effect. For example, if a cause is measured at 15:58, and an effect is measured at 15:59, grouping them both at 16:00 makes it impossible to learn which came first. Use this function with caution, and generally mainly when the data already has normalized or close-to-normalized timestamps already.

The process consists of four steps:

Firstly, ‘normalized’ date ranges are created according to the required frequency. The start/end times of these date ranges can be given by start_time/end_time. If not given, the global minimum/maximum across all TYPE/ID is used. For example, if ID=’foo’ runs from 2017 to 2019, and ID=’bar’ runs from 2018 to 2019, then ID=’bar’ will have missing values in the entirety of 2017.

Secondly, all timestamps are rounded to the required frequency. For example, if the frequency is 1 hour, we may want the timestamp 19:45:12 to be rounded to 20:00:00. The method of rounding is ceiling by default, and is given by round_method.

Thirdly, any timestamps with multiple measurements are aggregated. This is the last non-null value by default, and is given by aggregate_method. Other options are ‘mean’, ‘median’, ‘first’, and other pandas aggregation functions.

Fourthly, any timestamps with missing values are filled. By default, no filling is done, and is given by fillna_method. The other options are backward filling and forward filling. (‘bfill’ and ‘ffill’)

Parameters:
df: pandas dataframe with TIME, TYPE, ID and VALUE columns, shape = (nrows, 4)

Dataframe from which the values are created

freq: str or DateOffset

The required frequency for the time features. frequencies can have multiples, e.g. “15 min” for 15 minutes See here for options

start_time: str or datetime-like, optional (default = ‘’)

the start time of the period to create features over if string, the format %Y-%m-%d %H:%M:%S will always work Pandas also accepts other formats, or a datetime object

end_time: str or datetime-like, optional (default = ‘’)

the end time of the period to create features over if string, the format %Y-%m-%d %H:%M:%S will always work Pandas also accepts other formats, or a datetime object

round_method: string, optional (default = ‘floor’)

How to group the times in bins. By default, rows are grouped by flooring them to the frequency (e.g.: if frequency is hourly, the timestamp 18:59 will be grouped together with 18:01, and the TIME will be set to 18:00) The options are:

  • ‘floor’: Group times by flooring to the nearest frequency

  • ‘ceil’: Group times by ceiling to the nearest frequency

  • ‘round’: Group times by rounding to the nearest frequency

Ceiling is the option that is the safest to prevent leakage: It will guarantee that a value in the output will have a TIME that is not before the time that it actually occurred.

aggregate_method: function, string, dictionary, list of string/functions (default = ‘last’)

Method that is used to aggregate values when multiple values fall within a specified frequency region. For example, when you have data per 5 minutes, but you’re creating a an hourly frequency, the values need to be aggregated. Can be strings such as mean, sum, min, max, or a function. See also

fillna_method: string, optional (default = None)

Method used to fill NA values, must be an option from pd.DataFrame.fillna. Options are: ‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None See also

Returns:
complete_df: pandas dataframe,

shape (length(TIME) * length(unique IDs) * length(unique TYPEs, 4)) dataframe containing all possible combinations of timestamps and IDs and TYPEs with selected frequency, aggregate method and fillna method

Examples

>>> from sam.preprocessing import normalize_timestamps
>>> from datetime import datetime
>>> import pandas as pd
>>> df = pd.DataFrame({'TIME': [datetime(2018, 6, 9, 11, 13), datetime(2018, 6, 9, 11, 34),
...                             datetime(2018, 6, 9, 11, 44), datetime(2018, 6, 9, 11, 46)],
...                    'ID': "SENSOR",
...                    'TYPE': "DEPTH",
...                    'VALUE': [1, 20, 3, 20]})
>>>
>>> normalize_timestamps(df, freq = "15 min", end_time="2018-06-09 12:15:00")
                 TIME      ID   TYPE  VALUE
0 2018-06-09 11:15:00  SENSOR  DEPTH    1.0
1 2018-06-09 11:30:00  SENSOR  DEPTH    NaN
2 2018-06-09 11:45:00  SENSOR  DEPTH    3.0
3 2018-06-09 12:00:00  SENSOR  DEPTH   20.0
4 2018-06-09 12:15:00  SENSOR  DEPTH    NaN
>>> from sam.preprocessing import normalize_timestamps
>>> from datetime import datetime
>>> import pandas as pd
>>> df = pd.DataFrame({'TIME': [datetime(2018, 6, 9, 11, 13), datetime(2018, 6, 9, 11, 34),
...                             datetime(2018, 6, 9, 11, 44), datetime(2018, 6, 9, 11, 46)],
...                    'ID': "SENSOR",
...                    'TYPE': "DEPTH",
...                    'VALUE': [1, 20, 3, 20]})
>>>
>>> normalize_timestamps(df, freq = "15 min", end_time="2018-06-09 12:15:00",
...                     aggregate_method = "mean", fillna_method="ffill")
                 TIME      ID   TYPE  VALUE
0 2018-06-09 11:15:00  SENSOR  DEPTH    1.0
1 2018-06-09 11:30:00  SENSOR  DEPTH    1.0
2 2018-06-09 11:45:00  SENSOR  DEPTH   11.5
3 2018-06-09 12:00:00  SENSOR  DEPTH   20.0
4 2018-06-09 12:15:00  SENSOR  DEPTH   20.0

Correct extremes

sam.preprocessing.correct_outside_range(series, threshold=(0, 1), method='na', value=None)

This documentation covers correct_above_threshold, correct_below_threshold and correct_outside_range. These three functions can be used to filter extreme values or fill them with a specified method. The function can correctly handle series with a DatetimeIndex, to interpolate correctly even in the case of measurements with a varying frequency.

Note: this function does not affect nans. To filter/fill missing values, use pd.fillna instead.

Parameters:
series: A pandas series

The series containing potential outliers

threshold: number, (default = 1) or a tuple (default = (0,1))

The exclusive threshold. A number for above or below, for correct_outside_range it should be a tuple

method: string (default = “na”)

To what the threshold exceeding values should be corrected, options are:

  • If ‘na’, set values to np.nan

  • If ‘previous’, set values to previous non non-exceeding, non-na value

  • If ‘average’, linearly interpolate values using

    pandas.DataFrame.interpolate, might leak and requires an index

  • If ‘clip’: set to the max threshold, lower/upper in case of range

  • If ‘value’, set to a specific value, specified in value parameter

  • If ‘remove’, removes complete row.

value: (default = None)

If method is ‘value’, set the threshold exceeding entry to this value

Returns:
series: pandas series

The original series with the threshold exceeding values corrected

Examples

>>> from sam.preprocessing import correct_below_threshold
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1], index = [2, 3, 4, 6])
>>>
>>> correct_below_threshold(data, method = "average", threshold=0)
2    0.0
3    1.0
4    2.0
6    1.0
dtype: float64
>>> from sam.preprocessing import correct_outside_range
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1])
>>>
>>> correct_outside_range(data, method = "na", threshold=(0,1))
0    0.0
1    NaN
2    NaN
3    1.0
dtype: float64
>>> from sam.preprocessing import correct_above_threshold
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1])
>>>
>>> correct_above_threshold(data, method = "remove", threshold = 1)
0    0
1   -1
3    1
dtype: int64
sam.preprocessing.correct_above_threshold(series, threshold=1, method='na', value=None)

This documentation covers correct_above_threshold, correct_below_threshold and correct_outside_range. These three functions can be used to filter extreme values or fill them with a specified method. The function can correctly handle series with a DatetimeIndex, to interpolate correctly even in the case of measurements with a varying frequency.

Note: this function does not affect nans. To filter/fill missing values, use pd.fillna instead.

Parameters:
series: A pandas series

The series containing potential outliers

threshold: number, (default = 1) or a tuple (default = (0,1))

The exclusive threshold. A number for above or below, for correct_outside_range it should be a tuple

method: string (default = “na”)

To what the threshold exceeding values should be corrected, options are:

  • If ‘na’, set values to np.nan

  • If ‘previous’, set values to previous non non-exceeding, non-na value

  • If ‘average’, linearly interpolate values using

    pandas.DataFrame.interpolate, might leak and requires an index

  • If ‘clip’: set to the max threshold, lower/upper in case of range

  • If ‘value’, set to a specific value, specified in value parameter

  • If ‘remove’, removes complete row.

value: (default = None)

If method is ‘value’, set the threshold exceeding entry to this value

Returns:
series: pandas series

The original series with the threshold exceeding values corrected

Examples

>>> from sam.preprocessing import correct_below_threshold
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1], index = [2, 3, 4, 6])
>>>
>>> correct_below_threshold(data, method = "average", threshold=0)
2    0.0
3    1.0
4    2.0
6    1.0
dtype: float64
>>> from sam.preprocessing import correct_outside_range
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1])
>>>
>>> correct_outside_range(data, method = "na", threshold=(0,1))
0    0.0
1    NaN
2    NaN
3    1.0
dtype: float64
>>> from sam.preprocessing import correct_above_threshold
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1])
>>>
>>> correct_above_threshold(data, method = "remove", threshold = 1)
0    0
1   -1
3    1
dtype: int64
sam.preprocessing.correct_below_threshold(series, threshold=0, method='na', value=None)

This documentation covers correct_above_threshold, correct_below_threshold and correct_outside_range. These three functions can be used to filter extreme values or fill them with a specified method. The function can correctly handle series with a DatetimeIndex, to interpolate correctly even in the case of measurements with a varying frequency.

Note: this function does not affect nans. To filter/fill missing values, use pd.fillna instead.

Parameters:
series: A pandas series

The series containing potential outliers

threshold: number, (default = 1) or a tuple (default = (0,1))

The exclusive threshold. A number for above or below, for correct_outside_range it should be a tuple

method: string (default = “na”)

To what the threshold exceeding values should be corrected, options are:

  • If ‘na’, set values to np.nan

  • If ‘previous’, set values to previous non non-exceeding, non-na value

  • If ‘average’, linearly interpolate values using

    pandas.DataFrame.interpolate, might leak and requires an index

  • If ‘clip’: set to the max threshold, lower/upper in case of range

  • If ‘value’, set to a specific value, specified in value parameter

  • If ‘remove’, removes complete row.

value: (default = None)

If method is ‘value’, set the threshold exceeding entry to this value

Returns:
series: pandas series

The original series with the threshold exceeding values corrected

Examples

>>> from sam.preprocessing import correct_below_threshold
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1], index = [2, 3, 4, 6])
>>>
>>> correct_below_threshold(data, method = "average", threshold=0)
2    0.0
3    1.0
4    2.0
6    1.0
dtype: float64
>>> from sam.preprocessing import correct_outside_range
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1])
>>>
>>> correct_outside_range(data, method = "na", threshold=(0,1))
0    0.0
1    NaN
2    NaN
3    1.0
dtype: float64
>>> from sam.preprocessing import correct_above_threshold
>>> import pandas as pd
>>> data = pd.Series([0, -1, 2, 1])
>>>
>>> correct_above_threshold(data, method = "remove", threshold = 1)
0    0
1   -1
3    1
dtype: int64

Time-specific preprocessing

sam.preprocessing.average_winter_time(data: DataFrame, tmpcol: str = 'tmp_UNID')

Solve duplicate timestamps in wintertime, by averaging them Because the to_wintertime hour happens twice, there can be duplicate timestamps This function removes those duplicates by averaging the VALUE column All other columns are used as group-by columns

Parameters:
data: pandas Dataframe

must have columns TIME, VALUE, and optionally others like ID and TYPE.

tmpcol: string, optional (default=’tmp_UNID’)

temporary columnname that is created in dataframe. This columnname cannot exist in the dataframe already

Returns:
data: pandas Dataframe

The same dataframe as was given in input, but with duplicate timestamps removed, if they happened during the wintertime duplicate hour

Examples

>>> from sam.preprocessing import average_winter_time
>>> import numpy as np
>>>
>>> daterange = pd.date_range('2019-10-27 01:45:00', '2019-10-27 03:00:00', freq='15min')
>>> test_df = pd.DataFrame({"TIME": daterange.values[[0, 1, 1, 2, 2, 3, 3, 4, 4, 5]],
...                         "VALUE": np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])})
>>> average_winter_time(test_df)
                 TIME  VALUE
0 2019-10-27 01:45:00    0.0
1 2019-10-27 02:00:00    1.5
2 2019-10-27 02:15:00    3.5
3 2019-10-27 02:30:00    5.5
4 2019-10-27 02:45:00    7.5
5 2019-10-27 03:00:00    9.0
sam.preprocessing.label_dst(timestamps_series: Series)

Find possible conflicts due to daylight savings time, by labeling timestamps_series. This converts a series of timestamps to a series of strings. The strings are either ‘normal’, ‘to_summertime’, or ‘to_wintertime’. to_summertime happens the last sunday morning of march, from 2:00 to 2:59. to_wintertime happens the last sunday morninig of october, from 2:00 to 2:59. These can be possible problems because they either happen 2 or 0 times. to_summertime should therefore be impossible.

Parameters:
timestamps_series: pd.Series, shape = (n_inputs,)

a series of pandas timestamps

Returns:
labels: string, array-like, shape = (n_inputs,)

a numpy array of strings, that are all either ‘normal’, ‘to_summertime’, or ‘to_wintertime’

Examples

>>> from sam.preprocessing import label_dst
>>> import pandas as pd
>>>
>>> daterange = pd.date_range('2019-10-27 01:00:00', '2019-10-27 03:00:00', freq='15min')
>>> date_labels = label_dst(pd.Series(daterange))
>>>
>>> pd.DataFrame({'TIME' : daterange,
...               'LABEL': date_labels})
                 TIME          LABEL
0 2019-10-27 01:00:00         normal
1 2019-10-27 01:15:00         normal
2 2019-10-27 01:30:00         normal
3 2019-10-27 01:45:00         normal
4 2019-10-27 02:00:00  to_wintertime
5 2019-10-27 02:15:00  to_wintertime
6 2019-10-27 02:30:00  to_wintertime
7 2019-10-27 02:45:00  to_wintertime
8 2019-10-27 03:00:00         normal

SAM-format Reshaping

sam.preprocessing.sam_format_to_wide(data: DataFrame, sep: str = '_')

Converts a typical sam-format df ‘(TIME, ID, TYPE, VALUE)’ to wide format. This is almost a wrapper around pd.pivot_table, although it does a few extra things. It removes the multiindex that would normally occur with ID + TYPE, by concatenating them with a separator in between. It also sorts the output by ‘TIME’, which is not guaranteed by pivot_table.

Parameters:
data: pd.DataFrame

dataframe with TIME, ID, TYPE, VALUE columns

sep: str, optional (default=’_’)

separator that will be placed between ID and TYPE to create column names.

Returns:
data_wide: pd.DataFrame

the data, in wide format, with 1 column for every ID/TYPE combination, as well as a TIME column. For example, if ID is ‘abc’ and TYPEs are ‘debiet’ and ‘stand’, the created column names will be ‘abc_debiet’ and ‘abc_stand’, as well as TIME. The result will be sorted by TIME, ascending. The index will be a range from 0 to nrows.

sam.preprocessing.wide_to_sam_format(data: DataFrame, sep: str = '_', idvalue: str = '', timecol: str = 'TIME')

Convert a wide format dataframe to sam format. This function has the requirement that the dataframe has a time column, of which the name is given by timecol. Furthermore, the TYPE/ID combinations should be present in value column names that look like ID(sep)TYPE.

If sep is None, then all column names (except timecol) are assumed to be TYPE, with no id, so id will always be set to `idvalue.

Columns that look like ‘A(sep)B(sep)C’ will be split as ‘ID = A, TYPE = B(sep)C’ Columns that look like ‘A’ (without sep) will be split as ‘ID = idvalue, TYPE = A’

Parameters:
data: pd.DataFrame

data in wide format, with time column and other column names like ‘ID(sep)TYPE’

sep: string, optional (default=’_’)

the seperator that appears in column names between id and type.

idvalue: string, optional (default=’’)

the default id value that is used when a column name contains no id

timecol: string, optional (default=’TIME’)

the column name of the time column that must be present

Returns:
df: pd.DataFrame

the data in sam format, with columns TIME, ID, TYPE, VALUE.

Recurrent Features Reshaping

sam.preprocessing.RecurrentReshaper(window, lookback=1, remove_leading_nan=False)

Reshapes a two-dimensional feature table into a three dimensional table sliding window table, usable for recurrent neural networks

An important note is that this transformer assumes that the data is sorted by time already! So if the input dataframe is not sorted by time (in ascending order), the results will be completely wrong.

Given an input array with shape (n_samples, n_features), output array is of shape (n_samples, lookback, n_features)

Parameters:
windowinteger

Number of rows to look back

lookbackinteger (default=0)

the features that are built will be shifted by this value. If target is in X, lookback should be greater than 0 to avoid leakage.

remove_leading_nanboolean

Whether leading nans should be removed. Leading nans arise because there is no history for first samples

Examples

>>> from sam.data_sources import read_knmi
>>> from sam.preprocessing import RecurrentReshaper
>>> X = read_knmi('2018-01-01 00:00:00', '2018-01-08 00:00:00').set_index('TIME')
>>> reshaper = RecurrentReshaper(window=7)
>>> X3D = reshaper.fit_transform(X)

Differencing

sam.preprocessing.make_differenced_target(y: Series, lags: int | list = 1, newcol_prefix: str | None = None)

Creates a target dataframe by performing differencing (once or multiple times) on a monospaced series. The dataframe contains columns ‘TARGET_lead_x’, where x are the lags and TARGET is the name of the input series.

Parameters:
y: pd.Series

A series containing the target data. Must be monospaced in time, for the differencing to work correctly.

lags: array-like or int, optional (default=1)

A list of integers, or a single integer describing what lags should be used to look in the future. For example, if this is [1, 2, 3], the output will have three columns, performing differencing on 1, 2, and 3 timesteps in the future. If this is a list, the output will be a dataframe. If this is a scalar, the output will be a pd.Series

newcol_prefix: str, optional (default=None)

The prefix that the output columns will have. If None, y.name is used instead.

Returns:
target: pd.DataFrame or pd.Series

A target with the same index as y, and columns equal to len(lags) The values will be ‘future values’ of y but differenced. If we consider the index to be the ‘timestamp’, then the index will be the moment the prediction is made, not the moment the prediction is about. Therefore, the columns will be different future values with different lags. Any values that cannot be calculated (because there is no available future value) will be set to np.nan.

Examples

>>> df = pd.DataFrame({
...     'X': [18, 19, 20, 21],
...     'y': [10, 20, 50, 100]
... })
>>> make_differenced_target(df['y'], lags=1)
0    10.0
1    30.0
2    50.0
3     NaN
Name: y_diff_1, dtype: float64
sam.preprocessing.inverse_differenced_target(predictions: DataFrame, y: Series)

Inverses differencing by adding the current values to the prediction.

This function will take differenced target(s) and the current values, and return the actual target(s). Can be used to convert predictions from a differenced model to real predictions.

predictions and y must be joined on index. Any indexes that only appear in predictions, or only appear in y, will also appear in the output, with nans inserted.

Parameters:
predictions: pd.DataFrame

Dataframe containing differenced values.

y: pd.Series

The actual values in the present

Returns:
actual: pd.DataFrame

Dataframe containing un-differenced values, created by adding predictions to y on index. The index of this output will be the union of the indexes of predictions and y. The columns refer to the values/predictions made at a single point in time. For example, if the index is ‘18:00’, and the predictions are made on differencing 1 hour, 2 hour and 3 hours, then one row will contain the predictions made at 18:00, predicting what the target will be at 19:00, 20:00 and 21:00

Examples

>>> df = pd.DataFrame({
...     'X': [18, 19, 20, 21],
...     'y': [10, 20, 50, 100]
... })
>>> target = make_differenced_target(df['y'], lags=1)
>>> inverse_differenced_target(target, df['y'])
0     20.0
1     50.0
2    100.0
3      NaN
Name: y_diff_1, dtype: float64
>>> prediction = pd.DataFrame({
...    'pred_diff_1': [15, 25, 34, np.nan],
...    'pred_diff_2': [40, 55, np.nan, np.nan]
... })
>>> inverse_differenced_target(prediction, df['y'])
   pred_diff_1  pred_diff_2
0         25.0         50.0
1         45.0         75.0
2         84.0          NaN
3          NaN          NaN
>>> # This means that at timestep 0, we predict that the next two values will be 25 and 60
>>> # At timestep 1, we predict the next two values will be 45 and 105
>>> # At timestep 2, we predict the next two values will be 84 and unknown, etcetera.