Exploration

This is the documentation for Exploration functions.

If you are interested in feature selection, please see Features for further details on which features to use.

Incident curves information

sam.exploration.incident_curves(data: DataFrame, under_conf_interval: bool = True, max_gap: int = 0, min_duration: int = 0, max_gap_perc: float = 1, min_dist_total: float = 0, actual: str = 'ACTUAL', low: str = 'PREDICT_LOW', high: str = 'PREDICT_HIGH')

Finds and labels connected outliers, or ‘curves’. The basic idea of this function is to define an outlier as a row where the value is outside some interval. ‘Interval’ here refers to a prediction interval, that can be different for every row. This interval defines what a model considers a ‘normal’ value for that datapoint. Missing values are not considered outliers.

Then, we apply various checks and filters to the outlier to create ‘curves’, or streaks of connected outlies. These curves can have gaps, if max_gap > 0. In the end, only the curves that satisfy conditions are kept. Curves that do not satisfy one of the conditions are ignored (essentially, the output will act as if they are not outliers at all).

The output is an array of the same length as the number of rows as data, with each streak of outliers labeled with an unique number. This algorithm assumes the input is sorted by time, adjacent rows are adjacent measurements!

Parameters:
data: pd.DataFrame (n_rows, _)

dataframe containing values, as well two columns determining an interval. These three column names can be configured using the ‘actual’, ‘low’, and ‘high’ parameters

under_conf_interval: bool, optional (default=True)

If true, values lower than the interval count as outliers. Else, only values higher than the interval are counted as outliers.

max_gap: int, optional (default=0)

How many gaps are allowed between outliers. For example, if max_gap = 2, and the outliers look like: [True, False, False, True], then this ‘gap’ of 2 will be assigned to this curve, and this will turn into a single curve of outliers, of length 4

min_duration: int, optional (default=0)

Minimum number of outliers per curve. Curves with a smaller length than this value, will be ignored. Gaps are counted in the duration of a curve.

max_gap_perc: float, optional (default=1)

The maximum percentage of gaps that a curve can have to count. For example, if this is 0.4, and a curve contains 2 outliers and 2 gaps, then the curve will be ignored.

min_dist_total: float, optional (default=0)

If a curve should have a minimum ‘outlier size’, and if so, how much. The outlier size here is defined as the distance between the value and the end of the interval. For example, if the interval is (10, 20) and the value is 21, the ‘outlier size’ is 1. These values are summed (gaps are counted as 0), and compared to this value. Curves with a sum that is too low will be ignored.

actual: string, optional (default=’ACTUAL’)

The name of the column in the data containing the value for each row

low: string, optional (default=’PREDICT_LOW’)

The name of the column in the data containing the lower end of the interval for each row

high: string, optional (default=’PREDICT_HIGH’)

The name of the column in the data containing the higher end of the interval for each row

Returns:
outlier_curves: array-like (n_rows,)

A numpy array of numbers labeling each curve. 0 means there is no outlier curve here that satisfies all conditions.

Examples

>>> from sam.exploration import incident_curves
>>> data = pd.DataFrame({'ACTUAL': [0.3, np.nan, 0.3, np.nan, 0.3, 0.5, np.nan, 0.7],
...                      'PREDICT_HIGH': 0.6, 'PREDICT_LOW': 0.4})
>>> incident_curves(data)
array([1, 0, 2, 0, 3, 0, 0, 4])
>>> incident_curves(data, max_gap=1)
array([1, 1, 1, 1, 1, 0, 0, 2])
>>> incident_curves(data, max_gap=1, max_gap_perc=0.2)
array([0, 0, 0, 0, 0, 0, 0, 2])
sam.exploration.incident_curves_information(data: DataFrame, under_conf_interval: bool = True, return_aggregated: bool = True, normal: str = 'PREDICT', time: str = 'TIME', **kwargs)

Aggregates a dataframe by incident_curves. This function calculates incident_curves using incident_curves, and then calculates information about each outlier curve. This function can either return raw information about each outlier row, keeping the number of rows the same, or it can aggregate the dataframe by outlier curve, which means there will be specific information per curve, such as the length, the total outlier distance, etcetera.

The data must contain the actual, low, high values (actual is the actual value, and low/high are some interval deeming what is ‘normal’) Also, the data must contain a ‘time’ column. Lastly, the data must contain a ‘normal’ column, describing what would be considered the most normal value (for example the middle of the interval) This algorithm assumes the input is sorted by time, adjacent rows are adjacent measurements!

Parameters:
data: pd.DataFrame (n_rows, _)

dataframe containing actual, low, high, normal, and time columns. These column names can be configured using those parameters. The default values are (ACTUAL, PREDICT_LOW, PREDICT_HIGH, PREDICT, TIME)

under_conf_interval: bool, optional (default=True)

If true, values lower than the interval count as outliers. Else, only values higher than the interval are counted as outliers.

return_aggregated: bool, optional (default=True)

If true the information about the outliers will be aggregated by OUTLIER_CURVE. Else, information will not be aggregated. The two options return different types of information.

normal: string, optional (default=’PREDICT’)

The name of the column in the data containing a ‘normal’ value for each row

time: string, optional (default=’TIME’)

The name of the column in the data containing a ‘time’ value for each row

Returns:
information, dataframe

if return_aggregated is false: information about each outlier. The output will have the folowing columns:

  • all the original columns

  • OUTLIER (bool) whether the value of the row is considered an outlier

  • OUTLIER_CURVE (numeric) the streak the outlier belongs to, or 0 if it’s not an outlier

  • OUTLIER_DIST (numeric) The distance between the value and the outside of the interval, describing how much ‘out of the normal’ the value is

  • OUTLIER_SCORE (numeric) If x is OUTLIER_DIST, and y is the distance between the value and the ‘normal’ column, then this is x / (1 + y) This defines some ‘ratio’ of how abnormal the value is. This can be useful in scale-free data, where the absolute distance is not a fair metric.

  • OUTLIER_TYPE (string) ‘positive’ if the outlier is above the interval, and ‘negative’ if the outlier is below the interval

if return_aggregated, then it will return information about each outlier curve The output will have the following columns:

  • index: OUTLIER_CURVE (numeric) The id of the curve. 0 is not included

  • OUTLIER_DURATION (numeric) The number of points in the curve, including gaps

  • OUTLIER_TYPE (string) if the first point is positive or negative. Other points in the curve may have other types

  • OUTLIER_SCORE_MAX (numeric) The maximum of OUTLIER_SCORE of all the points in the curve

  • OUTLIER_START_TIME (datetime) The value of the ‘time’ column of the first point in the curve

  • OUTLIER_END_TIME (datetime) The value of the ‘time’ column of the last point in the curve

  • OUTLIER_DIST_SUM (numeric) The sum of OUTLIER_DIST of the points in the curve. Gaps count as 0

  • OUTLIER_DIST_MAX (numeric) The max of OUTLIER_DIST of the points in the curve

Examples

>>> data = pd.DataFrame({'TIME': range(1547477436, 1547477436+3),  # unix timestamps
...                     'ACTUAL': [0.3, 0.5, 0.7],
...                     'PREDICT_HIGH': 0.6, 'PREDICT_LOW': 0.4, 'PREDICT': 0.5})
>>> incident_curves_information(data)  
               OUTLIER_DURATION  ...
>>> incident_curves_information(data, return_aggregated=False)
         TIME  ACTUAL  PREDICT_HIGH  ...  OUTLIER_DIST  OUTLIER_SCORE  OUTLIER_TYPE
0  1547477436     0.3           0.6  ...           0.1       0.090909      negative
1  1547477437     0.5           0.6  ...           0.0       0.000000          none
2  1547477438     0.7           0.6  ...           0.1       0.090909      positive

[3 rows x 10 columns]

Retrieve correlation features

sam.exploration.lag_correlation(df: DataFrame, target_name: str, lag: int = 12, method: str | Callable = 'pearson')

Creates a new dataframe that contains the correlation of target_name with other variables in the dataframe based on the output from BuildRollingFeatures. The results are processed for easy visualization, with a column for the lag and then correlation per feature.

Parameters:
df: pd.DataFrame

input dataframe contains variables to calculate lag correlation of

target_name: str

The name of the goal variable to calculate lag correlation with

lag: int or list of ints (default=12)

When an integer is provided, a range is created from 0 to lag in steps of 1, when an array of ints is provided, this is directly used. Default is 12, which means the correlation is calculated for lag ranging from 0 to 11.

method: string or callable, optional (default=’pearson’)

The method used to calculate correlation. See pandas.DataFrame.corrwith. Options are {‘pearson’, ‘kendall’, ‘spearman’}, or a callable.

Returns:
tab: pandas dataframe

A dataframe with the correlations and shift. The column header contains the feature name.

Examples

>>> import pandas as pd
>>> from sam.exploration import lag_correlation
>>> import numpy as np
>>> X = pd.DataFrame({
...        'RAIN': [0.1, 0.2, 0.0, 0.6, 0.1, 0.0, 0.0,
...                 0.0, 0.0, 0.0, 0.0, 0.0],
...        'DEBIET#A': [1, 2, 3, 4, 5, 5, 4, 3, 2, 4, 2, 3],
...        'DEBIET#B': [3, 1, 2, 3, 3, 6, 4, 1, 3, 3, 1, 5]})
>>> X['DEBIET#TOTAAL'] = X['DEBIET#A'] + X['DEBIET#B']
>>> tab = lag_correlation(X, 'DEBIET#TOTAAL', lag=11)
>>> tab
    LAG  DEBIET#A  DEBIET#B      RAIN
0     0  0.838591  0.897340 -0.017557
1     1  0.436484  0.102808  0.204983
2     2  0.287863 -0.401768  0.672316
3     3 -0.388095 -0.140876  0.188438
4     4 -0.632980 -0.509307 -0.227071
5     5 -0.667537 -0.367268 -0.048162
6     6 -0.152832  0.615239  0.110876
7     7  0.457496 -0.107833 -0.719702
8     8  0.291111  0.039253  0.871695
9     9  0.188982  0.755929 -0.944911
10   10  1.000000 -1.000000  1.000000
sam.exploration.top_score_correlations(df: DataFrame, goal_feature: str, score: float = 0.5)

Returns the features that have a correlation above a certain threshold with the defined goal feature

Parameters:
df: pd.DataFrame

Dataframe containing the features that have to be correlated.

goal_feature: str

Feature that is used to compare the correlation with other features

score: float (default: 0.5)

absolute minimal correlation value

Returns:
corrs: pd.DataFrame

A dataframe containing 2 columns(index, goal feature). Index contains the correlating features and goal feature the correlation values.

Examples

>>> import pandas as pd
>>> from sam.feature_engineering import BuildRollingFeatures
>>> from sam.exploration import top_score_correlations
>>> import numpy as np
>>> goal_feature = 'DEBIET_TOTAAL#lag_0'
>>> df = pd.DataFrame({
...                'RAIN': [0.1, 0.2, 0.0, 0.6, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
...                'DEBIET_A': [1, 2, 3, 4, 5, 5, 4, 3, 2, 4, 2, 3],
...                'DEBIET_B': [3, 1, 2, 3, 3, 6, 4, 1, 3, 3, 1, 5]})
>>> df['DEBIET_TOTAAL'] = df['DEBIET_A'] + df['DEBIET_B']
>>> RollingFeatures = BuildRollingFeatures(rolling_type='lag', \
...     window_size = np.arange(10), lookback=0, keep_original=False)
>>> res = RollingFeatures.fit_transform(df)
>>> top_score_correlations(res, goal_feature, score=0.8)
                 index  DEBIET_TOTAAL#lag_0
0  DEBIET_TOTAAL#lag_9             0.944911
1           RAIN#lag_9            -0.944911
2       DEBIET_B#lag_0             0.897340
3           RAIN#lag_8             0.871695
4       DEBIET_A#lag_0             0.838591
sam.exploration.top_n_correlations(df: DataFrame, goal_feature: str, n: int = 5, grouped: bool = True, sep: str = '#')

Given a dataset, retrieve the top n absolute correlating features per group or in general

Parameters:
df: pd.DataFrame

Dataframe containing the features that have to be correlated.

goal_feature: str

Feature that is used to compare the correlation with other features

n: int (default: 5)

Number of correlating features that are returned

grouped: bool (default: True)

Whether to group the features and take the top n of a group, or just the top n correlating features in general. Groups are created based on column name, and are all characters before the first occurence of the sep. s For example, if the sep is ‘#’, then DEBIET_TOTAAL#lag_0 is in group DEBIET_TOTAAL

sep: str (default: ‘#’)

The seperator character. The group of a column is defined as everything before the first occurence of this character. Only relevant if grouped is True

Returns:
df: pd.DataFrame

If grouped is true, a dataframe containing 3 columns (GROUP, index, goal_variable) is returned, else a dataframe containing 2 columns (index, goal_variable) is returned. index contains the correlating features and goal_variable the correlation value. GROUP contains the group.

Examples

>>> import pandas as pd
>>> from sam.feature_engineering import BuildRollingFeatures
>>> from sam.exploration import top_n_correlations
>>> import numpy as np
>>> goal_feature = 'DEBIET_TOTAAL#lag_0'
>>> df = pd.DataFrame({
...                'RAIN': [0.1, 0.2, 0.0, 0.6, 0.1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
...                'DEBIET_A': [1, 2, 3, 4, 5, 5, 4, 3, 2, 4, 2, 3],
...                'DEBIET_B': [3, 1, 2, 3, 3, 6, 4, 1, 3, 3, 1, 5]})
>>> df['DEBIET_TOTAAL'] = df['DEBIET_A'] + df['DEBIET_B']
>>> RollingFeatures = BuildRollingFeatures(rolling_type='lag',
...     window_size = np.arange(12), lookback=0, keep_original=False)
>>> res = RollingFeatures.fit_transform(df)
>>> top_n_correlations(res, goal_feature, n=2, grouped=True, sep='#')
           GROUP                 index  DEBIET_TOTAAL#lag_0
0       DEBIET_A       DEBIET_A#lag_10             1.000000
1       DEBIET_A        DEBIET_A#lag_0             0.838591
2       DEBIET_B       DEBIET_B#lag_10            -1.000000
3       DEBIET_B        DEBIET_B#lag_0             0.897340
4  DEBIET_TOTAAL  DEBIET_TOTAAL#lag_10            -1.000000
5  DEBIET_TOTAAL   DEBIET_TOTAAL#lag_9             0.944911
6           RAIN           RAIN#lag_10             1.000000
7           RAIN            RAIN#lag_9            -0.944911
>>> top_n_correlations(res, goal_feature, n=2, grouped=False)
                  index  DEBIET_TOTAAL#lag_0
0  DEBIET_TOTAAL#lag_10                 -1.0
1       DEBIET_A#lag_10                  1.0