Data Sources

This is the documentation for data sources functions.

Create Synthetic Data

sam.data_sources.synthetic_date_range(start='2016-01-01', end='2017-01-01', freq='H', max_delay=0, random_stop_freq=0, random_stop_max_length=1, seed=None)

Create a synthetic, somewhat realistic-looking array of datetimes.

Given a start time, end time, frequency, and some variables governing noise, creates an array of datetimes that is somewhat random.

The algorithm:

  • Generate a regular pandas date_range with start, end, and frequency

  • Delay each time by a uniformly chosen random number between 0 and max_delay, in seconds.

  • Pick a proportion random_stop_freq of times randomly. Each of these times x_i are deemed ‘stoppages’, and for each, a number between 1 and random_stop_max_length is uniformly chosen, say k_i. Then, the ‘stoppage’, the k_i next points after x_i are deleted, causing a hole in the times.

  • Only the times strictly smaller than end are kept. This means end is an exclusive bound.

Parameters:
start: str or datetime-like, optional (default=’2016-01-01’)

Left bound for generating dates.

end: str or datetime-like, optional (default=’2017-01-01’)

Right bound for generating dates. Exclusive bound.

freq: str or DateOffset, optional (default=’H’) (hourly)

Frequency strings can have multiples, e.g. ‘5H’. See `here for a list of frequency aliases. <https://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases`_

max_delay: numeric, optional (default=0)

Each time is delayed by a random number of seconds, chosen between 0 and max_delay

random_stop_freq: numeric, optional (default=0)

Number between 0 and 1. This proportion of all times are deemed as starting points of ‘stoppages’. A stoppage means that a number of points are removed from the result.

random_stop_max_length: numeric, optional (default=1)

Each stoppage will have a randomly generated length, between 1 and random_stop_max_length. A stoppage of length k means that the first k points after the start of the stoppage are deleted.

seed: int or 1-d array_like, optional (default=None)

seed for random noise generation. Passed through to numpy.random.seed. By default, no call to numpy.random.seed is made.

Returns:
rng: DatetimeIndex

A pandas datetimeindex of noisy times

Examples

>>> # Generate times with point approximately every 6 hours
>>> from sam.data_sources.synthetic_data import synthetic_date_range
>>> synthetic_date_range('2016-01-01', '2016-01-02', '6H', 600, 0, 1, seed=0)
DatetimeIndex(['2016-01-01 00:05:29.288102356',
               '2016-01-01 06:12:38.401722180',
               '2016-01-01 12:18:40.059747823',
               '2016-01-01 18:24:06.989657621'],
              dtype='datetime64[ns]', freq=None)
>>> # Generate times with very likely stops of length 1
>>> synthetic_date_range('2016-01-01', '2016-01-02', 'H', 0, 0.5, 1, seed=0)
DatetimeIndex(['2016-01-01 00:00:00', '2016-01-01 01:00:00',
               '2016-01-01 02:00:00', '2016-01-01 03:00:00',
               '2016-01-01 04:00:00', '2016-01-01 05:00:00',
               '2016-01-01 09:00:00', '2016-01-01 10:00:00',
               '2016-01-01 11:00:00', '2016-01-01 13:00:00',
               '2016-01-01 16:00:00', '2016-01-01 21:00:00'],
              dtype='datetime64[ns]', freq=None)
sam.data_sources.synthetic_timeseries(dates, monthly=0, daily=0, hourly=0, monthnoise=(None, 0), daynoise=(None, 0), noise={}, minmax_values=None, cutoff_values=None, negabs=None, random_missing=None, seed=None)

Create a synthetic time series, with some temporal patterns, and some noise. There are various parameters to control the distribution of the variables. The output will never be completely realistic, it will at least resemble what real life data could look like.

The algorithm works like this:

  • 3 cubic splines are created: one with a monthly pattern, one with a day-of-week pattern, and one with an hourly pattern. These splines are added together.

  • For each month and day of the week, noise is generated according to monthnoise and daynoise These two sources of noise are added together

  • Noise as specified by the noise parameter is generated for each point

  • The above three series are added together. Rescale the result according to minmax_values

  • Missing values are added according to cutoff_values and random_missing

  • The values are mutated according to negabs

The result is returned in a numpy array with the same length as the dates input. Due to the way the cubic splines are generated, there may be several dozen to a hundred data points at the beginning and end that are nan. To fix this, choose a dates array that is a couple of days longer than what you really want. Then, at the end, filter the output to only the dates in the middle.

Parameters:
dates: series of datetime, shape=(n_inputs,)

The index of the time series that will be created. At least length 2. Must be a pandas series, with a .dt attribute.

monthly: numeric, optional (default=0)

The magnitude of the (random) monthly pattern. A random magnitude will be created for each month, with a cubic spline interpolating between months. The higher this value, the stronger the monthly pattern

daily: numeric, optional (default=0)

The magnitude of the (random) daily pattern. A random magnitude will be created for each day of the week, with a cubic spline interpolating between days. The higher this value, the stronger the daily pattern

hourly: numeric, optional (default=0)

The magnitude of the (random) hourly pattern. A random magnitude will be created for each hour, with a cubic spline interpolating between days. The higher this value, the stronger the daily pattern

monthnoise: tuple of (str, numeric), optional (default=(None, 0))

The type and magnitude of the monthly noise. For each month, a different magnitude will be uniformly drawn between 0 and monthnoise[1]. The type of the noise is given in monthnoise[0] and is either ‘normal’, ‘poisson’, or other (no noise). This noise is added to all points,but the magnitude wil differ between the 12 different months.

daynoise: tuple of (str, numeric), optional (default=(None, 0))

The type and magnitude of the daily noise. For each day of the week, a different magnitude will be drawn between 0 and daynoise[1]. The type of the noise is given in daynoise[0] and is either ‘normal’, ‘poisson’, or other (no noise). This noise is added to all points, but the magnitude wil differ between the 7 different days of the week.

noise: dict, optional (default={})

The types of noise that are added to every single point. The keys of this dictionary are ‘normal’, ‘poisson’, or other (ignored) The value of the dictionary is the scale of the noise, standard deviation for normal noise, and the lambda value for poisson noise. The greater, the higher the variance of the result.

minmax_values: tuple, optional (default=None)

The values will be linearly rescaled to always fall within these bounds. By default, no rescaling is done.

cutoff_values: tuple, optional (default=None)

After rescaling, all the values that fall outside of these bounds will be set to nan. By default, no cutoff is done, and no values will be set to nan.

negabs: numeric, optional (default=None)

This value is subtracted from all the output (after rescaling), and then the result will be the absolute value. This oddly-specific operation is useful in case you want a positive value that has a lot of values around 0. This is very hard to do otherwise. By subtracting and taking the absolute value, this is achieved.

random_missing: numeric, optional (default=None)

Between 0 and 1. The fraction of values that will be set to nan. Used to emulate time series with a lot of missing values. The missing values will be completely randomly distributed with no pattern.

seed: int or 1-d array_like, optional (default=None)

seed for random noise generation. Passed through to numpy.random.seed. By default, no call to numpy.random.seed is made.

Returns:
timeseries: numpy array, shape=(n_inputs,)

A numpy array containing numbers, generated according to the provided parameters.

Examples

>>> # Create data that slightly resembles the temperature in a Nereda reactor:
>>> from sam.data_sources.synthetic_data import synthetic_date_range, synthetic_timeseries
>>> dates = pd.date_range('2015-01-01', '2016-01-01', freq='6H').to_series()
>>> rnd = synthetic_timeseries(
...     dates,
...     monthly=5,
...     daily=1,
...     hourly=0.0,
...     monthnoise=('normal', 0.01),
...     daynoise=('normal', 0.01),
...     noise={'normal': 0.1},
...     minmax_values=(5, 25),
...     cutoff_values=None,
...     random_missing=0.12,
...     seed = 0,
... )
>>> # visualize the result to see if it looks random or not
>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax = ax.plot(dates[600:700], rnd[600:700])
>>> fig = fig.autofmt_xdate()
>>> plt.show()  
_images/synthetic_temperature.png

Read Weather API’s

sam.data_sources.read_knmi_stations()

Function to get all KNMI stations from API

sam.data_sources.read_knmi_station_data(start_date='2021-01-01', end_date='2021-01-02', stations=None, freq='daily', variables='default', parse=True, preprocess=True)

Read KNMI data for specific station To find station numbers, use sam.data_sources.read_knmi_stations, or use sam.data_sources.read_knmi to use lat/lon and find closest station

Source: https://www.knmi.nl/kennis-en-datacentrum/achtergrond/data-ophalen-vanuit-een-script

Parameters:
start_datestr or datetime-like

the start time of the period from which to export weather if str, must be in the format %Y-%m-%d %H:%M:%S or %Y-%m-%d

end_datestr or datetime-like

the end time of the period from which to export weather if str, must be in the format %Y-%m-%d %H:%M:%S or %Y-%m-%d

stationsint, string, list or None

station number or list of station numbers, either int or string if None, data from all stations is returned

freqstr, optional (default = ‘daily’)

frequency of export. Must be ‘hourly’ or ‘daily’

variablesstr, None or list, optional (default=’default’)

knmi-variables to export. See all hourly variables here or all daily variables here by default, export [average temperature, sunshine duration, rainfall], which is [‘RH’, ‘SQ’, ‘T’] for hourly, and [‘RH’, ‘SQ’, ‘TG’] for daily If None, all variables will be collected

preprocessbool, optional (default=False)

by default (False), return variables in default units (often 0.1 mm). If true, data is scaled to whole units, and default values of -1 are mapped to 0

parsebool, optional (default=True)

if True, parse the data to a pandas dataframe Only use False for debugging purposes

Returns
——-
knmi: dataframe

Dataframe with columns as in ‘variables’, and STN, TIME columns

sam.data_sources.read_knmi(start_date, end_date, latitude=52.11, longitude=5.18, freq='hourly', variables='default', find_nonan_station=False, preprocess=False, drop_station=True)

Export historic variables from KNMI, either hourly or daily. There are many weather stations in the Netherlands, but this function will select the station that is physically closest to the desired location, and use that station. knmi only has historic data. Usually, the most recent datapoint is about half a day prior to the current time. If the start_date and/or end_date is after the most recent available datapoint, any datapoints that are not available will not be included in the results, not even as missing data.

Parameters:
start_datestr or datetime-like

the start time of the period from which to export weather if str, must be in the format %Y-%m-%d %H:%M:%S or %Y-%m-%d

end_datestr or datetime-like

the end time of the period from which to export weather if str, must be in the format %Y-%m-%d %H:%M:%S or %Y-%m-%d

latitudefloat, optional (default=52.11)

latitude of the location from which to export weather. By default, use location of weather station De Bilt

longitudefloat, optional (default=5.18)

longitude of the location from which to export weather. By default, use location of weather station De Bilt

freq: str, optional (default = ‘hourly’)

frequency of export. Must be ‘hourly’ or ‘daily’

variables: str, None or list, optional (default=’default’)

knmi-variables to export. See all hourly variables here or all daily variables here by default, export [average temperature, sunshine duration, rainfall], which is [‘RH’, ‘SQ’, ‘T’] for hourly, and [‘RH’, ‘SQ’, ‘TG’] for daily If None, all variables will be collected

find_nonan_stationbool, optional (defaut=False)

by default (False), return the closest stations even if it includes nans. If True, return the closest station that does not include nans instead

preprocessbool, optional (default=False)

by default (False), return variables in default units (often 0.1 mm). If true, data is scaled to whole units, and default values of -1 are mapped to 0

drop_stationbool, optional (default=True)

by default (True), drop ‘STN’ column from result. If False, the returned dataframe will contain a column STN with station number This station number will be the same for all rows since this function returns only data for the closest station. To get data of multiple stations, try read_knmi_station_data

Returns:
knmi: dataframe

Dataframe with columns as in ‘variables’, and TIME column

Examples

>>> read_knmi('2018-01-01 00:00:00', '2018-01-01 06:00:00', 52.09, 5.09, 'hourly', ['SQ', 'T'])
    SQ     T                TIME
0  0.0  87.0 2018-01-01 00:00:00
1  0.0  85.0 2018-01-01 01:00:00
2  0.0  71.0 2018-01-01 02:00:00
3  0.0  78.0 2018-01-01 03:00:00
4  0.0  80.0 2018-01-01 04:00:00
5  0.0  75.0 2018-01-01 05:00:00
6  0.0  69.0 2018-01-01 06:00:00
sam.data_sources.read_openweathermap(latitude=52.11, longitude=5.18)

Use openweathermap API to obtain a weather forecast. This forecast has a frequency of 3 hours, with a total of 39 observations, meaning the forecast is up to 5 days in the future. The resulting timestamp always uses UTC.

Parameters:
latitude: float, optional (default=52.11)

latitude of the location from which to export weather. By default, use location of weather station De Bilt

longitude: float, optional (default=5.18)

longitude of the location from which to export weather. By default, use location of weather station De Bilt

Returns:
forecast: dataframe with TIME column, containing the time of that specific forecast,

with timezone UTC. And the following columns:

  • cloud_coverage, in %

  • humidity, in %

  • pressure: generally same as pressure_sealevel, in hPa

  • pressure_groundlevel, in hPa

  • pressure_sealevel, in hPa

  • temp, in celcius

  • temp_max, in celcius

  • temp_min, in celcius

  • rain_3h: volume of the last 3h, in mm

  • wind_deg: wind direction in degrees (meteorological)

  • wind_speed, in meter/sec

Examples

>>> read_openweathermap(52.11, 5.18)  
    cloud_coverage  pressure_groundlevel    humidity        pressure        pressure_sealevel       temp    temp_max                    temp_min    rain_3h wind_deg        wind_speed      TIME
0   92      991.91  95      992.77  992.77  8.82    8.82    7.20    1.005   225.510 11.82   2019-03-07 15:00:00
1   92      991.57  91      992.55  992.55  8.01    8.01    6.79    0.280   223.501 13.01   2019-03-07 18:00:00
...
39  80      1009.42 73      1010.39 1010.39 8.41    8.41    8.41    0.090   204.502 10.28   2019-03-12 12:00:00
sam.data_sources.read_regenradar(start_date: str, end_date: str, latitude: float = 52.0237687, longitude: float = 5.5920412, freq: float = '5min', batch_size: str = '7D', crs: str = 'EPSG:4326', **kwargs) DataFrame

Export historic precipitation from Nationale Regenradar.

By default, this function collects the best-known information for a single point, given by latitude and longitude in coordinate system EPSG:4326 (WGS84). This can be configured using **kwargs, but this requires some knowledge of the underlying API.

The parameters agg=average, rasters=730d6675, srs=EPSG:4326m are given to the API, as well as start, end, window given by start_date, end_date, freq. Lastly geom, which is POINT+(latitude+longitude). Alternatively, a different geometry can be passed via the geom argument in **kwargs. A different coordinate system can be passed via the srs argument in **kwargs. This is a WKT string. For example: geom=’POINT+(191601+500127)’, srs=’epsg:28992’. Exact information about the API specification and possible arguments is unfortunately unknown.

Parameters:
start_date: str or datetime-like

the start time of the period from which to export weather if str, must be in the format %Y-%m-%d or %Y-%m-%d %H:%M:%S

end_date: str or datetime-like

the end time of the period from which to export weather if str, must be in the format %Y-%m-%d or %Y-%m-%d %H:%M:%S

latitude: float, optional (default=52.11)

latitude of the location from which to export weather. By default, use location of weather station De Bilt

longitude: float, optional (default=5.18)

longitude of the location from which to export weather. By default, use location of weather station De Bilt

freq: str or DateOffset, default ‘5min’

frequency of export. Minimum, and default frequency is every 5 minutes. To learn more about the frequency strings, see this link.

batch_size: str, default ‘7D’

batch size for collecting data from the API to avoid time-out. Default is 7 days.

crs: str, default ‘EPSG:4326’

coordinate system for provided longitude (x) and latitude (y) values (or geometry by kwargs). Default is WGS84.

kwargs: dict

additional parameters passed in the url. Must be convertable to string. Any entries with a value of None will be ignored and not passed in the url.

Returns:
result: dataframe

Dataframe with column PRECIPITATION and column TIME. PRECIPITATION is the precipitation in the last 5 minutes, in mm.

Examples

>>> from sam.data_sources import read_regenradar
>>> read_regenradar('2018-01-01', '2018-01-01 00:20:00')  
    TIME    PRECIPITATION
0   2018-05-01 00:00:00     0.05
1   2018-05-01 00:05:00     0.09
2   2018-05-01 00:10:00     0.09
3   2018-05-01 00:15:00     0.07
4   2018-05-01 00:20:00     0.04
>>> # Example of using alternative **kwargs
>>> # For more info about these parameters, ask regenradar experts at RHDHV
>>> read_regenradar(
...     '2018-01-01',
...     '2018-01-01 00:20:00',
...     boundary_type='MUNICIPALITY',
...     geom_id=95071,
...     geom=None,
... )  
    TIME    PRECIPITATION
0   2018-05-01 00:00:00     0.00
1   2018-05-01 00:05:00     0.00
2   2018-05-01 00:10:00     0.00
3   2018-05-01 00:15:00     0.00
4   2018-05-01 00:20:00     0.00

Mongo wrapper

class sam.data_sources.MongoWrapper(db, collection, location='localhost', port=27017, **kwargs)

Bases: object

Provides a simple wrapper to the MongoDB layers

This class provides a wrapper for basic functionality in MongoDB. We aim to use MongoDB as storage layer between analyses and e.g. dashboarding.

Parameters:
db: string

Name of the database

collection: string

the name of the collection to fetch

location: string, optional (default=”localhost”)

Location of the database

port: integer, optional (default=27017)

Port that the database is reachable on

**kwargs: arbitrary keyword arguments

Passed through to pymongo.MongoClient

Examples

>>> from sam.data_sources import MongoWrapper  
>>> mon = MongoWrapper('test_magweg','test_magookweg')  
>>> mon.empty().add([{'test': 7}]).get()  
>>> test  

Methods

add(content)

Get as specific collection from the database

empty()

Empty the collection

get([query, as_df])

Get as specific collection from the database

add(content)

Get as specific collection from the database

Parameters:
content: list of dictionaries, or pandas dataframe

list of items to add to the collection

Returns:
resultself
empty()

Empty the collection

Returns:
result: self
get(query={}, as_df=True)

Get as specific collection from the database

Parameters:
query: dictionary-like, optional (default={})

dictionary of parameters to use in the query. e.g. { “address”: “Park Lane 38” }

as_df: boolean, optional (default=True)

return the query results as a Pandas Dataframe

Returns:
resultpandas dataframe, or list of dictionaries

the results of the query