Data formats

To make it easier to work together in a project we defined a standard way of storing data. This way was designed to be easy to use and also support the different programming languages. To reshape data from one format to the other, see SAM-format Reshaping.

Right now we use two main formats:

Long format (Sam data)

A lot of functions work on long format data (usually called sam-data). This is the easiest way to store data, since it doesn’t require a lot of columns and is more sparse when time indices are not the same for all signals. In this format we expect the following columns to be present:

TIME, ID, TYPE, VALUE

Those columns stand for:

  • TIME is a datetime column

  • ID specificies the object (Usually a location)

  • TYPE what is measured (Precipitation, flow or vibrations)

  • VALUE the measured sensor value

For example:

TIME

ID

TYPE

VALUE

2020-01-01 00:00:00

Pump1

Flow

2

2020-01-01 00:02:00

Pump1

Speed

800

Wide format

This format is used by the models and feature engineers. It usually has the following columns:

TIME, ID1_TYPE1, ID1_TYPE2, ID_TYPE1, …

Note that _ is the default separator, but it can be changed in the reshaping functions, in case your ID or TYPE contains an underscore.

The columns stand for:

  • TIME is a datetime column

  • ID_TYPE Columns signify the different sensor values per ID and TYPE

For example:

TIME

Pump1_Flow

Pump1_Speed

Pump2_Flow

2020-01-01 00:00:00

2

800

0

2020-01-01 00:02:00

3

802

0

Timezone

SAM does not need timezone information to work. However some feature engineering functions (like using datetime components) do expect the data to be in UTC (with or without tz info). Therefore it is strongly encouraged to make sure your data in in UTC, to prevent summer/winter-time issues. The feature engineer can take the local time into account while the data remains in UTC.