TimeSeries

class getml.data.TimeSeries(population, time_stamps, alias=None, alias2=None, peripheral=None, split=None, deep_copy=False, on=None, memory=None, horizon=None, lagged_targets=False, upper_time_stamp=None)[source]

A TimeSeries is a simplifying abstraction that can be used for machine learning problems on time series data.

It unifies Container and DataModel thus abstracting away the need to differentiate between the concrete data and the abstract data model. It also abstracts away the need for self joins.

Args:
population (DataFrame or View):

The population table defines the statistical population of the machine learning problem and contains the target variables.

time_stamps (str):

The time stamps used to limit the self-join.

alias (str, optional):

The alias to be used for the population table. If it isn’t set, the ‘population’ will be used as the alias. To explicitly set an alias for the peripheral table, use with_name().

peripheral (dict, optional):

The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using add().

split (StringColumn or StringColumnView, optional):

Contains information on how you want to split population into different Subset s. Also refer to split.

deep_copy (bool, optional):

Whether you want to create deep copies or your tables.

on (None, string, Tuple[str] or List[Union[str, Tuple[str]]], optional):

The join keys to use. If none is passed, then everything will be joined to everything else.

memory (float, optional):

The difference between the time stamps until data is ‘forgotten’. Limiting your joins using memory can significantly speed up training time. Also refer to time.

horizon (float, optional):

The prediction horizon to apply to this join. Also refer to time.

lagged_targets (bool, optional):

Whether you want to allow lagged targets. If this is set to True, you must also pass a positive, non-zero horizon.

upper_time_stamp (str, optional):

Name of a time stamp in right_df that serves as an upper limit on the join.

Example:
# All rows before row 10500 will be used for training.
split = getml.data.split.time(data_all, "rowid", test=10500)

time_series = getml.data.TimeSeries(
    population=data_all,
    time_stamps="rowid",
    split=split,
    lagged_targets=False,
    memory=30,
)

pipe = getml.Pipeline(
    data_model=time_series.data_model,
    feature_learners=[...],
    predictors=...
)

pipe.check(time_series.train)

pipe.fit(time_series.train)

pipe.score(time_series.test)

# To generate predictions on new data,
# it is sufficient to use a Container.
# You don't have to recreate the entire
# TimeSeries, because the abstract data model
# is stored in the pipeline.
container = getml.data.Container(
    population=population_new,
)

# Add the data as a peripheral table, for the
# self-join.
container.add(population=population_new)

predictions = pipe.predict(container.full)

Methods

join(right_df[, alias, on, time_stamps, ...])

Joins a DataFrame or View to the population table.

sync()

Synchronizes the last change with the data to avoid warnings that the data has been changed.

Attributes

container

The underlying Container.

data_model

The underlying DataModel.