Container

class getml.data.Container(population=None, peripheral=None, split=None, deep_copy=False, train=None, validation=None, test=None, **kwargs)[source]

A container holds the actual data in the form of a DataFrame or a View.

The purpose of a container is twofold:

  • Assigning concrete data to an abstract DataModel.

  • Storing data and allowing you to reproduce previous results.

Args:
population (DataFrame or View, optional):

The population table defines the statistical population of the machine learning problem and contains the target variables.

peripheral (dict, optional):

The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using add().

split (StringColumn or StringColumnView, optional):

Contains information on how you want to split population into different Subset s. Also refer to split.

deep_copy (bool, optional):

Whether you want to create deep copies or your tables.

train (DataFrame or View, optional):

The population table used in the train Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

validation (DataFrame or View, optional):

The population table used in the validation Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

test (DataFrame or View, optional):

The population table used in the test Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

kwargs (DataFrame or View, optional):

The population table used in Subset s other than the predefined train, validation and test subsets. You can call these subsets anything you want to and can access them just like train, validation and test. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

Example:
# Pass the subset.
container = getml.data.Container(my_subset=my_data_frame)

# You can access the subset just like train,
# validation or test
my_pipeline.fit(container.my_subset)
Examples:

A DataModel only contains abstract data. When we fit a pipeline, we need to assign concrete data.

Note that this example is taken from the loans notebook.

# The abstract data model is constructed
# using the DataModel class. A data model
# does not contain any actual data. It just
# defines the abstract relational structure.
dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

dm.add(getml.data.to_placeholder(
    meta=meta,
    order=order,
    trans=trans)
)

dm.population.join(
    dm.trans,
    on="account_id",
    time_stamps=("date_loan", "date")
)

dm.population.join(
    dm.order,
    on="account_id",
)

dm.population.join(
    dm.meta,
    on="account_id",
)

# We now have abstract placeholders on something
# called "population", "meta", "order" and "trans".
# But how do we assign concrete data? By using
# a container.
container = getml.data.Container(
    train=population_train,
    test=population_test
)

# meta, order and trans are either
# DataFrames or Views. Their aliases need
# to match the names of the placeholders in the
# data model.
container.add(
    meta=meta,
    order=order,
    trans=trans
)

# Freezing makes the container immutable.
# This is not required, but often a good idea.
container.freeze()

# When we call 'train', the container
# will return the train set and the
# peripheral tables.
my_pipeline.fit(container.train)

# Same for 'test'
my_pipeline.score(container.test)

If you don’t already have a train and test set, you can use a function from the split module.

split = getml.data.split.random(
    train=0.8, test=0.2)

container = getml.data.Container(
    population=population_all,
    split=split,
)

# The remaining code is the same as in
# the example above. In particular,
# container.train and container.test
# work just like above.

Containers can also be used for storage and reproducing your results. A recommended pattern is to assign ‘baseline roles’ to your data frames and then using a View to tweak them:

# Assign baseline roles
data_frame.set_role(["jk"], getml.data.roles.join_key)
data_frame.set_role(["col1", "col2"], getml.data.roles.categorical)
data_frame.set_role(["col3", "col4"], getml.data.roles.numerical)
data_frame.set_role(["col5"], getml.data.roles.target)

# Make the data frame immutable, so in-place operations are
# no longer possible.
data_frame.freeze()

# Save the data frame.
data_frame.save()

# I suspect that col1 leads to overfitting, so I will drop it.
view = data_frame.drop(["col1"])

# Insert the view into a container.
container = getml.data.Container(...)
container.add(some_alias=view)
container.save()

The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which container you have used.

Methods

add(*args, **kwargs)

Adds new peripheral data frames or views.

freeze()

Freezes the container, so that changes are no longer possible.

save()

Saves the Container to disk.

sync()

Synchronizes the last change with the data to avoid warnings that the data has been changed.

to_pandas()

TODO