Container¶
- class getml.data.Container(population=None, peripheral=None, split=None, deep_copy=False, train=None, validation=None, test=None, **kwargs)[source]¶
A container holds the actual data in the form of a
DataFrame
or aView
.The purpose of a container is twofold:
Assigning concrete data to an abstract
DataModel
.Storing data and allowing you to reproduce previous results.
- Args:
- population (
DataFrame
orView
, optional): The population table defines the statistical population of the machine learning problem and contains the target variables.
- peripheral (dict, optional):
The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using
add()
.- split (
StringColumn
orStringColumnView
, optional): Contains information on how you want to split population into different
Subset
s. Also refer tosplit
.- deep_copy (bool, optional):
Whether you want to create deep copies or your tables.
- train (
DataFrame
orView
, optional): The population table used in the train
Subset
. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- validation (
DataFrame
orView
, optional): The population table used in the validation
Subset
. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- test (
DataFrame
orView
, optional): The population table used in the test
Subset
. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- kwargs (
DataFrame
orView
, optional): The population table used in
Subset
s other than the predefined train, validation and test subsets. You can call these subsets anything you want to and can access them just like train, validation and test. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- Example:
# Pass the subset. container = getml.data.Container(my_subset=my_data_frame) # You can access the subset just like train, # validation or test my_pipeline.fit(container.my_subset)
- population (
- Examples:
A
DataModel
only contains abstract data. When we fit a pipeline, we need to assign concrete data.Note that this example is taken from the loans notebook.
# The abstract data model is constructed # using the DataModel class. A data model # does not contain any actual data. It just # defines the abstract relational structure. dm = getml.data.DataModel( population_train.to_placeholder("population") ) dm.add(getml.data.to_placeholder( meta=meta, order=order, trans=trans) ) dm.population.join( dm.trans, on="account_id", time_stamps=("date_loan", "date") ) dm.population.join( dm.order, on="account_id", ) dm.population.join( dm.meta, on="account_id", ) # We now have abstract placeholders on something # called "population", "meta", "order" and "trans". # But how do we assign concrete data? By using # a container. container = getml.data.Container( train=population_train, test=population_test ) # meta, order and trans are either # DataFrames or Views. Their aliases need # to match the names of the placeholders in the # data model. container.add( meta=meta, order=order, trans=trans ) # Freezing makes the container immutable. # This is not required, but often a good idea. container.freeze() # When we call 'train', the container # will return the train set and the # peripheral tables. my_pipeline.fit(container.train) # Same for 'test' my_pipeline.score(container.test)
If you don’t already have a train and test set, you can use a function from the
split
module.split = getml.data.split.random( train=0.8, test=0.2) container = getml.data.Container( population=population_all, split=split, ) # The remaining code is the same as in # the example above. In particular, # container.train and container.test # work just like above.
Containers can also be used for storage and reproducing your results. A recommended pattern is to assign ‘baseline roles’ to your data frames and then using a
View
to tweak them:# Assign baseline roles data_frame.set_role(["jk"], getml.data.roles.join_key) data_frame.set_role(["col1", "col2"], getml.data.roles.categorical) data_frame.set_role(["col3", "col4"], getml.data.roles.numerical) data_frame.set_role(["col5"], getml.data.roles.target) # Make the data frame immutable, so in-place operations are # no longer possible. data_frame.freeze() # Save the data frame. data_frame.save() # I suspect that col1 leads to overfitting, so I will drop it. view = data_frame.drop(["col1"]) # Insert the view into a container. container = getml.data.Container(...) container.add(some_alias=view) container.save()
The advantage of using such a pattern is that it enables you to always completely retrace your entire pipeline without creating deep copies of the data frames whenever you have made a small change like the one in our example. Note that the pipeline will record which container you have used.
Methods
add
(*args, **kwargs)Adds new peripheral data frames or views.
freeze
()Freezes the container, so that changes are no longer possible.
save
()Saves the Container to disk.
sync
()Synchronizes the last change with the data to avoid warnings that the data has been changed.
TODO