StarSchema¶

class getml.data.StarSchema(population=None, alias=None, peripheral=None, split=None, deep_copy=False, train=None, validation=None, test=None, **kwargs)[source]¶

A StarSchema is a simplifying abstraction that can be used for machine learning problems that can be organized in a simple star schema.

It unifies Container and DataModel thus abstracting away the need to differentiate between the concrete data and the abstract data model.

The class is designed using composition - it is neither Container nor DataModel, but has both of them.

This means that you can always fall back to the more flexible methods using Container and DataModel by directly accessing the attributes container and data_model.

Args:

population (DataFrame or View, optional):

The population table defines the statistical population of the machine learning problem and contains the target variables.

alias (str, optional):

The alias to be used for the population table. This is required, if population is a View.

peripheral (dict, optional):

The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using join().

split (StringColumn or StringColumnView, optional):

Contains information on how you want to split population into different Subset s. Also refer to split.

deep_copy (bool, optional):

Whether you want to create deep copies or your tables.

train (DataFrame or View, optional):

The population table used in the train Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

validation (DataFrame or View, optional):

The population table used in the validation Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

test (DataFrame or View, optional):

The population table used in the test Subset. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

kwargs (DataFrame or View, optional):

The population table used in Subset s other than the predefined train, validation and test subsets. You can call these subsets anything you want to and can access them just like train, validation and test. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.

Example:

# Pass the subset.
star_schema = getml.data.StarSchema(
    my_subset=my_data_frame)

# You can access the subset just like train,
# validation or test
my_pipeline.fit(star_schema.my_subset)

Examples:

Note that this example is taken from the loans notebook.

You might also want to refer to DataFrame, View and Pipeline.

# First, we insert our data.
# population_train and population_test are either
# DataFrames or Views. The population table
# defines the statistical population of your
# machine learning problem and contains the
# target variables.
star_schema = getml.data.StarSchema(
    train=population_train,
    test=population_test
)

# meta, order and trans are either
# DataFrames or Views.
# Because this is a star schema,
# all joins take place on the population
# table.
star_schema.join(
    trans,
    on="account_id",
    time_stamps=("date_loan", "date")
)

star_schema.join(
    order,
    on="account_id",
)

star_schema.join(
    meta,
    on="account_id",
)

# Now you can insert your data model,
# your preprocessors, feature learners,
# feature selectors and predictors
# into the pipeline.
# Note that the pipeline only knows
# the abstract data model, but hasn't
# seen the actual data yet.
pipe = getml.Pipeline(
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=predictor,
)

# Now, we pass the actual data.
# This passes 'population_train' and the
# peripheral tables (meta, order and trans)
# to the pipeline.
pipe.check(star_schema.train)

pipe.fit(star_schema.train)

pipe.score(star_schema.test)

# To generate predictions on new data,
# it is sufficient to use a Container.
# You don't have to recreate the entire
# StarSchema, because the abstract data model
# is stored in the pipeline.
container = getml.data.Container(
    population=population_new)

container.add(
    trans=trans_new,
    order=order_new,
    meta=meta_new)

predictions = pipe.predict(container.full)

If you don’t already have a train and test set, you can use a function from the split module.

split = getml.data.split.random(
    train=0.8, test=0.2)

star_schema = getml.data.StarSchema(
    population=population_all,
    split=split,
)

# The remaining code is the same as in
# the example above. In particular,
# star_schema.train and star_schema.test
# work just like above.

Methods

`join`(right_df[, alias, on, time_stamps, ...])	Joins a `DataFrame` or `View` to the population table.
`sync`()	Synchronizes the last change with the data to avoid warnings that the data has been changed.

Attributes

`container`	The underlying `Container`.
`data_model`	The underlying `DataModel`.