StarSchema¶
-
class
getml.data.
StarSchema
(population=None, alias=None, peripheral=None, split=None, deep_copy=False, train=None, validation=None, test=None, **kwargs)[source]¶ A StarSchema is a simplifying abstraction that can be used for machine learning problems that can be organized in a simple star schema.
It unifies
Container
andDataModel
thus abstracting away the need to differentiate between the concrete data and the abstract data model.The class is designed using composition - it is neither
Container
norDataModel
, but has both of them.This means that you can always fall back to the more flexible methods using
Container
andDataModel
by directly accessing the attributes container and data_model.- Args:
- population (
DataFrame
orView
, optional): The population table defines the statistical population of the machine learning problem and contains the target variables.
- alias (str, optional):
The alias to be used for the population table. This is required, if population is a
View
.- peripheral (dict, optional):
The peripheral tables are joined onto population or other peripheral tables. Note that you can also pass them using
join()
.- split (
StringColumn
orStringColumnView
, optional): Contains information on how you want to split population into different
Subset
s. Also refer tosplit
.- deep_copy (bool, optional):
Whether you want to create deep copies or your tables.
- train (
DataFrame
orView
, optional): The population table used in the train
Subset
. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- validation (
DataFrame
orView
, optional): The population table used in the validation
Subset
. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- test (
DataFrame
orView
, optional): The population table used in the test
Subset
. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- kwargs (
DataFrame
orView
, optional): The population table used in
Subset
s other than the predefined train, validation and test subsets. You can call these subsets anything you want to and can access them just like train, validation and test. You can either pass population and split or you can pass the subsets separately using train, validation, test and kwargs.- Example:
# Pass the subset. star_schema = getml.data.StarSchema( my_subset=my_data_frame) # You can access the subset just like train, # validation or test my_pipeline.fit(star_schema.my_subset)
- population (
- Examples:
Note that this example is taken from the loans notebook.
You might also want to refer to
DataFrame
,View
andPipeline
.# First, we insert our data. # population_train and population_test are either # DataFrames or Views. The population table # defines the statistical population of your # machine learning problem and contains the # target variables. star_schema = getml.data.StarSchema( train=population_train, test=population_test ) # meta, order and trans are either # DataFrames or Views. # Because this is a star schema, # all joins take place on the population # table. star_schema.join( trans, on="account_id", time_stamps=("date_loan", "date") ) star_schema.join( order, on="account_id", ) star_schema.join( meta, on="account_id", ) # Now you can insert your data model, # your preprocessors, feature learners, # feature selectors and predictors # into the pipeline. # Note that the pipeline only knows # the abstract data model, but hasn't # seen the actual data yet. pipe = getml.Pipeline( data_model=star_schema.data_model, preprocessors=[mapping], feature_learners=[fast_prop], feature_selectors=[feature_selector], predictors=predictor, ) # Now, we pass the actual data. # This passes 'population_train' and the # peripheral tables (meta, order and trans) # to the pipeline. pipe.check(star_schema.train) pipe.fit(star_schema.train) pipe.score(star_schema.test) # To generate predictions on new data, # it is sufficient to use a Container. # You don't have to recreate the entire # StarSchema, because the abstract data model # is stored in the pipeline. container = getml.data.Container( population=population_new) container.add( trans=trans_new, order=order_new, meta=meta_new) predictions = pipe.predict(container.full)
If you don’t already have a train and test set, you can use a function from the
split
module.split = getml.data.split.random( train=0.8, test=0.2) star_schema = getml.data.StarSchema( population=population_all, split=split, ) # The remaining code is the same as in # the example above. In particular, # star_schema.train and star_schema.test # work just like above.
Methods
join
(right_df[, alias, on, time_stamps, …])sync
()Synchronizes the last change with the data to avoid warnings that the data has been changed.
Attributes
The underlying
Container
.The underlying
DataModel
.