Pipeline¶

class getml.Pipeline(data_model: Optional[DataModel] = None, peripheral: Optional[List[Placeholder]] = None, preprocessors: Optional[Union[CategoryTrimmer, EmailDomain, Imputation, Mapping, Seasonal, Substring, TextFieldSplitter, List[Union[CategoryTrimmer, EmailDomain, Imputation, Mapping, Seasonal, Substring, TextFieldSplitter]]]] = None, feature_learners: Optional[Union[Union[Fastboost, FastProp, Multirel, Relboost, RelMT], List[Union[Fastboost, FastProp, Multirel, Relboost, RelMT]]]] = None, feature_selectors: Optional[Union[Union[LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor, ScaleGBMClassifier, ScaleGBMRegressor], List[Union[LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor, ScaleGBMClassifier, ScaleGBMRegressor]]]] = None, predictors: Optional[Union[LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor, ScaleGBMClassifier, ScaleGBMRegressor, List[Union[LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor, ScaleGBMClassifier, ScaleGBMRegressor]]]] = None, loss_function: Optional[str] = None, tags: Optional[list[str]] = None, include_categorical: bool = False, share_selected_features: float = 0.5)[source]¶

A Pipeline is the main class for feature learning and prediction.

Args:

data_model (DataModel):: Abstract representation of the data_model, which defines the abstract relationships between the tables. Required for the feature learners.
peripheral (Union[Placeholder, List[Placeholder]], optional):: Abstract representations of the additional tables used to augment the information provided in population. These have to be the same objects that were join() ed onto the population Placeholder. Their order determines the order of the peripheral DataFrame passed to the ‘peripheral_tables’ argument in check(), fit(), predict(), score(), and transform(), if you pass the data frames as a list. If you omit the peripheral placeholders, they will be inferred from the data model and ordered alphabetically.
preprocessors (Union[_Preprocessor, List[_Preprocessor]], optional):: The preprocessor(s) to be used. Must be from preprocessors. A single preprocessor does not have to be wrapped in a list.
feature_learners (Union[_FeatureLearner, List[_FeatureLearner]], optional):: The feature learner(s) to be used. Must be from feature_learning. A single feature learner does not have to be wrapped in a list.
feature_selectors (Union[_Predictor, List[_Predictor]], optional):: Predictor(s) used to select the best features. Must be from predictors. A single feature selector does not have to be wrapped in a list. Make sure to also set share_selected_features.
predictors (Union[_Predictor, List[_Predictor]], optional):: Predictor(s) used to generate the predictions. If more than one predictor is passed, the predictions generated will be averaged. Must be from predictors. A single predictor does not have to be wrapped in a list.
loss_function (str or None):: The loss function to use for the feature learners.
tags (List[str], optional): Tags exist to help you organize your pipelines.: You can add any tags that help you remember what you were trying to do.
include_categorical (bool, optional):: Whether you want to pass categorical columns in the population table to the predictor.
share_selected_features(float, optional):: The share of features you want the feature selection to keep. When set to 0.0, then all features will be kept.

Examples:

We assume that you have already set up your preprocessors (refer to preprocessors), your feature learners (refer to feature_learning) as well as your feature selectors and predictors (refer to predictors, which can be used for prediction and feature selection).

You might also want to refer to DataFrame, View, DataModel, Container, Placeholder and StarSchema.

If you want to create features for a time series problem, the easiest way to do so is to use the TimeSeries abstraction.

Note that this example is taken from the robot notebook.

# All rows before row 10500 will be used for training.
split = getml.data.split.time(data_all, "rowid", test=10500)

time_series = getml.data.TimeSeries(
    population=data_all,
    time_stamps="rowid",
    split=split,
    lagged_targets=False,
    memory=30,
)

pipe = getml.Pipeline(
    data_model=time_series.data_model,
    feature_learners=[...],
    predictors=...
)

pipe.check(time_series.train)

pipe.fit(time_series.train)

pipe.score(time_series.test)

# To generate predictions on new data,
# it is sufficient to use a Container.
# You don't have to recreate the entire
# TimeSeries, because the abstract data model
# is stored in the pipeline.
container = getml.data.Container(
    population=population_new,
)

# Add the data as a peripheral table, for the
# self-join.
container.add(population=population_new)

predictions = pipe.predict(container.full)

If your data can be organized in a simple star schema, you can use StarSchema. StarSchema unifies Container and DataModel:

Note that this example is taken from the loans notebook.

# First, we insert our data into a StarSchema.
# population_train and population_test are either
# DataFrames or Views. The population table
# defines the statistical population of your
# machine learning problem and contains the
# target variables.
star_schema = getml.data.StarSchema(
    train=population_train,
    test=population_test
)

# meta, order and trans are either
# DataFrames or Views.
# Because this is a star schema,
# all joins take place on the population
# table.
star_schema.join(
    trans,
    on="account_id",
    time_stamps=("date_loan", "date")
)

star_schema.join(
    order,
    on="account_id",
)

star_schema.join(
    meta,
    on="account_id",
)

# Now you can insert your data model,
# your preprocessors, feature learners,
# feature selectors and predictors
# into the pipeline.
# Note that the pipeline only knows
# the abstract data model, but hasn't
# seen the actual data yet.
pipe = getml.Pipeline(
    data_model=star_schema.data_model,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=predictor,
)

# Now, we pass the actual data.
# This passes 'population_train' and the
# peripheral tables (meta, order and trans)
# to the pipeline.
pipe.check(star_schema.train)

pipe.fit(star_schema.train)

pipe.score(star_schema.test)

StarSchema is simpler, but cannot be used for more complex data models. The general approach is to use Container and DataModel:

# First, we insert our data into a Container.
# population_train and population_test are either
# DataFrames or Views.
container = getml.data.Container(
    train=population_train,
    test=population_test
)

# meta, order and trans are either
# DataFrames or Views. They are given
# aliases, so we can refer to them in the
# DataModel.
container.add(
    meta=meta,
    order=order,
    trans=trans
)

# Freezing makes the container immutable.
# This is not required, but often a good idea.
container.freeze()

# The abstract data model is constructed
# using the DataModel class. A data model
# does not contain any actual data. It just
# defines the abstract relational structure.
dm = getml.data.DataModel(
    population_train.to_placeholder("population")
)

dm.add(getml.data.to_placeholder(
    meta=meta,
    order=order,
    trans=trans)
)

dm.population.join(
    dm.trans,
    on="account_id",
    time_stamps=("date_loan", "date")
)

dm.population.join(
    dm.order,
    on="account_id",
)

dm.population.join(
    dm.meta,
    on="account_id",
)

# Now you can insert your data model,
# your preprocessors, feature learners,
# feature selectors and predictors
# into the pipeline.
# Note that the pipeline only knows
# the abstract data model, but hasn't
# seen the actual data yet.
pipe = getml.Pipeline(
    data_model=dm,
    preprocessors=[mapping],
    feature_learners=[fast_prop],
    feature_selectors=[feature_selector],
    predictors=predictor,
)

# This passes 'population_train' and the
# peripheral tables (meta, order and trans)
# to the pipeline.
pipe.check(container.train)

pipe.fit(container.train)

pipe.score(container.test)

Technically, you don’t actually have to use a Container. You might as well do this (in fact, a Container is just syntactic sugar for this approach):

pipe.check(
    population_train,
    {"meta": meta, "order": order, "trans": trans},
)

pipe.fit(
    population_train,
    {"meta": meta, "order": order, "trans": trans},
)

pipe.score(
    population_test,
    {"meta": meta, "order": order, "trans": trans},
)

Or you could even do this. The order of the peripheral tables can be inferred from the __repr__ method of the pipeline, and it is usually in alphabetical order.

pipe.check(
    population_train,
    [meta, order, trans],
)

pipe.fit(
    population_train,
    [meta, order, trans],
)

pipe.score(
    population_test,
    [meta, order, trans],
)

Methods

`check`(population_table[, peripheral_tables])	Checks the validity of the data model.
`delete`()	Deletes the pipeline from the engine.
`deploy`(deploy)	Allows a fitted pipeline to be addressable via an HTTP request.
`fit`(population_table[, peripheral_tables, ...])	Trains the feature learning algorithms, feature selectors and predictors.
`predict`(population_table[, ...])	Forecasts on new, unseen data using the trained `predictor`.
`refresh`()	Reloads the pipeline from the engine.
`score`(population_table[, peripheral_tables])	Calculates the performance of the `predictor`.
`transform`(population_table[, ...])	Translates new data into the trained features.

Attributes

`accuracy`	A convenience wrapper to retrieve the accuracy of the latest scoring run (the last time .score() was called) on the pipeline.
`auc`	A convenience wrapper to retrieve the auc of the latest scoring run (the last time .score() was called) on the pipeline.
`columns`	`Columns` object that can be used to handle information about the original columns utilized by the feature learners.
`cross_entropy`	A convenience wrapper to retrieve the cross entropy of the latest scoring run (the last time .score() was called) on the pipeline.
`features`	`Features` object that can be used to handle the features generated by the feature learners.
`fitted`	Whether the pipeline has already been fitted.
`id`	ID of the pipeline.
`is_classification`	Whether the pipeline can used for classification problems.
`is_regression`	Whether the pipeline can used for regression problems.
`mae`	A convenience wrapper to retrieve the mae of the latest scoring run (the last time .score() was called) on the pipeline.
`metadata`	Contains information on the data frames that were passed to .fit(...).
`name`	Returns the ID of the pipeline.
`plots`	`Plots` object that can be used to generate plots like an ROC curve or a lift curve.
`rmse`	A convenience wrapper to retrieve the rmse of the latest scoring run (the last time .score() was called) on the pipeline.
`rsquared`	A convenience wrapper to retrieve the rsquared of the latest scoring run (the last time .score() was called) on the pipeline.
`scored`	Whether the pipeline has been scored.
`scores`	Contains all scores generated by `score()`
`tables`	`Tables` object that can be used to handle information about the original tables utilized by the feature learners.
`targets`	Contains the names of the targets used for this pipeline.