Pipeline

class getml.pipeline.Pipeline(population=None, peripheral=None, preprocessors=None, feature_learners=None, feature_selectors=None, predictors=None, tags=None, include_categorical=False, share_selected_features=0.5)[source]

A Pipeline is the main class for feature learning and prediction.

Example:

We assume that you have already set up your preprocessors (refer to preprocessors), your feature learners (refer to feature_learning) as well as your feature selectors and predictors (refer to predictors, which can be used for prediction and feature selection).

For more detailed information on how to set up your data model, please refer to the documentation of the Placeholder.

population_placeholder = getml.data.Placeholder("population")

order_placeholder = getml.data.Placeholder("order")

trans_placeholder = getml.data.Placeholder("trans")

population_placeholder.join(order_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

population_placeholder.join(trans_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

pipe = getml.pipeline.Pipeline(
    tags=["multirel", "relboost", "31 features"],
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

# You can pass the peripheral tables as a list. In that
# case they have to match the order in which you have passed
# the peripheral placeholders to the pipeline.
pipe.check(
    population_table=population_training,
    peripheral_tables=[order, trans]
)

# You can also pass them as a dictionary, in which
# case their order doesn't matter, but the keys
# of the dictionary need to match the names of the
# peripheral placeholders.
pipe.check(
    population_table=population_training,
    peripheral_tables={"order": order, "trans": trans}
)

# Everything we have discussed above applies to
# .fit(...), .score(...), .predict(...) and .transform(...)
# as well.
pipe.fit(
    population_table=population_training,
    peripheral_tables={"order": order, "trans": trans}
)

pipe.score(
    population_table=population_testing,
    peripheral_tables={"order": order, "trans": trans}
)
Args:
population (getml.data.Placeholder, optional):

Abstract representation of the population table, which defines the statistical population and contains the target variables.

peripheral (Union[Placeholder, List[Placeholder]], optional):

Abstract representations of the additional tables used to augment the information provided in population. These have to be the same objects that were join`ed onto the `population() Placeholder. Their order determines the order of the peripheral DataFrame passed to the ‘peripheral_tables’ argument in check(), fit(), predict(), score(), and transform(), if you pass the data frames as a list. If you omit the peripheral placeholders, they will be inferred from the population placeholder and ordered alphabetically.

preprocessors (Union[_Preprocessor, List[_Preprocessor]], optional):

The preprocessor(s) to be used. Must be from preprocessors. A single preprocessor does not have to be wrapped in a list.

feature_learners (Union[_FeatureLearner, List[_FeatureLearner]], optional):

The feature learner(s) to be used. Must be from feature_learning. A single feature learner does not have to be wrapped in a list.

feature_selectors (Union[_Predictor, List[_Predictor]], optional):

Predictor(s) used to select the best features. Must be from predictors. A single feature selector does not have to be wrapped in a list. Make sure to also set share_selected_features.

predictors (Union[_Predictor, List[_Predictor]], optional):

Predictor(s) used to generate the predictions. If more than one predictor is passed, the predictions generated will be averaged. Must be from predictors. A single predictor does not have to be wrapped in a list.

tags (List[str], optional): Tags exist to help you organize your pipelines.

You can add any tags that help you remember what you were trying to do.

include_categorical (bool, optional): Whether you want to pass categorical columns

in the population table to the predictor.

share_selected_features(float, optional): The share of features you want the feature

selection to keep. When set to 0.0, then all features will be kept.

Methods

check(population_table[, peripheral_tables])

Checks the validity of the data model.

delete()

Deletes the pipeline from the engine.

deploy(deploy)

Allows a fitted pipeline to be addressable via an HTTP request.

fit(population_table[, peripheral_tables])

Trains the feature learning algorithms, feature selectors and predictors.

info()

Prints detailed information on the Pipeline.

predict(population_table[, …])

Forecasts on new, unseen data using the trained predictor.

refresh()

Reloads the pipeline from the engine.

score(population_table[, peripheral_tables])

Calculates the performance of the predictor.

transform(population_table[, …])

Translates new data into the trained features.

validate()

Checks both the types and the values of all instance variables and raises an exception if something is off.

Attributes

accuracy

A convenience wrapper to retrieve the accuracy of the latest scoring run (the last time .score() was called) on the pipeline.

auc

A convenience wrapper to retrieve the auc of the latest scoring run (the last time .score() was called) on the pipeline.

columns

Columns object that can be used to handle the columns generated by the feature learners.

cross_entropy

A convenience wrapper to retrieve the cross entropy of the latest scoring run (the last time .score() was called) on the pipeline.

features

Features object that can be used to handle the features generated by the feature learners.

fitted

Whether the pipeline has already been fitted.

id

ID of the pipeline.

is_classification

Whether the pipeline can used for classification problems.

is_regression

Whether the pipeline can used for regression problems.

mae

A convenience wrapper to retrieve the mae of the latest scoring run (the last time .score() was called) on the pipeline.

metrics

Metrics object that can be used to generate metrics like an ROC curve or a lift curve.

name

Returns the ID of the pipeline.

rmse

A convenience wrapper to retrieve the rmse of the latest scoring run (the last time .score() was called) on the pipeline.

rsquared

A convenience wrapper to retrieve the rsquared of the latest scoring run (the last time .score() was called) on the pipeline.

scored

Whether the pipeline has been scored.

scores

Contains all scores generated by getml.pipeline.Pipeline.score()

targets

Contains the names of the targets used for this pipeline.