Pipeline¶
-
class
getml.pipeline.
Pipeline
(population=None, peripheral=None, preprocessors=None, feature_learners=None, feature_selectors=None, predictors=None, tags=None, include_categorical=False, share_selected_features=0.5)[source]¶ A Pipeline is the main class for feature learning and prediction.
Example:
We assume that you have already set up your preprocessors (refer to
preprocessors
), your feature learners (refer tofeature_learning
) as well as your feature selectors and predictors (refer topredictors
, which can be used for prediction and feature selection).For more detailed information on how to set up your data model, please refer to the documentation of the
Placeholder
.population_placeholder = getml.data.Placeholder("population") order_placeholder = getml.data.Placeholder("order") trans_placeholder = getml.data.Placeholder("trans") population_placeholder.join(order_placeholder, join_key="join_key", time_stamp="time_stamp" ) population_placeholder.join(trans_placeholder, join_key="join_key", time_stamp="time_stamp" ) pipe = getml.pipeline.Pipeline( tags=["multirel", "relboost", "31 features"], population=population_placeholder, peripheral=[order_placeholder, trans_placeholder], feature_learners=[feature_learner_1, feature_learner_2], feature_selectors=feature_selector, predictors=predictor, share_selected_features=0.5 ) # You can pass the peripheral tables as a list. In that # case they have to match the order in which you have passed # the peripheral placeholders to the pipeline. pipe.check( population_table=population_training, peripheral_tables=[order, trans] ) # You can also pass them as a dictionary, in which # case their order doesn't matter, but the keys # of the dictionary need to match the names of the # peripheral placeholders. pipe.check( population_table=population_training, peripheral_tables={"order": order, "trans": trans} ) # Everything we have discussed above applies to # .fit(...), .score(...), .predict(...) and .transform(...) # as well. pipe.fit( population_table=population_training, peripheral_tables={"order": order, "trans": trans} ) pipe.score( population_table=population_testing, peripheral_tables={"order": order, "trans": trans} )
- Args:
- population (
getml.data.Placeholder
, optional): Abstract representation of the population table, which defines the statistical population and contains the target variables.
- peripheral (Union[
Placeholder
, List[Placeholder
]], optional): Abstract representations of the additional tables used to augment the information provided in population. These have to be the same objects that were
join`ed onto the `population()
Placeholder
. Their order determines the order of the peripheralDataFrame
passed to the ‘peripheral_tables’ argument incheck()
,fit()
,predict()
,score()
, andtransform()
, if you pass the data frames as a list. If you omit the peripheral placeholders, they will be inferred from the population placeholder and ordered alphabetically.- preprocessors (Union[
_Preprocessor
, List[_Preprocessor
]], optional): The preprocessor(s) to be used. Must be from
preprocessors
. A single preprocessor does not have to be wrapped in a list.- feature_learners (Union[
_FeatureLearner
, List[_FeatureLearner
]], optional): The feature learner(s) to be used. Must be from
feature_learning
. A single feature learner does not have to be wrapped in a list.- feature_selectors (Union[
_Predictor
, List[_Predictor
]], optional): Predictor(s) used to select the best features. Must be from
predictors
. A single feature selector does not have to be wrapped in a list. Make sure to also set share_selected_features.- predictors (Union[
_Predictor
, List[_Predictor
]], optional): Predictor(s) used to generate the predictions. If more than one predictor is passed, the predictions generated will be averaged. Must be from
predictors
. A single predictor does not have to be wrapped in a list.- tags (List[str], optional): Tags exist to help you organize your pipelines.
You can add any tags that help you remember what you were trying to do.
- include_categorical (bool, optional): Whether you want to pass categorical columns
in the population table to the predictor.
- share_selected_features(float, optional): The share of features you want the feature
selection to keep. When set to 0.0, then all features will be kept.
- population (
Methods
check
(population_table[, peripheral_tables])Checks the validity of the data model.
delete
()Deletes the pipeline from the engine.
deploy
(deploy)Allows a fitted pipeline to be addressable via an HTTP request.
fit
(population_table[, peripheral_tables])Trains the feature learning algorithms, feature selectors and predictors.
info
()Prints detailed information on the Pipeline.
predict
(population_table[, …])Forecasts on new, unseen data using the trained
predictor
.refresh
()Reloads the pipeline from the engine.
score
(population_table[, peripheral_tables])Calculates the performance of the
predictor
.transform
(population_table[, …])Translates new data into the trained features.
validate
()Checks both the types and the values of all instance variables and raises an exception if something is off.
Attributes
A convenience wrapper to retrieve the accuracy of the latest scoring run (the last time .score() was called) on the pipeline.
A convenience wrapper to retrieve the auc of the latest scoring run (the last time .score() was called) on the pipeline.
Columns
object that can be used to handle the columns generated by the feature learners.A convenience wrapper to retrieve the cross entropy of the latest scoring run (the last time .score() was called) on the pipeline.
Features
object that can be used to handle the features generated by the feature learners.Whether the pipeline has already been fitted.
ID of the pipeline.
Whether the pipeline can used for classification problems.
Whether the pipeline can used for regression problems.
A convenience wrapper to retrieve the mae of the latest scoring run (the last time .score() was called) on the pipeline.
Metrics
object that can be used to generate metrics like an ROC curve or a lift curve.Returns the ID of the pipeline.
A convenience wrapper to retrieve the rmse of the latest scoring run (the last time .score() was called) on the pipeline.
A convenience wrapper to retrieve the rsquared of the latest scoring run (the last time .score() was called) on the pipeline.
Whether the pipeline has been scored.
Contains all scores generated by
getml.pipeline.Pipeline.score()
Contains the names of the targets used for this pipeline.