Pipeline¶
-
class
getml.pipeline.
Pipeline
(population=None, peripheral=None, preprocessors=None, feature_learners=None, feature_selectors=None, predictors=None, tags=None, include_categorical=False, share_selected_features=0.5)¶ Bases:
object
A Pipeline is the main class for feature learning and prediction.
- Parameters
population (
getml.data.Placeholder
, optional) – Abstract representation of the population table, which defines the statistical population and contains the target variables.peripheral (Union[
Placeholder
, List[Placeholder
]], optional) – Abstract representations of the additional tables used to augment the information provided in population. These have to be the same objects that gotjoin()
on the populationPlaceholder
and their order strictly determines the order of the peripheralDataFrame
provided in the ‘peripheral_tables’ argument ofcheck()
,fit()
,predict()
,score()
, andtransform()
.feature_learners (Union[
_FeatureLearner
, List[_FeatureLearner
]], optional) – The feature learner(s) to be used. Must be fromfeature_learning
. A single feature learner does not have to be wrapped in a list.feature_selectors (Union[
_Predictor
, List[_Predictor
]], optional) – Predictor(s) used to select the best features. Must be frompredictors
. A single feature selector does not have to be wrapped in a list. Make sure to also set share_selected_features.predictors (Union[
_Predictor
, List[_Predictor
]], optional) – Predictor(s) used to generate the predictions. If more than one predictor is passed, the predictions generated will be averaged. Must be frompredictors
. A single predictor does not have to be wrapped in a list.tags (List[str], optional) – Tags exist to help you organize your pipelines. You can add any tags that help you remember what you were trying to do.
include_categorical (bool, optional) – Whether you want to pass categorical columns in the population table to the predictor.
share_selected_features (float, optional) – The share of features you want the feature selection to keep. When set to 0.0, then all features will be kept.
Example
We assume that you have already set up your data model using
Placeholder
, your feature learners (refer tofeature_learning
) as well as your feature selectors and predictors (refer topredictors
, which can be used for prediction and feature selection).pipe = getml.pipeline.Pipeline( tags=["multirel", "relboost", "31 features"], population=population_placeholder, peripheral=[order_placeholder, trans_placeholder], feature_learners=[feature_learner_1, feature_learner_2], feature_selectors=feature_selector, predictors=predictor, share_selected_features=0.5 ) # "order" and "trans" refer to the names of the # placeholders. pipe.check( population_table=population_training, peripheral_tables={"order": order, "trans": trans} ) pipe.fit( population_table=population_training, peripheral_tables={"order": order, "trans": trans} ) pipe.score( population_table=population_testing, peripheral_tables={"order": order, "trans": trans} )
Attributes Summary
Columns
object that can be used to handle the columns generated by the feature learners.Features
object that can be used to handle the features generated by the feature learners.Whether the pipeline has already been fitted.
ID of the pipeline.
Whether the pipeline can used for classification problems.
Whether the pipeline can used for regression problems.
Metrics
object that can be used to generate metrics like an ROC curve or a lift curve.Returns the ID of the pipeline.
The names of the targets to which the pipeline has been fitted.
Methods Summary
check
(population_table[, peripheral_tables])Checks the validity of the data model.
delete
()Deletes the pipeline from the engine.
deploy
(deploy)Allows a fitted pipeline to be addressable via an HTTP request.
fit
(population_table[, peripheral_tables])Trains the feature learning algorithms, feature selectors and predictors.
info
()Prints detailed information on the Pipeline.
predict
(population_table[, …])Forecasts on new, unseen data using the trained
predictor
.refresh
()Reloads the pipeline from the engine.
score
(population_table[, peripheral_tables])Calculates the performance of the
predictor
.transform
(population_table[, …])Translates new data into the trained features.
validate
()Checks both the types and the values of all instance variables and raises an exception if something is off.
Attributes Documentation
-
features
¶ Features
object that can be used to handle the features generated by the feature learners.
-
fitted
¶ Whether the pipeline has already been fitted.
-
id
¶ ID of the pipeline. This is used to uniquely identify the pipeline on the engine.
-
is_classification
¶ Whether the pipeline can used for classification problems.
-
is_regression
¶ Whether the pipeline can used for regression problems.
-
name
¶ Returns the ID of the pipeline. The name property is kept for backward compatibility.
-
targets
¶ The names of the targets to which the pipeline has been fitted.
Methods Documentation
-
check
(population_table, peripheral_tables=None)¶ Checks the validity of the data model.
- Parameters
population_table (
getml.data.DataFrame
) – Main table containing the target variable(s) and corresponding to thepopulation
Placeholder
instance variable.peripheral_tables (List[
getml.data.DataFrame
] or dict) – Additional tables corresponding to theperipheral
Placeholder
instance variable. They have to be provided in the exact same order as their corresponding placeholders!
-
delete
()¶ Deletes the pipeline from the engine.
- Raises
KeyError – If an unsupported instance variable is encountered (via
validate()
).TypeError – If any instance variable is of wrong type (via
validate()
).ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical) (via
validate()
).
Note
Caution: You can not undo this action!
-
deploy
(deploy)¶ Allows a fitted pipeline to be addressable via an HTTP request. See Deployment for details.
- Parameters
deploy (bool) – If
True
, the deployment of the pipeline will be triggered.- Raises
TypeError – If deploy is not of type bool.
KeyError – If an unsupported instance variable is encountered.
TypeError – If any instance variable is of wrong type.
ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).
-
fit
(population_table, peripheral_tables=None)¶ Trains the feature learning algorithms, feature selectors and predictors.
- Parameters
population_table (
getml.data.DataFrame
) – Main table containing the target variable(s) and corresponding to thepopulation
Placeholder
instance variable.peripheral_tables (List[
getml.data.DataFrame
]) – Additional tables corresponding to theperipheral
Placeholder
instance variable. They have to be passed in the exact same order as their corresponding placeholders! A single DataFrame will be wrapped into a list internally.
- Raises
IOError – If the pipeline corresponding to the instance variable
name
could not be found on the engine or the pipeline could not be fitted.TypeError – If any input argument is not of proper type.
KeyError – If an unsupported instance variable is encountered.
TypeError – If any instance variable is of wrong type.
ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).
-
info
()¶ Prints detailed information on the Pipeline.
-
predict
(population_table, peripheral_tables=None, table_name='')¶ Forecasts on new, unseen data using the trained
predictor
.Returns the predictions generated by the pipeline based on population_table and peripheral_tables or writes them into a data base named table_name.
- Parameters
population_table (Union[
pandas.DataFrame
,getml.data.DataFrame
]) – Main table corresponding to thepopulation
Placeholder
instance variable. Its target variable(s) will be ignored.( (peripheral_tables) – Union[
getml.data.DataFrame
, List[getml.data.DataFrame
]]): Additional tables corresponding to theperipheral
Placeholder
instance variable. They have to be provided in the exact same order as their corresponding placeholders! A single DataFrame will be wrapped into a list internally.table_name (str, optional) – If not an empty string, the resulting predictions will be written into the
database
of the same name. See Unified import interface for further information.
- Raises
IOError – If the pipeline corresponding to the instance variable
name
could not be found on the engine or the pipeline could not be fitted.TypeError – If any input argument is not of proper type.
ValueError – If no valid
predictor
was set.KeyError – If an unsupported instance variable is encountered.
TypeError – If any instance variable is of wrong type.
ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).
- Returns
Resulting predictions provided in an array of the (number of rows in population_table, number of targets in population_table).
- Return type
Note
Only fitted pipelines (
fit()
) can be used for prediction.
-
refresh
()¶ Reloads the pipeline from the engine.
This discards all local changes you have made since the last time you called
fit()
.- Raises
IOError – If the engine did not send a proper pipeline.
- Returns
Current instance
- Return type
-
score
(population_table, peripheral_tables=None)¶ Calculates the performance of the
predictor
.Returns different scores calculated on population_table and peripheral_tables.
- Parameters
population_table (
getml.data.DataFrame
) – Main table corresponding to thepopulation
Placeholder
instance variable.( (peripheral_tables) – Union[
getml.data.DataFrame
, List[getml.data.DataFrame
], dict]): Additional tables corresponding to theperipheral
Placeholder
instance variable. They have to be provided in the exact same order as their corresponding placeholders! A single DataFrame will be wrapped into a list internally.
- Raises
IOError – If the pipeline corresponding to the instance variable
name
could not be found on the engine or the pipeline could not be fitted.TypeError – If any input argument is not of proper type.
ValueError – If no valid
predictor
was set.KeyError – If an unsupported instance variable is encountered.
TypeError – If any instance variable is of wrong type.
ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).
- Returns
Mapping of the name of the score (str) to the corresponding value (float).
For regression problems the following scores are returned:
For classification problems the following scores are returned:
- Return type
dict
Note
Only fitted pipelines (
fit()
) can be scored.
-
transform
(population_table, peripheral_tables=None, df_name='', table_name='')¶ Translates new data into the trained features.
Transforms the data provided in population_table and peripheral_tables into features, which can be used to drive machine learning models. In addition to returning them as numerical array, this method is also able to write the results in a data base called table_name.
- Parameters
population_table (
getml.data.DataFrame
) – Main table corresponding to thepopulation
Placeholder
instance variable. Its target variable(s) will be ignored.peripheral_tables (List[
getml.data.DataFrame
]) – Additional tables corresponding to theperipheral
Placeholder
instance variable. They have to be provided in the exact same order as their corresponding placeholders. A single DataFrame will be wrapped into a list internally.df_name (str, optional) – If not an empty string, the resulting features will be written into a newly created DataFrame.
table_name (str, optional) – If not an empty string, the resulting features will be written into the
database
of the same name. See Unified import interface for further information.
- Raises
IOError – If the pipeline could not be found on the engine or the pipeline could not be fitted.
TypeError – If any input argument is not of proper type.
KeyError – If an unsupported instance variable is encountered.
TypeError – If any instance variable is of wrong type.
ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).
- Returns
- Resulting features provided in an array of the
(number of rows in population_table, number of selected features).
- or
getml.data.DataFrame
: A DataFrame containing the resulting features.
- Return type
Note
Only fitted pipelines (
fit()
) can transform data into features.
-
validate
()¶ Checks both the types and the values of all instance variables and raises an exception if something is off.
- Raises
KeyError – If an unsupported instance variable is encountered.
TypeError – If any instance variable is of wrong type.
ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).
Note
This method is triggered at the end of the __init__ constructor and every time a function communicating with the getML engine - except
refresh()
- is called.