Pipeline

class getml.pipeline.Pipeline(population=None, peripheral=None, preprocessors=None, feature_learners=None, feature_selectors=None, predictors=None, tags=None, include_categorical=False, share_selected_features=0.5)

Bases: object

A Pipeline is the main class for feature learning and prediction.

Parameters
  • population (getml.data.Placeholder, optional) – Abstract representation of the population table, which defines the statistical population and contains the target variables.

  • peripheral (Union[Placeholder, List[Placeholder]], optional) – Abstract representations of the additional tables used to augment the information provided in population. These have to be the same objects that got join() on the population Placeholder and their order strictly determines the order of the peripheral DataFrame provided in the ‘peripheral_tables’ argument of check(), fit(), predict(), score(), and transform().

  • feature_learners (Union[_FeatureLearner, List[_FeatureLearner]], optional) – The feature learner(s) to be used. Must be from feature_learning. A single feature learner does not have to be wrapped in a list.

  • feature_selectors (Union[_Predictor, List[_Predictor]], optional) – Predictor(s) used to select the best features. Must be from predictors. A single feature selector does not have to be wrapped in a list. Make sure to also set share_selected_features.

  • predictors (Union[_Predictor, List[_Predictor]], optional) – Predictor(s) used to generate the predictions. If more than one predictor is passed, the predictions generated will be averaged. Must be from predictors. A single predictor does not have to be wrapped in a list.

  • tags (List[str], optional) – Tags exist to help you organize your pipelines. You can add any tags that help you remember what you were trying to do.

  • include_categorical (bool, optional) – Whether you want to pass categorical columns in the population table to the predictor.

  • share_selected_features (float, optional) – The share of features you want the feature selection to keep. When set to 0.0, then all features will be kept.

Example

We assume that you have already set up your data model using Placeholder, your feature learners (refer to feature_learning) as well as your feature selectors and predictors (refer to predictors, which can be used for prediction and feature selection).

pipe = getml.pipeline.Pipeline(
    tags=["multirel", "relboost", "31 features"],
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    feature_learners=[feature_learner_1, feature_learner_2],
    feature_selectors=feature_selector,
    predictors=predictor,
    share_selected_features=0.5
)

# "order" and "trans" refer to the names of the
# placeholders.
pipe.check(
    population_table=population_training,
    peripheral_tables={"order": order, "trans": trans}
)

pipe.fit(
    population_table=population_training,
    peripheral_tables={"order": order, "trans": trans}
)

pipe.score(
    population_table=population_testing,
    peripheral_tables={"order": order, "trans": trans}
)

Attributes Summary

columns

Columns object that can be used to handle the columns generated by the feature learners.

features

Features object that can be used to handle the features generated by the feature learners.

fitted

Whether the pipeline has already been fitted.

id

ID of the pipeline.

is_classification

Whether the pipeline can used for classification problems.

is_regression

Whether the pipeline can used for regression problems.

metrics

Metrics object that can be used to generate metrics like an ROC curve or a lift curve.

name

Returns the ID of the pipeline.

targets

The names of the targets to which the pipeline has been fitted.

Methods Summary

check(population_table[, peripheral_tables])

Checks the validity of the data model.

delete()

Deletes the pipeline from the engine.

deploy(deploy)

Allows a fitted pipeline to be addressable via an HTTP request.

fit(population_table[, peripheral_tables])

Trains the feature learning algorithms, feature selectors and predictors.

info()

Prints detailed information on the Pipeline.

predict(population_table[, …])

Forecasts on new, unseen data using the trained predictor.

refresh()

Reloads the pipeline from the engine.

score(population_table[, peripheral_tables])

Calculates the performance of the predictor.

transform(population_table[, …])

Translates new data into the trained features.

validate()

Checks both the types and the values of all instance variables and raises an exception if something is off.

Attributes Documentation

columns

Columns object that can be used to handle the columns generated by the feature learners.

features

Features object that can be used to handle the features generated by the feature learners.

fitted

Whether the pipeline has already been fitted.

id

ID of the pipeline. This is used to uniquely identify the pipeline on the engine.

is_classification

Whether the pipeline can used for classification problems.

is_regression

Whether the pipeline can used for regression problems.

metrics

Metrics object that can be used to generate metrics like an ROC curve or a lift curve.

name

Returns the ID of the pipeline. The name property is kept for backward compatibility.

targets

The names of the targets to which the pipeline has been fitted.

Methods Documentation

check(population_table, peripheral_tables=None)

Checks the validity of the data model.

Parameters
  • population_table (getml.data.DataFrame) – Main table containing the target variable(s) and corresponding to the population Placeholder instance variable.

  • peripheral_tables (List[getml.data.DataFrame] or dict) – Additional tables corresponding to the peripheral Placeholder instance variable. They have to be provided in the exact same order as their corresponding placeholders!

delete()

Deletes the pipeline from the engine.

Raises
  • KeyError – If an unsupported instance variable is encountered (via validate()).

  • TypeError – If any instance variable is of wrong type (via validate()).

  • ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical) (via validate()).

Note

Caution: You can not undo this action!

deploy(deploy)

Allows a fitted pipeline to be addressable via an HTTP request. See Deployment for details.

Parameters

deploy (bool) – If True, the deployment of the pipeline will be triggered.

Raises
  • TypeError – If deploy is not of type bool.

  • KeyError – If an unsupported instance variable is encountered.

  • TypeError – If any instance variable is of wrong type.

  • ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).

fit(population_table, peripheral_tables=None)

Trains the feature learning algorithms, feature selectors and predictors.

Parameters
  • population_table (getml.data.DataFrame) – Main table containing the target variable(s) and corresponding to the population Placeholder instance variable.

  • peripheral_tables (List[getml.data.DataFrame]) – Additional tables corresponding to the peripheral Placeholder instance variable. They have to be passed in the exact same order as their corresponding placeholders! A single DataFrame will be wrapped into a list internally.

Raises
  • IOError – If the pipeline corresponding to the instance variable name could not be found on the engine or the pipeline could not be fitted.

  • TypeError – If any input argument is not of proper type.

  • KeyError – If an unsupported instance variable is encountered.

  • TypeError – If any instance variable is of wrong type.

  • ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).

info()

Prints detailed information on the Pipeline.

predict(population_table, peripheral_tables=None, table_name='')

Forecasts on new, unseen data using the trained predictor.

Returns the predictions generated by the pipeline based on population_table and peripheral_tables or writes them into a data base named table_name.

Parameters
  • population_table (Union[pandas.DataFrame, getml.data.DataFrame]) – Main table corresponding to the population Placeholder instance variable. Its target variable(s) will be ignored.

  • ( (peripheral_tables) – Union[getml.data.DataFrame, List[getml.data.DataFrame]]): Additional tables corresponding to the peripheral Placeholder instance variable. They have to be provided in the exact same order as their corresponding placeholders! A single DataFrame will be wrapped into a list internally.

  • table_name (str, optional) – If not an empty string, the resulting predictions will be written into the database of the same name. See Unified import interface for further information.

Raises
  • IOError – If the pipeline corresponding to the instance variable name could not be found on the engine or the pipeline could not be fitted.

  • TypeError – If any input argument is not of proper type.

  • ValueError – If no valid predictor was set.

  • KeyError – If an unsupported instance variable is encountered.

  • TypeError – If any instance variable is of wrong type.

  • ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).

Returns

Resulting predictions provided in an array of the (number of rows in population_table, number of targets in population_table).

Return type

numpy.ndarray

Note

Only fitted pipelines (fit()) can be used for prediction.

refresh()

Reloads the pipeline from the engine.

This discards all local changes you have made since the last time you called fit().

Raises

IOError – If the engine did not send a proper pipeline.

Returns

Current instance

Return type

Pipeline

score(population_table, peripheral_tables=None)

Calculates the performance of the predictor.

Returns different scores calculated on population_table and peripheral_tables.

Parameters
  • population_table (getml.data.DataFrame) – Main table corresponding to the population Placeholder instance variable.

  • ( (peripheral_tables) – Union[getml.data.DataFrame, List[getml.data.DataFrame], dict]): Additional tables corresponding to the peripheral Placeholder instance variable. They have to be provided in the exact same order as their corresponding placeholders! A single DataFrame will be wrapped into a list internally.

Raises
  • IOError – If the pipeline corresponding to the instance variable name could not be found on the engine or the pipeline could not be fitted.

  • TypeError – If any input argument is not of proper type.

  • ValueError – If no valid predictor was set.

  • KeyError – If an unsupported instance variable is encountered.

  • TypeError – If any instance variable is of wrong type.

  • ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).

Returns

Mapping of the name of the score (str) to the corresponding value (float).

For regression problems the following scores are returned:

For classification problems the following scores are returned:

Return type

dict

Note

Only fitted pipelines (fit()) can be scored.

transform(population_table, peripheral_tables=None, df_name='', table_name='')

Translates new data into the trained features.

Transforms the data provided in population_table and peripheral_tables into features, which can be used to drive machine learning models. In addition to returning them as numerical array, this method is also able to write the results in a data base called table_name.

Parameters
  • population_table (getml.data.DataFrame) – Main table corresponding to the population Placeholder instance variable. Its target variable(s) will be ignored.

  • peripheral_tables (List[getml.data.DataFrame]) – Additional tables corresponding to the peripheral Placeholder instance variable. They have to be provided in the exact same order as their corresponding placeholders. A single DataFrame will be wrapped into a list internally.

  • df_name (str, optional) – If not an empty string, the resulting features will be written into a newly created DataFrame.

  • table_name (str, optional) – If not an empty string, the resulting features will be written into the database of the same name. See Unified import interface for further information.

Raises
  • IOError – If the pipeline could not be found on the engine or the pipeline could not be fitted.

  • TypeError – If any input argument is not of proper type.

  • KeyError – If an unsupported instance variable is encountered.

  • TypeError – If any instance variable is of wrong type.

  • ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).

Returns

Resulting features provided in an array of the

(number of rows in population_table, number of selected features).

or getml.data.DataFrame:

A DataFrame containing the resulting features.

Return type

numpy.ndarray

Note

Only fitted pipelines (fit()) can transform data into features.

validate()

Checks both the types and the values of all instance variables and raises an exception if something is off.

Raises
  • KeyError – If an unsupported instance variable is encountered.

  • TypeError – If any instance variable is of wrong type.

  • ValueError – If any instance variable does not match its possible choices (string) or is out of the expected bounds (numerical).

Note

This method is triggered at the end of the __init__ constructor and every time a function communicating with the getML engine - except refresh() - is called.