Columns

class getml.pipeline.Columns(name, targets, peripheral)

Bases: object

Custom class for handling the columns inserted into the pipeline.

Example

names, importances = my_pipeline.columns.importances()

# Sets all categorical and numerical columns that are not
# in the top 20% to unused.
my_pipeline.columns.select(
    population_table,
    peripheral_tables,
    share_selected_columns=0.2
)

Methods Summary

importances([target_num, sort])

Returns the data for the column importances.

select(population_table[, …])

Sets all categorical or numerical columns that are not sufficiently important to unused.

to_pandas()

Returns all information related to the columns in a pandas data frame.

Methods Documentation

importances(target_num=0, sort=True)

Returns the data for the column importances.

Column importances extend the idea of feature importances to the columns originally inserted into the pipeline. Each column is assigned an importance value that measures its contribution to the predictive performance. All columns importances add up to 1.

Parameters
  • target_num (int) – Indicates for which target you want to view the importances. (Pipelines can have more than one target.)

  • sort (bool) – Whether you want the results to be sorted.

Returns

  • The first array contains the names of the columns.

  • The second array contains their importances. By definition, all importances add up to 1.

Return type

(numpy.ndarray, numpy.ndarray)

select(population_table, peripheral_tables=None, share_selected_columns=0.5)

Sets all categorical or numerical columns that are not sufficiently important to unused.

Parameters
  • population_table (getml.data.DataFrame) – Main table containing the target variable(s) and corresponding to the population Placeholder instance variable.

  • peripheral_tables (List[getml.data.DataFrame] or dict) – Additional tables corresponding to the peripheral Placeholder instance variable.

  • share_selected_columns (numerical) – The share of columns to keep. Must be between 0.0 and 1.0.

to_pandas()

Returns all information related to the columns in a pandas data frame.