Placeholder

class getml.data.Placeholder(name, categorical=None, numerical=None, join_keys=None, time_stamps=None, targets=None)

Bases: object

Schematic representation of tables and their relations

This classes provides a light weight representation of the DataFrame including both its general structure and its relations to other DataFrame. The actual data, however, is not contained.

Examples

Although you can directly use the constructor of Placeholder to replicate the structure of a DataFrame, we highly recommend to use the getml.data.DataFrame.to_placeholder() method to generate them automatically.

# Creates some DataFrames
population_table, peripheral_table = getml.datasets.make_numerical()

# Derives Placeholder from them
population_placeholder = population_table.to_placeholder()
peripheral_placeholder = peripheral_table.to_placeholder()

With your Placeholder in place you can use the join() method to construct the data model (required to construct the models later on).

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)
Parameters
Raises

TypeError – If any of the input arguments is of wrong type.

Note

The input argument categorical, numerical, join_keys, time_stamps, and targets do represent the general structure of the corresponding DataFrame. Its relations to other DataFrame - stored in the instance variables join_keys_used, other_join_keys_used, time_stamps_used, other_time_stamps_used, upper_time_stamps_used, and joined_tables - , however, are not set using the constructor but in set_relations() method. It will be called internally when invoking join() and is only exported as a public method to allow more sophisticated scripting.

Methods Summary

join(other, join_key[, time_stamp, …])

Establish a relation between two Placeholder

set_relations([join_keys_used, …])

Set all relational instance variables not exposed in the constructor.

Methods Documentation

join(other, join_key, time_stamp='', other_join_key='', other_time_stamp='', upper_time_stamp='')

Establish a relation between two Placeholder

In order for the feature engineering algorithm to craft sophisticated features, it has to know about the general structure of your relational data at hand. This structure, the data model, is composed out of both the schematic representations of the involved DataFrame and their relations to each other. The latter is introduced using this method.

Examples

population_table, peripheral_table = getml.datasets.make_numerical()
population_placeholder = population_table.to_placeholder()
peripheral_placeholder = peripheral_table.to_placeholder()

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

The example above will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those columns in ‘peripheral_table’ which ‘time_stamp’ is small than the ‘time_stamp’ in ‘population_table’ are considered.

Parameters
  • other (Placeholder) – Placeholder the current instance will depend on.

  • join_key (str) –

    Name of the StringColumn in the corresponding DataFrame used to establish a relation between the current instance and other.

    The provided string must be contained in the join_keys instance variable.

    If other_join_key is an empty string, join_key will be used to determine the column of other too.

  • time_stamp (str, optional) –

    Name of the FloatColumn in the corresponding DataFrame used to ensure causality.

    The provided string must be contained in the time_stamps instance variable.

    If other_time_stamp is an empty string, time_stamp will be used to determine the column of other too.

  • other_join_key (str, optional) –

    Name of the StringColumn in the DataFrame represented by other used to establish a relation between the current instance and other.

    If an empty string is provided, join_key will be used instead.

  • other_time_stamp (str, optional) –

    Name of the FloatColumn in the DataFrame represented by other used to ensure causality.

    If an empty string is provided, time_stamp will be used instead.

  • upper_time_stamp (str, optional) –

    Optional additional time stamp in the other that will limit the number of joined rows to a certain point in the past. This is useful for data with limited correlation length.

    Expressed as SQL code, this will add the condition

    t1.time_stamp < t2.upper_time_stamp OR
    t2.upper_time_stamp IS NULL
    

    to the feature.

    If an empty string is provided, all values in the past will be considered.

Raises
  • TypeError – If any of the input arguments is of wrong type.

  • Exception – If other was created earlier (temporally) than the current instance.

Note

The task of the time_stamp is a crucial one. It ensures causality by incorporating only those rows of other in the join operation for which the time stamp in other_time_stamp is at most as recent as the one in the corresponding row of time_stamp in the current instance. Since usually it’s the population table you call this method on and it’s the peripheral table you provide as other (see Tables), this ensures no information from the future is considered during training. upper_time_stamp is used to additionally limit the joined rows up to a certain point in the past.

In terms of SQL syntax this method does correspond to a LEFT_JOIN.

other must be created (temporally) after the current instance. This was implemented as measure to prevent circular dependencies in the data model.

set_relations(join_keys_used=None, other_join_keys_used=None, time_stamps_used=None, other_time_stamps_used=None, upper_time_stamps_used=None, joined_tables=None)

Set all relational instance variables not exposed in the constructor.

The reason to split of the setting of the instance variables in two different functions, this one and the constructor, is due to their distinct nature. While the input arguments of the latter are concerned with the structure of the particular DataFrame the current instance is representing, the arguments of this function cover the relations of that table to all the other DataFrame.

The intended usage during a data science project is to first construct the Placeholder for all DataFrame and later on join them together/define their relations using join(). Letting this method managing the relational instances variables instead of requiring them during the construction of the Placeholder makes the code much less error-prone. But at some points, like in deserialization or extended scripting, it might be still required to initialize the whole instance in one (well, two) step which is where this method comes into play.

The ordering within the provided arguments is important. It is assumed that all of them are of the same length and all keys/time stamps at a certain position define the relation of the table this instance resembles with another one represented by the Placeholder in the same position of the joined_tables argument.

Parameters
  • join_keys_used (List[str]) – Elements in join_keys used to define the relations to the other tables provided in joined_tables.

  • other_join_keys_used (List[str]) – join_keys of the Placeholder in joined_tables used to define a relation with the current instance. Note that the join_keys instance variable is not contained in the joined_tabled.

  • time_stamps_used (List[str]) – Elements in time_stamps used to define the relations to the other tables provided in joined_tables.

  • other_time_stamps_used (List[str]) – time_stamps of the Placeholder in joined_tables used to define a relation with the current instance. Note that the time_stamps instance variable is not contained in the joined_tabled.

  • upper_time_stamps_used (List[str]) – time_stamps of the Placeholder in joined_tables used as ‘upper_time_stamp’ to define a relation with the current instance. For details please see the join() method. Note that the time_stamps instance variable is not contained in the joined_tabled.

  • joined_tables (List[Placeholder]) – List of all other Placeholder the current instance is joined on.

Raises
  • TypeError – If any of the input arguments is of wrong type.

  • ValueError – If the input arguments are not of same length.