Placeholder

class getml.data.Placeholder(name)

Bases: object

Abstract representation of tables and their relations.

This classes provides an abstract representation of the DataFrame. However, it does not contain any actual data.

Examples

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

With your Placeholder in place you can use the join() method to construct the data model (required for the Pipeline).

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)
Parameters

name (str) – The name used for this placeholder. This name will appear in the generated SQL code.

Raises

TypeError – If any of the input arguments is of wrong type.

Methods Summary

join(other[, join_key, time_stamp, …])

Establish a relation between two Placeholder s.

set_relations([allow_lagged_targets, …])

Set all relational instance variables not exposed in the constructor.

Methods Documentation

join(other, join_key='', time_stamp='', other_join_key='', other_time_stamp='', upper_time_stamp='', horizon=0.0, memory=0.0, allow_lagged_targets=False, relationship='many-to-many')

Establish a relation between two Placeholder s.

Examples

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

The example above will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those columns in ‘peripheral_table’ which ‘time_stamp’ is smaller than the ‘time_stamp’ in ‘population_table’ are considered.

Parameters
  • other (Placeholder) – Placeholder the current instance will depend on.

  • join_key (str or List[str]) –

    Name of the StringColumn in the corresponding DataFrame used to establish a relation between the current instance and other.

    If no join_key is passed, then all rows of the two data frames will be joined.

    If a list of strings is passed, then all join keys must match

    If other_join_key is an empty string, join_key will be used to determine the column of other too.

  • time_stamp (str, optional) –

    Name of the FloatColumn in the corresponding DataFrame used to ensure causality.

    The provided string must be contained in the time_stamps instance variable.

    If other_time_stamp is an empty string, time_stamp will be used to determine the column of other too.

  • other_join_key (str or List[str], optional) –

    Name of the StringColumn in the DataFrame represented by other used to establish a relation between the current instance and other.

    If an empty string is passed, join_key will be used instead.

  • other_time_stamp (str, optional) –

    Name of the FloatColumn in the DataFrame represented by other used to ensure causality.

    If an empty string is provided, time_stamp will be used instead.

  • upper_time_stamp (str, optional) –

    Optional additional time stamp in the other that will limit the number of joined rows to a certain point in the past. This is useful for data with limited correlation length.

    Expressed as SQL code, this will add the condition

    t1.time_stamp < t2.upper_time_stamp OR
    t2.upper_time_stamp IS NULL
    

    to the feature.

    If an empty string is provided, all values in the past will be considered.

  • horizon (float, optional) –

    Period of time between the time_stamp and the other_time_stamp.

    Usually, you need to ensure that no data from the future is used for your prediction, like this:

    t1.time_stamp - t2.other_time_stamp >= 0
    

    But in some cases, you would like the gap to be something other than zero. For such cases, you can set a horizon:

    t1.time_stamp - t2.other_time_stamp >= horizon
    

  • memory (float, optional) –

    Period of time to which the join is limited.

    Expressed as SQL code, this will add the condition

    t1.time_stamp - t2.other_time_stamp < horizon + memory
    

    to the feature.

    When the memory is set to 0.0 or a negative number, there is no limit.

    Limiting the joins using the memory or upper_time_stamp parameter can significantly reduce the training time. However, you can only set an upper_time_stamp or memory, but not both.

  • allow_lagged_targets (bool, optional) – For some applications, it is allowed to aggregate over target variables from the past. In others, this is not allowed. If allow_lagged_targets is set to True, you must pass a horizon that is greater than zero, otherwise you would have a data leak (an exception will be thrown to prevent this).

  • relationship (string, optional) – If the relationship between two tables in many-to-one or one-to-one, then feature learning is not necessary or meaningful. If you mark such relationships using one of the constants defined in :module:`~getml.data.relationship`, the tables will be joined directly by the pipeline.

Note

other must be created (temporally) after the current instance. This was implemented as a measure to prevent circular dependencies in the data model.

set_relations(allow_lagged_targets=None, join_keys_used=None, horizon=None, relationship=None, memory=None, other_join_keys_used=None, time_stamps_used=None, other_time_stamps_used=None, upper_time_stamps_used=None, joined_tables=None)

Set all relational instance variables not exposed in the constructor.

Parameters
  • allow_lagged_targets (List[bool]) – Whether we want to allow lagged targets to be aggregated in the join.

  • join_keys_used (List[str]) – Elements in join_keys used to define the relations to the other tables provided in joined_tables.

  • horizon (List[float]) – horizon of the join. Determines the gap between time_stamp and other_time_stamp.

  • memory (List[float]) – memory of the join. Determines how much of the past data may be joined.

  • other_join_keys_used (List[str]) – join_keys of the Placeholder in joined_tables used to define a relation with the current instance. Note that the join_keys instance variable is not contained in the joined_tabled.

  • time_stamps_used (List[str]) – Elements in time_stamps used to define the relations to the other tables provided in joined_tables.

  • other_time_stamps_used (List[str]) – time_stamps of the Placeholder in joined_tables used to define a relation with the current instance. Note that the time_stamps instance variable is not contained in the joined_tabled.

  • upper_time_stamps_used (List[str]) – time_stamps of the Placeholder in joined_tables used as ‘upper_time_stamp’ to define a relation with the current instance. For details please see the join() method. Note that the time_stamps instance variable is not contained in the joined_tabled.

  • joined_tables (List[Placeholder]) – List of all other Placeholder the current instance is joined on.

Raises
  • TypeError – If any of the input arguments is of wrong type.

  • ValueError – If the input arguments are not of same length.