join

Placeholder.join(other, join_key='', time_stamp='', other_join_key='', other_time_stamp='', upper_time_stamp='', horizon=0.0, memory=0.0, allow_lagged_targets=False, relationship='many-to-many')[source]

Establish a relation between two Placeholder s.

Examples:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

The example above will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those columns in ‘peripheral_table’ which ‘time_stamp’ is smaller than the ‘time_stamp’ in ‘population_table’ are considered.

Args:

other (Placeholder):

Placeholder the current instance will depend on.

join_key (str or List[str]):

Name of the StringColumn in the corresponding DataFrame used to establish a relation between the current instance and other.

If no join_key is passed, then all rows of the two data frames will be joined.

If a list of strings is passed, then all join keys must match

If other_join_key is an empty string, join_key will be used to determine the column of other too.

time_stamp (str, optional):

Name of the FloatColumn in the corresponding DataFrame used to ensure causality.

The provided string must be contained in the time_stamps instance variable.

If other_time_stamp is an empty string, time_stamp will be used to determine the column of other too.

other_join_key (str or List[str], optional):

Name of the StringColumn in the DataFrame represented by other used to establish a relation between the current instance and other.

If an empty string is passed, join_key will be used instead.

other_time_stamp (str, optional):

Name of the FloatColumn in the DataFrame represented by other used to ensure causality.

If an empty string is provided, time_stamp will be used instead.

upper_time_stamp (str, optional):

Optional additional time stamp in the other that will limit the number of joined rows to a certain point in the past. This is useful for data with limited correlation length.

Expressed as SQL code, this will add the condition

t1.time_stamp < t2.upper_time_stamp OR
t2.upper_time_stamp IS NULL

to the feature.

If an empty string is provided, all values in the past will be considered.

horizon (float, optional):

Period of time between the time_stamp and the other_time_stamp.

Usually, you need to ensure that no data from the future is used for your prediction, like this:

t1.time_stamp - t2.other_time_stamp >= 0

But in some cases, you would like the gap to be something other than zero. For such cases, you can set a horizon:

t1.time_stamp - t2.other_time_stamp >= horizon

memory (float, optional):

Period of time to which the join is limited.

Expressed as SQL code, this will add the condition

t1.time_stamp - t2.other_time_stamp < horizon + memory

to the feature.

When the memory is set to 0.0 or a negative number, there is no limit.

Limiting the joins using the memory or upper_time_stamp parameter can significantly reduce the training time. However, you can only set an upper_time_stamp or memory, but not both.

allow_lagged_targets (bool, optional):

For some applications, it is allowed to aggregate over target variables from the past. In others, this is not allowed. If allow_lagged_targets is set to True, you must pass a horizon that is greater than zero, otherwise you would have a data leak (an exception will be thrown to prevent this).

relationship (string, optional);

If the relationship between two tables in many-to-one or one-to-one, then feature learning is not necessary or meaningful. If you mark such relationships using one of the constants defined in :module:`~getml.data.relationship`, the tables will be joined directly by the pipeline.

Note:

other must be created (temporally) after the current instance. This was implemented as a measure to prevent circular dependencies in the data model.