join¶
-
Placeholder.
join
(other, join_key='', time_stamp='', other_join_key='', other_time_stamp='', upper_time_stamp='', horizon=0.0, memory=0.0, allow_lagged_targets=False, relationship='many-to-many')[source]¶ Establish a relation between two
Placeholder
s.Examples:
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp" )
The example above will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those columns in ‘peripheral_table’ which ‘time_stamp’ is smaller than the ‘time_stamp’ in ‘population_table’ are considered.
- Args:
other (
Placeholder
):Placeholder
the current instance will depend on.join_key (str or List[str]):
Name of the
StringColumn
in the correspondingDataFrame
used to establish a relation between the current instance and other.If no join_key is passed, then all rows of the two data frames will be joined.
If a list of strings is passed, then all join keys must match
If other_join_key is an empty string, join_key will be used to determine the column of other too.
time_stamp (str, optional):
Name of the
FloatColumn
in the correspondingDataFrame
used to ensure causality.The provided string must be contained in the
time_stamps
instance variable.If other_time_stamp is an empty string, time_stamp will be used to determine the column of other too.
other_join_key (str or List[str], optional):
Name of the
StringColumn
in theDataFrame
represented by other used to establish a relation between the current instance and other.If an empty string is passed, join_key will be used instead.
other_time_stamp (str, optional):
Name of the
FloatColumn
in theDataFrame
represented by other used to ensure causality.If an empty string is provided, time_stamp will be used instead.
upper_time_stamp (str, optional):
Optional additional time stamp in the other that will limit the number of joined rows to a certain point in the past. This is useful for data with limited correlation length.
Expressed as SQL code, this will add the condition
t1.time_stamp < t2.upper_time_stamp OR t2.upper_time_stamp IS NULL
to the feature.
If an empty string is provided, all values in the past will be considered.
horizon (float, optional):
Period of time between the time_stamp and the other_time_stamp.
Usually, you need to ensure that no data from the future is used for your prediction, like this:
t1.time_stamp - t2.other_time_stamp >= 0
But in some cases, you would like the gap to be something other than zero. For such cases, you can set a horizon:
t1.time_stamp - t2.other_time_stamp >= horizon
memory (float, optional):
Period of time to which the join is limited.
Expressed as SQL code, this will add the condition
t1.time_stamp - t2.other_time_stamp < horizon + memory
to the feature.
When the memory is set to 0.0 or a negative number, there is no limit.
Limiting the joins using the memory or upper_time_stamp parameter can significantly reduce the training time. However, you can only set an upper_time_stamp or memory, but not both.
allow_lagged_targets (bool, optional):
For some applications, it is allowed to aggregate over target variables from the past. In others, this is not allowed. If allow_lagged_targets is set to True, you must pass a horizon that is greater than zero, otherwise you would have a data leak (an exception will be thrown to prevent this).
relationship (string, optional);
If the relationship between two tables in many-to-one or one-to-one, then feature learning is not necessary or meaningful. If you mark such relationships using one of the constants defined in :module:`~getml.data.relationship`, the tables will be joined directly by the pipeline.
Note:
other must be created (temporally) after the current instance. This was implemented as a measure to prevent circular dependencies in the data model.