join¶

Placeholder.join(other, join_key='', time_stamp='', other_join_key='', other_time_stamp='', upper_time_stamp='', horizon=0.0, memory=0.0, allow_lagged_targets=False, relationship='many-to-many')[source]¶

Establish a relation between two Placeholder s.

Examples:

This example will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those rows in ‘peripheral_table’ for which ‘time_stamp’ is smaller than the ‘time_stamp’ in ‘population_table’ are considered. This is to prevent data leaks:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

If the relationship between two tables is many-to-one or one-to-one you should clearly say so:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            relationship=getml.data.relationship.many_to_one
)

Please also refer to relationship.

If you want to do a self-join, you can do something like this:

population_placeholder = getml.data.Placeholder("POPULATION")
population_placeholder2 = getml.data.Placeholder("POPULATION")

population_placeholder.join(population_placeholder2,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

If the join keys or time stamps are named differently in the two different tables, use other_join_key and other_time_stamp:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key_in_population",
                            other_join_key="join_key_in_peripheral",
                            time_stamp="time_stamp_in_population",
                            other_time_stamp="time_stamp_in_peripheral"
)

You can join over more than one join key:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key=["join_key_1", "join_key_2"],
                            time_stamp="time_stamp"
)

You can also limit the scope of your joins using memory. This can significantly speed up training time. For instance, if you only want to consider data from the last seven days, you could do something like this:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            memory=getml.data.time.days(7)
)

In some use cases, particularly those involving time series, it might be a good idea to use targets from the past. You can activate this using allow_lagged_targets. But if you do that, you must also define a prediction horizon. For instance, if you want to predict data for the next hour, using data from the last seven days, you could do this:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            allow_lagged_targets=True,
                            horizon=getml.data.time.hours(1),
                            memory=getml.data.time.days(7)
)

Please also refer to time.

If the join involves many matches, it might be a good idea to set the relationship to propositionalization. This would force the pipeline to always use a propositionalization algorithm for this join, which can significantly speed things up.

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            relationship=getml.data.relationship.propositionalization
)

Please also refer to relationship.

Args:

other (Placeholder):

Placeholder the current instance will depend on.

join_key (str or List[str]):

Name of the StringColumn in the corresponding DataFrame used to establish a relation between the current instance and other.

If no join_key is passed, then all rows of the two data frames will be joined.

If a list of strings is passed, then all join keys must match

If other_join_key is an empty string, join_key will be used to determine the column of other too.

time_stamp (str, optional):

Name of the FloatColumn in the corresponding DataFrame used to ensure causality.

The provided string must be contained in the time_stamps instance variable.

If other_time_stamp is an empty string, time_stamp will be used to determine the column of other too.

other_join_key (str or List[str], optional):

Name of the StringColumn in the DataFrame represented by other used to establish a relation between the current instance and other.

If an empty string is passed, join_key will be used instead.

other_time_stamp (str, optional):

Name of the FloatColumn in the DataFrame represented by other used to ensure causality.

If an empty string is provided, time_stamp will be used instead.

upper_time_stamp (str, optional):

Optional additional time stamp in the other that will limit the number of joined rows to a certain point in the past. This is useful for data with limited correlation length.

Expressed as SQL code, this will add the condition
t1.time_stamp < t2.upper_time_stamp OR
t2.upper_time_stamp IS NULL
to the feature.

If an empty string is provided, all values in the past will be considered.

horizon (float, optional):

Period of time between the time_stamp and the other_time_stamp.

Usually, you need to ensure that no data from the future is used for your prediction, like this:
t1.time_stamp - t2.other_time_stamp >= 0
But in some cases, you would like the gap to be something other than zero. For such cases, you can set a horizon:
t1.time_stamp - t2.other_time_stamp >= horizon

memory (float, optional):

Period of time to which the join is limited.

Expressed as SQL code, this will add the condition
t1.time_stamp - t2.other_time_stamp < horizon + memory
to the feature.

When the memory is set to 0.0 or a negative number, there is no limit.

Limiting the joins using the memory or upper_time_stamp parameter can significantly reduce the training time. However, you can only set an upper_time_stamp or memory, but not both.

allow_lagged_targets (bool, optional):

For some applications, it is allowed to aggregate over target variables from the past. In others, this is not allowed. If allow_lagged_targets is set to True, you must pass a horizon that is greater than zero, otherwise you would have a data leak (an exception will be thrown to prevent this).

relationship (string, optional);

If the relationship between two tables in many-to-one or one-to-one, then feature learning is not necessary or meaningful. If you mark such relationships using one of the constants defined in relationship, the tables will be joined directly by the pipeline.

Note:

other must be created (temporally) after the current instance. This was implemented as a measure to prevent circular dependencies in the data model.