join

StarSchema.join(right_df, alias=None, on=None, time_stamps=None, relationship='many-to-many', memory=None, horizon=None, lagged_targets=False, upper_time_stamp=None)[source]

Joins a DataFrame or View to the population table.

In a StarSchema or TimeSeries, all joins take place on the population table. If you want to create more complex data models, use DataModel instead.

Examples:

This example will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those rows in ‘peripheral_table’ for which ‘time_stamp’ is smaller or equal to the ‘time_stamp’ in ‘population_table’ are considered:

star_schema = getml.data.StarSchema(
    population=population_table, split=split)

star_schema.join(
    peripheral_table,
    on="join_key",
    time_stamps="time_stamp"
)

If the relationship between two tables is many-to-one or one-to-one you should clearly say so:

star_schema.join(
    peripheral_table,
    on="join_key",
    time_stamps="time_stamp",
    relationship=getml.data.relationship.many_to_one,
)

Please also refer to relationship.

If the join keys or time stamps are named differently in the two different tables, use a tuple:

star_schema.join(
    peripheral_table,
    on=("join_key", "other_join_key"),
    time_stamps=("time_stamp", "other_time_stamp"),
)

You can join over more than one join key:

star_schema.join(
    peripheral_table,
    on=["join_key1", "join_key2", ("join_key3", "other_join_key3")],
    time_stamps="time_stamp",
)

You can also limit the scope of your joins using memory. This can significantly speed up training time. For instance, if you only want to consider data from the last seven days, you could do something like this:

star_schema.join(
    peripheral_table,
    on="join_key",
    time_stamps="time_stamp",
    memory=getml.data.time.days(7),
)

In some use cases, particularly those involving time series, it might be a good idea to use targets from the past. You can activate this using lagged_targets. But if you do that, you must also define a prediction horizon. For instance, if you want to predict data for the next hour, using data from the last seven days, you could do this:

star_schema.join(
    peripheral_table,
    on="join_key",
    time_stamps="time_stamp",
    lagged_targets=True,
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.days(7),
)

Please also refer to time.

If the join involves many matches, it might be a good idea to set the relationship to propositionalization. This forces the pipeline to always use a propositionalization algorithm for this join, which can significantly speed things up.

star_schema.join(
    peripheral_table,
    on="join_key",
    time_stamps="time_stamp",
    relationship=getml.data.relationship.propositionalization,
)

Please also refer to relationship.

Args:
right_df (DataFrame or View):

The data frame or view you would like to join.

alias (str or None):

The name as which you want right_df to be referred to in the generated SQL code.

on (None, string, Tuple[str] or List[Union[str, Tuple[str]]]):

The join keys to use. If none is passed, then everything will be joined to everything else.

time_stamps (string or Tuple[str]):

The time stamps used to limit the join.

relationship (str):

The relationship between the two tables. Must be from relationship.

memory (float):

The difference between the time stamps until data is ‘forgotten’. Limiting your joins using memory can significantly speed up training time. Also refer to time.

horizon (float):

The prediction horizon to apply to this join. Also refer to time.

lagged_targets (bool):

Whether you want to allow lagged targets. If this is set to True, you must also pass a positive, non-zero horizon.

upper_time_stamp (str):

Name of a time stamp in right_df that serves as an upper limit on the join.