Placeholder¶

class getml.data.Placeholder(name)[source]¶

Abstract representation of tables and their relations.

This class provides an abstract representation of the DataFrame. However, it does not contain any actual data.

Examples:

This example will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those rows in ‘peripheral_table’ for which ‘time_stamp’ is smaller than the ‘time_stamp’ in ‘population_table’ are considered. This is to prevent data leaks:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

If the relationship between two tables is many-to-one or one-to-one you should clearly say so:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            relationship=getml.data.relationship.many_to_one
)

Please also refer to relationship.

If you want to do a self-join, you can do something like this:

population_placeholder = getml.data.Placeholder("POPULATION")
population_placeholder2 = getml.data.Placeholder("POPULATION")

population_placeholder.join(population_placeholder2,
                            join_key="join_key",
                            time_stamp="time_stamp"
)

If the join keys or time stamps are named differently in the two different tables, use other_join_key and other_time_stamp:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key_in_population",
                            other_join_key="join_key_in_peripheral",
                            time_stamp="time_stamp_in_population",
                            other_time_stamp="time_stamp_in_peripheral"
)

You can join over more than one join key:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key=["join_key_1", "join_key_2"],
                            time_stamp="time_stamp"
)

You can also limit the scope of your joins using memory. This can significantly speed up training time. For instance, if you only want to consider data from the last seven days, you could do something like this:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            memory=getml.data.time.days(7)
)

In some use cases, particularly those involving time series, it might be a good idea to use targets from the past. You can activate this using allow_lagged_targets. But if you do that, you must also define a prediction horizon. For instance, if you want to predict data for the next hour, using data from the last seven days, you could do this:

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            allow_lagged_targets=True,
                            horizon=getml.data.time.hours(1),
                            memory=getml.data.time.days(7)
)

Please also refer to time.

If the join involves many matches, it might be a good idea to set the relationship to propositionalization. This would force the pipeline to always use a propositionalization algorithm for this join, which can significantly speed things up.

population_placeholder = getml.data.Placeholder("POPULATION")
peripheral_placeholder = getml.data.Placeholder("PERIPHERAL")

population_placeholder.join(peripheral_placeholder,
                            join_key="join_key",
                            time_stamp="time_stamp",
                            relationship=getml.data.relationship.propositionalization
)

Please also refer to relationship.

Args:

name (str):: The name used for this placeholder. This name will appear in the generated SQL code.

Raises:

TypeError:: If any of the input arguments is of wrong type.

Methods

`join`(other[, join_key, time_stamp, …])	Establish a relation between two `Placeholder` s.
`set_relations`([allow_lagged_targets, …])	Set all relational instance variables not exposed in the constructor.