Placeholder¶
-
class
getml.data.
Placeholder
(name)[source]¶ Abstract representation of tables and their relations.
This class provides an abstract representation of the
DataFrame
. However, it does not contain any actual data.Examples:
This example will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those rows in ‘peripheral_table’ for which ‘time_stamp’ is smaller than the ‘time_stamp’ in ‘population_table’ are considered. This is to prevent data leaks:
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp" )
If the relationship between two tables is many-to-one or one-to-one you should clearly say so:
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp", relationship=getml.data.relationship.many_to_one )
Please also refer to
relationship
.If you want to do a self-join, you can do something like this:
population_placeholder = getml.data.Placeholder("POPULATION") population_placeholder2 = getml.data.Placeholder("POPULATION") population_placeholder.join(population_placeholder2, join_key="join_key", time_stamp="time_stamp" )
If the join keys or time stamps are named differently in the two different tables, use other_join_key and other_time_stamp:
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key_in_population", other_join_key="join_key_in_peripheral", time_stamp="time_stamp_in_population", other_time_stamp="time_stamp_in_peripheral" )
You can join over more than one join key:
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key=["join_key_1", "join_key_2"], time_stamp="time_stamp" )
You can also limit the scope of your joins using memory. This can significantly speed up training time. For instance, if you only want to consider data from the last seven days, you could do something like this:
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp", memory=getml.data.time.days(7) )
In some use cases, particularly those involving time series, it might be a good idea to use targets from the past. You can activate this using allow_lagged_targets. But if you do that, you must also define a prediction horizon. For instance, if you want to predict data for the next hour, using data from the last seven days, you could do this:
population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp", allow_lagged_targets=True, horizon=getml.data.time.hours(1), memory=getml.data.time.days(7) )
Please also refer to
time
.If the join involves many matches, it might be a good idea to set the relationship to
propositionalization
. This would force the pipeline to always use a propositionalization algorithm for this join, which can significantly speed things up.population_placeholder = getml.data.Placeholder("POPULATION") peripheral_placeholder = getml.data.Placeholder("PERIPHERAL") population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp", relationship=getml.data.relationship.propositionalization )
Please also refer to
relationship
.- Args:
- name (str):
The name used for this placeholder. This name will appear in the generated SQL code.
- Raises:
- TypeError:
If any of the input arguments is of wrong type.
Methods
join
(other[, join_key, time_stamp, …])Establish a relation between two
Placeholder
s.set_relations
([allow_lagged_targets, …])Set all relational instance variables not exposed in the constructor.