Placeholder¶
-
class
getml.data.
Placeholder
(name, categorical=None, numerical=None, join_keys=None, time_stamps=None, targets=None)¶ Bases:
object
Schematic representation of tables and their relations
This classes provides a light weight representation of the
DataFrame
including both its general structure and its relations to otherDataFrame
. The actual data, however, is not contained.Examples
Although you can directly use the constructor of
Placeholder
to replicate the structure of aDataFrame
, we highly recommend to use thegetml.data.DataFrame.to_placeholder()
method to generate them automatically.# Creates some DataFrames population_table, peripheral_table = getml.datasets.make_numerical() # Derives Placeholder from them population_placeholder = population_table.to_placeholder() peripheral_placeholder = peripheral_table.to_placeholder()
With your
Placeholder
in place you can use thejoin()
method to construct the data model (required to construct themodels
later on).population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp" )
- Parameters
name (str) – Name of the corresponding
DataFrame
thePlaceholder
is being modelled on.categorical (List[str], optional) – Names of the
columns
of theDataFrame
, which are of rolecategorical
.numerical (List[str], optional) – Names of the
columns
of theDataFrame
, which are of rolenumerical
.join_keys (List[str], optional) – Names of the
columns
of theDataFrame
, which are of rolejoin_key
.time_stamps (List[str], optional) – Names of the
columns
of theDataFrame
, which are of roletime_stamp
.targets (List[str], optional) – Names of the
columns
of theDataFrame
, which are of roletarget
.
- Raises
TypeError – If any of the input arguments is of wrong type.
Note
The input argument categorical, numerical, join_keys, time_stamps, and targets do represent the general structure of the corresponding
DataFrame
. Its relations to otherDataFrame
- stored in the instance variablesjoin_keys_used
,other_join_keys_used
,time_stamps_used
,other_time_stamps_used
,upper_time_stamps_used
, andjoined_tables
- , however, are not set using the constructor but inset_relations()
method. It will be called internally when invokingjoin()
and is only exported as a public method to allow more sophisticated scripting.Methods Summary
join
(other, join_key[, time_stamp, …])Establish a relation between two
Placeholder
set_relations
([join_keys_used, …])Set all relational instance variables not exposed in the constructor.
Methods Documentation
-
join
(other, join_key, time_stamp='', other_join_key='', other_time_stamp='', upper_time_stamp='')¶ Establish a relation between two
Placeholder
In order for the feature engineering algorithm to craft sophisticated features, it has to know about the general structure of your relational data at hand. This structure, the data model, is composed out of both the schematic representations of the involved
DataFrame
and their relations to each other. The latter is introduced using this method.Examples
population_table, peripheral_table = getml.datasets.make_numerical() population_placeholder = population_table.to_placeholder() peripheral_placeholder = peripheral_table.to_placeholder() population_placeholder.join(peripheral_placeholder, join_key="join_key", time_stamp="time_stamp" )
The example above will construct a data model in which the ‘population_table’ depends on the ‘peripheral_table’ via the ‘join_key’ column. In addition, only those columns in ‘peripheral_table’ which ‘time_stamp’ is small than the ‘time_stamp’ in ‘population_table’ are considered.
- Parameters
other (
Placeholder
) –Placeholder
the current instance will depend on.join_key (str) –
Name of the
StringColumn
in the correspondingDataFrame
used to establish a relation between the current instance and other.The provided string must be contained in the
join_keys
instance variable.If other_join_key is an empty string, join_key will be used to determine the column of other too.
time_stamp (str, optional) –
Name of the
FloatColumn
in the correspondingDataFrame
used to ensure causality.The provided string must be contained in the
time_stamps
instance variable.If other_time_stamp is an empty string, time_stamp will be used to determine the column of other too.
other_join_key (str, optional) –
Name of the
StringColumn
in theDataFrame
represented by other used to establish a relation between the current instance and other.If an empty string is provided, join_key will be used instead.
other_time_stamp (str, optional) –
Name of the
FloatColumn
in theDataFrame
represented by other used to ensure causality.If an empty string is provided, time_stamp will be used instead.
upper_time_stamp (str, optional) –
Optional additional time stamp in the other that will limit the number of joined rows to a certain point in the past. This is useful for data with limited correlation length.
Expressed as SQL code, this will add the condition
t1.time_stamp < t2.upper_time_stamp OR t2.upper_time_stamp IS NULL
to the feature.
If an empty string is provided, all values in the past will be considered.
- Raises
TypeError – If any of the input arguments is of wrong type.
Exception – If other was created earlier (temporally) than the current instance.
Note
The task of the time_stamp is a crucial one. It ensures causality by incorporating only those rows of other in the join operation for which the time stamp in other_time_stamp is at most as recent as the one in the corresponding row of time_stamp in the current instance. Since usually it’s the population table you call this method on and it’s the peripheral table you provide as other (see Tables), this ensures no information from the future is considered during training. upper_time_stamp is used to additionally limit the joined rows up to a certain point in the past.
In terms of SQL syntax this method does correspond to a LEFT_JOIN.
other must be created (temporally) after the current instance. This was implemented as measure to prevent circular dependencies in the data model.
-
set_relations
(join_keys_used=None, other_join_keys_used=None, time_stamps_used=None, other_time_stamps_used=None, upper_time_stamps_used=None, joined_tables=None)¶ Set all relational instance variables not exposed in the constructor.
The reason to split of the setting of the instance variables in two different functions, this one and the constructor, is due to their distinct nature. While the input arguments of the latter are concerned with the structure of the particular
DataFrame
the current instance is representing, the arguments of this function cover the relations of that table to all the otherDataFrame
.The intended usage during a data science project is to first construct the
Placeholder
for allDataFrame
and later on join them together/define their relations usingjoin()
. Letting this method managing the relational instances variables instead of requiring them during the construction of thePlaceholder
makes the code much less error-prone. But at some points, like in deserialization or extended scripting, it might be still required to initialize the whole instance in one (well, two) step which is where this method comes into play.The ordering within the provided arguments is important. It is assumed that all of them are of the same length and all keys/time stamps at a certain position define the relation of the table this instance resembles with another one represented by the
Placeholder
in the same position of the joined_tables argument.- Parameters
join_keys_used (List[str]) – Elements in join_keys used to define the relations to the other tables provided in joined_tables.
other_join_keys_used (List[str]) – join_keys of the
Placeholder
in joined_tables used to define a relation with the current instance. Note that the join_keys instance variable is not contained in the joined_tabled.time_stamps_used (List[str]) – Elements in time_stamps used to define the relations to the other tables provided in joined_tables.
other_time_stamps_used (List[str]) – time_stamps of the
Placeholder
in joined_tables used to define a relation with the current instance. Note that the time_stamps instance variable is not contained in the joined_tabled.upper_time_stamps_used (List[str]) – time_stamps of the
Placeholder
in joined_tables used as ‘upper_time_stamp’ to define a relation with the current instance. For details please see thejoin()
method. Note that the time_stamps instance variable is not contained in the joined_tabled.joined_tables (List[
Placeholder
]) – List of all otherPlaceholder
the current instance is joined on.
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If the input arguments are not of same length.