Annotating data¶
After you have uploaded your data into the getML engine, there is one more step before you can start engineering features: You need to assign a role to each column. Why is that?
First, the general structure of the individual data frames is needed to construct the relational data model. This is done by assigning the roles join_key and time_stamp. The former defines the columns that are used to join different data frames, the latter ensures that only row in a reasonable time frame are taken into account (otherwise there might be data leaks).
Second, you need to tell the feature engineering algorithm how to interpret the individual columns for it to construct sophisticated features. The roles numerical, categorical, and target serve to this end. You can also assign units to each column in a Data Frame.
This chapter provides you with detailed insights into the individual roles and units.
In short¶
When building the data model, you should keep the following things in mind:
Every
DataFrame
in a data model needs to have at least one column (columns
) with the role join_key.The role time_stamp has to be used to prevent data leaks (see Time series for details).
When engineering features, please keep the following things in mind:
Only
columns
with roles of categorical, numeric, and time_stamp will be used by the feature engineering algorithm for aggregations or conditions.Columns are only compared with each other if they have the same unit.
If you want to make sure that a column is only used for comparison, you can set
comparison_only
(see Units). Time stamps are always used for comparison only.
Roles¶
Roles determine if and how columns
are handled during the
construction of the data model and how they are interpreted
by the feature engineering algorithm. The
following roles are available in getML:
Role |
Class |
Included by the FE |
---|---|---|
yes |
||
yes |
||
yes |
||
no |
||
no |
||
no |
||
no |
When constructing a DataFrame
via the class methods
from_csv()
,
from_pandas()
,
from_db()
, and
from_json()
, all columns
will
have either the role unused_float or
unused_string.
Unused columns will be ignored by the feature engineering and
machine learning (ML) algorithms.
>>> import pandas as pd
>>> data_df = dict(
... animal=["hawk", "parrot", "goose"],
... votes=[12341, 5127, 65311],
... weight=[12.14, 12.6, 11.92],
... animal_id=[123, 512, 671],
... date=["2019-05-02", "2019-02-28", "2018-12-24"]
... )
>>> pandas_df = pd.DataFrame(data=data_df)
>>> getml_df = getml.data.DataFrame.from_pandas(pandas_df, name='animal elections')
>>> getml_df
| votes | weight | animal_id | animal | date |
| unused float | unused float | unused float | unused string | unused string |
------------------------------------------------------------------------------
| 12341 | 12.14 | 123 | hawk | 2019-05-02 |
| 5127 | 12.6 | 512 | parrot | 2019-02-28 |
| 65311 | 11.92 | 671 | goose | 2018-12-24 |
To make use of the uploaded data, you have to tell the getML suite how you
intend to use each column by assigning a role (roles
). This is
done by using the set_role()
method of the
DataFrame
. Each column must have exactly one role. If you
wish to use a column in two different roles, you have to add it twice and assign each
copy a different role.
>>> getml_df.set_role(['animal_id'], getml.data.roles.join_key)
>>> getml_df.set_role(['animal'], getml.data.roles.categorical)
>>> getml_df.set_role(['votes', 'weight'], getml.data.roles.numerical)
>>> getml_df.set_role(['date'], getml.data.roles.time_stamp)
>>> getml_df
| date | animal_id | animal | votes | weight |
| time stamp | join key | categorical | numerical | numerical |
---------------------------------------------------------------------------------
| 2019-05-02T00:00:00.000000Z | 123 | hawk | 12341 | 12.14 |
| 2019-02-28T00:00:00.000000Z | 512 | parrot | 5127 | 12.6 |
| 2018-12-24T00:00:00.000000Z | 671 | goose | 65311 | 11.92 |
When assigning new roles to existing columns, you might notice that some of these calls are completed in an instance while others might take a considerable amount of time. What’s happening here? A column’s role also determines its type. When you set a new role, an implicit type conversion might take place.
A note on reproducibility and efficiency¶
When building a stable pipeline you want to deploy in a productive environment, the flexible default behavior of the import interface might be more of an obstacle. For instance, CSV files are not type-safe. A column that was interpreted as a float column for one set of files might be interpreted as a string column for a different set of files. This obviously has implications for the stability of your pipeline. Therefore, it might be a good idea to hard-code column roles.
In the getML Python API, you can bypass the default deduction of the role of each column by providing a dictionary mapping each column name to a role in the import interface.
>>> roles = {"categorical": ["colname1", "colname2"], "target": ["colname3"]}
>>>
>>> df_expd = data.DataFrame.from_csv(
... fnames=["file1.csv", "file2.csv"],
... name="MY DATA FRAME",
... sep=';',
... quotechar='"',
... roles=roles,
... ignore=True
... )
If the ignore
argument is set to True, any columns missing in the dictionary won’t
be imported at all.
If you feel that writing the roles by hand is too tedious, you can use
dry
: If you call the
import interface while setting the dry
argument to True, no data is
read. Instead, the default roles of all columns will be returned
as a dictionary. You can store, alter, and hard-code this dictionary into your
stable pipeline.
>>> roles = data.DataFrame.from_csv(
... fnames=["file1.csv", "file2.csv"],
... name="MY DATA FRAME",
... sep=';',
... quotechar='"',
... dry=True
... )
Even if your data source is type safe, setting roles is still a good idea
as it is also more efficient: set_role()
creates a deep copy of the original column and might perform an
implicit type conversion. If you
already know where you want your data to end up, it might be a good idea
to set roles in advance.
Join key¶
Join keys are required to establish a relation between two
data frames (DataFrame
). See chapter
Data model for details.
The content of this column is allowed to contain NULL values. NULL values won’t be matched to anything, not even to NULL values in other join keys.
columns
of this role will not be aggregated by the
feature engineering algorithm or used for conditions.
Time stamp¶
This role is used to prevent data leaks. When you join one table onto another, you usually want to make sure that no data from the future is used. Time stamps can be used to limit your joins.
In addition, the feature engineering algorithm can aggregate time stamps or use them for conditions. However, they will never be compared to fixed values. This means that conditions like this are not possible:
...
WHERE time_stamp > some_fixed_date
...
Instead, time stamps will always be compared to other time stamps:
...
WHERE time_stamp1 - time_stamp2 > some_value
...
This is because it is unlikely that comparing time stamps to a fixed date performs well out-of-sample.
When assigning the role time stamp to a column that is currently a
StringColumn
,
you need to specify the format of this string. You can do so by using
the time_formats
argument of
set_role()
. You can pass a list of time formats
that is used to try to interpret the input strings. Possible format options are
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
If none of the provided formats works, the getML engine will try to interpret the time stamps as numerical values. If this fails, the time stamp will be set to NULL.
>>> data_df = dict(
... date1=[365, 366, 367],
... date2=['1971-01-01', '1971-01-02', '1971-01-03'],
... date3=['1|1|71', '1|2|71', '1|3|71'],
)
>>> df = getml.data.DataFrame.from_dict(data_df, name='dates')
>>> df.set_role(['date1', 'date2', 'date3'], getml.data.roles.time_stamp, time_formats=['%Y-%m-%d', '%n|%e|%y'])
>>> df
| date1 | date2 | date3 |
| time stamp | time stamp | time stamp |
-------------------------------------------------------------------------------------------
| 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z |
| 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z |
| 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z |
Note
GetML does not handle UNIX time stamps but encodes time as multiples and fractions of days since the 01.01.1970 (1970-01-01T00:00:00). For example 7.334722222222222 = 7 + 6/24 + 2/(24*60) would be interpreted as 1970-01-08T06:02:00.
Target¶
The associated columns
contain the variables we
want to predict in our data science project. They are
neither included in the data model nor are they used by the feature engineering
algorithm (since they will be unknown in all future events). But they
are such an important part of the analysis that you are required to provide at
least one of them in the population table (see
Tables). They are allowed to be present in
peripheral tables too, but will be ignored by the feature engineering
algorithms.
The content of the target columns needs to be numerical. For classification problems, target variables can only assume the values 0 or 1. Target variables can never be NULL.
MultirelModel
supports multiple targets out
of the box. However, RelboostModel
can only train on one target at a time. If you have several
targets, you need to train separate models (either by providing unique
names or using the time-based default names) and specify the
corresponding target_num
instance variable of
RelboostModel
. Which number is associated to
which target is determined by their ordering in the
target_names
instance variable in the
DataFrame
.
This difference is
due to the nature of the underlying relational learning algorithm used to
perform the automated feature engineering. Unlike
MultirelModel
,
RelboostModel
needs to train separate weights
and rules for each target.
Numerical¶
This role tells the getML engine to include the associated
FloatColumn
during the feature
engineering.
It should be used for all data with an inherent ordering, regardless if it’s a sampled from a continuous quantity, like passed time or the total amount of rainfall, or a discrete quantity, like the number of sugary mulberries one has eaten since lunch.
Categorical¶
This role tells the getML engine to include the associated
StringColumn
during feature
engineering.
It should be used for all data with no inherent ordering, even if the categories are encoded as integers instead of strings.
You should also make sure that the number of unique categories relative to the size of non-NULL categories is not too high (otherwise your features might overfit). You can check the summary statistics in the getML monitor to inform your decisions on whether to include particular string columns.
Unused_float¶
Marks a FloatColumn
as unused.
The associated columns
will be neither used for the
data model nor by the feature engineering algorithms and predictors.
Unused_string¶
Marks a StringColumn
as unused.
The associated columns
will be neither used for the
data model nor by the feature engineering algorithms and predictors.
Units¶
By default, all columns of either role categorical or numerical will only be compared to fixed values:
...
WHERE numerical_column > some_value
OR categorical_column == 'some string'
...
If you want the feature engineering algorithms to compare these columns with each other (like in the snippet below), you have to explicitly set a unit.
...
WHERE numerical_column1 - numerical_column2 > some_value
OR categorical_column1 != categorical_column2
...
Using set_unit()
you can set the unit of
a column to an arbitrary, non-empty string. If it matches the string
of another column, both of them will be compared by the getML
engine. Please note that a column can not have more than one unit.
There are occasions where only a pairwise comparison of columns but
not a comparison with fixed values is useful. To cope with this problem,
you can set the comparison_only
flag in
set_unit()
.
Note that time stamps are always used for comparison only. The feature engineering algorithm will never compare them to a fixed date, because it is very unlikely that such a feature would perform well out-of-sample.