Annotating data

After you have uploaded your data into the getML engine, there is one more step before you can start engineering features: You need to assign a role to each column. Why is that?

First, the general structure of the individual data frames is needed to construct the relational data model. This is done by assigning the roles join_key and time_stamp. The former defines the columns that are used to join different data frames, the latter ensures that only row in a reasonable time frame are taken into account (otherwise there might be data leaks).

Second, you need to tell the feature engineering algorithm how to interpret the individual columns for it to construct sophisticated features. The roles numerical, categorical, and target serve to this end. You can also assign units to each column in a Data Frame.

This chapter provides you with detailed insights into the individual roles and units.

In short

When building the data model, you should keep the following things in mind:

When engineering features, please keep the following things in mind:

  • Only columns with roles of categorical, numeric, and time_stamp will be used by the feature engineering algorithm for aggregations or conditions.

  • Columns are only compared with each other if they have the same unit.

  • If you want to make sure that a column is only used for comparison, you can set comparison_only (see Units). Time stamps are always used for comparison only.

Roles

Roles determine if and how columns are handled during the construction of the data model and how they are interpreted by the feature engineering algorithm. The following roles are available in getML:

Role

Class

Included by the FE

categorical

StringColumn

yes

numerical

FloatColumn

yes

time_stamp

FloatColumn

yes

join_key

StringColumn

no

target

FloatColumn

no

unused_float

FloatColumn

no

unused_string

StringColumn

no

When constructing a DataFrame via the class methods from_csv(), from_pandas(), from_db(), and from_json(), all columns will have either the role unused_float or unused_string. Unused columns will be ignored by the feature engineering and machine learning (ML) algorithms.

>>> import pandas as pd
>>> data_df = dict(
... animal=["hawk", "parrot", "goose"],
... votes=[12341, 5127, 65311],
... weight=[12.14, 12.6, 11.92],
... animal_id=[123, 512, 671],
... date=["2019-05-02", "2019-02-28", "2018-12-24"]
... )
>>> pandas_df = pd.DataFrame(data=data_df)
>>> getml_df = getml.data.DataFrame.from_pandas(pandas_df, name='animal elections')

>>> getml_df
| votes        | weight       | animal_id    | animal        | date          |
| unused float | unused float | unused float | unused string | unused string |
------------------------------------------------------------------------------
| 12341        | 12.14        | 123          | hawk          | 2019-05-02    |
| 5127         | 12.6         | 512          | parrot        | 2019-02-28    |
| 65311        | 11.92        | 671          | goose         | 2018-12-24    |

To make use of the uploaded data, you have to tell the getML suite how you intend to use each column by assigning a role (roles). This is done by using the set_role() method of the DataFrame. Each column must have exactly one role. If you wish to use a column in two different roles, you have to add it twice and assign each copy a different role.

>>> getml_df.set_role(['animal_id'], getml.data.roles.join_key)
>>> getml_df.set_role(['animal'], getml.data.roles.categorical)
>>> getml_df.set_role(['votes', 'weight'], getml.data.roles.numerical)
>>> getml_df.set_role(['date'], getml.data.roles.time_stamp)
>>> getml_df
| date                        | animal_id | animal      | votes     | weight    |
| time stamp                  | join key  | categorical | numerical | numerical |
---------------------------------------------------------------------------------
| 2019-05-02T00:00:00.000000Z | 123       | hawk        | 12341     | 12.14     |
| 2019-02-28T00:00:00.000000Z | 512       | parrot      | 5127      | 12.6      |
| 2018-12-24T00:00:00.000000Z | 671       | goose       | 65311     | 11.92     |

When assigning new roles to existing columns, you might notice that some of these calls are completed in an instance while others might take a considerable amount of time. What’s happening here? A column’s role also determines its type. When you set a new role, an implicit type conversion might take place.

A note on reproducibility and efficiency

When building a stable pipeline you want to deploy in a productive environment, the flexible default behavior of the import interface might be more of an obstacle. For instance, CSV files are not type-safe. A column that was interpreted as a float column for one set of files might be interpreted as a string column for a different set of files. This obviously has implications for the stability of your pipeline. Therefore, it might be a good idea to hard-code column roles.

In the getML Python API, you can bypass the default deduction of the role of each column by providing a dictionary mapping each column name to a role in the import interface.

>>> roles = {"categorical": ["colname1", "colname2"], "target": ["colname3"]}
>>>
>>> df_expd = data.DataFrame.from_csv(
...         fnames=["file1.csv", "file2.csv"],
...         name="MY DATA FRAME",
...         sep=';',
...         quotechar='"',
...         roles=roles,
...         ignore=True
... )

If the ignore argument is set to True, any columns missing in the dictionary won’t be imported at all.

If you feel that writing the roles by hand is too tedious, you can use dry: If you call the import interface while setting the dry argument to True, no data is read. Instead, the default roles of all columns will be returned as a dictionary. You can store, alter, and hard-code this dictionary into your stable pipeline.

>>> roles = data.DataFrame.from_csv(
...         fnames=["file1.csv", "file2.csv"],
...         name="MY DATA FRAME",
...         sep=';',
...         quotechar='"',
...         dry=True
... )

Even if your data source is type safe, setting roles is still a good idea as it is also more efficient: set_role() creates a deep copy of the original column and might perform an implicit type conversion. If you already know where you want your data to end up, it might be a good idea to set roles in advance.

Join key

Join keys are required to establish a relation between two data frames (DataFrame). See chapter Data model for details.

The content of this column is allowed to contain NULL values. NULL values won’t be matched to anything, not even to NULL values in other join keys.

columns of this role will not be aggregated by the feature engineering algorithm or used for conditions.

Time stamp

This role is used to prevent data leaks. When you join one table onto another, you usually want to make sure that no data from the future is used. Time stamps can be used to limit your joins.

In addition, the feature engineering algorithm can aggregate time stamps or use them for conditions. However, they will never be compared to fixed values. This means that conditions like this are not possible:

...
WHERE time_stamp > some_fixed_date
...

Instead, time stamps will always be compared to other time stamps:

...
WHERE time_stamp1 - time_stamp2 > some_value
...

This is because it is unlikely that comparing time stamps to a fixed date performs well out-of-sample.

When assigning the role time stamp to a column that is currently a StringColumn, you need to specify the format of this string. You can do so by using the time_formats argument of set_role(). You can pass a list of time formats that is used to try to interpret the input strings. Possible format options are

  • %w - abbreviated weekday (Mon, Tue, …)

  • %W - full weekday (Monday, Tuesday, …)

  • %b - abbreviated month (Jan, Feb, …)

  • %B - full month (January, February, …)

  • %d - zero-padded day of month (01 .. 31)

  • %e - day of month (1 .. 31)

  • %f - space-padded day of month ( 1 .. 31)

  • %m - zero-padded month (01 .. 12)

  • %n - month (1 .. 12)

  • %o - space-padded month ( 1 .. 12)

  • %y - year without century (70)

  • %Y - year with century (1970)

  • %H - hour (00 .. 23)

  • %h - hour (00 .. 12)

  • %a - am/pm

  • %A - AM/PM

  • %M - minute (00 .. 59)

  • %S - second (00 .. 59)

  • %s - seconds and microseconds (equivalent to %S.%F)

  • %i - millisecond (000 .. 999)

  • %c - centisecond (0 .. 9)

  • %F - fractional seconds/microseconds (000000 - 999999)

  • %z - time zone differential in ISO 8601 format (Z or +NN.NN)

  • %Z - time zone differential in RFC format (GMT or +NNNN)

  • %% - percent sign

If none of the provided formats works, the getML engine will try to interpret the time stamps as numerical values. If this fails, the time stamp will be set to NULL.

>>> data_df = dict(
... date1=[365, 366, 367],
... date2=['1971-01-01', '1971-01-02', '1971-01-03'],
... date3=['1|1|71', '1|2|71', '1|3|71'],
)
>>> df = getml.data.DataFrame.from_dict(data_df, name='dates')
>>> df.set_role(['date1', 'date2', 'date3'], getml.data.roles.time_stamp, time_formats=['%Y-%m-%d', '%n|%e|%y'])
>>> df
| date1                       | date2                       | date3                       |
| time stamp                  | time stamp                  | time stamp                  |
-------------------------------------------------------------------------------------------
| 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z | 1971-01-01T00:00:00.000000Z |
| 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z | 1971-01-02T00:00:00.000000Z |
| 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z | 1971-01-03T00:00:00.000000Z |

Note

GetML does not handle UNIX time stamps but encodes time as multiples and fractions of days since the 01.01.1970 (1970-01-01T00:00:00). For example 7.334722222222222 = 7 + 6/24 + 2/(24*60) would be interpreted as 1970-01-08T06:02:00.

Target

The associated columns contain the variables we want to predict in our data science project. They are neither included in the data model nor are they used by the feature engineering algorithm (since they will be unknown in all future events). But they are such an important part of the analysis that you are required to provide at least one of them in the population table (see Tables). They are allowed to be present in peripheral tables too, but will be ignored by the feature engineering algorithms.

The content of the target columns needs to be numerical. For classification problems, target variables can only assume the values 0 or 1. Target variables can never be NULL.

MultirelModel supports multiple targets out of the box. However, RelboostModel can only train on one target at a time. If you have several targets, you need to train separate models (either by providing unique names or using the time-based default names) and specify the corresponding target_num instance variable of RelboostModel. Which number is associated to which target is determined by their ordering in the target_names instance variable in the DataFrame.

This difference is due to the nature of the underlying relational learning algorithm used to perform the automated feature engineering. Unlike MultirelModel, RelboostModel needs to train separate weights and rules for each target.

Numerical

This role tells the getML engine to include the associated FloatColumn during the feature engineering.

It should be used for all data with an inherent ordering, regardless if it’s a sampled from a continuous quantity, like passed time or the total amount of rainfall, or a discrete quantity, like the number of sugary mulberries one has eaten since lunch.

Categorical

This role tells the getML engine to include the associated StringColumn during feature engineering.

It should be used for all data with no inherent ordering, even if the categories are encoded as integers instead of strings.

You should also make sure that the number of unique categories relative to the size of non-NULL categories is not too high (otherwise your features might overfit). You can check the summary statistics in the getML monitor to inform your decisions on whether to include particular string columns.

Unused_float

Marks a FloatColumn as unused.

The associated columns will be neither used for the data model nor by the feature engineering algorithms and predictors.

Unused_string

Marks a StringColumn as unused.

The associated columns will be neither used for the data model nor by the feature engineering algorithms and predictors.

Units

By default, all columns of either role categorical or numerical will only be compared to fixed values:

...
WHERE numerical_column > some_value
OR categorical_column == 'some string'
...

If you want the feature engineering algorithms to compare these columns with each other (like in the snippet below), you have to explicitly set a unit.

...
WHERE numerical_column1 - numerical_column2 > some_value
OR categorical_column1 != categorical_column2
...

Using set_unit() you can set the unit of a column to an arbitrary, non-empty string. If it matches the string of another column, both of them will be compared by the getML engine. Please note that a column can not have more than one unit.

There are occasions where only a pairwise comparison of columns but not a comparison with fixed values is useful. To cope with this problem, you can set the comparison_only flag in set_unit().

Note that time stamps are always used for comparison only. The feature engineering algorithm will never compare them to a fixed date, because it is very unlikely that such a feature would perform well out-of-sample.