Feature engineering

The deep learning revolution has enabled automated feature engineering for images and sound data. Yet, for relational data and classical time series analysis, feature engineering is still done by hand or using very simple brute force methods. Our mission is to change that.

The automation of feature engineering on relational data and time series is at the heart of the getML software suite. There are other libraries that implement feature engineering tools on top of frameworks like data.tables in R, pandas in Python, or Apache Spark. In essence, they all use a brute force approach: Generate a large number of features, then use some sort of feature selection routine to pick a small subselection of them.

getML has chosen another path: Our highly efficient feature engineering algorithms produce features, that are far more advanced than what manual feature engineering could achieve or what could be accomplished using simple brute force approaches.

Definition

Feature engineering is the process of construction explanatory variables, so-called features, from a dataset. These features are used as input for machine learning algorithms. In most real-world datasets, the raw data is spread over multiple tables and the task is to bring these tables together and construct features based on their relationships. These features are stored in a flat feature table. In other words, feature engineering is the operation of merging and aggregating a relational data model into a flat (feature) table.

This is sometimes referred to as data wrangling.

Traditionally, feature engineering is done manually using brute force approaches or domain knowledge. In any case it is a tedious, time-consuming, and error-prone process. Manual feature engineering is done by writing scripts in Python, R, or SQL.

getML’s approch to feature engineering is fundamentally different. It provides a framework capable of automatically extracting useful and meaningful features from a relational data model by finding the best merge and aggregate operations.

Note

Unfortunately, the term feature engineering is ambiguous. More often than not, feature engineering is meant to describe numerical transformations or encoding techniques on a single table. The definition used above assumes that the raw data comes in relational form, which is true for almost all real-world data sets.

Design principles

The general procedure for feature engineering on relational data and time series using getML looks like this:

  • The only required input is a relational data schema. In particular, there needs to be some sort of target variable(s), which shall be predicted. For time series, the schema would typically be a self-join. In addition to this general information on the data schema, the intended usage of the variables has to be provided by setting the roles of the corresponding columns. How to setup a data scheme is described in Data model.

  • Features are often of the type (illustrated in pseudo SQL-like syntax)

    COUNT the number of `transactions` within the last X `days`
    

    where X is some sort of fixed numerical value. getML’s algorithms do identify appropriate values for X automatically and there is no need for you to provide them by hand.

  • Features can also take the form of

    COUNT the number of `transactions` for which the `transaction
    type` is ‘type_1’ OR ‘type_2’ OR ’type_3’ OR …
    

    getML’s algorithms also find appropriate conditions based on categorical data without any input from the user.

  • The feature engineering algorithms can handle combinations of conditions too. So, features of the form

    SOME_AGGREGATION( over some column ) WHERE ( condition_1 AND
        condition_2 ) OR ( condition_3 AND condition_4 ) OR condition_5
    

    will be engineered automatically as well. Again, no input from the user is required.

  • To increase transparency relating to the created features, they can be expressed in SQL code. Even though automatically generated features will always be less intuitive than hand-crafted ones and could be quite complex, we want the user to get an understanding of what is going on.

Algorithms

getML provides two powerful feature engineering algorithms: Multirel and Relboost.

Multirel

Simply speaking, Multirel is a more efficient variation of Multi-relational Decision Tree Learning (MRDTL). The core idea is to minimize redundancies in the original algorithm by incremental updates. We then combined our improved version of MRDTL with ensemble learning methods.

MRDTL is a strain of academic literature that was particularly popular in the early 2000s. It is based on a greedy, tree-like approach:

  • Define some sort of objective function that evaluates the quality of your feature as it relates to the target variable(s).

  • Pick an aggregation and some column to be aggregated.

  • Try out different conditions. Keep the one that generates the greatest improvement of your objective. Repeat until no improvement can be found or some sort of stopping criterion is reached.

The reason this approach has never really taken off outside of academia is that an efficient implementation is far from trivial. Most papers on MRDTL implement the algorithm on top of an existing relational database system, like MySQL.

The main problem with trying to implement something like this on top of an existing database is that it requires many redundant operations. Consider a feature like

COUNT the number of `transactions` in the last X `days`

As we iterate through different values for the threshold X, we are forced to repeat the same operations on the same data over and over again. Tasks like this bring traditional database engine to their knees.

The core idea of getML’s Multirel algorithm is to minimize redundancies through incremental updates. To allow for incremental updates and maximal efficiency, we developed a database engine from scratch in C++. When we evaluate a feature like

COUNT the number of `transactions` in the last 90 `days`

and

COUNT the number of `transactions` in the last 91 `days`

very little changes in between. Multirel only recalculates what has changed and keeps everything else untouched. Therefore, it needs two ingredients that can be incrementally updated: An objective function and the actual aggregation(s).

Our first ingredient must be suited for incremental updates. When we move from 90 to 91 days, presumably only very few lines in the population table actually change. We do not need to recalculate the entire table. In practice, most commonly used objective functions are fine and this is not much of a limitation. However, there are some, like rank correlation, that cannot be used.

The second ingredient, the aggregations, must allow for incremental updates too. This part is a bit harder, so let us elaborate: Let’s say we have a match between the population table that contains our targets and another table (or a self-join). This match happens to be between the two thresholds 90 and 91 days. As we move from 90 to 91 days, we have to update our aggregation for that match. For maximum efficiency, this needs also to be done incrementally. That means we do not want to recalculate the entire aggregation for all matches that it aggregates - instead just for the one match in between the two thresholds.

We want, however, to also support the AND and OR combinations of conditions. So, it is possible that a match was not included in the aggregation before, but becomes part of it as we move the threshold. It is also possible that the match was included in the aggregation, but now it isn’t anymore.

For an aggregation like Count, incremental updates are straight-forward. If the match was not included, but now it is, then increment by 1. If was included, but it isn’t anymore, then decrement by 1.

Things are more tricky for aggregations like Max, Median, or CountDistinct. For instance, whereas incrementing Max is easy, decrementing it is hard. If the match used to be included and is in fact the maximum value, we now have to find the next biggest match. And we have to find it quickly - ideally iterating through a set of thresholds should take linear time in the number of matches. To make it even more complicated, some cross-joins might result in a lot of matches, so any data structures that have non-trivial memory overhead are a no-go.

Everything so far has shed light on how we train one feature. But in practice, we want more than one. So, how do we do that? Since we are using a tree-based algorithm anyway, we are able to harness the power of ensemble learning algorithms that have been shown to work very well with non-relational decision trees, namely bagging and gradient boosting.

With bagging, we just sample randomly from our population table. We train a feature on that sample and then pick a different random sample to train the next feature.

With gradient boosting, we calculate the pseudo-residuals of our previously trained features. We then train features that predict these pseudo-residuals. This procedure guarantees that new features are targeted and compensate the weaknesses of older ones.

Transpiled to SQL, a typical feature generated by Multirel looks like this:

CREATE TABLE FEATURE_1 AS
SELECT COUNT( * ) AS feature_1,
       t1.join_key,
       t1.time_stamp
FROM (
     SELECT *,
            ROW_NUMBER() OVER ( ORDER BY join_key, time_stamp ASC ) AS rownum
     FROM POPULATION
) t1
LEFT JOIN PERIPHERAL t2
ON t1.join_key = t2.join_key
WHERE (
   ( t1.time_stamp - t2.time_stamp <= 0.499624 )
) AND t2.time_stamp <= t1.time_stamp
GROUP BY t1.rownum,
         t1.join_key,
         t1.time_stamp;

Further information can be found in the API documentation for MultirelModel.

Relboost

Relboost is a generalization of the gradient boosting algorithm. More specifically, it generalizes the xgboost implementation to relational learning.

The main difference between Relboost and Multirel is that Multirel aggregates columns, whereas Relboost aggregates learnable weights.

Relboost addresses a problem with Multirel that is related to computational complexity theory: In Multirel, every column can be aggregated and/or used for generating a condition. That means that the number of possible features is \mathcal{O}(n^2) in the number of columns in the original tables. As a result having twice as many columns will lead to a search space that is four times as large (in reality, it is a bit more complicated than that, but the basic point is true).

Any computer scientist or applied mathematician will tell you that \mathcal{O}(n^2) is not nice. If you have tables with many columns, it might turn out to be a problem. Of course, this issue is not specific to Multirel: It is a very fundamental problem that you would also have, if you were to write your features by hand or use brute force.

Relboost offers a way out of this dilemma: Because Relboost aggregates learnable weights and columns will only be used for conditions, but not for aggregation. So, now the search space is \mathcal{O}(n) in the number of columns in the original tables - much better.

This might seem very theoretical, but it has considerable implications: From our experience with real-world data in various projects, we know that Relboost usually outperforms Multirel in terms of predictive accuracy and training time.

However, these advantages come at a price: First, the features generated by Relboost are less intuitive. They are further away from what you might write by hand, even though they can still be expressed as SQL code. Second, it is more difficult to apply Relboost to multiple targets, because Relboost has to learn separate rules and weights for each target.

Expressed as SQL code, a typical feature generated by Relboost looks like this:

CREATE TABLE FEATURE_1 AS
SELECT SUM(
CASE
     WHEN ( t1.time_stamp - t2.time_stamp > 0.499624 ) THEN 0.0
     WHEN ( t1.time_stamp - t2.time_stamp <= 0.499624 OR t1.time_stamp IS NULL OR t2.time_stamp IS NULL ) THEN 1.0
     ELSE NULL
END
) AS feature_1,
     t1.join_key,
     t1.time_stamp
FROM (
     SELECT *,
            ROW_NUMBER() OVER ( ORDER BY join_key, time_stamp ASC ) AS rownum
     FROM POPULATION
) t1
LEFT JOIN PERIPHERAL t2
ON t1.join_key = t2.join_key
WHERE t2.time_stamp <= t1.time_stamp
GROUP BY t1.rownum,
         t1.join_key,
         t1.time_stamp;

Further information can be found in the API documentation for RelboostModel.

Best practices

This is a listing of suggestions how to get the most out of getML’s feature engineering algorithms.

How to make features more interpretable

First of all, you should keep in mind that there is a trade-off between the interpretability and the predictive power of the generated features, just like there is a trade-off between interpretability and predictive power of machine learning algorithms.

MultirelModel offers numerous parameters to regularize your features, to keep them readable, and to avoid overfitting.

The most important ones are:

1. regularization: A higher value of this variable is the most elegant and efficient way to regularize the feature engineering algorithm and will lead to less complex features and less danger of overfitting.

2. min_num_samples: Determines the minimum number of samples a subcondition should apply to in order for it to be considered. Higher values lead to less complex statements and less danger of overfitting.

3. max_length: This parameter imposes a hard upper limit on the length of your subconditions. Settings it to 1 or 2 will greatly increase the readability of your features.

4. allow_sets: Multirel can combine categories into sets. This will produce conditions of the form

(CATEGORY = "value_1" OR CATEGORY = "value_2" OR ...)

Enabling this option can be powerful, but it can also produce features, which are hard to read. So, you might want to set allow_sets to False if you strive for interpretability.

How to add handcrafted features

Just add you hand-crafted features to the population table as numerical columns. They will by automatically passed on to the predictor.

If you have categorical hand-crafted features, you can add them as well, but you also have to set the include_categorical flag in MultirelModel to True.

Which hyperparameters have the most impact for Multirel?

There are several things you have to keep in mind when working with MultirelModel.

  1. The ideal num_features is often more than you think.

From our experience with hand-crafted features we were used to have some 30 to 50 features. But many problems, particularly for business-related data sets, often contain millions of samples of labeled data. That means if we create some 1000 features, the ratio of features to samples will still be below 0.1%. So, the only thing that should stop us from generating 1000 features are memory limitations. But if you have sufficient memory, do try to generate as much features as you please and see if it improves your scores on the validation set.

  1. Playing with share_aggregations can make a difference.

Every time a new feature is generated, the aggregation will be taken from a random subsample of possible aggregations and values to be aggregated. share_aggregations determines the size of that subsample.

share_aggregations is, thus, the parameter to control the accuracy vs. variance trade-off: A higher value results in more accurate features, whereas a lower ones in greater variance.

The ideal value can vary greatly with the data set. We have seen sets (such as the Customer Expenditure) where 0.25 is appropriate, but we gained wonderful results with values around of 0.05 on others.

The most extreme way to get great variance at the expense of accuracy, is round_robin: When it is set to True, you will be guaranteed to have a great variety of features (but many of them might be meaningless).

  1. Put some effort into regularization.

You can regularize your features using regularization, min_num_samples, or max_length.

  1. A good data model trumps everything else.

But the most important thing is to think about your data model. We have tried to make the Python API as easy and flexible as possible by introducing our Placeholder. Thinking about your data model (don’t forget about the units) is often a better use of your time than microtuning the hyperparameters. After all, you have got a hyperparameter optimization at hand that automates this task.

How to reduce the training time of Multirel

The following parameters can be used to reduce your training time:

1. share_aggregations: Every time a new feature is generated, the particular aggregation will be taken from a random subsample of possible aggregations and values to be aggregated. The parameter determines the size of that subsample. A lower value will increase training time significantly.

2. round_robin: Setting this parameter to True will almost certainly reduce your training time.

3. regularization, min_num_samples, and max_length: These parameters are used to regularize your features and keep them simple. Simpler features take less time to train. So, anything you do to regularize them will also reduce training time.

4. aggregations: Keep in mind that some aggregations, such as CountDistinct or Median, can be far more expensive than others. If you want to reduce training time, sticking to the “big 5” (Count, Sum, Avg, Min, Max) might be a good idea.

5. grid_factor: A higher values will cause Multirel to try more critical values for your numeric features, which can take quite some time. But keep in mind that this time penalty is quite low compared to the increased accuracy you get.

Influence of the data set on the training time

It may sound very counterintuitive, but size does not matter much! Multirel uses a bootstrapping approach and subsamples of your population table. A new subsample is generated for each feature. So, whether your population table consists of 500 or 50,000,000 lines does not matter much.

What, instead, does matter, is whether the number of unique join keys is small relative to the size of your data set. If there are a lot of entries to aggregate in your peripheral table for every entry in your population table (as you might get in a time series problem), that will take a while.

By contrast, if you are doing a churn use case with many customers and few activities per customer, that will be fast.

In addition, categorical features tend to take a bit longer than numerical ones.