Data model

Defining the data model is a crucial step before training one of getML’s feature engineering algorithms. You typically deal with this step after having uploaded your data and specified the roles of each columns.

When working with getML, the raw data usually comes in the form of relational data. That means the information relevant for a prediction is spread over several tables. The data model is the definition of the relations between all of them. If you are not sure what relational data is, check out our blog post on this topic.

Note

If you are dealing with time series, you might only have a single table. But all the concepts described in this section do still apply and the getML software does still fit your needs and produces any number of features automatically. You only have to keep in mind that the feature engineering algorithms will rely on so-called self-joins as explained in Time series.

Tables

When defining the data model, we distinguish between a population table and one or more peripheral tables.

The population table

The population table is the main table of the analysis. It must contain at least one column with the role target, which is the variable we want to predict. Furthermore, the table must also contain one or more columns with the role join_key. These are foreign keys used to establish a relation - also called joins - with one or more peripheral table. In data warehousing, the population table is sometimes called the fact table.

The following example shows the population table of a customer churn analysis. The target variable is churn - whether a person does stop using the services and products of a company. It contains the information whether or not a given customer has churned after a certain reference date. The join key customer_id is used to establish relations with a peripheral table. Additionally, the date the customer joined our fictional company is given in column date_joined, which we have assigned the role time_stamp.

../../_images/population_table.png

Peripheral tables

Peripheral tables contain additional information relevant for the prediction of the target variable in the population table. Each of them is related to the latter (or another peripheral table, see The snowflake schema) via a column with role join_key. In data warehousing, the peripheral tables are sometimes called dimension tables.

The following pictures show two peripheral tables that could be used for our customer churn analysis from the example above. One represents complaints a certain customer made with a certain agent and the other the transactions the customer made using her account.

../../_images/peripheral_tables.png

Placeholders

In getML, Placeholder are used to construct the data model. They are light-weight representations of Data Frames and their relations among each other but do not contain any data.

The idea behind the placeholder concept is that they allow constructing an abstract data model without any reference to an actual data set. This data model serves as input for the feature engineering algorithms. Later on, these algorithms can be trained and applied on any data set that follows this data model.

In most cases placeholders are constructed for a specific data set. Thus, they are most easily derived from the actual DataFrame using its to_placeholder() method. More information on how to construct placeholders is given in the API documentation for Placeholder.

Joins

Joins are used to establish relations between placeholders. In order to join two placeholders, the data frames used to derive them must both have at least one column of role join_key. The joining itself is done using the getml.data.Placeholder.join() method.

Special care must be taken when dealing with time-based data. All columns corresponding to time stamps have to be given the role time_stamp and the main one in both the population and peripheral table must be provided in the getml.data.Placeholder.join() method. This prevents easter eggs by incorporating only those rows of the peripheral table in the join operation for which the time stamp of the corresponding row in the population table is either the same or more recent. This way no information from the future is considered during training.

Data schemata

After having created placeholders for all data frames in an analysis, we are ready to create the actual data schema. A data schema is a certain way of assembling population and peripheral tables.

The star schema

The star schema is simplest way of establishing relations between the population and the peripheral tables. It is sufficient for the majority of data science projects.

In the star schema, the population table is surrounded by any number of peripheral tables, all joined via a certain join key. However, no joins between peripheral tables are allowed.

The population table and two peripheral tables introduced in Tables can be arranged in a star schema like this:

../../_images/star_scheme.png

The snowflake schema

In some cases, the star schema is not enough to represent the complexity of a data set. This is where the snowflake schema comes in: In a snowflake schema, peripheral tables can have peripheral tables of their own.

Assume that in the customer churn analysis shown above, there is an additional table containing information about the calls a certain agent made in customer service. It can be joined to the COMPLAINTS table using the key agent_id.

../../_images/snowflake_schema.png

Time series

The handling of time series deserves special treatment since a single table containing your (multivariate) time series does not seem to fit the relational data model at first. Self-joining a single table explains why, nevertheless, they do. In addition, some extra parameters and considerations are required when building features based on time stamps.

Self-joining a single table

If you deal with a classical (multivariate) time series and all your data is contained in a single table, all the concepts covered so far still apply. You just have to do a so-called self-join by providing your table as both population and peripheral table and join them together while ensuring causality using the time stamp.

This sounds a little bit strange in the beginning. You can think of the process as working in the following way: Whenever a row in the population table - a single measurement - is taken, it will be combined with all the content of the peripheral table - the same time series - for which the time stamps are smaller than the one in the line we picked. As a result, we will obtain our original table with all rows containing future data - from the perspective of the picked line - being discarded. This ensures we will not include any knowledge unavailable during prediction into our training step in order to prevent data leaks.

Features based on time stamps

Time stamps are handled differently than numerical columns. In addition, the getML engine is able to automatically generate features based on aggregations over time windows. Both the length of the time window and the aggregation itself will be figured out by the feature engineering algorithm. The only thing you have to do is to provide the temporal resolution your time series is sampled with in the delta_t parameter of either MultirelModel or RelboostModel.