Training a single RelboostModel

Please note that RelboostModel is not supported by the basic version.

Getting started

For the complete script, please refer to example_02c_build_features_using_relboost.py.

Just like there are numerous ways to stage data, there are also numerous ways to train models. We will begin with the simplest one.

As a first step, we reset the project and reload the data. This might not be necessary, if we haven’t shut down the engine in the meantime, but it is good practice not keep these steps strictly separate.

engine.set_project("CE")

# -----------------------------------------------------------------------------
# Reload the data - if you haven't shut down the engine since loading the data
# in the first script, you can also call .refresh()

df_population_training = data.load_data_frame("POPULATION_TRAINING")

df_population_validation = data.load_data_frame("POPULATION_VALIDATION")

df_population_testing = data.load_data_frame("POPULATION_TESTING")

df_expd = data.load_data_frame("EXPD")

df_memd = data.load_data_frame("MEMD")

Building the data model

The next step is to build the data model. As we mentioned earlier, we would like to have two joins: One over NEWID (the customer id), but only with data from previous months. And another with only data from this month (BASKEDID):

population_placeholder = models.Placeholder("POPULATION")

expd_placeholder = models.Placeholder("EXPD")

memd_placeholder = models.Placeholder("MEMD")

population_placeholder.join(
    expd_placeholder,
    join_key="NEWID",
    time_stamp="TIME_STAMP"
)

population_placeholder.join(
    memd_placeholder,
    join_key="NEWID",
    time_stamp="TIME_STAMP"
)

Building the model

We use the XGBoostClassifier for both feature selection and prediction, but for those who would rather use a LogisticRegression, we have provided that option as well:

feature_selector = predictors.XGBoostClassifier(
    booster="gbtree",
    n_estimators=100,
    n_jobs=6,
    max_depth=7,
    reg_lambda=500
)

#feature_selector = predictors.LogisticRegression()

predictor = predictors.XGBoostClassifier(
    booster="gbtree",
    n_estimators=100,
    n_jobs=6,
    max_depth=7,
    reg_lambda=0.0
)

#predictor = predictors.LogisticRegression()

model = models.RelboostModel(
    population=population_placeholder,
    peripheral=[expd_placeholder, memd_placeholder],
    loss_function=loss_functions.CrossEntropyLoss(),
    shrinkage=0.1,
    gamma=0.0,
    min_num_samples=200,
    num_features=20,
    share_selected_features=0.0,
    reg_lambda=0.0,
    sampling_factor=1.0,
    predictor=predictor,
    #feature_selector=feature_selector,
    num_threads=0,
    include_categorical=True
)

Fitting the model

Fitting should be very familiar:

model = model.fit(
    population_table=df_population_training,
    peripheral_tables=[df_expd, df_memd]
)

Evaluation

To detect overfitting, it is often a good idea to do some in-sample scores as well. The monitor will always show the results from the last time .score(…) was called.

scores = model.score(
    population_table=df_population_training,
    peripheral_tables=[df_expd, df_memd]
)

print("In-sample:")
print(scores)
print()

scores = model.score(
    population_table=df_population_validation,
    peripheral_tables=[df_expd, df_memd]
)

print("Out-of-sample:")
print(scores)
print()

Retrieving data

We if want to retrieve the generated features as a numpy.array here is how that works:

features = model.transform(
    population_table=df_population_validation,
    peripheral_tables=[df_expd, df_memd]
)

And here is how we can retrieve the predictions, also as a numpy.array:

predictions = model.predict(
    population_table=df_population_validation,
    peripheral_tables=[df_expd, df_memd]
)