Loan default prediction

In this tutorial you will use the getML Python API in order to

  1. Approach a real world dataset

  2. Train a single Multirel Model

  3. Perform a Hyperparameter optimization

  4. Access the trained features

The main result is the analysis of a real world problem from the financial sector. You will learn how to tackle a data science problem from scratch to a production-ready solution.

Introduction

This tutorial features a use case from the financial sector. We will use getML in order to predict loan default. A loan is the lending of money to companies or individuals. Banks grant loans in exchange for the promise of repayment. Loan default is defined as the failure to meet this legal obligation, for example when a home buyer fails to make a mortgage payment. It is essential for a bank to estimate the risk it carries when granting loans to potentially non-performing customers.

The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository. It contains information about 606 successful and 76 not successful loans and consists of 8 tables:

../../_images/loans_4_0.png

The loan table contains information about the loans granted by the bank, such as the date of creation, the amount, and the planned duration of the loan. It also contains the status of the loan. This is the target variable that we will predict in this analysis. The loan table is connected to the table account via the column account_id.

The account table contains further information about the accounts associated with each loan. Static characteristics such as the date of creation are given in account and dynamic characteristics such as debited payments and balances are given in order and trans. The table client contains characteristics of the account owners. Clients and accounts are related via the table disp. The card table describes credit card services the bank offers to its clients and is related to a certain account also via the table disp. The table district contains publicly available information such as the unemployment rate about the districts a certain account or client is related to. More information about the dataset can be found here.

In the following, we will further explore the data and prepare a data model to be used for the analysis with getML. As usual, we start with setting a project.

import getml
getml.engine.set_project('loans')
Creating new project 'loans'

Since the data sets from the CTU Prague Relational Learning Repository are available from a MariaDB database, we use getML’s data base connector to directly load the data into the getML engine.

getml.database.connect_mysql(
    host="relational.fit.cvut.cz",
    port=3306,
    dbname="financial",
    user="guest",
    password="relational",
    time_formats=['%Y/%m/%d']
)

loan = getml.data.DataFrame.from_db('loan', name='loan')
account = getml.data.DataFrame.from_db('account', name='account')
order = getml.data.DataFrame.from_db('order', name='order')
trans = getml.data.DataFrame.from_db('trans', name='trans')
card = getml.data.DataFrame.from_db('card', name='card')
client = getml.data.DataFrame.from_db('client', name='client')
disp = getml.data.DataFrame.from_db('disp', name='disp')
district = getml.data.DataFrame.from_db('district', name='district')

Data preparation

We will have a closer look at the tables from the financial dataset and setup the data model. Note that a convenient way to explore the data frames we just loaded into the getML engine is to have a look at them in the getML monitor. We recommend to check what is going on there in parallel to this tutorial.

Setting roles

In order to tell getML feature engineering algorithms how to treat the columns of each Data Frame we need to set its role. For more info in roles check out the user guide. The loan table looks like this

loan_id account_id amount duration date payments status
0 4959.0 2.0 80952.0 24.0 1994-01-05 3373.00 A
1 4961.0 19.0 30276.0 12.0 1996-04-29 2523.00 B
2 4962.0 25.0 30276.0 12.0 1997-12-08 2523.00 A
3 4967.0 37.0 318480.0 60.0 1998-10-14 5308.00 D
4 4968.0 38.0 110736.0 48.0 1998-04-19 2307.00 C
... ... ... ... ... ... ... ...
677 7294.0 11327.0 39168.0 24.0 1998-09-27 1632.00 C
678 7295.0 11328.0 280440.0 60.0 1998-07-18 4674.00 C
679 7304.0 11349.0 419880.0 60.0 1995-10-29 6998.00 C
680 7305.0 11359.0 54024.0 12.0 1996-08-06 4502.00 A
681 7308.0 11362.0 129408.0 24.0 1996-12-27 5392.00 A

The status column is our target variable. It contains 4 different categories:

  • A stands for contract finished, no problems,

  • B stands for contract finished, loan not paid,

  • C stands for running contract, OK so far,

  • D stands for running contract, client in debt

Before assigning it the role target we need to transform it to a numerical variable. We will consider A and C a successfull loan and B and D a default.

default = ((loan['status'] == 'B') | (loan['status'] == 'D'))
loan.add(default, name='default', role='target')
print(loan['default'].sum())
SUM aggregation, value: 76.0.

The data set contains 76 defaulted loans out of 681 data points in total, which corresponds to roughly 10%.

Next, ow we assign roles to the remaining columns in loan

  • join_key: loan_id, account_id

  • time_stamp: date

  • numerical: amount, duration, payments

Note that the column status, which obviously contains a data leak, will not be considered by getML since we do not assign it a role.

loan.set_role(["account_id", "loan_id"], getml.data.roles.join_key)
loan.set_role(["date"], getml.data.roles.time_stamp)
loan.set_role(["amount", "duration", "payments"], getml.data.roles.numerical)

The account table looks like this

account_id district_id frequency date
0 1.0 18.0 POPLATEK MESICNE 1995-03-24
1 2.0 1.0 POPLATEK MESICNE 1993-02-26
2 3.0 5.0 POPLATEK MESICNE 1997-07-07
3 4.0 12.0 POPLATEK MESICNE 1996-02-21
4 5.0 15.0 POPLATEK MESICNE 1997-05-30
... ... ... ... ...
4495 11333.0 8.0 POPLATEK MESICNE 1994-05-26
4496 11349.0 1.0 POPLATEK TYDNE 1995-05-26
4497 11359.0 61.0 POPLATEK MESICNE 1994-10-01
4498 11362.0 67.0 POPLATEK MESICNE 1995-10-14
4499 11382.0 74.0 POPLATEK MESICNE 1995-08-20
print(account['frequency'].count_distinct())
COUNT_DISTINCT aggregation, value: 3.0.

Frequency is a categorial variable with 3 distinct categories, so we add it to the data model. Accordingly, we set the time_stamp and join_key columns for account.

account.set_role(["account_id", "district_id"], getml.data.roles.join_key)
account.set_role(["date"], getml.data.roles.time_stamp)
account.set_role(["frequency"], getml.data.roles.categorical)

Population table

Let’s have a closer look at the relation between loans and account:

print(loan['account_id'].count_distinct())
print(account['account_id'].count_distinct())
COUNT_DISTINCT aggregation, value: 682.0.
COUNT_DISTINCT aggregation, value: 4500.0.

The join key account_id has no duplicated value neither in loan nor in account. That means, each row in loan is associated with exactly one row in account. This is called a one-to-one relation.

It does not make sense to let getML’s feature engieering algorithms try to find aggregations over the accounts associated with each loan (because there is only one). So we perform the join operation between both tables before feeding to getML. This is part of the definition of the data model and is generally recommended for one-to-one or many-to-one relations. The resulting table will be the population table of our analysis.

population = loan.join(
    name='population',
    other=account,
    how='left',
    join_key='account_id',
    other_cols=[
        account['district_id'],
        account['frequency'],
        account['date'].alias('date_account')
    ]
)
frequency account_id loan_id district_id amount duration payments default date date_account status
0 POPLATEK MESICNE 2 4959 1 80952.0 24.0 3373.0 0.0 1994-01-05 1993-02-26 A
1 POPLATEK MESICNE 19 4961 21 30276.0 12.0 2523.0 1.0 1996-04-29 1995-04-07 B
2 POPLATEK MESICNE 25 4962 68 30276.0 12.0 2523.0 0.0 1997-12-08 1996-07-28 A
3 POPLATEK MESICNE 37 4967 20 318480.0 60.0 5308.0 1.0 1998-10-14 1997-08-18 D
4 POPLATEK TYDNE 38 4968 19 110736.0 48.0 2307.0 0.0 1998-04-19 1997-08-08 C
5 POPLATEK MESICNE 67 4973 16 165960.0 24.0 6915.0 0.0 1996-05-02 1994-10-19 A
6 POPLATEK MESICNE 97 4986 74 102876.0 12.0 8573.0 0.0 1997-08-10 1996-05-05 A
7 POPLATEK MESICNE 103 4988 44 265320.0 36.0 7370.0 1.0 1997-12-06 1996-03-10 D
8 POPLATEK MESICNE 105 4989 21 352704.0 48.0 7348.0 0.0 1998-12-05 1997-07-10 C
9 POPLATEK MESICNE 110 4990 36 162576.0 36.0 4516.0 0.0 1997-09-08 1996-07-17 C
... ... ... ... ... ... ... ... ... ... ... ...
672 POPLATEK MESICNE 11231 7277 1 89280.0 12.0 7440.0 0.0 1997-10-24 1997-02-05 A
673 POPLATEK MESICNE 11244 7279 33 155760.0 24.0 6490.0 0.0 1997-12-11 1997-01-12 C
674 POPLATEK MESICNE 11265 7284 15 52788.0 12.0 4399.0 0.0 1993-09-15 1993-01-14 A
675 POPLATEK MESICNE 11271 7286 5 67320.0 36.0 1870.0 0.0 1997-01-31 1995-09-20 C
676 POPLATEK MESICNE 11317 7292 50 317460.0 60.0 5291.0 0.0 1998-11-22 1997-07-11 C
677 POPLATEK MESICNE 11327 7294 7 39168.0 24.0 1632.0 0.0 1998-09-27 1997-10-15 C
678 POPLATEK MESICNE 11328 7295 54 280440.0 60.0 4674.0 0.0 1998-07-18 1996-11-05 C
679 POPLATEK TYDNE 11349 7304 1 419880.0 60.0 6998.0 0.0 1995-10-29 1995-05-26 C
680 POPLATEK MESICNE 11359 7305 61 54024.0 12.0 4502.0 0.0 1996-08-06 1994-10-01 A
681 POPLATEK MESICNE 11362 7308 67 129408.0 24.0 5392.0 0.0 1996-12-27 1995-10-14 A

We also randomly split the data into a training and a validation set. We use 70% of the data set for training and the rest for testing.

split = 0.7
population_train = population.where('population_train', population.random() < split)
population_test = population.where('population_test', population.random() >= split)

Peripheral tables

The next step is to check the join relations between the population table and the remaining peripheral tables. We start by considering order and trans since the are both joined via account_id and do not have any further relationships with other tables. We check if any of the rows in population has a one-to-many relationship with order (or trans). If it is the case, we cannot perform the join relation directly but pass the peripheral table to getML’s feature engieering algorithms in order to let them find the right aggregation operations to create the best features.

import numpy as np

account_ids = population['account_id'].to_numpy()

for peri_ in [order, trans]:
    print(peri_.name)
    unique, counts = np.unique(peri_['account_id'].to_numpy(), return_counts=True)
    for acc_ in account_ids:
        idx = np.where(unique == float(acc_))[0]
        if counts[idx] > 1:
            print("-> has one-to-many")
            break
order
-> has one-to-many
trans
-> has one-to-many

Consequently we keep both order and trans as part of our relational data model. Therefore we assign the columns in both tables appropriate roles. Before assigning a column the role categorical we make sure that the number of distinct categories is not too large.

print('order')
order.set_role(["account_id"], getml.data.roles.join_key)
order.set_role(["amount"], getml.data.roles.numerical)

for col_ in ["bank_to", "k_symbol", "account_to"]:
    unique_cat = len(np.unique(order[col_].to_numpy()))
    print("Distinct categories in {}: {}".format(col_, unique_cat))
    if unique_cat <= 20:
        order.set_role([col_], getml.data.roles.categorical)

print('trans')
trans.set_role(["account_id", "trans_id"], getml.data.roles.join_key)
trans.set_role(["date"], getml.data.roles.time_stamp)
trans.set_role(["amount", "balance"], getml.data.roles.numerical)

for col_ in ["type", "k_symbol", "bank","operation", "account"]:
    unique_cat = len(np.unique(trans[col_].to_numpy()))
    print("Distinct categories in {}: {}".format(col_, unique_cat))
    if unique_cat <= 20:
        trans.set_role([col_], getml.data.roles.categorical)
order
Distinct categories in bank_to: 13
Distinct categories in k_symbol: 5
Distinct categories in account_to: 6446
trans
Distinct categories in type: 3
Distinct categories in k_symbol: 9
Distinct categories in bank: 14
Distinct categories in operation: 6
Distinct categories in account: 768596

Setting units

Let’s stop here with adding peripheral tables to our data model and see how far we can get with only the two tables trans and order. There is, however, one more thing we can do: We can also to tell the engine about the unit for each column. Columns will the same unit will be compared in the feature engineering process. For more info check out the user guide.

loan.set_unit(["amount"], 'money')
order.set_unit(["amount"], 'money')
trans.set_unit(["amount", "balance"], 'money')

Data model

Now, we can formally define the data model. This is done using Placeholders. We can create these placeholder directly from the DataFrames.

population_placeholder = population.to_placeholder()
order_placeholder = order.to_placeholder()
trans_placeholder = trans.to_placeholder()

These Placeholders are then joined together in order to define the data model we will then turn over to the engine.

population_placeholder.join(order_placeholder,
                            join_key="account_id")
population_placeholder.join(trans_placeholder,
                            join_key="account_id",
                            time_stamp="date")

The final data model looks like this

../../_images/loans_39_0.png

Training a Multirel Model

After having prepared the dataset we can dive into the actual analysis. This is the point where getML sets in with automated feature engineering and model training. We will train a MultirelModel in order to predict the target column default. We will start with the default settings and take care of the hyperparameter optimization later on. Input to the model are a feature selector and a predictor. We will use XGBoost for both in this tutorial.

feature_selector = getml.predictors.XGBoostClassifier(
    reg_lambda=500
)

predictor = getml.predictors.XGBoostClassifier(
    reg_lambda=500
)

We also need to provied the placeholders defined above. Now we are ready to instantiate the MultirelModel.

agg_ = getml.models.aggregations

model = getml.models.MultirelModel(
    aggregation=[
        agg_.Avg,
        agg_.Count,
        agg_.Max,
        agg_.Median,
        agg_.Min,
        agg_.Sum,
        agg_.Var
    ],
    num_features=30,
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    loss_function=getml.models.loss_functions.CrossEntropyLoss(),
    feature_selector=feature_selector,
    predictor=predictor,
    seed=1706
).send()

The next step is to fit the model using the training data set.

model = model.fit(
    population_table=population_train,
    peripheral_tables=[order, trans]
)
Loaded data. Features are now being trained...
Trained model.
Time taken: 0h:0m:31.185045

The training time of the model is below one minute. Let’s look at how well the model performs on the validation dataset.

in_sample = model.score(
        population_table=population_train,
        peripheral_tables=[order, trans]
)

out_of_sample = model.score(
        population_table=population_test,
        peripheral_tables=[order, trans]
)
print("In sample accuracy: {:.2f}\nIn sample AUC: {:.2f}\nOut of sample accuracy: {:.2f}\nOut of sample AUC: {:.2f}".format(
    in_sample['accuracy'][0], in_sample['auc'][0], out_of_sample['accuracy'][0], out_of_sample['auc'][0]))
In sample accuracy: 0.94
In sample AUC: 0.92
Out of sample accuracy: 0.94
Out of sample AUC: 0.83

This is already a promising result but we can try to do better by performing a hyperparameter optimization.

Hyperparameter optimization

We will perform a hyperparamter optimization to improve the out of sample accuracy. We will do this using a Latin Hypercube search.

param_space = dict(
    grid_factor = [1.0, 16.0],
    max_length = [1, 10],
    num_features = [10, 100],
    regularization = [0.0, 0.01],
    share_aggregations = [0.01, 0.3],
    share_selected_features = [0.1, 1.0],
    shrinkage = [0.01, 0.4],
    predictor_n_estimators = [100, 400],
    predictor_max_depth = [3, 15],
    predictor_reg_lambda = [0.0, 1000.0]
)

latin_search = getml.hyperopt.LatinHypercubeSearch(
    model=model,
    param_space=param_space,
    # n_iter=30,
    # Set n_iter to a smaller value in order to make the notebook finish quickly
    n_iter=2,
    seed=1706
)

latin_search.fit(
  population_table_training=population_train,
  population_table_validation=population_test,
  peripheral_tables=[order, trans],
)
Launched hyperparameter optimization...
scores = latin_search.get_scores()
best_model_name = max(scores, key=lambda key: scores[key]['auc'])
print("Out of sample accuracy: {:.2f}".format(scores[best_model_name]['accuracy'][0]))
print("AUC: {:.2f}".format(scores[best_model_name]['auc'][0]))
Out of sample accuracy: 0.94
AUC: 0.93

The hyperparameter optimization has improved the in sample accuracy and AUC. These results will get even better when performing a more thorough hyperparameter optimization.

Extracting Features

So far, we have trained a MultirelModel and conducted a hyperparameter optimization. But what did actually happened behind the scenes? In order to gain insight into the features the Multirel Model has construced, we will look at the in SQL code of the constructed features. This information is available in the getML monitor or by caling to_sql on a getML model. The feature with the highest importance looks like this. This is a typical example for a feature generated by Multirel. You can see the logic behind the aggregation, but its also clear that it would have been impossible to come up with the specific values by hand or using brute force approaches:

CREATE TABLE FEATURE_2 AS
SELECT MEDIAN( t1.date_account - t2.date ) AS feature_2,
       t1.account_id,
       t1.date
FROM (
     SELECT *,
            ROW_NUMBER() OVER ( ORDER BY account_id, date ASC ) AS rownum
     FROM population
) t1
LEFT JOIN trans t2
ON t1.account_id = t2.account_id
WHERE (
   ( t2.balance > 390.000000 AND t2.balance > 159331.000000 AND t1.date - t2.date <= 28.142857 )
OR ( t2.balance > 390.000000 AND t2.balance <= 159331.000000 AND t1.amount > 464288.000000 )
OR ( t2.balance <= 390.000000 AND t1.date_account - t2.date <= -148.000000 )
) AND t2.date <= t1.date
GROUP BY t1.rownum,
         t1.account_id,
         t1.date;

Results

We are able to predict loan default in the example dataset with an accuracy of over 95% and a very good AUC. With this result getML is in the top 1% of published solutions on this problem. The training time for the initial model was less than one minute. Together with the data preparation this project can easily be completed within one day.

You can use this tutorial as starting point for your own analysis or head over the other tutorials and the user guide if you want to learn more about the functionalty getML offers. Please contact us with your feedback about this tutorial or general inquiries. We also offer proof-of-concept project if you need help getting started with getML.