Loan default prediction¶

In this tutorial you will use the getML Python API in order to

Approach a real world dataset
Train a single Multirel Model
Perform a Hyperparameter optimization
Access the trained features

The main result is the analysis of a real world problem from the financial sector. You will learn how to tackle a data science problem from scratch to a production-ready solution.

Introduction

This tutorial features a use case from the financial sector. We will use getML in order to predict loan default. A loan is the lending of money to companies or individuals. Banks grant loans in exchange for the promise of repayment. Loan default is defined as the failure to meet this legal obligation, for example when a home buyer fails to make a mortgage payment. It is essential for a bank to estimate the risk it carries when granting loans to potentially non-performing customers.

The analysis is based on the financial dataset from the the CTU Prague Relational Learning Repository. It contains information about 606 successful and 76 not successful loans and consists of 8 tables:

The loan table contains information about the loans granted by the bank, such as the date of creation, the amount, and the planned duration of the loan. It also contains the status of the loan. This is the target variable that we will predict in this analysis. The loan table is connected to the table account via the column account_id.

The account table contains further information about the accounts associated with each loan. Static characteristics such as the date of creation are given in account and dynamic characteristics such as debited payments and balances are given in order and trans. The table client contains characteristics of the account owners. Clients and accounts are related via the table disp. The card table describes credit card services the bank offers to its clients and is related to a certain account also via the table disp. The table district contains publicly available information such as the unemployment rate about the districts a certain account or client is related to. More information about the dataset can be found here.

In the following, we will further explore the data and prepare a data model to be used for the analysis with getML. As usual, we start with setting a project.

import getml
getml.engine.set_project('loans')

Creating new project 'loans'

Since the data sets from the CTU Prague Relational Learning Repository are available from a MariaDB database, we use getML’s data base connector to directly load the data into the getML engine.

getml.database.connect_mysql(
    host="relational.fit.cvut.cz",
    port=3306,
    dbname="financial",
    user="guest",
    password="relational",
    time_formats=['%Y/%m/%d']
)

loan = getml.data.DataFrame.from_db('loan', name='loan')
account = getml.data.DataFrame.from_db('account', name='account')
order = getml.data.DataFrame.from_db('order', name='order')
trans = getml.data.DataFrame.from_db('trans', name='trans')
card = getml.data.DataFrame.from_db('card', name='card')
client = getml.data.DataFrame.from_db('client', name='client')
disp = getml.data.DataFrame.from_db('disp', name='disp')
district = getml.data.DataFrame.from_db('district', name='district')

Data preparation¶

We will have a closer look at the tables from the financial dataset and setup the data model. Note that a convenient way to explore the data frames we just loaded into the getML engine is to have a look at them in the getML monitor. We recommend to check what is going on there in parallel to this tutorial.

Setting roles¶

In order to tell getML feature engineering algorithms how to treat the columns of each Data Frame we need to set its role. For more info in roles check out the user guide. The loan table looks like this

	loan_id	account_id	amount	duration	date	payments	status
0	4959.0	2.0	80952.0	24.0	1994-01-05	3373.00	A
1	4961.0	19.0	30276.0	12.0	1996-04-29	2523.00	B
2	4962.0	25.0	30276.0	12.0	1997-12-08	2523.00	A
3	4967.0	37.0	318480.0	60.0	1998-10-14	5308.00	D
4	4968.0	38.0	110736.0	48.0	1998-04-19	2307.00	C
...	...	...	...	...	...	...	...
677	7294.0	11327.0	39168.0	24.0	1998-09-27	1632.00	C
678	7295.0	11328.0	280440.0	60.0	1998-07-18	4674.00	C
679	7304.0	11349.0	419880.0	60.0	1995-10-29	6998.00	C
680	7305.0	11359.0	54024.0	12.0	1996-08-06	4502.00	A
681	7308.0	11362.0	129408.0	24.0	1996-12-27	5392.00	A

The status column is our target variable. It contains 4 different categories:

A stands for contract finished, no problems,
B stands for contract finished, loan not paid,
C stands for running contract, OK so far,
D stands for running contract, client in debt

Before assigning it the role target we need to transform it to a numerical variable. We will consider A and C a successfull loan and B and D a default.

default = ((loan['status'] == 'B') | (loan['status'] == 'D'))
loan.add(default, name='default', role='target')
print(loan['default'].sum())

SUM aggregation, value: 76.0.

The data set contains 76 defaulted loans out of 681 data points in total, which corresponds to roughly 10%.

Next, ow we assign roles to the remaining columns in loan

join_key: loan_id, account_id
time_stamp: date
numerical: amount, duration, payments

Note that the column status, which obviously contains a data leak, will not be considered by getML since we do not assign it a role.

loan.set_role(["account_id", "loan_id"], getml.data.roles.join_key)
loan.set_role(["date"], getml.data.roles.time_stamp)
loan.set_role(["amount", "duration", "payments"], getml.data.roles.numerical)

The account table looks like this

	account_id	district_id	frequency	date
0	1.0	18.0	POPLATEK MESICNE	1995-03-24
1	2.0	1.0	POPLATEK MESICNE	1993-02-26
2	3.0	5.0	POPLATEK MESICNE	1997-07-07
3	4.0	12.0	POPLATEK MESICNE	1996-02-21
4	5.0	15.0	POPLATEK MESICNE	1997-05-30
...	...	...	...	...
4495	11333.0	8.0	POPLATEK MESICNE	1994-05-26
4496	11349.0	1.0	POPLATEK TYDNE	1995-05-26
4497	11359.0	61.0	POPLATEK MESICNE	1994-10-01
4498	11362.0	67.0	POPLATEK MESICNE	1995-10-14
4499	11382.0	74.0	POPLATEK MESICNE	1995-08-20

print(account['frequency'].count_distinct())

COUNT_DISTINCT aggregation, value: 3.0.

Frequency is a categorial variable with 3 distinct categories, so we add it to the data model. Accordingly, we set the time_stamp and join_key columns for account.

account.set_role(["account_id", "district_id"], getml.data.roles.join_key)
account.set_role(["date"], getml.data.roles.time_stamp)
account.set_role(["frequency"], getml.data.roles.categorical)

Population table¶

Let’s have a closer look at the relation between loans and account:

print(loan['account_id'].count_distinct())
print(account['account_id'].count_distinct())

COUNT_DISTINCT aggregation, value: 682.0.
COUNT_DISTINCT aggregation, value: 4500.0.

The join key account_id has no duplicated value neither in loan nor in account. That means, each row in loan is associated with exactly one row in account. This is called a one-to-one relation.

It does not make sense to let getML’s feature engieering algorithms try to find aggregations over the accounts associated with each loan (because there is only one). So we perform the join operation between both tables before feeding to getML. This is part of the definition of the data model and is generally recommended for one-to-one or many-to-one relations. The resulting table will be the population table of our analysis.

population = loan.join(
    name='population',
    other=account,
    how='left',
    join_key='account_id',
    other_cols=[
        account['district_id'],
        account['frequency'],
        account['date'].alias('date_account')
    ]
)

	frequency	account_id	loan_id	district_id	amount	duration	payments	default	date	date_account	status
0	POPLATEK MESICNE	2	4959	1	80952.0	24.0	3373.0	0.0	1994-01-05	1993-02-26	A
1	POPLATEK MESICNE	19	4961	21	30276.0	12.0	2523.0	1.0	1996-04-29	1995-04-07	B
2	POPLATEK MESICNE	25	4962	68	30276.0	12.0	2523.0	0.0	1997-12-08	1996-07-28	A
3	POPLATEK MESICNE	37	4967	20	318480.0	60.0	5308.0	1.0	1998-10-14	1997-08-18	D
4	POPLATEK TYDNE	38	4968	19	110736.0	48.0	2307.0	0.0	1998-04-19	1997-08-08	C
5	POPLATEK MESICNE	67	4973	16	165960.0	24.0	6915.0	0.0	1996-05-02	1994-10-19	A
6	POPLATEK MESICNE	97	4986	74	102876.0	12.0	8573.0	0.0	1997-08-10	1996-05-05	A
7	POPLATEK MESICNE	103	4988	44	265320.0	36.0	7370.0	1.0	1997-12-06	1996-03-10	D
8	POPLATEK MESICNE	105	4989	21	352704.0	48.0	7348.0	0.0	1998-12-05	1997-07-10	C
9	POPLATEK MESICNE	110	4990	36	162576.0	36.0	4516.0	0.0	1997-09-08	1996-07-17	C
...	...	...	...	...	...	...	...	...	...	...	...
672	POPLATEK MESICNE	11231	7277	1	89280.0	12.0	7440.0	0.0	1997-10-24	1997-02-05	A
673	POPLATEK MESICNE	11244	7279	33	155760.0	24.0	6490.0	0.0	1997-12-11	1997-01-12	C
674	POPLATEK MESICNE	11265	7284	15	52788.0	12.0	4399.0	0.0	1993-09-15	1993-01-14	A
675	POPLATEK MESICNE	11271	7286	5	67320.0	36.0	1870.0	0.0	1997-01-31	1995-09-20	C
676	POPLATEK MESICNE	11317	7292	50	317460.0	60.0	5291.0	0.0	1998-11-22	1997-07-11	C
677	POPLATEK MESICNE	11327	7294	7	39168.0	24.0	1632.0	0.0	1998-09-27	1997-10-15	C
678	POPLATEK MESICNE	11328	7295	54	280440.0	60.0	4674.0	0.0	1998-07-18	1996-11-05	C
679	POPLATEK TYDNE	11349	7304	1	419880.0	60.0	6998.0	0.0	1995-10-29	1995-05-26	C
680	POPLATEK MESICNE	11359	7305	61	54024.0	12.0	4502.0	0.0	1996-08-06	1994-10-01	A
681	POPLATEK MESICNE	11362	7308	67	129408.0	24.0	5392.0	0.0	1996-12-27	1995-10-14	A

We also randomly split the data into a training and a validation set. We use 70% of the data set for training and the rest for testing.

split = 0.7
population_train = population.where('population_train', population.random() < split)
population_test = population.where('population_test', population.random() >= split)

Peripheral tables¶

The next step is to check the join relations between the population table and the remaining peripheral tables. We start by considering order and trans since the are both joined via account_id and do not have any further relationships with other tables. We check if any of the rows in population has a one-to-many relationship with order (or trans). If it is the case, we cannot perform the join relation directly but pass the peripheral table to getML’s feature engieering algorithms in order to let them find the right aggregation operations to create the best features.

import numpy as np

account_ids = population['account_id'].to_numpy()

for peri_ in [order, trans]:
    print(peri_.name)
    unique, counts = np.unique(peri_['account_id'].to_numpy(), return_counts=True)
    for acc_ in account_ids:
        idx = np.where(unique == float(acc_))[0]
        if counts[idx] > 1:
            print("-> has one-to-many")
            break

order
-> has one-to-many
trans
-> has one-to-many

Consequently we keep both order and trans as part of our relational data model. Therefore we assign the columns in both tables appropriate roles. Before assigning a column the role categorical we make sure that the number of distinct categories is not too large.

print('order')
order.set_role(["account_id"], getml.data.roles.join_key)
order.set_role(["amount"], getml.data.roles.numerical)

for col_ in ["bank_to", "k_symbol", "account_to"]:
    unique_cat = len(np.unique(order[col_].to_numpy()))
    print("Distinct categories in {}: {}".format(col_, unique_cat))
    if unique_cat <= 20:
        order.set_role([col_], getml.data.roles.categorical)

print('trans')
trans.set_role(["account_id", "trans_id"], getml.data.roles.join_key)
trans.set_role(["date"], getml.data.roles.time_stamp)
trans.set_role(["amount", "balance"], getml.data.roles.numerical)

for col_ in ["type", "k_symbol", "bank","operation", "account"]:
    unique_cat = len(np.unique(trans[col_].to_numpy()))
    print("Distinct categories in {}: {}".format(col_, unique_cat))
    if unique_cat <= 20:
        trans.set_role([col_], getml.data.roles.categorical)

order
Distinct categories in bank_to: 13
Distinct categories in k_symbol: 5
Distinct categories in account_to: 6446
trans
Distinct categories in type: 3
Distinct categories in k_symbol: 9
Distinct categories in bank: 14
Distinct categories in operation: 6
Distinct categories in account: 768596

Setting units¶

Let’s stop here with adding peripheral tables to our data model and see how far we can get with only the two tables trans and order. There is, however, one more thing we can do: We can also to tell the engine about the unit for each column. Columns will the same unit will be compared in the feature engineering process. For more info check out the user guide.

loan.set_unit(["amount"], 'money')
order.set_unit(["amount"], 'money')
trans.set_unit(["amount", "balance"], 'money')

Data model¶

Now, we can formally define the data model. This is done using Placeholders. We can create these placeholder directly from the DataFrames.

population_placeholder = population.to_placeholder()
order_placeholder = order.to_placeholder()
trans_placeholder = trans.to_placeholder()

These Placeholders are then joined together in order to define the data model we will then turn over to the engine.

population_placeholder.join(order_placeholder,
                            join_key="account_id")
population_placeholder.join(trans_placeholder,
                            join_key="account_id",
                            time_stamp="date")

The final data model looks like this

Training a Multirel Model¶

After having prepared the dataset we can dive into the actual analysis. This is the point where getML sets in with automated feature engineering and model training. We will train a MultirelModel in order to predict the target column default. We will start with the default settings and take care of the hyperparameter optimization later on. Input to the model are a feature selector and a predictor. We will use XGBoost for both in this tutorial.

feature_selector = getml.predictors.XGBoostClassifier(
    reg_lambda=500
)

predictor = getml.predictors.XGBoostClassifier(
    reg_lambda=500
)

We also need to provied the placeholders defined above. Now we are ready to instantiate the MultirelModel.

agg_ = getml.models.aggregations

model = getml.models.MultirelModel(
    aggregation=[
        agg_.Avg,
        agg_.Count,
        agg_.Max,
        agg_.Median,
        agg_.Min,
        agg_.Sum,
        agg_.Var
    ],
    num_features=30,
    population=population_placeholder,
    peripheral=[order_placeholder, trans_placeholder],
    loss_function=getml.models.loss_functions.CrossEntropyLoss(),
    feature_selector=feature_selector,
    predictor=predictor,
    seed=1706
).send()

The next step is to fit the model using the training data set.

model = model.fit(
    population_table=population_train,
    peripheral_tables=[order, trans]
)

Loaded data. Features are now being trained...
Trained model.
Time taken: 0h:0m:31.185045

The training time of the model is below one minute. Let’s look at how well the model performs on the validation dataset.

in_sample = model.score(
        population_table=population_train,
        peripheral_tables=[order, trans]
)

out_of_sample = model.score(
        population_table=population_test,
        peripheral_tables=[order, trans]
)
print("In sample accuracy: {:.2f}\nIn sample AUC: {:.2f}\nOut of sample accuracy: {:.2f}\nOut of sample AUC: {:.2f}".format(
    in_sample['accuracy'][0], in_sample['auc'][0], out_of_sample['accuracy'][0], out_of_sample['auc'][0]))

In sample accuracy: 0.94
In sample AUC: 0.92
Out of sample accuracy: 0.94
Out of sample AUC: 0.83

This is already a promising result but we can try to do better by performing a hyperparameter optimization.

Hyperparameter optimization¶

We will perform a hyperparamter optimization to improve the out of sample accuracy. We will do this using a Latin Hypercube search.

param_space = dict(
    grid_factor = [1.0, 16.0],
    max_length = [1, 10],
    num_features = [10, 100],
    regularization = [0.0, 0.01],
    share_aggregations = [0.01, 0.3],
    share_selected_features = [0.1, 1.0],
    shrinkage = [0.01, 0.4],
    predictor_n_estimators = [100, 400],
    predictor_max_depth = [3, 15],
    predictor_reg_lambda = [0.0, 1000.0]
)

latin_search = getml.hyperopt.LatinHypercubeSearch(
    model=model,
    param_space=param_space,
    # n_iter=30,
    # Set n_iter to a smaller value in order to make the notebook finish quickly
    n_iter=2,
    seed=1706
)

latin_search.fit(
  population_table_training=population_train,
  population_table_validation=population_test,
  peripheral_tables=[order, trans],
)

Launched hyperparameter optimization...

scores = latin_search.get_scores()
best_model_name = max(scores, key=lambda key: scores[key]['auc'])
print("Out of sample accuracy: {:.2f}".format(scores[best_model_name]['accuracy'][0]))
print("AUC: {:.2f}".format(scores[best_model_name]['auc'][0]))

Out of sample accuracy: 0.94
AUC: 0.93

The hyperparameter optimization has improved the in sample accuracy and AUC. These results will get even better when performing a more thorough hyperparameter optimization.

Extracting Features¶

So far, we have trained a MultirelModel and conducted a hyperparameter optimization. But what did actually happened behind the scenes? In order to gain insight into the features the Multirel Model has construced, we will look at the in SQL code of the constructed features. This information is available in the getML monitor or by caling to_sql on a getML model. The feature with the highest importance looks like this. This is a typical example for a feature generated by Multirel. You can see the logic behind the aggregation, but its also clear that it would have been impossible to come up with the specific values by hand or using brute force approaches:

CREATE TABLE FEATURE_2 AS
SELECT MEDIAN( t1.date_account - t2.date ) AS feature_2,
       t1.account_id,
       t1.date
FROM (
     SELECT *,
            ROW_NUMBER() OVER ( ORDER BY account_id, date ASC ) AS rownum
     FROM population
) t1
LEFT JOIN trans t2
ON t1.account_id = t2.account_id
WHERE (
   ( t2.balance > 390.000000 AND t2.balance > 159331.000000 AND t1.date - t2.date <= 28.142857 )
OR ( t2.balance > 390.000000 AND t2.balance <= 159331.000000 AND t1.amount > 464288.000000 )
OR ( t2.balance <= 390.000000 AND t1.date_account - t2.date <= -148.000000 )
) AND t2.date <= t1.date
GROUP BY t1.rownum,
         t1.account_id,
         t1.date;

Results¶

We are able to predict loan default in the example dataset with an accuracy of over 95% and a very good AUC. With this result getML is in the top 1% of published solutions on this problem. The training time for the initial model was less than one minute. Together with the data preparation this project can easily be completed within one day.

You can use this tutorial as starting point for your own analysis or head over the other tutorials and the user guide if you want to learn more about the functionalty getML offers. Please contact us with your feedback about this tutorial or general inquiries. We also offer proof-of-concept project if you need help getting started with getML.