How to understand your customers and interpret a black box model

Making use of the Random Forest classifier and SHAP

Photo by Luku Muffin on Unsplash

In this post, I would like to interpret a Random Forest classifier using SHAP values and along with that to answer the following questions:

1. What kind of characteristics have customers who placed a deposit?

2. What kind of characteristics have customers who did NOT place a deposit?

3. Based on the available data, what can be done next time to increase CVR?

Data

I used a public dataset with the results of Portugal bank marketing campaigns. Conducted campaigns were based mostly on direct phone calls, offering bank client to place a term deposit. If after all marking efforts client had agreed to place a deposit — target variable marked ‘yes’, otherwise ‘no’.

There are no NULL values in the dataset

Exploratory Data Analysis (EDA)

Let us explore the dataset to have a better understanding of the data.

It is difficult to find any correlations from the table. So let us visualize the results

It’s interesting to see that some parameters have a high correlation such as “emp.var.rate” and “nr.employed”, “emp.var.rate” and “euribor3m”, “euribor3m” and “nr.employee”. Unfortunately, I did not find how they were calculated and their definitions.

Let us build a count plot to see the conversion rate.

4,640 users out of 41,188 placed a deposit in the bank during the campaigns. CVR is about 11%.

Average values for each group have the following results:

There is a drastic difference in the last contact duration (seconds) for those who placed and did not place a deposit. Also, there is some difference in the number of contacts performed during this campaign (“campaign”), number of days that passed by after the client was last contacted from a previous campaign (“pdays”), number of contacts performed before this campaign and for this client (“previous”) and the employment variation rate.

The EDA does not answer explicitly to the business questions we placed at the beginning. Therefore, it is reasonable to apply machine learning to get a better understanding of the data. For this problem, I used Random Forest classifier and SHAP values.

Data Preparation

I used one-hot encoding to create dummy variables for categorical attributes. This procedure needs to be done before feeding the data to the model.

Random Forest Classifier

Random Forest is an ensemble of decision tree algorithms. Random Forest creates decision trees on randomly selected data samples, gets a prediction from each tree and selects the best solution by means of voting. A prediction on a classification problem is the majority vote for the class label across the trees in the ensemble.

The sizes of the train and test sets

Let us train and test the model:

The accuracy of the test set is quite high. Let us look at ROC curve to get a better understanding of the model’s performance.

We can see that the ROC Area Under the Curve (AUC) for the Random Forest classifier on the synthetic dataset is about 0.745, which is better than a no skill classifier with a score of about 0.5. It is possible to improve the ROC AUC value by model tunning. However, I did not tune such parameters as n_estimators and max_depth, which directly affect on SHAP computational time.

SHAP Plots

SHAP (SHapley Additive exPlanations) is a method to explain individual predictions. The goal of SHAP is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the “payout” (= the prediction) among the features. A player can be an individual feature value, e.g. for tabular data.

SHAP summary plot

Based on the SHAP summary plot, we can see the top 20 features and their indications of the relationship between the value of a feature and the impact on the classification model. For better understanding let us have a look at the dependence plots.

SHAP waterfall plot

Based on the SHAP waterfall plot, we can say that duration is the most important feature in the model, which has more than 30% of the model’s explainability. Also, these top 20 features provide more than 80% of the model’s interpretation.

SHAP dependence plot for duration
SHAP dependence plot for euribor3m
SHAP dependence plot for emp.var.rate
SHAP dependence plot for nr.employed

Based on the SHAP dependence plots, it is clear that users, who subscribed to a deposit tend to have the following features:
* last contact duration longer than 500 seconds
* have euribor3m values 1 and 2
* nr.employed <= 5100
* emp.var.rate <=0

On the other hand, users, who did NOT subscribe to the service may have the following features:
* last contact duration less than 500 seconds
* have euribor3m values 4 and 5
* nr.employed >=5100
* emp.var.rate >= 0

Summary

Let us answer the questions we placed at the beginning.

1) What kind of characteristics have people who placed a deposit?
* last contact duration longer than 500 seconds
* have euribor3m values 1 and 2
* nr.employed <= 5100
* emp.var.rate <=0

2) What kind of characteristics have people who did NOT place a deposit?
* last contact duration less than 500 seconds
* have euribor3m values 4 and 5
* nr.employed >=5100
* emp.var.rate >= 0

3) Based on the available data, is there any opportunity to increase CVR?
Based on the SHAP waterfall plot, the most important feature is duration (last contact duration, in seconds), which contributes by more than 30% to the model’s explainability. Therefore, having a relatively long last contact duration (> 500 seconds) may increase CVR.

This analysis was done to grasp a practical understanding of applying SHAP values and Random Forest classifier. All the findings may do not have any connection with the real situation for placing a deposit. Also, we saw some parameters such as euribor3m, emp.var.rate, cons.price.idx, cons.conf.idx, nr.employed, which need an additional understanding of how they were derived and what are clear definitions for them.

The Source code that was created for this post can be found here. I would be pleased to receive feedback or questions on any of the above. For any questions or just a chat, you can reach me out on LinkedIn.

Reference

  1. Molnar, Christoph. “Interpretable machine learning. A Guide for Making Black Box Models Explainable”, 2019. https://christophm.github.io/interpretable-ml-book/.
  2. Brownlee, Jason. How to Develop a Random Forest Ensemble in Python. https://machinelearningmastery.com/random-forest-ensemble-in-python/
  3. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition. https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/

Data Scientist @ Rakuten | 💜 Data Science and Psychology | https://www.linkedin.com/in/aigerimshopenova/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store