Predict Possible Interested Clients

Introduction

In this article we build a machine learning model with the goal of predicting which customers of a bank are interested in opening a term deposit account. For this purpose we compare seven classification algorithms, with emphasis on Boosting models (XGBoost and LightGBM). The data come from the “UCI Machine Learning Repository” and specifically from the Bank Marketing dataset.

Before the analysis, let us define a few basic concepts.

What is a term deposit?

It is a type of bank account in which the customer commits not to make a withdrawal for a predefined period of time (e.g. one year). In exchange, the bank offers higher interest rates compared with ordinary savings accounts.

Indicatively:

Piraeus Bank offers double the interest rate on its term-deposit accounts.
Eurobank offers zero interest on savings accounts, 0.01%–0.35% on savings-plan accounts, and 0.1%–1% on term deposits, depending on the programme and the size of the deposit.
According to a recent report by the Bank of Greece, term-deposit interest rates range between 1.2% and 1.4%, versus just 0.03% for standard household accounts.

The suitable profile for these products generally concerns people with a significant savings balance and without heavy financial obligations (loans, overdue debts).

Prerequisites

Importing libraries

For this analysis we will need standard R libraries for importing the data, through the {readr} package, and formatting it with the {dplyr} package. The {kableExtra} package is a significant addition for printing the results in table form. An important part is that of data visualisation. Initially I used the {ggplot2} package to create charts, which is limiting for a website since ggplot2 produces static charts. So for my articles I use the {highcharter} package, which enables interactive charts that are friendly to all screen types. Finally, this analysis aims to categorise the bank’s customers by their interest in a banking product, so the use of the {tidymodels} package is essential.

Importing data

Once we have loaded the necessary libraries, we need to import our data. There are several versions of the same data, a larger one and a more compact version; their difference lies only in the number of observations. For this article I will choose the more compact form, since fitting Boosting models is particularly time-consuming compared with building simpler classification models (e.g. Logistic Regression, k-Nearest Neighbours).

Data preview

Below is a small sample of the dataset (the first 6 observations), so that we understand its structure and the type of the variables.

Dataset preview (first 6 observations)
ID	Age	Job	Marital	Education	Default	Balance	Housing	Loan	Contact	Day	Month	Duration	Campaign	pdays	Previous	Poutcome	Deposit
1	30	Unemployed	Married	Primary	No	1787	No	No	Mobile	19	October	79	1	-1	0	Unknown	No
2	33	Services	Married	Secondary	No	4789	Yes	Yes	Mobile	11	May	220	1	339	4	Failure	No
3	35	Management	Single	Tertiary	No	1350	Yes	No	Mobile	16	April	185	1	330	1	Failure	No
4	30	Management	Married	Tertiary	No	1476	Yes	Yes	Unknown	3	June	199	4	-1	0	Unknown	No
5	59	Blue Collar	Married	Secondary	No	0	Yes	No	Unknown	5	May	226	1	-1	0	Unknown	No
6	35	Management	Single	Tertiary	No	747	No	No	Mobile	23	February	141	2	176	3	Failure	No

Before doing any analysis it is good to determine the type of the data we have available. In general, variables can be classified as follows:

Quantitative: Discrete or Continuous
Qualitative: Categorical or Ordinal

: Summary of the dataset variables

Variable	Variable type	Description
`Age`	quantitative (continuous)	Age of the individual
`Job`	qualitative (categorical)	Individual’s field of employment
`Marital`	qualitative (categorical)	Marital status
`Education`	qualitative (ordinal)	Highest level of education
`Default`	qualitative (categorical)	Has defaulted on credit obligations?
`Balance`	quantitative (continuous)	Average annual account balance (in €)
`Housing`	qualitative (categorical)	Has a housing loan?
`Loan`	qualitative (categorical)	Has a personal loan?
`Contact`	qualitative (categorical)	Means of contact
`Month`	qualitative (ordinal)	Month of most recent contact
`Duration`	quantitative (continuous)	Duration (in seconds) of the last contact
`Campaign`	quantitative	Number of contacts made to an individual
`pdays`	quantitative	Number of days since the last contact
`pprevious`	quantitative	Number of times the client was contacted
`poutcome`	qualitative (nominal)	Outcome of the previous marketing campaign
`Deposit`	qualitative (nominal)	Did the client open a term deposit?

Our sample consists of 17 variables (columns), of which 7 are quantitative and the remaining 10 qualitative. As for the qualitative variables, 8 of them are categorical and only two are ordinal (promotion month and level of education).

Defining functions

Ok, we saw some basic details of my data and its structure. Can I now start my analysis?

It depends. If we want a quick analysis in order to extract a specific result, it is probably fine. Of course, most of the time a more careful study design is required. A common mistake I too have fallen into in the past is the risk of repeating certain procedures. To avoid writing the same things many times, it is essential to write some functions.

Consequently, we define two functions. First, univariateQualitativePlot, which is used to create pie charts and bar charts for the qualitative variables. Similarly, we will also define the univariateQuantitativePlot function for building bar charts for our quantitative variables. Both functions build charts using the {highcharter} package.

Descriptive Analysis

Missing values

In the given dataset there are a total of 0 missing values. This is, of course, a rare, ideal case. Otherwise, we would have to fill the empty values using some estimation method.

Univariate analysis

Next it is important to study our variables, their values and their distributions. This is an important part in order to understand the sample and take further parameters into account when building the model.

As for the field of employment, the sample shows a significant participation of people in jobs that are probably combined with higher studies and consequently higher earnings, such as executives, administrative staff, entrepreneurs, etc. About 40% of the bank’s customers are employed in “blue-collar” jobs, which are most often combined with a reduced willingness to commit capital. Finally, in the bank’s clientele there is a share of around 10% concerning population groups for whom, for various reasons, it is not advantageous to open such an account, such as the unemployed, students, as well as pensioners who in turn will need to cover emergency healthcare needs.

Another available piece of information is marital status, which may be linked to increased needs and household expenses. In general, the interpretation of this indicator is not clear in advance. In any case, in the sample under examination we have around 60% of individuals who are married, a quarter of the customers are single, and the rest are divorced.

An indicator that, at least intuitively, may be among the most important is the customer’s highest level of education. It is reasonable that someone with a higher level of studies is able to work in jobs that require specialisation. In this particular dataset, only 30% have a university education.

Beyond studies, the individual’s obligations also play an important role. These can be distinguished through three variables:

whether the customer has debts
whether the customer has taken out a housing loan
whether the customer has taken out a personal consumer loan

It is obvious that if someone has non-performing debts, the last thing they will think about is saving. In the sample, only 76 people, corresponding to 1.6% of customers, fall into this category.

In addition, a significant share of the bank’s customers already have substantial obligations, both in the short and long term. More than half have taken out a housing loan. What may be a more significant factor for refusal is consumer loans, which are notorious for their high interest rates. In our case, about 15% have taken out a consumer loan, which is a slightly optimistic figure.

Housing loan
Has housing loan?	Frequency	Percentage
Yes	2559	56.6
No	1962	43.4

Personal loan
Has personal loan?	Frequency	Percentage
No	3830	84.7
Yes	691	15.3

Another piece of information provided is the means of contact with the customer. More than half (64%) have declared a mobile phone as their means of contact.

Another interesting variable is the last month in which a customer was contacted. Most of the last contacts appear to have been made during the summer months. Of course, this figure needs care in its interpretation.

According to the data and its description, the bank had also run similar campaigns in previous years. As a result of the previous campaigns, we had 129 interested parties for closed accounts, while the status of many is unknown.

The bank’s depositors are mainly of younger age and the vast majority under 60 years old. The histogram appears to have a bell shape resembling the normal distribution, but there is a slight positive skew.

Closely linked to the customer’s profile is the size of their bank balance. The distribution of deposits is strongly right-skewed: the vast majority is concentrated around low balances (a median of just $444), with a long tail of a few customers with particularly high deposits. There is even a small share with a negative balance.

Another element is the duration of the call. Intuitively, if the call lasts very briefly it is estimated that the customer is not interested. Otherwise, increased interaction between the caller on behalf of the company and the customer probably indicates increased interest in the banking product.

Finally, it is worth seeing how many times each customer was contacted in the context of the current campaign. Most received one or two calls, while a smaller part was contacted repeatedly, something that rarely pays off.

To close the univariate analysis, let us also look at the response variable itself: how many, that is, finally opened a term deposit. As expected, the data is strongly imbalanced, with only ~11% of customers responding positively. This imbalance will particularly concern us at the modelling stage.

Bivariate analysis

In the previous sub-section, some basic descriptive details were examined per variable. In this case it would be additionally beneficial to compare the previous variables with the response variable (that is, with the desire to open a term-deposit account).

An important comparison is the customer’s field of employment against their final decision. The data indicate the lowest proportional demand among people with manual work, while pensioners are those with the highest.

The above results were perhaps expected to some extent. The case of pensioners is interesting: while we initially assumed that they are not a suitable audience because of possible emergency needs, the data show that they respond positively at a higher rate. This can be explained by the fact that many pensioners already have a stable income without significant new obligations.

Subscription rates by category
Job	No	Yes
Retired	76.5	23.5
Student	77.4	22.6
Unknown	81.6	18.4
Management	86.5	13.5
Housemaid	87.5	12.5
Administrative	87.9	12.1
Self-Employed	89.1	10.9
Technician	89.2	10.8
Unemployed	89.8	10.2
Services	90.9	9.1
Entrepreneur	91.1	8.9
Blue Collar	92.7	7.3

But there are also variables under examination for which the answer is not obvious, such as marital status. In the figure, the significantly higher proportional participation of people who are on their own (either as single or divorced) is examined.

A summary of the above can also be made with a flow diagram. The first column indicates the distribution across marital statuses combined with the educational background, and we end up at the third column, which is the customer’s final answer.

I considered it important to compare the difference between the previous and the current campaign. Initially, the bank retains a large part of the trust of the previous users of the service, since 64% of those who had agreed to open a term deposit do so again with the new campaign. The crucial figure in this bivariate analysis is the extent to which it persuaded the customers who had refused the service. This share approaches 13%, which is a fairly satisfactory performance.

Model Building

In R there are two widespread ways of composing models, caret and tidymodels. On the one hand, the caret package is quite easy to use. On the other hand, tidymodels is an “all-in-one” solution, since it is a meta-package that tries to provide a complete solution; however, there is less documentation because it was created recently.

Splitting the dataset

The first step is to split the original dataset. In this particular analysis we use a three-way split:

Training set (bank_train): used to train all the models and for cross-validation.
Validation set (bank_val): a small part of the training set that we keep sealed exclusively for selecting the optimal classification threshold of the Stack Ensemble.
Test set (bank_test): used only for the final evaluation; we do not touch it at any other stage.

Subset	Observations	Purpose
Full set	4,521	-
`bank_trainval`	3,390	Training + validation
`bank_test`	1,131	Final evaluation only
`bank_train`	2,712	Model training & cross-validation
`bank_val`	678	Stack Ensemble threshold selection

bank_val contains about 678 observations, enough to give a reliable estimate of the threshold, without removing a significant part of the training data.

Data preprocessing

Of course, building the models is not such an easy matter. Between splitting the dataset and composing the models comes data preprocessing. Fortunately, the {tidymodels} package offers ready-made commands. There are also other packages that address common problems in our dataset. For example, in our data most customers are expected not to want to open a term deposit. Our data is characterised as imbalanced when the variable I am trying to predict has a large difference between categories (90% do not want / 10%). In this case we use the step_smote() command from the {themis} package.

It is worth noting that the duration variable (call duration) was deliberately removed from the models. The reason is that the value of this variable becomes known only after the call is completed, that is, after the event we are trying to predict. If we included it, we would create a data-leakage problem. In addition, the poutcome variable was also removed, as it concerns previous campaigns for which there is insufficient data for the vast majority of customers.

Cross-validation

Ok, is it time to build our model?

Not so fast. Theoretically we could continue, however the recommended method is not to simply take two parts, since further evaluation is largely based on how the split was made.

Cross-validation is a technique in which the training set is divided into k equal subsets (folds). At each iteration, one fold is used as the evaluation set and the remaining k-1 as the training set. In our case, we used 5-fold cross-validation with stratification so that the ratio of interested / not-interested is maintained in each fold.

Building the model

Next, with the {parsnip} package we have the ability to define the characteristics of the various models. Seven different classification models were developed and compared: Logistic Regression, k-Nearest Neighbours (KNN), Random Forest, Naive Bayes, SVM, XGBoost and LightGBM.

Applying the models

Having defined the models and the corresponding workflows, we proceed to optimise their hyperparameters. For the simpler models we use grid search (tune_grid()). For the gradient boosting models (XGBoost and LightGBM), I choose Bayesian optimisation (tune_bayes()), which saves time compared with exhaustive search.

Beyond the individual evaluation of each model, we also apply a technique called stacking. The idea behind stacking is that instead of choosing a single model as the final one, we combine the predictions of several models into a meta-model. I use the {stacks} package:

A note on the Stack Ensemble threshold

Unlike the individual models (LightGBM, XGBoost, Logistic Regression) for which we can draw out-of-fold predictions from cross-validation, the Stack Ensemble does not have corresponding predictions. For this reason we created bank_val earlier, which is a subset of the training set that the model has not seen during training, and is used exclusively for selecting the threshold.

Threshold selection

Each model produces for each customer a probability of interest, not directly a decision. To move from the probability to the classification (“yes” / “no”) we need a threshold: if the probability exceeds this threshold, the customer is classified as interested.

The default value of 0.5 is rarely the optimal choice with imbalanced data; in our case only 11% of customers belong to the “yes” category. Instead, we choose the threshold that maximises F1, which balances the ability to detect interested customers (recall) with the reliability of the positive predictions (precision).

A common mistake is to find the threshold on the same test set used for the final evaluation. To avoid this, we follow a different approach depending on the model:

For the individual models we use the out-of-fold (OOF) predictions from cross-validation.
For the Stack Ensemble we use bank_val.

It is worth noting that the thresholds differ significantly between models. This does not mean that some model is “wrong”; each model simply calibrates its probabilities differently. What matters is that each threshold was found on data that the corresponding model had not used during training.

Optimal threshold per model
Model	Threshold	F1 (estimate)
Stack Ensemble (v1)	0.1035	0.396
Stack Ensemble (v2)	0.1341	0.396
LightGBM	0.2533	0.385
XGBoost	0.3195	0.37
Naive Bayes	0.7316	0.365
Random Forest	0.4146	0.362
Logistic Regression	0.5559	0.319
SVM	0.6287	0.309

Results

Variable importance

From the variable-importance analysis of the LightGBM model, it emerges that the most decisive factors for predicting interest in a term deposit are the unknown means of contact (15.4%) and the existence of a housing loan (15.2%), followed by secondary education (10.2%) and the “married” marital status (9.5%). The contribution of the contact month is also notable, with May standing out as the most important (8.0%). These results largely agree with the findings of the descriptive analysis.

Variable importance, descending
Variable	Importance (%)
Unknown contact method	15.41
Housing loan	15.23
Secondary education	10.18
Married	9.5
Month: May	7.97
Month: August	3.56
Days since last contact	3.48
Job: Management	3.36
Number of contacts	2.94
Balance	2.81

Model comparison

Before proceeding to the final evaluation, it is worth comparing graphically the predictive ability of the top models with the help of ROC curves (Receiver Operating Characteristic curves). The ROC curve depicts sensitivity against the false-positive rate, across all possible thresholds. The closer the curve is to the upper left corner, the better the overall predictive ability. From the diagram, the superiority of the combined models (Stack Ensemble) over the others is clearly evident.

With imbalanced data such as ours, the ROC curve can be overly optimistic. For this reason, we additionally examine the Precision–Recall curves, which focus exclusively on the category we are interested in (the interested customers). The dashed reference line corresponds to the performance of a model with no discriminative ability (equal to the frequency of the positive category, ~11.6%).

Overall, we observe that accuracy on its own is misleading. XGBoost, which shows the top accuracy value (88.4%), has essentially zero F1, since it detects no truly interested customer. This happens because accuracy mainly “rewards” the correct negative predictions, and in our imbalanced data, saying “no” to the vast majority is easy. F1 is the most honest criterion here, because it penalises false positives and false negatives equally. In terms of sensitivity, the combined models and Logistic Regression stand out, detecting about 50% of the truly interested; LightGBM, by contrast, sacrifices part of its sensitivity (≈39%) in favour of higher accuracy and positive predictive value among the strong models.

Model comparison on the test set (each at its own threshold)
Model	Accuracy	F1	Sensitivity	Specificity	Precision
LightGBM	0.866	0.402	0.389	0.928	0.415
Stack Ensemble (v2)	0.828	0.401	0.496	0.872	0.337
Stack Ensemble (v1)	0.809	0.379	0.504	0.849	0.304
SVM	0.794	0.34	0.458	0.838	0.27
Random Forest	0.822	0.337	0.389	0.879	0.297
Logistic Regression	0.758	0.332	0.519	0.789	0.244
Naive Bayes	0.869	0.178	0.122	0.967	0.327
XGBoost	0.884	—	0	1	—

Conclusions

This second table translates the statistical measures into business reality and highlights a different ranking. LightGBM appears as the most efficient choice per call: with just 123 calls it achieves a per-call success of 41.5%, that is, about 4 in 10 contacts pay off. If the resource being spent is the call centre’s time, this model respects that resource more. The price, however, is 80 lost interested parties: out of 131 it detected only 51.

If, on the contrary, the goal is to detect as many interested parties as possible, first place goes to Logistic Regression, which catches 68 out of 131 (the most of all models) but at a cost of 279 calls and a success rate of just 24.4%. The combined models follow very closely (Stack Ensemble v1: 66 interested, v2: 65), with significantly fewer calls. Among these high-recall options, Stack Ensemble v2 makes the fewest wrong approaches (128 versus 151 for v1, that is 23 fewer), so it is preferable when we want to bother as few uninterested people as possible.

The absolute surprise was the results of XGBoost, with zero correct predictions out of 131 truly positive cases, practically useless for our purpose. On the other side, Naive Bayes is interesting, being particularly economical (32.7% success with just 49 calls), although it detects only 16 interested parties.

Business evaluation per model
Model	True positives (TP)	False positives (FP)	Missed prospects (FN)	Total calls	Success per call (%)
LightGBM	51	72	80	123	41.5
Stack Ensemble (v2)	65	128	66	193	33.7
Naive Bayes	16	33	115	49	32.7
Stack Ensemble (v1)	66	151	65	217	30.4
Random Forest	51	121	80	172	29.7
SVM	60	162	71	222	27
Logistic Regression	68	211	63	279	24.4
XGBoost	0	0	131	0	—

From the above analysis it becomes clear that there is no “right” or best model, but rather the one that serves our purpose. If the bank’s goal were to detect as many interested parties as possible, then the answer leans towards Logistic Regression or the combined models. The goals could, however, be more conservative, for example, to expand the services but bother as few uninterested people as possible. In such a case Stack Ensemble v2 is ideal, since it makes the fewest wrong contacts among the high-recall models. Finally, if the priority is efficiency per call (that is, respecting the call centre’s time), the winner is clearly LightGBM.

In conclusion, the ideal model is not determined solely by the ideal parameters, but by the question itself and the purpose of our organisation.