Predict Possible Interested Clients

Build a classification machine learning model (using LightGBM & XGBoost) in order to classify people based on their interest to have a term deposit or not.

R
Classification
Tidymodels
Author

stesiam

Published

November 24, 2022

Introduction

In this article we build a machine learning model with the goal of predicting which customers of a bank are interested in opening a term deposit account. For this purpose we compare seven classification algorithms, with emphasis on Boosting models (XGBoost and LightGBM). The data come from the “UCI Machine Learning Repository” (Dua & Graff, 2017) and specifically from the Bank Marketing dataset (Moro, Cortez, & Rita, 2014).

Before the analysis, let us define some basic concepts.

What is a term deposit?

It is a type of bank account where the customer commits to not making withdrawals for a predetermined period of time (e.g., one year). In return, the bank offers higher interest rates compared to regular savings or current accounts.

For reference:

  • Bank of Piraeus offers double the interest rate on its term deposit accounts.
  • Eurobank offers zero interest on current accounts, 0.01%-0.35% on savings accounts, and 0.1%-1% on term deposits, depending on the program and deposit amount.
  • According to a recent report by the Bank of Greece, term deposit interest rates range between 1.2% and 1.4%, compared to just 0.03% for typical household accounts.

The appropriate profile for these products generally involves individuals with significant savings balances and without heavy financial obligations (loans, overdue debts).

Prerequisites

Loading Libraries

For this analysis we will need standard R libraries for importing data, via the readr package (Wickham, Hester, & Bryan, 2025), and formatting it with the dplyr package (Wickham, François, Henry, Müller, & Vaughan, 2023). The kableExtra package (Zhu, 2024) is a useful addition for printing results in table format. An important part of the workflow is data visualization. I initially used the ggplot2 package to create charts, which is limiting for a website since ggplot2 produces static graphics. So for my articles I use the highcharter package (Kunst, 2022), which enables responsive charts that work well on all screen sizes. Finally, this analysis aims to classify bank customers based on their interest in a banking product, so the tidymodels package (Kuhn & Wickham, 2025) is essential, as we will need a classification model.

Load libraries
# General data processing
library(readr)
library(dplyr)
library(forcats)
library(tidyr)
library(glue)

# Results presentation
library(kableExtra)
library(reactable)
library(gt)

# Interactive charts
library(highcharter)

# Machine learning models
library(tidymodels)
library(bonsai)    # LightGBM via tidymodels
library(themis)    # SMOTE for imbalanced data
library(stacks)    # Ensemble stacking
library(probably)
library(discrim)

# Individual algorithms
library(kknn)
library(ranger)
library(naivebayes)
library(kernlab)
library(vip)       # Variable importance

Importing Data

After loading the necessary libraries, we need to import our data. There are several versions of the same dataset, a larger and a more compact one, differing only in the number of observations. For this article I will choose the more compact version since fitting Boosting models is particularly time-consuming compared to simpler classification models (e.g., Logistic Regression, k-Nearest Neighbors).

Show the code
bank_dataset <- read_delim("bank_dataset_files/bank.csv",  
                           delim = ";", 
                           escape_double = FALSE, 
                           trim_ws = TRUE)

bank_dataset = bank_dataset %>% tibble::rowid_to_column("ID")

Data Preview

Below we present a small sample of the dataset (the first 6 observations) to understand its structure and variable types.

Show the code
preview_bank_dataset = head(bank_dataset, 6)
reactable(
  preview_bank_dataset,
  bordered = FALSE,
  striped = FALSE,
  highlight = FALSE,
  defaultColDef = colDef(align = "center"),
  style = list(fontSize = "14px", border = "none"),
  theme = reactableTheme(
    borderColor = "transparent",
    cellStyle = list(
      borderBottom = "transparent"
    ),
    headerStyle = list(
      borderBottom = "2px solid rgb(117, 117, 117)",
      borderTop = "2px solid rgb(117, 117, 117)",
      fontWeight = "bold"
    )
  )
)
Table 1: Data preview (first 6 observations)

Before performing any analysis it is good to identify the type of data we have available. For the most part we can learn this by looking at the values a variable takes. In general, variables can be classified based on the values they take as follows:

graph TD;
  A(Variable Types) --> B(Quantitative)
  A(Variable Types) --> C(Qualitative)
  B --> D(Discrete)
  B --> E(Continuous)
  C --> J(Categorical)
  C --> G(Ordinal)
Figure 1: Variable classification based on their values
Table 2: Variable summary
Variable Type Description
Age quantitative
(continuous)
Individual’s age
Job qualitative
(categorical)
Employment sector
Marital qualitative
(categorical)
Marital status
Education qualitative
(ordinal)
Highest level of education
Default qualitative
(categorical)
Has credit in default?
Balance quantitative
(continuous)
Average annual account balance (in euros)
Housing qualitative
(categorical)
Has a housing loan?
Loan qualitative
(categorical)
Has a personal loan?
Contact qualitative
(categorical)
Contact method
Month qualitative
(ordinal)
Month of most recent contact
Duration quantitative
(continuous)
Duration (in seconds) of last call
Campaign quantitative Number of contacts during this campaign
pdays quantitative Days since last contact from a previous campaign
pprevious quantitative Number of contacts before this campaign
poutcome qualitative (nominal) Outcome of previous marketing campaign
Deposit qualitative
(nominal)
Did the client subscribe to a term deposit?

The dataset consists of 18 variables (columns), of which 7 are quantitative and the remaining 10 are qualitative. Of the qualitative variables, 8 are categorical and only two are ordinal (month of contact and education level).

Defining Functions

Ok, we have seen some basic characteristics of the data and their structure. Can I now start my analysis?

It depends. If we want a quick analysis to extract a specific result, it is probably fine. However, most of the time a more careful study design is required. A common mistake I have made in the past is the risk of repeating certain procedures. To prevent writing the same things multiple times, writing reusable functions is essential. For example, in this exercise we observed in the previous section the presence of variables of the same type, both qualitative and quantitative. Why write similar code 10 times when we can encapsulate the core logic into a single function?

Consequently, we define two functions. First, univariateQualitativePlot, which is used for creating pie charts and bar charts for the qualitative variables.

Show the code
gt_custom <- function(data, head_max = 5, use_labels = TRUE) {
    
    data_subset <- as.data.frame(data) %>% head(head_max)
    
    gt_tbl <- data_subset %>%
        gt::gt(groupname_col = "model") %>%
        cols_align(align = "center", columns = everything()) %>%
        tab_style(
            style = cell_text(
                v_align = "middle",
                weight = "bold",
                whitespace = "normal"
            ),
            locations = cells_column_labels()
        ) %>%
        tab_options(
            row.striping.include_table_body = FALSE,
            table.background.color = "transparent",
            column_labels.background.color = "transparent",
            table.width = pct(100), 
            table.font.size = px(13),           
            data_row.padding = px(15), 
            column_labels.padding = px(15),
            table.border.top.style = "none",
            table.border.bottom.style = "none",
            table_body.hlines.style = "none", 
            column_labels.border.top.width = px(2),
            column_labels.border.top.color = "#757575",
            column_labels.border.bottom.width = px(2),
            column_labels.border.bottom.color = "#757575",
            table_body.border.bottom.width = px(2),
            table_body.border.bottom.color = "#757575"
        ) %>%
        opt_css(css = "
      .gt_table th, .gt_table td { 
        text-align: center !important; 
        vertical-align: middle !important;
      }
      .gt_table { 
        margin-left: auto !important; 
        margin-right: auto !important; 
        width: 100% !important;
      }
      .gt_table tr { background-color: transparent !important; }
      .gt_row { background-color: transparent !important; }
    ")
    
    if (use_labels) {
        gt_tbl <- gt_tbl %>%
            cols_label_with(
                fn = function(x) {
                    html(paste0(
                        "<b>", col_labels[x], "</b>",
                        "<br><span style='font-size:10px; color:#777; font-style:italic;'>",
                        x,
                        "</span>"
                    ))
                }
            )
    }
    
    return(gt_tbl)
}
Show the code
univariateQualitativePlot <- function(data, column, title, subtitle, chart_type = "bar") {

  freq_table <- data %>%
    count({{ column }}, name = "Frequency") %>%
    arrange(desc(Frequency)) %>%
    rename(Variable = {{ column }}) %>%
    mutate(pct = round((Frequency / sum(Frequency) * 100), digits = 1))

  hc <- highchart() %>%
    hc_title(text = title) %>%
    hc_subtitle(text = subtitle) %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = chart_type == "pie")

  hc <- if (chart_type == "bar") {
    hc %>%
      hc_chart(type = "bar") %>%
      hc_xAxis(categories = freq_table$Variable, title = list(text = "Category")) %>%
      hc_yAxis(title = list(text = "Frequency")) %>%
      hc_series(list(name = "Frequency", data = freq_table$Frequency))
  } else if (chart_type == "pie") {
    pie_data <- lapply(1:nrow(freq_table), function(i) {
      list(name = freq_table$Variable[i], y = freq_table$Frequency[i])
    })
    hc %>%
      hc_chart(type = "pie") %>%
      hc_series(list(name = "Frequency", data = pie_data))
  } else {
    stop("Invalid chart type. Use 'bar' or 'pie'.")
  }

  return(hc)
}

univariateQualitative <- function(data, column) {

  freq_table <- data %>%
    count({{ column }}, name = "Frequency") %>%
    arrange(desc(Frequency)) %>%
    rename(Variable = {{ column }}) %>%
    mutate(pct = round((Frequency / sum(Frequency) * 100), digits = 1))

  return(freq_table)
}

Similarly, we will define the univariateQuantitativePlot function for building histograms for our quantitative variables. Both functions build interactive charts using the highcharter package.

Descriptive Analysis

Missing Values

Show the code
how_many_nas = sum(is.na(bank_dataset))

The dataset contains a total of 0 missing values. This is of course a rare and ideal situation. Otherwise, we would need to impute the missing values using some estimation method.

Univariate Analysis

Next it is important to study our variables, their values and their distributions. This is a critical step to understand the sample and to take additional context into account when building the model.

Regarding the employment sector, the sample shows significant participation from individuals with jobs likely associated with higher education and consequently higher earnings, such as management executives, administrative employees, and entrepreneurs. About 40% of the bank’s customers work in blue-collar jobs, which are most often associated with a reduced willingness to commit capital. Finally, in the bank’s customer base there is a share of around 10% from population groups who, for various reasons, would not benefit from creating a term deposit account: the unemployed, students who are in a financially vulnerable period with high expenses and limited income, and retirees who may need to cover unexpected healthcare costs. For retirees specifically, who make up 5% of customers, there are other financial instruments available for securing their retirement, such as reverse mortgages.

Show the code
univariateQualitativePlot(bank_dataset, job, 
                          title = "Employment Sector", 
                          subtitle = "Absolute count and percentage (%) of all customers")
Figure 2: Bar chart of employment sectors of bank customers
Table 3: Employment sector of bank customers
Occupation Frequency Percentage
Management 969 21.4
Blue Collar 946 20.9
Technician 768 17.0
Administrative 478 10.6
Services 417 9.2
Retired 230 5.1
Self-Employed 183 4.0
Entrepreneur 168 3.7
Unemployed 128 2.8
Housemaid 112 2.5
Student 84 1.9
Unknown 38 0.8

Another available piece of information is marital status, which may be associated with increased household needs and expenses. A married customer may have higher household expenses (e.g., due to children). Alternatively, married individuals may also have greater financial stability. In general, the interpretation of this indicator is not clear-cut a priori. In any case, in the sample under examination about 60% of individuals are married, one quarter are single, and the rest are divorced.

Show the code
univariateQualitativePlot(bank_dataset, marital, 
                          title = "What is your marital status?", 
                          subtitle = "The majority are married.",
                          "pie")
Figure 3: Pie chart of marital status of bank customers
Table 4: Marital status of bank customers
Marital Status Frequency Percentage
Married 2797 61.9
Single 1196 26.5
Divorced 528 11.7

A variable that, at least intuitively, can be among the most important is the customer’s highest level of education. It is logical that someone with higher-level education can work in jobs requiring specialization that is rewarded accordingly given its reduced supply. This indirectly determines professional opportunities and consequently the salary one can command and the amount left over for savings. In this particular dataset only 30% have university-level education.

Show the code
univariateQualitativePlot(bank_dataset, 
                          education, 
                          title = "Highest level of education",
                          subtitle = "")
Figure 4: Bar chart of educational background of bank customers
Table 5: Education level
Education Frequency Percentage
Secondary 2306 51.0
Tertiary 1350 29.9
Primary 678 15.0
Unknown 187 4.1

Beyond education, which is a strong indicator, the customer’s financial obligations also play an important role. These can be examined through three variables:

  • whether the customer has credit in default
  • whether the customer has a housing loan
  • whether the customer has a personal loan

It is obvious that a customer with outstanding debts is unlikely to consider saving rather than paying off what they owe. In the sample only 76 individuals, corresponding to 1.6% of customers, fall into this category, which is encouraging for the vast majority.

Show the code
univariateQualitativePlot(bank_dataset, default, 
                          title = "Do you have any credit in default?", 
                          subtitle = "Percentage (%) of individuals who have not fulfilled their credit obligations",  
                          "pie")
Figure 5: Pie chart of customers by credit default status
Table 6: How many have outstanding debts?
Default Frequency Percentage
No 4445 98.3
Yes 76 1.7

Furthermore, a significant share of the bank’s customers already carry substantial obligations, both short-term and long-term. More than half have taken out a housing loan, which represents a fixed, long-term commitment that cannot be defaulted on since it is often secured against the property itself. On the other hand, one could argue that having a housing loan means the household has already budgeted for housing costs rather than paying rent. What may be the more important barrier is the presence of a personal or consumer loan. These loans are typically taken out to cover short-term unexpected needs and are known for their high interest rates, making them expensive. Paying them off first is the sensible priority, and opening a term deposit is not a logical step in that context. In our case about 15% have a personal loan, which is mildly encouraging since it leaves 85% without this particular obstacle.

Show the code
univariateQualitative(bank_dataset, housing) %>%
    rename("Housing Loan?" = Variable,
           "Frequency"     = Frequency,
           "Percentage"    = pct) %>%
    gt_custom(head_max = Inf, use_labels = FALSE)
univariateQualitative(bank_dataset, loan) %>%
    rename("Personal Loan?" = Variable,
           "Frequency"      = Frequency,
           "Percentage"     = pct) %>%
    gt_custom(head_max = Inf, use_labels = FALSE)
Table 7: Types of loans held by bank customers
(a) Housing loan
Housing Loan? Frequency Percentage
Yes 2559 56.6
No 1962 43.4
(b) Personal loan
Personal Loan? Frequency Percentage
No 3830 84.7
Yes 691 15.3

Another piece of information available is the contact method used to reach customers. More than half (64%) have indicated a mobile phone as their preferred contact method.

Show the code
univariateQualitativePlot(bank_dataset, contact, 
                          title = "Contact method", 
                          subtitle = "Mobile contact is more widespread than landline",
                          "pie")
Figure 6: Pie chart of contact methods used to reach customers
Table 8: Contact method
Contact Method Frequency Percentage
Mobile 2896 64.1
Unknown 1324 29.3
Landline 301 6.7

Another interesting variable is the last month in which a customer was contacted. Most recent contacts appear to have occurred during the summer months. This detail requires careful interpretation, however, as the data may have been collected in September, which would naturally make summer months the most recent for most customers. A bivariate analysis in the next section, looking at successful contacts by month, will help examine this more precisely.

Show the code
univariateQualitativePlot(bank_dataset, month, 
                          title = "Contacts by month", 
                          subtitle = "(#) Absolute number of contacts per month")
Figure 7: Bar chart of contacts by month

According to the data and its description, the bank had previously run similar campaigns promoting term deposit accounts. As a result, we have records of customers who subscribed in previous campaigns, which may be useful since a notable share could be interested in re-subscribing. The previous campaigns resulted in 129 customers having subscribed, while the status of many others is unknown.

Show the code
univariateQualitativePlot(bank_dataset, poutcome, 
                          title = "What was the outcome of the previous contact?", 
                          subtitle = "For every five unsuccessful contacts (failure or other), there is one successful one", "pie")
Figure 8: Pie chart of customer responses in previous term deposit campaigns

Here we essentially have the outcome we are trying to predict. These figures are known from the current campaign. Based on them and the previous variables, we will build a prediction model, and those models will be evaluated against these observed outcomes.

Show the code
univariateQualitativePlot(bank_dataset, y, 
                          title = "How many decided to open a term deposit?", 
                          subtitle = "Percentage of customers who decided to open a term deposit as a result of the current campaign.",
                          "pie")
Figure 9: Pie chart of customer responses to the current term deposit campaign

The bank’s customers are predominantly younger, with the vast majority under 60. The histogram appears to have a bell shape resembling a normal distribution, but with a slight positive skew.

Show the code
hchart(bank_dataset$age) %>%
    hc_title(text = "Age Distribution") %>%
    hc_subtitle(text = glue("The median age is <b>{median(bank_dataset$age)}</b> years.")) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = FALSE)
Figure 10: Histogram of age distribution of bank customers
Show the code
hchart(bank_dataset$balance) %>%
    hc_title(text = "Balance Distribution") %>%
    hc_subtitle(text = glue("Most balances range between $0 and $200. The median account balance is {median(bank_dataset$balance)}. The 75th percentile is $1,480, meaning one quarter of customers have balances above this amount. Balances range from {min(bank_dataset$balance)} to {max(bank_dataset$balance)}.")) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = FALSE) %>%
    hc_xAxis(
    title = list(text = "Account balance in USD"),
      max = 10000
)
Figure 11: Histogram of account balance distribution of bank customers

Another variable of interest is the duration of the last call. Intuitively, a very short call suggests the customer is not interested, while a longer interaction between the agent and the customer likely signals greater interest in the banking product.

Show the code
hchart(bank_dataset$duration) %>%
    hc_title(text = "Call Duration") %>%
    hc_subtitle(text = glue("The average call duration is 4.4 minutes, while the median is 3 minutes.")) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = FALSE) %>%
    hc_xAxis(
    title = list(text = "Call duration (in seconds)"),
    max = 1000)
Figure 12: Histogram of total call duration (in seconds)
Show the code
s = bank_dataset %>%
  mutate(category = case_when(
    campaign >= 1 & campaign <= 7 ~ as.character(campaign),
    campaign > 7              ~ "8+",
    TRUE                   ~ NA_character_
  )) %>%
  mutate(category = factor(category, levels = c(as.character(1:7), "8+"), ordered = TRUE))
Show the code
hchart(s$category) %>%
    hc_title(text = "Number of Contacts") %>%
    hc_subtitle(text = glue("How many times was the same customer contacted")) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_legend(enabled = FALSE) %>%
    hc_xAxis(
    title = list(text = "Number of contacts")
)
Figure 13: Bar chart of total number of contacts made with each customer
Table 9: Number of contacts per customer
Contacts (#) Frequency
1 1734
2 1264
3 558
4 325
5 167
6 155
7 75
8+ 243

Bivariate Analysis

In the previous sub-section we examined some basic descriptive statistics per variable, giving us a sense of the bank’s customer base. Looking at these figures individually may not be sufficient to draw meaningful conclusions, and a bivariate analysis (comparing two variables at a time) becomes necessary. One approach would be to investigate the correlation between quantitative variables and the association between qualitative ones. In this case, it is particularly useful to compare the previous variables against the response variable, namely the customer’s interest in opening a term deposit.

An important comparison is the customer’s employment sector against their final decision. The chart below shows that blue-collar workers have the lowest positive response rate, while retirees have the highest.

Show the code
plotjob <- bank_dataset %>%
  group_by(job, y) %>%
  summarize(n = n(), .groups = "drop_last") %>% 
  mutate(pct = round(100*(n/sum(n)), digits = 1),
         lbl = scales::percent(pct)) %>%
  arrange(desc(pct))

highchart() %>%
  hc_chart(type = "column") %>%
  hc_xAxis(categories = plotjob$job) %>%
  hc_yAxis(
    min = 0,
    max = 100,
    title = list(text = "Percentage"),
    labels = list(format = "{value}%")
  ) %>%
  hc_plotOptions(
    column = list(
      stacking = "percent",
      dataLabels = list(
        enabled = TRUE,
        format = "{point.y:.0f}%"
      )
    )
  ) %>%
  hc_series(
    list(name = "No",  data = plotjob$pct[plotjob$y == "No"], color = "#AA5733"),
    list(name = "Yes", data = sort(plotjob$pct[plotjob$y == "Yes"]), color = "#33bb57")
  ) %>%
  hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y}%") %>%
  hc_title(text = "Current Campaign Responses by Occupation") %>%
  hc_subtitle(text = "Students and retirees are the groups with the highest proportional subscription rates.") %>%
  hc_responsive(
    rules = list(
      list(
        condition = list(maxWidth = 500),
        chartOptions = list(
          plotOptions = list(
            column = list(dataLabels = list(enabled = FALSE))
          )
        )
      )
    )
)
Figure 14: Bar chart of campaign outcomes by occupation
Table 10: Campaign success rates by occupation
Occupation No Yes
Blue Collar 92.7 7.3
Entrepreneur 91.1 8.9
Services 90.9 9.1
Unemployed 89.8 10.2
Technician 89.2 10.8
Self-Employed 89.1 10.9
Administrative 87.9 12.1
Housemaid 87.5 12.5
Management 86.5 13.5
Unknown 81.6 18.4
Student 77.4 22.6
Retired 76.5 23.5

These results may have been partially expected when considering the likely financial situation of each group. The case of retirees is particularly interesting: while we initially assumed they would not be an ideal target due to potential unexpected expenses, the data show that they respond positively at a higher rate. This can be explained by the fact that many retirees have a stable income with few new financial obligations, making them suitable candidates for committing capital.

There are also variables where the answer is less obvious, such as marital status. The chart shows that individuals who are alone (either single or divorced) subscribe proportionally more than married individuals.

Show the code
plotmarital <- bank_dataset %>%
  group_by(marital, y) %>%
  summarize(n = n(), .groups = "drop_last") %>% 
  mutate(pct = round(100*(n/sum(n)), digits = 1),
         lbl = scales::percent(pct)) %>%
  arrange(desc(pct))

highchart() %>%
  hc_chart(type = "column") %>%
  hc_xAxis(categories = plotmarital$marital) %>%
  hc_yAxis(
    min = 0,
    max = 100,
    title = list(text = "Percentage"),
    labels = list(format = "{value}%")
  ) %>%
  hc_plotOptions(
    column = list(
      stacking = "percent",
      dataLabels = list(
        enabled = TRUE,
        format = "{point.y:.0f}%"
      )
    )
  ) %>%
  hc_series(
    list(name = "No",  data = plotmarital$pct[plotmarital$y == "No"], color = "#AA5733"),
    list(name = "Yes", data = sort(plotmarital$pct[plotmarital$y == "Yes"]), color = "#33bb57")
  ) %>%
  hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y}%") %>%
  hc_title(text = "Responses by Marital Status") %>%
  hc_subtitle(text = "Single and divorced individuals show proportionally greater interest in term deposits compared to married customers.") %>%
  hc_responsive(
    rules = list(
      list(
        condition = list(maxWidth = 500),
        chartOptions = list(
          plotOptions = list(
            column = list(dataLabels = list(enabled = FALSE))
          )
        )
      )
    )
)
Figure 15: Bar chart of campaign outcomes by marital status

A summary of the above can also be visualised with the following Sankey diagram. The first column shows the distribution across marital statuses combined with educational background, and we end at the third column which is the customer’s final response (whether they are interested in opening a term deposit).

Show the code
custom_df = 
  tibble(
    r = bank_dataset$education,
    t = bank_dataset$marital,
    m = bank_dataset$y
  )

df_sankey = custom_df %>%
  group_by(r, t, m) %>%
  summarise(weight = n(), .groups = "drop")

links2 = df_sankey %>%
  group_by(t, r) %>%
  summarise(weight = sum(weight), .groups = "drop") %>%
  rename(from = t, to = r)

links3 = df_sankey %>%
  group_by(r, m) %>%
  summarise(weight = sum(weight), .groups = "drop") %>%
  rename(from = r, to = m)

links_all <- bind_rows(links2, links3)

highchart() %>%
  hc_chart(type = "sankey") %>%
  hc_title(text = "Marital Status -> Education -> Outcome") %>%
  hc_subtitle(text = "Flow of customers by combination of demographic characteristics and final decision") %>%
  hc_add_series(
    keys = c("from", "to", "weight"),
    data = list_parse(links_all),
    name = "Flow"
  ) %>%
  hc_tooltip(pointFormat = "{point.from} -> {point.to}: <b>{point.weight}</b>")

I also wanted to compare the difference between previous and current campaign outcomes, to understand where subscribers come from. Did customers re-subscribe, or did the bank succeed in expanding its reach to new clients? The bank retains significant loyalty from previous subscribers, with 64% of those who had previously agreed to open a term deposit doing so again in the new campaign. The key figure in this bivariate analysis is the rate at which those who previously declined were converted. This rate approaches 13%, which is a reasonably satisfying result given that they had previously refused.

Show the code
custom_df = 
  tibble(
    t = bank_dataset$poutcome,
    l = bank_dataset$y
  )

df_sankey = custom_df %>%
  group_by(t, l) %>%
  summarise(weight = n(), .groups = "drop")

links3 = df_sankey %>%
  group_by(t, l) %>%
  summarise(weight = sum(weight), .groups = "drop") %>%
  rename(from = t, to = l)

links_all <- bind_rows(links3)

highchart() %>%
  hc_chart(type = "sankey") %>%
  hc_title(text = "Sankey Diagram: Previous vs Current Campaign") %>%
  hc_subtitle(text = "The diagram highlights the improved results of the current campaign (second column) compared to previous ones. Approximately 12% (63 individuals) who previously declined were persuaded to subscribe in the current campaign.") %>%
  hc_add_series(
    keys = c("from", "to", "weight"),
    data = list_parse(links_all),
    name = "Flow"
  ) %>%
  hc_tooltip(pointFormat = "{point.from} -> {point.to}: <b>{point.weight}</b>")

Building the Model

In R there are two widely used frameworks for building models: caret and tidymodels. On one hand, the caret package is fairly easy to use and has a large body of guides and tutorials. On the other hand, tidymodels is an all-in-one solution, a meta-package (a collection of packages) that provides a comprehensive workflow, though with somewhat less documentation since it is more recent.

Splitting the Dataset

The first step is to split the original dataset. In this analysis we use a three-way split:

  • Training set (bank_train): used for training all models and cross-validation.
  • Validation set (bank_val): a small portion of the training set held aside exclusively for selecting the optimal classification threshold for the Stack Ensemble.
  • Test set (bank_test): used only for the final evaluation, and not touched at any other stage.
Show the code
set.seed(123)

# Main split: 75% train, 25% test
bank_dataset_split <- initial_split(bank_dataset,
                                    prop   = 0.75,
                                    strata = y)
bank_trainval <- training(bank_dataset_split)
bank_test     <- testing(bank_dataset_split)

# Secondary split: from trainval we produce
# bank_train (80%) and bank_val (20%)
set.seed(123)
trainval_split <- initial_split(bank_trainval,
                                prop   = 0.80,
                                strata = y)
bank_train <- training(trainval_split)
bank_val   <- testing(trainval_split)
graph TD;
  A(Full Dataset <br> 4521 observations) --> B(bank_trainval <br> 3390 observations)
  A --> C(bank_test <br> 1131 observations)
  B --> D(bank_train <br> 2712 observations)
  B --> E(bank_val <br> 678 observations)
  D --> F(Model Training)
  D --> G(Cross-validation)
  E --> H(Stack Ensemble Threshold)
  C --> I(Final Evaluation)
Figure 16: Three-way data split

The bank_val set contains approximately 678 observations, enough to give a reliable threshold estimate without removing a significant portion of the training data.

In our example we have, as mentioned, 4521 observations. The most common split is 75% (or 80%) for training and 25% (or 20%) for evaluation. We therefore end up with two subsets with the same number of variables: the training set with 3390 observations and the evaluation set with 1131.

Training Data

Show the code
bank_train %>%
  gt_custom(.)
Table 11: Preview of the model training subset (train dataset)
ID
ID
Age
age
Occupation
job
Marital Status
marital
Education
education
Credit Default
default
Balance
balance
Housing Loan
housing
Personal Loan
loan
Contact Method
contact

day
Month
month
Call Duration
duration
Number of Calls
campaign
Days Since Last Call
pdays
NA
previous
Previous Outcome
poutcome
NA
y
1 30 Unemployed Married Primary No 1787 No No Mobile 19 October 79 1 -1 0 Unknown No
2 33 Services Married Secondary No 4789 Yes Yes Mobile 11 May 220 1 339 4 Failure No
5 59 Blue Collar Married Secondary No 0 Yes No Unknown 5 May 226 1 -1 0 Unknown No
7 36 Self-Employed Married Tertiary No 307 Yes No Mobile 14 May 341 1 330 2 Other No
9 41 Entrepreneur Married Tertiary No 221 Yes No Unknown 14 May 57 2 -1 0 Unknown No

Evaluation Data

Show the code
bank_test %>%
  gt_custom(.)
Table 12: Preview of the model evaluation subset (test dataset)
ID
ID
Age
age
Occupation
job
Marital Status
marital
Education
education
Credit Default
default
Balance
balance
Housing Loan
housing
Personal Loan
loan
Contact Method
contact

day
Month
month
Call Duration
duration
Number of Calls
campaign
Days Since Last Call
pdays
NA
previous
Previous Outcome
poutcome
NA
y
3 35 Management Single Tertiary No 1350 Yes No Mobile 16 April 185 1 330 1 Failure No
6 35 Management Single Tertiary No 747 No No Mobile 23 February 141 2 176 3 Failure No
15 31 Blue Collar Married Secondary No 360 Yes Yes Mobile 29 January 89 1 241 1 Failure No
23 44 Services Single Secondary No 106 No No Unknown 12 June 109 2 -1 0 Unknown No
51 45 Blue Collar Divorced Primary No 844 No No Unknown 5 June 1018 3 -1 0 Unknown Yes

Data Preprocessing

Building models is not straightforward. Between splitting the dataset and fitting the models comes the data preprocessing stage. The steps involved are not fixed and vary depending on the type of problem (classification or regression) and the model chosen. Fortunately, the tidymodels package provides ready-made tools to make this stage easier, particularly the recipes package, which forms part of tidymodels. There are also additional packages that address common dataset issues. For example, our data are expected to be imbalanced, as the vast majority of customers did not subscribe to a term deposit. A dataset is considered imbalanced when the target variable has a large disparity between classes (e.g., 90% “No” / 10% “Yes”). In this case we use the step_smote() function from the themis package to balance the target variable.

Show the code
# Recipe for tree-based models (RF, XGBoost, LightGBM)
tree_recipe <- recipe(y ~., data = bank_train) %>%
  step_rm(poutcome, ID, duration) %>%
  step_corr(all_numeric(), threshold = 0.75) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_smote(y)

# Recipe for distance/linear models (Logistic Regression, KNN)
linear_recipe <- recipe(y ~., data = bank_train) %>%
  step_rm(poutcome, ID, duration) %>%
  step_corr(all_numeric(), threshold = 0.75) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_normalize(all_numeric_predictors()) %>%  # <-- additional step
  step_smote(y)

It is worth noting that the duration variable (call duration) was deliberately removed from the models. Its value is only known after the call has ended, that is, after the event we are trying to predict. Including it would create a data leakage problem, making the model artificially accurate but practically useless. The poutcome variable was also removed, for a different reason: it concerns previous campaigns for which data are unavailable for the large majority of customers, limiting its usefulness as a predictor.

Here is a preview of the dataset after applying the preprocessing steps for linear models:

Show the code
linear_recipe %>%
  prep() %>%
  juice() %>%
  head() %>%
  gt_custom()
Table 13: Dataset after applying recipes for linear models
Age
age
Balance
balance

day
Number of Calls
campaign
Days Since Last Call
pdays
NA
previous
NA
job_Blue.Collar
NA
job_Entrepreneur
NA
job_Housemaid
NA
job_Management
NA
job_Retired
NA
job_Self.Employed
NA
job_Services
NA
job_Student
NA
job_Technician
NA
job_Unemployed
NA
job_Unknown
NA
marital_Married
NA
marital_Single
NA
education_Secondary
NA
education_Tertiary
NA
education_Unknown
NA
default_Yes
NA
housing_Yes
NA
loan_Yes
NA
contact_Landline
NA
contact_Unknown
NA
month_August
NA
month_December
NA
month_February
NA
month_January
NA
month_July
NA
month_June
NA
month_March
NA
month_May
NA
month_November
NA
month_October
NA
month_September
NA
y
-1.037076 0.1075 0.3454 -0.5588 -0.4107 -0.3184 -0.5135 -0.1936 -0.1651 -0.5158 -0.2359 -0.2065 -0.3249 -0.137 -0.4436 5.9285 -0.1057 0.8009 -0.6045 -1.0245 -0.6479 -0.2151 -0.1127 -1.1299 -0.4159 -0.2744 -0.6336 -0.3999 -0.07702 -0.2216 -0.1956 -0.4292 -0.3533 -0.1075 -0.6721 -0.3072 7.295 -0.1057 No
-0.756037 1.0681 -0.6206 -0.5588 2.9742 1.8971 -0.5135 -0.1936 -0.1651 -0.5158 -0.2359 -0.2065 3.0769 -0.137 -0.4436 -0.1686 -0.1057 0.8009 -0.6045 0.9758 -0.6479 -0.2151 -0.1127 0.8847 2.4037 -0.2744 -0.6336 -0.3999 -0.07702 -0.2216 -0.1956 -0.4292 -0.3533 -0.1075 1.4874 -0.3072 -0.137 -0.1057 No
1.679640 -0.4643 -1.3451 -0.5588 -0.4107 -0.3184 1.9468 -0.1936 -0.1651 -0.5158 -0.2359 -0.2065 -0.3249 -0.137 -0.4436 -0.1686 -0.1057 0.8009 -0.6045 0.9758 -0.6479 -0.2151 -0.1127 0.8847 -0.4159 -0.2744 1.5778 -0.3999 -0.07702 -0.2216 -0.1956 -0.4292 -0.3533 -0.1075 1.4874 -0.3072 -0.137 -0.1057 No
-0.474997 -0.3660 -0.2584 -0.5588 2.8846 0.7894 -0.5135 -0.1936 -0.1651 -0.5158 -0.2359 4.8398 -0.3249 -0.137 -0.4436 -0.1686 -0.1057 0.8009 -0.6045 -1.0245 1.5429 -0.2151 -0.1127 0.8847 -0.4159 -0.2744 -0.6336 -0.3999 -0.07702 -0.2216 -0.1956 -0.4292 -0.3533 -0.1075 1.4874 -0.3072 -0.137 -0.1057 No
-0.006598 -0.3936 -0.2584 -0.2429 -0.4107 -0.3184 -0.5135 5.1637 -0.1651 -0.5158 -0.2359 -0.2065 -0.3249 -0.137 -0.4436 -0.1686 -0.1057 0.8009 -0.6045 -1.0245 1.5429 -0.2151 -0.1127 0.8847 -0.4159 -0.2744 1.5778 -0.3999 -0.07702 -0.2216 -0.1956 -0.4292 -0.3533 -0.1075 1.4874 -0.3072 -0.137 -0.1057 No

And a preview after applying the steps for tree-based models:

Show the code
tree_recipe %>%
  prep() %>%
  juice() %>%
  head() %>%
  gt_custom()
Table 14: Dataset after applying recipes for tree-based models
Age
age
Balance
balance

day
Number of Calls
campaign
Days Since Last Call
pdays
NA
previous
NA
job_Blue.Collar
NA
job_Entrepreneur
NA
job_Housemaid
NA
job_Management
NA
job_Retired
NA
job_Self.Employed
NA
job_Services
NA
job_Student
NA
job_Technician
NA
job_Unemployed
NA
job_Unknown
NA
marital_Married
NA
marital_Single
NA
education_Secondary
NA
education_Tertiary
NA
education_Unknown
NA
default_Yes
NA
housing_Yes
NA
loan_Yes
NA
contact_Landline
NA
contact_Unknown
NA
month_August
NA
month_December
NA
month_February
NA
month_January
NA
month_July
NA
month_June
NA
month_March
NA
month_May
NA
month_November
NA
month_October
NA
month_September
NA
y
30 1787 19 1 -1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 No
33 4789 11 1 339 4 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 No
59 0 5 1 -1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 No
36 307 14 1 330 2 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 No
41 221 14 2 -1 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 No

Cross-Validation

Ok, is it time to build our model now?

Not so fast. Theoretically we could proceed, but the recommended approach is not to rely on a single two-way split, since the evaluation results depend heavily on how that split happened and which observations ended up where. To get a more reliable estimate of model performance, we build subsets of our data to estimate, on average across 5 or 10 sub-samples, which parameters consistently lead to better accuracy.

Cross-validation is a technique in which the training set is divided into k equal subsets (folds). In each iteration, one fold is used as the validation set and the remaining k-1 folds are used for training. The process is repeated k times and the results are averaged, giving us a more reliable estimate of model performance. In our case we used 5-fold cross-validation with stratification, so that the ratio of subscribers to non-subscribers is preserved in each fold.

Show the code
set.seed(123)
cv_folds <- vfold_cv(bank_train, v = 5, strata = y)

# Control objects
ctrl_grid  <- control_stack_grid()
ctrl_bayes <- control_stack_bayes() # instead of control_grid(), for stacking

Building the Model

With the parsnip package we can define the characteristics of various models. In this case we want to evaluate several models and compare their performance. We will therefore define:

  • k-Nearest Neighbors
  • Logistic Regression
  • Random Forest
  • LightGBM
  • XGBoost

The parsnip package provides a unified interface for specifying predictive models. In the framework of this analysis, seven different classification models were developed and compared using the tidymodels ecosystem in R: Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, Naive Bayes, SVM, XGBoost, and LightGBM. For Logistic Regression, a regularized implementation via the glmnet library was used, allowing hyperparameter tuning of the regularization strength (penalty) and the mixing ratio (mixture), corresponding to Ridge and Lasso regularization. The KNN model was based on the kknn algorithm, with tuning of the number of neighbors and the distance weighting function (weight_func). For Random Forest, the ranger implementation was used with 200 decision trees, with hyperparameter tuning of the number of variables examined at each split (mtry) and the minimum number of observations per node (min_n). The two gradient boosting methods, XGBoost and LightGBM, were both defined with 200 trees, with tuning of tree depth (tree_depth), learning rate (learn_rate), and minimum observations per node (min_n). Hyperparameters marked with tune() were optimized to find the combination that maximizes predictive performance. Finally, each model is paired with the appropriate preprocessing recipe. For linear models (Logistic Regression and KNN), the linear recipe is used. For tree-based models (XGBoost, LightGBM, and Random Forest), the tree recipe is applied.

Show the code
# --- Logistic Regression ---
log_reg_model <- parsnip::logistic_reg(
  penalty = tune(),
  mixture = tune()
) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

# --- K-Nearest Neighbors ---
knn_model <- parsnip::nearest_neighbor(
  neighbors = tune(),
  weight_func = tune()
) %>%
  set_engine("kknn") %>%
  set_mode("classification")

# --- Random Forest ---
rf_model <- parsnip::rand_forest(
  trees = 200,
  mtry  = tune(),
  min_n = tune()
) %>%
  set_engine("ranger") %>%
  set_mode("classification")

# --- Naive Bayes ---
nb_model <- naive_Bayes(
  smoothness = tune(),
  Laplace    = tune()
) %>%
  set_engine("naivebayes") %>%
  set_mode("classification")

# --- SVM ---
svm_model <- parsnip::svm_linear(
  cost = tune()
) %>%
  set_engine("kernlab") %>%
  set_mode("classification")

# --- XGBoost ---
xgb_model <- parsnip::boost_tree(
  trees      = 200,
  min_n      = tune(),
  learn_rate = tune(),
  tree_depth = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

# --- LightGBM ---
lgbm_model <- parsnip::boost_tree(
  trees      = 200,
  min_n      = tune(),
  learn_rate = tune(),
  tree_depth = tune()
) %>%
  set_engine("lightgbm") %>%
  set_mode("classification")

log_wf  <- workflow() %>% add_recipe(linear_recipe) %>% add_model(log_reg_model)
knn_wf  <- workflow() %>% add_recipe(linear_recipe) %>% add_model(knn_model)
rf_wf   <- workflow() %>% add_recipe(tree_recipe)   %>% add_model(rf_model)
nb_wf   <- workflow() %>% add_recipe(linear_recipe) %>% add_model(nb_model)
svm_wf  <- workflow() %>% add_recipe(linear_recipe) %>% add_model(svm_model)
xgb_wf  <- workflow() %>% add_recipe(tree_recipe)   %>% add_model(xgb_model)
lgbm_wf <- workflow() %>% add_recipe(tree_recipe)   %>% add_model(lgbm_model)

Fitting the Models

Having defined the models and their corresponding workflows, we proceed to hyperparameter optimization. For the simpler models (Logistic Regression, KNN, Naive Bayes, SVM, and Random Forest) we use grid search (tune_grid()), systematically testing various hyperparameter combinations. For the gradient boosting models (XGBoost and LightGBM), which have more hyperparameters and are more computationally expensive per run, I opt for Bayesian optimization (tune_bayes()). In this approach, the algorithm learns from previous trials and selects the next parameter combination more intelligently, saving time compared to an exhaustive search.

Show the code
set.seed(123)

log_results <- tune_grid(
  log_wf,
  resamples = cv_folds,
  grid = 10,
  metrics = metric_set(roc_auc, accuracy),
  control = ctrl_grid
)

# KNN -- tune_grid
knn_results <- tune_grid(
  knn_wf,
  resamples = cv_folds,
  grid = 10,
  metrics = metric_set(roc_auc, accuracy),
  control = ctrl_grid
)

# Random Forest -- tune_grid (moderate cost)
rf_results <- tune_grid(
  rf_wf,
  resamples = cv_folds,
  grid = 15,
  metrics = metric_set(roc_auc, accuracy),
  control = ctrl_grid
)

nb_results <- tune_grid(
  nb_wf,
  resamples = cv_folds,
  grid      = 10,
  metrics   = metric_set(roc_auc, accuracy),
  control   = ctrl_grid
)

svm_results <- tune_grid(
  svm_wf,
  resamples = cv_folds,
  grid      = 10,
  metrics   = metric_set(roc_auc, accuracy),
  control   = ctrl_grid
)

# XGBoost -- tune_bayes
xgb_results <- tune_bayes(
  xgb_wf,
  resamples = cv_folds,
  initial = 5,
  iter = 25,
  metrics = metric_set(roc_auc, accuracy),
  control = ctrl_bayes
)

# LightGBM -- tune_bayes
lgbm_results <- tune_bayes(
  lgbm_wf,
  resamples = cv_folds,
  initial = 5,
  iter = 25,
  metrics = metric_set(roc_auc, accuracy),
  control = ctrl_bayes
)

Beyond evaluating each model individually, we also apply a technique called stacking. The idea behind stacking is that instead of selecting a single model as the final one, we combine the predictions of multiple models in a meta-model that learns which model performs best and in which situations. In practice, the models trained previously (Logistic Regression, KNN, Random Forest, XGBoost, LightGBM) feed their predictions into a second level, where a linear model with L1 regularization assigns weights to each and can exclude those that do not contribute. This way we leverage the strengths of each algorithm without committing to a single approach. We implement this technique using the stacks package, which integrates smoothly with the tidymodels ecosystem.

Show the code
bank_stack_v1 <- stacks() %>%
  add_candidates(log_results)  %>%
  add_candidates(knn_results)  %>%
  add_candidates(rf_results)   %>%
  add_candidates(xgb_results)  %>%
  add_candidates(lgbm_results)

bank_stack_v2 <- bank_stack_v1 %>%
    add_candidates(nb_results)   %>%
    add_candidates(svm_results)

# Compute weights for each model
set.seed(123)
bank_stack_model_v1 <- bank_stack_v1 %>%
  blend_predictions(
    penalty = 10^(-2:0),       # L1 regularization for model selection
    metric  = metric_set(roc_auc)
  )

bank_stack_model_v2 <- bank_stack_v2 %>%
  blend_predictions(
    penalty = 10^(-2:0),
    metric  = metric_set(roc_auc)
  )

# Train the selected members
bank_stack_fit_v1 <- bank_stack_model_v1 %>%
  fit_members()

bank_stack_fit_v2 <- bank_stack_model_v2 %>%
  fit_members()
Note

Note on the Stack Ensemble threshold

Unlike individual models (LightGBM, XGBoost, Logistic Regression) for which we can draw out-of-fold predictions from cross-validation, the Stack Ensemble does not have equivalent predictions. For this reason we created bank_val earlier: a subset of the training set that the model has not seen during training, used exclusively for threshold selection in the next section.

Threshold Selection

Each model produces for each customer a probability of interest, not a direct decision. To move from probability to classification (“Yes” / “No”) we need a threshold: if the probability exceeds it, the customer is classified as interested.

The default value of 0.5 is rarely the optimal choice for imbalanced data. In our case only 11% of customers belong to the “Yes” class. Instead, we select the threshold that maximizes the F1 score, which balances the ability to detect interested customers (recall) with the reliability of positive predictions (precision).

A common mistake is to find the threshold on the same test set used for final evaluation. This way the model indirectly “sees” the evaluation data, leading to optimistic results that do not reflect true performance. To avoid this we follow different approaches depending on the model:

  • For individual models (LightGBM, XGBoost, Logistic Regression) we use the out-of-fold (OOF) predictions from cross-validation. In each fold, predictions are made by a model that has not seen those observations, so the predictions are unbiased.
  • For the Stack Ensemble we use bank_val, which was set aside from the beginning exclusively for this purpose.
Show the code
# Helper function: finds threshold from OOF predictions
get_oof_threshold <- function(tune_results, model_name) {
  
  best_params <- select_best(tune_results, metric = "roc_auc")
  
  oof_preds <- tune_results %>%
    collect_predictions(
      summarize  = FALSE,
      parameters = best_params
    )
  
  # Dynamic upper bound: up to 95% of the maximum predicted probability
  # Avoids threshold values where no one is predicted as "Yes"
  upper_bound <- max(oof_preds$.pred_Yes, na.rm = TRUE) * 0.95
  
  oof_preds %>%
    threshold_perf(
      truth      = y,
      estimate   = .pred_Yes,
      thresholds = seq(0.005, upper_bound, length.out = 200),
      event_level = "second",
      metrics    = metric_set(f_meas)
    ) %>%
    filter(.metric == "f_meas", !is.na(.estimate)) %>%
    slice_max(.estimate, n = 1, with_ties = FALSE) %>%
    transmute(
      model     = model_name,
      threshold = .threshold,
      f1_oof    = round(.estimate, 3)
    )
}

thresholds_individual <- bind_rows(
  get_oof_threshold(lgbm_results, "LightGBM"),
  get_oof_threshold(xgb_results,  "XGBoost"),
  get_oof_threshold(log_results,  "Logistic Regression"),
  get_oof_threshold(rf_results,   "Random Forest"),
  get_oof_threshold(nb_results,   "Naive Bayes"),
  get_oof_threshold(svm_results,  "SVM")
)

# Predictions on bank_val -- the stack has not seen it
## For first stack
stack_val_preds_v1 <- predict(bank_stack_fit_v1,
                           bank_val,
                           type = "prob") %>%
  bind_cols(bank_val %>% select(y))

upper_bound_stack_v1 <- max(stack_val_preds_v1$.pred_Yes, 
                         na.rm = TRUE) * 0.95

## For second stack
stack_val_preds_v2 <- predict(bank_stack_fit_v2,
                           bank_val,
                           type = "prob") %>%
  bind_cols(bank_val %>% select(y))

upper_bound_stack_v2 <- max(stack_val_preds_v2$.pred_Yes, 
                         na.rm = TRUE) * 0.95

stack_threshold_v1 <- stack_val_preds_v1 %>%
  threshold_perf(
    truth      = y,
    estimate   = .pred_Yes,
    thresholds = seq(0.005, upper_bound_stack_v1, length.out = 200),
    event_level = "second",
    metrics    = metric_set(f_meas)
  ) %>%
  filter(.metric == "f_meas", !is.na(.estimate)) %>%
  slice_max(.estimate, n = 1, with_ties = FALSE) %>%
  transmute(
    model     = "Stack Ensemble (v1)",
    threshold = .threshold,
    f1_oof    = round(.estimate, 3)
  )

stack_threshold_v2 <- stack_val_preds_v2 %>%
  threshold_perf(
    truth      = y,
    estimate   = .pred_Yes,
    thresholds = seq(0.005, upper_bound_stack_v2, length.out = 200),
    event_level = "second",
    metrics    = metric_set(f_meas)
  ) %>%
  filter(.metric == "f_meas", !is.na(.estimate)) %>%
  slice_max(.estimate, n = 1, with_ties = FALSE) %>%
  transmute(
    model     = "Stack Ensemble (v2)",
    threshold = .threshold,
    f1_oof    = round(.estimate, 3)
  )

all_thresholds <- bind_rows(thresholds_individual,
                            stack_threshold_v1,
                            stack_threshold_v2)

all_thresholds %>%
  arrange(desc(f1_oof)) %>%
  rename(
    "Model"          = model,
    "Threshold"      = threshold,
    "F1 (estimate)"  = f1_oof
  ) %>%
  gt_custom(head_max = Inf, use_labels = FALSE)
Table 15: Optimal threshold per model
Model Threshold F1 (estimate)
Stack Ensemble (v2) 0.11061 0.417
Stack Ensemble (v1) 0.08791 0.408
Random Forest 0.23905 0.390
LightGBM 0.26002 0.382
XGBoost 0.38405 0.377
Naive Bayes 0.73631 0.331
Logistic Regression 0.55355 0.319
SVM 0.67053 0.312

It is worth noting that thresholds differ considerably across models. This does not mean that any model is “wrong”: each model simply calibrates its probabilities differently. What matters is that every threshold was found on data that the corresponding model had not used during training. Furthermore, the F1 values in the table above are estimates used for threshold selection and are not directly comparable across models. The individual models used OOF predictions (2712 observations) while the Stack Ensemble used bank_val (678 observations). The definitive comparison is made on the test set below.

Results

Show the code
# Helper function for generating predictions
get_preds <- function(fitted_wf, model_name) {
  predict(fitted_wf, bank_test, type = "prob") %>%
    bind_cols(predict(fitted_wf, bank_test)) %>%
    bind_cols(bank_test %>% select(y)) %>%
    mutate(model = model_name)
}

# Select best hyperparameters for each model
best_lgbm <- select_best(lgbm_results, metric = "roc_auc")
best_lr   <- select_best(log_results,  metric = "roc_auc")
best_xgb  <- select_best(xgb_results,  metric = "roc_auc")
best_rf   <- select_best(rf_results,   metric = "roc_auc")
best_nb   <- select_best(nb_results,   metric = "roc_auc")
best_svm  <- select_best(svm_results,  metric = "roc_auc")

# Finalize and train on full training set
final_lgbm_wf <- finalize_workflow(lgbm_wf, best_lgbm) %>% fit(data = bank_train)
final_lr_wf   <- finalize_workflow(log_wf,  best_lr)   %>% fit(data = bank_train)
final_xgb_wf  <- finalize_workflow(xgb_wf,  best_xgb)  %>% fit(data = bank_train)
final_rf_wf   <- finalize_workflow(rf_wf,   best_rf)   %>% fit(data = bank_train)
final_nb_wf   <- finalize_workflow(nb_wf,   best_nb)   %>% fit(data = bank_train)
final_svm_wf  <- finalize_workflow(svm_wf,  best_svm)  %>% fit(data = bank_train)

stack_preds_v1 <- predict(bank_stack_fit_v1, bank_test, type = "prob") %>%
  bind_cols(predict(bank_stack_fit_v1, bank_test)) %>%
  bind_cols(bank_test %>% select(y)) %>%
  mutate(model = "Stack Ensemble (v1)")

stack_preds_v2 <- predict(bank_stack_fit_v2, bank_test, type = "prob") %>%
  bind_cols(predict(bank_stack_fit_v2, bank_test)) %>%
  bind_cols(bank_test %>% select(y)) %>%
  mutate(model = "Stack Ensemble (v2)")

# Generate predictions on the test set for all models
all_preds <- bind_rows(
  get_preds(final_lgbm_wf, "LightGBM"),
  get_preds(final_lr_wf,   "Logistic Regression"),
  get_preds(final_xgb_wf,  "XGBoost"),
  get_preds(final_rf_wf,   "Random Forest"),
  get_preds(final_nb_wf,   "Naive Bayes"),
  get_preds(final_svm_wf,  "SVM"),
  stack_preds_v1,
  stack_preds_v2
)

Variable Importance

The variable importance analysis from the LightGBM model reveals that the most decisive factors for predicting interest in a term deposit are the presence of a housing loan (housing_Yes, 15.3%) and an unknown contact method (contact_Unknown, 14.0%), followed by married marital status (marital_Married, 11.7%). Noteworthy is also the contribution of the contact month, with May appearing as the most important month (8.2%), while education level also carries meaningful weight. These results largely align with the findings of the descriptive analysis, confirming that demographic characteristics such as marital status and educational background, along with operational parameters such as the contact method and timing, play a decisive role in a customer’s decision.

Table 16: Variable importance in descending order
Variable Importance
Unknown Contact Method 15.56
Housing Loan 15.40
Secondary Education 9.72
Married 9.50
Month: May 7.08
Month: August 3.99
job_Management 3.84
pdays 3.74
balance 3.24
Occupation: Blue Collar 3.14

Model Comparison

Before moving on to the final evaluation, it is worth comparing graphically the predictive ability of the four selected models using ROC curves (Receiver Operating Characteristic curves). A ROC curve plots sensitivity against the false positive rate across all possible thresholds. The closer the curve is to the upper left corner, the better the model’s overall predictive ability.

Show the code
model_order <- c("Logistic Regression", "XGBoost",
                 "Stack Ensemble (v1)", "Stack Ensemble (v2)")

roc_all <- all_preds %>%
  mutate(model = factor(model, levels = model_order)) %>%
  group_by(model) %>%
  roc_curve(truth = y, .pred_Yes, event_level = "second") %>%
  ungroup() %>%
  drop_na(model)

colors <- c(
  "Logistic Regression" = "#009E73",
  "XGBoost"             = "#D55E00",
  "Stack Ensemble (v1)" = "#CC79A7",
  "Stack Ensemble (v2)" = "#E69F00"
)

hchart(roc_all, type = "line",
       hcaes(x = 1 - specificity, y = sensitivity, group = model)) %>%
    hc_title(text = "ROC Curves") %>%
    hc_subtitle(text = "Comparison of Logistic Regression, XGBoost, and Stack Ensemble models") %>%
    hc_xAxis(title = list(text = "1 - Specificity (False Positive Rate)"),
             min = 0, max = 1) %>%
    hc_yAxis(title = list(text = "Sensitivity (True Positive Rate)"),
             min = 0, max = 1) %>%
    hc_colors(unname(colors)) %>%
    hc_tooltip(headerFormat = "",
               pointFormat = "<b>{series.name}</b><br>TPR: {point.y:.3f}<br>FPR: {point.x:.3f}") %>%
    hc_add_series(
        data = list(list(x = 0, y = 0), list(x = 1, y = 1)),
        type = "line",
        name = "Random model",
        color = "#888888",
        dashStyle = "Dash",
        lineWidth = 1.5,
        marker = list(enabled = FALSE),
        enableMouseTracking = FALSE
    )

The diagram clearly shows the superiority of the Stack Ensemble models over the other two.

Show the code
pr_all <- all_preds %>%
    mutate(model = factor(model, levels = model_order)) %>%
    group_by(model) %>%
    pr_curve(truth = y, .pred_Yes, event_level = "second") %>%
    ungroup() %>%
    drop_na(model)

# Baseline = proportion of positives in the test set
baseline <- mean(bank_test$y == "Yes")

hchart(pr_all, type = "line",
       hcaes(x = recall, y = precision, group = model)) %>%
    hc_title(text = "Precision-Recall Curves") %>%
    hc_subtitle(text = "Model comparison on imbalanced data") %>%
    hc_xAxis(title = list(text = "Recall"),
             min = 0.05, max = 1) %>%
    hc_yAxis(
        title = list(text = "Precision"),
        min = 0, max = 0.6,
        plotLines = list(
            list(value = baseline, color = "#888", dashStyle = "Dash", width = 1,
                 label = list(text = paste0("Baseline: ", round(baseline, 3))))
        )
    ) %>%
    hc_colors(unname(colors)) %>%
    hc_tooltip(headerFormat = "",
               pointFormat = "<b>{series.name}</b><br>Recall: {point.x:.3f}<br>Precision: {point.y:.3f}")

Overall, we observe that accuracy alone is misleading. XGBoost, which achieves one of the highest accuracy scores, has a near-zero F1. This happens because accuracy primarily rewards correct negative predictions, and in our imbalanced dataset, predicting “No” for the vast majority is easy. F1 is the more honest criterion here, penalizing false positives and false negatives equally. It is also worth noting the sensitivity of LightGBM: it identifies 51% of truly interested customers, a rate that the Stack Ensemble models do not match.

Show the code
threshold_lookup <- setNames(
    all_thresholds$threshold,
    all_thresholds$model
)

all_preds_custom <- all_preds %>%
    mutate(
        thr = threshold_lookup[model],
        .pred_class = factor(
            ifelse(.pred_Yes >= thr, "Yes", "No"),
            levels = c("No", "Yes")
        )
    )

all_preds_custom %>%
  group_by(model) %>%
  metric_set(accuracy, f_meas, sens, spec, precision)(
    truth = y, estimate = .pred_class,
    event_level = "second"
  ) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  mutate(across(where(is.numeric), ~ round(.x, 3))) %>%
  dplyr::arrange(-f_meas) %>%
  rename(
  "Model"                        = model,
  "Accuracy"                     = accuracy,
  "F1"                           = f_meas,
  "Sensitivity"                  = sens,
  "Specificity"                  = spec,
  "Positive Predictive Value"    = precision
) %>%
  gt_custom(head_max = Inf, use_labels = FALSE)
Model Accuracy F1 Sensitivity Specificity Positive Predictive Value
LightGBM 0.868 0.402 0.382 0.932 0.424
Stack Ensemble (v2) 0.851 0.396 0.420 0.908 0.374
Stack Ensemble (v1) 0.794 0.340 0.458 0.838 0.270
Random Forest 0.807 0.339 0.427 0.857 0.281
Logistic Regression 0.753 0.334 0.534 0.782 0.243
SVM 0.819 0.300 0.336 0.882 0.272
Naive Bayes 0.871 0.180 0.122 0.969 0.340
XGBoost 0.884 NA 0.000 1.000 NA

Conclusions

The second table translates the statistical figures into operational reality and reveals a different ranking from what the F1 score might have suggested. Stack Ensemble v1 emerges here as the most efficient option: with only 127 calls it achieves a success rate per call of 40.2%, meaning 4 in every 10 contacts result in a subscription. If the constrained resource is call center time, this model is the most respectful of it. The trade-off, however, is 80 missed subscribers. Of the 131 actually interested customers, we identified 51. Interestingly, the second ensemble model (Stack Ensemble v2) does not outperform the first, despite being a combination of seven individual models compared to the five in v1. It does produce a smaller total number of contacts, but at a lower efficiency rate relative to v1. Its main advantage is that it generates the fewest incorrect contacts among the top-performing models, which has its own operational value.

LightGBM, on the other hand, identifies 67 subscribers, the most of any model, but requires 204 calls to do so, at a success rate of 32.8%. That is nearly 60% more calls than Stack Ensemble v1 to gain 16 additional true positives. Also notable are the results of Naive Bayes, which comes surprisingly close to the performance of the ensemble models. Finally, the most striking result is XGBoost, which correctly predicted only 3 out of 131 actual positive cases.

Show the code
all_preds_custom %>%
  group_by(model) %>%
  summarise(
    `True Positives (TP)`        = sum(.pred_class == "Yes" & y == "Yes"),
    `False Positives (FP)`       = sum(.pred_class == "Yes" & y == "No"),
    `Missed Subscribers (FN)`    = sum(.pred_class == "No"  & y == "Yes"),
    `Total Calls Made`           = sum(.pred_class == "Yes"),
    `Success Rate (%)`           = round(100 * `True Positives (TP)` / `Total Calls Made`, 1),
    .groups = "drop"
  ) %>%
  arrange(-`Success Rate (%)`) %>%
  rename("Model" = model) %>%
  gt_custom(head_max = Inf, use_labels = FALSE)
Model True Positives (TP) False Positives (FP) Missed Subscribers (FN) Total Calls Made Success Rate (%)
LightGBM 50 68 81 118 42.4
Stack Ensemble (v2) 55 92 76 147 37.4
Naive Bayes 16 31 115 47 34.0
Random Forest 56 143 75 199 28.1
SVM 44 118 87 162 27.2
Stack Ensemble (v1) 60 162 71 222 27.0
Logistic Regression 70 218 61 288 24.3
XGBoost 0 0 131 0 NaN

From the analysis above it becomes clear that there is no single “correct” or universally best model, but rather one that best fits our goal. If the bank’s objective is to maximize revenue by promoting a new product, the answer is clear: LightGBM, as it identifies the most interested customers. If the goals are more conservative, for instance expanding services among existing customers while disturbing as few uninterested ones as possible, then Stack Ensemble v2 becomes ideal since it generates 21 fewer incorrect contacts. The same model is also ideal when campaign time is limited, as it leads to the lowest total number of calls. If the goal is a balanced approach between the number of contacts required and the expected campaign results, the winner is clearly Stack Ensemble v1.

In conclusion, the ideal model is not determined solely by optimal parameters, but by the question being asked and the goals of the organization.

References

Bray, A., Ismay, C., Chasnovski, E., Couch, S., Baumer, B., & Cetinkaya-Rundel, M. (2025). Infer: Tidy statistical inference. Retrieved from https://github.com/tidymodels/infer
Couch, S. P., Bray, A. P., Ismay, C., Chasnovski, E., Baumer, B. S., & Çetinkaya-Rundel, M. (2021). infer: An R package for tidyverse-friendly statistical inference. Journal of Open Source Software, 6(65), 3661. https://doi.org/10.21105/joss.03661
Couch, S., Frick, H., HvitFeldt, E., & Kuhn, M. (2025). Tailor: Iterative steps for postprocessing model predictions. Retrieved from https://github.com/tidymodels/tailor
Couch, S., & Kuhn, M. (2025). Stacks: Tidy model stacking. Retrieved from https://stacks.tidymodels.org/
Dua, D., & Graff, C. (2017). UCI machine learning repository. University of California, Irvine, School of Information; Computer Sciences. Retrieved from http://archive.ics.uci.edu/ml
Falbel, D., Damiani, A., Hogervorst, R. M., Kuhn, M., Couch, S., & Hvitfeldt, E. (2025). Bonsai: Model wrappers for tree-based models. Retrieved from https://bonsai.tidymodels.org/
Frick, H., Chow, F., Kuhn, M., Mahoney, M., Silge, J., & Wickham, H. (2025). Rsample: General resampling infrastructure. Retrieved from https://rsample.tidymodels.org
Frick, H., Kuhn, M., & Couch, S. (2025). Workflowsets: Create a collection of tidymodels workflows. Retrieved from https://github.com/tidymodels/workflowsets
Greenwell, B. M., & Boehmke, B. (2025). Vip: Variable importance plots. Retrieved from https://github.com/koalaverse/vip/
Greenwell, B. M., & Boehmke, B. C. (2020). Variable importance plots—an introduction to the vip package. The R Journal, 12(1), 343–366. Retrieved from https://doi.org/10.32614/RJ-2020-013
Hester, J., & Bryan, J. (2024). Glue: Interpreted string literals. Retrieved from https://glue.tidyverse.org/
Hvitfeldt, E. (2025). Themis: Extra recipes steps for dealing with unbalanced data. Retrieved from https://github.com/tidymodels/themis
Hvitfeldt, E., & Kuhn, M. (2025). Discrim: Model wrappers for discriminant analysis. Retrieved from https://github.com/tidymodels/discrim
Iannone, R., Cheng, J., Schloerke, B., Haughton, S., Hughes, E., Lauer, A., … Roy, O. (2025). Gt: Easily create presentation-ready display tables. Retrieved from https://gt.rstudio.com
Kabacoff, R. (2019). Data visualization with r. URL Https://Rkabacoff. Github. Io/Datavis.
Karatzoglou, A., Smola, A., & Hornik, K. (2024). Kernlab: Kernel-based machine learning lab. https://doi.org/10.32614/CRAN.package.kernlab
Karatzoglou, A., Smola, A., Hornik, K., & Zeileis, A. (2004). Kernlab – an S4 package for kernel methods in R. Journal of Statistical Software, 11(9), 1–20. https://doi.org/10.18637/jss.v011.i09
Kirenz, J. (2021). Classification with tidymodels, workflows and recipes. Retrieved November 22, 2022, from https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/#data-preparation
Kuhn, M. (2025a). Modeldata: Data sets useful for modeling examples. Retrieved from https://modeldata.tidymodels.org
Kuhn, M. (2025b). Tune: Tidy tuning tools. Retrieved from https://tune.tidymodels.org/
Kuhn, M., & Frick, H. (2025). Dials: Tools for creating tuning parameter values. Retrieved from https://dials.tidymodels.org
Kuhn, M., & Vaughan, D. (2026). Parsnip: A common API to modeling and analysis functions. Retrieved from https://github.com/tidymodels/parsnip
Kuhn, M., Vaughan, D., & Hvitfeldt, E. (2025). Yardstick: Tidy characterizations of model performance. Retrieved from https://github.com/tidymodels/yardstick
Kuhn, M., Vaughan, D., & Ruiz, E. (2025). Probably: Tools for post-processing predicted values. Retrieved from https://github.com/tidymodels/probably
Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. Retrieved from https://www.tidymodels.org
Kuhn, M., & Wickham, H. (2025). Tidymodels: Easily install and load the tidymodels packages. Retrieved from https://tidymodels.tidymodels.org
Kuhn, M., Wickham, H., & Hvitfeldt, E. (2025). Recipes: Preprocessing and feature engineering steps for modeling. Retrieved from https://github.com/tidymodels/recipes
Kunst, J. (2022). Highcharter: A wrapper for the highcharts library. Retrieved from https://jkunst.com/highcharter/
Lin, G. (2025). Reactable: Interactive data tables for r. Retrieved from https://glin.github.io/reactable/
Majka, M. (2024). Naivebayes: High performance implementation of the naive bayes algorithm. Retrieved from https://github.com/majkamichal/naivebayes
Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62, 22–31.
R Core Team. (2025). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
R-Bloggers. (2017). 5 ways to measure running time of r code. Retrieved November 20, 2022, from https://www.r-bloggers.com/2017/05/5-ways-to-measure-running-time-of-r-code/
R-Bloggers. (2020). How to use lightgbm with tidymodels. Retrieved November 20, 2022, from https://www.r-bloggers.com/2020/08/how-to-use-lightgbm-with-tidymodels/
Robinson, D., Hayes, A., Couch, S., & Hvitfeldt, E. (2025). Broom: Convert statistical objects into tidy tibbles. Retrieved from https://broom.tidymodels.org/
Schliep, K., & Hechenbichler, K. (2025). Kknn: Weighted k-nearest neighbors. Retrieved from https://github.com/KlausVigo/kknn
Vaughan, D., Couch, S., & Frick, H. (2025). Workflows: Modeling workflows. Retrieved from https://github.com/tidymodels/workflows
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org
Wickham, H. (2025). Forcats: Tools for working with categorical variables (factors). Retrieved from https://forcats.tidyverse.org/
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … van den Brand, T. (2025). ggplot2: Create elegant data visualisations using the grammar of graphics. Retrieved from https://ggplot2.tidyverse.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org
Wickham, H., & Henry, L. (2026). Purrr: Functional programming tools. Retrieved from https://purrr.tidyverse.org/
Wickham, H., Hester, J., & Bryan, J. (2025). Readr: Read rectangular text data. Retrieved from https://readr.tidyverse.org
Wickham, H., Pedersen, T. L., & Seidel, D. (2025). Scales: Scale functions for visualization. Retrieved from https://scales.r-lib.org
Wickham, H., Vaughan, D., & Girlich, M. (2025). Tidyr: Tidy messy data. Retrieved from https://tidyr.tidyverse.org
Wright, M. N. (2026). Ranger: A fast implementation of random forests. Retrieved from https://imbs-hl.github.io/ranger/
Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01
Zhu, H. (2024). kableExtra: Construct complex table with kable and pipe syntax. Retrieved from http://haozhu233.github.io/kableExtra/

Citation

BibTeX citation:
@online{2022,
  author = {, stesiam},
  title = {Predict {Possible} {Interested} {Clients}},
  date = {2022-11-24},
  url = {https://stesiam.com/posts/predict-possible-clients/},
  langid = {en}
}
For attribution, please cite this work as:
stesiam. (2022, November 24). Predict Possible Interested Clients. Retrieved from https://stesiam.com/posts/predict-possible-clients/