Introduction

On this article, I will make a ML model, in order to predict the costumers of a bank that are interested on having a term deposit. To achieve this goal I will use some Boosting algorithms (XGBoost and LightGBM). The data is from UCI website (Dua & Graff, 2017) and the dataset that I worked with is “Bank Marketing” (Moro, Cortez, & Rita, 2014).

But first let’s explain some things.

What is a term deposit ?

Generally, we are saving our money in a bank account but we are promising to not withdraw them for a specific amount of time.

And why someone to do that ?

It is apparent that term deposits have a major disadvantage over regular ones in terms of capital availability. Of course, people will not have motive to do that without anything in return. For that reason, the banks usually offer higher interest for that kind of accounts. For example, Piraeus Bank (a bank in Greece) offers double interest rate on term deposits compared to regular ones.

So, we can assume that there will be a serious interest for that product mostly on people with significant amount of savings (>1000) and do not have large, extraordinary expenses (loan, kids, etc.).

Prerequisites

Import libraries

For this analysis we will need standard libraries for importing and processing my data, such as readr (Wickham, Hester, & Bryan, 2024) and dplyr (Wickham, François, Henry, Müller, & Vaughan, 2023). The kableExtra (Zhu, 2024) package was used to print the results in table format. Concerning general purpose R libraries, I also used gridExtra in order to show ggplot items side to side.

On its basis this analysis is about to predict if someone is interested or not to have a term deposit. Thus, we need to build a ML model. The all-in-one solution package tidymodels is crucial to this. Although, there are some concerns about our data (imbalanced predicted value and our implementation of LightGBM. Thankfully, these are solved by bonsai and treesnip packages, respectively.

Finally, the ggplot2 (Wickham et al., 2025) package was used to create some visualizations, as well as an auxiliary package, ggtext (R-ggtext?), for further formatting those.

Show the code

library(readr)
library(dplyr)
library(forcats)
library(kableExtra)
library(highcharter)
library(glue)
library(reactable)

# Build ML models

library(tidymodels)
library(bonsai)
library(themis)

Import dataset

After loading R libraries, then I will load my data. The initial source of my dataset has various versions of the same dataset. I will use the smaller one, as the fitting process on Boosting algorithms it is more time consuming in comparison with other methods (e.g. Logistic Regression, k-Nearest Neighbours etc.).

Show the code

bank_dataset <- read_delim("bank_dataset_files/bank.csv",  delim = ";", escape_double = FALSE, trim_ws = TRUE)

bank_dataset = bank_dataset %>% tibble::rowid_to_column("ID")

Preview dataset

Here we can see a small chunk of my dataset (first 10 rows / observations) just to understand the dataset’s structure and type of variables.

Show the code

preview_bank_dataset = head(bank_dataset, 6)
reactable(
  preview_bank_dataset,
  bordered = FALSE,     # No borders (booktabs style)
  striped = FALSE,      # No stripes for a clean look
  highlight = FALSE,    # No hover highlighting
  defaultColDef = colDef(align = "center"),
  style = list(fontSize = "14px", border = "none"),
  theme = reactableTheme(
    borderColor = "transparent", # Remove border
    cellStyle = list(
      borderBottom = "transparent"  # Subtle line between rows
    ),
    headerStyle = list(
      borderBottom = "2px solid rgb(117, 117, 117)", # Thick top rule
      borderTop = "2px solid rgb(117, 117, 117)",
      fontWeight = "bold"
    )
  )
)

Table 1: Preview dataset (first six rows)

Before we do any analysis we have to define what kind of data we have available. We can assess this type of information by looking on the values of each variable. Generally, we can classify our variables, depending their values, as follows :

graph TD;
  A(Type of variables) --> B(Quantitative)
  A(Type of variables) --> C(Qualitative)
  B --> D(Discrete)
  B --> E(Continuous)
  C --> J(Nominal)
  C --> G(Ordinal)

Figure 1: Type of variables based on its values

Our dataset is consisted by 18 variables (columns) and 4521 observations (rows). More specifically, concerning my variables, are as follows :

Variable	Property	Description
`Age`	quantitative (continuous)	The age of the respondent
`Job`	qualitative (nominal)	The sector of employment of the respondent
`Marital`	qualitative (nominal)	The marital status of the respondent
`Education`	qualitative (ordinal)	The higher education level that the respondent has ever reached
`Default`	qualitative (nominal)	has credit in default?
`Balance`	quantitative (continuous)	Average yearly balance, in euros
`Housing`	qualitative (nominal)	Has housing loan?
`Loan`	qualitative (nominal)	Has personal loan?
`Contact`	qualitative (nominal)	Contact communication type
`Month`	qualitative (ordinal)	Last contact day of the month
`Duration`	quantitative (continuous)	Last contact duration, in seconds (numeric)
`Campaign`	quantitative	Number of contacts performed during this campaign and for this client
`pdays`	quantitative	Number of days that passed by after the client was last contacted from a previous campaign
`pprevious`	quantitative	Number of contacts performed before this campaign and for this client
`poutcome`	qualitative (nominal)	Outcome of the previous marketing campaign
`Deposit`	qualitative (nominal)	Has the client subscribed a term deposit?

Thus, my sample has 18 variables, of which 7 are quantitative and 10 are quantitative properties, of which 8 are nominal and the rest ones (Education, Month) are ordinal.

Show the code

bank_dataset$y = as.factor(bank_dataset$y)

Custom functions

So, I have a basic idea about my data. Can we start analyzing our data?

It depends. If you want to do a simple analysis then yes. Although most of the times this is not the case. Probably there is the need for repetitive actions. In order to not repeat ourselves we need to define some actions, prior to our analysis.

On this occasion, I found beneficial the definition of a function for qualitative data.

Show the code

univariateQualitativePlot <- function(data, column, title, subtitle, chart_type = "bar") {
  library(dplyr)
  library(highcharter)

  # Compute frequency table
  freq_table <- data %>%
    count({{ column }}, name = "Frequency") %>%
    arrange(desc(Frequency)) %>%
    rename(Variable = {{ column }})

  # Define common chart properties
  hc <- highchart() %>%
    hc_title(text = title) %>%
    hc_subtitle(text = subtitle) %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = chart_type == "pie")  # Enable legend only for pie chart

  # Choose chart type
  hc <- if (chart_type == "bar") {
    hc %>%
      hc_chart(type = "bar") %>%
      hc_xAxis(categories = freq_table$Variable, title = list(text = "Job Category")) %>%
      hc_yAxis(title = list(text = "Frequency")) %>%
      hc_series(list(name = "Frequency", data = freq_table$Frequency))
  } else if (chart_type == "pie") {
    pie_data <- lapply(1:nrow(freq_table), function(i) {
      list(name = freq_table$Variable[i], y = freq_table$Frequency[i])
    })
    
    hc %>%
      hc_chart(type = "pie") %>%
      hc_series(list(name = "Frequency", data = pie_data))
  } else {
    stop("Invalid chart type. Use 'bar' or 'pie'.")
  }

  return(hc)
}

I will do the same for univariate numeric data.

[EDITTING]

EDA with R

Missing Values

Show the code

how_many_nas = sum(is.na(bank_dataset))

On this dataset there are 0 missing values, in total. So, there is no need for imputation.

Univariate analysis

Qualitative variables

Show the code

univariateQualitativePlot(bank_dataset, job, 
                          title = "Sector of employment of the respondent?", 
                          subtitle = "Number and (%) of total respondents")

Figure 2: Figure: Job

Show the code

univariateQualitativePlot(bank_dataset, marital, 
                          title = "What is your marital status?", 
                          subtitle = "Most of the participants are married.", "pie")

Figure 3: Figure : Marital Status

Show the code

univariateQualitativePlot(bank_dataset, education, title = "Type of your education", subtitle = "")

Show the code

univariateQualitativePlot(bank_dataset, default, 
                          title = "Has credit in default?", 
                          subtitle = "(%) of respondents that have failed to meet their payment obligations",                            "pie")

Show the code

univariateQualitativePlot(bank_dataset, housing, 
                          title = "Has housing loan?", 
                          subtitle = "(%) of respondents who have acquired a housing loan",                            "pie")

Show the code

univariateQualitativePlot(bank_dataset, loan, 
                          title = "Has personal loan?", 
                          subtitle = "(%) of respondents who have acquired a personal loan",                            "pie")

Show the code

univariateQualitativePlot(bank_dataset, contact, 
                          title = "Type of Contact", 
                          subtitle = "Vast majority of apporaches are via cellular phones.", "pie")

Show the code

univariateQualitativePlot(bank_dataset, month, 
                          title = "Μέσο επικοινωνίας", 
                          subtitle = "Η προώθηση μέσω κινητού είναι πιο εκτεταμένη σε σχέση με το σταθερό")

Figure 4: Κυκλικό διάγραμμα τρόπων προσέγγισης ατόμων - πελατών

Show the code

univariateQualitativePlot(bank_dataset, poutcome, 
                          title = "Which was the outcome of the previous approaches?", 
                          subtitle = "For every five failed approaches there was one succussful.", "pie")

Show the code

univariateQualitativePlot(bank_dataset, y, 
                          title = "How many people made a deposit account?", 
                          subtitle = "(%) of respondents who have decided to open a deposit account
                                      as a result of the current approach",
                          "pie")

Quantitative variables

Show the code

subtitle_text = glue("The meadian age is <b>{median(bank_dataset$age)}</b>.")

hchart(bank_dataset$age) %>%
    hc_title(text = "Age Distribution") %>%
    hc_subtitle(text = subtitle_text) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = FALSE)

Show the code

hchart(bank_dataset$balance) %>%
    hc_title(text = "Balance Distribution") %>%
    hc_subtitle(text = glue("The most prevalent group is the one with balance between 0$ and 200$. The median balance is {median(bank_dataset$balance)}")) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = FALSE) %>%
    hc_xAxis(
    title = list(text = "Balance in $"),
      max = 10000
)

Show the code

hchart(bank_dataset$duration) %>%
    hc_title(text = "Διάρκεια κλήσης") %>%
    hc_subtitle(text = glue("Οι περισσότερες καταθέσεις κυμαίνονται μεταξύ των 0$ και 200$ δολαρίων. Το διάμεσο υπόλοιπο λογαριασμού είναι {median(bank_dataset$balance)} $. Το 75ο τεταρτημόριο είναι τα 1480$, άρα το ένα τέταρτο των πελατών έχει καταθέσεις υψηλότερες αυτού του ποσού. Τα ύψη των λογαριασμών κυμαίνονται από {min(bank_dataset$balance)} μέχρι και τα {max(bank_dataset$balance)} $")) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
    hc_legend(enabled = FALSE) %>%
    hc_xAxis(
    title = list(text = "Διάρκεια κλήσης (σε δευτερόλεπτα)"),
    max = 1000)

Show the code

s = bank_dataset %>%
  mutate(category = case_when(
    campaign >= 1 & campaign <= 7 ~ as.character(campaign),
    campaign > 7              ~ "8+",
    TRUE                   ~ NA_character_
  )) %>%
  mutate(category = factor(category, levels = c(as.character(1:7), "8+"), ordered = TRUE))

Show the code

hchart(s$category) %>%
    hc_title(text = "Αριθμός προσεγγίσεων") %>%
    hc_subtitle(text = glue("Οι περισσότερες καταθέσεις κυμαίνονται μεταξύ των 0€ και 200€ ευρώ. Το διάμεσο υπόλοιπο λογαριασμού είναι {median(bank_dataset$balance)} €. Το 75ο τεταρτημόριο είναι τα 1480€, άρα το ένα τέταρτο των πελατών έχει καταθέσεις υψηλότερες αυτού του ποσού. Τα ύψη των λογαριασμών κυμαίνονται από {min(bank_dataset$balance)} μέχρι και τα {max(bank_dataset$balance)} $")) %>%
    hc_caption(text = "Bank Marketing Dataset from UCI") %>%
    hc_legend(enabled = FALSE) %>%
    hc_xAxis(
    title = list(text = "Ύψος καταθέσεων σε ευρώ €")
)

Bivariate analysis

On the previous section I learned a lot about my dataset. Now, I have to reveal relationships between my variables. Visualizing those relationships will make it even easier to explain our results. In order to make the right plots on the right occasions I used the book “Datavis with R” (Chapter 4 : Bivariate Graphs).

Qualitative variables

Show the code

plotjob <- bank_dataset %>%
  group_by(job, y) %>%
  summarize(n = n()) %>% 
  mutate(pct = round(100*(n/sum(n)), digits = 1),
         lbl = scales::percent(pct)) %>%
  arrange(desc(pct))

#categories <- unique(plotjob$job)

# series_data <- plotjob %>%
#     group_by(y) %>%
#     summarise(series = list(list(name = first(y), data = pct))) %>%
#     pull(series)

highchart() %>%
  hc_chart(type = "column") %>%
  hc_xAxis(categories = plotjob$job) %>%
  hc_yAxis(
    min = 0,
    max = 100,
    title = list(text = "Percentage"),
    labels = list(format = "{value}%")
  ) %>%
  hc_plotOptions(
    column = list(
      stacking = "percent",
      dataLabels = list(
        enabled = TRUE,
        format = "{point.y:.0f}%"
      )
    )
  ) %>%
  hc_series(
    list(name = "No", data = plotjob$pct[plotjob$y == "no"], color = "#FF5733"),
    list(name = "Yes", data = sort(plotjob$pct[plotjob$y == "yes"]), color = "#33FF57")
  ) %>%
  hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y}%") %>%
  hc_title(text = "Job of Respondent by Interest to Term Deposit") %>%
  hc_subtitle(text = "Students and retirees are the population groups with a proportionally lower interest in term deposits compared to other groups.") %>%
  hc_responsive(
    rules = list(
      list(
        condition = list(
          maxWidth = 500 # Hide labels when screen width is ≤500px
        ),
        chartOptions = list(
          plotOptions = list(
            column = list(
              dataLabels = list(enabled = FALSE) # Disable labels
            )
          )
        )
      )
    )
)

Show the code

plotmarital <- bank_dataset %>%
  group_by(marital, y) %>%
  summarize(n = n()) %>% 
  mutate(pct = n/sum(n),
         lbl = pct *100 ) |>
 arrange(y)

highchart() %>%
  hc_chart(type = "column") %>%
  hc_xAxis(categories = plotmarital$marital) %>%
  hc_yAxis(
    min = 0,
    max = 100,
    title = list(text = "Percentage"),
    labels = list(format = "{value}%")
  ) %>%
  hc_plotOptions(
    column = list(
      stacking = "percent",
      dataLabels = list(
        enabled = TRUE,
        format = "{point.y:.0f}%"
      )
    )
  ) %>%
  hc_series(
    list(name = "Όχι", data = plotmarital$lbl[plotmarital$y == "Όχι"], color = "#FF5733"),
    list(name = "Ναι", data = plotmarital$lbl[plotmarital$y == "Ναι"], color = "#33FF57")
  ) %>%
  hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
  hc_title(text = "Marital Status by Interest to Term Deposit") %>%
  hc_subtitle(text = "Couples are less interested in comparison to singles / divorced people in term deposit accounts. The increased expenses of a hpusehold may averts people from maintaining a term deposit.") %>%
  hc_responsive(
    rules = list(
      list(
        condition = list(
          maxWidth = 500 # Hide labels when screen width is ≤500px
        ),
        chartOptions = list(
          plotOptions = list(
            column = list(
              dataLabels = list(enabled = FALSE) # Disable labels
            )
          )
        )
      )
    )
)

Show the code

educdata = bank_dataset %>%
  group_by(education, y) %>%
  summarize(n = n()) %>% 
  mutate(pct = n/sum(n)*100) %>%
  arrange(y)


highchart() %>%
  hc_chart(type = "column") %>%
  hc_xAxis(categories = educdata$education) %>%
  hc_yAxis(
    min = 0,
    max = 100,
    title = list(text = "Percentage"),
    labels = list(format = "{value}%")
  ) %>%
  hc_plotOptions(
    column = list(
      stacking = "percent",
      dataLabels = list(
        enabled = TRUE,
        format = "{point.y:.0f}%"
      )
    )
  ) %>%
  hc_series(
    list(name = "No", data = educdata$pct[educdata$y == "no"], color = "#FF5733"),
    list(name = "Yes", data = educdata$pct[educdata$y == "yes"], color = "#33FF57")
  ) %>%
  hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
  hc_title(text = "Education Level & Interest to Term Deposit") %>%
  hc_subtitle(text = "The more educated people tend to be more interested to new bank/finanial products. ") %>%
  hc_responsive(
    rules = list(
      list(
        condition = list(
          maxWidth = 500 # Hide labels when screen width is ≤500px
        ),
        chartOptions = list(
          plotOptions = list(
            column = list(
              dataLabels = list(enabled = FALSE) # Disable labels
            )
          )
        )
      )
    )
)

Show the code

plot_housing2 = bank_dataset %>%
  group_by(housing, y) %>%
  summarize(n = n()) %>% 
  mutate(pct = n/sum(n)*100) %>%
  arrange(y)

`summarise()` has grouped output by 'housing'. You can override using the
`.groups` argument.

Show the code

highchart() %>%
  hc_chart(type = "column") %>%
  hc_xAxis(categories = plot_housing2$housing) %>%
  hc_yAxis(
    min = 0,
    max = 100,
    title = list(text = "Percentage"),
    labels = list(format = "{value}%")
  ) %>%
  hc_plotOptions(
    column = list(
      stacking = "percent",
      dataLabels = list(
        enabled = TRUE,
        format = "{point.y:.0f}%"
      )
    )
  ) %>%
  hc_series(
    list(name = "No", data = plot_housing2$pct[plot_housing2$y == "no"], color = "#FF5733"),
    list(name = "Yes", data = plot_housing2$pct[plot_housing2$y == "yes"], color = "#33FF57")
  ) %>%
  hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
  hc_title(text = "Do you have housing loan?") %>%
  hc_subtitle(text = "People without loans tend to open more willingly deposit accounts.") %>%
  hc_responsive(
    rules = list(
      list(
        condition = list(
          maxWidth = 500 # Hide labels when screen width is ≤500px
        ),
        chartOptions = list(
          plotOptions = list(
            column = list(
              dataLabels = list(enabled = FALSE) # Disable labels
            )
          )
        )
      )
    )
)

Show the code

plot_loan = bank_dataset %>%
  group_by(loan, y) %>%
  summarize(n = n()) %>% 
  mutate(pct = n/sum(n)*100) %>%
  arrange(y)

`summarise()` has grouped output by 'loan'. You can override using the
`.groups` argument.

Show the code

highchart() %>%
  hc_chart(type = "column") %>%
  hc_xAxis(categories = plot_loan$loan) %>%
  hc_yAxis(
    min = 0,
    max = 100,
    title = list(text = "Percentage"),
    labels = list(format = "{value}%")
  ) %>%
  hc_plotOptions(
    column = list(
      stacking = "percent",
      dataLabels = list(
        enabled = TRUE,
        format = "{point.y:.0f}%"
      )
    )
  ) %>%
  hc_series(
    list(name = "No", data = plot_loan$pct[plot_loan$y == "no"], color = "#FF5733"),
    list(name = "Yes", data = plot_loan$pct[plot_loan$y == "yes"], color = "#33FF57")
  ) %>%
  hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
  hc_title(text = "Do you have personal loan?") %>%
  hc_subtitle(text = "People without loans tend to open more willingly deposit accounts.") %>%
  hc_responsive(
    rules = list(
      list(
        condition = list(
          maxWidth = 500 # Hide labels when screen width is ≤500px
        ),
        chartOptions = list(
          plotOptions = list(
            column = list(
              dataLabels = list(enabled = FALSE) # Disable labels
            )
          )
        )
      )
    )
)

Quantitative variables

Show the code

dat <- data_to_boxplot(bank_dataset, age, y)
highchart() %>%
    hc_xAxis(type = "category") %>%
    hc_add_series_list(dat) %>%
    hc_title(text = "Age & Desire of Bank Deposit Account") %>%
    hc_subtitle(text = "There are not significant demographic differences concerning people that choose to open term deposit accounts.")  %>%
    hc_legend(enabled = FALSE)

Show the code

dat <- data_to_boxplot(bank_dataset, balance, y)

highchart() %>%
    hc_xAxis(type = "category") %>%
    hc_add_series_list(dat) %>%
    hc_title(text = "Balance & Desire of Bank Deposit Account") %>%
    hc_subtitle(text = "")  %>%
    hc_legend(enabled = FALSE)

Building model

In R there are two fairly well-known libraries when it comes to model generation, caret and tidymodels. On one hand, caret is pretty simple to use, has lots of sources, guides, explanatory articles. On the other hand, tidymodels is an all-in-one solution but has less documentation, articles compared to caret.

Split train/test dataset

Our first step is to split our dataset on 2 parts. Each of those will be used for a different purpose. The first part’s (train dataset) purpose is to build our model. The other part (test dataset) will be used to evaluate our model’s performance.

Show the code

set.seed(123)
bank_dataset_split <- initial_split(bank_dataset,
                                prop = 0.75,
                                strata = y)

# Create training data
bank_train <- bank_dataset_split %>%
                    training()

# Create testing data
bank_test <- bank_dataset_split %>%
                    testing()

Show the code

graph TD;
  A(Data) --> B(Train Dataset <br> 3390 obs.)
  A(Data) --> C(Test  Dataset <br> 1131 obs.)

graph TD;
  A(Data) --> B(Train Dataset <br> 3390 obs.)
  A(Data) --> C(Test  Dataset <br> 1131 obs.)

Show the code

head(bank_train) %>%
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

ID	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
1	30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
2	33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
4	30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
5	59	blue-collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no
7	36	self-employed	married	tertiary	no	307	yes	no	cellular	14	may	341	1	330	2	other	no
8	39	technician	married	secondary	no	147	yes	no	cellular	6	may	151	2	-1	0	unknown	no

Show the code

head(bank_test) %>%
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

ID	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
3	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
6	35	management	single	tertiary	no	747	no	no	cellular	23	feb	141	2	176	3	failure	no
15	31	blue-collar	married	secondary	no	360	yes	yes	cellular	29	jan	89	1	241	1	failure	no
23	44	services	single	secondary	no	106	no	no	unknown	12	jun	109	2	-1	0	unknown	no
51	45	blue-collar	divorced	primary	no	844	no	no	unknown	5	jun	1018	3	-1	0	unknown	yes
52	37	technician	single	secondary	no	228	yes	no	cellular	20	aug	1740	2	-1	0	unknown	no

Recipes

An important part in the process of model building is preprocessing. Depending of model type and data structure, I have to do the necessary changes. The tidymodels offers some ready-made preprocessing functions which make the whole process piece of cake.

In instance, the dataset I am working right now has imbalanced response variable (term deposit interest). For that reason, I used the recipe step_smote() from themis package.

Show the code

bank_recipe <- recipes::recipe(y~., 
                               data = bank_train) %>%
  step_rm(poutcome, ID) %>%
  step_corr(all_numeric(), threshold = 0.75) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>% prep() %>%
  step_smote(y) %>%
  prep()

Let’s preview our dataset after applying our recipes :

Show the code

bank_recipe %>%
  prep() %>%
  juice() %>%
  head() %>%
  kbl() %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

Table 2: Dataset after recipes

age	balance	day	duration	campaign	pdays	previous	job_blue.collar	job_management	job_self.employed	job_services	job_technician	job_unemployed	marital_married	education_secondary	education_tertiary	housing_yes	loan_yes	contact_unknown	month_jun	month_may	month_oct	y
30	1787	19	79	1	-1	0	0	0	0	0	0	1	1	0	0	0	0	0	0	0	1	no
33	4789	11	220	1	339	4	0	0	0	1	0	0	1	1	0	1	1	0	0	1	0	no
30	1476	3	199	4	-1	0	0	1	0	0	0	0	1	0	1	1	1	1	1	0	0	no
59	0	5	226	1	-1	0	1	0	0	0	0	0	1	1	0	1	0	1	0	1	0	no
36	307	14	341	1	330	2	0	0	1	0	0	0	1	0	1	1	0	0	0	1	0	no
39	147	6	151	2	-1	0	0	0	0	0	1	0	1	1	0	1	0	0	0	1	0	no

Create validation set

So, how good is the model? Not so fast…

We could actually build the model and evaluate its performance. The problem with that approach is the sample.

Show the code

cv_folds <- recipes::bake(
  bank_recipe,
  new_data = bank_train) %>%
  rsample::vfold_cv(v = 5, strata = y)

Specify models

Next, parsnip helps us to specify our models. Initially, I will define a LightGBM model,

Show the code

lightgbm_model<- parsnip::boost_tree(
 mode = "classification",
 trees = 50,
 min_n = tune(),
 learn_rate = tune(),
 tree_depth = tune()) %>%
set_engine("lightgbm", loss_function = "squarederror")

and an XGBoost one.

Show the code

xgboost_model<- parsnip::boost_tree(
 mode = "classification",
 trees = 50,
 min_n = tune(),
 learn_rate = tune(),
 tree_depth = tune()) %>%
 set_engine("xgboost")

Hyperparameters tuning

Now, we are specifying the hyperpaterers’ values and a grid to check which combination of those are performing better according to our desired metric (in our case ROC). This has to be done for both, LightGBM

Show the code

lightgbm_params <- dials::parameters(
 min_n(),
 tree_depth(range = c(4,10)),
 learn_rate() # learning rate
)

Show the code

lightgbm_grid <- dials::grid_max_entropy(
 lightgbm_params,
 size = 10)

head(lightgbm_grid) %>%
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

min_n	tree_depth	learn_rate
3	5	0.0000
3	7	0.0000
11	10	0.0000
33	4	0.0787
10	5	0.0116
31	9	0.0000

and XGBoost.

Show the code

xgboost_params <- dials::parameters(
 min_n(),
 tree_depth(range = c(4,10)),
 learn_rate() # learning rate
)

Show the code

xgboost_grid <- dials::grid_max_entropy(
  xgboost_params,
  size = 10
)

head(xgboost_grid) %>%
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

min_n	tree_depth	learn_rate
5	6	0.0000
4	8	0.0000
37	9	0.0000
2	5	0.0000
21	9	0.0057
21	5	0.0472

Fit resamples

Show the code

# build workflow for LightGBM

lightgbm_workflow <- workflows::workflow() %>%
 add_model(lightgbm_model) %>%
 add_formula(y ~.)

Show the code

# build workflow for XGBoost

xgboost_workflow <- workflows::workflow() %>%
  add_model(xgboost_model) %>%
  add_formula(y~.)

Finally, we can build the LightGBM model by combining :

The workflows we set it up above
The resamples
Grid of values (hyperparameters)
Metric based on which we will evaluate our model’s performance

Show the code

start_time_lightgbm <- Sys.time()

lightgbm_tuned_model <- tune::tune_grid(
 object = lightgbm_workflow,
 resamples = cv_folds,
 metrics = metric_set(roc_auc, accuracy),
 grid = lightgbm_grid,
 control = tune::control_grid(verbose = FALSE) # set this to TRUE to see
 # in what step of the process you are. But that doesn't look that well in
 # a blog.
)

end_time_lightgbm <- Sys.time()

time_lightgbm = difftime(end_time_lightgbm,start_time_lightgbm,units = "secs")

Similarly, for XGBoost.

Show the code

start_time_xgboost <- Sys.time()

xgboost_tuned_model <- tune::tune_grid(
 object = xgboost_workflow,
 resamples = cv_folds,
 metrics = metric_set(roc_auc, accuracy),
 grid = xgboost_grid,
 control = tune::control_grid(verbose = FALSE) # set this to TRUE to see
 # in what step of the process you are. But that doesn't look that well in
 # a blog.
)

end_time_xgboost <- Sys.time()

time_xgboost= difftime(end_time_xgboost,start_time_xgboost,units = "secs")

Evaluate model

Our first results based on resamples for LightGBM

Show the code

lightgbm_tuned_model %>%
  show_best(., metric = "roc_auc", n = 5) %>% 
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

min_n	tree_depth	learn_rate	.metric	.estimator	mean	n	std_err	.config
16	10	0.0646	roc_auc	binary	0.9001	5	0.0022	Preprocessor1_Model10
33	4	0.0787	roc_auc	binary	0.8921	5	0.0021	Preprocessor1_Model04
10	5	0.0116	roc_auc	binary	0.8775	5	0.0037	Preprocessor1_Model05
34	10	0.0006	roc_auc	binary	0.8653	5	0.0035	Preprocessor1_Model07
31	9	0.0000	roc_auc	binary	0.8580	5	0.0088	Preprocessor1_Model06

and XGBoost

Show the code

xgboost_tuned_model %>%
  show_best(., metric = "roc_auc", n = 5) %>% 
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

min_n	tree_depth	learn_rate	.metric	.estimator	mean	n	std_err	.config
21	5	0.0472	roc_auc	binary	0.8748	5	0.0030	Preprocessor1_Model06
21	9	0.0057	roc_auc	binary	0.8552	5	0.0026	Preprocessor1_Model05
5	6	0.0000	roc_auc	binary	0.8290	5	0.0074	Preprocessor1_Model01
40	5	0.0002	roc_auc	binary	0.8285	5	0.0061	Preprocessor1_Model08
19	7	0.0000	roc_auc	binary	0.8242	5	0.0077	Preprocessor1_Model07

Last fit

By now we can assess which is the best combination of values. Given those I will assess my model’s performance on unknown (to my model) data. LightGBM model has 0.8469 ROC-value

Show the code

last_fit_lightgbm_model = parsnip::boost_tree(
 mode = "classification",
 trees = 100,
 min_n = 33,
 learn_rate = 0.0787,
 tree_depth = 4) %>%
set_engine("lightgbm")

Show the code

## TODO
# options(digits = 4)
# 
# last_fit_workflow <- lightgbm_workflow %>% 
#   update_model(last_fit_lightgbm_model)
# 
# last_lightgbm_fit <- 
#   last_fit_workflow %>% 
#   last_fit(bank_dataset_split)
# 
# last_lightgbm_fit %>% 
#   collect_metrics() %>%
#   kbl(toprule = T,align = 'c',booktabs = T)  %>%
#   kable_styling(full_width = F, position = "center", html_font = "Cambria")

and XGBoost, 0.8736.

Show the code

last_fit_xgboost_model = parsnip::boost_tree(
 mode = "classification",
 trees = 100,
 min_n = 21,
 learn_rate = 0.0472,
 tree_depth = 5) %>%
set_engine("xgboost")

Show the code

options(digits = 4)

last_fit_xgboost_workflow <- xgboost_workflow %>% 
  update_model(last_fit_xgboost_model)

last_xgboost_fit <- 
  last_fit_xgboost_workflow %>% 
  last_fit(bank_dataset_split)

last_xgboost_fit %>% 
  collect_metrics() %>%
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

.metric	.estimator	.estimate	.config
accuracy	binary	0.8921	Preprocessor1_Model1
roc_auc	binary	0.8693	Preprocessor1_Model1
brier_class	binary	0.0750	Preprocessor1_Model1

Results

It seems that my model has a really good predictive value even if I applied a relatively simple model. If an increased accuracy is justified we can apply a more powerful model (e.g., XGBoost, CatBoost, etc..).

To summarize,

Model	Time to build	ROC value (test)	ROC value (CV)
LightGBM	19.76 sec.	0.8469	0.903
XGBoost	43.92 sec.	0.8736	0.881

In conclusion LightGBM model is less time consuming in comparison with XGBoost. On the other hand, XGBoost model needed more time to be built (which was expected) but outperformed LightGBM.

So, we got some results, so what? The main question is still unanswered. All this modelling was about to find out who is interested to a term deposit. Usually, I would use predict function with the corresponding data to mark my predictions. However, I wasn’t able to do that as I received an error about class. Nevermind, the function collect_predictions() does the same.

Show the code

pr = last_xgboost_fit %>% collect_predictions()

Now, I will select the predicted value and the result and I will paste them on my test data.

Show the code

pr = pr %>% select(.pred_class, y) %>% rename(prediction = .pred_class) 

final = cbind(pr, bank_test)

And now let’s say that I want to give a list of only the interested ones. I will filter pred class to show only “Yes” values. We should also take into consideration that the dataset does not include any personal information, so I will also include an ID.

Show the code

final %>% 
  head() %>%
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria")

prediction	y	ID	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
no	no	3	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
no	no	6	35	management	single	tertiary	no	747	no	no	cellular	23	feb	141	2	176	3	failure	no
no	no	15	31	blue-collar	married	secondary	no	360	yes	yes	cellular	29	jan	89	1	241	1	failure	no
no	no	23	44	services	single	secondary	no	106	no	no	unknown	12	jun	109	2	-1	0	unknown	no
no	yes	51	45	blue-collar	divorced	primary	no	844	no	no	unknown	5	jun	1018	3	-1	0	unknown	yes
yes	no	52	37	technician	single	secondary	no	228	yes	no	cellular	20	aug	1740	2	-1	0	unknown	no

Show the code

interested_clients = final %>% select(ID, prediction) %>%
  filter(prediction == "yes") %>% 
  kbl(toprule = T,align = 'c',booktabs = T)  %>%
  kable_styling(full_width = F, position = "center", html_font = "Cambria") %>%
  scroll_box(width = "100%", height = "200px")

interested_clients

ID	prediction
52	yes
99	yes
157	yes
201	yes
203	yes
301	yes
313	yes
355	yes
495	yes
510	yes
620	yes
662	yes
685	yes
703	yes
838	yes
856	yes
960	yes
1127	yes
1193	yes
1280	yes
1343	yes
1432	yes
1612	yes
1739	yes
1761	yes
1764	yes
1780	yes
1897	yes
1904	yes
1915	yes
1946	yes
1975	yes
1981	yes
1992	yes
2056	yes
2161	yes
2259	yes
2376	yes
2634	yes
2761	yes
2880	yes
2918	yes
2986	yes
3089	yes
3306	yes
3590	yes
3615	yes
3645	yes
3755	yes
3795	yes
3932	yes
3957	yes
3969	yes
4233	yes
4263	yes
4266	yes
4276	yes

So, from 1131 possible clients, I finally got 75 that are actually interested. So, I reduced the required phone calls by 93 % (and therefore the working hours allocated to this task). But at what cost ?

Show the code

ds <- final %>%
  select(prediction) %>%
  dplyr::count(prediction) %>%
  dplyr::mutate(pct = round(n/sum(n) * 100, digits = 2))

ds %>%
  hchart("pie", hcaes(x = prediction, y = pct))%>%
    hc_title(text = "Expected number of clients") %>%
    hc_subtitle(text = glue("From {nrow(final)}, only {nrow(interested_clients)} are interested to open a deposit account.")) %>%
    hc_caption("stesiam, 2023")

Hypothesis

Let’s suppose we have to launch a new campaign and inform the interested parties. There are two options. The traditional one, in which case we should call everyone (1131 people). On the other hand, we can use a machine learning (ML) model (given that I have some data) to ‘interrupt’ some of them. I am also assuming that a phone call lasts, on average, 4 minutes. Additionally, based on Eurostat, the average hourly wage is 30.5 euros.

To call everyone, we would need 4,524 minutes, which corresponds to 75.4 working hours and therefore amounts to 2,300 euros. By using the model, we would only require 75 calls, equivalent to 5 working hours and 152.5 euros. Even in this simple example (with so few observations), we can see a significant benefit for the company. Last but not least, the company won’t disappoint any customers by promoting something they are not interested in, potentially losing clients. Therefore, ML modeling will help not only financially but also in maintaining a healthy brand.

Acknowledgements

Image by Gerd Altmann from Pixabay

References

Bray, A., Ismay, C., Chasnovski, E., Couch, S., Baumer, B., & Cetinkaya-Rundel, M. (2025). Infer: Tidy statistical inference. Retrieved from https://github.com/tidymodels/infer

Couch, S. P., Bray, A. P., Ismay, C., Chasnovski, E., Baumer, B. S., & Çetinkaya-Rundel, M. (2021). infer: An R package for tidyverse-friendly statistical inference. Journal of Open Source Software, 6(65), 3661. https://doi.org/10.21105/joss.03661

Dua, D., & Graff, C. (2017). UCI machine learning repository. University of California, Irvine, School of Information; Computer Sciences. Retrieved from http://archive.ics.uci.edu/ml

Falbel, D., Damiani, A., Hogervorst, R. M., Kuhn, M., & Couch, S. (2025). Bonsai: Model wrappers for tree-based models. Retrieved from https://bonsai.tidymodels.org/

Frick, H., Chow, F., Kuhn, M., Mahoney, M., Silge, J., & Wickham, H. (2025). Rsample: General resampling infrastructure. Retrieved from https://rsample.tidymodels.org

Hester, J., & Bryan, J. (2024). Glue: Interpreted string literals. Retrieved from https://glue.tidyverse.org/

Hvitfeldt, E. (2025). Themis: Extra recipes steps for dealing with unbalanced data. Retrieved from https://github.com/tidymodels/themis

Kabacoff, R. (2019). Data visualization with r. URL Https://Rkabacoff. Github. Io/Datavis.

Kirenz, J. (2021). Classification with tidymodels, workflows and recipes. Retrieved November 22, 2022, from https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/#data-preparation

Kuhn, M. (2024). Modeldata: Data sets useful for modeling examples. Retrieved from https://modeldata.tidymodels.org

Kuhn, M. (2025). Tune: Tidy tuning tools. Retrieved from https://tune.tidymodels.org/

Kuhn, M., & Couch, S. (2024). Workflowsets: Create a collection of tidymodels workflows. Retrieved from https://github.com/tidymodels/workflowsets

Kuhn, M., & Frick, H. (2025). Dials: Tools for creating tuning parameter values. Retrieved from https://dials.tidymodels.org

Kuhn, M., & Vaughan, D. (2025). Parsnip: A common API to modeling and analysis functions. Retrieved from https://github.com/tidymodels/parsnip

Kuhn, M., Vaughan, D., & Hvitfeldt, E. (2025). Yardstick: Tidy characterizations of model performance. Retrieved from https://github.com/tidymodels/yardstick

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. Retrieved from https://www.tidymodels.org

Kuhn, M., & Wickham, H. (2025). Tidymodels: Easily install and load the tidymodels packages. Retrieved from https://tidymodels.tidymodels.org

Kuhn, M., Wickham, H., & Hvitfeldt, E. (2025). Recipes: Preprocessing and feature engineering steps for modeling. Retrieved from https://github.com/tidymodels/recipes

Kunst, J. (2022). Highcharter: A wrapper for the highcharts library. Retrieved from https://jkunst.com/highcharter/

Lin, G. (2023). Reactable: Interactive data tables for r. Retrieved from https://glin.github.io/reactable/

Moro, S., Cortez, P., & Rita, P. (2014). A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62, 22–31.

Müller, K., & Wickham, H. (2023). Tibble: Simple data frames. Retrieved from https://tibble.tidyverse.org/

R Core Team. (2025). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/

R-Bloggers. (2017). 5 ways to measure running time of r code. Retrieved November 20, 2022, from https://www.r-bloggers.com/2017/05/5-ways-to-measure-running-time-of-r-code/

R-Bloggers. (2020). How to use lightgbm with tidymodels. Retrieved November 20, 2022, from https://www.r-bloggers.com/2020/08/how-to-use-lightgbm-with-tidymodels/

Robinson, D., Hayes, A., & Couch, S. (2025). Broom: Convert statistical objects into tidy tibbles. Retrieved from https://broom.tidymodels.org/

Vaughan, D., & Couch, S. (2025). Workflows: Modeling workflows. Retrieved from https://github.com/tidymodels/workflows

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org

Wickham, H. (2023). Forcats: Tools for working with categorical variables (factors). Retrieved from https://forcats.tidyverse.org/

Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … van den Brand, T. (2025). ggplot2: Create elegant data visualisations using the grammar of graphics. Retrieved from https://ggplot2.tidyverse.org

Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org

Wickham, H., & Henry, L. (2025). Purrr: Functional programming tools. Retrieved from https://purrr.tidyverse.org/

Wickham, H., Hester, J., & Bryan, J. (2024). Readr: Read rectangular text data. Retrieved from https://readr.tidyverse.org

Wickham, H., Pedersen, T. L., & Seidel, D. (2025). Scales: Scale functions for visualization. Retrieved from https://scales.r-lib.org

Wickham, H., Vaughan, D., & Girlich, M. (2024). Tidyr: Tidy messy data. Retrieved from https://tidyr.tidyverse.org

Zhu, H. (2024). kableExtra: Construct complex table with kable and pipe syntax. Retrieved from http://haozhu233.github.io/kableExtra/

Citation

BibTeX citation:

@online{2022,
  author = {, stesiam},
  title = {Predict {Possible} {Interested} {Clients}},
  date = {2022-11-24},
  url = {https://stesiam.com/posts/predict-possible-clients/},
  langid = {en}
}

For attribution, please cite this work as:

stesiam. (2022, November 24). Predict Possible Interested Clients. Retrieved from https://stesiam.com/posts/predict-possible-clients/