Introduction
On this article, I will make a ML model, in order to predict the costumers of a bank that are interested on having a term deposit. To achieve this goal I will use some Boosting algorithms (XGBoost and LightGBM). The data is from UCI website (Dua & Graff, 2017) and the dataset that I worked with is “Bank Marketing” (Moro, Cortez, & Rita, 2014).
But first let’s explain some things.
What is a term deposit ?
Generally, we are saving our money in a bank account but we are promising to not withdraw them for a specific amount of time.
And why someone to do that ?
It is apparent that term deposits have a major disadvantage over regular ones in terms of capital availability. Of course, people will not have motive to do that without anything in return. For that reason, the banks usually offer higher interest for that kind of accounts. For example, Piraeus Bank (a bank in Greece) offers double interest rate on term deposits compared to regular ones.
So, we can assume that there will be a serious interest for that product mostly on people with significant amount of savings (>1000) and do not have large, extraordinary expenses (loan, kids, etc.).
Prerequisites
Import libraries
For this analysis we will need standard libraries for importing and processing my data, such as readr (Wickham, Hester, & Bryan, 2024) and dplyr (Wickham, François, Henry, Müller, & Vaughan, 2023). The kableExtra (Zhu, 2024) package was used to print the results in table format. Concerning general purpose R libraries, I also used gridExtra in order to show ggplot items side to side.
On its basis this analysis is about to predict if someone is interested or not to have a term deposit. Thus, we need to build a ML model. The all-in-one solution package tidymodels is crucial to this. Although, there are some concerns about our data (imbalanced predicted value and our implementation of LightGBM. Thankfully, these are solved by bonsai and treesnip packages, respectively.
Finally, the ggplot2 (Wickham et al., 2025) package was used to create some visualizations, as well as an auxiliary package, ggtext (R-ggtext?), for further formatting those.
Import dataset
After loading R libraries, then I will load my data. The initial source of my dataset has various versions of the same dataset. I will use the smaller one, as the fitting process on Boosting algorithms it is more time consuming in comparison with other methods (e.g. Logistic Regression, k-Nearest Neighbours etc.).
Show the code
bank_dataset <- read_delim("bank_dataset_files/bank.csv", delim = ";", escape_double = FALSE, trim_ws = TRUE)
bank_dataset = bank_dataset %>% tibble::rowid_to_column("ID")
Preview dataset
Here we can see a small chunk of my dataset (first 10 rows / observations) just to understand the dataset’s structure and type of variables.
Show the code
preview_bank_dataset = head(bank_dataset, 6)
reactable(
preview_bank_dataset,
bordered = FALSE, # No borders (booktabs style)
striped = FALSE, # No stripes for a clean look
highlight = FALSE, # No hover highlighting
defaultColDef = colDef(align = "center"),
style = list(fontSize = "14px", border = "none"),
theme = reactableTheme(
borderColor = "transparent", # Remove border
cellStyle = list(
borderBottom = "transparent" # Subtle line between rows
),
headerStyle = list(
borderBottom = "2px solid rgb(117, 117, 117)", # Thick top rule
borderTop = "2px solid rgb(117, 117, 117)",
fontWeight = "bold"
)
)
)
Before we do any analysis we have to define what kind of data we have available. We can assess this type of information by looking on the values of each variable. Generally, we can classify our variables, depending their values, as follows :
graph TD; A(Type of variables) --> B(Quantitative) A(Type of variables) --> C(Qualitative) B --> D(Discrete) B --> E(Continuous) C --> J(Nominal) C --> G(Ordinal)
Our dataset is consisted by 18 variables (columns) and 4521 observations (rows). More specifically, concerning my variables, are as follows :
Variable | Property | Description |
---|---|---|
Age |
quantitative (continuous) |
The age of the respondent |
Job |
qualitative (nominal) |
The sector of employment of the respondent |
Marital |
qualitative (nominal) |
The marital status of the respondent |
Education |
qualitative (ordinal) |
The higher education level that the respondent has ever reached |
Default |
qualitative (nominal) |
has credit in default? |
Balance |
quantitative (continuous) |
Average yearly balance, in euros |
Housing |
qualitative (nominal) |
Has housing loan? |
Loan |
qualitative (nominal) |
Has personal loan? |
Contact |
qualitative (nominal) |
Contact communication type |
Month |
qualitative (ordinal) |
Last contact day of the month |
Duration |
quantitative (continuous) |
Last contact duration, in seconds (numeric) |
Campaign |
quantitative | Number of contacts performed during this campaign and for this client |
pdays |
quantitative | Number of days that passed by after the client was last contacted from a previous campaign |
pprevious |
quantitative | Number of contacts performed before this campaign and for this client |
poutcome |
qualitative (nominal) | Outcome of the previous marketing campaign |
Deposit |
qualitative (nominal) |
Has the client subscribed a term deposit? |
Thus, my sample has 18 variables, of which 7 are quantitative and 10 are quantitative properties, of which 8 are nominal and the rest ones (Education
, Month
) are ordinal.
Show the code
bank_dataset$y = as.factor(bank_dataset$y)
Custom functions
So, I have a basic idea about my data. Can we start analyzing our data?
It depends. If you want to do a simple analysis then yes. Although most of the times this is not the case. Probably there is the need for repetitive actions. In order to not repeat ourselves we need to define some actions, prior to our analysis.
On this occasion, I found beneficial the definition of a function for qualitative data.
Show the code
univariateQualitativePlot <- function(data, column, title, subtitle, chart_type = "bar") {
library(dplyr)
library(highcharter)
# Compute frequency table
freq_table <- data %>%
count({{ column }}, name = "Frequency") %>%
arrange(desc(Frequency)) %>%
rename(Variable = {{ column }})
# Define common chart properties
hc <- highchart() %>%
hc_title(text = title) %>%
hc_subtitle(text = subtitle) %>%
hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
hc_legend(enabled = chart_type == "pie") # Enable legend only for pie chart
# Choose chart type
hc <- if (chart_type == "bar") {
hc %>%
hc_chart(type = "bar") %>%
hc_xAxis(categories = freq_table$Variable, title = list(text = "Job Category")) %>%
hc_yAxis(title = list(text = "Frequency")) %>%
hc_series(list(name = "Frequency", data = freq_table$Frequency))
} else if (chart_type == "pie") {
pie_data <- lapply(1:nrow(freq_table), function(i) {
list(name = freq_table$Variable[i], y = freq_table$Frequency[i])
})
hc %>%
hc_chart(type = "pie") %>%
hc_series(list(name = "Frequency", data = pie_data))
} else {
stop("Invalid chart type. Use 'bar' or 'pie'.")
}
return(hc)
}
I will do the same for univariate numeric data.
[EDITTING]
EDA with R
Missing Values
On this dataset there are 0 missing values, in total. So, there is no need for imputation.
Univariate analysis
Qualitative variables
Show the code
univariateQualitativePlot(bank_dataset, job,
title = "Sector of employment of the respondent?",
subtitle = "Number and (%) of total respondents")
Show the code
univariateQualitativePlot(bank_dataset, marital,
title = "What is your marital status?",
subtitle = "Most of the participants are married.", "pie")
Show the code
univariateQualitativePlot(bank_dataset, education, title = "Type of your education", subtitle = "")
Show the code
univariateQualitativePlot(bank_dataset, default,
title = "Has credit in default?",
subtitle = "(%) of respondents that have failed to meet their payment obligations", "pie")
Show the code
univariateQualitativePlot(bank_dataset, housing,
title = "Has housing loan?",
subtitle = "(%) of respondents who have acquired a housing loan", "pie")
Show the code
univariateQualitativePlot(bank_dataset, loan,
title = "Has personal loan?",
subtitle = "(%) of respondents who have acquired a personal loan", "pie")
Show the code
univariateQualitativePlot(bank_dataset, contact,
title = "Type of Contact",
subtitle = "Vast majority of apporaches are via cellular phones.", "pie")
Show the code
univariateQualitativePlot(bank_dataset, month,
title = "Μέσο επικοινωνίας",
subtitle = "Η προώθηση μέσω κινητού είναι πιο εκτεταμένη σε σχέση με το σταθερό")
Show the code
univariateQualitativePlot(bank_dataset, poutcome,
title = "Which was the outcome of the previous approaches?",
subtitle = "For every five failed approaches there was one succussful.", "pie")
Show the code
univariateQualitativePlot(bank_dataset, y,
title = "How many people made a deposit account?",
subtitle = "(%) of respondents who have decided to open a deposit account
as a result of the current approach",
"pie")
Quantitative variables
Show the code
subtitle_text = glue("The meadian age is <b>{median(bank_dataset$age)}</b>.")
hchart(bank_dataset$age) %>%
hc_title(text = "Age Distribution") %>%
hc_subtitle(text = subtitle_text) %>%
hc_caption(text = "Bank Marketing Dataset from UCI") %>%
hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
hc_legend(enabled = FALSE)
Show the code
hchart(bank_dataset$balance) %>%
hc_title(text = "Balance Distribution") %>%
hc_subtitle(text = glue("The most prevalent group is the one with balance between 0$ and 200$. The median balance is {median(bank_dataset$balance)}")) %>%
hc_caption(text = "Bank Marketing Dataset from UCI") %>%
hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
hc_legend(enabled = FALSE) %>%
hc_xAxis(
title = list(text = "Balance in $"),
max = 10000
)
Show the code
hchart(bank_dataset$duration) %>%
hc_title(text = "Διάρκεια κλήσης") %>%
hc_subtitle(text = glue("Οι περισσότερες καταθέσεις κυμαίνονται μεταξύ των 0$ και 200$ δολαρίων. Το διάμεσο υπόλοιπο λογαριασμού είναι {median(bank_dataset$balance)} $. Το 75ο τεταρτημόριο είναι τα 1480$, άρα το ένα τέταρτο των πελατών έχει καταθέσεις υψηλότερες αυτού του ποσού. Τα ύψη των λογαριασμών κυμαίνονται από {min(bank_dataset$balance)} μέχρι και τα {max(bank_dataset$balance)} $")) %>%
hc_caption(text = "Bank Marketing Dataset from UCI") %>%
hc_tooltip(pointFormat = "{point.name}: {point.y}") %>%
hc_legend(enabled = FALSE) %>%
hc_xAxis(
title = list(text = "Διάρκεια κλήσης (σε δευτερόλεπτα)"),
max = 1000)
Show the code
s = bank_dataset %>%
mutate(category = case_when(
campaign >= 1 & campaign <= 7 ~ as.character(campaign),
campaign > 7 ~ "8+",
TRUE ~ NA_character_
)) %>%
mutate(category = factor(category, levels = c(as.character(1:7), "8+"), ordered = TRUE))
Show the code
hchart(s$category) %>%
hc_title(text = "Αριθμός προσεγγίσεων") %>%
hc_subtitle(text = glue("Οι περισσότερες καταθέσεις κυμαίνονται μεταξύ των 0€ και 200€ ευρώ. Το διάμεσο υπόλοιπο λογαριασμού είναι {median(bank_dataset$balance)} €. Το 75ο τεταρτημόριο είναι τα 1480€, άρα το ένα τέταρτο των πελατών έχει καταθέσεις υψηλότερες αυτού του ποσού. Τα ύψη των λογαριασμών κυμαίνονται από {min(bank_dataset$balance)} μέχρι και τα {max(bank_dataset$balance)} $")) %>%
hc_caption(text = "Bank Marketing Dataset from UCI") %>%
hc_legend(enabled = FALSE) %>%
hc_xAxis(
title = list(text = "Ύψος καταθέσεων σε ευρώ €")
)
Bivariate analysis
On the previous section I learned a lot about my dataset. Now, I have to reveal relationships between my variables. Visualizing those relationships will make it even easier to explain our results. In order to make the right plots on the right occasions I used the book “Datavis with R” (Chapter 4 : Bivariate Graphs).
Qualitative variables
Show the code
plotjob <- bank_dataset %>%
group_by(job, y) %>%
summarize(n = n()) %>%
mutate(pct = round(100*(n/sum(n)), digits = 1),
lbl = scales::percent(pct)) %>%
arrange(desc(pct))
#categories <- unique(plotjob$job)
# series_data <- plotjob %>%
# group_by(y) %>%
# summarise(series = list(list(name = first(y), data = pct))) %>%
# pull(series)
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = plotjob$job) %>%
hc_yAxis(
min = 0,
max = 100,
title = list(text = "Percentage"),
labels = list(format = "{value}%")
) %>%
hc_plotOptions(
column = list(
stacking = "percent",
dataLabels = list(
enabled = TRUE,
format = "{point.y:.0f}%"
)
)
) %>%
hc_series(
list(name = "No", data = plotjob$pct[plotjob$y == "no"], color = "#FF5733"),
list(name = "Yes", data = sort(plotjob$pct[plotjob$y == "yes"]), color = "#33FF57")
) %>%
hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y}%") %>%
hc_title(text = "Job of Respondent by Interest to Term Deposit") %>%
hc_subtitle(text = "Students and retirees are the population groups with a proportionally lower interest in term deposits compared to other groups.") %>%
hc_responsive(
rules = list(
list(
condition = list(
maxWidth = 500 # Hide labels when screen width is ≤500px
),
chartOptions = list(
plotOptions = list(
column = list(
dataLabels = list(enabled = FALSE) # Disable labels
)
)
)
)
)
)
Show the code
plotmarital <- bank_dataset %>%
group_by(marital, y) %>%
summarize(n = n()) %>%
mutate(pct = n/sum(n),
lbl = pct *100 ) |>
arrange(y)
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = plotmarital$marital) %>%
hc_yAxis(
min = 0,
max = 100,
title = list(text = "Percentage"),
labels = list(format = "{value}%")
) %>%
hc_plotOptions(
column = list(
stacking = "percent",
dataLabels = list(
enabled = TRUE,
format = "{point.y:.0f}%"
)
)
) %>%
hc_series(
list(name = "Όχι", data = plotmarital$lbl[plotmarital$y == "Όχι"], color = "#FF5733"),
list(name = "Ναι", data = plotmarital$lbl[plotmarital$y == "Ναι"], color = "#33FF57")
) %>%
hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
hc_title(text = "Marital Status by Interest to Term Deposit") %>%
hc_subtitle(text = "Couples are less interested in comparison to singles / divorced people in term deposit accounts. The increased expenses of a hpusehold may averts people from maintaining a term deposit.") %>%
hc_responsive(
rules = list(
list(
condition = list(
maxWidth = 500 # Hide labels when screen width is ≤500px
),
chartOptions = list(
plotOptions = list(
column = list(
dataLabels = list(enabled = FALSE) # Disable labels
)
)
)
)
)
)
Show the code
educdata = bank_dataset %>%
group_by(education, y) %>%
summarize(n = n()) %>%
mutate(pct = n/sum(n)*100) %>%
arrange(y)
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = educdata$education) %>%
hc_yAxis(
min = 0,
max = 100,
title = list(text = "Percentage"),
labels = list(format = "{value}%")
) %>%
hc_plotOptions(
column = list(
stacking = "percent",
dataLabels = list(
enabled = TRUE,
format = "{point.y:.0f}%"
)
)
) %>%
hc_series(
list(name = "No", data = educdata$pct[educdata$y == "no"], color = "#FF5733"),
list(name = "Yes", data = educdata$pct[educdata$y == "yes"], color = "#33FF57")
) %>%
hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
hc_title(text = "Education Level & Interest to Term Deposit") %>%
hc_subtitle(text = "The more educated people tend to be more interested to new bank/finanial products. ") %>%
hc_responsive(
rules = list(
list(
condition = list(
maxWidth = 500 # Hide labels when screen width is ≤500px
),
chartOptions = list(
plotOptions = list(
column = list(
dataLabels = list(enabled = FALSE) # Disable labels
)
)
)
)
)
)
Show the code
`summarise()` has grouped output by 'housing'. You can override using the
`.groups` argument.
Show the code
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = plot_housing2$housing) %>%
hc_yAxis(
min = 0,
max = 100,
title = list(text = "Percentage"),
labels = list(format = "{value}%")
) %>%
hc_plotOptions(
column = list(
stacking = "percent",
dataLabels = list(
enabled = TRUE,
format = "{point.y:.0f}%"
)
)
) %>%
hc_series(
list(name = "No", data = plot_housing2$pct[plot_housing2$y == "no"], color = "#FF5733"),
list(name = "Yes", data = plot_housing2$pct[plot_housing2$y == "yes"], color = "#33FF57")
) %>%
hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
hc_title(text = "Do you have housing loan?") %>%
hc_subtitle(text = "People without loans tend to open more willingly deposit accounts.") %>%
hc_responsive(
rules = list(
list(
condition = list(
maxWidth = 500 # Hide labels when screen width is ≤500px
),
chartOptions = list(
plotOptions = list(
column = list(
dataLabels = list(enabled = FALSE) # Disable labels
)
)
)
)
)
)
Show the code
`summarise()` has grouped output by 'loan'. You can override using the
`.groups` argument.
Show the code
highchart() %>%
hc_chart(type = "column") %>%
hc_xAxis(categories = plot_loan$loan) %>%
hc_yAxis(
min = 0,
max = 100,
title = list(text = "Percentage"),
labels = list(format = "{value}%")
) %>%
hc_plotOptions(
column = list(
stacking = "percent",
dataLabels = list(
enabled = TRUE,
format = "{point.y:.0f}%"
)
)
) %>%
hc_series(
list(name = "No", data = plot_loan$pct[plot_loan$y == "no"], color = "#FF5733"),
list(name = "Yes", data = plot_loan$pct[plot_loan$y == "yes"], color = "#33FF57")
) %>%
hc_tooltip(pointFormat = "<b>{series.name}</b>: {point.y:.2f}%") %>%
hc_title(text = "Do you have personal loan?") %>%
hc_subtitle(text = "People without loans tend to open more willingly deposit accounts.") %>%
hc_responsive(
rules = list(
list(
condition = list(
maxWidth = 500 # Hide labels when screen width is ≤500px
),
chartOptions = list(
plotOptions = list(
column = list(
dataLabels = list(enabled = FALSE) # Disable labels
)
)
)
)
)
)
Quantitative variables
Show the code
dat <- data_to_boxplot(bank_dataset, age, y)
highchart() %>%
hc_xAxis(type = "category") %>%
hc_add_series_list(dat) %>%
hc_title(text = "Age & Desire of Bank Deposit Account") %>%
hc_subtitle(text = "There are not significant demographic differences concerning people that choose to open term deposit accounts.") %>%
hc_legend(enabled = FALSE)
Show the code
dat <- data_to_boxplot(bank_dataset, balance, y)
highchart() %>%
hc_xAxis(type = "category") %>%
hc_add_series_list(dat) %>%
hc_title(text = "Balance & Desire of Bank Deposit Account") %>%
hc_subtitle(text = "") %>%
hc_legend(enabled = FALSE)
Building model
In R there are two fairly well-known libraries when it comes to model generation, caret and tidymodels. On one hand, caret is pretty simple to use, has lots of sources, guides, explanatory articles. On the other hand, tidymodels is an all-in-one solution but has less documentation, articles compared to caret.
Split train/test dataset
Our first step is to split our dataset on 2 parts. Each of those will be used for a different purpose. The first part’s (train dataset) purpose is to build our model. The other part (test dataset) will be used to evaluate our model’s performance.
Show the code
graph TD;
A(Data) --> B(Train Dataset <br> 3390 obs.) A(Data) --> C(Test Dataset <br> 1131 obs.)
graph TD; A(Data) --> B(Train Dataset <br> 3390 obs.) A(Data) --> C(Test Dataset <br> 1131 obs.)
Show the code
head(bank_train) %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria")
ID | age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 30 | unemployed | married | primary | no | 1787 | no | no | cellular | 19 | oct | 79 | 1 | -1 | 0 | unknown | no |
2 | 33 | services | married | secondary | no | 4789 | yes | yes | cellular | 11 | may | 220 | 1 | 339 | 4 | failure | no |
4 | 30 | management | married | tertiary | no | 1476 | yes | yes | unknown | 3 | jun | 199 | 4 | -1 | 0 | unknown | no |
5 | 59 | blue-collar | married | secondary | no | 0 | yes | no | unknown | 5 | may | 226 | 1 | -1 | 0 | unknown | no |
7 | 36 | self-employed | married | tertiary | no | 307 | yes | no | cellular | 14 | may | 341 | 1 | 330 | 2 | other | no |
8 | 39 | technician | married | secondary | no | 147 | yes | no | cellular | 6 | may | 151 | 2 | -1 | 0 | unknown | no |
Show the code
head(bank_test) %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria")
ID | age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
6 | 35 | management | single | tertiary | no | 747 | no | no | cellular | 23 | feb | 141 | 2 | 176 | 3 | failure | no |
15 | 31 | blue-collar | married | secondary | no | 360 | yes | yes | cellular | 29 | jan | 89 | 1 | 241 | 1 | failure | no |
23 | 44 | services | single | secondary | no | 106 | no | no | unknown | 12 | jun | 109 | 2 | -1 | 0 | unknown | no |
51 | 45 | blue-collar | divorced | primary | no | 844 | no | no | unknown | 5 | jun | 1018 | 3 | -1 | 0 | unknown | yes |
52 | 37 | technician | single | secondary | no | 228 | yes | no | cellular | 20 | aug | 1740 | 2 | -1 | 0 | unknown | no |
Recipes
An important part in the process of model building is preprocessing. Depending of model type and data structure, I have to do the necessary changes. The tidymodels offers some ready-made preprocessing functions which make the whole process piece of cake.
In instance, the dataset I am working right now has imbalanced response variable (term deposit interest). For that reason, I used the recipe step_smote()
from themis package.
Show the code
bank_recipe <- recipes::recipe(y~.,
data = bank_train) %>%
step_rm(poutcome, ID) %>%
step_corr(all_numeric(), threshold = 0.75) %>%
step_dummy(all_nominal(), -all_outcomes()) %>% prep() %>%
step_smote(y) %>%
prep()
Let’s preview our dataset after applying our recipes :
Show the code
recipes
age | balance | day | duration | campaign | pdays | previous | job_blue.collar | job_entrepreneur | job_housemaid | job_management | job_retired | job_self.employed | job_services | job_student | job_technician | job_unemployed | job_unknown | marital_married | marital_single | education_secondary | education_tertiary | education_unknown | default_yes | housing_yes | loan_yes | contact_telephone | contact_unknown | month_aug | month_dec | month_feb | month_jan | month_jul | month_jun | month_mar | month_may | month_nov | month_oct | month_sep | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
30 | 1787 | 19 | 79 | 1 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | no |
33 | 4789 | 11 | 220 | 1 | 339 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | no |
30 | 1476 | 3 | 199 | 4 | -1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | no |
59 | 0 | 5 | 226 | 1 | -1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | no |
36 | 307 | 14 | 341 | 1 | 330 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | no |
39 | 147 | 6 | 151 | 2 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | no |
Create validation set
So, how good is the model? Not so fast…
We could actually build the model and evaluate its performance. The problem with that approach is the sample.
Specify models
Next, parsnip helps us to specify our models. Initially, I will define a LightGBM model,
Show the code
lightgbm_model<- parsnip::boost_tree(
mode = "classification",
trees = 50,
min_n = tune(),
learn_rate = tune(),
tree_depth = tune()) %>%
set_engine("lightgbm", loss_function = "squarederror")
and an XGBoost one.
Show the code
xgboost_model<- parsnip::boost_tree(
mode = "classification",
trees = 50,
min_n = tune(),
learn_rate = tune(),
tree_depth = tune()) %>%
set_engine("xgboost")
Hyperparameters tuning
Now, we are specifying the hyperpaterers’ values and a grid to check which combination of those are performing better according to our desired metric (in our case ROC). This has to be done for both, LightGBM
Show the code
lightgbm_params <- dials::parameters(
min_n(),
tree_depth(range = c(4,10)),
learn_rate() # learning rate
)
Show the code
lightgbm_grid <- dials::grid_max_entropy(
lightgbm_params,
size = 10)
head(lightgbm_grid) %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria")
min_n | tree_depth | learn_rate |
---|---|---|
3 | 5 | 0.0000 |
3 | 7 | 0.0000 |
11 | 10 | 0.0000 |
33 | 4 | 0.0787 |
10 | 5 | 0.0116 |
31 | 9 | 0.0000 |
and XGBoost.
Show the code
xgboost_params <- dials::parameters(
min_n(),
tree_depth(range = c(4,10)),
learn_rate() # learning rate
)
Show the code
xgboost_grid <- dials::grid_max_entropy(
xgboost_params,
size = 10
)
head(xgboost_grid) %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria")
min_n | tree_depth | learn_rate |
---|---|---|
5 | 6 | 0.0000 |
4 | 8 | 0.0000 |
37 | 9 | 0.0000 |
2 | 5 | 0.0000 |
21 | 9 | 0.0057 |
21 | 5 | 0.0472 |
Fit resamples
Finally, we can build the LightGBM model by combining :
- The workflows we set it up above
- The resamples
- Grid of values (hyperparameters)
- Metric based on which we will evaluate our model’s performance
Show the code
start_time_lightgbm <- Sys.time()
lightgbm_tuned_model <- tune::tune_grid(
object = lightgbm_workflow,
resamples = cv_folds,
metrics = metric_set(roc_auc, accuracy),
grid = lightgbm_grid,
control = tune::control_grid(verbose = FALSE) # set this to TRUE to see
# in what step of the process you are. But that doesn't look that well in
# a blog.
)
end_time_lightgbm <- Sys.time()
time_lightgbm = difftime(end_time_lightgbm,start_time_lightgbm,units = "secs")
Similarly, for XGBoost.
Show the code
start_time_xgboost <- Sys.time()
xgboost_tuned_model <- tune::tune_grid(
object = xgboost_workflow,
resamples = cv_folds,
metrics = metric_set(roc_auc, accuracy),
grid = xgboost_grid,
control = tune::control_grid(verbose = FALSE) # set this to TRUE to see
# in what step of the process you are. But that doesn't look that well in
# a blog.
)
end_time_xgboost <- Sys.time()
time_xgboost= difftime(end_time_xgboost,start_time_xgboost,units = "secs")
Evaluate model
Our first results based on resamples for LightGBM
Show the code
lightgbm_tuned_model %>%
show_best(., metric = "roc_auc", n = 5) %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria")
min_n | tree_depth | learn_rate | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
16 | 10 | 0.0646 | roc_auc | binary | 0.9001 | 5 | 0.0022 | Preprocessor1_Model10 |
33 | 4 | 0.0787 | roc_auc | binary | 0.8921 | 5 | 0.0021 | Preprocessor1_Model04 |
10 | 5 | 0.0116 | roc_auc | binary | 0.8775 | 5 | 0.0037 | Preprocessor1_Model05 |
34 | 10 | 0.0006 | roc_auc | binary | 0.8653 | 5 | 0.0035 | Preprocessor1_Model07 |
31 | 9 | 0.0000 | roc_auc | binary | 0.8580 | 5 | 0.0088 | Preprocessor1_Model06 |
and XGBoost
Show the code
xgboost_tuned_model %>%
show_best(., metric = "roc_auc", n = 5) %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria")
min_n | tree_depth | learn_rate | .metric | .estimator | mean | n | std_err | .config |
---|---|---|---|---|---|---|---|---|
21 | 5 | 0.0472 | roc_auc | binary | 0.8748 | 5 | 0.0030 | Preprocessor1_Model06 |
21 | 9 | 0.0057 | roc_auc | binary | 0.8552 | 5 | 0.0026 | Preprocessor1_Model05 |
5 | 6 | 0.0000 | roc_auc | binary | 0.8290 | 5 | 0.0074 | Preprocessor1_Model01 |
40 | 5 | 0.0002 | roc_auc | binary | 0.8285 | 5 | 0.0061 | Preprocessor1_Model08 |
19 | 7 | 0.0000 | roc_auc | binary | 0.8242 | 5 | 0.0077 | Preprocessor1_Model07 |
Last fit
By now we can assess which is the best combination of values. Given those I will assess my model’s performance on unknown (to my model) data. LightGBM model has 0.8469 ROC-value
Show the code
last_fit_lightgbm_model = parsnip::boost_tree(
mode = "classification",
trees = 100,
min_n = 33,
learn_rate = 0.0787,
tree_depth = 4) %>%
set_engine("lightgbm")
Show the code
## TODO
# options(digits = 4)
#
# last_fit_workflow <- lightgbm_workflow %>%
# update_model(last_fit_lightgbm_model)
#
# last_lightgbm_fit <-
# last_fit_workflow %>%
# last_fit(bank_dataset_split)
#
# last_lightgbm_fit %>%
# collect_metrics() %>%
# kbl(toprule = T,align = 'c',booktabs = T) %>%
# kable_styling(full_width = F, position = "center", html_font = "Cambria")
and XGBoost, 0.8736.
Show the code
last_fit_xgboost_model = parsnip::boost_tree(
mode = "classification",
trees = 100,
min_n = 21,
learn_rate = 0.0472,
tree_depth = 5) %>%
set_engine("xgboost")
Show the code
options(digits = 4)
last_fit_xgboost_workflow <- xgboost_workflow %>%
update_model(last_fit_xgboost_model)
last_xgboost_fit <-
last_fit_xgboost_workflow %>%
last_fit(bank_dataset_split)
last_xgboost_fit %>%
collect_metrics() %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria")
.metric | .estimator | .estimate | .config |
---|---|---|---|
accuracy | binary | 0.8921 | Preprocessor1_Model1 |
roc_auc | binary | 0.8693 | Preprocessor1_Model1 |
brier_class | binary | 0.0750 | Preprocessor1_Model1 |
Results
It seems that my model has a really good predictive value even if I applied a relatively simple model. If an increased accuracy is justified we can apply a more powerful model (e.g., XGBoost, CatBoost, etc..).
To summarize,
Model | Time to build | ROC value (test) | ROC value (CV) |
---|---|---|---|
LightGBM | 25.65 sec. | 0.8469 | 0.903 |
XGBoost | 41.19 sec. | 0.8736 | 0.881 |
In conclusion LightGBM model is less time consuming in comparison with XGBoost. On the other hand, XGBoost model needed more time to be built (which was expected) but outperformed LightGBM.
So, we got some results, so what? The main question is still unanswered. All this modelling was about to find out who is interested to a term deposit. Usually, I would use predict function with the corresponding data to mark my predictions. However, I wasn’t able to do that as I received an error about class. Nevermind, the function collect_predictions() does the same.
Show the code
pr = last_xgboost_fit %>% collect_predictions()
Now, I will select the predicted value and the result and I will paste them on my test data.
And now let’s say that I want to give a list of only the interested ones. I will filter pred class to show only “Yes” values. We should also take into consideration that the dataset does not include any personal information, so I will also include an ID.
Show the code
prediction | y | ID | age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | y |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
no | no | 3 | 35 | management | single | tertiary | no | 1350 | yes | no | cellular | 16 | apr | 185 | 1 | 330 | 1 | failure | no |
no | no | 6 | 35 | management | single | tertiary | no | 747 | no | no | cellular | 23 | feb | 141 | 2 | 176 | 3 | failure | no |
no | no | 15 | 31 | blue-collar | married | secondary | no | 360 | yes | yes | cellular | 29 | jan | 89 | 1 | 241 | 1 | failure | no |
no | no | 23 | 44 | services | single | secondary | no | 106 | no | no | unknown | 12 | jun | 109 | 2 | -1 | 0 | unknown | no |
no | yes | 51 | 45 | blue-collar | divorced | primary | no | 844 | no | no | unknown | 5 | jun | 1018 | 3 | -1 | 0 | unknown | yes |
yes | no | 52 | 37 | technician | single | secondary | no | 228 | yes | no | cellular | 20 | aug | 1740 | 2 | -1 | 0 | unknown | no |
Show the code
interested_clients = final %>% select(ID, prediction) %>%
filter(prediction == "yes") %>%
kbl(toprule = T,align = 'c',booktabs = T) %>%
kable_styling(full_width = F, position = "center", html_font = "Cambria") %>%
scroll_box(width = "100%", height = "200px")
interested_clients
ID | prediction |
---|---|
52 | yes |
99 | yes |
157 | yes |
201 | yes |
203 | yes |
301 | yes |
313 | yes |
355 | yes |
495 | yes |
510 | yes |
620 | yes |
662 | yes |
685 | yes |
703 | yes |
838 | yes |
856 | yes |
960 | yes |
1127 | yes |
1193 | yes |
1280 | yes |
1343 | yes |
1432 | yes |
1612 | yes |
1739 | yes |
1761 | yes |
1764 | yes |
1780 | yes |
1897 | yes |
1904 | yes |
1915 | yes |
1946 | yes |
1975 | yes |
1981 | yes |
1992 | yes |
2056 | yes |
2161 | yes |
2259 | yes |
2376 | yes |
2634 | yes |
2761 | yes |
2880 | yes |
2918 | yes |
2986 | yes |
3089 | yes |
3306 | yes |
3590 | yes |
3615 | yes |
3645 | yes |
3755 | yes |
3795 | yes |
3932 | yes |
3957 | yes |
3969 | yes |
4233 | yes |
4263 | yes |
4266 | yes |
4276 | yes |
So, from 1131 possible clients, I finally got 75 that are actually interested. So, I reduced the required phone calls by 93 % (and therefore the working hours allocated to this task). But at what cost ?
Show the code
ds <- final %>%
select(prediction) %>%
dplyr::count(prediction) %>%
dplyr::mutate(pct = round(n/sum(n) * 100, digits = 2))
ds %>%
hchart("pie", hcaes(x = prediction, y = pct))%>%
hc_title(text = "Expected number of clients") %>%
hc_subtitle(text = glue("From {nrow(final)}, only {nrow(interested_clients)} are interested to open a deposit account.")) %>%
hc_caption("stesiam, 2023")
Hypothesis
Let’s suppose we have to launch a new campaign and inform the interested parties. There are two options. The traditional one, in which case we should call everyone (1131 people). On the other hand, we can use a machine learning (ML) model (given that I have some data) to ‘interrupt’ some of them. I am also assuming that a phone call lasts, on average, 4 minutes. Additionally, based on Eurostat, the average hourly wage is 30.5 euros.
To call everyone, we would need 4,524 minutes, which corresponds to 75.4 working hours and therefore amounts to 2,300 euros. By using the model, we would only require 75 calls, equivalent to 5 working hours and 152.5 euros. Even in this simple example (with so few observations), we can see a significant benefit for the company. Last but not least, the company won’t disappoint any customers by promoting something they are not interested in, potentially losing clients. Therefore, ML modeling will help not only financially but also in maintaining a healthy brand.
Acknowledgements
Image by Gerd Altmann from Pixabay
References
Citation
@online{2022,
author = {, stesiam},
title = {Predict {Possible} {Interested} {Clients}},
date = {2022-11-24},
url = {https://stesiam.com/posts/predict-possible-clients/},
langid = {en}
}