The data in the project is related to telephone-based direct marketing activities of a Portuguese banking institution.The marketing campaigns were based on phone calls. In order to determine if the product (bank term deposit) would be subscribed to with (‘yes’) or not (‘no’), it was frequently necessary to make more than one contact with the same client.
A term deposit is one that a bank or other financial institution offers with the fixed rate (generally preferable to just opening a deposit account) that guarantees the return of your funds at a certain maturity date. The reseacrh question is if the client will subscribe to a deposit from a Portuguese banking institution or not.
The multivariate dataset contains 16 features and 1 target variable (17 attributes) with 45211 numbers of instances. The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. This dataset is public available for research. The details are described in [Moro et al., 2014].
Below is the list of libraries used:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(tidymodels)
## ── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
## ✔ broom 1.0.1 ✔ rsample 1.1.0
## ✔ dials 1.1.0 ✔ tune 1.0.1
## ✔ infer 1.0.3 ✔ workflows 1.1.2
## ✔ modeldata 1.0.1 ✔ workflowsets 1.0.0
## ✔ parsnip 1.0.3 ✔ yardstick 1.1.0
## ✔ recipes 1.0.3
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter() masks stats::filter()
## ✖ recipes::fixed() masks stringr::fixed()
## ✖ dplyr::lag() masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step() masks stats::step()
## • Use tidymodels_prefer() to resolve common conflicts.
library(tidyr)
library(skimr)
library(reshape)
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:dplyr':
##
## rename
##
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
library(dplyr)
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
##
## The following object is masked from 'package:reshape':
##
## tips
##
## The following object is masked from 'package:modeldata':
##
## ames
library(infer)
library(foreign)
library(statsr)
## Loading required package: BayesFactor
## Loading required package: coda
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following object is masked from 'package:reshape':
##
## expand
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## ************
## Welcome to BayesFactor 0.9.12-4.4. If you have questions, please contact Richard Morey (richarddmorey@gmail.com).
##
## Type BFManual() to open the manual.
## ************
##
## Attaching package: 'statsr'
##
## The following objects are masked from 'package:openintro':
##
## calc_streak, evals, nycflights, present
##
## The following object is masked from 'package:modeldata':
##
## ames
##
## The following object is masked from 'package:infer':
##
## rep_sample_n
library(airports)
library(cherryblossom)
library(usdata)
library(describer)
library(BayesFactor)
library(coda)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:openintro':
##
## densityPlot
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(psych)
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:car':
##
## logit
##
## The following object is masked from 'package:describer':
##
## describe
##
## The following objects are masked from 'package:scales':
##
## alpha, rescale
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(carData)
library(Matrix)
library(heatmaply)
## Loading required package: plotly
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:reshape':
##
## rename
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
##
## Loading required package: viridis
## Loading required package: viridisLite
##
## Attaching package: 'viridis'
##
## The following object is masked from 'package:scales':
##
## viridis_pal
##
##
## ======================
## Welcome to heatmaply version 1.4.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## You may ask questions at stackoverflow, use the r and heatmaply tags:
## https://stackoverflow.com/questions/tagged/heatmaply
## ======================
library(viridis)
library(viridisLite)
library(corrplot)
## corrplot 0.92 loaded
library(shape)
##
## Attaching package: 'shape'
##
## The following object is masked from 'package:corrplot':
##
## colorlegend
library(ggplot2)
rm(list = ls())
getwd()
## [1] "C:/Users/Ivan/OneDrive/Desktop"
theme_set(theme_light())
The.CSV file can be easily obtained because I’ve kept it on github.
## age job marital education default balance housing loan contact day
## 1 58 management married tertiary no 2143 yes no unknown 5
## 2 44 technician single secondary no 29 yes no unknown 5
## 3 33 entrepreneur married secondary no 2 yes yes unknown 5
## 4 47 blue-collar married unknown no 1506 yes no unknown 5
## 5 33 unknown single unknown no 1 no no unknown 5
## 6 35 management married tertiary no 231 yes no unknown 5
## month duration campaign pdays previous poutcome y
## 1 may 261 1 -1 0 unknown no
## 2 may 151 1 -1 0 unknown no
## 3 may 76 1 -1 0 unknown no
## 4 may 92 1 -1 0 unknown no
## 5 may 198 1 -1 0 unknown no
## 6 may 139 1 -1 0 unknown no
bank %>%
summarize(across(.cols = everything(),
~sum(is.na(.x))))
## age job marital education default balance housing loan contact day month
## 1 0 0 0 0 0 0 0 0 0 0 0
## duration campaign pdays previous poutcome y
## 1 0 0 0 0 0 0
bank <- bank %>%
filter(!is.na(education))
sum(is.na(bank))
## [1] 0
I will convert jobs to a factor for prediction purposes
bank$job <- as_factor(bank$job)
Data visualizations
The variables that are being predicted are housing and jobs
ggplot(bank, aes(x=housing, y=job)) +
geom_point()
More variables that are being predicted are age and loans
bank2 <- bank %>%
mutate(age = ifelse(bank$marital > 21, "yes", "no")) %>%
na.exclude()
ggplot(bank2, aes(x=loan, y=age)) + geom_boxplot() + theme_bw()
### Part 3 - Exploratory data analysis
Here is the matrix
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
bank %>%
ggscatmat(columns = 3:16, color="housing",alpha = 0.7)
## Warning in ggscatmat(., columns = 3:16, color = "housing", alpha = 0.7): Factor
## variables are omitted in plot
## Warning: The dot-dot notation (`..scaled..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(scaled)` instead.
## ℹ The deprecated feature was likely used in the GGally package.
## Please report the issue at <]8;;https://github.com/ggobi/ggally/issueshttps://github.com/ggobi/ggally/issues]8;;>.
Let’s plot the jobs per their average balance. As expected people who
are retired have the highest balance.
bank2 %>%
group_by(job) %>%
summarise(avg_balance = mean(balance)) %>%
arrange(desc(avg_balance)) %>%
na.omit() %>%
slice(1:25) %>%
ggplot(aes(x = reorder(job, avg_balance),avg_balance)) +
geom_bar(stat="identity") +
coord_flip()
bank2 %>%
group_by(job) %>%
summarise(avg_balance = mean(balance)) %>%
arrange(desc(avg_balance)) %>%
na.omit() %>%
slice(1:25) %>%
ggplot(aes(x = reorder(job, -avg_balance),avg_balance)) +
geom_bar(stat="identity") +
coord_flip()
bank3 <- bank[, c(1,2,4,6,8,12,17)]
str(bank3)
## 'data.frame': 45211 obs. of 7 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "management","technician",..: 1 2 3 4 5 1 1 3 6 2 ...
## $ education: chr "tertiary" "secondary" "secondary" "unknown" ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ y : chr "no" "no" "no" "no" ...
head(bank3)
## age job education balance loan duration y
## 1 58 management tertiary 2143 no 261 no
## 2 44 technician secondary 29 no 151 no
## 3 33 entrepreneur secondary 2 yes 76 no
## 4 47 blue-collar unknown 1506 no 92 no
## 5 33 unknown unknown 1 no 198 no
## 6 35 management tertiary 231 no 139 no
summary(bank3$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 33.00 39.00 40.94 48.00 95.00
describe(bank3$age)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 45211 40.94 10.62 39 40.25 10.38 18 95 77 0.68 0.32 0.05
summary(bank3$duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 103.0 180.0 258.2 319.0 4918.0
describe(bank3$duration)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 45211 258.16 257.53 180 210.87 137.88 0 4918 4918 3.14 18.15
## se
## X1 1.21
round(prop.table(table(bank3$education))*100,digit = 1)
##
## primary secondary tertiary unknown
## 15.2 51.3 29.4 4.1
boxplot(bank3$age)
ggplot(data = bank3, aes(bank3$age))+geom_histogram(binwidth = 2, position="identity", alpha=0.5)+
labs(title="Age of customer",x="Age", y = "balance")
## Warning: Use of `bank3$age` is discouraged.
## ℹ Use `age` instead.
H0: There is no association between balance and the variables of Age.
HA: There is an association with at least one of the variables of Age.
The Normal Probability Plot
The normal probability plot shows that the balance and age are not normally distributed as can be seen that the data points does not closely follow the line.
qqnorm(bank3$age)
qqline(bank3$age)
I will use the Linear Regression to address my research question.
bank3$education <- ifelse(bank3$education == "Y", 1, 0)
str(bank3)
## 'data.frame': 45211 obs. of 7 variables:
## $ age : int 58 44 33 47 33 35 28 42 58 43 ...
## $ job : Factor w/ 12 levels "management","technician",..: 1 2 3 4 5 1 1 3 6 2 ...
## $ education: num 0 0 0 0 0 0 0 0 0 0 ...
## $ balance : int 2143 29 2 1506 1 231 447 2 121 593 ...
## $ loan : chr "no" "no" "yes" "no" ...
## $ duration : int 261 151 76 92 198 139 217 380 50 55 ...
## $ y : chr "no" "no" "no" "no" ...
The Linear Regression Model
bank3Model <- lm(duration~ age + balance, data = bank3)
summary(bank3Model)
##
## Call:
## lm(formula = duration ~ age + balance, data = bank3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -355.9 -155.0 -77.8 60.3 4653.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.624e+02 4.823e+00 54.397 < 2e-16 ***
## age -1.654e-01 1.146e-01 -1.444 0.149
## balance 1.880e-03 3.996e-04 4.705 2.55e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257.5 on 45208 degrees of freedom
## Multiple R-squared: 0.0005109, Adjusted R-squared: 0.0004667
## F-statistic: 11.56 on 2 and 45208 DF, p-value: 9.614e-06
The Equation for the Regression is thus:
y^=262.37471−0.16545∗AGE+0.00188∗balance
R^2 suggesting that the alternative hypothesis is a better explanation for my data.
anova(bank3Model)
## Analysis of Variance Table
##
## Response: duration
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 64788 64788 0.9773 0.3229
## balance 1 1467200 1467200 22.1332 2.551e-06 ***
## Residuals 45208 2996821166 66290
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the ANOVA result I should reject the null hypothesis in favor of the alternative hypothesis.
Check of Conditions:
Test for constant variance - The NcvTest
ncvTest(bank3Model)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 31.4982, Df = 1, p = 1.9962e-08
The test shows that constant error of variance does not exist.
Test for linearity - The crPlots
crPlots(bank3Model)
The test indicates that none of the predictors shows linearity.
Test for Independence of error- Autocorrelation- Durbin- Watson Test
durbinWatsonTest(bank3Model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.04669605 1.906605 0
## Alternative hypothesis: rho != 0
Predicting balance with Age variables
plot(bank3$balance ~ bank3$duration + bank3$age)
bank3 %>%
group_by(y) %>%
summarise(avg_balance = mean(balance)) %>%
arrange(desc(avg_balance)) %>%
na.omit() %>%
slice(1:2) %>%
ggplot(aes(x = reorder(y, avg_balance),avg_balance)) +
geom_bar(stat="identity") +
coord_flip()
The answer to research question if the client will subscribe to a deposit from a Portuguese banking institution is Yes.This finding is importatant because it indicates that direct marketing activities of a Portuguese banking institution are successful and could be implemented by other financial institutions.
Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014 https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#