In this project we are going to discuss about the implementing different supervised models using R programming. This process includes the process of data loading, data preparation, exploratory data analysis and modelling (varies subject to dataset). All datasets are taken from Kaggle.
The binary classification goal is to predict if the client will subscribe a bank term deposit (variable y).
This dataset is based on “Bank Marketing” UCI dataset. Please check the description at: UCI. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at:Banco de Portugal.
Number of Instances: 41188 for
bank-additional-full.csv
Number of Attributes: 20 + output attribute.
Attribute information:
Bank client data:
1. age (numeric)
2. job : type of job (categorical:
“admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)
3. marital : marital status (categorical:
“divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced
or widowed)
4. education (categorical:
“basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)
5. default: has credit in default? (categorical:
“no”,“yes”,“unknown”)
6. housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7. loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
Related with the last contact of the current
campaign:
8. contact: contact communication type (categorical:
“cellular”,“telephone”)
9. month: last contact month of year (categorical: “jan”, “feb”, “mar”,
…, “nov”, “dec”)
10. day_of_week: last contact day of the week (categorical:
“mon”,“tue”,“wed”,“thu”,“fri”)
11. duration: last contact duration, in seconds (numeric). Important
note: this attribute highly affects the output target (e.g., if
duration=0 then y=“no”). Yet, the duration is not known before a call is
performed. Also, after the end of the call y is obviously known. Thus,
this input should only be included for benchmark purposes and should be
discarded if the intention is to have a realistic predictive
model.
Other attributes:
12. campaign: number of contacts performed during this campaign and for
this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last
contacted from a previous campaign (numeric; 999 means client was not
previously contacted)
14. previous: number of contacts performed before this campaign and for
this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical:
“failure”,“nonexistent”,“success”)
Social and economic context attributes
16. emp.var.rate: employment variation rate - quarterly indicator
(numeric)
17. cons.price.idx: consumer price index - monthly indicator
(numeric)
18. cons.conf.idx: consumer confidence index - monthly indicator
(numeric)
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)
### Loading required packages
library(tidyverse)
library(GGally)
library(ggplot2)
library(vtreat)
library(MASS)
library(gridExtra)
library(plotly)
library(dplyr)
library(corrgram)
library(corrplot)
## Loading required packages
library(knitr)
library(dplyr)
## Reading the dataset
bank_data<-read.csv("bank-additional-full.csv")
## head of dataset
head(bank_data)
## age job marital education default housing loan contact month
## 1 56 housemaid married basic.4y no no no telephone may
## 2 57 services married high.school unknown no no telephone may
## 3 37 services married high.school no yes no telephone may
## 4 40 admin. married basic.6y no no no telephone may
## 5 56 services married high.school no no yes telephone may
## 6 45 services married basic.9y unknown no no telephone may
## day_of_week duration campaign pdays previous poutcome emp.var.rate
## 1 mon 261 1 999 0 nonexistent 1.1
## 2 mon 149 1 999 0 nonexistent 1.1
## 3 mon 226 1 999 0 nonexistent 1.1
## 4 mon 151 1 999 0 nonexistent 1.1
## 5 mon 307 1 999 0 nonexistent 1.1
## 6 mon 198 1 999 0 nonexistent 1.1
## cons.price.idx cons.conf.idx euribor3m nr.employed y
## 1 93.994 -36.4 4.857 5191 no
## 2 93.994 -36.4 4.857 5191 no
## 3 93.994 -36.4 4.857 5191 no
## 4 93.994 -36.4 4.857 5191 no
## 5 93.994 -36.4 4.857 5191 no
## 6 93.994 -36.4 4.857 5191 no
## string type of data
str(bank_data)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
## missing data
colSums(is.na(bank_data)) %>% show()
## age job marital education default
## 0 0 0 0 0
## housing loan contact month day_of_week
## 0 0 0 0 0
## duration campaign pdays previous poutcome
## 0 0 0 0 0
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 0 0 0 1 0
## y
## 0
names(bank_data)
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
sum(is.na(bank_data$euribor3m))
## [1] 1
bank_data$euribor3m[is.na(bank_data$euribor3m)]<-mean(bank_data$euribor3m,na.rm=TRUE)
sum(is.na(bank_data$euribor3m))
## [1] 0
sum(is.na(bank_data$day_of_week))
## [1] 0
bank_data$day_of_week[is.na(bank_data$day_of_week)]<-mode(bank_data$day_of_week)
sum(is.na(bank_data$day_of_week))
## [1] 0
library(ggplot2)
## Dimension of dataset
dim(bank_data)
## [1] 41188 21
## summary of all columns
summary(bank_data)
## age job marital education
## Min. :17.00 Length:41188 Length:41188 Length:41188
## 1st Qu.:32.00 Class :character Class :character Class :character
## Median :38.00 Mode :character Mode :character Mode :character
## Mean :40.02
## 3rd Qu.:47.00
## Max. :98.00
## default housing loan contact
## Length:41188 Length:41188 Length:41188 Length:41188
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## month day_of_week duration campaign
## Length:41188 Length:41188 Min. : 0.0 Min. : 1.000
## Class :character Class :character 1st Qu.: 102.0 1st Qu.: 1.000
## Mode :character Mode :character Median : 180.0 Median : 2.000
## Mean : 258.3 Mean : 2.568
## 3rd Qu.: 319.0 3rd Qu.: 3.000
## Max. :4918.0 Max. :56.000
## pdays previous poutcome emp.var.rate
## Min. : 0.0 Min. :0.000 Length:41188 Min. :-3.40000
## 1st Qu.:999.0 1st Qu.:0.000 Class :character 1st Qu.:-1.80000
## Median :999.0 Median :0.000 Mode :character Median : 1.10000
## Mean :962.5 Mean :0.173 Mean : 0.08189
## 3rd Qu.:999.0 3rd Qu.:0.000 3rd Qu.: 1.40000
## Max. :999.0 Max. :7.000 Max. : 1.40000
## cons.price.idx cons.conf.idx euribor3m nr.employed
## Min. :92.20 Min. :-50.8 Min. :0.634 Min. :4964
## 1st Qu.:93.08 1st Qu.:-42.7 1st Qu.:1.344 1st Qu.:5099
## Median :93.75 Median :-41.8 Median :4.857 Median :5191
## Mean :93.58 Mean :-40.5 Mean :3.621 Mean :5167
## 3rd Qu.:93.99 3rd Qu.:-36.4 3rd Qu.:4.961 3rd Qu.:5228
## Max. :94.77 Max. :-26.9 Max. :5.045 Max. :5228
## y
## Length:41188
## Class :character
## Mode :character
##
##
##
bp <- barplot(table(bank_data$y), beside=TRUE, ylim=c(0, max(table(bank_data$y)) + 3452),
main="Term Deposit(yes/no) Distribution", col = c("#eb8060", "#b9e38d"), border=0)
text(bp, table(bank_data$y) + 1200, table(bank_data$y), font=2, col="black")
## correlation check
corr_data<-round(cor(bank_data[sapply(bank_data, is.numeric)]),2)
corr_data
## age duration campaign pdays previous emp.var.rate
## age 1.00 0.00 0.00 -0.03 0.02 0.00
## duration 0.00 1.00 -0.07 -0.05 0.02 -0.03
## campaign 0.00 -0.07 1.00 0.05 -0.08 0.15
## pdays -0.03 -0.05 0.05 1.00 -0.59 0.27
## previous 0.02 0.02 -0.08 -0.59 1.00 -0.42
## emp.var.rate 0.00 -0.03 0.15 0.27 -0.42 1.00
## cons.price.idx 0.00 0.01 0.13 0.08 -0.20 0.78
## cons.conf.idx 0.13 -0.01 -0.01 -0.09 -0.05 0.20
## euribor3m 0.01 -0.03 0.14 0.30 -0.45 0.97
## nr.employed -0.02 -0.04 0.14 0.37 -0.50 0.91
## cons.price.idx cons.conf.idx euribor3m nr.employed
## age 0.00 0.13 0.01 -0.02
## duration 0.01 -0.01 -0.03 -0.04
## campaign 0.13 -0.01 0.14 0.14
## pdays 0.08 -0.09 0.30 0.37
## previous -0.20 -0.05 -0.45 -0.50
## emp.var.rate 0.78 0.20 0.97 0.91
## cons.price.idx 1.00 0.06 0.69 0.52
## cons.conf.idx 0.06 1.00 0.28 0.10
## euribor3m 0.69 0.28 1.00 0.95
## nr.employed 0.52 0.10 0.95 1.00
library(reshape2)
melted_corr_data <- melt(corr_data)
ggplot(data = melted_corr_data, aes(x=Var1, y=Var2, fill=value)) +
geom_tile() +
geom_text(aes(Var2, Var1, label = value), size = 5) +
scale_fill_gradient2(low = "blue", high = "red",
limit = c(-1,1), name="Correlation") +
theme(axis.title.x = element_blank(),
axis.text.x = element_text(angle = 90),
axis.title.y = element_blank(),
panel.background = element_blank())
## Density plot for age column
# Create a histogram
hist(bank_data$age, freq = TRUE, xlab = "Age", main = "Distribution of Age")
## Distribution of Term deposit across the age
ggplot(bank_data, aes(x = age, fill = y)) +
geom_histogram(position = "identity", alpha = 0.4) +
labs(title = "Age and Term Deposit") +
theme(plot.title = element_text(hjust = 0.5))+guides(fill=guide_legend(title="Term Deposit"))
## Distribution of customer marital status by Term Deposit
mar_counts <- bank_data %>%
count(Marital = factor(marital), Term_Deposit = factor(y)) %>%
mutate(pct = prop.table(n))
mar_counts$pct<-round(mar_counts$pct,digits = 3)
ggplot(mar_counts,aes(x = reorder(Marital,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0)) + ggtitle("Marital Status v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))
# Statistical test between marital status variable and Term Deposit target variable
chisq.test(bank_data$marital, bank_data$y, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: bank_data$marital and bank_data$y
## X-squared = 122.66, df = 3, p-value < 2.2e-16
## table of job title and term deposit
library(tidyverse)
## checking any relation in job and the term deposit
job_counts<-as.data.frame(table(bank_data$job, bank_data$y))
job_counts<-job_counts %>%
pivot_wider(names_from=Var2, values_from=Freq)
job_counts<-as.data.frame(job_counts)
names(job_counts)<-c("Job Title","Term Deposit No","Term Deposit Yes")
job_counts$TD_No_Per<-round((job_counts$`Term Deposit No`/sum(job_counts$`Term Deposit No`))*100,2)
job_counts$TD_Yes_Per<-round((job_counts$`Term Deposit Yes`/sum(job_counts$`Term Deposit Yes`))*100,2)
job_counts
## Job Title Term Deposit No Term Deposit Yes TD_No_Per TD_Yes_Per
## 1 admin. 9070 1352 24.82 29.14
## 2 blue-collar 8616 638 23.57 13.75
## 3 entrepreneur 1332 124 3.64 2.67
## 4 housemaid 954 106 2.61 2.28
## 5 management 2596 328 7.10 7.07
## 6 retired 1286 434 3.52 9.35
## 7 self-employed 1272 149 3.48 3.21
## 8 services 3646 323 9.98 6.96
## 9 student 600 275 1.64 5.93
## 10 technician 6013 730 16.45 15.73
## 11 unemployed 870 144 2.38 3.10
## 12 unknown 293 37 0.80 0.80
## Distribution of Job variable
library(dplyr)
JB_counts <- bank_data %>%
count(Job = factor(job)) %>%
mutate(pct = prop.table(n))
JB_counts$pct<-round(JB_counts$pct,digits = 3)
ggplot(JB_counts,aes(x = reorder(Job,-pct), y = pct, fill = Job, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 90),legend.position="none") + ggtitle("Distribution of Job variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))
# Statistical test between Job variable and Term Deposit target variable
chisq.test(bank_data$job, bank_data$y, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: bank_data$job and bank_data$y
## X-squared = 961.24, df = 11, p-value < 2.2e-16
## Distribution of education variable
ed_counts <- bank_data %>%
count(Education = factor(education)) %>%
mutate(pct = prop.table(n))
ed_counts$pct<-round(ed_counts$pct,digits = 3)
ggplot(ed_counts,aes(x = reorder(Education,-pct), y = pct, fill = Education, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 90),legend.position="none") + ggtitle("Distribution of Education variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))
# Statistical test between Education variable and Term Deposit target variable
chisq.test(bank_data$education, bank_data$y, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: bank_data$education and bank_data$y
## X-squared = 193.11, df = 7, p-value < 2.2e-16
# Distribution of education variable by term deposit
edu_counts <- bank_data %>%
count(Education = factor(education), Term_Deposit = factor(y)) %>%
mutate(pct = prop.table(n))
edu_counts$pct<-round(edu_counts$pct,digits = 3)
ggplot(edu_counts,aes(x = reorder(Education,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 90),legend.position="none") + ggtitle("Education v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))
## Distribution of housing variable
hou_counts <- bank_data %>%
count(Housing = factor(housing)) %>%
mutate(pct = prop.table(n))
hou_counts$pct<-round(hou_counts$pct,digits = 3)
ggplot(hou_counts,aes(x = reorder(Housing,-pct), y = pct, fill = Housing, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Distribution of Housing variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))
## checking any relation in housing and the term deposit
hou_counts1 <- bank_data %>%
count(Housing = factor(housing), Term_Deposit = factor(y)) %>%
mutate(pct = prop.table(n))
hou_counts1$pct<-round(hou_counts1$pct,digits = 3)
ggplot(hou_counts1,aes(x = reorder(Housing,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Housing v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))
## Distribution of Loan variable
ln_counts <- bank_data %>%
count(Loan = factor(loan)) %>%
mutate(pct = prop.table(n))
ln_counts$pct<-round(ln_counts$pct,digits = 3)
ggplot(ln_counts,aes(x = reorder(Loan,-pct), y = pct, fill = Loan, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Distribution of Loan variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))
## checking any relation in loan and the term deposit
loan_counts <- bank_data %>%
count(Loan = factor(loan), Term_Deposit = factor(y)) %>%
mutate(pct = prop.table(n))
loan_counts$pct<-round(loan_counts$pct,digits = 3)
ggplot(loan_counts,aes(x = reorder(Loan,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Loan v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))
## checking any relation in month and the term deposit
mon_counts <- bank_data %>%
count(Month = factor(month), Term_Deposit = factor(y)) %>%
mutate(pct = prop.table(n))
mon_counts$pct<-round(mon_counts$pct,digits = 3)
ggplot(mon_counts,aes(x = reorder(Month,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) +
geom_col(position = 'dodge') +
geom_text(position = position_dodge(width = .9), # move to center of bars
vjust = -0.5, # nudge above top of bar
size = 3) +
scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Month v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))
mon_cont_y_counts<-as.data.frame(table(bank_data$month, bank_data$contact, bank_data$y))
names(mon_cont_y_counts)<-c("Month","Contact","TermDepositYesNo","Freq")
ggplot(mon_cont_y_counts, aes(x = Month, y = Freq))+
geom_bar(
aes(fill = TermDepositYesNo), stat = "identity", color = "white",
position = position_dodge(0.9)
)+facet_wrap(~Contact)+guides(fill=guide_legend(title="Contact"))
library(knitr)
library(dplyr)
library(kableExtra)
str(bank_data)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : chr "housemaid" "services" "services" "admin." ...
## $ marital : chr "married" "married" "married" "married" ...
## $ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
## $ default : chr "no" "unknown" "no" "no" ...
## $ housing : chr "no" "no" "yes" "no" ...
## $ loan : chr "no" "no" "no" "no" ...
## $ contact : chr "telephone" "telephone" "telephone" "telephone" ...
## $ month : chr "may" "may" "may" "may" ...
## $ day_of_week : chr "mon" "mon" "mon" "mon" ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : chr "no" "no" "no" "no" ...
factor_cols <- c("job", "marital", "education", "default","housing","loan","contact","month","day_of_week","poutcome","y")
bank_data[,factor_cols] <- lapply(bank_data[,factor_cols], factor)
str(bank_data)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 6 levels "","fri","mon",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : num 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
## $ cons.price.idx: num 94 94 94 94 94 ...
## $ cons.conf.idx : num -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
## $ euribor3m : num 4.86 4.86 4.86 4.86 4.86 ...
## $ nr.employed : num 5191 5191 5191 5191 5191 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
# Loading caret library
require(caret)
# Splitting the data into train and test
index <- createDataPartition(bank_data$y, p = .70, list = FALSE)
train <- bank_data[index, ]
test <- bank_data[-index, ]
dim(train)
## [1] 28832 21
dim(test)
## [1] 12356 21
table(train$y)
##
## no yes
## 25584 3248
table(test$y)
##
## no yes
## 10964 1392
# Loading package
library(caTools)
library(ROCR)
# Training the model
logistic_model <- glm(y ~ ., family = binomial(), train)
# Checking the model
summary(logistic_model)
##
## Call:
## glm(formula = y ~ ., family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.2540 -0.3004 -0.1867 -0.1357 3.3536
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.453e+02 1.461e+02 -1.679 0.09309 .
## age 1.030e-03 2.919e-03 0.353 0.72410
## jobblue-collar -2.258e-01 9.506e-02 -2.376 0.01751 *
## jobentrepreneur -1.789e-01 1.467e-01 -1.220 0.22243
## jobhousemaid 3.037e-02 1.789e-01 0.170 0.86518
## jobmanagement -8.765e-02 1.028e-01 -0.853 0.39368
## jobretired 2.660e-01 1.279e-01 2.079 0.03762 *
## jobself-employed -8.928e-02 1.382e-01 -0.646 0.51814
## jobservices -1.759e-01 1.045e-01 -1.682 0.09254 .
## jobstudent 3.176e-01 1.316e-01 2.413 0.01583 *
## jobtechnician 3.856e-02 8.494e-02 0.454 0.64986
## jobunemployed -1.258e-02 1.529e-01 -0.082 0.93443
## jobunknown 1.787e-03 2.700e-01 0.007 0.99472
## maritalmarried 3.335e-02 8.288e-02 0.402 0.68740
## maritalsingle 4.874e-02 9.460e-02 0.515 0.60641
## maritalunknown 3.079e-01 4.454e-01 0.691 0.48939
## educationbasic.6y 1.937e-02 1.480e-01 0.131 0.89591
## educationbasic.9y 8.790e-02 1.128e-01 0.779 0.43592
## educationhigh.school 5.195e-02 1.098e-01 0.473 0.63621
## educationilliterate 1.598e+00 9.302e-01 1.718 0.08580 .
## educationprofessional.course 1.189e-01 1.209e-01 0.984 0.32508
## educationuniversity.degree 2.147e-01 1.097e-01 1.957 0.05032 .
## educationunknown 2.325e-01 1.423e-01 1.634 0.10227
## defaultunknown -3.191e-01 8.094e-02 -3.943 8.06e-05 ***
## defaultyes -8.279e+00 1.869e+02 -0.044 0.96467
## housingunknown -1.573e-01 1.702e-01 -0.924 0.35534
## housingyes 9.657e-03 4.944e-02 0.195 0.84513
## loanunknown NA NA NA NA
## loanyes -1.331e-01 6.947e-02 -1.916 0.05541 .
## contacttelephone -5.478e-01 9.002e-02 -6.086 1.16e-09 ***
## monthaug 9.041e-01 1.446e-01 6.251 4.08e-10 ***
## monthdec 3.836e-01 2.493e-01 1.539 0.12391
## monthjul 1.746e-01 1.155e-01 1.512 0.13045
## monthjun -4.677e-01 1.518e-01 -3.082 0.00206 **
## monthmar 2.068e+00 1.734e-01 11.927 < 2e-16 ***
## monthmay -4.026e-01 9.888e-02 -4.072 4.66e-05 ***
## monthnov -3.551e-01 1.447e-01 -2.454 0.01412 *
## monthoct 3.373e-01 1.843e-01 1.831 0.06717 .
## monthsep 3.762e-01 2.143e-01 1.755 0.07924 .
## day_of_weekfri 8.499e+00 1.388e+02 0.061 0.95117
## day_of_weekmon 8.385e+00 1.388e+02 0.060 0.95182
## day_of_weekthu 8.557e+00 1.388e+02 0.062 0.95083
## day_of_weektue 8.553e+00 1.388e+02 0.062 0.95086
## day_of_weekwed 8.712e+00 1.388e+02 0.063 0.94994
## duration 4.666e-03 8.863e-05 52.646 < 2e-16 ***
## campaign -3.976e-02 1.370e-02 -2.901 0.00371 **
## pdays -9.868e-04 2.512e-04 -3.929 8.55e-05 ***
## previous -4.538e-02 7.004e-02 -0.648 0.51704
## poutcomenonexistent 4.538e-01 1.126e-01 4.030 5.58e-05 ***
## poutcomesuccess 9.917e-01 2.455e-01 4.039 5.37e-05 ***
## emp.var.rate -1.766e+00 1.690e-01 -10.448 < 2e-16 ***
## cons.price.idx 2.154e+00 3.005e-01 7.168 7.62e-13 ***
## cons.conf.idx 1.952e-02 9.344e-03 2.089 0.03672 *
## euribor3m 3.159e-01 1.545e-01 2.044 0.04094 *
## nr.employed 6.102e-03 3.723e-03 1.639 0.10121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 20299 on 28831 degrees of freedom
## Residual deviance: 11972 on 28778 degrees of freedom
## AIC: 12080
##
## Number of Fisher Scoring iterations: 11
# Predicting in the test dataset
pred_prob <- predict(logistic_model, test, type = "response")
# Converting from probability to actual output
test$pred_class <- ifelse(pred_prob >= 0.5, "yes", "no")
test$pred_class <- as.factor(test$pred_class)
# Generating the classification table
ctab_test <- table(test$y, test$pred_class)
ctab_test
##
## no yes
## no 10663 301
## yes 813 579
library(pROC)
roc <- roc(train$y, logistic_model$fitted.values)
auc(roc)
## Area under the curve: 0.9365
## Accuracy in Test dataset
# Accuracy = (TP + TN)/(TN + FP + FN + TP)
accuracy_test <- sum(diag(ctab_test))/sum(ctab_test)
accuracy_test
## [1] 0.9098414
#Precision = TP/FP + TP (Precision indicates how often does your predicted TRUE values are actually TRUE.)
# Precision in Test dataset
Precision <- (ctab_test[2, 2]/sum(ctab_test[, 2]))
Precision
## [1] 0.6579545
# Recall Or TPR = TP/(FN + TP) (Recall or TPR indicates how often does our model predicts actual TRUE from the overall TRUE events.)
# Recall in Train dataset
Recall <- (ctab_test[2, 2]/sum(ctab_test[2, ]))
Recall
## [1] 0.4159483
# F1 score (F-Score is a harmonic mean of recall and precision. The score value lies between 0 and 1. The value of 1 represents perfect precision & recall. The value 0 represents the worst case.)
F_Score <- (2 * Precision * Recall / (Precision + Recall))
F_Score
## [1] 0.5096831
metric_eval <- data.frame(matrix(ncol = 6, nrow = 0))
x <- c("Model_Name", "Accuracy", "Precision","Recall", "F1_score", "AUC")
colnames(metric_eval) <- x
library(caret)
lgr_val <- c("Logistic Regression",accuracy_test,Precision,Recall,F_Score,auc(roc))
metric_eval <- rbind(metric_eval,lgr_val)
names(metric_eval)<-x
## making null for predicted column created in test data
test$pred_class<-NULL
library(caTools)
library(knitr)
set.seed(123)
library(rpart)
classifier = rpart(formula = y ~ .,
data = train)
# Predicting the Test set results
names(test)
## [1] "age" "job" "marital" "education"
## [5] "default" "housing" "loan" "contact"
## [9] "month" "day_of_week" "duration" "campaign"
## [13] "pdays" "previous" "poutcome" "emp.var.rate"
## [17] "cons.price.idx" "cons.conf.idx" "euribor3m" "nr.employed"
## [21] "y"
y_pred = predict(classifier,
newdata = test,
type = 'prob')[,2]
library(pROC)
tree.roc <- roc(test$y, y_pred)
dt_auc<-tree.roc$auc[1]
## for confusion matrix evaluation
y_pred = predict(classifier,
newdata = test,
type = 'class')
# Making the Confusion Matrix
library(caret)
cm<-confusionMatrix(as.factor(y_pred), test$y, mode = "everything", positive="yes")
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10615 758
## yes 349 634
##
## Accuracy : 0.9104
## 95% CI : (0.9052, 0.9154)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.486
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.45546
## Specificity : 0.96817
## Pos Pred Value : 0.64496
## Neg Pred Value : 0.93335
## Precision : 0.64496
## Recall : 0.45546
## F1 : 0.53389
## Prevalence : 0.11266
## Detection Rate : 0.05131
## Detection Prevalence : 0.07956
## Balanced Accuracy : 0.71181
##
## 'Positive' Class : yes
##
dt_val <- c("Decision Tree",cm$overall[1],cm$byClass[5],cm$byClass[6],cm$byClass[7],dt_auc)
metric_eval <- rbind(metric_eval,dt_val)
names(metric_eval)<-x
library(randomForest)
library(knitr)
library(randomForest)
# Random Forest for classification
classifier_RF = randomForest(x = train[-21],
y = train$y,
ntree = 500)
classifier_RF
##
## Call:
## randomForest(x = train[-21], y = train$y, ntree = 500)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 8.58%
## Confusion matrix:
## no yes class.error
## no 24642 942 0.03681989
## yes 1531 1717 0.47136700
# Predicting the Test set results
y_pred_rf = predict(classifier_RF, newdata = test[-21])
# Plot the error vs The number of trees graph
plot(classifier_RF)
# Variable importance plot
varImpPlot(classifier_RF)
# confusion matrix
cm<-confusionMatrix(y_pred_rf, test$y, mode = "everything", positive="yes")
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 10560 655
## yes 404 737
##
## Accuracy : 0.9143
## 95% CI : (0.9092, 0.9192)
## No Information Rate : 0.8873
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5347
##
## Mcnemar's Test P-Value : 1.562e-14
##
## Sensitivity : 0.52945
## Specificity : 0.96315
## Pos Pred Value : 0.64592
## Neg Pred Value : 0.94160
## Precision : 0.64592
## Recall : 0.52945
## F1 : 0.58192
## Prevalence : 0.11266
## Detection Rate : 0.05965
## Detection Prevalence : 0.09234
## Balanced Accuracy : 0.74630
##
## 'Positive' Class : yes
##
require(pROC)
rf.roc<-roc(train$y,classifier_RF$votes[,2])
plot(rf.roc)
rf_auc<-auc(rf.roc)[1]
rf_val <- c("Random Forest",cm$overall[1],cm$byClass[5],cm$byClass[6],cm$byClass[7],rf_auc)
metric_eval <- rbind(metric_eval,rf_val)
colnames(metric_eval) <- x
metric_eval$Accuracy<-round(as.numeric(metric_eval$Accuracy),digits = 4)
metric_eval$Precision<-round(as.numeric(metric_eval$Precision),digits = 4)
metric_eval$Recall<-round(as.numeric(metric_eval$Recall),digits = 4)
metric_eval$F1_score<-round(as.numeric(metric_eval$F1_score),digits = 4)
metric_eval$AUC<-round(as.numeric(metric_eval$AUC),digits = 4)
metric_eval
## Model_Name Accuracy Precision Recall F1_score AUC
## 1 Logistic Regression 0.9098 0.6580 0.4159 0.5097 0.9365
## 2 Decision Tree 0.9104 0.6450 0.4555 0.5339 0.8651
## 3 Random Forest 0.9143 0.6459 0.5295 0.5819 0.9439