Business Analytics with R Project

Project Description

In this project we are going to discuss about the implementing different supervised models using R programming. This process includes the process of data loading, data preparation, exploratory data analysis and modelling (varies subject to dataset). All datasets are taken from Kaggle.

Problem Statement

The binary classification goal is to predict if the client will subscribe a bank term deposit (variable y).

Source and Understanding of Dataset

This dataset is based on “Bank Marketing” UCI dataset. Please check the description at: UCI. The data is enriched by the addition of five new social and economic features/attributes (national wide indicators from a ~10M population country), published by the Banco de Portugal and publicly available at:Banco de Portugal.

Data set description:

Number of Instances: 41188 for bank-additional-full.csv
Number of Attributes: 20 + output attribute.
Attribute information:

Input variables:

Bank client data:
1. age (numeric)
2. job : type of job (categorical: “admin.”,“blue-collar”,“entrepreneur”,“housemaid”,“management”,“retired”,“self-employed”,“services”,“student”,“technician”,“unemployed”,“unknown”)
3. marital : marital status (categorical: “divorced”,“married”,“single”,“unknown”; note: “divorced” means divorced or widowed)
4. education (categorical: “basic.4y”,“basic.6y”,“basic.9y”,“high.school”,“illiterate”,“professional.course”,“university.degree”,“unknown”)
5. default: has credit in default? (categorical: “no”,“yes”,“unknown”)
6. housing: has housing loan? (categorical: “no”,“yes”,“unknown”)
7. loan: has personal loan? (categorical: “no”,“yes”,“unknown”)
Related with the last contact of the current campaign:
8. contact: contact communication type (categorical: “cellular”,“telephone”)
9. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
10. day_of_week: last contact day of the week (categorical: “mon”,“tue”,“wed”,“thu”,“fri”)
11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=“no”). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
Other attributes:
12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14. previous: number of contacts performed before this campaign and for this client (numeric)
15. poutcome: outcome of the previous marketing campaign (categorical: “failure”,“nonexistent”,“success”)
Social and economic context attributes
16. emp.var.rate: employment variation rate - quarterly indicator (numeric)
17. cons.price.idx: consumer price index - monthly indicator (numeric)
18. cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19. euribor3m: euribor 3 month rate - daily indicator (numeric)
20. nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):

y - has the client subscribed a term deposit? (binary: “yes”,“no”)

### Loading required packages
library(tidyverse)
library(GGally)
library(ggplot2)
library(vtreat)
library(MASS)
library(gridExtra)
library(plotly)
library(dplyr)
library(corrgram)
library(corrplot)

Data processing

## Loading required packages
library(knitr)
library(dplyr)
## Reading the dataset
bank_data<-read.csv("bank-additional-full.csv")
## head of dataset
head(bank_data)

##   age       job marital   education default housing loan   contact month
## 1  56 housemaid married    basic.4y      no      no   no telephone   may
## 2  57  services married high.school unknown      no   no telephone   may
## 3  37  services married high.school      no     yes   no telephone   may
## 4  40    admin. married    basic.6y      no      no   no telephone   may
## 5  56  services married high.school      no      no  yes telephone   may
## 6  45  services married    basic.9y unknown      no   no telephone   may
##   day_of_week duration campaign pdays previous    poutcome emp.var.rate
## 1         mon      261        1   999        0 nonexistent          1.1
## 2         mon      149        1   999        0 nonexistent          1.1
## 3         mon      226        1   999        0 nonexistent          1.1
## 4         mon      151        1   999        0 nonexistent          1.1
## 5         mon      307        1   999        0 nonexistent          1.1
## 6         mon      198        1   999        0 nonexistent          1.1
##   cons.price.idx cons.conf.idx euribor3m nr.employed  y
## 1         93.994         -36.4     4.857        5191 no
## 2         93.994         -36.4     4.857        5191 no
## 3         93.994         -36.4     4.857        5191 no
## 4         93.994         -36.4     4.857        5191 no
## 5         93.994         -36.4     4.857        5191 no
## 6         93.994         -36.4     4.857        5191 no

## string type of data
str(bank_data)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

## missing data
colSums(is.na(bank_data)) %>% show()

##            age            job        marital      education        default 
##              0              0              0              0              0 
##        housing           loan        contact          month    day_of_week 
##              0              0              0              0              0 
##       duration       campaign          pdays       previous       poutcome 
##              0              0              0              0              0 
##   emp.var.rate cons.price.idx  cons.conf.idx      euribor3m    nr.employed 
##              0              0              0              1              0 
##              y 
##              0

Data Treatment

names(bank_data)

##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "duration"       "campaign"      
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
## [21] "y"

sum(is.na(bank_data$euribor3m))

## [1] 1

bank_data$euribor3m[is.na(bank_data$euribor3m)]<-mean(bank_data$euribor3m,na.rm=TRUE)
sum(is.na(bank_data$euribor3m))

## [1] 0

sum(is.na(bank_data$day_of_week))

## [1] 0

bank_data$day_of_week[is.na(bank_data$day_of_week)]<-mode(bank_data$day_of_week)
sum(is.na(bank_data$day_of_week))

## [1] 0

Exploratory Data Analysis

library(ggplot2)
## Dimension of dataset
dim(bank_data)

## [1] 41188    21

## summary of all columns
summary(bank_data)

##       age            job              marital           education        
##  Min.   :17.00   Length:41188       Length:41188       Length:41188      
##  1st Qu.:32.00   Class :character   Class :character   Class :character  
##  Median :38.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :40.02                                                           
##  3rd Qu.:47.00                                                           
##  Max.   :98.00                                                           
##    default            housing              loan             contact         
##  Length:41188       Length:41188       Length:41188       Length:41188      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     month           day_of_week           duration         campaign     
##  Length:41188       Length:41188       Min.   :   0.0   Min.   : 1.000  
##  Class :character   Class :character   1st Qu.: 102.0   1st Qu.: 1.000  
##  Mode  :character   Mode  :character   Median : 180.0   Median : 2.000  
##                                        Mean   : 258.3   Mean   : 2.568  
##                                        3rd Qu.: 319.0   3rd Qu.: 3.000  
##                                        Max.   :4918.0   Max.   :56.000  
##      pdays          previous       poutcome          emp.var.rate     
##  Min.   :  0.0   Min.   :0.000   Length:41188       Min.   :-3.40000  
##  1st Qu.:999.0   1st Qu.:0.000   Class :character   1st Qu.:-1.80000  
##  Median :999.0   Median :0.000   Mode  :character   Median : 1.10000  
##  Mean   :962.5   Mean   :0.173                      Mean   : 0.08189  
##  3rd Qu.:999.0   3rd Qu.:0.000                      3rd Qu.: 1.40000  
##  Max.   :999.0   Max.   :7.000                      Max.   : 1.40000  
##  cons.price.idx  cons.conf.idx     euribor3m      nr.employed  
##  Min.   :92.20   Min.   :-50.8   Min.   :0.634   Min.   :4964  
##  1st Qu.:93.08   1st Qu.:-42.7   1st Qu.:1.344   1st Qu.:5099  
##  Median :93.75   Median :-41.8   Median :4.857   Median :5191  
##  Mean   :93.58   Mean   :-40.5   Mean   :3.621   Mean   :5167  
##  3rd Qu.:93.99   3rd Qu.:-36.4   3rd Qu.:4.961   3rd Qu.:5228  
##  Max.   :94.77   Max.   :-26.9   Max.   :5.045   Max.   :5228  
##       y            
##  Length:41188      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

bp <- barplot(table(bank_data$y), beside=TRUE, ylim=c(0, max(table(bank_data$y)) + 3452), 
             main="Term Deposit(yes/no) Distribution", col = c("#eb8060", "#b9e38d"), border=0)
text(bp, table(bank_data$y) + 1200, table(bank_data$y), font=2, col="black")

## correlation check
corr_data<-round(cor(bank_data[sapply(bank_data, is.numeric)]),2)
corr_data

##                  age duration campaign pdays previous emp.var.rate
## age             1.00     0.00     0.00 -0.03     0.02         0.00
## duration        0.00     1.00    -0.07 -0.05     0.02        -0.03
## campaign        0.00    -0.07     1.00  0.05    -0.08         0.15
## pdays          -0.03    -0.05     0.05  1.00    -0.59         0.27
## previous        0.02     0.02    -0.08 -0.59     1.00        -0.42
## emp.var.rate    0.00    -0.03     0.15  0.27    -0.42         1.00
## cons.price.idx  0.00     0.01     0.13  0.08    -0.20         0.78
## cons.conf.idx   0.13    -0.01    -0.01 -0.09    -0.05         0.20
## euribor3m       0.01    -0.03     0.14  0.30    -0.45         0.97
## nr.employed    -0.02    -0.04     0.14  0.37    -0.50         0.91
##                cons.price.idx cons.conf.idx euribor3m nr.employed
## age                      0.00          0.13      0.01       -0.02
## duration                 0.01         -0.01     -0.03       -0.04
## campaign                 0.13         -0.01      0.14        0.14
## pdays                    0.08         -0.09      0.30        0.37
## previous                -0.20         -0.05     -0.45       -0.50
## emp.var.rate             0.78          0.20      0.97        0.91
## cons.price.idx           1.00          0.06      0.69        0.52
## cons.conf.idx            0.06          1.00      0.28        0.10
## euribor3m                0.69          0.28      1.00        0.95
## nr.employed              0.52          0.10      0.95        1.00

library(reshape2)
melted_corr_data <- melt(corr_data)
ggplot(data = melted_corr_data, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile() +
  geom_text(aes(Var2, Var1, label = value), size = 5) +
  scale_fill_gradient2(low = "blue", high = "red",
                       limit = c(-1,1), name="Correlation") +
  theme(axis.title.x = element_blank(),
        axis.text.x = element_text(angle = 90),
        axis.title.y = element_blank(),
        panel.background = element_blank())

## Density plot for age column
# Create a histogram
hist(bank_data$age, freq = TRUE, xlab = "Age", main = "Distribution of Age")

## Distribution of Term deposit across the age
ggplot(bank_data, aes(x = age, fill = y)) +
  geom_histogram(position = "identity", alpha = 0.4) +
  labs(title = "Age and Term Deposit") +
  theme(plot.title = element_text(hjust = 0.5))+guides(fill=guide_legend(title="Term Deposit"))

## Distribution of customer marital status by Term Deposit
mar_counts <- bank_data %>% 
  count(Marital = factor(marital), Term_Deposit = factor(y)) %>% 
  mutate(pct = prop.table(n))
mar_counts$pct<-round(mar_counts$pct,digits = 3)
ggplot(mar_counts,aes(x = reorder(Marital,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0)) + ggtitle("Marital Status v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))

# Statistical test between marital status variable and Term Deposit target variable
chisq.test(bank_data$marital, bank_data$y, correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$marital and bank_data$y
## X-squared = 122.66, df = 3, p-value < 2.2e-16

## table of job title and term deposit
library(tidyverse)
## checking any relation in job and the term deposit
job_counts<-as.data.frame(table(bank_data$job, bank_data$y))
job_counts<-job_counts %>% 
  pivot_wider(names_from=Var2, values_from=Freq)
job_counts<-as.data.frame(job_counts)
names(job_counts)<-c("Job Title","Term Deposit No","Term Deposit Yes") 
job_counts$TD_No_Per<-round((job_counts$`Term Deposit No`/sum(job_counts$`Term Deposit No`))*100,2)
job_counts$TD_Yes_Per<-round((job_counts$`Term Deposit Yes`/sum(job_counts$`Term Deposit Yes`))*100,2)
job_counts

##        Job Title Term Deposit No Term Deposit Yes TD_No_Per TD_Yes_Per
## 1         admin.            9070             1352     24.82      29.14
## 2    blue-collar            8616              638     23.57      13.75
## 3   entrepreneur            1332              124      3.64       2.67
## 4      housemaid             954              106      2.61       2.28
## 5     management            2596              328      7.10       7.07
## 6        retired            1286              434      3.52       9.35
## 7  self-employed            1272              149      3.48       3.21
## 8       services            3646              323      9.98       6.96
## 9        student             600              275      1.64       5.93
## 10    technician            6013              730     16.45      15.73
## 11    unemployed             870              144      2.38       3.10
## 12       unknown             293               37      0.80       0.80

## Distribution of Job variable
library(dplyr)
JB_counts <- bank_data %>% 
  count(Job = factor(job)) %>% 
  mutate(pct = prop.table(n))
JB_counts$pct<-round(JB_counts$pct,digits = 3)
ggplot(JB_counts,aes(x = reorder(Job,-pct), y = pct, fill = Job, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 90),legend.position="none") + ggtitle("Distribution of Job variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))

# Statistical test between Job variable and Term Deposit target variable
chisq.test(bank_data$job, bank_data$y, correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$job and bank_data$y
## X-squared = 961.24, df = 11, p-value < 2.2e-16

## Distribution of education variable
ed_counts <- bank_data %>% 
  count(Education = factor(education)) %>% 
  mutate(pct = prop.table(n))
ed_counts$pct<-round(ed_counts$pct,digits = 3)
ggplot(ed_counts,aes(x = reorder(Education,-pct), y = pct, fill = Education, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 90),legend.position="none") + ggtitle("Distribution of Education variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))

# Statistical test between Education variable and Term Deposit target variable
chisq.test(bank_data$education, bank_data$y, correct=FALSE)

## 
##  Pearson's Chi-squared test
## 
## data:  bank_data$education and bank_data$y
## X-squared = 193.11, df = 7, p-value < 2.2e-16

# Distribution of education variable by term deposit
edu_counts <- bank_data %>% 
  count(Education = factor(education), Term_Deposit = factor(y)) %>% 
  mutate(pct = prop.table(n))
edu_counts$pct<-round(edu_counts$pct,digits = 3)
ggplot(edu_counts,aes(x = reorder(Education,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 90),legend.position="none") + ggtitle("Education v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))

## Distribution of housing variable
hou_counts <- bank_data %>% 
  count(Housing = factor(housing)) %>% 
  mutate(pct = prop.table(n))
hou_counts$pct<-round(hou_counts$pct,digits = 3)
ggplot(hou_counts,aes(x = reorder(Housing,-pct), y = pct, fill = Housing, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Distribution of Housing variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))

## checking any relation in housing and the term deposit
hou_counts1 <- bank_data %>% 
  count(Housing = factor(housing), Term_Deposit = factor(y)) %>% 
  mutate(pct = prop.table(n))
hou_counts1$pct<-round(hou_counts1$pct,digits = 3)
ggplot(hou_counts1,aes(x = reorder(Housing,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Housing v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))

## Distribution of Loan variable
ln_counts <- bank_data %>% 
  count(Loan = factor(loan)) %>% 
  mutate(pct = prop.table(n))
ln_counts$pct<-round(ln_counts$pct,digits = 3)
ggplot(ln_counts,aes(x = reorder(Loan,-pct), y = pct, fill = Loan, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Distribution of Loan variable") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5))

## checking any relation in loan and the term deposit
loan_counts <- bank_data %>% 
  count(Loan = factor(loan), Term_Deposit = factor(y)) %>% 
  mutate(pct = prop.table(n))
loan_counts$pct<-round(loan_counts$pct,digits = 3)
ggplot(loan_counts,aes(x = reorder(Loan,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Loan v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))

## checking any relation in month and the term deposit
mon_counts <- bank_data %>% 
  count(Month = factor(month), Term_Deposit = factor(y)) %>% 
  mutate(pct = prop.table(n))
mon_counts$pct<-round(mon_counts$pct,digits = 3)
ggplot(mon_counts,aes(x = reorder(Month,-pct), y = pct, fill = Term_Deposit, label = scales::percent(pct))) + 
  geom_col(position = 'dodge') + 
  geom_text(position = position_dodge(width = .9),    # move to center of bars
            vjust = -0.5,    # nudge above top of bar
            size = 3) + 
  scale_y_continuous(labels = scales::percent) + theme(axis.title.x=element_blank(),axis.text.x = element_text(angle = 0),legend.position="none") + ggtitle("Month v/s Term Deposit") + ylab("% of Records") + theme(plot.title = element_text(hjust = 0.5)) + guides(fill=guide_legend(title="Term Deposit"))

mon_cont_y_counts<-as.data.frame(table(bank_data$month, bank_data$contact, bank_data$y))
names(mon_cont_y_counts)<-c("Month","Contact","TermDepositYesNo","Freq")
ggplot(mon_cont_y_counts, aes(x = Month, y = Freq))+
  geom_bar(
    aes(fill = TermDepositYesNo), stat = "identity", color = "white",
    position = position_dodge(0.9)
  )+facet_wrap(~Contact)+guides(fill=guide_legend(title="Contact"))

Data Modelling (Classification)

Data Preparation for Modelling

library(knitr)
library(dplyr)
library(kableExtra)
str(bank_data)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : chr  "housemaid" "services" "services" "admin." ...
##  $ marital       : chr  "married" "married" "married" "married" ...
##  $ education     : chr  "basic.4y" "high.school" "high.school" "basic.6y" ...
##  $ default       : chr  "no" "unknown" "no" "no" ...
##  $ housing       : chr  "no" "no" "yes" "no" ...
##  $ loan          : chr  "no" "no" "no" "no" ...
##  $ contact       : chr  "telephone" "telephone" "telephone" "telephone" ...
##  $ month         : chr  "may" "may" "may" "may" ...
##  $ day_of_week   : chr  "mon" "mon" "mon" "mon" ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : chr  "nonexistent" "nonexistent" "nonexistent" "nonexistent" ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : chr  "no" "no" "no" "no" ...

factor_cols <- c("job", "marital", "education", "default","housing","loan","contact","month","day_of_week","poutcome","y")
bank_data[,factor_cols] <- lapply(bank_data[,factor_cols], factor)
str(bank_data)

## 'data.frame':    41188 obs. of  21 variables:
##  $ age           : int  56 57 37 40 56 45 59 41 24 25 ...
##  $ job           : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
##  $ marital       : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ education     : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
##  $ default       : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
##  $ housing       : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
##  $ loan          : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
##  $ contact       : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
##  $ month         : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ day_of_week   : Factor w/ 6 levels "","fri","mon",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ duration      : int  261 149 226 151 307 198 139 217 380 50 ...
##  $ campaign      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays         : int  999 999 999 999 999 999 999 999 999 999 ...
##  $ previous      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome      : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ emp.var.rate  : num  1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 1.1 ...
##  $ cons.price.idx: num  94 94 94 94 94 ...
##  $ cons.conf.idx : num  -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 -36.4 ...
##  $ euribor3m     : num  4.86 4.86 4.86 4.86 4.86 ...
##  $ nr.employed   : num  5191 5191 5191 5191 5191 ...
##  $ y             : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

# Loading caret library
require(caret)
# Splitting the data into train and test
index <- createDataPartition(bank_data$y, p = .70, list = FALSE)
train <- bank_data[index, ]
test <- bank_data[-index, ]
dim(train)

## [1] 28832    21

dim(test)

## [1] 12356    21

table(train$y)

## 
##    no   yes 
## 25584  3248

table(test$y)

## 
##    no   yes 
## 10964  1392

Logistic Regression

# Loading package
library(caTools)
library(ROCR)
# Training the model
logistic_model <- glm(y ~ ., family = binomial(), train)

# Checking the model
summary(logistic_model)

## 
## Call:
## glm(formula = y ~ ., family = binomial(), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.2540  -0.3004  -0.1867  -0.1357   3.3536  
## 
## Coefficients: (1 not defined because of singularities)
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -2.453e+02  1.461e+02  -1.679  0.09309 .  
## age                           1.030e-03  2.919e-03   0.353  0.72410    
## jobblue-collar               -2.258e-01  9.506e-02  -2.376  0.01751 *  
## jobentrepreneur              -1.789e-01  1.467e-01  -1.220  0.22243    
## jobhousemaid                  3.037e-02  1.789e-01   0.170  0.86518    
## jobmanagement                -8.765e-02  1.028e-01  -0.853  0.39368    
## jobretired                    2.660e-01  1.279e-01   2.079  0.03762 *  
## jobself-employed             -8.928e-02  1.382e-01  -0.646  0.51814    
## jobservices                  -1.759e-01  1.045e-01  -1.682  0.09254 .  
## jobstudent                    3.176e-01  1.316e-01   2.413  0.01583 *  
## jobtechnician                 3.856e-02  8.494e-02   0.454  0.64986    
## jobunemployed                -1.258e-02  1.529e-01  -0.082  0.93443    
## jobunknown                    1.787e-03  2.700e-01   0.007  0.99472    
## maritalmarried                3.335e-02  8.288e-02   0.402  0.68740    
## maritalsingle                 4.874e-02  9.460e-02   0.515  0.60641    
## maritalunknown                3.079e-01  4.454e-01   0.691  0.48939    
## educationbasic.6y             1.937e-02  1.480e-01   0.131  0.89591    
## educationbasic.9y             8.790e-02  1.128e-01   0.779  0.43592    
## educationhigh.school          5.195e-02  1.098e-01   0.473  0.63621    
## educationilliterate           1.598e+00  9.302e-01   1.718  0.08580 .  
## educationprofessional.course  1.189e-01  1.209e-01   0.984  0.32508    
## educationuniversity.degree    2.147e-01  1.097e-01   1.957  0.05032 .  
## educationunknown              2.325e-01  1.423e-01   1.634  0.10227    
## defaultunknown               -3.191e-01  8.094e-02  -3.943 8.06e-05 ***
## defaultyes                   -8.279e+00  1.869e+02  -0.044  0.96467    
## housingunknown               -1.573e-01  1.702e-01  -0.924  0.35534    
## housingyes                    9.657e-03  4.944e-02   0.195  0.84513    
## loanunknown                          NA         NA      NA       NA    
## loanyes                      -1.331e-01  6.947e-02  -1.916  0.05541 .  
## contacttelephone             -5.478e-01  9.002e-02  -6.086 1.16e-09 ***
## monthaug                      9.041e-01  1.446e-01   6.251 4.08e-10 ***
## monthdec                      3.836e-01  2.493e-01   1.539  0.12391    
## monthjul                      1.746e-01  1.155e-01   1.512  0.13045    
## monthjun                     -4.677e-01  1.518e-01  -3.082  0.00206 ** 
## monthmar                      2.068e+00  1.734e-01  11.927  < 2e-16 ***
## monthmay                     -4.026e-01  9.888e-02  -4.072 4.66e-05 ***
## monthnov                     -3.551e-01  1.447e-01  -2.454  0.01412 *  
## monthoct                      3.373e-01  1.843e-01   1.831  0.06717 .  
## monthsep                      3.762e-01  2.143e-01   1.755  0.07924 .  
## day_of_weekfri                8.499e+00  1.388e+02   0.061  0.95117    
## day_of_weekmon                8.385e+00  1.388e+02   0.060  0.95182    
## day_of_weekthu                8.557e+00  1.388e+02   0.062  0.95083    
## day_of_weektue                8.553e+00  1.388e+02   0.062  0.95086    
## day_of_weekwed                8.712e+00  1.388e+02   0.063  0.94994    
## duration                      4.666e-03  8.863e-05  52.646  < 2e-16 ***
## campaign                     -3.976e-02  1.370e-02  -2.901  0.00371 ** 
## pdays                        -9.868e-04  2.512e-04  -3.929 8.55e-05 ***
## previous                     -4.538e-02  7.004e-02  -0.648  0.51704    
## poutcomenonexistent           4.538e-01  1.126e-01   4.030 5.58e-05 ***
## poutcomesuccess               9.917e-01  2.455e-01   4.039 5.37e-05 ***
## emp.var.rate                 -1.766e+00  1.690e-01 -10.448  < 2e-16 ***
## cons.price.idx                2.154e+00  3.005e-01   7.168 7.62e-13 ***
## cons.conf.idx                 1.952e-02  9.344e-03   2.089  0.03672 *  
## euribor3m                     3.159e-01  1.545e-01   2.044  0.04094 *  
## nr.employed                   6.102e-03  3.723e-03   1.639  0.10121    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 20299  on 28831  degrees of freedom
## Residual deviance: 11972  on 28778  degrees of freedom
## AIC: 12080
## 
## Number of Fisher Scoring iterations: 11

# Predicting in the test dataset
pred_prob <- predict(logistic_model, test, type = "response")

# Converting from probability to actual output
test$pred_class <- ifelse(pred_prob >= 0.5, "yes", "no")
test$pred_class <- as.factor(test$pred_class)
# Generating the classification table
ctab_test <- table(test$y, test$pred_class)
ctab_test

##      
##          no   yes
##   no  10663   301
##   yes   813   579

library(pROC)
roc <- roc(train$y, logistic_model$fitted.values)
auc(roc)

## Area under the curve: 0.9365

## Accuracy in Test dataset
# Accuracy = (TP + TN)/(TN + FP + FN + TP)
accuracy_test <- sum(diag(ctab_test))/sum(ctab_test)
accuracy_test

## [1] 0.9098414

#Precision = TP/FP + TP (Precision indicates how often does your predicted TRUE values are actually TRUE.)
# Precision in Test dataset
Precision <- (ctab_test[2, 2]/sum(ctab_test[, 2]))
Precision

## [1] 0.6579545

# Recall Or TPR = TP/(FN + TP) (Recall or TPR indicates how often does our model predicts actual TRUE from the overall TRUE events.)
# Recall in Train dataset
Recall <- (ctab_test[2, 2]/sum(ctab_test[2, ]))
Recall

## [1] 0.4159483

# F1 score (F-Score is a harmonic mean of recall and precision. The score value lies between 0 and 1. The value of 1 represents perfect precision & recall. The value 0 represents the worst case.)
F_Score <- (2 * Precision * Recall / (Precision + Recall))
F_Score

## [1] 0.5096831

metric_eval <- data.frame(matrix(ncol = 6, nrow = 0))
x <- c("Model_Name", "Accuracy", "Precision","Recall", "F1_score", "AUC")
colnames(metric_eval) <- x
library(caret)
lgr_val <- c("Logistic Regression",accuracy_test,Precision,Recall,F_Score,auc(roc))
metric_eval <- rbind(metric_eval,lgr_val)
names(metric_eval)<-x
## making null for predicted column created in test data
test$pred_class<-NULL

Decision Tree

library(caTools)
library(knitr)
set.seed(123)
library(rpart)
classifier = rpart(formula = y ~ .,
                   data = train)
# Predicting the Test set results
names(test)

##  [1] "age"            "job"            "marital"        "education"     
##  [5] "default"        "housing"        "loan"           "contact"       
##  [9] "month"          "day_of_week"    "duration"       "campaign"      
## [13] "pdays"          "previous"       "poutcome"       "emp.var.rate"  
## [17] "cons.price.idx" "cons.conf.idx"  "euribor3m"      "nr.employed"   
## [21] "y"

y_pred = predict(classifier,
                 newdata = test,
                 type = 'prob')[,2]

library(pROC)
tree.roc <- roc(test$y, y_pred)
dt_auc<-tree.roc$auc[1]
## for confusion matrix evaluation
y_pred = predict(classifier,
                 newdata = test,
                 type = 'class')
# Making the Confusion Matrix
library(caret)
cm<-confusionMatrix(as.factor(y_pred), test$y, mode = "everything", positive="yes")
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10615   758
##        yes   349   634
##                                           
##                Accuracy : 0.9104          
##                  95% CI : (0.9052, 0.9154)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.486           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.45546         
##             Specificity : 0.96817         
##          Pos Pred Value : 0.64496         
##          Neg Pred Value : 0.93335         
##               Precision : 0.64496         
##                  Recall : 0.45546         
##                      F1 : 0.53389         
##              Prevalence : 0.11266         
##          Detection Rate : 0.05131         
##    Detection Prevalence : 0.07956         
##       Balanced Accuracy : 0.71181         
##                                           
##        'Positive' Class : yes             
##

dt_val <- c("Decision Tree",cm$overall[1],cm$byClass[5],cm$byClass[6],cm$byClass[7],dt_auc)
metric_eval <- rbind(metric_eval,dt_val)
names(metric_eval)<-x

Random Forest

library(randomForest)
library(knitr)
library(randomForest)
# Random Forest for classification
classifier_RF = randomForest(x = train[-21],
                             y = train$y,
                             ntree = 500)

classifier_RF

## 
## Call:
##  randomForest(x = train[-21], y = train$y, ntree = 500) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 8.58%
## Confusion matrix:
##        no  yes class.error
## no  24642  942  0.03681989
## yes  1531 1717  0.47136700

# Predicting the Test set results
y_pred_rf = predict(classifier_RF, newdata = test[-21])
# Plot the error vs The number of trees graph
plot(classifier_RF)

# Variable importance plot
varImpPlot(classifier_RF)

# confusion matrix
cm<-confusionMatrix(y_pred_rf, test$y, mode = "everything", positive="yes")
cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    no   yes
##        no  10560   655
##        yes   404   737
##                                           
##                Accuracy : 0.9143          
##                  95% CI : (0.9092, 0.9192)
##     No Information Rate : 0.8873          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5347          
##                                           
##  Mcnemar's Test P-Value : 1.562e-14       
##                                           
##             Sensitivity : 0.52945         
##             Specificity : 0.96315         
##          Pos Pred Value : 0.64592         
##          Neg Pred Value : 0.94160         
##               Precision : 0.64592         
##                  Recall : 0.52945         
##                      F1 : 0.58192         
##              Prevalence : 0.11266         
##          Detection Rate : 0.05965         
##    Detection Prevalence : 0.09234         
##       Balanced Accuracy : 0.74630         
##                                           
##        'Positive' Class : yes             
##

require(pROC)
rf.roc<-roc(train$y,classifier_RF$votes[,2])
plot(rf.roc)

rf_auc<-auc(rf.roc)[1]
rf_val <- c("Random Forest",cm$overall[1],cm$byClass[5],cm$byClass[6],cm$byClass[7],rf_auc)
metric_eval <- rbind(metric_eval,rf_val)
colnames(metric_eval) <- x
metric_eval$Accuracy<-round(as.numeric(metric_eval$Accuracy),digits = 4)
metric_eval$Precision<-round(as.numeric(metric_eval$Precision),digits = 4)
metric_eval$Recall<-round(as.numeric(metric_eval$Recall),digits = 4)
metric_eval$F1_score<-round(as.numeric(metric_eval$F1_score),digits = 4)
metric_eval$AUC<-round(as.numeric(metric_eval$AUC),digits = 4)
metric_eval

##            Model_Name Accuracy Precision Recall F1_score    AUC
## 1 Logistic Regression   0.9098    0.6580 0.4159   0.5097 0.9365
## 2       Decision Tree   0.9104    0.6450 0.4555   0.5339 0.8651
## 3       Random Forest   0.9143    0.6459 0.5295   0.5819 0.9439