1.0.3.1 Main Objective
To build a model that predicts if the client will subscribe to a term deposit or not
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.
Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call. The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution.
To build a model that predicts if the client will subscribe to a term deposit or not
To Answer questions derived from our specific objective.
Find and deal with outliers, anomalies, and missing data within the data set.
Perform EDA.
Building a model to predict if a client will subscribe to a term deposit or not ( best model should have a Balanced Accuracy score above 80)
From our insights provide a conclusion and recommendation.
Loading Important Libraries
library(data.table)
library(dplyr)
library(tidyverse)
library(ggplot2)We have two sets of data set i.e train and test , will load them separately as follows:
a. Loading the train data set
library(readr)
train <- read_delim("train.csv", delim = ";",
escape_double = FALSE, trim_ws = TRUE)Previewing first six rows
head(train)Checking number of rows and columns
dim(train)## [1] 45211 17
We have 45211 rows and 17 columns
Checking the data types
str(train)## spec_tbl_df [45,211 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:45211] 58 44 33 47 33 35 28 42 58 43 ...
## $ job : chr [1:45211] "management" "technician" "entrepreneur" "blue-collar" ...
## $ marital : chr [1:45211] "married" "single" "married" "married" ...
## $ education: chr [1:45211] "tertiary" "secondary" "secondary" "unknown" ...
## $ default : chr [1:45211] "no" "no" "no" "no" ...
## $ balance : num [1:45211] 2143 29 2 1506 1 ...
## $ housing : chr [1:45211] "yes" "yes" "yes" "yes" ...
## $ loan : chr [1:45211] "no" "no" "yes" "no" ...
## $ contact : chr [1:45211] "unknown" "unknown" "unknown" "unknown" ...
## $ day : num [1:45211] 5 5 5 5 5 5 5 5 5 5 ...
## $ month : chr [1:45211] "may" "may" "may" "may" ...
## $ duration : num [1:45211] 261 151 76 92 198 139 217 380 50 55 ...
## $ campaign : num [1:45211] 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : num [1:45211] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ previous : num [1:45211] 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : chr [1:45211] "unknown" "unknown" "unknown" "unknown" ...
## $ y : chr [1:45211] "no" "no" "no" "no" ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. job = col_character(),
## .. marital = col_character(),
## .. education = col_character(),
## .. default = col_character(),
## .. balance = col_double(),
## .. housing = col_character(),
## .. loan = col_character(),
## .. contact = col_character(),
## .. day = col_double(),
## .. month = col_character(),
## .. duration = col_double(),
## .. campaign = col_double(),
## .. pdays = col_double(),
## .. previous = col_double(),
## .. poutcome = col_character(),
## .. y = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
We have a mixture of numeric, and categorical variables
b. Loading the test data set
library(readr)
test <- read_delim("test.csv", delim = ";",
escape_double = FALSE, trim_ws = TRUE)Previewing the first six rows
head(test)Checking the number of rows and columns
dim(test)## [1] 4521 17
We have 17 columns and 4521 rows
Previewing our test data types
str(test)## spec_tbl_df [4,521 × 17] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:4521] 30 33 35 30 59 35 36 39 41 43 ...
## $ job : chr [1:4521] "unemployed" "services" "management" "management" ...
## $ marital : chr [1:4521] "married" "married" "single" "married" ...
## $ education: chr [1:4521] "primary" "secondary" "tertiary" "tertiary" ...
## $ default : chr [1:4521] "no" "no" "no" "no" ...
## $ balance : num [1:4521] 1787 4789 1350 1476 0 ...
## $ housing : chr [1:4521] "no" "yes" "yes" "yes" ...
## $ loan : chr [1:4521] "no" "yes" "no" "yes" ...
## $ contact : chr [1:4521] "cellular" "cellular" "cellular" "unknown" ...
## $ day : num [1:4521] 19 11 16 3 5 23 14 6 14 17 ...
## $ month : chr [1:4521] "oct" "may" "apr" "jun" ...
## $ duration : num [1:4521] 79 220 185 199 226 141 341 151 57 313 ...
## $ campaign : num [1:4521] 1 1 1 4 1 2 1 2 2 1 ...
## $ pdays : num [1:4521] -1 339 330 -1 -1 176 330 -1 -1 147 ...
## $ previous : num [1:4521] 0 4 1 0 0 3 2 0 0 2 ...
## $ poutcome : chr [1:4521] "unknown" "failure" "failure" "unknown" ...
## $ y : chr [1:4521] "no" "no" "no" "no" ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. job = col_character(),
## .. marital = col_character(),
## .. education = col_character(),
## .. default = col_character(),
## .. balance = col_double(),
## .. housing = col_character(),
## .. loan = col_character(),
## .. contact = col_character(),
## .. day = col_double(),
## .. month = col_character(),
## .. duration = col_double(),
## .. campaign = col_double(),
## .. pdays = col_double(),
## .. previous = col_double(),
## .. poutcome = col_character(),
## .. y = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
For cleaning will start cleaning the train data set
Train data set
is.null(train)## [1] FALSE
colSums(is.na(train))## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
We have no null values
duplicated_rows <- train[duplicated(train),]
duplicated_rowsWe have no duplicates
boxplot(train$balance, ylab = "average yearly balance, in euros ", main = 'Average Yearly Balance')We have outliers on the balance column.
ggplot(train) +
aes(x = "", y =age) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal() + labs(title = 'Age')We have outlier in age
ggplot(train) +
aes(x = "", y =day) +
geom_boxplot(fill = "#0c4c8a") +
theme_minimal() + labs(title = 'Number of days that passed by after the client was contacted from previous campaign')There are no outliers on day column.
boxplot(train$duration, ylab = "last contact duration, in seconds ", main = 'Last contact duration')boxplot(train$campaign, ylab = "number of contacts performed for this client", main = 'Number of contacts')Most of the numeric columns have outliers but will not drop them since they are significant for our analysis.
Test data set
is.null(test)## [1] FALSE
colSums(is.na(test))## age job marital education default balance housing loan
## 0 0 0 0 0 0 0 0
## contact day month duration campaign pdays previous poutcome
## 0 0 0 0 0 0 0 0
## y
## 0
We have no null values
duplicated_rows <- test[duplicated(test),]
duplicated_rowsWe have no duplicates
Will combine the two tables for EDA
df <- rbind(train, test)
head(df)Age distribution of the customers
hist((df$age),
main = "Customer age distribution",
xlab = 'Age',
ylab = 'count',
col = "blue")The age bracket of most clients was 35 years, there was an extreme of 95 years and 18 years
Education level Distribution of the customers
edu <- (df$education)
edu.frequency <- table(edu)
edu.frequency## edu
## primary secondary tertiary unknown
## 7529 25508 14651 2044
barplot(edu.frequency,
main="Distribution of Education level among the customer",
xlab="Education Level",
ylab = "Frequency",
col=c("magenta","blue", "green", "yellow"),
)Most of our customers had a form of education with highest having already reached secondary education followed by tertiary level and the least were those who did not disclose their level of education.
Job types distribution
job <- (df$job)
job.frequency <- table(job)
job.frequency## job
## admin. blue-collar entrepreneur housemaid management
## 5649 10678 1655 1352 10427
## retired self-employed services student technician
## 2494 1762 4571 1022 8365
## unemployed unknown
## 1431 326
ggplot(df, aes(x=job)) +geom_bar() + ggtitle("Customers Job Type Distribution") + coord_flip()The clients for the campaign involved most personnel working in blue collar jobs, management and administrative levels with the least being students and thosw who didn’t disclose their jobs.
Marital status
marital <- (df$marital)
marital.frequency <- table(marital)
marital.frequency## marital
## divorced married single
## 5735 30011 13986
barplot(marital.frequency,
main="Customers Marital Status",
xlab="Marital Status",
ylab = "Frequency",
col=c("magenta","blue", "red"),
)Most of the customers participating in the campaigns were married, followed by single people and finally divorced.
Credit status
default <- (df$default)
default.frequency <- table(default)
default.frequency## default
## no yes
## 48841 891
barplot(default.frequency,
main="Distribution of Customers on Default Credit",
xlab="Has Credit in default",
ylab = "Frequency",
col=c("magenta","blue"),
)The graph above shows that most customers don’t have credit on default.
Housing Loan
housing <- (df$housing)
housing.frequency <- table(housing)
housing.frequency## housing
## no yes
## 22043 27689
barplot(housing.frequency,
main="Customer Housing Loan Distribution",
xlab="Housing Loan",
ylab = "Frequency",
col=c("magenta","blue"),
)Most of the customers have a housing loan.
Personal loan
loan <- (df$loan)
loan.frequency <- table(loan)
loan.frequency## loan
## no yes
## 41797 7935
barplot(loan.frequency,
main="Customer's Personal Loan Distribution Status",
xlab="Personal Loan",
ylab = "Frequency",
col=c("magenta","blue"),
)Most customers don’t have a personal loan.
Outcome of the previous marketing campaign
outcome <- (df$poutcome)
outcome.frequency <- table(outcome)
outcome.frequency## outcome
## failure other success unknown
## 5391 2037 1640 40664
barplot(outcome.frequency,
main="Previous Marketing Campaign Outcome",
xlab="Previous campaign Outcome",
ylab = "Frequency",
col=c("magenta","blue", "grey", "black"),
)The graph shows most customers outcome of the previous marketing campaign to be unknown, with the least of the current focus group ending in success
Subscription to term deposit
sb <- (df$y)
sb.frequency <- table(sb)
sb.frequency## sb
## no yes
## 43922 5810
barplot(sb.frequency,
main="Term Deposit Subscription",
xlab="Subscription to term deposit",
ylab = "Frequency",
col=c("Purple","green"),
)The graph shows the outcome towards term deposit subscription where most customers did not subscribe.
library(reshape2)Comparing age vs average yearly balance
plot((df$age), (df$balance),
main = "Age vs Average yearly Balance",
xlab = 'Age',
ylab = 'Average yearly balance')There is high concentration of average yearly balance of most customers despite age to be on the lower limit, however, around age 40 to 60 years we have outliers on the upper limit.
Does having a housing loan affect whether a client subscribed to a term deposit or not?
library(plyr)counts <- ddply(df, .(df$y, df$housing), nrow)
names(counts) <- c("term deposit", "housing loan", "Freq")
countsThe table shows that most people with housing loan didn’t no subscribe to a term deposit.
We can see this visually
ggplot(df, aes(fill=y, x=housing)) + geom_bar(position = "dodge" ) + labs(title = 'Housing loan vs Term deposit subscription',
x = 'Housing Loan', y = 'Customer count')We can therefore answer our objective that indeed having a housing loan affects if someone subscribes to a term deposit or not. We can clearly see most of the people who subscribed to a term deposit did not have a housing loan.
Does having a Personal loan affect whether a client subscribed to a term deposit or not?
loan_counts <- ddply(df, .(df$y, df$loan), nrow)
names(loan_counts) <- c("Term deposit", "Personal loan", "Freq")
loan_countsThe table shows that most people with personal loan did not subscribe to a term deposit.
ggplot(df, aes(fill=y, x=loan)) + geom_bar(position = "dodge" ) + labs(title = 'Personal loan vs Term deposit subscription',
x = 'Personal Loan', y = 'Customer count')We can therefore answer our objective that indeed having a personal loan affects if someone subscribes to a term deposit or not. We can clearly see most of the people who subscribed to a term deposit did not have a personal loan.
Does previous campaign success lead to current campaign success to term deposit subscription?
previous_outcome <- ddply(df, .(df$y, df$poutcome), nrow)
names(previous_outcome) <- c("Term deposit", "Previous outcome", "Freq")
previous_outcomeFrom this table we can see previous success indeed lead to current success.
ggplot(df, aes(fill=y, x=poutcome)) + geom_bar(position = "dodge" ) + labs(title = 'Previous campaign outcome vs Term deposit subscription',
x = 'Previous campaign outcome', y = 'Customer count')The success of previous campaign had a higher chance of success to the current campaign.
Does having credit on default affect term deposit subscription?
default_count <- ddply(df, .(df$y, df$default), nrow)
names(default_count) <- c("term deposit", "Credit by Default", "Freq")
default_countggplot(df, aes(fill=y, x=default)) + geom_bar(position = "dodge" ) + labs(title = 'Customers default credit status vs Term deposit subscription',
x = 'Customers default credit status', y = 'Customer count')The graph and table above shows having a credit on default doesn’t lead to term deposit subscription.
Job type vs Term deposit subscription
job_count <- ddply(df, .(df$job, df$y), nrow)
names(job_count) <- c("Job type", "term deposit", "Freq")
job_countThe table above shows that most people in management subscribed to a term deposit, followed by blue collar and administrative.
Marital status vs Term deposit subscription
maritalstatus <- ddply(df, .(df$marital, df$y), nrow)
names(maritalstatus) <- c("maritalstatus", "Term Deposit", "Freq")
maritalstatusggplot(df, aes(fill=y, x=marital)) + geom_bar(position = "dodge" ) + labs(title = 'Customers marital status vs Term deposit subscription',
x = 'Customers Marital status', y = 'Customer count')Most married people subscribed to term deposit, however, they were also the majority in the campaign.
Education Level vs Term deposit subscription
edu_count<- ddply(df, .(df$education, df$y), nrow)
names(edu_count) <- c("Education level", "term deposit", "Freq")
edu_countggplot(df, aes(fill=y, x=education)) + geom_bar(position = "dodge" ) +labs(title = 'Customers education level vs Term deposit subscription',
x = 'Customers education level', y = 'Customer count')
The graph above shows most customers as previously observed had some
level of secondary education. However, proportionally most tertiary
educational holder actually subscribed to term deposit compared to other
levels of education.
Multiple calls(campaign) contact led to a term deposit or not?
campaign_count<- ddply(df, .(df$campaign, df$y), nrow)
names(campaign_count) <- c("Campaign", "term deposit", "Freq")
campaign_countThe table above shows that multiple contact during the campaign did not result to subscription. Most the people who actually subscribed to term deposit were only contacted once.
Getting a summary of the variables
summary(df)## age job marital education
## Min. :18.00 Length:49732 Length:49732 Length:49732
## 1st Qu.:33.00 Class :character Class :character Class :character
## Median :39.00 Mode :character Mode :character Mode :character
## Mean :40.96
## 3rd Qu.:48.00
## Max. :95.00
## default balance housing loan
## Length:49732 Min. : -8019 Length:49732 Length:49732
## Class :character 1st Qu.: 72 Class :character Class :character
## Mode :character Median : 448 Mode :character Mode :character
## Mean : 1368
## 3rd Qu.: 1431
## Max. :102127
## contact day month duration
## Length:49732 Min. : 1.00 Length:49732 Min. : 0.0
## Class :character 1st Qu.: 8.00 Class :character 1st Qu.: 103.0
## Mode :character Median :16.00 Mode :character Median : 180.0
## Mean :15.82 Mean : 258.7
## 3rd Qu.:21.00 3rd Qu.: 320.0
## Max. :31.00 Max. :4918.0
## campaign pdays previous poutcome
## Min. : 1.000 Min. : -1.00 Min. : 0.0000 Length:49732
## 1st Qu.: 1.000 1st Qu.: -1.00 1st Qu.: 0.0000 Class :character
## Median : 2.000 Median : -1.00 Median : 0.0000 Mode :character
## Mean : 2.767 Mean : 40.16 Mean : 0.5769
## 3rd Qu.: 3.000 3rd Qu.: -1.00 3rd Qu.: 0.0000
## Max. :63.000 Max. :871.00 Max. :275.0000
## y
## Length:49732
## Class :character
## Mode :character
##
##
##
The summary above shows the following:
* The minimum age was 18 while the maximum was 95 years while the mean was 40.
* The minimum customer's average yearly balance was -8019, the maximum was 102127 while the mean was 1368.
* The minimum number of days that passed by after the client was last contacted from a previous campaign was 1 day, the maximum was 31 days while the mean was 15 days.
* The minimum number of contacts performed during this campaign and for a particular client was 1, the maximum was 63 while the mean was 2.
Checking for correlation
library(corrplot)numeric <- df %>%
select_if(is.numeric) %>%
select("age", "balance", "duration", "day", "campaign", "pdays", "previous")corrplot(cor(numeric))
There is no correlation among the numeric columns observed
Will be performing our modeling using supervised method then challenge with unsupervised learning.
Loading important libraries
library(caTools)
library(party)
library(dplyr)
library(magrittr)
library(randomForest)
library(e1071)
library(caTools)
library(class)
library(rpart)
library(rpart.plot)
library(caret)
library(caretEnsemble)
library(psych)
library(Amelia)
library(mice)
library(GGally)Previewing our train data set.
head(df)Selecting numeric columns
num <- df[, c(1,6,10,12:15)]
head(num)Selecting categorical columns
cat <- df[, c(2:5,7:9,11,16,17)]
head(cat)Label encoding our categorical columns
library(superml)label <- LabelEncoder$new()
cat$job <- label$fit_transform(cat$job)
cat$marital <- label$fit_transform(cat$marital)
cat$education <- label$fit_transform(cat$education)
cat$default <- label$fit_transform(cat$default)
cat$housing <- label$fit_transform(cat$housing)
cat$loan <- label$fit_transform(cat$loan)
cat$contact <- label$fit_transform(cat$contact)
cat$month <- label$fit_transform(cat$month)
cat$poutcome <- label$fit_transform(cat$poutcome)
cat$y <- label$fit_transform(cat$y)
head(cat)joining now categorical and numeric data
data <-cbind(num, cat)
head(data)Will also perform feature selection to remove redundant feature in our data set.
correlationMatrix <- cor(data)highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.70)Dataset2<-data[-highlyCorrelated]par(mfrow = c(1, 2))
corrplot(correlationMatrix, order = "hclust")
corrplot(cor(Dataset2), order = "hclust")We can see from the graphs above we don’t have highly correlated feature so none was removed.
Previewing our classes
head(data)class<- (data$y)
class.frequency <- table(class)
class.frequency## class
## 0 1
## 43922 5810
From this frequency table we have a huge class imbalance and will deal with them before moving forward.
library(imbalance)Selecting the two class in the data set
df_p <- which(data$y == "0")
df_n <- which(data$y == "1")Under sampling the majority class.
nsample <- 5810
pick_negative <- sample(df_p, nsample)
undersample_df1 <- data[c(df_n, pick_negative), ]
dim(undersample_df1)## [1] 11620 17
The final product we have a new data set with 11620 rows and 17 columns
Previewing our response variable class
table(undersample_df1$y)##
## 0 1
## 5810 5810
Now will go ahead and split our data into train and test data set
train.size = floor(0.75*nrow(undersample_df1))
train.index = sample(1:nrow(undersample_df1), train.size)
train.set = undersample_df1[train.index,]
test.set = undersample_df1[-train.index,]
x.train = train.set[,-17]
x.test = test.set[,-17]
y.train = train.set[,17]
y.test = test.set[,17]Fitting KNN model
knn.3 <- knn(train = x.train, test = x.test, cl = y.train , k = 5)def = table(predicted = knn.3, true = y.test)
def## true
## predicted 0 1
## 0 1121 338
## 1 353 1093
confusionMatrix(def)## Confusion Matrix and Statistics
##
## true
## predicted 0 1
## 0 1121 338
## 1 353 1093
##
## Accuracy : 0.7621
## 95% CI : (0.7462, 0.7775)
## No Information Rate : 0.5074
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5242
##
## Mcnemar's Test P-Value : 0.5943
##
## Sensitivity : 0.7605
## Specificity : 0.7638
## Pos Pred Value : 0.7683
## Neg Pred Value : 0.7559
## Prevalence : 0.5074
## Detection Rate : 0.3859
## Detection Prevalence : 0.5022
## Balanced Accuracy : 0.7622
##
## 'Positive' Class : 0
##
The model gives us a balanced accuracy of 76.08 before any hyper parameter tuning is performed.
Parameter tuning
creating Standardization function
standardize = function(x){
z <- (x - mean(x)) / sd(x)
return( z)
}applying the function to the data set
undersample_df2 <-
apply(undersample_df1, 2, standardize)
head(undersample_df2)## age balance day duration campaign pdays previous
## 84 1.48497332 0.2439696 -1.259277 1.8747220 -0.5374291 -0.4841538 -0.3742031
## 87 1.23327436 -0.4747472 -1.259277 3.0799292 -0.5374291 -0.4841538 -0.3742031
## 88 -0.02522044 -0.0916192 -1.259277 2.8587382 -0.5374291 -0.4841538 -0.3742031
## 130 1.14937471 0.2855664 -1.259277 0.5617552 -0.5374291 -0.4841538 -0.3742031
## 169 1.06547506 -0.4312739 -1.259277 0.8283186 -0.1884531 -0.4841538 -0.3742031
## 271 0.05867922 -0.4888213 -1.259277 0.5135469 -0.1884531 -0.4841538 -0.3742031
## job marital education default housing loan contact
## 84 0.7357723 -0.7841811 0.02026809 -0.1271889 -1.0484622 -0.389973 -1.705916
## 87 0.7357723 -0.7841811 0.02026809 -0.1271889 0.9536957 -0.389973 -1.705916
## 88 -0.8076016 -0.7841811 0.02026809 -0.1271889 -1.0484622 -0.389973 -1.705916
## 130 1.0444471 -0.7841811 0.02026809 -0.1271889 -1.0484622 -0.389973 -1.705916
## 169 0.7357723 -0.7841811 -1.02646594 -0.1271889 0.9536957 -0.389973 -1.705916
## 271 -1.1162764 0.6596763 -1.02646594 -0.1271889 -1.0484622 2.564060 -1.705916
## month poutcome y
## 84 -1.032287 -0.5178625 0.999957
## 87 -1.032287 -0.5178625 0.999957
## 88 -1.032287 -0.5178625 0.999957
## 130 -1.032287 -0.5178625 0.999957
## 169 -1.032287 -0.5178625 0.999957
## 271 -1.032287 -0.5178625 0.999957
train1.size = floor(0.75*nrow(undersample_df2))
train1.index = sample(1:nrow(undersample_df1), train1.size)
train1.set = undersample_df2[train1.index,]
test1.set = undersample_df2[-train1.index,]
x.train1 = train1.set[,-17]
x.test1 = test1.set[,-17]
y.train1 = train1.set[,17]
y.test1 = test1.set[,17]knn5 <- knn(train = x.train1, test = x.test1, cl = y.train1 , k = 5)defp = table(predicted = knn5, true = y.test1)confusionMatrix(defp)## Confusion Matrix and Statistics
##
## true
## predicted -0.999956969814305 0.999956969814305
## -0.999956969814305 1214 331
## 0.999956969814305 271 1089
##
## Accuracy : 0.7928
## 95% CI : (0.7776, 0.8074)
## No Information Rate : 0.5112
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.585
##
## Mcnemar's Test P-Value : 0.01619
##
## Sensitivity : 0.8175
## Specificity : 0.7669
## Pos Pred Value : 0.7858
## Neg Pred Value : 0.8007
## Prevalence : 0.5112
## Detection Rate : 0.4179
## Detection Prevalence : 0.5318
## Balanced Accuracy : 0.7922
##
## 'Positive' Class : -0.999956969814305
##
After hyper parameter tuning our model improved to 78.49% balanced accuracy.
Fitting Naive Bayes Model
set.seed(120)
classifier_cl <- naiveBayes(y.train ~ ., data = x.train)Predicting on test data’
y_pred <- predict(classifier_cl, newdata = x.test)Confusion Matrix
cm <- table(y.test, y_pred)
cm## y_pred
## y.test 0 1
## 0 1133 341
## 1 334 1097
Model Evaluation
confusionMatrix(cm)## Confusion Matrix and Statistics
##
## y_pred
## y.test 0 1
## 0 1133 341
## 1 334 1097
##
## Accuracy : 0.7676
## 95% CI : (0.7518, 0.7829)
## No Information Rate : 0.505
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.5352
##
## Mcnemar's Test P-Value : 0.8174
##
## Sensitivity : 0.7723
## Specificity : 0.7629
## Pos Pred Value : 0.7687
## Neg Pred Value : 0.7666
## Prevalence : 0.5050
## Detection Rate : 0.3900
## Detection Prevalence : 0.5074
## Balanced Accuracy : 0.7676
##
## 'Positive' Class : 0
##
The model had a balanced accuracy of 74.79% which was lower than knn and also below our metrics of success
Fitting SVM to the Training set
classifier = svm(formula = y.train ~ .,
data = x.train,
type = 'C-classification',
kernel = 'linear')Predicting the Test set results
y_pred = predict(classifier, newdata = x.test)Making the Confusion Matrix
cm = table(y.test, y_pred)
cm## y_pred
## y.test 0 1
## 0 1215 259
## 1 255 1176
confusionMatrix(cm)## Confusion Matrix and Statistics
##
## y_pred
## y.test 0 1
## 0 1215 259
## 1 255 1176
##
## Accuracy : 0.8231
## 95% CI : (0.8087, 0.8368)
## No Information Rate : 0.506
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6461
##
## Mcnemar's Test P-Value : 0.8947
##
## Sensitivity : 0.8265
## Specificity : 0.8195
## Pos Pred Value : 0.8243
## Neg Pred Value : 0.8218
## Prevalence : 0.5060
## Detection Rate : 0.4182
## Detection Prevalence : 0.5074
## Balanced Accuracy : 0.8230
##
## 'Positive' Class : 0
##
The SVM model had a balanced accuracy of 81.14% making the best model compared to the previous two, and also qualifies with our metric of success.
head(undersample_df1)Selecting the predictor columns
predictorcol <- undersample_df1[, -17]
label <- undersample_df1[, 17]Fitting the K-mean Clustering model using k=2
kmeans.re <- kmeans(predictorcol, centers = 2, nstart = 20)Confusion Matrix
kmeancm <- table(label, kmeans.re$cluster)
kmeancm##
## label 1 2
## 0 5589 221
## 1 5511 299
The table shows despite being to make correct prediction of the two classes , there was also a case of high mis-prediction for both classes making this model unsuitable
From our models above we are able to see they performed differently summarized below:
Overall the best model to determine is a customer subscribe to term deposit or not is SVM. It’s also important to note unsupervised techniques are not suitable for this project.
Most Customers who will subscribe to term deposit are those without loan (housing and personal Loan).
Making multiple campaign calls to the same customer doesn’t result in them subscribing to term deposit.
Having credit on default doesn’t equate term deposit subscription.
For effectiveness of the campaigns the marketing team would:
Don’t call one customer multiple times (more than 2 times) instead spread that time to other customers.
Be aware people with previous loan (any form) might not be willing to subscribe to a term deposit.
Focusing on customers who without credit default is wise since they will be likely to subscribe to term deposit.