The data-set provides details from Thera Bank about a Personal Loan Campaign that was executed by the bank. 4982 customers were targeted with an offer of personal loan, out of which 480 customers responded positively. The data needs to be used to create classification model(s) in order to predict the response of new set of customers in the future, depending on the attributes available in the data. Classification Models using following Supervised Machine Learning Techniques: 1. Exploratory Analytics 2. Clustering 3. CART & Random Forest Once the Classification Models are built, after pruning a recommendation system to be provided.
#install.packages("caret, repos = http://cran.us.r-project.org")
#install.packages("rpart, repos = http://cran.us.r-project.org")
#install.packages("rpart.plot, repos = http://cran.us.r-project.org")
#install.packages("randomForest, repos = http://cran.us.r-project.org")
library(caret)
Loading required package: lattice
Loading required package: ggplot2
Want to understand how all the pieces fit together? Buy the ggplot2 book: http://ggplot2.org/book/
library(rpart)
library(rpart.plot)
library(ggplot2)
library(randomForest)
randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 㤼㸱randomForest㤼㸲
The following object is masked from 㤼㸱package:ggplot2㤼㸲:
margin
base_data <-read.csv('D:/Freelancer_questions/Thera_bank/Thera Bank_Personal_Loan_Modelling.csv')
# Find out Total Number of Rows and Columns
dim(base_data)
[1] 5000 14
str(base_data)
'data.frame': 5000 obs. of 14 variables:
$ ï..ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Age..in.years. : int 25 45 39 35 35 37 53 50 35 34 ...
$ Experience..in.years.: int 1 19 15 9 8 13 27 24 10 9 ...
$ Income..in.K.month. : int 49 34 11 100 45 29 72 22 81 180 ...
$ ZIP.Code : int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023 ...
$ Family.members : int 4 3 1 1 4 4 2 1 3 1 ...
$ CCAvg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
$ Education : int 1 1 1 2 2 2 2 3 2 3 ...
$ Mortgage : int 0 0 0 0 0 155 0 0 104 0 ...
$ Personal.Loan : int 0 0 0 0 0 0 0 0 0 1 ...
$ Securities.Account : int 1 1 0 0 0 0 0 0 0 0 ...
$ CD.Account : int 0 0 0 0 0 0 0 0 0 0 ...
$ Online : int 0 0 0 0 0 1 1 0 1 0 ...
$ CreditCard : int 0 0 0 0 1 0 0 1 0 0 ...
colnames(base_data)<-c('ID','Age_in_years','Experience(years)','Income_Monthly','Zip_code','Family_members','CCAvg','Education','Mortgage','Personal_loan','Securities_Account','CD_Account','Online','CreditCard')
head(base_data)
ID Age_in_years Experience(years) Income_Monthly Zip_code Family_members CCAvg Education Mortgage Personal_loan Securities_Account
1 1 25 1 49 91107 4 1.6 1 0 0 1
2 2 45 19 34 90089 3 1.5 1 0 0 1
3 3 39 15 11 94720 1 1.0 1 0 0 0
4 4 35 9 100 94112 1 2.7 2 0 0 0
5 5 35 8 45 91330 4 1.0 2 0 0 0
6 6 37 13 29 92121 4 0.4 2 155 0 0
CD_Account Online CreditCard
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
5 0 0 1
6 0 1 0
prop.table(table(base_data$Personal_loan))*100
0 1
90.4 9.6
## Univariate analysis
hist(base_data$Age_in_years,
main = "Histogram of Age",
xlab = "Age in Years")
Inference :we can observe that the Age is very close to the normal distribution
base_data$Zip_code<-as.factor(base_data$Zip_code)
base_data$Family_members<-as.factor(base_data$Family_members)
base_data$Education<-as.factor(base_data$Education)
base_data$Personal_loan<-as.factor(base_data$Personal_loan)
base_data$Securities_Account<-as.factor(base_data$Securities_Account)
base_data$CD_Account<-as.factor(base_data$CD_Account)
base_data$Online<-as.factor(base_data$Online)
base_data$CreditCard<-as.factor(base_data$CreditCard)
# Grouped Bar Plot
counts <- table(base_data$Family_members, base_data$Personal_loan)
barplot(counts, main="Family members vs Personal Loan",
xlab="Personal Loan No vs Yes", col=c("darkblue","red","green","yellow"),
legend = rownames(counts), beside=TRUE)
Inference : We can clearly see that those people having more family members have higher liklihood to take loan
counts <- table(base_data$Education, base_data$Personal_loan)
barplot(counts, main="Education Category vs Personal Loan",
xlab="Personal Loan No vs Yes", col=c("darkblue","red","green"),
legend = c("1 Undergrad", "2 Graduate","3 Advanced/Professional"), beside=TRUE)
Inference : Hypothesis : Advanced/Professional require loan for higher studies
boxplot(base_data$Age_in_years,
main = toupper("Boxplot of Age"),
ylab = "Age in years",
col = "blue")
Inference : Not much outlier in Age column
boxplot(base_data$`Experience(years)`,
main = toupper("Boxplot of Experience"),
ylab = "Experience in years",
col = "purple")
Inference : Not much outlier in Experience column
boxplot(base_data$Income_Monthly,
main = toupper("Boxplot of Monthly Income"),
ylab = "Monthly Income",
col = "pink")
Inference : There are lots of outliers in the monthly income
boxplot(base_data$CCAvg,
main = toupper("Boxplot of Average Spending of credit card per month"),
ylab = "Average Spending",
col = "maroon")
Inference : Here too in the average spending of credit card per month there are lots of outliers
boxplot(base_data$Mortgage,
main = toupper("Boxplot of House Mortgage if any"),
ylab = "House Mortgag",
col = "maroon")
Inference : Here too in there are lots of outliers
my_data <- base_data[, c(2,3,4,7,9)]
res <- cor(my_data)
round(res, 2)
Age_in_years Experience(years) Income_Monthly CCAvg Mortgage
Age_in_years 1.00 0.99 -0.06 -0.05 -0.01
Experience(years) 0.99 1.00 -0.05 -0.05 -0.01
Income_Monthly -0.06 -0.05 1.00 0.65 0.21
CCAvg -0.05 -0.05 0.65 1.00 0.11
Mortgage -0.01 -0.01 0.21 0.11 1.00
Inference
Clustering features are only numberical
All the categorical features have not been considered as they do not make much sense when we do clustering
wss <- (nrow(my_data)-1)*sum(apply(my_data,2,var))
for(i in 2:15)wss[i]<- sum(fit=kmeans(my_data,centers=i,15)$withinss)
plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")
A fundamental step for any unsupervised algorithm is to determine the optimal number of clusters into which the data may be clustered. The Elbow Method is one of the most popular methods to determine this optimal value of k.
We now demonstrate the given method using the K-Means clustering technique
Inference : Based on the elbow curve we can see 4 clusters formed
fit <- kmeans(my_data,4)
library(cluster)
library(fpc)
package 㤼㸱fpc㤼㸲 was built under R version 3.5.3
plotcluster(my_data,fit$cluster)
points(fit$centers,col=1:8,pch=16)
mydata <- data.frame(my_data,fit$cluster)
cluster_mean <- aggregate(mydata,by = list(fit$cluster),FUN = mean)
cluster_mean
Group.1 Age_in_years Experience.years. Income_Monthly CCAvg Mortgage fit.cluster
1 1 45.73072 20.40566 51.66331 1.387648 0.000000 1
2 2 44.62385 19.47706 129.17431 3.044434 343.116208 2
3 3 44.44778 19.44667 139.28778 3.605644 1.925556 3
4 4 45.35738 20.12164 56.98742 1.566745 141.411074 4
“It is important to remember that Data Analytics Projects require a delicate balance between experimentation, intuition, but also following (once a while) a process to avoid getting fooled by randomness and “finding results and patterns” that are mainly driven by our own biases and not by the facts/data themselves"-https://inseaddataanalytics.github.io/INSEADAnalytics/CourseSessions/Sessions45/ClusterAnalysisReading.html
my_data2<-my_data
outliers3 <- boxplot(my_data2$Income_Monthly, plot=FALSE)
outliers3<-outliers3$out
my_data2 <- my_data2[-which(my_data2$Income_Monthly %in% outliers3),]
outliers4 <- boxplot(my_data2$CCAvg, plot=FALSE)
outliers4<-outliers4$out
my_data2 <- my_data2[-which(my_data2$CCAvg %in% outliers4),]
outliers5 <- boxplot(my_data2$Mortgage, plot=FALSE)
outliers5<-outliers5$out
my_data2 <- my_data2[-which(my_data2$Mortgage %in% outliers5),]
nrow(my_data2)
[1] 4374
Inference : Outliers have been successfully removed
wss <- (nrow(my_data2)-1)*sum(apply(my_data2,2,var))
for(i in 2:15)wss[i]<- sum(fit2=kmeans(my_data2,centers=i,15)$withinss)
plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")
Inference : 5 clusters make sense here
fit2<-kmeans(my_data2,5)
my_data3 <- data.frame(my_data2)
cluster_mean_2 <- aggregate(my_data3,by = list(fit2$cluster),FUN = mean)
cluster_mean_2
Group.1 Age_in_years Experience.years. Income_Monthly CCAvg Mortgage
1 1 55.67742 30.26100 75.12023 1.8905572 0.111437
2 2 44.76296 19.78889 136.06481 2.5069074 1.851852
3 3 46.07710 20.75944 31.42791 0.9644025 0.000000
4 4 45.42485 20.18332 54.30061 1.4017289 140.939183
5 5 35.54119 10.31960 73.52983 1.7996591 0.000000
Inference : The 5 clusters make much more sense after outlier removal
my_data2$cluster<-fit2$cluster
library(dplyr)
package 㤼㸱dplyr㤼㸲 was built under R version 3.5.1
Attaching package: 㤼㸱dplyr㤼㸲
The following object is masked from 㤼㸱package:randomForest㤼㸲:
combine
The following objects are masked from 㤼㸱package:stats㤼㸲:
filter, lag
The following objects are masked from 㤼㸱package:base㤼㸲:
intersect, setdiff, setequal, union
head(my_data2)
Age_in_years Experience(years) Income_Monthly CCAvg Mortgage cluster
1 25 1 49 1.6 0 5
2 45 19 34 1.5 0 3
3 39 15 11 1.0 0 3
4 35 9 100 2.7 0 5
5 35 8 45 1.0 0 3
6 37 13 29 0.4 155 4
index<-as.integer(row.names.data.frame(my_data2))
Personal_loan<-base_data[index,10]
my_data2$Personal_loan<-Personal_loan
head(my_data2)
Age_in_years Experience(years) Income_Monthly CCAvg Mortgage cluster Personal_loan
1 25 1 49 1.6 0 5 0
2 45 19 34 1.5 0 3 0
3 39 15 11 1.0 0 3 0
4 35 9 100 2.7 0 5 0
5 35 8 45 1.0 0 3 0
6 37 13 29 0.4 155 4 0
Inference : We have got the two cluster
# Grouped Bar Plot
counts <- table( my_data2$Personal_loan,my_data2$cluster)
barplot(counts, main="Family members vs Personal Loan",
xlab="Personal Loan No vs Yes", col=c("red","green"),
legend = c("Personal_Loan_No","Personal_Loan_Yes"), beside=TRUE)
Inference : Targeting the cluster 4 segment would be the best option for conversion rate to be higher, also the population is close to 500
This would help the company to spend the marketing money on the correct customers rather than waste it on all customers
Creating Training and Testing Dataset The given data set is divided into Training and Testing data set, with 70:30 proportion. The distribution of Responder and Non Responder Class is verified in both the data sets, and ensured it’s close to equal.
set.seed(111)
trainIndex <- createDataPartition(Personal_loan,
p=0.7,
list = FALSE,
times = 1)
train.data <- base_data[trainIndex,2:length(base_data) ]
test.data <- base_data[-trainIndex,2:length(base_data) ]
Model Building - CART (Unbalanced Dataset) Setting the control parameter inputs for rpart
r.ctrl <- rpart.control(minsplit = 100,
minbucket = 10,
cp = 0,
xval = 10
)
#Exclude columns - "Customer ID" and "Acct Opening Date"
cart.train <- train.data
m1 <- rpart(formula = Personal_loan~.,
data = cart.train,
method = "class",
control = r.ctrl
)
#install.packages("rattle")
#install.packages("RColorBrewer")
library(rattle)
Rattle: A free graphical interface for data science with R.
Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
Attaching package: 㤼㸱rattle㤼㸲
The following object is masked from 㤼㸱package:randomForest㤼㸲:
importance
library(RColorBrewer)
fancyRpartPlot(m1)
Variables used in the tree construction
printcp(m1)
Classification tree:
rpart(formula = Personal_loan ~ ., data = cart.train, method = "class",
control = r.ctrl)
Variables actually used in tree construction:
[1] CCAvg Education Family_members Income_Monthly Zip_code
Root node error: 308/3062 = 0.10059
n= 3062
CP nsplit rel error xerror xstd
1 0.3214286 0 1.000000 1.00000 0.054039
2 0.1525974 2 0.357143 0.36688 0.033871
3 0.0487013 3 0.204545 0.21753 0.026283
4 0.0097403 5 0.107143 0.23052 0.027039
5 0.0000000 6 0.097403 0.23377 0.027224
“Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. This means that the noise or random fluctuations in the training data is picked up and learned as concepts by the model. The problem is that these concepts do not apply to new data and negatively impact the models ability to generalize” -https://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/
plotcp(m1)
Pruning the tree has started
We are considering 0.045 as the pruned parameter and rebuild the tree
ptree<- prune(m1, cp= 0.045 ,"CP")
printcp(ptree)
Classification tree:
rpart(formula = Personal_loan ~ ., data = cart.train, method = "class",
control = r.ctrl)
Variables actually used in tree construction:
[1] CCAvg Education Family_members Income_Monthly Zip_code
Root node error: 308/3062 = 0.10059
n= 3062
CP nsplit rel error xerror xstd
1 0.321429 0 1.00000 1.00000 0.054039
2 0.152597 2 0.35714 0.36688 0.033871
3 0.048701 3 0.20455 0.21753 0.026283
4 0.045000 5 0.10714 0.23052 0.027039
fancyRpartPlot(ptree,
uniform = TRUE,
main = "Final Tree",
palettes = c("Blues", "Oranges")
)
## Scoring Holdout sample
cart.test <- test.data
cart.test$predict.class = predict(ptree, cart.test,type = "class")
x<-cart.test$Personal_loan
cart.test$predict.score = predict(ptree, cart.test, type = "prob")
library(caret)
confusionMatrix(table(as.factor(x),cart.test$predict.class ))
Confusion Matrix and Statistics
0 1
0 1744 22
1 25 147
Accuracy : 0.9757
95% CI : (0.9679, 0.9821)
No Information Rate : 0.9128
P-Value [Acc > NIR] : <2e-16
Kappa : 0.8489
Mcnemar's Test P-Value : 0.7705
Sensitivity : 0.9859
Specificity : 0.8698
Pos Pred Value : 0.9875
Neg Pred Value : 0.8547
Prevalence : 0.9128
Detection Rate : 0.8999
Detection Prevalence : 0.9112
Balanced Accuracy : 0.9278
'Positive' Class : 0
library("ROCR")
Pred.cart = predict(ptree, newdata = cart.test, type = "prob")[,2]
Pred2 = prediction(Pred.cart, cart.test$Personal_loan)
plot(performance(Pred2, "tpr", "fpr"))
abline(0, 1, lty = 2)
#######################################
##
auc.tmp <- performance(Pred2,"auc")
auc <- as.numeric(auc.tmp@y.values)
print(auc)
[1] 0.973238
Inference : The area under the curve is close to 0.97
Result : The Cart model has given close to 97.5 % accurcy in predicting the people who will take personal loan on the test data
library(randomForest)
library(caret)
library(e1071)
trainIndex <- createDataPartition(Personal_loan,
p=0.7,
list = FALSE,
times = 1)
base_data_2<-base_data[,-5]
train.data <- base_data_2[trainIndex,2:length(base_data_2) ]
colnames(train.data)<-c('Age_in_years','Experience_years','Income_Monthly','Family_members','CCAvg','Education','Mortgage',
'Personal_loan','Securities_Account','CD_Account','Online','CreditCard')
train.data$Personal_loan<-as.factor(train.data$Personal_loan)
train.data<-na.omit(train.data)
test.data <- base_data_2[-trainIndex,2:length(base_data_2) ]
colnames(test.data)<-c('Age_in_years','Experience_years','Income_Monthly','Family_members','CCAvg','Education','Mortgage',
'Personal_loan','Securities_Account','CD_Account','Online','CreditCard')
test.data<-na.omit(test.data)
test.data$Personal_loan<-as.factor(test.data$Personal_loan)
model1 <- randomForest(Personal_loan ~ ., ntree = 100,data = train.data, importance = TRUE)
model1
Call:
randomForest(formula = Personal_loan ~ ., data = train.data, ntree = 100, importance = TRUE)
Type of random forest: classification
Number of trees: 100
No. of variables tried at each split: 3
OOB estimate of error rate: 1.41%
Confusion matrix:
0 1 class.error
0 2730 6 0.002192982
1 37 277 0.117834395
Pred_rf <- predict(model1, test.data, type = 'class')
confusionMatrix(test.data$Personal_loan, Pred_rf)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1766 2
1 25 139
Accuracy : 0.986
95% CI : (0.9797, 0.9908)
No Information Rate : 0.927
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9039
Mcnemar's Test P-Value : 2.297e-05
Sensitivity : 0.9860
Specificity : 0.9858
Pos Pred Value : 0.9989
Neg Pred Value : 0.8476
Prevalence : 0.9270
Detection Rate : 0.9141
Detection Prevalence : 0.9151
Balanced Accuracy : 0.9859
'Positive' Class : 0
Result : Random forest has perfomed very well with 98.9% accuracy on the test data
library("ROCR")
Pred_rf <- predict(model1, test.data, type = 'prob')[,2]
require(pROC)
rf.roc<-roc(test.data$Personal_loan,Pred_rf)
plot(rf.roc)
#######################################
##
Inference : The ROC is very close to ideal
auc(rf.roc)
Area under the curve: 0.9975
varImpPlot(model1,
sort = T,
n.var=10,
main="Top 10 - Variable Importance")
Inference