Applied (7-12)
In the lab, we applied random forests to the Boston data using mtry=6 and using ntree=25 and ntree=500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensiverange of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.
library(tree)
library(randomForest)
library(MASS)
#splitting the data into training and testing data
set.seed(1)
subset<-sample(1:nrow(Boston),nrow(Boston)*0.7)
Boston.train<-Boston[subset,-14]
Boston.test<-Boston[-subset,-14]
y.train<-Boston[subset,14]
y.test<-Boston[-subset,14]
#Building 5 different models with tuning parameter m=p,p/2,p/3,p/4,p^0.5
rfmodel1<-randomForest(Boston.train,y=y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=ncol(Boston.train))
rfmodel2<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))/2)
rfmodel3<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))^(0.5))
rfmodel4<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))/3)
rfmodel5<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))/4)
plot(1:500,rfmodel1$test$mse,col="red",type="l",xlab = "Number of Trees",ylab = "Test MSE",ylim = c(10,25))
lines(1:500,rfmodel2$test$mse, col="orange",type="l")
lines(1:500,rfmodel3$test$mse, col="green",type="l")
lines(1:500,rfmodel4$test$mse, col="blue",type="l")
lines(1:500,rfmodel5$test$mse, col="black",type="l")
legend("topright",c("m=p=13","m=p/2","m=sqrt(p)","m=p/3","m=p/4"),col=c("red","orange","green","blue","black"),cex=0.5,lty=1)
We see that Test MSE decreases with the increase in number of trees. It stabilizes after certain number of trees and no further improvement is seen
The minimum Test MSE is observed when m=sqrt(p) is chosen
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
Split the data set into a training set and a test set.
library(ISLR)
attach(Carseats)
set.seed(1)
subset<-sample(nrow(Carseats),nrow(Carseats)*0.7)
car.train<-Carseats[subset,]
car.test<-Carseats[-subset,]
Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
library(tree)
car.model.train<-tree(Sales~.,car.train)
summary(car.model.train)
##
## Regression tree:
## tree(formula = Sales ~ ., data = car.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "CompPrice" "Advertising"
## [6] "Population" "Income"
## Number of terminal nodes: 18
## Residual mean deviance: 2.502 = 655.5 / 262
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.74100 -0.98720 -0.02545 0.00000 1.02000 5.06900
plot(car.model.train)
text(car.model.train,pretty=0)
tree.prediction<-predict(car.model.train,newdata=car.test)
tree.mse<-mean((car.test$Sales-tree.prediction)^2)
tree.mse
## [1] 5.288256
The Test MSE for full grown Tree is recorded as 5.288
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
set.seed(1)
cv.car<-cv.tree(car.model.train)
plot(cv.car$size,cv.car$dev,xlab = "Size of Tree",ylab = "Deviance",type = "b")
prune.car<-prune.tree(car.model.train,best=6)
plot(prune.car)
text(prune.car,pretty=0)
prune.predict<-predict(prune.car,car.test)
mean((prune.predict-car.test$Sales)^2)
## [1] 5.453816
We have used cross validation to find the size to which the tree should be pruned.The size for which the deviance is minimum is selected as the optimal size for pruning
For the pruned tree we get MSE as 5.454
Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
bag.car<-randomForest(Sales~.,car.train,importance=TRUE,mtry=13)
importance(bag.car)
## %IncMSE IncNodePurity
## CompPrice 27.2205963 211.989549
## Income 8.4435449 108.627807
## Advertising 22.1539607 184.298723
## Population 5.7100937 93.196984
## Price 64.9087095 693.051954
## ShelveLoc 69.6421864 591.926746
## Age 20.9073044 194.036137
## Education 0.2049133 53.557062
## Urban -5.0569793 8.424407
## US 1.6671334 11.686647
bag.car.predict<-predict(bag.car,car.test)
mean((bag.car.predict-car.test$Sales)^2)
## [1] 2.324245
We use randomForest with m=p=13 total number of predictors which is equivalent to bagging
The Test Set MSE obtained is 2.324. It has further reduced compared to single pruned tree.Thus Bagging helped reducing the MSE
We can see that Price & ShelvLoc are the two most important variables chosen by Bagging model
Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables aremost important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
rf.car<-randomForest(Sales~.,car.train,importance=TRUE,mtry=sqrt(13))
importance(rf.car)
## %IncMSE IncNodePurity
## CompPrice 17.471607 191.36441
## Income 6.623987 150.25084
## Advertising 19.843893 211.95548
## Population 2.498219 130.16751
## Price 50.014724 600.96795
## ShelveLoc 50.103591 506.77540
## Age 17.389449 215.75786
## Education 2.256214 77.89851
## Urban -1.936739 14.00996
## US 2.398366 29.92999
rf.car.predict<-predict(rf.car,car.test)
mean((rf.car.predict-car.test$Sales)^2)
## [1] 2.517948
Using Random Forest the MSE increased compared to bagging
The important variables chosen by Random Forest are the same as the one chosen by Bagging
Random Forest avoids correlated trees and hence is expected to perform better than Bagging. Here the case is not true.Further tuning of Random Forest model should be tried
Full Grown Tree MSE: 5.288
Pruned Tree MSE: 5.454
Bagging Model MSE: 2.324
Random Forest MSE: 2.518
This problem involves the OJ data set which is part of the ISLR package.
Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
library(ISLR)
attach(OJ)
train<-sample(1:nrow(OJ),800)
oj.train<-OJ[train,]
oj.test<-OJ[-train,]
Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
oj.tree<-tree(Purchase~.,oj.train)
fulltree.summary<-summary(oj.tree)
fulltree.summary
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7287 = 577.8 / 793
## Misclassification error rate: 0.145 = 116 / 800
fulltree.misclass<-fulltree.summary$misclass[1]*100/fulltree.summary$misclass[2]
The algorithm chose “LoyalCH”, “PriceDiff”, “ListPriceDiff” variables in the model
Full Tree Misclassification error rate: 14.5 % for training data
Number of terminal nodes: 7
Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
oj.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1081.00 CH ( 0.59250 0.40750 )
## 2) LoyalCH < 0.48285 297 304.30 MM ( 0.20875 0.79125 )
## 4) LoyalCH < 0.136344 96 33.26 MM ( 0.04167 0.95833 ) *
## 5) LoyalCH > 0.136344 201 241.50 MM ( 0.28856 0.71144 )
## 10) PriceDiff < 0.31 158 161.90 MM ( 0.20886 0.79114 ) *
## 11) PriceDiff > 0.31 43 58.47 CH ( 0.58140 0.41860 ) *
## 3) LoyalCH > 0.48285 503 475.60 CH ( 0.81909 0.18091 )
## 6) LoyalCH < 0.74912 238 303.90 CH ( 0.66387 0.33613 )
## 12) PriceDiff < -0.165 33 24.38 MM ( 0.12121 0.87879 ) *
## 13) PriceDiff > -0.165 205 230.00 CH ( 0.75122 0.24878 )
## 26) ListPriceDiff < 0.135 37 50.62 MM ( 0.43243 0.56757 ) *
## 27) ListPriceDiff > 0.135 168 157.70 CH ( 0.82143 0.17857 ) *
## 7) LoyalCH > 0.74912 265 91.54 CH ( 0.95849 0.04151 ) *
The terminal node is represented by asterisk
We select node 11 “PriceDiff”" for explanation
The node splits for PriceDiff>0.31. There are 43 observation in the leaf with the residual deviance of 58.47. The overall prediction is CH with 58% of observations taking CH value and rest 42% taking MM
Create a plot of the tree, and interpret the results.
plot(oj.tree)
text(oj.tree,pretty=0)
The most important indicator of Purchase appears to be “LoyalCH”
The top 3 nodes contain “LoyalCH”
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
oj.predict.test<-predict(oj.tree,newdata = oj.test,type = "class")
table(oj.test$Purchase,oj.predict.test,dnn = c("Actual","Predicted"))
## Predicted
## Actual CH MM
## CH 142 37
## MM 21 70
oj.testerror<-(21+37)/nrow(oj.test)
round(oj.testerror,3)
## [1] 0.215
The Test Error Rate for Full Grown Tree is 21.5 %
Apply the cv.tree() function to the training set in order to determine the optimal tree size.
oj.cv.train<-cv.tree(oj.tree,FUN = prune.misclass)
oj.cv.train
## $size
## [1] 7 5 4 2 1
##
## $dev
## [1] 147 146 146 163 326
##
## $k
## [1] -Inf 3.5 5.0 12.5 173.0
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(oj.cv.train$size,oj.cv.train$dev,xlab="Size of the Tree",ylab="CV Deviance",type = "b")
points(4,min(oj.cv.train$dev),col="red")
Which tree size corresponds to the lowest cross-validated classification error rate?
We see the deviance is minimum on cross validated data for tree size=4
Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
oj.prune<-prune.misclass(oj.tree,best=4)
plot(oj.prune)
text(oj.prune,pretty=0)
Compare the training error rates between the pruned and unpruned trees. Which is higher?
prune.summary<-summary(oj.prune)
prune.summary
##
## Classification tree:
## snip.tree(tree = oj.tree, nodes = c(2L, 13L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 4
## Residual mean deviance: 0.8169 = 650.2 / 796
## Misclassification error rate: 0.16 = 128 / 800
prune.misclas<-prune.summary$misclass[1]*100/prune.summary$misclass[2]
Full Tree Misclassification error rate for training data: 14.5 %
Pruned Tree Misclassification error rate for training data: 16 %
Pruning did not help reduce misclassification error on training data
Compare the test error rates between the pruned and unpruned trees. Which is higher?
prune.predict.test<-predict(oj.prune,newdata = oj.test,type="class")
table(oj.test$Purchase,prune.predict.test,dnn = c("Actual","Predicted"))
## Predicted
## Actual CH MM
## CH 139 40
## MM 15 76
oj.testerror.prune<-(40+15)/nrow(oj.test)
round(oj.testerror.prune,3)
## [1] 0.204
Full Tree Misclassification error rate for test data: 21.5 %
Pruned Tree Misclassification error rate for test data: 20.4 %
We find that after pruning misclassification rate on test data is less compared to full grown tree. This means that overfitting is taking place for full grown tree as it gave lower training misclassification error but higher test misclassification error
We now use boosting to predict Salary in the Hitters data set.
Remove the observations for whom the salary information is unknown, and then log-transform the salaries.
attach(Hitters)
Hitters<-na.omit(Hitters)
Hitters$Salary<-log(Hitters$Salary)
Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.
subset<-1:200
hitters.train<-Hitters[subset,]
hitters.test<-Hitters[-subset,]
Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter ??. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.
library(gbm)
set.seed(1)
#defining range of lambdas from 0.01 till 1
powerss<-seq(-2,0,by=0.1)
lambdas<-10^powerss
#defining list storing training errors
train.error<-rep(NA,length(lambdas))
#For loop over the range of values of lambdas to store error
#tuning parameters of boosting:Shrinkage(lambda),number of trees(ntree), & distribution(gaussian for regression trees and binomial for classification trees)
for (i in 1:length(lambdas)){
hitters.gbm<-gbm(Salary~.,hitters.train,distribution = "gaussian",n.trees=1000,shrinkage=lambdas[i])
#predicting trainig error
hitters.train.pred<-predict(hitters.gbm,hitters.train,n.trees=1000)
train.error[i]<-mean((hitters.train.pred-hitters.train$Salary)^2)
}
#Plotting training MSE against Lambdas
plot(lambdas,train.error,type="b",xlab="Shrinkage Value(lambda)",ylab="Training MSE")
Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.
#test mse
set.seed(1)
#defining list storing testing errors
test.error<-rep(NA,length(lambdas))
#For loop over the range of values of lambdas to store error
#tuning parameters of boosting:Shrinkage(lambda),number of trees(ntree), & distribution(gaussian for regression trees and binomial for classification trees)
for (i in 1:length(lambdas)){
hitters.gbm<-gbm(Salary~.,hitters.train,distribution = "gaussian",n.trees=1000,shrinkage=lambdas[i])
#predicting testig error
hitters.test.pred<-predict(hitters.gbm,hitters.test,n.trees=1000)
test.error[i]<-mean((hitters.test.pred-hitters.test$Salary)^2)
}
#Plotting testing MSE against Lambdas
plot(lambdas,test.error,type="b",xlab="Shrinkage Value(lambda)",ylab="Test MSE")
hitters.gbm.testerror<-min(test.error)
hitters.gbm.testerror
## [1] 0.255455
The Minimum Test MSE obtained by boosting is 0.26
Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.
library(glmnet)
#Fitting least square regression model
lm<-lm(Salary~.,hitters.train)
hitters.predict.lm<-predict(lm,hitters.test)
hitters.lm.test.mse<-mean((hitters.predict.lm-hitters.test$Salary)^2)
hitters.lm.test.mse
## [1] 0.4917959
#Ridge regression model
#here we have selected a s=0.01 value of lambda to fit the model
x<-model.matrix(Salary~.,hitters.train)
x.test<-model.matrix(Salary ~ . , hitters.test)
y<-hitters.train$Salary
hitters.ridge<-glmnet(x,y,alpha=0)
hitters.ridge.predict<-predict(hitters.ridge,s=0.01,x.test)
hitters.ridge.test.mse<-mean((hitters.ridge.predict-hitters.test$Salary)^2)
hitters.ridge.test.mse
## [1] 0.4570283
#Lasso regression model
#here we have selected a s=0.01 value of lambda to fit the model
x<-model.matrix(Salary~.,hitters.train)
x.test<-model.matrix(Salary ~ . , hitters.test)
y<-hitters.train$Salary
hitters.lasso<-glmnet(x,y,alpha=1)
hitters.lasso.predict<-predict(hitters.lasso,s=0.01,x.test)
hitters.lasso.test.mse<-mean((hitters.lasso.predict-hitters.test$Salary)^2)
hitters.lasso.test.mse
## [1] 0.4700537
We have Test MSE for different methods as summarized below. It can be seen Boosting gives least Test MSE among the 4 models
Least Square Regression Full Model Test MSE: 0.49
Ridge Regression Model Test MSE: 0.46
Lasso Regression Model Test MSE: 0.47
Which variables appear to be the most important predictors in the boosted model?
boost.hitters<-gbm(Salary~.,data=hitters.train,distribution = "gaussian",n.trees = 1000,shrinkage=lambdas[which.min(test.error)])
summary(boost.hitters)
## var rel.inf
## CAtBat CAtBat 30.0391455
## Years Years 7.1722320
## CWalks CWalks 6.6962390
## PutOuts PutOuts 6.1919523
## Walks Walks 6.0430398
## CRuns CRuns 5.8184446
## CHmRun CHmRun 5.7355580
## CRBI CRBI 5.6678930
## Hits Hits 5.5180489
## HmRun HmRun 3.7274075
## Assists Assists 3.6165621
## AtBat AtBat 2.8770937
## RBI RBI 2.8062318
## CHits CHits 2.8030774
## Errors Errors 2.3509666
## Runs Runs 1.5093746
## Division Division 0.6614964
## NewLeague NewLeague 0.5632541
## League League 0.2019828
We find that CAtbat is the most important variable
Now apply bagging to the training set. What is the test set MSE for this approach?
set.seed(1)
hitters.bagging<-randomForest(Salary~.,hitters.train,mtry=19,importance=TRUE)
hitters.bagg.predict<-predict(hitters.bagging,hitters.test)
hitters.bagg.test.mse<-mean((hitters.bagg.predict-hitters.test$Salary)^2)
hitters.bagg.test.mse
## [1] 0.228722
The Test set MSE for Bagging is 0.23
This is lower than the Test set MSE obtained for Boosting which was 0.26
This question uses the Caravan data set.
Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.
attach(Caravan)
set.seed(1)
caravan.subset<-1:1000
Caravan$Purchase<-ifelse(Caravan$Purchase=="Yes",1,0)
caravan.train<-Caravan[caravan.subset,]
caravan.test<-Caravan[-caravan.subset,]
Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?
library(gbm)
library(DT)
#boosting with lambda=0.01, ntrees=1000, distribution=bernoulli as classification problem
set.seed(1)
caravan.boost<-gbm(Purchase~.,caravan.train,shrinkage = 0.01,n.trees = 1000,distribution = "bernoulli")
datatable(summary(caravan.boost), class="table-condensed", style="bootstrap", options = list(dom = 'tp'))
It can be seen that “PPERSAUT” is the most important variable
Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20 %. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?
#Prediction using Boosting
caravan.predict.boost<-predict(caravan.boost,caravan.test,n.trees = 1000,type="response")
caravan.predict.prob.boost<-ifelse(caravan.predict.boost>0.2,1,0)
table(caravan.test$Purchase,caravan.predict.prob.boost,dnn=c("Actual","Predicted"))
## Predicted
## Actual 0 1
## 0 4410 123
## 1 256 33
TPF<-33/(123+33)
TPF
## [1] 0.2115385
#Logistic Regression
caravan.logit<-glm(Purchase~.,caravan.train,family = "binomial")
carvan.predict.logit<-predict(caravan.logit,caravan.test,type="response")
caravan.predict.prob.logit<-ifelse(carvan.predict.logit>0.2,1,0)
table(caravan.test$Purchase,caravan.predict.prob.logit,dnn=c("Actual","Predicted"))
## Predicted
## Actual 0 1
## 0 4183 350
## 1 231 58
TPF.logit<-58/(350+58)
TPF.logit
## [1] 0.1421569
21% of people predicted to make purchase in fact made one by Boosting Model
14% of people predicted to make purchase in fact made one by Logistic Regression Model
Thus targeting customers using Boosting Model will serve better compared to Logistic Regression Model
Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?
We use German Credit Data.Misclassification Rate on Test Set is used as performance metric
#Loading Dependencies
library(knitr)
#Loading Data Set
german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
colnames(german_credit) = c("chk_acct", "duration", "credit_his", "purpose",
"amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor",
"present_resid", "property", "age", "other_install", "housing", "n_credits",
"job", "n_people", "telephone", "foreign", "response")
#Take a glance at the Dataset
kable(head(german_credit,10))
chk_acct | duration | credit_his | purpose | amount | saving_acct | present_emp | installment_rate | sex | other_debtor | present_resid | property | age | other_install | housing | n_credits | job | n_people | telephone | foreign | response |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A11 | 6 | A34 | A43 | 1169 | A65 | A75 | 4 | A93 | A101 | 4 | A121 | 67 | A143 | A152 | 2 | A173 | 1 | A192 | A201 | 1 |
A12 | 48 | A32 | A43 | 5951 | A61 | A73 | 2 | A92 | A101 | 2 | A121 | 22 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 2 |
A14 | 12 | A34 | A46 | 2096 | A61 | A74 | 2 | A93 | A101 | 3 | A121 | 49 | A143 | A152 | 1 | A172 | 2 | A191 | A201 | 1 |
A11 | 42 | A32 | A42 | 7882 | A61 | A74 | 2 | A93 | A103 | 4 | A122 | 45 | A143 | A153 | 1 | A173 | 2 | A191 | A201 | 1 |
A11 | 24 | A33 | A40 | 4870 | A61 | A73 | 3 | A93 | A101 | 4 | A124 | 53 | A143 | A153 | 2 | A173 | 2 | A191 | A201 | 2 |
A14 | 36 | A32 | A46 | 9055 | A65 | A73 | 2 | A93 | A101 | 4 | A124 | 35 | A143 | A153 | 1 | A172 | 2 | A192 | A201 | 1 |
A14 | 24 | A32 | A42 | 2835 | A63 | A75 | 3 | A93 | A101 | 4 | A122 | 53 | A143 | A152 | 1 | A173 | 1 | A191 | A201 | 1 |
A12 | 36 | A32 | A41 | 6948 | A61 | A73 | 2 | A93 | A101 | 2 | A123 | 35 | A143 | A151 | 1 | A174 | 1 | A192 | A201 | 1 |
A14 | 12 | A32 | A43 | 3059 | A64 | A74 | 2 | A91 | A101 | 4 | A121 | 61 | A143 | A152 | 1 | A172 | 1 | A191 | A201 | 1 |
A12 | 30 | A34 | A40 | 5234 | A61 | A71 | 4 | A94 | A101 | 2 | A123 | 28 | A143 | A152 | 2 | A174 | 1 | A191 | A201 | 2 |
#Dimension of the Table
dim(german_credit)
## [1] 1000 21
#The data set has 20 predictor variables and 1 response variable with 1000 observations
#Variable types
str(german_credit)
## 'data.frame': 1000 obs. of 21 variables:
## $ chk_acct : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_his : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ saving_acct : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ present_emp : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installment_rate: int 4 2 2 2 3 2 3 2 2 4 ...
## $ sex : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ other_debtor : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ present_resid : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ other_install : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ n_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ n_people : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ response : int 1 2 1 1 2 1 1 1 1 2 ...
#Changing Variable types of response and residentid. Coding response as 1 and 0 instead of 2 and 1
# orginal response coding 1= good, 2 = bad we need 0 = good, 1 = bad
german_credit$response = german_credit$response - 1 # Run this once
german_credit$response = as.factor(german_credit$response)
present_resid <- as.factor(german_credit$present_resid)
#Data Summary
summary(german_credit)
## chk_acct duration credit_his purpose amount
## A11:274 Min. : 4.0 A30: 40 A43 :280 Min. : 250
## A12:269 1st Qu.:12.0 A31: 49 A40 :234 1st Qu.: 1366
## A13: 63 Median :18.0 A32:530 A42 :181 Median : 2320
## A14:394 Mean :20.9 A33: 88 A41 :103 Mean : 3271
## 3rd Qu.:24.0 A34:293 A49 : 97 3rd Qu.: 3972
## Max. :72.0 A46 : 50 Max. :18424
## (Other): 55
## saving_acct present_emp installment_rate sex other_debtor
## A61:603 A71: 62 Min. :1.000 A91: 50 A101:907
## A62:103 A72:172 1st Qu.:2.000 A92:310 A102: 41
## A63: 63 A73:339 Median :3.000 A93:548 A103: 52
## A64: 48 A74:174 Mean :2.973 A94: 92
## A65:183 A75:253 3rd Qu.:4.000
## Max. :4.000
##
## present_resid property age other_install housing
## Min. :1.000 A121:282 Min. :19.00 A141:139 A151:179
## 1st Qu.:2.000 A122:232 1st Qu.:27.00 A142: 47 A152:713
## Median :3.000 A123:332 Median :33.00 A143:814 A153:108
## Mean :2.845 A124:154 Mean :35.55
## 3rd Qu.:4.000 3rd Qu.:42.00
## Max. :4.000 Max. :75.00
##
## n_credits job n_people telephone foreign response
## Min. :1.000 A171: 22 Min. :1.000 A191:596 A201:963 0:700
## 1st Qu.:1.000 A172:200 1st Qu.:1.000 A192:404 A202: 37 1:300
## Median :1.000 A173:630 Median :1.000
## Mean :1.407 A174:148 Mean :1.155
## 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :4.000 Max. :2.000
##
#No Missing Values present in the Data
#Splitting the Data into training and validation data set in 75:25 ratio
set.seed(10743959)
subset<-sample(nrow(german_credit),nrow(german_credit)*0.75)
germancredit.train<-german_credit[subset,]
germancredit.test<-german_credit[-subset,]
#Logistic Regression
lr.gc.model<-glm(response ~ .,family = binomial,germancredit.train)
lr.gc.predict<-predict(lr.gc.model,germancredit.test,type="response")
#Choosing cutoff probability as 0.5 to classify into bad and good customers
lr.gc.predict<-ifelse(lr.gc.predict>0.5,1,0)
table(germancredit.test$response,lr.gc.predict,dnn = c("Actual ","Predicted "))
## Predicted
## Actual 0 1
## 0 148 25
## 1 39 38
lr.mr<-mean(ifelse(lr.gc.predict!=germancredit.test$response,1,0))
round(lr.mr,2)
## [1] 0.26
#Bagging: Tuning parameters- Number of Trees=1000, mtry=20= All of the predictors considered at every node for split
set.seed(10743959)
#Model Building
bag.gc.model<-randomForest(response ~ .,germancredit.train,ntree=1000,mtry=20,importance=TRUE)
#Predicting on test
bag.gc.predict<-predict(bag.gc.model,germancredit.test,type="class")
table(germancredit.test$response,bag.gc.predict,dnn = c("Actual ","Predicted "))
## Predicted
## Actual 0 1
## 0 156 17
## 1 39 38
#Test misclassification rate
bagging.mr<-(17+39)/nrow(germancredit.test)
bagging.mr
## [1] 0.224
#Random Forest: Tuning parameters- Number of Trees=1000, mtry=squareroot(P)= square root of the total predictors considered at every node for split
set.seed(10743959)
#Model Building
rf.gc.model<-randomForest(response ~ .,germancredit.train,ntree=1000,mtry=sqrt(20),importance=TRUE)
#Predicting on test
rf.gc.predict<-predict(rf.gc.model,germancredit.test,type="class")
table(germancredit.test$response,rf.gc.predict,dnn = c("Actual ","Predicted "))
## Predicted
## Actual 0 1
## 0 157 16
## 1 49 28
#Test misclassification rate
rf.mr<-(16+49)/nrow(germancredit.test)
rf.mr
## [1] 0.26
The Test set Misclassification Rate is summarized for different Modeling techniques as below:
Logistic regression: 0.26
Bagging:0.22
Random Forest:0.26
The Bagging seems to give the least misclassification rate among all the methods.