Chapter 8 Tree-Based Methods:

Classification Trees, Regression Trees, Bagging, Random Forest, Boosting

Applied (7-12)

Problem 7

In the lab, we applied random forests to the Boston data using mtry=6 and using ntree=25 and ntree=500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensiverange of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

Solution 7

library(tree)
library(randomForest)
library(MASS)
#splitting the data into training and testing data
set.seed(1)
subset<-sample(1:nrow(Boston),nrow(Boston)*0.7)
Boston.train<-Boston[subset,-14]
Boston.test<-Boston[-subset,-14]
y.train<-Boston[subset,14]
y.test<-Boston[-subset,14]
#Building 5 different models with tuning parameter m=p,p/2,p/3,p/4,p^0.5
rfmodel1<-randomForest(Boston.train,y=y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=ncol(Boston.train))
rfmodel2<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))/2)
rfmodel3<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))^(0.5))
rfmodel4<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))/3)
rfmodel5<-randomForest(Boston.train,y.train,xtest = Boston.test,ytest = y.test,ntree=500,mtry=(ncol(Boston.train))/4)

plot(1:500,rfmodel1$test$mse,col="red",type="l",xlab = "Number of Trees",ylab = "Test MSE",ylim = c(10,25))
lines(1:500,rfmodel2$test$mse, col="orange",type="l")
lines(1:500,rfmodel3$test$mse, col="green",type="l")
lines(1:500,rfmodel4$test$mse, col="blue",type="l")
lines(1:500,rfmodel5$test$mse, col="black",type="l")
legend("topright",c("m=p=13","m=p/2","m=sqrt(p)","m=p/3","m=p/4"),col=c("red","orange","green","blue","black"),cex=0.5,lty=1)

We see that Test MSE decreases with the increase in number of trees. It stabilizes after certain number of trees and no further improvement is seen

The minimum Test MSE is observed when m=sqrt(p) is chosen

Problem 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a)

Split the data set into a training set and a test set.

Solution (a)

library(ISLR)
attach(Carseats)
set.seed(1)
subset<-sample(nrow(Carseats),nrow(Carseats)*0.7)
car.train<-Carseats[subset,]
car.test<-Carseats[-subset,]

(b)

Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

Solution (b)

library(tree)
car.model.train<-tree(Sales~.,car.train)
summary(car.model.train)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = car.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "CompPrice"   "Advertising"
## [6] "Population"  "Income"     
## Number of terminal nodes:  18 
## Residual mean deviance:  2.502 = 655.5 / 262 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.74100 -0.98720 -0.02545  0.00000  1.02000  5.06900
plot(car.model.train)
text(car.model.train,pretty=0)

tree.prediction<-predict(car.model.train,newdata=car.test)
tree.mse<-mean((car.test$Sales-tree.prediction)^2)
tree.mse
## [1] 5.288256

The Test MSE for full grown Tree is recorded as 5.288

(c)

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

Solution (c)

set.seed(1)
cv.car<-cv.tree(car.model.train)
plot(cv.car$size,cv.car$dev,xlab = "Size of Tree",ylab = "Deviance",type = "b")

prune.car<-prune.tree(car.model.train,best=6)
plot(prune.car)
text(prune.car,pretty=0)

prune.predict<-predict(prune.car,car.test)
mean((prune.predict-car.test$Sales)^2)
## [1] 5.453816

We have used cross validation to find the size to which the tree should be pruned.The size for which the deviance is minimum is selected as the optimal size for pruning

For the pruned tree we get MSE as 5.454

(d)

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

Solution (d)

bag.car<-randomForest(Sales~.,car.train,importance=TRUE,mtry=13)
importance(bag.car)
##                %IncMSE IncNodePurity
## CompPrice   27.2205963    211.989549
## Income       8.4435449    108.627807
## Advertising 22.1539607    184.298723
## Population   5.7100937     93.196984
## Price       64.9087095    693.051954
## ShelveLoc   69.6421864    591.926746
## Age         20.9073044    194.036137
## Education    0.2049133     53.557062
## Urban       -5.0569793      8.424407
## US           1.6671334     11.686647
bag.car.predict<-predict(bag.car,car.test)
mean((bag.car.predict-car.test$Sales)^2)
## [1] 2.324245

We use randomForest with m=p=13 total number of predictors which is equivalent to bagging

The Test Set MSE obtained is 2.324. It has further reduced compared to single pruned tree.Thus Bagging helped reducing the MSE

We can see that Price & ShelvLoc are the two most important variables chosen by Bagging model

(e)

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables aremost important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

Solution (e)

rf.car<-randomForest(Sales~.,car.train,importance=TRUE,mtry=sqrt(13))
importance(rf.car)
##               %IncMSE IncNodePurity
## CompPrice   17.471607     191.36441
## Income       6.623987     150.25084
## Advertising 19.843893     211.95548
## Population   2.498219     130.16751
## Price       50.014724     600.96795
## ShelveLoc   50.103591     506.77540
## Age         17.389449     215.75786
## Education    2.256214      77.89851
## Urban       -1.936739      14.00996
## US           2.398366      29.92999
rf.car.predict<-predict(rf.car,car.test)
mean((rf.car.predict-car.test$Sales)^2)
## [1] 2.517948

Using Random Forest the MSE increased compared to bagging

The important variables chosen by Random Forest are the same as the one chosen by Bagging

Random Forest avoids correlated trees and hence is expected to perform better than Bagging. Here the case is not true.Further tuning of Random Forest model should be tried

Full Grown Tree MSE: 5.288

Pruned Tree MSE: 5.454

Bagging Model MSE: 2.324

Random Forest MSE: 2.518

Problem 9

This problem involves the OJ data set which is part of the ISLR package.

(a)

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

Solution (a)

library(ISLR)
attach(OJ)
train<-sample(1:nrow(OJ),800)
oj.train<-OJ[train,]
oj.test<-OJ[-train,]

(b)

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

Solution (b)

oj.tree<-tree(Purchase~.,oj.train)
fulltree.summary<-summary(oj.tree)
fulltree.summary
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7287 = 577.8 / 793 
## Misclassification error rate: 0.145 = 116 / 800
fulltree.misclass<-fulltree.summary$misclass[1]*100/fulltree.summary$misclass[2]

The algorithm chose “LoyalCH”, “PriceDiff”, “ListPriceDiff” variables in the model

Full Tree Misclassification error rate: 14.5 % for training data

Number of terminal nodes: 7

(c)

Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

Solution (c)

oj.tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1081.00 CH ( 0.59250 0.40750 )  
##    2) LoyalCH < 0.48285 297  304.30 MM ( 0.20875 0.79125 )  
##      4) LoyalCH < 0.136344 96   33.26 MM ( 0.04167 0.95833 ) *
##      5) LoyalCH > 0.136344 201  241.50 MM ( 0.28856 0.71144 )  
##       10) PriceDiff < 0.31 158  161.90 MM ( 0.20886 0.79114 ) *
##       11) PriceDiff > 0.31 43   58.47 CH ( 0.58140 0.41860 ) *
##    3) LoyalCH > 0.48285 503  475.60 CH ( 0.81909 0.18091 )  
##      6) LoyalCH < 0.74912 238  303.90 CH ( 0.66387 0.33613 )  
##       12) PriceDiff < -0.165 33   24.38 MM ( 0.12121 0.87879 ) *
##       13) PriceDiff > -0.165 205  230.00 CH ( 0.75122 0.24878 )  
##         26) ListPriceDiff < 0.135 37   50.62 MM ( 0.43243 0.56757 ) *
##         27) ListPriceDiff > 0.135 168  157.70 CH ( 0.82143 0.17857 ) *
##      7) LoyalCH > 0.74912 265   91.54 CH ( 0.95849 0.04151 ) *

The terminal node is represented by asterisk

We select node 11 “PriceDiff”" for explanation

The node splits for PriceDiff>0.31. There are 43 observation in the leaf with the residual deviance of 58.47. The overall prediction is CH with 58% of observations taking CH value and rest 42% taking MM

(d)

Create a plot of the tree, and interpret the results.

Solution (d)

plot(oj.tree)
text(oj.tree,pretty=0)

The most important indicator of Purchase appears to be “LoyalCH”

The top 3 nodes contain “LoyalCH”

(e)

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

Solution (e)

oj.predict.test<-predict(oj.tree,newdata = oj.test,type = "class")
table(oj.test$Purchase,oj.predict.test,dnn = c("Actual","Predicted"))
##       Predicted
## Actual  CH  MM
##     CH 142  37
##     MM  21  70
oj.testerror<-(21+37)/nrow(oj.test)
round(oj.testerror,3)
## [1] 0.215

The Test Error Rate for Full Grown Tree is 21.5 %

(f)

Apply the cv.tree() function to the training set in order to determine the optimal tree size.

Solution (f)

oj.cv.train<-cv.tree(oj.tree,FUN = prune.misclass)
oj.cv.train
## $size
## [1] 7 5 4 2 1
## 
## $dev
## [1] 147 146 146 163 326
## 
## $k
## [1]  -Inf   3.5   5.0  12.5 173.0
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(g)

Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

Solution (g)

plot(oj.cv.train$size,oj.cv.train$dev,xlab="Size of the Tree",ylab="CV Deviance",type = "b")
points(4,min(oj.cv.train$dev),col="red")

(h)

Which tree size corresponds to the lowest cross-validated classification error rate?

Solution (h)

We see the deviance is minimum on cross validated data for tree size=4

(i)

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

Solution (i)

oj.prune<-prune.misclass(oj.tree,best=4)
plot(oj.prune)
text(oj.prune,pretty=0)

(j)

Compare the training error rates between the pruned and unpruned trees. Which is higher?

Solution (j)

prune.summary<-summary(oj.prune)
prune.summary
## 
## Classification tree:
## snip.tree(tree = oj.tree, nodes = c(2L, 13L))
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  4 
## Residual mean deviance:  0.8169 = 650.2 / 796 
## Misclassification error rate: 0.16 = 128 / 800
prune.misclas<-prune.summary$misclass[1]*100/prune.summary$misclass[2]

Full Tree Misclassification error rate for training data: 14.5 %

Pruned Tree Misclassification error rate for training data: 16 %

Pruning did not help reduce misclassification error on training data

(k)

Compare the test error rates between the pruned and unpruned trees. Which is higher?

Solution (k)

prune.predict.test<-predict(oj.prune,newdata = oj.test,type="class")
table(oj.test$Purchase,prune.predict.test,dnn = c("Actual","Predicted"))
##       Predicted
## Actual  CH  MM
##     CH 139  40
##     MM  15  76
oj.testerror.prune<-(40+15)/nrow(oj.test)
round(oj.testerror.prune,3)
## [1] 0.204

Full Tree Misclassification error rate for test data: 21.5 %

Pruned Tree Misclassification error rate for test data: 20.4 %

We find that after pruning misclassification rate on test data is less compared to full grown tree. This means that overfitting is taking place for full grown tree as it gave lower training misclassification error but higher test misclassification error

Problem 10

We now use boosting to predict Salary in the Hitters data set.

(a)

Remove the observations for whom the salary information is unknown, and then log-transform the salaries.

Solution (a)

attach(Hitters)
Hitters<-na.omit(Hitters)
Hitters$Salary<-log(Hitters$Salary)

(b)

Create a training set consisting of the first 200 observations, and a test set consisting of the remaining observations.

Solution (b)

subset<-1:200
hitters.train<-Hitters[subset,]
hitters.test<-Hitters[-subset,]

(c)

Perform boosting on the training set with 1,000 trees for a range of values of the shrinkage parameter ??. Produce a plot with different shrinkage values on the x-axis and the corresponding training set MSE on the y-axis.

Solution (c)

library(gbm)
set.seed(1)
#defining range of lambdas from 0.01 till 1
powerss<-seq(-2,0,by=0.1)
lambdas<-10^powerss
#defining list storing training errors
train.error<-rep(NA,length(lambdas))
#For loop over the range of values of lambdas to store error
#tuning parameters of boosting:Shrinkage(lambda),number of trees(ntree), & distribution(gaussian for regression trees and binomial for classification trees)
for (i in 1:length(lambdas)){
hitters.gbm<-gbm(Salary~.,hitters.train,distribution = "gaussian",n.trees=1000,shrinkage=lambdas[i])

#predicting trainig error

hitters.train.pred<-predict(hitters.gbm,hitters.train,n.trees=1000)
train.error[i]<-mean((hitters.train.pred-hitters.train$Salary)^2)
}

#Plotting training MSE against Lambdas
plot(lambdas,train.error,type="b",xlab="Shrinkage Value(lambda)",ylab="Training MSE")

(d)

Produce a plot with different shrinkage values on the x-axis and the corresponding test set MSE on the y-axis.

Solution (d)

#test mse
set.seed(1)

#defining list storing testing errors
test.error<-rep(NA,length(lambdas))
#For loop over the range of values of lambdas to store error
#tuning parameters of boosting:Shrinkage(lambda),number of trees(ntree), & distribution(gaussian for regression trees and binomial for classification trees)
for (i in 1:length(lambdas)){
hitters.gbm<-gbm(Salary~.,hitters.train,distribution = "gaussian",n.trees=1000,shrinkage=lambdas[i])

#predicting testig error

hitters.test.pred<-predict(hitters.gbm,hitters.test,n.trees=1000)
test.error[i]<-mean((hitters.test.pred-hitters.test$Salary)^2)
}

#Plotting testing MSE against Lambdas
plot(lambdas,test.error,type="b",xlab="Shrinkage Value(lambda)",ylab="Test MSE")

hitters.gbm.testerror<-min(test.error)
hitters.gbm.testerror
## [1] 0.255455

The Minimum Test MSE obtained by boosting is 0.26

(e)

Compare the test MSE of boosting to the test MSE that results from applying two of the regression approaches seen in Chapters 3 and 6.

Solution (e)

library(glmnet)

#Fitting least square regression model
lm<-lm(Salary~.,hitters.train)
hitters.predict.lm<-predict(lm,hitters.test)
hitters.lm.test.mse<-mean((hitters.predict.lm-hitters.test$Salary)^2)
hitters.lm.test.mse
## [1] 0.4917959
#Ridge regression model

#here we have selected a s=0.01  value of lambda to fit the model

x<-model.matrix(Salary~.,hitters.train)
x.test<-model.matrix(Salary ~ . , hitters.test)
y<-hitters.train$Salary
hitters.ridge<-glmnet(x,y,alpha=0)
hitters.ridge.predict<-predict(hitters.ridge,s=0.01,x.test)
hitters.ridge.test.mse<-mean((hitters.ridge.predict-hitters.test$Salary)^2)
hitters.ridge.test.mse
## [1] 0.4570283
#Lasso regression model

#here we have selected a s=0.01  value of lambda to fit the model

x<-model.matrix(Salary~.,hitters.train)
x.test<-model.matrix(Salary ~ . , hitters.test)
y<-hitters.train$Salary
hitters.lasso<-glmnet(x,y,alpha=1)
hitters.lasso.predict<-predict(hitters.lasso,s=0.01,x.test)
hitters.lasso.test.mse<-mean((hitters.lasso.predict-hitters.test$Salary)^2)
hitters.lasso.test.mse
## [1] 0.4700537

We have Test MSE for different methods as summarized below. It can be seen Boosting gives least Test MSE among the 4 models

Least Square Regression Full Model Test MSE: 0.49

Ridge Regression Model Test MSE: 0.46

Lasso Regression Model Test MSE: 0.47

(f)

Which variables appear to be the most important predictors in the boosted model?

Solution (f)

boost.hitters<-gbm(Salary~.,data=hitters.train,distribution = "gaussian",n.trees = 1000,shrinkage=lambdas[which.min(test.error)])

summary(boost.hitters)

##                 var    rel.inf
## CAtBat       CAtBat 30.0391455
## Years         Years  7.1722320
## CWalks       CWalks  6.6962390
## PutOuts     PutOuts  6.1919523
## Walks         Walks  6.0430398
## CRuns         CRuns  5.8184446
## CHmRun       CHmRun  5.7355580
## CRBI           CRBI  5.6678930
## Hits           Hits  5.5180489
## HmRun         HmRun  3.7274075
## Assists     Assists  3.6165621
## AtBat         AtBat  2.8770937
## RBI             RBI  2.8062318
## CHits         CHits  2.8030774
## Errors       Errors  2.3509666
## Runs           Runs  1.5093746
## Division   Division  0.6614964
## NewLeague NewLeague  0.5632541
## League       League  0.2019828

We find that CAtbat is the most important variable

(g)

Now apply bagging to the training set. What is the test set MSE for this approach?

Solution (g)

set.seed(1)
hitters.bagging<-randomForest(Salary~.,hitters.train,mtry=19,importance=TRUE)
hitters.bagg.predict<-predict(hitters.bagging,hitters.test)
hitters.bagg.test.mse<-mean((hitters.bagg.predict-hitters.test$Salary)^2)
hitters.bagg.test.mse
## [1] 0.228722

The Test set MSE for Bagging is 0.23

This is lower than the Test set MSE obtained for Boosting which was 0.26

Problem 11

This question uses the Caravan data set.

(a)

Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

Solution (a)

attach(Caravan)
set.seed(1)
caravan.subset<-1:1000
Caravan$Purchase<-ifelse(Caravan$Purchase=="Yes",1,0)
caravan.train<-Caravan[caravan.subset,]
caravan.test<-Caravan[-caravan.subset,]

(b)

Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

Solution (b)

library(gbm)
library(DT)

#boosting with lambda=0.01, ntrees=1000, distribution=bernoulli as classification problem

set.seed(1)
caravan.boost<-gbm(Purchase~.,caravan.train,shrinkage = 0.01,n.trees = 1000,distribution = "bernoulli")
datatable(summary(caravan.boost), class="table-condensed", style="bootstrap", options = list(dom = 'tp'))

It can be seen that “PPERSAUT” is the most important variable

(c)

Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated probability of purchase is greater than 20 %. Form a confusion matrix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtained from applying KNN or logistic regression to this data set?

Solution (c)

#Prediction using Boosting

caravan.predict.boost<-predict(caravan.boost,caravan.test,n.trees = 1000,type="response")
caravan.predict.prob.boost<-ifelse(caravan.predict.boost>0.2,1,0)
table(caravan.test$Purchase,caravan.predict.prob.boost,dnn=c("Actual","Predicted"))
##       Predicted
## Actual    0    1
##      0 4410  123
##      1  256   33
TPF<-33/(123+33)
TPF
## [1] 0.2115385
#Logistic Regression

caravan.logit<-glm(Purchase~.,caravan.train,family = "binomial")
carvan.predict.logit<-predict(caravan.logit,caravan.test,type="response")
caravan.predict.prob.logit<-ifelse(carvan.predict.logit>0.2,1,0)
table(caravan.test$Purchase,caravan.predict.prob.logit,dnn=c("Actual","Predicted"))
##       Predicted
## Actual    0    1
##      0 4183  350
##      1  231   58
TPF.logit<-58/(350+58)
TPF.logit
## [1] 0.1421569

21% of people predicted to make purchase in fact made one by Boosting Model

14% of people predicted to make purchase in fact made one by Logistic Regression Model

Thus targeting customers using Boosting Model will serve better compared to Logistic Regression Model

Problem 12: Credit Risk Analysis

Apply boosting, bagging, and random forests to a data set of your choice. Be sure to fit the models on a training set and to evaluate their performance on a test set. How accurate are the results compared to simple methods like linear or logistic regression? Which of these approaches yields the best performance?

Solution

We use German Credit Data.Misclassification Rate on Test Set is used as performance metric

#Loading Dependencies
library(knitr)

#Loading Data Set

german_credit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")

colnames(german_credit) = c("chk_acct", "duration", "credit_his", "purpose", 
                            "amount", "saving_acct", "present_emp", "installment_rate", "sex", "other_debtor", 
                            "present_resid", "property", "age", "other_install", "housing", "n_credits", 
                            "job", "n_people", "telephone", "foreign", "response")

#Take a glance at the Dataset
kable(head(german_credit,10))
chk_acct duration credit_his purpose amount saving_acct present_emp installment_rate sex other_debtor present_resid property age other_install housing n_credits job n_people telephone foreign response
A11 6 A34 A43 1169 A65 A75 4 A93 A101 4 A121 67 A143 A152 2 A173 1 A192 A201 1
A12 48 A32 A43 5951 A61 A73 2 A92 A101 2 A121 22 A143 A152 1 A173 1 A191 A201 2
A14 12 A34 A46 2096 A61 A74 2 A93 A101 3 A121 49 A143 A152 1 A172 2 A191 A201 1
A11 42 A32 A42 7882 A61 A74 2 A93 A103 4 A122 45 A143 A153 1 A173 2 A191 A201 1
A11 24 A33 A40 4870 A61 A73 3 A93 A101 4 A124 53 A143 A153 2 A173 2 A191 A201 2
A14 36 A32 A46 9055 A65 A73 2 A93 A101 4 A124 35 A143 A153 1 A172 2 A192 A201 1
A14 24 A32 A42 2835 A63 A75 3 A93 A101 4 A122 53 A143 A152 1 A173 1 A191 A201 1
A12 36 A32 A41 6948 A61 A73 2 A93 A101 2 A123 35 A143 A151 1 A174 1 A192 A201 1
A14 12 A32 A43 3059 A64 A74 2 A91 A101 4 A121 61 A143 A152 1 A172 1 A191 A201 1
A12 30 A34 A40 5234 A61 A71 4 A94 A101 2 A123 28 A143 A152 2 A174 1 A191 A201 2
#Dimension of the Table
dim(german_credit)
## [1] 1000   21
#The data set has 20 predictor variables and 1 response variable with 1000 observations

#Variable types
str(german_credit)
## 'data.frame':    1000 obs. of  21 variables:
##  $ chk_acct        : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
##  $ duration        : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_his      : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
##  $ purpose         : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
##  $ amount          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ saving_acct     : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
##  $ present_emp     : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate: int  4 2 2 2 3 2 3 2 2 4 ...
##  $ sex             : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
##  $ other_debtor    : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
##  $ present_resid   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property        : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
##  $ age             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ other_install   : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ housing         : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
##  $ n_credits       : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ job             : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
##  $ n_people        : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone       : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign         : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
##  $ response        : int  1 2 1 1 2 1 1 1 1 2 ...
#Changing Variable types of response and residentid. Coding response as 1 and 0 instead of 2 and 1

# orginal response coding 1= good, 2 = bad we need 0 = good, 1 = bad
german_credit$response = german_credit$response - 1 # Run this once

german_credit$response = as.factor(german_credit$response)
present_resid <- as.factor(german_credit$present_resid)


#Data Summary
summary(german_credit)
##  chk_acct     duration    credit_his    purpose        amount     
##  A11:274   Min.   : 4.0   A30: 40    A43    :280   Min.   :  250  
##  A12:269   1st Qu.:12.0   A31: 49    A40    :234   1st Qu.: 1366  
##  A13: 63   Median :18.0   A32:530    A42    :181   Median : 2320  
##  A14:394   Mean   :20.9   A33: 88    A41    :103   Mean   : 3271  
##            3rd Qu.:24.0   A34:293    A49    : 97   3rd Qu.: 3972  
##            Max.   :72.0              A46    : 50   Max.   :18424  
##                                      (Other): 55                  
##  saving_acct present_emp installment_rate  sex      other_debtor
##  A61:603     A71: 62     Min.   :1.000    A91: 50   A101:907    
##  A62:103     A72:172     1st Qu.:2.000    A92:310   A102: 41    
##  A63: 63     A73:339     Median :3.000    A93:548   A103: 52    
##  A64: 48     A74:174     Mean   :2.973    A94: 92               
##  A65:183     A75:253     3rd Qu.:4.000                          
##                          Max.   :4.000                          
##                                                                 
##  present_resid   property        age        other_install housing   
##  Min.   :1.000   A121:282   Min.   :19.00   A141:139      A151:179  
##  1st Qu.:2.000   A122:232   1st Qu.:27.00   A142: 47      A152:713  
##  Median :3.000   A123:332   Median :33.00   A143:814      A153:108  
##  Mean   :2.845   A124:154   Mean   :35.55                           
##  3rd Qu.:4.000              3rd Qu.:42.00                           
##  Max.   :4.000              Max.   :75.00                           
##                                                                     
##    n_credits       job         n_people     telephone  foreign    response
##  Min.   :1.000   A171: 22   Min.   :1.000   A191:596   A201:963   0:700   
##  1st Qu.:1.000   A172:200   1st Qu.:1.000   A192:404   A202: 37   1:300   
##  Median :1.000   A173:630   Median :1.000                                 
##  Mean   :1.407   A174:148   Mean   :1.155                                 
##  3rd Qu.:2.000              3rd Qu.:1.000                                 
##  Max.   :4.000              Max.   :2.000                                 
## 
#No Missing Values present in the Data
#Splitting the Data into training and validation data set in 75:25 ratio

set.seed(10743959)
subset<-sample(nrow(german_credit),nrow(german_credit)*0.75)
germancredit.train<-german_credit[subset,]
germancredit.test<-german_credit[-subset,]

#Logistic Regression

lr.gc.model<-glm(response ~ .,family = binomial,germancredit.train)

lr.gc.predict<-predict(lr.gc.model,germancredit.test,type="response")
#Choosing cutoff probability as 0.5 to classify into bad and good customers
lr.gc.predict<-ifelse(lr.gc.predict>0.5,1,0)
table(germancredit.test$response,lr.gc.predict,dnn = c("Actual ","Predicted "))
##        Predicted 
## Actual    0   1
##       0 148  25
##       1  39  38
lr.mr<-mean(ifelse(lr.gc.predict!=germancredit.test$response,1,0))
round(lr.mr,2)
## [1] 0.26
#Bagging: Tuning parameters- Number of Trees=1000, mtry=20= All of the predictors considered at every node for split
set.seed(10743959)
#Model Building
bag.gc.model<-randomForest(response ~ .,germancredit.train,ntree=1000,mtry=20,importance=TRUE)


#Predicting on test
bag.gc.predict<-predict(bag.gc.model,germancredit.test,type="class")

table(germancredit.test$response,bag.gc.predict,dnn = c("Actual ","Predicted "))
##        Predicted 
## Actual    0   1
##       0 156  17
##       1  39  38
#Test misclassification rate
bagging.mr<-(17+39)/nrow(germancredit.test)
bagging.mr
## [1] 0.224
#Random Forest: Tuning parameters- Number of Trees=1000, mtry=squareroot(P)= square root of the total predictors considered at every node for split
set.seed(10743959)
#Model Building
rf.gc.model<-randomForest(response ~ .,germancredit.train,ntree=1000,mtry=sqrt(20),importance=TRUE)


#Predicting on test
rf.gc.predict<-predict(rf.gc.model,germancredit.test,type="class")

table(germancredit.test$response,rf.gc.predict,dnn = c("Actual ","Predicted "))
##        Predicted 
## Actual    0   1
##       0 157  16
##       1  49  28
#Test misclassification rate
rf.mr<-(16+49)/nrow(germancredit.test)
rf.mr
## [1] 0.26

The Test set Misclassification Rate is summarized for different Modeling techniques as below:

Logistic regression: 0.26

Bagging:0.22

Random Forest:0.26

The Bagging seems to give the least misclassification rate among all the methods.