Reading & Explaining the Data

Please download the data using the link

Please review the code below and execute it in your own RMD file.

Reading the Data

# reading the data
df <- read.csv("AutoFinanaceData.csv")

# attach the data frame
attach(df)

# Number of rows and columns
dim(df)

## [1] 28906    21

Printing the Column Names of the Data

# column names
colnames(df)

##  [1] "Agmt.No"        "ContractStatus" "StartDate"      "AGE"           
##  [5] "NOOFDEPE"       "MTHINCTH"       "SALDATFR"       "TENORYR"       
##  [9] "DWNPMFR"        "PROFBUS"        "QUALHSC"        "QUAL_PG"       
## [13] "SEXCODE"        "FULLPDC"        "FRICODE"        "WASHCODE"      
## [17] "Region"         "Branch"         "DefaulterFlag"  "DefaulterType" 
## [21] "DATASET"

List of Data Columns

DEFAULT

1. Defaulter Flag

1: Customer has delayed paying at least once
0: Otherwise

DEMOGRAPHIC VARIABLES

1. Gender

SEXCODE = 1 (Male)
SEXCODE = 0 (Female)

2. Age

3. Education

QUALHSC
QUAL_PG

4. Income

Monthly Income in Thousands (MTHINCTH)
Owns a Fridge (FRICODE)
Owns a Washing Machine (WASHCODE)

5. Profession

PROFBUS = 1 (BUSINESS)
PROFBUS = 0 (PROFESSIONAL)

6. No.of Dependents

NOOFDEPE

7. Region

Structure of the Dataset

# structure
str(df)

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ QUALHSC       : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ QUAL_PG       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SEXCODE       : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ FULLPDC       : int  1 1 1 1 1 0 0 1 1 1 ...
##  $ FRICODE       : int  0 1 1 1 1 0 0 0 0 0 ...
##  $ WASHCODE      : int  0 0 1 1 0 0 0 0 0 0 ...
##  $ Region        : chr  "AP2" "AP2" "AP2" "AP2" ...
##  $ Branch        : chr  "Vizag" "Vizag" "Vizag" "Vizag" ...
##  $ DefaulterFlag : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DefaulterType : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Convert catgorical variables to `factor`

names <- c(10:20)
df[,names] <- lapply(df[,names] , factor)
str(df)

## 'data.frame':    28906 obs. of  21 variables:
##  $ Agmt.No       : chr  "AP18100057" "AP18100140" "AP18100198" "AP18100217" ...
##  $ ContractStatus: chr  "Closed" "Closed" "Closed" "Closed" ...
##  $ StartDate     : chr  "19-01-01" "10-05-01" "05-08-01" "03-09-01" ...
##  $ AGE           : int  26 28 32 31 36 33 41 47 43 27 ...
##  $ NOOFDEPE      : int  2 2 2 0 2 2 2 0 0 0 ...
##  $ MTHINCTH      : num  4.5 5.59 8.8 5 12 ...
##  $ SALDATFR      : num  1 1 1 1 1 1 1 1 0.97 1 ...
##  $ TENORYR       : num  1.5 2 1 1 1 2 1 2 1.5 2 ...
##  $ DWNPMFR       : num  0.27 0.25 0.51 0.66 0.17 0.18 0.37 0.42 0.27 0.47 ...
##  $ PROFBUS       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ QUALHSC       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 1 1 ...
##  $ QUAL_PG       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SEXCODE       : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ FULLPDC       : Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2 ...
##  $ FRICODE       : Factor w/ 2 levels "0","1": 1 2 2 2 2 1 1 1 1 1 ...
##  $ WASHCODE      : Factor w/ 2 levels "0","1": 1 1 2 2 1 1 1 1 1 1 ...
##  $ Region        : Factor w/ 8 levels "AP1","AP2","Chennai",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Branch        : Factor w/ 14 levels "Bangalore","Chennai",..: 14 14 14 14 14 14 14 14 14 14 ...
##  $ DefaulterFlag : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DefaulterType : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ DATASET       : chr  " " "BUILD" "BUILD" "BUILD" ...

Creating Train and Test dataset

Reserve 80% for training and 20% of test

# loading the package
library(caTools)

## Warning: package 'caTools' was built under R version 4.0.4

# fixing the observations 
set.seed(123)
# splitting the data 
split = sample.split(df$DefaulterFlag, SplitRatio = 0.75)
# creating the training set
trainingSet = subset(df, split == TRUE)
# creating the test set
testSet = subset(df, split == FALSE)

Section 1: Decision Tree

We made the Decision Tree using Gini on training dataset, the decision tree as shown below.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(2345)
dTree <- train(DefaulterFlag ~ AGE
                       + NOOFDEPE
                       + MTHINCTH 
                       + SALDATFR 
                       + TENORYR
                       + DWNPMFR
                       + PROFBUS
                       + QUALHSC 
                       + QUAL_PG
                       + SEXCODE
                       + FULLPDC
                       + FRICODE
                       + WASHCODE
                       + Region, 
                       data = trainingSet, 
                       method = "rpart", 
                       parms = list(split = "gini"), 
                       trControl = trainControl(method = "cv"))
dTree

## CART 
## 
## 21679 samples
##    14 predictor
##     2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 19511, 19511, 19512, 19511, 19511, 19511, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa    
##   0.01024164  0.7292763  0.2129099
##   0.01152184  0.7269239  0.2189725
##   0.01408225  0.7186214  0.1326211
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.01024164.

library(rpart.plot)

## Loading required package: rpart

prp(dTree$finalModel, box.palette = "Reds", tweak = 1.2, varlen = 20)

Using Decision Tree, We made Confusion Matrix, shown below, assuming threshold probability of 50%.

library(caret)
# predicted probabilities
predProbTestTree <- predict(dTree, testSet, type = "prob")
# confusion matrix using caret package
yPred <- ifelse(predProbTestTree[2] > 0.5, "Yes", "No")
predY <- as.factor(yPred)
levels(testSet$DefaulterFlag) <- c("No", "Yes")
confusionMatrix(data = predY, reference = testSet$DefaulterFlag, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   434  297
##        Yes 1649 4847
##                                           
##                Accuracy : 0.7307          
##                  95% CI : (0.7203, 0.7409)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 0.0001805       
##                                           
##                   Kappa : 0.1867          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9423          
##             Specificity : 0.2084          
##          Pos Pred Value : 0.7462          
##          Neg Pred Value : 0.5937          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6707          
##    Detection Prevalence : 0.8989          
##       Balanced Accuracy : 0.5753          
##                                           
##        'Positive' Class : Yes             
##

Using Decision Tree, We made the ROC Curve, shown below

# loading the package
library(ROCR)

## Warning: package 'ROCR' was built under R version 4.0.4

DTPrediction <- predict(dTree, testSet,type = "prob")
DTPrediction <- prediction(DTPrediction[2],testSet$DefaulterFlag)
DTperformance <- performance(DTPrediction, "tpr","fpr")
# plotting ROC curve
plot(DTperformance,main = "ROC Curve",col = 2,lwd = 2)
abline(a = 0,b = 1,lwd = 2,lty = 3,col = "black")

Using Decision Tree, We calculated (AUC), shown below

library(ROCR)
# area under curve
DTPrediction <- prediction(predProbTestTree[2],testSet$DefaulterFlag)
aucDT <- performance(DTPrediction, measure = "auc")
aucDT <- aucDT@y.values[[1]]
aucDT

## [1] 0.6829464

Section 2: Random Forest

Que 1. Write R code to run the Random Forest Model on training dataset, the output is as shown below

set nbagg = 50 set.seed = 123

library(caret)
# control parameters
set.seed(123)
trctrl <- trainControl(method = "none", classProbs = TRUE,)

## setting levels as "Yes" and "NO"
trainingSet$DefaulterFlag <- ifelse(trainingSet$DefaulterFlag == "1","Yes","No")

RFModel <- train(DefaulterFlag ~ AGE
                       + NOOFDEPE
                       + MTHINCTH 
                       + SALDATFR 
                       + TENORYR
                       + DWNPMFR
                       + PROFBUS
                       + QUALHSC 
                       + QUAL_PG
                       + SEXCODE
                       + FULLPDC
                       + FRICODE
                       + WASHCODE
                       + Region, 
                       data = trainingSet, 
                         method = "rf",
                         nbagg = 50,
                         parms  = list(split = "gini"),
                         trControl = trctrl,
                         importance = TRUE)
# model summary
RFModel

## Random Forest 
## 
## 21679 samples
##    14 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: None

Que 2. Using Random Forest model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

library(caret)
# predicted probabilities
predProbTestRF <- predict(RFModel, testSet, type = "prob")
# confusion matrix using caret package
yPred <- ifelse(predProbTestRF[2] > 0.5, "Yes", "No")
predY <- as.factor(yPred)
levels(testSet$DefaulterFlag) <- c("No", "Yes")
confusionMatrix(data = predY, reference = testSet$DefaulterFlag, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   715  558
##        Yes 1368 4586
##                                           
##                Accuracy : 0.7335          
##                  95% CI : (0.7231, 0.7437)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 2.118e-05       
##                                           
##                   Kappa : 0.2655          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8915          
##             Specificity : 0.3433          
##          Pos Pred Value : 0.7702          
##          Neg Pred Value : 0.5617          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6346          
##    Detection Prevalence : 0.8239          
##       Balanced Accuracy : 0.6174          
##                                           
##        'Positive' Class : Yes             
##

Que 3- Using Random Forest Model Write R code to Plot the ROC curve, as shown below

# loading the package
library(ROCR)
RFPrediction <- predict(RFModel, testSet,type = "prob")
RFPrediction <- prediction(RFPrediction[2],testSet$DefaulterFlag)
RFperformance <- performance(RFPrediction, "tpr","fpr")
# plotting ROC curve
plot(RFperformance,main = "ROC Curve",col = 2,lwd = 2)
abline(a = 0,b = 1,lwd = 2,lty = 3,col = "black")

Que 4- Using Random Forest Model Write R code to calculate the AUC (Area Under Curve).

library(ROCR)
# area under curve
RFPrediction <- prediction(predProbTestRF[2],testSet$DefaulterFlag)
aucRF <- performance(RFPrediction, measure = "auc")
aucRF <- aucRF@y.values[[1]]
aucRF

## [1] 0.7243218

Que 5- Write R code to draw ROC curves for Decision Tree & Random Forest on same graph using test dataset.

# List of predictions
predList <- list(predProbTestTree[2],predProbTestRF[2])

# List of actual values (same for all)
m <- length(predList)

# ROC curves (logit and tree)
plot(DTperformance, col = "black", lwd = 2)
plot(RFperformance, add = TRUE, col = "red", lwd = 3)
legend(x = "bottomright", 
       legend = c("Decision Tree", "Random Forest"),fill = 1:m)

Que 6- Which Machine Learning technique (Decision Tree or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

#a) Accuracy = Random Forest
#b) Sensitivity = Decision Tree
#c) Specificity = Random Forest
#d) AUC = Random Forest

Section 3: Bagging

Que 7. Write R code to run the Bagging Model on training dataset, the output is as shown below

set nbagg = 50 set.seed = 123

library(caret)
set.seed(123)
# control parameters
trctrl <- trainControl(method = "none", classProbs = TRUE,)

## setting levels as "Yes" and "NO"
#trainingSet$DefaulterFlag <- ifelse(trainingSet$DefaulterFlag == "1","Yes","No")

BaggingModel <- train(DefaulterFlag ~ AGE
                       + NOOFDEPE
                       + MTHINCTH 
                       + SALDATFR 
                       + TENORYR
                       + DWNPMFR
                       + PROFBUS
                       + QUALHSC 
                       + QUAL_PG
                       + SEXCODE
                       + FULLPDC
                       + FRICODE
                       + WASHCODE
                       + Region, 
                       data = trainingSet, 
                         method = "treebag",
                         nbagg = 50,
                         trControl = trctrl,
                         importance = TRUE)
# model summary
BaggingModel

## Bagged CART 
## 
## 21679 samples
##    14 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: None

Que 8. Using Bagging model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

library(caret)
# predicted probabilities
predProbTestBagg <- predict(BaggingModel, testSet, type = "prob")
# confusion matrix using caret package
yPred <- ifelse(predProbTestBagg[2] > 0.5, "Yes", "No")
predY <- as.factor(yPred)
levels(testSet$DefaulterFlag) <- c("No", "Yes")
confusionMatrix(data = predY, reference = testSet$DefaulterFlag, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   744  706
##        Yes 1339 4438
##                                           
##                Accuracy : 0.717           
##                  95% CI : (0.7065, 0.7274)
##     No Information Rate : 0.7118          
##     P-Value [Acc > NIR] : 0.1651          
##                                           
##                   Kappa : 0.2418          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.8628          
##             Specificity : 0.3572          
##          Pos Pred Value : 0.7682          
##          Neg Pred Value : 0.5131          
##              Prevalence : 0.7118          
##          Detection Rate : 0.6141          
##    Detection Prevalence : 0.7994          
##       Balanced Accuracy : 0.6100          
##                                           
##        'Positive' Class : Yes             
##

Que 9- Using Bagging Model Write R code to Plot the ROC curve, as shown below

# loading the package
library(ROCR)
BaggPrediction <- predict(BaggingModel, testSet,type = "prob")
BaggPrediction <- prediction(BaggPrediction[2],testSet$DefaulterFlag)
Baggperformance <- performance(BaggPrediction, "tpr","fpr")
# plotting ROC curve
plot(Baggperformance,main = "ROC Curve",col = 2,lwd = 2)
abline(a = 0,b = 1,lwd = 2,lty = 3,col = "black")

Que 10- Using Bagging Model Write R code to calculate the AUC (Area Under Curve).

library(ROCR)
# area under curve
BaggPrediction <- prediction(predProbTestBagg[2],testSet$DefaulterFlag)
aucBagg <- performance(BaggPrediction, measure = "auc")
aucBagg <- aucBagg@y.values[[1]]
aucBagg

## [1] 0.7010126

Que 11- Which Machine Learning technique (Bagging or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

#a) Accuracy = Random Forest
#b) Sensitivity = Random Forest
#c) Specificity = Bagging
#d) AUC = Random Forest

Que 12- Write R code to draw ROC curves for Decision Tree, Random Forest & Bagging on same graph using test dataset.

# List of predictions
predList <- list(predProbTestTree[2],predProbTestRF[2],predProbTestBagg[2])

# List of actual values (same for all)
m <- length(predList)

# ROC curves (logit and tree)
plot(DTperformance, col = "black", lwd = 2)
plot(RFperformance, add = TRUE, col = "red", lwd = 3)
plot(Baggperformance, add = TRUE, col = "green", lwd = 4)
legend(x = "bottomright", 
       legend = c("Decision Tree", "Random Forest","Bagging"),fill = 1:m)

Que 13- Which Model is doing better (Decision Tree / Random Forest / Bagging)? Please rank order the models, based on 1) Accuracy, 2) Sensitivity, 3) Specificity, 4) AUC. Please explain your reasoning in four or five sentences.

#a) Accuracy = Random Forest (Random Forest > Decision Tree > Bagging)
#b) Sensitivity = Decision Tree (Decision Tree > Random Forest > Bagging)
#c) Specificity = Bagging (Bagging > Random Forest > Decision Tree)
#d) AUC = Random Forest (Random Forest > Bagging > Decision Tree)

Assignment 6 (Case: Auto Finance)

Sameer Mathur

Reading & Explaining the Data

Please download the data using the link

Please review the code below and execute it in your own RMD file.

Reading the Data

Printing the Column Names of the Data

List of Data Columns

DEFAULT

DEMOGRAPHIC VARIABLES

Structure of the Dataset

Convert catgorical variables to `factor`

Creating Train and Test dataset

Reserve 80% for training and 20% of test

Section 1: Decision Tree

We made the Decision Tree using Gini on training dataset, the decision tree as shown below.

Using Decision Tree, We made Confusion Matrix, shown below, assuming threshold probability of 50%.

Using Decision Tree, We made the ROC Curve, shown below

Using Decision Tree, We calculated (AUC), shown below

Section 2: Random Forest

Que 1. Write R code to run the Random Forest Model on training dataset, the output is as shown below

Que 2. Using Random Forest model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

Que 3- Using Random Forest Model Write R code to Plot the ROC curve, as shown below

Que 4- Using Random Forest Model Write R code to calculate the AUC (Area Under Curve).

Que 5- Write R code to draw ROC curves for Decision Tree & Random Forest on same graph using test dataset.

Que 6- Which Machine Learning technique (Decision Tree or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

Section 3: Bagging

Que 7. Write R code to run the Bagging Model on training dataset, the output is as shown below

Que 8. Using Bagging model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

Que 9- Using Bagging Model Write R code to Plot the ROC curve, as shown below

Que 10- Using Bagging Model Write R code to calculate the AUC (Area Under Curve).

Que 11- Which Machine Learning technique (Bagging or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

Que 12- Write R code to draw ROC curves for Decision Tree, Random Forest & Bagging on same graph using test dataset.

Que 13- Which Model is doing better (Decision Tree / Random Forest / Bagging)? Please rank order the models, based on 1) Accuracy, 2) Sensitivity, 3) Specificity, 4) AUC. Please explain your reasoning in four or five sentences.

Assignment 6 (Case: Auto Finance)

Sameer Mathur

Reading & Explaining the Data

Please download the data using the link

Please review the code below and execute it in your own RMD file.

Reading the Data

Printing the Column Names of the Data

List of Data Columns

DEFAULT

DEMOGRAPHIC VARIABLES

LOAN-RELATED VARIABLES

Structure of the Dataset

Convert catgorical variables to factor

Creating Train and Test dataset

Reserve 80% for training and 20% of test

Section 1: Decision Tree

We made the Decision Tree using Gini on training dataset, the decision tree as shown below.

Using Decision Tree, We made Confusion Matrix, shown below, assuming threshold probability of 50%.

Using Decision Tree, We made the ROC Curve, shown below

Using Decision Tree, We calculated (AUC), shown below

Section 2: Random Forest

Que 1. Write R code to run the Random Forest Model on training dataset, the output is as shown below

Que 2. Using Random Forest model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

Que 3- Using Random Forest Model Write R code to Plot the ROC curve, as shown below

Que 4- Using Random Forest Model Write R code to calculate the AUC (Area Under Curve).

Que 5- Write R code to draw ROC curves for Decision Tree & Random Forest on same graph using test dataset.

Que 6- Which Machine Learning technique (Decision Tree or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

Section 3: Bagging

Que 7. Write R code to run the Bagging Model on training dataset, the output is as shown below

Que 8. Using Bagging model , Write R Code to generate following Confusion Matrix, shown below, assuming threshold probability of 50%.

Que 9- Using Bagging Model Write R code to Plot the ROC curve, as shown below

Que 10- Using Bagging Model Write R code to calculate the AUC (Area Under Curve).

Que 11- Which Machine Learning technique (Bagging or Random Forest) is better based on a) Accuracy, b) Sensitivity, c) Specificity, d) AUC? Please explain your reasoning in two or three sentences.

Que 12- Write R code to draw ROC curves for Decision Tree, Random Forest & Bagging on same graph using test dataset.

Que 13- Which Model is doing better (Decision Tree / Random Forest / Bagging)? Please rank order the models, based on 1) Accuracy, 2) Sensitivity, 3) Specificity, 4) AUC. Please explain your reasoning in four or five sentences.

Convert catgorical variables to `factor`