Some of the respondents have been tagged as belonging to group 1 – 6. However, due to a data calculation issue, some of the respondents have had their groups (pov6) missing. Build a model that will classify these respondents back into one of the 6 groups.
# Loading of Libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(RColorBrewer)
library(corrplot)
## corrplot 0.84 loaded
library(rpart)
library(rpart.plot)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(gbm)
## Loaded gbm 2.1.5
# Loading the Data
train_in <- read.csv('./spendata.csv', header=T)
test_in <- read.csv('./testdata.csv', header=T)
# Cleaning the Data
trainData <- train_in[, colSums(is.na(train_in)) == 0]
testData <- test_in[, colSums(is.na(test_in)) == 0]
trainData <- trainData[, -c(1)]
testData <- testData[, -c(1)]
testData[, "pov6"] <- ""
common_column_names <- intersect(names(trainData), names(testData))
trainData <- trainData[, common_column_names]
testData <- testData[, common_column_names]
dim(trainData)
## [1] 18379 227
dim(testData)
## [1] 4595 227
We split the training data (trainData) into 50% for training (trainData) and 50% for cross validation (validData). This will help us to determine out-of-sample errors. We will then use our prediction models to classify respondents with missing groups back into one of the 6 groups.
# Splitting the Training Data
set.seed(1234)
inTrain <- createDataPartition(trainData$pov6, p = 0.5, list = FALSE)
trainData <- trainData[inTrain, ]
validData <- trainData[-inTrain, ]
# Removing Predictors with Near Zero Variance
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
validData <- validData[, -NZV]
dim(trainData)
## [1] 9190 108
dim(validData)
## [1] 4627 108
trainData <- as.data.frame(lapply(trainData, as.numeric))
validData <- as.data.frame(lapply(validData, as.numeric))
# Plotting a Correlation Plot for Training Data
cor_mat <- cor(trainData)
corrplot(cor_mat, order = "FPC", method = "color",
type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))
In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a threshold value of 0.95 to determine these highly correlated variables.
For this Question, we will use 3 different algorithms to predict the outcome. The algorithms are as follows:
# Building our Classification Tree Model with Training Data
set.seed(12345)
decisionTreeMod1 <- rpart(pov6 ~ ., data = trainData, method = "class")
fancyRpartPlot(decisionTreeMod1)
Next, we cross validate our Classification Tree Model with our validation data (validData), to determine the accuracy of this prediction model.
# Cross Validating the Classification Tree Model with Validation Data
Prediction_Matrix_CT <- predict(decisionTreeMod1, validData, type = "class")
cmtree <- confusionMatrix(table(Prediction_Matrix_CT, validData$pov6))
cmtree
## Confusion Matrix and Statistics
##
##
## Prediction_Matrix_CT 1 2 3 4 5 6
## 1 3501 92 3 4 29 13
## 2 53 741 4 13 0 9
## 3 0 0 95 0 0 0
## 4 3 0 0 19 0 1
## 5 0 0 0 0 0 0
## 6 0 0 0 1 0 46
##
## Overall Statistics
##
## Accuracy : 0.9514
## 95% CI : (0.9448, 0.9574)
## No Information Rate : 0.7687
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8658
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.9843 0.8896 0.93137 0.513514 0.000000 0.666667
## Specificity 0.8682 0.9792 1.00000 0.999129 1.000000 0.999781
## Pos Pred Value 0.9613 0.9037 1.00000 0.826087 NaN 0.978723
## Neg Pred Value 0.9431 0.9758 0.99846 0.996090 0.993732 0.994978
## Prevalence 0.7687 0.1800 0.02204 0.007997 0.006268 0.014912
## Detection Rate 0.7566 0.1601 0.02053 0.004106 0.000000 0.009942
## Detection Prevalence 0.7871 0.1772 0.02053 0.004971 0.000000 0.010158
## Balanced Accuracy 0.9262 0.9344 0.96569 0.756321 0.500000 0.833224
# Plotting Results in a Matrix
plot(cmtree$table, col = cmtree$byClass,
main = paste("Classification Tree: Accuracy =",
round(cmtree$overall['Accuracy'], 4)))
From the Classification Tree Matrix shown above, the accuracy of our Classification Tree Model is 0.9514. Therefore, its out-of-sample error is 0.0486.
# Building our Generalized Boosted Models with Training Data
set.seed(12345)
modGBM <- gbm(formula = as.factor(pov6) ~ ., distribution = "gaussian",
data = trainData, n.trees = 1000, interaction.depth = 3,
shrinkage = 0.1, cv.folds = 5, n.cores = NULL, verbose = FALSE)
print(modGBM)
## gbm(formula = as.factor(pov6) ~ ., distribution = "gaussian",
## data = trainData, n.trees = 1000, interaction.depth = 3,
## shrinkage = 0.1, cv.folds = 5, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## The best cross-validation iteration was 952.
## There were 107 predictors of which 85 had non-zero influence.
# Plotting our Generalized Boosted Models
gbm.perf(modGBM, method = "cv")
## [1] 952
# Cross Validating the Generalized Boosted Models with Validation Data
Prediction_Matrix_GBM <- round(predict(modGBM, newdata=validData, n.trees=933), digits=0)
cmGBM <- confusionMatrix(table(Prediction_Matrix_GBM, validData$pov6))
cmGBM
## Confusion Matrix and Statistics
##
##
## Prediction_Matrix_GBM 1 2 3 4 5 6
## 1 3495 77 1 0 2 0
## 2 55 748 5 4 1 0
## 3 7 8 91 11 1 5
## 4 0 0 5 21 13 3
## 5 0 0 0 1 12 15
## 6 0 0 0 0 0 46
##
## Overall Statistics
##
## Accuracy : 0.9537
## 95% CI : (0.9473, 0.9596)
## No Information Rate : 0.7687
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8762
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 0.9826 0.8980 0.89216 0.567568 0.413793 0.666667
## Specificity 0.9252 0.9829 0.99293 0.995425 0.996520 1.000000
## Pos Pred Value 0.9776 0.9200 0.73984 0.500000 0.428571 1.000000
## Neg Pred Value 0.9411 0.9777 0.99756 0.996510 0.996304 0.994979
## Prevalence 0.7687 0.1800 0.02204 0.007997 0.006268 0.014912
## Detection Rate 0.7553 0.1617 0.01967 0.004539 0.002593 0.009942
## Detection Prevalence 0.7726 0.1757 0.02658 0.009077 0.006051 0.009942
## Balanced Accuracy 0.9539 0.9404 0.94254 0.781496 0.705157 0.833333
# Plotting Results in a Matrix
plot(cmGBM$table, col = cmGBM$byClass,
main = paste("Generalized Boosted Models: Accuracy =",
round(cmGBM$overall['Accuracy'], 4)))
From the Generalized Boosted Models Matrix shown above, the accuracy of our Generalized Boosted Models is 0.9537. Therefore, its out-of-sample error is 0.0463.
# Building our Random Decision Forests Model with Training Data
set.seed(12345)
controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modRF1 <- train(as.factor(pov6) ~ ., data=trainData,
method="rf", ntree=100, trControl=controlRF)
modRF1$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 100, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 54
##
## OOB estimate of error rate: 3.74%
## Confusion matrix:
## 1 2 3 4 5 6 class.error
## 1 7001 98 0 7 8 1 0.01602249
## 2 114 1495 0 5 0 3 0.07544836
## 3 5 8 167 0 0 0 0.07222222
## 4 6 24 0 46 0 2 0.41025641
## 5 18 7 0 0 41 0 0.37878788
## 6 15 18 0 4 1 96 0.28358209
# Plotting our Random Decision Forests Model
plot(modRF1)
Next, we cross validate our Random Decision Forests Model with our validation data (validData), to determine the accuracy of this prediction model.
# Cross Validating the Random Decision Forests Model with Validation Data
Prediction_Matrix_RDF <- predict(modRF1, newdata=validData, type = "raw")
cmrf <- confusionMatrix(table(Prediction_Matrix_RDF, validData$pov6))
cmrf
## Confusion Matrix and Statistics
##
##
## Prediction_Matrix_RDF 1 2 3 4 5 6
## 1 3557 0 0 0 0 0
## 2 0 833 0 0 0 0
## 3 0 0 102 0 0 0
## 4 0 0 0 37 0 0
## 5 0 0 0 0 29 0
## 6 0 0 0 0 0 69
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9992, 1)
## No Information Rate : 0.7687
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity 1.0000 1.00 1.00000 1.000000 1.000000 1.00000
## Specificity 1.0000 1.00 1.00000 1.000000 1.000000 1.00000
## Pos Pred Value 1.0000 1.00 1.00000 1.000000 1.000000 1.00000
## Neg Pred Value 1.0000 1.00 1.00000 1.000000 1.000000 1.00000
## Prevalence 0.7687 0.18 0.02204 0.007997 0.006268 0.01491
## Detection Rate 0.7687 0.18 0.02204 0.007997 0.006268 0.01491
## Detection Prevalence 0.7687 0.18 0.02204 0.007997 0.006268 0.01491
## Balanced Accuracy 1.0000 1.00 1.00000 1.000000 1.000000 1.00000
# Plotting Results in a Matrix
plot(cmrf$table, col = cmrf$byClass,
main = paste("Random Decision Forests: Accuracy =",
round(cmrf$overall['Accuracy'], 4)))
From the Random Decision Forests Matrix shown above, the accuracy of our Random Decision Forests Model is 1. Therefore, its out-of-sample error is 0.
The accuracy values of the 3 prediction models are as follows:
From this comparison, we concluded that the Random Decision Forests Model is the best prediction model for our analysis.
# Using our Random Decision Forests Model on Test Data
testData <- as.data.frame(lapply(testData, as.numeric))
Results <- predict(modRF1, newdata=testData, type = "raw")
testData[, "pov6"] <- Results
write.csv(testData,'./testdata_pov6.csv', row.names = FALSE)
Our Random Forests Model was able to classify the respondents with missing groups from the test data (testData) back into one of the 6 groups. The generated output (Results) contains this information and it has been added back into the test data accordingly. You may refer to the new test data file (testdata_pov6.csv) for more information.