Overview

Some of the respondents have been tagged as belonging to group 1 – 6. However, due to a data calculation issue, some of the respondents have had their groups (pov6) missing. Build a model that will classify these respondents back into one of the 6 groups.

Loading of Libraries

# Loading of Libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(RColorBrewer)
library(corrplot)
## corrplot 0.84 loaded
library(rpart)
library(rpart.plot)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(gbm)
## Loaded gbm 2.1.5

Loading and Cleaning of Data

# Loading the Data
train_in <- read.csv('./spendata.csv', header=T)
test_in <- read.csv('./testdata.csv', header=T)

# Cleaning the Data
trainData <- train_in[, colSums(is.na(train_in)) == 0]
testData <- test_in[, colSums(is.na(test_in)) == 0]
trainData <- trainData[, -c(1)]
testData <- testData[, -c(1)]
testData[, "pov6"] <- ""
common_column_names <- intersect(names(trainData), names(testData))
trainData <- trainData[, common_column_names]
testData <- testData[, common_column_names]
dim(trainData)
## [1] 18379   227
dim(testData)
## [1] 4595  227

Preparing Datasets for Prediction

We split the training data (trainData) into 50% for training (trainData) and 50% for cross validation (validData). This will help us to determine out-of-sample errors. We will then use our prediction models to classify respondents with missing groups back into one of the 6 groups.

# Splitting the Training Data
set.seed(1234)
inTrain <- createDataPartition(trainData$pov6, p = 0.5, list = FALSE)
trainData <- trainData[inTrain, ]
validData <- trainData[-inTrain, ]

# Removing Predictors with Near Zero Variance
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
validData <- validData[, -NZV]
dim(trainData)
## [1] 9190  108
dim(validData)
## [1] 4627  108
trainData <- as.data.frame(lapply(trainData, as.numeric))
validData <- as.data.frame(lapply(validData, as.numeric))

Plotting a Correlation Plot for Training Data

# Plotting a Correlation Plot for Training Data
cor_mat <- cor(trainData)
corrplot(cor_mat, order = "FPC", method = "color",
         type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))

In the Correlation Plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We used a threshold value of 0.95 to determine these highly correlated variables.

Finding Highly Correlated Variables in Training Data

# Finding Highly Correlated Variables in Training Data
highlyCorrelated = findCorrelation(cor_mat, cutoff=0.95)
names(trainData)[highlyCorrelated]
##  [1] "b.17"  "c.30"  "f.188" "b.15"  "b.16"  "c.58"  "c.159" "c.161" "c.164"
## [10] "c.165" "c.166" "c.32"

Building our Prediction Models

For this Question, we will use 3 different algorithms to predict the outcome. The algorithms are as follows:

  1. Classification Tree
  2. Generalized Boosted Models
  3. Random Decision Forests

Prediction with Classification Tree

# Building our Classification Tree Model with Training Data
set.seed(12345)
decisionTreeMod1 <- rpart(pov6 ~ ., data = trainData, method = "class")
fancyRpartPlot(decisionTreeMod1)

Next, we cross validate our Classification Tree Model with our validation data (validData), to determine the accuracy of this prediction model.

# Cross Validating the Classification Tree Model with Validation Data
Prediction_Matrix_CT <- predict(decisionTreeMod1, validData, type = "class")
cmtree <- confusionMatrix(table(Prediction_Matrix_CT, validData$pov6))
cmtree
## Confusion Matrix and Statistics
## 
##                     
## Prediction_Matrix_CT    1    2    3    4    5    6
##                    1 3501   92    3    4   29   13
##                    2   53  741    4   13    0    9
##                    3    0    0   95    0    0    0
##                    4    3    0    0   19    0    1
##                    5    0    0    0    0    0    0
##                    6    0    0    0    1    0   46
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9514          
##                  95% CI : (0.9448, 0.9574)
##     No Information Rate : 0.7687          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8658          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.9843   0.8896  0.93137 0.513514 0.000000 0.666667
## Specificity            0.8682   0.9792  1.00000 0.999129 1.000000 0.999781
## Pos Pred Value         0.9613   0.9037  1.00000 0.826087      NaN 0.978723
## Neg Pred Value         0.9431   0.9758  0.99846 0.996090 0.993732 0.994978
## Prevalence             0.7687   0.1800  0.02204 0.007997 0.006268 0.014912
## Detection Rate         0.7566   0.1601  0.02053 0.004106 0.000000 0.009942
## Detection Prevalence   0.7871   0.1772  0.02053 0.004971 0.000000 0.010158
## Balanced Accuracy      0.9262   0.9344  0.96569 0.756321 0.500000 0.833224
# Plotting Results in a Matrix
plot(cmtree$table, col = cmtree$byClass,
     main = paste("Classification Tree: Accuracy =",
                  round(cmtree$overall['Accuracy'], 4)))

From the Classification Tree Matrix shown above, the accuracy of our Classification Tree Model is 0.9514. Therefore, its out-of-sample error is 0.0486.

Prediction with Generalized Boosted Models

# Building our Generalized Boosted Models with Training Data
set.seed(12345)
modGBM <- gbm(formula = as.factor(pov6) ~ ., distribution = "gaussian",
              data = trainData, n.trees = 1000, interaction.depth = 3,
              shrinkage = 0.1, cv.folds = 5, n.cores = NULL, verbose = FALSE)
print(modGBM)
## gbm(formula = as.factor(pov6) ~ ., distribution = "gaussian", 
##     data = trainData, n.trees = 1000, interaction.depth = 3, 
##     shrinkage = 0.1, cv.folds = 5, verbose = FALSE, n.cores = NULL)
## A gradient boosted model with gaussian loss function.
## 1000 iterations were performed.
## The best cross-validation iteration was 952.
## There were 107 predictors of which 85 had non-zero influence.
# Plotting our Generalized Boosted Models
gbm.perf(modGBM, method = "cv")

## [1] 952
# Cross Validating the Generalized Boosted Models with Validation Data
Prediction_Matrix_GBM <- round(predict(modGBM, newdata=validData, n.trees=933), digits=0)
cmGBM <- confusionMatrix(table(Prediction_Matrix_GBM, validData$pov6))
cmGBM
## Confusion Matrix and Statistics
## 
##                      
## Prediction_Matrix_GBM    1    2    3    4    5    6
##                     1 3495   77    1    0    2    0
##                     2   55  748    5    4    1    0
##                     3    7    8   91   11    1    5
##                     4    0    0    5   21   13    3
##                     5    0    0    0    1   12   15
##                     6    0    0    0    0    0   46
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9537          
##                  95% CI : (0.9473, 0.9596)
##     No Information Rate : 0.7687          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8762          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            0.9826   0.8980  0.89216 0.567568 0.413793 0.666667
## Specificity            0.9252   0.9829  0.99293 0.995425 0.996520 1.000000
## Pos Pred Value         0.9776   0.9200  0.73984 0.500000 0.428571 1.000000
## Neg Pred Value         0.9411   0.9777  0.99756 0.996510 0.996304 0.994979
## Prevalence             0.7687   0.1800  0.02204 0.007997 0.006268 0.014912
## Detection Rate         0.7553   0.1617  0.01967 0.004539 0.002593 0.009942
## Detection Prevalence   0.7726   0.1757  0.02658 0.009077 0.006051 0.009942
## Balanced Accuracy      0.9539   0.9404  0.94254 0.781496 0.705157 0.833333
# Plotting Results in a Matrix
plot(cmGBM$table, col = cmGBM$byClass,
     main = paste("Generalized Boosted Models: Accuracy =",
                  round(cmGBM$overall['Accuracy'], 4)))

From the Generalized Boosted Models Matrix shown above, the accuracy of our Generalized Boosted Models is 0.9537. Therefore, its out-of-sample error is 0.0463.

Prediction with Random Decision Forests

# Building our Random Decision Forests Model with Training Data
set.seed(12345)
controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modRF1 <- train(as.factor(pov6) ~ ., data=trainData,
                method="rf", ntree=100, trControl=controlRF)
modRF1$finalModel
## 
## Call:
##  randomForest(x = x, y = y, ntree = 100, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 54
## 
##         OOB estimate of  error rate: 3.74%
## Confusion matrix:
##      1    2   3  4  5  6 class.error
## 1 7001   98   0  7  8  1  0.01602249
## 2  114 1495   0  5  0  3  0.07544836
## 3    5    8 167  0  0  0  0.07222222
## 4    6   24   0 46  0  2  0.41025641
## 5   18    7   0  0 41  0  0.37878788
## 6   15   18   0  4  1 96  0.28358209
# Plotting our Random Decision Forests Model
plot(modRF1)

Next, we cross validate our Random Decision Forests Model with our validation data (validData), to determine the accuracy of this prediction model.

# Cross Validating the Random Decision Forests Model with Validation Data
Prediction_Matrix_RDF <- predict(modRF1, newdata=validData, type = "raw")
cmrf <- confusionMatrix(table(Prediction_Matrix_RDF, validData$pov6))
cmrf
## Confusion Matrix and Statistics
## 
##                      
## Prediction_Matrix_RDF    1    2    3    4    5    6
##                     1 3557    0    0    0    0    0
##                     2    0  833    0    0    0    0
##                     3    0    0  102    0    0    0
##                     4    0    0    0   37    0    0
##                     5    0    0    0    0   29    0
##                     6    0    0    0    0    0   69
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9992, 1)
##     No Information Rate : 0.7687     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Class: 6
## Sensitivity            1.0000     1.00  1.00000 1.000000 1.000000  1.00000
## Specificity            1.0000     1.00  1.00000 1.000000 1.000000  1.00000
## Pos Pred Value         1.0000     1.00  1.00000 1.000000 1.000000  1.00000
## Neg Pred Value         1.0000     1.00  1.00000 1.000000 1.000000  1.00000
## Prevalence             0.7687     0.18  0.02204 0.007997 0.006268  0.01491
## Detection Rate         0.7687     0.18  0.02204 0.007997 0.006268  0.01491
## Detection Prevalence   0.7687     0.18  0.02204 0.007997 0.006268  0.01491
## Balanced Accuracy      1.0000     1.00  1.00000 1.000000 1.000000  1.00000
# Plotting Results in a Matrix
plot(cmrf$table, col = cmrf$byClass,
     main = paste("Random Decision Forests: Accuracy =",
                  round(cmrf$overall['Accuracy'], 4)))

From the Random Decision Forests Matrix shown above, the accuracy of our Random Decision Forests Model is 1. Therefore, its out-of-sample error is 0.

Best Prediction Model

The accuracy values of the 3 prediction models are as follows:

  1. Classification Tree = 0.9514
  2. Generalized Boosted Models = 0.9537
  3. Random Decision Forests = 1

From this comparison, we concluded that the Random Decision Forests Model is the best prediction model for our analysis.

# Using our Random Decision Forests Model on Test Data
testData <- as.data.frame(lapply(testData, as.numeric))
Results <- predict(modRF1, newdata=testData, type = "raw")
testData[, "pov6"] <- Results
write.csv(testData,'./testdata_pov6.csv', row.names = FALSE)

Our Random Forests Model was able to classify the respondents with missing groups from the test data (testData) back into one of the 6 groups. The generated output (Results) contains this information and it has been added back into the test data accordingly. You may refer to the new test data file (testdata_pov6.csv) for more information.