Testing ML Prediction Models

Synopsis

The goal of this project is to develop a predictive model to determine how individuals performed an exercise in the training and test dataset.

In this project, I will be training a number of Machine Learning Prediction models also providing a detailed report outlining my model-building process, how cross-validation was applied and the accuracy behind those Prediction models. Additionally, I will apply my most accurate predictive model to predict the outcomes for 20 distinct test cases.

Background of the Data

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, my goal will be to use the data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The training data for this project are available here:

Training Dataset

The test data are available here:

Test dataset

Environment Preparation

This includes all the Packages that I have used in this project and are required in order to be reproduced.

require(caret) #Important Packages for training Models of different types
require(rpart)
require(rattle)
require(gbm)
require(randomForest)
require(knitr)

Cleaning & Exploring Dataset

Now after setting up the environment let’s get started with loading the dataset into our environment

traindat <- read.csv("pml-training.csv")   #reading the training data
testdat <- read.csv("pml-testing.csv")     #reading the testing data
dim(traindat)

## [1] 19622   160

after exploring the dataset I see a lot of variables that have Near Zero Variances which can affect our Prediction models, so the best thing is to get rid of them.

novar  <- nearZeroVar(traindat)
traindat <- traindat[,-novar]   #getting rid of all the near zero variance columns

Now I also need to get rid of the columns that have NA values in them,for this I will be setting a 80% threashold for NA, anymore than that will be withdrawn from the data set.

NO_NA <- sapply(traindat,function(x) mean(is.na(x)) > 0.8) #we iterate a function over the cols through which we 
                                                           #get a logical output which shows true if more than 0.8 Mean NA 

traindat <- traindat[,NO_NA == FALSE]    #getting rid of all the cols that are TRUE(ie More than 80% NA)

As discussed earlier we need a validation set as well, so I will be dividing the dataset into Two sets training & validation set.

data <- createDataPartition(traindat$classe,p = 0.70,list = FALSE) #splitting the data set into 70% train and 30% validation 
training <- traindat[data,]
validation <- traindat[-data,]

training1 <- training[,-(1:5)]              #getting rid of meta data
validation1 <- validation[,-(1:5)]

Model Training

I will be going with Three Machine Learning Models to train my Predictive Models & the Model that performs the best out of these 3 will be the predicting the test data

Decision Tree

Decision Tree works by asking a series of decisions making a flow chart in the end, we start with a big group of data and ask questions about different variables to split the data into smaller, more homogenous groups. This helps us predict outcomes more accurately. Unlike other cross validation models, this one takes the least time to train.

set.seed(101)
train_dt <- rpart(classe~.,data = training1,method = "class")  #training the model (rpart package)
fancyRpartPlot(train_dt)                                       #plotting the model (rattle package)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Now after plotting the data let us now try to predict our validation dataset.

pred_dt <- predict(train_dt, validation1,type = "class")
conf_mat_dt <- confusionMatrix(table(pred_dt,validation$classe))
conf_mat_dt

## Confusion Matrix and Statistics
## 
##        
## pred_dt    A    B    C    D    E
##       A 1533  262   16   92   85
##       B   47  644   78   53   94
##       C   20   98  837  144   85
##       D   50   87   71  581   53
##       E   24   48   24   94  765
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7409         
##                  95% CI : (0.7295, 0.752)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6701         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9158   0.5654   0.8158  0.60270   0.7070
## Specificity            0.8919   0.9427   0.9286  0.94696   0.9604
## Pos Pred Value         0.7711   0.7031   0.7069  0.69002   0.8010
## Neg Pred Value         0.9638   0.9004   0.9598  0.92405   0.9357
## Prevalence             0.2845   0.1935   0.1743  0.16381   0.1839
## Detection Rate         0.2605   0.1094   0.1422  0.09873   0.1300
## Detection Prevalence   0.3378   0.1556   0.2012  0.14308   0.1623
## Balanced Accuracy      0.9039   0.7540   0.8722  0.77483   0.8337

we see that this model has accuracy of 0.6911 which is typically not that high, so let us proceed with another model.

Random Forest

Random Forest works by taking a lot of samples, creating many decision trees and then combining their predictions to make a final decision. The key idea behind Random Forest is that by combining the results from many trees, it can produce more accurate predictions than individual decision trees. This Model takes alot of time to train compared to other model.

set.seed(100)
trcon <- trainControl(method = "cv",number = 5) #k-means cross validation 5 times
train_rf <- train(classe~., data = training1,method = "rf",trControl = trcon,verbose = FALSE) #training the model(randomForest Package)
pred_rf <- predict(train_rf,validation1)
conf_mat_rf <- confusionMatrix(table(pred_rf,validation1$classe))
conf_mat_rf

## Confusion Matrix and Statistics
## 
##        
## pred_rf    A    B    C    D    E
##       A 1673    1    0    0    0
##       B    0 1134    1    0    0
##       C    0    3 1025    4    0
##       D    0    1    0  960    0
##       E    1    0    0    0 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9981          
##                  95% CI : (0.9967, 0.9991)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9976          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9956   0.9990   0.9959   1.0000
## Specificity            0.9998   0.9998   0.9986   0.9998   0.9998
## Pos Pred Value         0.9994   0.9991   0.9932   0.9990   0.9991
## Neg Pred Value         0.9998   0.9989   0.9998   0.9992   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1927   0.1742   0.1631   0.1839
## Detection Prevalence   0.2845   0.1929   0.1754   0.1633   0.1840
## Balanced Accuracy      0.9996   0.9977   0.9988   0.9978   0.9999

This model has a accuracy of 0.9942 , so it almost predicted all of it correctly.

GBM

Gradient Boosting Machine (GBM) is another powerful machine learning algorithm that works by building an ensemble of decision trees. However, unlike Random Forest, which builds trees independently and averages their predictions, GBM builds trees sequentially where it looks after the error of the first tree and builds another one focusing on minimising those errors(residuals).

set.seed(102)
trcon <- trainControl(method = "cv",number = 5)
train_gbm <- train(classe~., data = training1,method = "gbm",trControl = trcon,verbose = FALSE)
pred_gbm <- predict(train_gbm,validation1)
conf_mat_gdm <- confusionMatrix(table(pred_gbm,validation1$classe))
conf_mat_gdm

## Confusion Matrix and Statistics
## 
##         
## pred_gbm    A    B    C    D    E
##        A 1669    8    0    1    2
##        B    4 1111    6   14    4
##        C    0   20 1013   13    3
##        D    1    0    6  932    8
##        E    0    0    1    4 1065
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9839          
##                  95% CI : (0.9803, 0.9869)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9796          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9754   0.9873   0.9668   0.9843
## Specificity            0.9974   0.9941   0.9926   0.9970   0.9990
## Pos Pred Value         0.9935   0.9754   0.9657   0.9842   0.9953
## Neg Pred Value         0.9988   0.9941   0.9973   0.9935   0.9965
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2836   0.1888   0.1721   0.1584   0.1810
## Detection Prevalence   0.2855   0.1935   0.1782   0.1609   0.1818
## Balanced Accuracy      0.9972   0.9848   0.9900   0.9819   0.9916

We see that the Model has an accuracy of 0.9568 , which is very good but not as accurate as the Random Forest model.

So, after training models in three different Machine Learning Algorithm we see that the most accurate prediction Model is the Random forest Model.

Final Prediction using the Random Forest on the Testing Data

I will now apply my most accurate predictive model that is the Random Forest Model to predict the outcomes for 20 distinct test cases.

pred_test <- predict(train_rf,testdat)
pred_test

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E