The goal of this project is to develop a predictive model to determine how individuals performed an exercise in the training and test dataset.
In this project, I will be training a number of Machine Learning Prediction models also providing a detailed report outlining my model-building process, how cross-validation was applied and the accuracy behind those Prediction models. Additionally, I will apply my most accurate predictive model to predict the outcomes for 20 distinct test cases.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, my goal will be to use the data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The training data for this project are available here:
The test data are available here:
This includes all the Packages that I have used in this project and are required in order to be reproduced.
require(caret) #Important Packages for training Models of different types
require(rpart)
require(rattle)
require(gbm)
require(randomForest)
require(knitr)
Now after setting up the environment let’s get started with loading the dataset into our environment
traindat <- read.csv("pml-training.csv") #reading the training data
testdat <- read.csv("pml-testing.csv") #reading the testing data
dim(traindat)
## [1] 19622 160
after exploring the dataset I see a lot of variables that have Near Zero Variances which can affect our Prediction models, so the best thing is to get rid of them.
novar <- nearZeroVar(traindat)
traindat <- traindat[,-novar] #getting rid of all the near zero variance columns
Now I also need to get rid of the columns that have NA values in them,for this I will be setting a 80% threashold for NA, anymore than that will be withdrawn from the data set.
NO_NA <- sapply(traindat,function(x) mean(is.na(x)) > 0.8) #we iterate a function over the cols through which we
#get a logical output which shows true if more than 0.8 Mean NA
traindat <- traindat[,NO_NA == FALSE] #getting rid of all the cols that are TRUE(ie More than 80% NA)
As discussed earlier we need a validation set as well, so I will be dividing the dataset into Two sets training & validation set.
data <- createDataPartition(traindat$classe,p = 0.70,list = FALSE) #splitting the data set into 70% train and 30% validation
training <- traindat[data,]
validation <- traindat[-data,]
training1 <- training[,-(1:5)] #getting rid of meta data
validation1 <- validation[,-(1:5)]
I will be going with Three Machine Learning Models to train my Predictive Models & the Model that performs the best out of these 3 will be the predicting the test data
Decision Tree works by asking a series of decisions making a flow chart in the end, we start with a big group of data and ask questions about different variables to split the data into smaller, more homogenous groups. This helps us predict outcomes more accurately. Unlike other cross validation models, this one takes the least time to train.
set.seed(101)
train_dt <- rpart(classe~.,data = training1,method = "class") #training the model (rpart package)
fancyRpartPlot(train_dt) #plotting the model (rattle package)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Now after plotting the data let us now try to predict our validation dataset.
pred_dt <- predict(train_dt, validation1,type = "class")
conf_mat_dt <- confusionMatrix(table(pred_dt,validation$classe))
conf_mat_dt
## Confusion Matrix and Statistics
##
##
## pred_dt A B C D E
## A 1533 262 16 92 85
## B 47 644 78 53 94
## C 20 98 837 144 85
## D 50 87 71 581 53
## E 24 48 24 94 765
##
## Overall Statistics
##
## Accuracy : 0.7409
## 95% CI : (0.7295, 0.752)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6701
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9158 0.5654 0.8158 0.60270 0.7070
## Specificity 0.8919 0.9427 0.9286 0.94696 0.9604
## Pos Pred Value 0.7711 0.7031 0.7069 0.69002 0.8010
## Neg Pred Value 0.9638 0.9004 0.9598 0.92405 0.9357
## Prevalence 0.2845 0.1935 0.1743 0.16381 0.1839
## Detection Rate 0.2605 0.1094 0.1422 0.09873 0.1300
## Detection Prevalence 0.3378 0.1556 0.2012 0.14308 0.1623
## Balanced Accuracy 0.9039 0.7540 0.8722 0.77483 0.8337
we see that this model has accuracy of 0.6911 which is typically not that high, so let us proceed with another model.
Random Forest works by taking a lot of samples, creating many decision trees and then combining their predictions to make a final decision. The key idea behind Random Forest is that by combining the results from many trees, it can produce more accurate predictions than individual decision trees. This Model takes alot of time to train compared to other model.
set.seed(100)
trcon <- trainControl(method = "cv",number = 5) #k-means cross validation 5 times
train_rf <- train(classe~., data = training1,method = "rf",trControl = trcon,verbose = FALSE) #training the model(randomForest Package)
pred_rf <- predict(train_rf,validation1)
conf_mat_rf <- confusionMatrix(table(pred_rf,validation1$classe))
conf_mat_rf
## Confusion Matrix and Statistics
##
##
## pred_rf A B C D E
## A 1673 1 0 0 0
## B 0 1134 1 0 0
## C 0 3 1025 4 0
## D 0 1 0 960 0
## E 1 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9981
## 95% CI : (0.9967, 0.9991)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9976
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9956 0.9990 0.9959 1.0000
## Specificity 0.9998 0.9998 0.9986 0.9998 0.9998
## Pos Pred Value 0.9994 0.9991 0.9932 0.9990 0.9991
## Neg Pred Value 0.9998 0.9989 0.9998 0.9992 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1927 0.1742 0.1631 0.1839
## Detection Prevalence 0.2845 0.1929 0.1754 0.1633 0.1840
## Balanced Accuracy 0.9996 0.9977 0.9988 0.9978 0.9999
This model has a accuracy of 0.9942 , so it almost predicted all of it correctly.
Gradient Boosting Machine (GBM) is another powerful machine learning algorithm that works by building an ensemble of decision trees. However, unlike Random Forest, which builds trees independently and averages their predictions, GBM builds trees sequentially where it looks after the error of the first tree and builds another one focusing on minimising those errors(residuals).
set.seed(102)
trcon <- trainControl(method = "cv",number = 5)
train_gbm <- train(classe~., data = training1,method = "gbm",trControl = trcon,verbose = FALSE)
pred_gbm <- predict(train_gbm,validation1)
conf_mat_gdm <- confusionMatrix(table(pred_gbm,validation1$classe))
conf_mat_gdm
## Confusion Matrix and Statistics
##
##
## pred_gbm A B C D E
## A 1669 8 0 1 2
## B 4 1111 6 14 4
## C 0 20 1013 13 3
## D 1 0 6 932 8
## E 0 0 1 4 1065
##
## Overall Statistics
##
## Accuracy : 0.9839
## 95% CI : (0.9803, 0.9869)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9796
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9754 0.9873 0.9668 0.9843
## Specificity 0.9974 0.9941 0.9926 0.9970 0.9990
## Pos Pred Value 0.9935 0.9754 0.9657 0.9842 0.9953
## Neg Pred Value 0.9988 0.9941 0.9973 0.9935 0.9965
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2836 0.1888 0.1721 0.1584 0.1810
## Detection Prevalence 0.2855 0.1935 0.1782 0.1609 0.1818
## Balanced Accuracy 0.9972 0.9848 0.9900 0.9819 0.9916
We see that the Model has an accuracy of 0.9568 , which is very good but not as accurate as the Random Forest model.
So, after training models in three different Machine Learning Algorithm we see that the most accurate prediction Model is the Random forest Model.
I will now apply my most accurate predictive model that is the Random Forest Model to predict the outcomes for 20 distinct test cases.
pred_test <- predict(train_rf,testdat)
pred_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E