Prediction Assignment Writeup

Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. The data analysed in the report come from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways, and their movement was classified in 1 of 5 classes. Three different models were fitted to the a subset comprised of 70% of the provided data, a random forest, a linear discriminant analysis (LDA) and a generalized additive model (GAM). These models were then used to predict the movement class for the remaining 30% of the data, and the accuracy of each prediction was measured by comparison with the orignal values of this final test set. The random forest model should be adopted for future predictions, since it outperforms the other models by far, with 99% out of sample accuracy.

Loading and processing data

After loading the “pml_training.csv” dataset, since there is no test dataset available, in order to be able to perform cross validation on the prediction model, the training dataset provided was divided into two subsets, “train” and “test”, using the createDataPartition function, leaving 70% of observation in the “train” set and the remaining 30% in the “test” set. The dataset has a very large number of variables, so the next steps are taken in order to remove unnecessary variables form the model. First of all, using the select function in the dplyr package, the first seven variables are removed, since they identify observations and have no predictive value. Secondly, since several columns which present an NA value in the head rows seem to be almost entirely made up by NA values, these are removed as well. Finally, the nearZeroVar is used to single out variables with variance equal or close to zero, and these are removed as well. All these operations are performed in both the “train” and “test” set.

library(caret)
training<-read.csv("pml-training.csv")
inTrain<-createDataPartition(training$classe,p=0.7,list=FALSE)
train<-training[inTrain,]
test<-training[-inTrain,]
library(dplyr)
train<-select(train,-c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window,num_window))
test<-select(test,-c(X,user_name,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window,num_window))
train<-train[,!is.na(train[5,])]
test<-test[,!is.na(test[5,])]
nzv<-nearZeroVar(train,saveMetrics=TRUE)
train<-train[,nzv$nzv==FALSE]
nzv2<-nearZeroVar(test,saveMetrics=TRUE)
test<-test[,nzv2$nzv==FALSE]

Model Building and Prediction

Two different machine learning algorithms were used to build two different prediciton models for the “classe” variable. Random forests and boosting are generally the most successful models, but since the computer where this analysis was performed has no computational ability to perform the model fitting for a boosting algorithm, even after the reduction in the number of predictors, the next best approach was selected, so the two chosen algorithms were random forests and linear discriminant analysis. The random forest model was generated using the randomForest function in the randomForest package, and the LDA model was generated through the train function in the caret package. A combined model combining the previous two was also produced through a Generalized Additive Model, using the train function. The models were used to generate three diffenrent predictions for the values in the “test” set.

library(randomForest)
model1<-randomForest(classe~.,data=train)
pred1<-predict(model1,test)
model2<-train(classe~.,method="lda",data=train)
pred2<-predict(model2,test)
predDF<-data.frame(pred1,pred2,classe=test$classe)
model3<-train(classe~.,method="gam",data=predDF)
pred3<-predict(model3,test)

To evaluate the models, confusion matrices were calculated for each predicted set of values using the confusionMatrix function in the caret package.

confusionMatrix(pred1,test$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    1    0    0    0
##          B    2 1134    7    0    0
##          C    0    4 1019    7    1
##          D    0    0    0  957    2
##          E    0    0    0    0 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9939, 0.9974)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9948          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9956   0.9932   0.9927   0.9972
## Specificity            0.9998   0.9981   0.9975   0.9996   1.0000
## Pos Pred Value         0.9994   0.9921   0.9884   0.9979   1.0000
## Neg Pred Value         0.9995   0.9989   0.9986   0.9986   0.9994
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1927   0.1732   0.1626   0.1833
## Detection Prevalence   0.2843   0.1942   0.1752   0.1630   0.1833
## Balanced Accuracy      0.9993   0.9969   0.9954   0.9962   0.9986

confusionMatrix(pred2,test$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1400  176  106   60   46
##          B   33  716  100   47  174
##          C  120  148  662  116  103
##          D  118   51  120  693   99
##          E    3   48   38   48  660
## 
## Overall Statistics
##                                           
##                Accuracy : 0.702           
##                  95% CI : (0.6901, 0.7136)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6224          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8363   0.6286   0.6452   0.7189   0.6100
## Specificity            0.9079   0.9254   0.8998   0.9212   0.9715
## Pos Pred Value         0.7830   0.6692   0.5762   0.6411   0.8281
## Neg Pred Value         0.9331   0.9121   0.9231   0.9436   0.9171
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2379   0.1217   0.1125   0.1178   0.1121
## Detection Prevalence   0.3038   0.1818   0.1952   0.1837   0.1354
## Balanced Accuracy      0.8721   0.7770   0.7725   0.8200   0.7907

confusionMatrix(pred3,test$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    1    0    0    0
##          B    2 1138 1026  964 1082
##          C    0    0    0    0    0
##          D    0    0    0    0    0
##          E    0    0    0    0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4775          
##                  95% CI : (0.4647, 0.4903)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3306          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9991   0.0000   0.0000   0.0000
## Specificity            0.9998   0.3523   1.0000   1.0000   1.0000
## Pos Pred Value         0.9994   0.2702      NaN      NaN      NaN
## Neg Pred Value         0.9995   0.9994   0.8257   0.8362   0.8161
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1934   0.0000   0.0000   0.0000
## Detection Prevalence   0.2843   0.7157   0.0000   0.0000   0.0000
## Balanced Accuracy      0.9993   0.6757   0.5000   0.5000   0.5000

From the accuracy values, we can see that the random forest model performs far better than the linear discriminant analysis, with 99% accuracy for the random forest versus 70% accuracy for the LDA. The combined model perform far worse than the separate models, with only 58% accuracy. So we can can conclude that the first model should provide an accurate prediction of the type of movement performed by an individual using the same set of accelerometers the participants used in this study.