Link to project on GitHUB
Link to project on RPub

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

The goal of this project is to predict the manner of performing unilateral dumbbell biceps curls based on data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The 5 possible methods include:
A: exactly according to the specification
B: throwing the elbows to the front
C: lifting the dumbbell only halfway
D: lowering the dumbbell only halfway
E: throwing the hips to the front

Data processing

Preprocessing

Loading necessary libraries and setting the seed:

library(caret)
library(rpart)
library(rattle)
library(randomForest)

set.seed(12345)

Downloading and reading training and testing datasets:

trainingURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

training <- read.csv(trainingURL, na.strings=c("NA",""), header=TRUE)
testing <- read.csv(testURL, na.strings=c("NA",""), header=TRUE)

Deleting in datasets columns with only NA:

indexForNA_training <- apply(training,2,function(x) {sum(is.na(x))}) 
training <- training[,which(indexForNA_training == 0)]

indexForNA_testing <- apply(testing,2,function(x) {sum(is.na(x))}) 
testing <- testing[,which(indexForNA_testing == 0)]

Setting classe as factor:

training$classe <- as.factor(training$classe)

Preprocessing colums with numeric data:

numericCol <- which(lapply(training, class) %in% "numeric")

preObj <-preProcess(training[,numericCol],method=c('knnImpute', 'center', 'scale'))
trainPreProcessed <- predict(preObj, training[,numericCol])
trainPreProcessed$classe <- training$classe

testingPreProcessed <-predict(preObj,testing[,numericCol])

Removing the variables with values near zero:

nzvTraining <- nearZeroVar(trainPreProcessed,saveMetrics=TRUE)
trainPreProcessed <- trainPreProcessed[,nzvTraining$nzv==FALSE]

nzvTesting <- nearZeroVar(testingPreProcessed,saveMetrics=TRUE)
testingPreProcessed <- testingPreProcessed[,nzvTesting$nzv==FALSE]

Training set has 28 variables from the initial 160.

Cross validation

Dividing the training set in two parts, one for training and the other for cross validation:

inTrain = createDataPartition(trainPreProcessed$classe, p = 3/4, list=FALSE)
trainingPart = trainPreProcessed[inTrain,]

testingPart = trainPreProcessed[-inTrain,]

Decision Tree Model

Fitting a decision tree:

decisiontree <- train(classe~.,method="rpart", data=trainingPart)
fancyRpartPlot(decisiontree$finalModel)

Predicting and using confusion matrix to test results:

predictions <- predict(decisiontree,newdata = testingPart)
confusionMatrix(testingPart$classe, predictions)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 870   2 380 138   5
##          B 162 176 329 282   0
##          C  29  16 710 100   0
##          D  46   4 352 402   0
##          E  16   4 264 228 389
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5194          
##                  95% CI : (0.5053, 0.5334)
##     No Information Rate : 0.415           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4002          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7747  0.87129   0.3489  0.34957  0.98731
## Specificity            0.8611  0.83560   0.9495  0.89291  0.88647
## Pos Pred Value         0.6237  0.18546   0.8304  0.50000  0.43174
## Neg Pred Value         0.9279  0.99343   0.6728  0.81756  0.99875
## Prevalence             0.2290  0.04119   0.4150  0.23450  0.08034
## Detection Rate         0.1774  0.03589   0.1448  0.08197  0.07932
## Detection Prevalence   0.2845  0.19352   0.1743  0.16395  0.18373
## Balanced Accuracy      0.8179  0.85344   0.6492  0.62124  0.93689

The decision tree is a fairly poor fit having an accuracy rate of roughly 50%.

Random forest

Train model with random forest due to its highly accuracy rate. Cross validation is used as train control method.

modFit <- train(classe ~.,
                method="rf", 
                data=trainingPart, 
                trControl=trainControl(method='cv'), 
                number=5, 
                allowParallel=TRUE )

modFit
## Random Forest 
## 
## 14718 samples
##    27 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 13246, 13247, 13246, 13245, 13246, 13247, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9927982  0.9908903  0.002504764  0.003168151
##   14    0.9921190  0.9900311  0.002583399  0.003267831
##   27    0.9899450  0.9872823  0.003048269  0.003854566
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

Following the computation on the accuracy of trainig and cross validation set

Training set:

trainingPartPrediction <- predict(modFit, trainingPart)
confusionMatrix(trainingPartPrediction, trainingPart$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4185    0    0    0    0
##          B    0 2848    0    0    0
##          C    0    0 2567    0    0
##          D    0    0    0 2412    0
##          E    0    0    0    0 2706
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Cross validation set:

testingPartPrediction <- predict(modFit, testingPart)
confusionMatrix(testingPartPrediction, testingPart$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395   12    0    0    0
##          B    0  933    5    0    0
##          C    0    4  844    8    1
##          D    0    0    6  795    2
##          E    0    0    0    1  898
## 
## Overall Statistics
##                                           
##                Accuracy : 0.992           
##                  95% CI : (0.9891, 0.9943)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9899          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9831   0.9871   0.9888   0.9967
## Specificity            0.9966   0.9987   0.9968   0.9980   0.9998
## Pos Pred Value         0.9915   0.9947   0.9848   0.9900   0.9989
## Neg Pred Value         1.0000   0.9960   0.9973   0.9978   0.9993
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1903   0.1721   0.1621   0.1831
## Detection Prevalence   0.2869   0.1913   0.1748   0.1637   0.1833
## Balanced Accuracy      0.9983   0.9909   0.9920   0.9934   0.9982

The end: predictions on the real testing set

testingPrediction <- predict(modFit, testingPreProcessed)
testingPrediction
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E