Introduction

The aim of this analysis is to create a model which predicts the manner in which participants did the exercise (classe variable in datasets). Data is from the following study:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

More information about the study could be found here: http://groupware.les.inf.puc-rio.br/har (under section Weight Lifting Exercises Dataset).

Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

The goal of this project is to predict the manner (class) in which participants did the exercise.

Data used for training could be found here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv Data used for testing could be found here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Data loading and preprocessing

First datasets are loaded in.

#it's assumed that datasets are in working directory folder
data.train=read.csv("pml-training.csv", 
                    na.strings = c("NA","#DIV/0!",""))
data.test=read.csv("pml-testing.csv",
                   na.strings = c("NA","#DIV/0!",""))

We keep only variables which do not have any NAs (there are many variables which mostly consists of NAs). Also first 7 variables are dropped because they are not needed for predicting (they are not directly related to excercise performance.

#remove first seven columns
data.train=data.train[, -c(1:7)]
data.test=data.test[, -c(1:7)]
#remove columns which contain NAs
data.train<-data.train[,colSums(is.na(data.train)) == 0]
data.test<-data.test[,colSums(is.na(data.test)) == 0]

For cross validation we create training and test dataset from initial training dataset (randomly assign 60% for training, 40% for testing). Initial testing dataset will be used for computing final predictions that will be submitted.

#seed is set for reproducibility
set.seed(100)
library(caret)
inTrain <- createDataPartition(y=data.train$classe,
                               p=0.6, list=FALSE)
inTrain.train <- data.train[inTrain,]
inTrain.test <- data.train[-inTrain,]

Training dataset consists of 11776 observations and 53 variables. Testing dataset consists of 7846 observations and 53 variables.

From the following plot we could see that most observations in training dataset are in class A and least of them are in class D. All the calsses have at minimum 1930 observations.

plot(inTrain.train$classe, ylab="Frequency",xlab="classe")

plot of chunk unnamed-chunk-5

To prepare data for model building we try to find if there are any variables with near zero variability. Variables which have near zero variability don’t change as the outcome changes and have little value in model building.

nsv <- nearZeroVar(inTrain.train,saveMetrics=TRUE)
length(nsv$zeroVar[nsv$zeroVar=="TRUE"])
## [1] 0
length(nsv$nzv[nsv$nzv=="TRUE"])
## [1] 0

As seen there are no variables which have near zero variability and we could use all the variables to build prediction model.

Model building

As the aim of the model is to predict in which fashion (class) excercise was done, first decision-tree based model is chosen. We fit the “CART” model (method “rpart”).

library(caret)
modFit <- train(classe ~ .,method="rpart",data=inTrain.train)
library(rattle)
library(rpart.plot)
fancyRpartPlot(modFit$finalModel, sub="")

plot of chunk unnamed-chunk-7

To assess how well model predicts results we use test dataset.

prediction1 <- predict(modFit, inTrain.test)
confusionMatrix(prediction1, inTrain.test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1994  652  628  590  168
##          B   36  496   36  241  198
##          C  163  370  704  455  392
##          D    0    0    0    0    0
##          E   39    0    0    0  684
## 
## Overall Statistics
##                                         
##                Accuracy : 0.494         
##                  95% CI : (0.483, 0.505)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.34          
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.893   0.3267   0.5146    0.000   0.4743
## Specificity             0.637   0.9192   0.7870    1.000   0.9939
## Pos Pred Value          0.495   0.4926   0.3378      NaN   0.9461
## Neg Pred Value          0.938   0.8506   0.8848    0.836   0.8936
## Prevalence              0.284   0.1935   0.1744    0.164   0.1838
## Detection Rate          0.254   0.0632   0.0897    0.000   0.0872
## Detection Prevalence    0.514   0.1283   0.2656    0.000   0.0921
## Balanced Accuracy       0.765   0.6230   0.6508    0.500   0.7341

As seen from the table model accuracy on testing set is 49.5%, which is pretty low. More or less model classifies data well on class A (sensitivity 90.3%), but is worse classifying other classes (for example class D sensitivity is 0%).

For a more accurate model random forest method is chosen. And it’s accuracy is assessed with test dataset.

library(randomForest)
model2 <- randomForest(classe ~ ., data=inTrain.train, method="class")
prediction2 <- predict(model2, inTrain.test)
confusionMatrix(prediction2, inTrain.test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2230    7    0    0    0
##          B    2 1508    8    0    0
##          C    0    3 1357   17    2
##          D    0    0    3 1269    6
##          E    0    0    0    0 1434
## 
## Overall Statistics
##                                         
##                Accuracy : 0.994         
##                  95% CI : (0.992, 0.995)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.992         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.999    0.993    0.992    0.987    0.994
## Specificity             0.999    0.998    0.997    0.999    1.000
## Pos Pred Value          0.997    0.993    0.984    0.993    1.000
## Neg Pred Value          1.000    0.998    0.998    0.997    0.999
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.284    0.192    0.173    0.162    0.183
## Detection Prevalence    0.285    0.193    0.176    0.163    0.183
## Balanced Accuracy       0.999    0.996    0.994    0.993    0.997

As seen from the table random forest prediction accuracy on test dataset is 99.3% which is subtantial increase compared to the initial model. As seen from the confusion matrix specificity is high for all exercise classes (all higher than 98%). Also as seen from the matrix that class D is still hardest to predict. Expected out of sample error is 100%-99.3%=0.7%.

Here is the plot where we could see how different classes errors was reduced as the number of trees increased.

library(reshape2)
errors=melt(model2$err.rate)

library(ggplot2)
ggplot(subset(errors, Var2!="OOB"), aes(x=Var1, y=value, group=Var2))+
    geom_line(aes(color=Var2))+
    ylab("Error")+
    xlab("trees")+
    scale_colour_discrete(name="Class (classe)")

plot of chunk unnamed-chunk-10

As seen from the plot random forest method has helped to build a model with low error. But this model might not have so high accuracy on other datasets because model was calibrated on training dataset which might have some noise which is not present in other datasets. This means that model out of sample error 0.7% is minimum expected error (because we haven’t trained model with other data sets except train data set).

Conclusion

Two models were fitted on the training data set to get the best prediction model for excercise classes. First model (decision tree method “rpart”) had low accuracy (49.5%). For that reason second model (method “random forest”) was used. This model has very high accuracy and low out of the sample error (about 0.5% on test data) on test data. For this reason random forest based model is chosen. In other data sets it might be bigger. Model most accurately classified if the excercise was done correctly (Class A, sensitivity 99.9%). Lowest classification accuracy is realted to following mistakes: lifting the dumbbell only halfway (Class C, sensitivity 99.0%) and lowering the dumbbell only halfway (Class D, sensitivity 98.7%). This is logical because both are related to moving dumbell and detecting exact type of mistake might be little difficult. But as seen both classes are still highly accurately classified on test data.

Additional materials

Making this analysis following material were used: