1 Executive Summary

This article shows how to create a classification model starting from the the data contained in the Weight Lifting Exercises Dataset (see References). The classification model is then used to predict the value of a variable in a different (test case) dataset. In this article three different models will be created and compared. The best performing model will then be used to predict the values of variable classe (which ) in the test case dataset.

2 Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

3 Data loading and analysis

The first step consists in loading training and test case datasets.

set.seed(111)
trainDS <- read.csv("./pml-training.csv", stringsAsFactors = FALSE)
TestCaseSet <- read.csv("./pml-testing.csv", stringsAsFactors = FALSE)
dim(trainDS)
## [1] 19622   160
dim(TestCaseSet)
## [1]  20 160

Analysing the data, one can see that many columns provide very little information as they contain mostly NA or are just empty.

Before trying to apply any regression or classification model it is important to have a complete dataset by removing the missing values or by imputing them. In this dataset, some columns have less than 5% of data and hence imputing missing data is not a recommendable option (more than 95% of the data should be estimated using less than 5% of it).

We decide to:

– remove columns with less than 5% of valid values

– convert column classe (which contains the values to predict) from string to factor (with levels A to E)

– remove first 7 columns which contain information not relevant for the classification (ID, timestamp, and the like)

As a last step the initial dataset is split into two groups, one for model training and the other for testing the model accuracy (which, in turn, will be used to choose the model that will be used for the prediction).

# Remove columns which contain almost only NA or are empty
trainDS$classe <- as.factor(trainDS$classe)

v <- sapply(trainDS, function(x) mean(!is.na(x))) > 0.95
trainDS <- trainDS[,v]
TestCaseSet  <- TestCaseSet[,v]
v <- sapply(names(trainDS), function(x) mean(trainDS[,x] != "") > 0.95)
trainDS <- trainDS[,v]
TestCaseSet  <- TestCaseSet[,v]

# Remove first seven columns which contain data unrelated to classification
trainDS <- trainDS[,-7:-1]
TestCaseSet  <- TestCaseSet[,-7:-1]

# create training and testing partitions dataset 
partition  <- createDataPartition(trainDS$classe, p=0.7, list=FALSE)
TrainSet <- trainDS[partition, ]
TestSet  <- trainDS[-partition, ]
dim(TrainSet)
## [1] 13737    53

After removing unnecessary data, the training set is left with 53 of the original 160 columns. Having a smaller dataset simplifies model construction and reduces computation time.

For sake of clarity, this experiment uses the following 3 datasets:

– TrainSet: the training dataset which is used to create the classification models

– TestSet: the dataset used for evaluating the accuracy of the classification models

– TestCaseSet: this dataset contains data of 20 test cases. The best performing classification model will be used to predict the values of variable classe

3.1 Data correlation

The following diagram shows correlation among variables of the dataset (the darker the color, the stronger the correlation):

corMatrix <- cor(TrainSet[, -53])
corrplot(corMatrix, type = "upper", order = "hclust", method = "circle", tl.cex = 0.7,  tl.col="black", tl.srt=45)

4 Model creation and selection

4.1 Decision tree

The first model is based on a simple decision tree.

modelDecisionTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modelDecisionTree)

# Use decision tree to predict results in test dataset
predictDecisionTree <- predict(modelDecisionTree, newdata=TestSet, type="class")
confMatDecisionTree <- confusionMatrix(predictDecisionTree, TestSet$classe)
accuracyDecisionTree <- round(confMatDecisionTree$overall['Accuracy'], 4)
confMatDecisionTree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1540  175   19   49   46
##          B   52  644   91   91   78
##          C   34  107  813  145  106
##          D   20   96   75  602   73
##          E   28  117   28   77  779
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7439         
##                  95% CI : (0.7326, 0.755)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6751         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9200   0.5654   0.7924   0.6245   0.7200
## Specificity            0.9314   0.9343   0.9193   0.9464   0.9479
## Pos Pred Value         0.8420   0.6736   0.6747   0.6952   0.7570
## Neg Pred Value         0.9670   0.8996   0.9545   0.9279   0.9376
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2617   0.1094   0.1381   0.1023   0.1324
## Detection Prevalence   0.3108   0.1624   0.2048   0.1472   0.1749
## Balanced Accuracy      0.9257   0.7498   0.8559   0.7854   0.8340

The decision tree reaches the accuracy of (0.7439) and as such the prediction contains a relatively high number of errors compared to the real values in the test set.

4.2 Random forest

Next we are going to create a new classification model based on the random forest algorithm. In order to avoid model overfitting, a k-fold (k = 3) cross-validation algorithm will be used. This way the model is constructed using different blocks of samples and hence it is expected to provide better perfomance when predicting values on a new dataset.

ctrlRandomForest <- trainControl(method="cv", number=3, verboseIter=FALSE)
modelRandomForest <- train(classe ~ ., data=TrainSet, method="rf", trControl=ctrlRandomForest)
modelRandomForest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.59%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3906    0    0    0    0 0.000000000
## B   15 2640    3    0    0 0.006772009
## C    0   17 2376    3    0 0.008347245
## D    0    0   33 2218    1 0.015097691
## E    0    0    2    7 2516 0.003564356
# Apply model to estimate parameter classe in test data set
predictRandomForest <- predict(modelRandomForest, newdata=TestSet)

# Calculate accuracy
confMatRandomForest <- confusionMatrix(predictRandomForest, TestSet$classe)
accuracyRandomForest <- round(confMatRandomForest$overall['Accuracy'], 4)
confMatRandomForest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    2    0    0    0
##          B    1 1129    4    0    0
##          C    0    8 1021   25    0
##          D    0    0    1  938    4
##          E    1    0    0    1 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.992           
##                  95% CI : (0.9894, 0.9941)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9899          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9912   0.9951   0.9730   0.9963
## Specificity            0.9995   0.9989   0.9932   0.9990   0.9996
## Pos Pred Value         0.9988   0.9956   0.9687   0.9947   0.9981
## Neg Pred Value         0.9995   0.9979   0.9990   0.9947   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1918   0.1735   0.1594   0.1832
## Detection Prevalence   0.2845   0.1927   0.1791   0.1602   0.1835
## Balanced Accuracy      0.9992   0.9951   0.9942   0.9860   0.9979

The model reaches an accuracy of (0.992) which, as expected, is a very good improvement with respect to the decision tree model. On the other hand models based on random forests require more computation time and are more difficult to interpret with respect to decision trees.

4.3 Generalized Boosted Model

Let’s now create a generalized boosted model. This time we will use a repeated cross-validation technique in order to reduce model overfitting.

controlGBM  <- trainControl(method = "repeatedcv", number = 3, repeats = 2)
modelFitGBM <- train(classe ~ ., data=TrainSet, method = "gbm", trControl = controlGBM, verbose = FALSE)
modelFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
# Calculate accuracy
predictGBM <- predict(modelFitGBM, newdata = TestSet)
confMatGBM <- confusionMatrix(predictGBM, TestSet$classe)
accuracyGBM <- round(confMatGBM$overall['Accuracy'], 4)
confMatGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1642   39    0    0    0
##          B   21 1070   31    2    8
##          C    6   29  973   34    6
##          D    5    0   17  923   18
##          E    0    1    5    5 1050
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9614          
##                  95% CI : (0.9562, 0.9662)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9512          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9809   0.9394   0.9483   0.9575   0.9704
## Specificity            0.9907   0.9869   0.9846   0.9919   0.9977
## Pos Pred Value         0.9768   0.9452   0.9284   0.9585   0.9896
## Neg Pred Value         0.9924   0.9855   0.9890   0.9917   0.9934
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2790   0.1818   0.1653   0.1568   0.1784
## Detection Prevalence   0.2856   0.1924   0.1781   0.1636   0.1803
## Balanced Accuracy      0.9858   0.9632   0.9665   0.9747   0.9841

Generalized boosted model provides a very high accuracy with more that 96% of the samples correctly classified.

4.4 Model choice and predicion

Comparing the accuracy of the three models it is easy to observe that the random forest model is the one performing best on the test dataset:

Model type Accuracy
Decision tree 0.7439
Random forest 0.992
Generalized Boosted Model 0.9614

Hence we will use the model base on random forest to estimate the values of the variable classe for the 20 samples contained in the testcase dataset.

predictClasseValuesRF <- predict(modelRandomForest, TestCaseSet)
predictClasseValuesRF
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

References

[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.