This article shows how to create a classification model starting from the the data contained in the Weight Lifting Exercises Dataset (see References). The classification model is then used to predict the value of a variable in a different (test case) dataset. In this article three different models will be created and compared. The best performing model will then be used to predict the values of variable classe (which ) in the test case dataset.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The first step consists in loading training and test case datasets.
set.seed(111)
trainDS <- read.csv("./pml-training.csv", stringsAsFactors = FALSE)
TestCaseSet <- read.csv("./pml-testing.csv", stringsAsFactors = FALSE)
dim(trainDS)
## [1] 19622 160
dim(TestCaseSet)
## [1] 20 160
Analysing the data, one can see that many columns provide very little information as they contain mostly NA or are just empty.
Before trying to apply any regression or classification model it is important to have a complete dataset by removing the missing values or by imputing them. In this dataset, some columns have less than 5% of data and hence imputing missing data is not a recommendable option (more than 95% of the data should be estimated using less than 5% of it).
We decide to:
– remove columns with less than 5% of valid values
– convert column classe (which contains the values to predict) from string to factor (with levels A to E)
– remove first 7 columns which contain information not relevant for the classification (ID, timestamp, and the like)
As a last step the initial dataset is split into two groups, one for model training and the other for testing the model accuracy (which, in turn, will be used to choose the model that will be used for the prediction).
# Remove columns which contain almost only NA or are empty
trainDS$classe <- as.factor(trainDS$classe)
v <- sapply(trainDS, function(x) mean(!is.na(x))) > 0.95
trainDS <- trainDS[,v]
TestCaseSet <- TestCaseSet[,v]
v <- sapply(names(trainDS), function(x) mean(trainDS[,x] != "") > 0.95)
trainDS <- trainDS[,v]
TestCaseSet <- TestCaseSet[,v]
# Remove first seven columns which contain data unrelated to classification
trainDS <- trainDS[,-7:-1]
TestCaseSet <- TestCaseSet[,-7:-1]
# create training and testing partitions dataset
partition <- createDataPartition(trainDS$classe, p=0.7, list=FALSE)
TrainSet <- trainDS[partition, ]
TestSet <- trainDS[-partition, ]
dim(TrainSet)
## [1] 13737 53
After removing unnecessary data, the training set is left with 53 of the original 160 columns. Having a smaller dataset simplifies model construction and reduces computation time.
For sake of clarity, this experiment uses the following 3 datasets:
– TrainSet: the training dataset which is used to create the classification models
– TestSet: the dataset used for evaluating the accuracy of the classification models
– TestCaseSet: this dataset contains data of 20 test cases. The best performing classification model will be used to predict the values of variable classe
The following diagram shows correlation among variables of the dataset (the darker the color, the stronger the correlation):
corMatrix <- cor(TrainSet[, -53])
corrplot(corMatrix, type = "upper", order = "hclust", method = "circle", tl.cex = 0.7, tl.col="black", tl.srt=45)
The first model is based on a simple decision tree.
modelDecisionTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modelDecisionTree)
# Use decision tree to predict results in test dataset
predictDecisionTree <- predict(modelDecisionTree, newdata=TestSet, type="class")
confMatDecisionTree <- confusionMatrix(predictDecisionTree, TestSet$classe)
accuracyDecisionTree <- round(confMatDecisionTree$overall['Accuracy'], 4)
confMatDecisionTree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1540 175 19 49 46
## B 52 644 91 91 78
## C 34 107 813 145 106
## D 20 96 75 602 73
## E 28 117 28 77 779
##
## Overall Statistics
##
## Accuracy : 0.7439
## 95% CI : (0.7326, 0.755)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6751
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9200 0.5654 0.7924 0.6245 0.7200
## Specificity 0.9314 0.9343 0.9193 0.9464 0.9479
## Pos Pred Value 0.8420 0.6736 0.6747 0.6952 0.7570
## Neg Pred Value 0.9670 0.8996 0.9545 0.9279 0.9376
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2617 0.1094 0.1381 0.1023 0.1324
## Detection Prevalence 0.3108 0.1624 0.2048 0.1472 0.1749
## Balanced Accuracy 0.9257 0.7498 0.8559 0.7854 0.8340
The decision tree reaches the accuracy of (0.7439) and as such the prediction contains a relatively high number of errors compared to the real values in the test set.
Next we are going to create a new classification model based on the random forest algorithm. In order to avoid model overfitting, a k-fold (k = 3) cross-validation algorithm will be used. This way the model is constructed using different blocks of samples and hence it is expected to provide better perfomance when predicting values on a new dataset.
ctrlRandomForest <- trainControl(method="cv", number=3, verboseIter=FALSE)
modelRandomForest <- train(classe ~ ., data=TrainSet, method="rf", trControl=ctrlRandomForest)
modelRandomForest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.59%
## Confusion matrix:
## A B C D E class.error
## A 3906 0 0 0 0 0.000000000
## B 15 2640 3 0 0 0.006772009
## C 0 17 2376 3 0 0.008347245
## D 0 0 33 2218 1 0.015097691
## E 0 0 2 7 2516 0.003564356
# Apply model to estimate parameter classe in test data set
predictRandomForest <- predict(modelRandomForest, newdata=TestSet)
# Calculate accuracy
confMatRandomForest <- confusionMatrix(predictRandomForest, TestSet$classe)
accuracyRandomForest <- round(confMatRandomForest$overall['Accuracy'], 4)
confMatRandomForest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 2 0 0 0
## B 1 1129 4 0 0
## C 0 8 1021 25 0
## D 0 0 1 938 4
## E 1 0 0 1 1078
##
## Overall Statistics
##
## Accuracy : 0.992
## 95% CI : (0.9894, 0.9941)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9899
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9912 0.9951 0.9730 0.9963
## Specificity 0.9995 0.9989 0.9932 0.9990 0.9996
## Pos Pred Value 0.9988 0.9956 0.9687 0.9947 0.9981
## Neg Pred Value 0.9995 0.9979 0.9990 0.9947 0.9992
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1918 0.1735 0.1594 0.1832
## Detection Prevalence 0.2845 0.1927 0.1791 0.1602 0.1835
## Balanced Accuracy 0.9992 0.9951 0.9942 0.9860 0.9979
The model reaches an accuracy of (0.992) which, as expected, is a very good improvement with respect to the decision tree model. On the other hand models based on random forests require more computation time and are more difficult to interpret with respect to decision trees.
Let’s now create a generalized boosted model. This time we will use a repeated cross-validation technique in order to reduce model overfitting.
controlGBM <- trainControl(method = "repeatedcv", number = 3, repeats = 2)
modelFitGBM <- train(classe ~ ., data=TrainSet, method = "gbm", trControl = controlGBM, verbose = FALSE)
modelFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
# Calculate accuracy
predictGBM <- predict(modelFitGBM, newdata = TestSet)
confMatGBM <- confusionMatrix(predictGBM, TestSet$classe)
accuracyGBM <- round(confMatGBM$overall['Accuracy'], 4)
confMatGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1642 39 0 0 0
## B 21 1070 31 2 8
## C 6 29 973 34 6
## D 5 0 17 923 18
## E 0 1 5 5 1050
##
## Overall Statistics
##
## Accuracy : 0.9614
## 95% CI : (0.9562, 0.9662)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9512
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9809 0.9394 0.9483 0.9575 0.9704
## Specificity 0.9907 0.9869 0.9846 0.9919 0.9977
## Pos Pred Value 0.9768 0.9452 0.9284 0.9585 0.9896
## Neg Pred Value 0.9924 0.9855 0.9890 0.9917 0.9934
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2790 0.1818 0.1653 0.1568 0.1784
## Detection Prevalence 0.2856 0.1924 0.1781 0.1636 0.1803
## Balanced Accuracy 0.9858 0.9632 0.9665 0.9747 0.9841
Generalized boosted model provides a very high accuracy with more that 96% of the samples correctly classified.
Comparing the accuracy of the three models it is easy to observe that the random forest model is the one performing best on the test dataset:
| Model type | Accuracy |
|---|---|
| Decision tree | 0.7439 |
| Random forest | 0.992 |
| Generalized Boosted Model | 0.9614 |
Hence we will use the model base on random forest to estimate the values of the variable classe for the 20 samples contained in the testcase dataset.
predictClasseValuesRF <- predict(modelRandomForest, TestCaseSet)
predictClasseValuesRF
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
[1] Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.