In this project we work on the Weight Lifting Data Set. This data set contains measurements from acceleration sensors that were fitted to different body parts of the participants (belt, arm, forearm, dumbbell). Our task is to predict how well the participants performed the biceps curl, a common weight lifting exercise.
Let’s start by loading the data:
df <- read.csv("../pml-training.csv", na.strings = c("NA", ""))
dim(df)
## [1] 19622 160
This data set contains a lot of columns with summary statistics. These columns only contain data for a small fraction of the rows because they summarize the data in certain time windows (marked by the column new_window). We won’t need them in this project. The row index (X), the user name and the time window data cannot be used for prediction and are removed as well.
df <- df %>%
filter(new_window == "no") %>%
dplyr::select(-contains("timestamp")) %>%
dplyr::select(-starts_with("kurtosis_")) %>%
dplyr::select(-starts_with("skewness_")) %>%
dplyr::select(-starts_with("max_")) %>%
dplyr::select(-starts_with("min_")) %>%
dplyr::select(-starts_with("amplitude_")) %>%
dplyr::select(-starts_with("avg_")) %>%
dplyr::select(-starts_with("var_")) %>%
dplyr::select(-starts_with("stddev_")) %>%
dplyr::select(-c(X, user_name, new_window, num_window))
We separate the data into three sets: training, testing and validation. Training and testing are combined into the building data set for single models. For the combined model, the combined predictor is set up using the training and testing data sets. The validation model is used for final model assessment as with the other models.
10-fold cross-validation is used for hyperparameter optimization. Models without hyperparameters are trained without cross-validation.
inBuild <- createDataPartition(y = df$classe, p = 0.7, list = FALSE)
building <- df[inBuild,]
validation <- df[-inBuild,]
inTrain <- createDataPartition(y = building$classe, p = 0.7, list = FALSE)
training <- building[inTrain,]
testing <- building[-inTrain,]
There are six class labels with similar frequencies. This means that classification accuracy is a good way to compare the models: It is unlikely that an algorithm has good overall accuracy while performing significantly worse on some of the classes. Accuracy will be used throughout the project to assess out-of-sample error on the validation data set (accuracy in percent = 100% - out-of-sample error in percent).
table(training$classe)
##
## A B C D E
## 2681 1823 1643 1543 1729
Classification with more than two class labels can be done with logistic regression. However, it is more common to use a Linear Discriminant Analysis (LDA) instead.
We use the caret package throughout the project. For some models such as LDA cross validation is superfluous and we therefore disable it.
lda.fit <- train(classe ~ ., data=building, method="lda",
trcontrol=trainControl(method="none"))
mean(predict(lda.fit, validation) == validation$classe)
## [1] 0.7013708
We can do better than that! While LDA assumes a shared covariance matrix for all classes, Quadratic Discriminant Analysis (QDA) drops this assumption. This gives improved accuracy:
qda.fit <- train(classe ~ ., data=building, method="qda",
trcontrol=trainControl(method="none"))
mean(predict(qda.fit, validation) == validation$classe)
## [1] 0.8948464
Still not stellar. Maybe a decision tree can help to untangle the many features of this data set. Caret didn’t do a good job optimizing the complexity parameter cp. Using the custom tune grid gives a better indication of the best value to choose.
While figuring out the tune grid I used the training and test data sets. Only for the final model assessment were the data sets changed to building and validation.
set.seed(12345)
rpart.fit <- train(classe ~ ., data=building, method="rpart",
trControl=trainControl(method="cv"),
tuneGrid=data.frame(cp=c(1e-6,5e-6,1e-5,5e-5,1e-4,5e-4,1e-3,1e-2)))
plot(rpart.fit, scales = list(x = list(log = 10)))
mean(predict(rpart.fit, validation) == validation$classe)
## [1] 0.924345
Accuracy does not improve even for very small values of cp. Since this parameter controls by how much each split has to decrease the lack of fit metric, it is not a good sign when we have to set it to such a tiny value to get a decent accuracy.
Maybe all three models together perform better?
set.seed(12345)
lda.fit.pred <- train(classe ~ ., data=training, method="lda",
trcontrol=trainControl(method="none"))
qda.fit.pred <- train(classe ~ ., data=training, method="qda",
trcontrol=trainControl(method="none"))
rpart.fit.pred <- train(classe ~ ., data=training, method="rpart",
trControl=trainControl(method="cv"),
tuneGrid=data.frame(cp=c(1e-6,5e-6,1e-5,5e-5,1e-4,5e-4,1e-3,1e-2)))
lda.pred <- predict(lda.fit.pred, testing)
qda.pred <- predict(qda.fit.pred, testing)
rpart.pred <- predict(rpart.fit.pred, testing)
df.pred <- data.frame(lda.pred,
qda.pred,
rpart.pred,
classe=testing$classe)
combined.fit <- train(classe ~ ., data=df.pred, method="rpart",
trControl=trainControl(method="cv"),
tuneGrid=data.frame(cp=seq(0, 0.01, 0.001)))
lda.val.pred <- predict(lda.fit.pred, validation)
qda.val.pred <- predict(qda.fit.pred, validation)
rpart.val.pred <- predict(rpart.fit.pred, validation)
df.val.pred <- data.frame(lda.pred=lda.val.pred,
qda.pred=qda.val.pred,
rpart.pred=rpart.val.pred,
classe=validation$classe)
mean(predict(combined.fit, df.val.pred) == df.val.pred$classe)
## [1] 0.9264272
The combined model only gives a small improvement over the decision tree alone. One problem could be the correlations between the predictions of the LDA, QDA and decision tree models:
c(cor(as.numeric(lda.pred), as.numeric(qda.pred)),
cor(as.numeric(lda.pred), as.numeric(rpart.pred)),
cor(as.numeric(qda.pred), as.numeric(rpart.pred)))
## [1] 0.7101423 0.6531314 0.8848267
As a last resort, let’s try to combine many decision trees into a random forest.
Caret does not optimize the number of trees. The higher this parameter the better. The only constraint is the computing power of your computer. In my case, I have to go for a relatively small number of trees.
set.seed(12345)
rforest.fit <- train(classe ~ ., data=building, method="rf",
trControl=trainControl(method="cv"),
ntree=100)
mean(predict(rforest.fit, validation) == validation$classe)
## [1] 0.9930592
This is what we’ve been craving for!
The accuracy of the random forest model on the validation set is 99.3059%. This is equivalent to an out-of-sample error of 0.6941%.
To celebrate, we also take a look at the confusion matrix and again marvel at the power of random forests. Only a tiny number of class labels were mispredicted!
confusionMatrix(predict(rforest.fit, validation), validation$classe)$table
## Reference
## Prediction A B C D E
## A 1635 10 0 0 0
## B 4 1104 4 1 0
## C 1 1 996 14 0
## D 0 0 5 929 0
## E 1 0 0 0 1058
All jolly and swell, but which features were actually the most important in predicting the class labels, say the top 15?
importance <- varImp(rforest.fit)$importance
head(importance[order(-importance), , drop=FALSE], 15)
## Overall
## roll_belt 100.00000
## pitch_forearm 59.38056
## yaw_belt 55.97338
## pitch_belt 46.44064
## magnet_dumbbell_y 45.09193
## magnet_dumbbell_z 42.55110
## roll_forearm 40.31743
## accel_dumbbell_y 19.59835
## accel_forearm_x 17.98583
## roll_dumbbell 17.63200
## magnet_belt_z 16.86979
## magnet_dumbbell_x 15.68579
## accel_dumbbell_z 15.20799
## total_accel_dumbbell 14.25629
## magnet_belt_y 13.59359
This indicates that the sensor fitted to the arm can be omitted without much loss of predictive power. This makes sense, since the arm does not move much during biceps curls compared to the forearm and the dumbbell.
Random forests were clearly the winning model in this analysis. I expect that gradient-boosted trees will work at least as good, but the accuracy achieved with random forests is nonetheless impressive.