Analysis of Weight Lifting Data Set

In this project we work on the Weight Lifting Data Set. This data set contains measurements from acceleration sensors that were fitted to different body parts of the participants (belt, arm, forearm, dumbbell). Our task is to predict how well the participants performed the biceps curl, a common weight lifting exercise.

Getting and cleaning the data

Let’s start by loading the data:

df <- read.csv("../pml-training.csv", na.strings = c("NA", ""))
dim(df)
## [1] 19622   160

This data set contains a lot of columns with summary statistics. These columns only contain data for a small fraction of the rows because they summarize the data in certain time windows (marked by the column new_window). We won’t need them in this project. The row index (X), the user name and the time window data cannot be used for prediction and are removed as well.

df <- df %>%
  filter(new_window == "no") %>%
  dplyr::select(-contains("timestamp")) %>%
  dplyr::select(-starts_with("kurtosis_")) %>%
  dplyr::select(-starts_with("skewness_")) %>%
  dplyr::select(-starts_with("max_")) %>%
  dplyr::select(-starts_with("min_")) %>%
  dplyr::select(-starts_with("amplitude_")) %>%
  dplyr::select(-starts_with("avg_")) %>%
  dplyr::select(-starts_with("var_")) %>%
  dplyr::select(-starts_with("stddev_")) %>%
  dplyr::select(-c(X, user_name, new_window, num_window))

Splitting the data set

We separate the data into three sets: training, testing and validation. Training and testing are combined into the building data set for single models. For the combined model, the combined predictor is set up using the training and testing data sets. The validation model is used for final model assessment as with the other models.

10-fold cross-validation is used for hyperparameter optimization. Models without hyperparameters are trained without cross-validation.

inBuild <- createDataPartition(y = df$classe, p = 0.7, list = FALSE)
building <- df[inBuild,]
validation <- df[-inBuild,]

inTrain <- createDataPartition(y = building$classe, p = 0.7, list = FALSE)
training <- building[inTrain,]
testing <- building[-inTrain,]

Classification performance measure

There are six class labels with similar frequencies. This means that classification accuracy is a good way to compare the models: It is unlikely that an algorithm has good overall accuracy while performing significantly worse on some of the classes. Accuracy will be used throughout the project to assess out-of-sample error on the validation data set (accuracy in percent = 100% - out-of-sample error in percent).

table(training$classe)
## 
##    A    B    C    D    E 
## 2681 1823 1643 1543 1729

Model selection

Classification with more than two class labels can be done with logistic regression. However, it is more common to use a Linear Discriminant Analysis (LDA) instead.

We use the caret package throughout the project. For some models such as LDA cross validation is superfluous and we therefore disable it.

lda.fit <- train(classe ~ ., data=building, method="lda",
                 trcontrol=trainControl(method="none"))

mean(predict(lda.fit, validation) == validation$classe)
## [1] 0.7013708

We can do better than that! While LDA assumes a shared covariance matrix for all classes, Quadratic Discriminant Analysis (QDA) drops this assumption. This gives improved accuracy:

qda.fit <- train(classe ~ ., data=building, method="qda",
                 trcontrol=trainControl(method="none"))

mean(predict(qda.fit, validation) == validation$classe)
## [1] 0.8948464

Still not stellar. Maybe a decision tree can help to untangle the many features of this data set. Caret didn’t do a good job optimizing the complexity parameter cp. Using the custom tune grid gives a better indication of the best value to choose.

While figuring out the tune grid I used the training and test data sets. Only for the final model assessment were the data sets changed to building and validation.

set.seed(12345)
rpart.fit <- train(classe ~ ., data=building, method="rpart",
                   trControl=trainControl(method="cv"),
                   tuneGrid=data.frame(cp=c(1e-6,5e-6,1e-5,5e-5,1e-4,5e-4,1e-3,1e-2)))
plot(rpart.fit, scales = list(x = list(log = 10)))

mean(predict(rpart.fit, validation) == validation$classe)
## [1] 0.924345

Accuracy does not improve even for very small values of cp. Since this parameter controls by how much each split has to decrease the lack of fit metric, it is not a good sign when we have to set it to such a tiny value to get a decent accuracy.

Maybe all three models together perform better?

set.seed(12345)
lda.fit.pred <- train(classe ~ ., data=training, method="lda",
                 trcontrol=trainControl(method="none"))
qda.fit.pred <- train(classe ~ ., data=training, method="qda",
                 trcontrol=trainControl(method="none"))
rpart.fit.pred <- train(classe ~ ., data=training, method="rpart",
                   trControl=trainControl(method="cv"),
                   tuneGrid=data.frame(cp=c(1e-6,5e-6,1e-5,5e-5,1e-4,5e-4,1e-3,1e-2)))
                   
lda.pred <- predict(lda.fit.pred, testing)
qda.pred <- predict(qda.fit.pred, testing)
rpart.pred <- predict(rpart.fit.pred, testing)

df.pred <- data.frame(lda.pred,
                      qda.pred,
                      rpart.pred,
                      classe=testing$classe)

combined.fit <- train(classe ~ ., data=df.pred, method="rpart",
                     trControl=trainControl(method="cv"),
                     tuneGrid=data.frame(cp=seq(0, 0.01, 0.001)))
 
lda.val.pred <- predict(lda.fit.pred, validation)
qda.val.pred <- predict(qda.fit.pred, validation)
rpart.val.pred <- predict(rpart.fit.pred, validation)
df.val.pred <- data.frame(lda.pred=lda.val.pred,
                          qda.pred=qda.val.pred,
                          rpart.pred=rpart.val.pred,
                          classe=validation$classe)

mean(predict(combined.fit, df.val.pred) == df.val.pred$classe)
## [1] 0.9264272

The combined model only gives a small improvement over the decision tree alone. One problem could be the correlations between the predictions of the LDA, QDA and decision tree models:

c(cor(as.numeric(lda.pred), as.numeric(qda.pred)),
  cor(as.numeric(lda.pred), as.numeric(rpart.pred)),
  cor(as.numeric(qda.pred), as.numeric(rpart.pred)))
## [1] 0.7101423 0.6531314 0.8848267

As a last resort, let’s try to combine many decision trees into a random forest.

Caret does not optimize the number of trees. The higher this parameter the better. The only constraint is the computing power of your computer. In my case, I have to go for a relatively small number of trees.

set.seed(12345)
rforest.fit <- train(classe ~ ., data=building, method="rf",
                     trControl=trainControl(method="cv"),
                     ntree=100)

mean(predict(rforest.fit, validation) == validation$classe)
## [1] 0.9930592

This is what we’ve been craving for!

Analysis of random forest model

The accuracy of the random forest model on the validation set is 99.3059%. This is equivalent to an out-of-sample error of 0.6941%.

To celebrate, we also take a look at the confusion matrix and again marvel at the power of random forests. Only a tiny number of class labels were mispredicted!

confusionMatrix(predict(rforest.fit, validation), validation$classe)$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1635   10    0    0    0
##          B    4 1104    4    1    0
##          C    1    1  996   14    0
##          D    0    0    5  929    0
##          E    1    0    0    0 1058

All jolly and swell, but which features were actually the most important in predicting the class labels, say the top 15?

importance <- varImp(rforest.fit)$importance
head(importance[order(-importance), , drop=FALSE], 15)
##                        Overall
## roll_belt            100.00000
## pitch_forearm         59.38056
## yaw_belt              55.97338
## pitch_belt            46.44064
## magnet_dumbbell_y     45.09193
## magnet_dumbbell_z     42.55110
## roll_forearm          40.31743
## accel_dumbbell_y      19.59835
## accel_forearm_x       17.98583
## roll_dumbbell         17.63200
## magnet_belt_z         16.86979
## magnet_dumbbell_x     15.68579
## accel_dumbbell_z      15.20799
## total_accel_dumbbell  14.25629
## magnet_belt_y         13.59359

This indicates that the sensor fitted to the arm can be omitted without much loss of predictive power. This makes sense, since the arm does not move much during biceps curls compared to the forearm and the dumbbell.

Random forests were clearly the winning model in this analysis. I expect that gradient-boosted trees will work at least as good, but the accuracy achieved with random forests is nonetheless impressive.