Executive summary

A random forest model has been created to predict the quality of a bicep curl based on sensor data. This model was tuned using a subset of 12 features and is expected to have an out of sample error of 2.7%.

This ease with which a model can be fitted to this data with high accuracy indicates overfitting. In this case it is easy to match a specific observation when there are 60 similar observations from the same instance of a bicep curl. An alternative approach would be to select a smaller number of measurements which would be expected to differ for the skeletal movement involved in each type of bicep curl.

The original study that generated this data is Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements by Ugulino, Cardador, Vega, Velloso, Milidiu, and Fuks. The study categorised bicep curls in the following categories:

A - exactly according to the specification
B - throwing the elbows to the front
C - lifting the dumbbell only halfway
D - lowering the dumbbell only halfway
E - throwing the hips to the front

Data preparation

The project supplied a large training set, and a small set of 20 records to be predict, which will be used as follows:

bc.train contains 90% of the large dataset to train and test the model,
bc.validation contains 10% of the large dataset to measure the out of sample accuracy,
bc.testcases contains the 20 records to be predicted for the.

Initial exploration of the data shows that the study derived many factors where 98% of the data is NA or blank. These factors can be removed from the model, along with all metadata, leaving 52 factors for modelling.

Exploratory data analysis

Ranges of values for each participant from the belt sensor

The following plots show how some measurements have very different values for observations from the belt sensor. This could be because the sensor was fitted in a different way for different participants, or even upside down. A few observations for other sensors showed similar issues, but to a smaller extent. Whilst it might be possible to normalize for each participant so that their data can be combined, in this case we will omit the Belt sensor observations from the dataset.

Ranges of values for each classification

The following plots show the variability by classification for six factors with the greatest importance in the final model. They show that there are visible differences, but that a combination of factors will be needed for an accurate prediction.

Model selection

Algorithm selection

The choice of machine learning algorithm was carried out with caret using different methods including random forest, random forest with principal component analysis (pca) and neural networks. Random forest was the most accurate with a suspiciously high accuracy of 98%.

Nodesize tuning

The algorithm was tuned for nodesize. Performance of the algorithm was slow using default settings for caret, since a nodesize of 1 generates large trees and cross-validation is repeated on 25 samples by default. In this case randomForest() was used. In future analysis I would use the caret package with train(method = "rf", nodesize = 10) and trainControl(method = "null") Investigation into nodesize illustrated the trade off between performance and accuracy as it approaches 1. Nodesize = 1 will be used in the final model.

mtry tuning

The algorithm was tuned for mtry. Values of mtry from 1 to 15 were used, and mtry = 9 was found to be the optimum. This code is shown below for information, but not evaluated. It requires that training data is further split to create an additional test set. In future analysis I would use the caret package with train(method = "rf", tuneGrid = data.frame(mtry = 1:15)).

set.seed(848)
m.comparison <- data.frame()
for(m in 1:15) { 
  rfm.fit <- randomForest(bc.train2, classe.train, mtry = m, nodesize = 10)
  rfm.predict <- predict(rfm.fit, subset.test)
  m.comparison <- rbind(m.comparison,
                        c(m, confusionMatrix(rfm.predict, classe.test)$overall))
}
names(m.comparison) <- c("mtry", "Accuracy", "Kappa", "AccuracyLower", "AccuracyUpper", "AccuracyNull", "AccuracyPValue", "McnemarPValue")
m.comparison

Principal component analysis

Given the relationship between the sensor data in different dimensions, principal component analysis (pca) would appear to be a valuable approach. Investigation showed that a default threshold of 95% reduces the number of factors from 52 to 25, achieving similar levels of accuracy. The improvement in accuracy is not significant so pca will not be used.

Reducing overfitting

The overfitting caused by using too many factors can be addressed by reducing the number of factors used in the model. A similar analysis to the tuning of mtry shows that the 12 most important factors in the model should achieve 95% accuracy.

Figure 3: Accuracy with different numbers of features
features	Accuracy	Kappa	AccuracyLower	AccuracyUpper	AccuracyNull
30	0.982	0.977	0.978	0.985	0.284
25	0.979	0.974	0.975	0.983	0.284
20	0.976	0.970	0.972	0.980	0.284
15	0.964	0.955	0.959	0.969	0.284
12	0.955	0.943	0.949	0.960	0.284
10	0.935	0.918	0.929	0.941	0.284
5	0.850	0.811	0.841	0.860	0.284

The following factors will be used in the final model.

magnet_dumbbell_z
pitch_forearm
magnet_dumbbell_y
roll_forearm
magnet_dumbbell_x
roll_arm
accel_dumbbell_y
accel_dumbbell_z
roll_dumbbell
accel_forearm_x
gyros_dumbbell_y
magnet_forearm_z

Final model

The selected model is a randomForest with mtry = 9 and nodesize = 1. It is built using Caret, with bootstrap resampling over 25 samples.

library(caret, quietly = TRUE) 
set.seed(29)
feature.list <- c("magnet_dumbbell_z",
                  "pitch_forearm",
                  "magnet_dumbbell_y",
                  "roll_forearm",
                  "magnet_dumbbell_x",
                  "roll_arm",
                  "accel_dumbbell_y",
                  "accel_dumbbell_z",
                  "roll_dumbbell",
                  "accel_forearm_x",
                  "gyros_dumbbell_y",
                  "magnet_forearm_z")

ctrl <- trainControl(method = "cv")
model <- ifelse(file.exists("bcFinal.rds"),
                bc.fit <- readRDS("bcFinal.rds"),
                bc.fit <- train(subset(bc.train, select = feature.list), 
                                classe.train, 
                                method = "rf", 
                                tuneGrid = data.frame(mtry = 9),
                                tuneLength = 1))
if(!file.exists("bcFinal.rds")) saveRDS(bc.fit, file="bcFinal.rds")

# Predict the outcomes for the validation set using our final model
suppressMessages(bc.predict <- predict(bc.fit, subset(bc.validation, select = feature.list)))
bc.results <- confusionMatrix(bc.predict, classe.validation)

Model analysis

Simple 2-fold cross validation was carried out using a validation set, with 10% of the original training data. The estimate for the out of sample accuracy is 0.973. The 95% confidence interval for the accuracy is 0.965 to 0.98.

This model is a suspiciously good fit, which is a good indicator of overfitting. In this case the 19622 observations are only split across 6 participants, 5 variants, and 10 repetitions. This means that there are ~60 observations for every bicep curl. When predicting one of these observations it can be easily matched to the other set of observations using noise in each repetition. It would be less accurate at predicting observations for new participants, or even additional bicep curls.

One way of improving the model so that it is more generalizable would be to refine the selection of factors, ideally based on an understanding of human musculoskeletal system. There may be a need to build additional factors based on a full bicep curl, rather than selecting a single observation of sensor readings.

The following figures show the confusion matrix and variable importance for the final model.

Figure 4: Confusion matrix for the final model
	A	B	C	D	E
A	554	7	0	0	1
B	2	365	3	0	3
C	0	5	334	14	2
D	2	0	4	302	2
E	0	2	1	5	352

Bicep curl quality

Chris Lill

17 September 2015