Eduardo Velloso et al., in their paper Qualitative Activity Recognition of Weight Lifting Exercises, demonstrate sensor- and model-based approaches to assessing the quality of execution of a weight lifting exercise and providing feedback to the user. For their experiment, they had six human males perform a biceps curl in varying ways: perfectly (A), elbows to the front (B), dumbbell lifted halfway ( C), dumbbell lowered halfway (D), and throwing hips forward (E). Inertial Mesurement Units, which provide acceleration, gyroscope, and magentometer data in all 3 axes, were placed in three spots on the user (glove, upper arm, lumbar region) and in one spot on the dumbbell. Velloso et al. used the Random Forest algorithm to classify the data, and the point of my project was to see if I could use Random Forest techniques and produce similar results.
The dataset I had to work with had to be cleaned up before I could perform an analysis. In a sense, there were two types of records distinguished by either having a “new window” or not. Those with the new window contained more data than those without, e.g. kurtosis, skewness, standard deviation, and variance. One option was to remove those records (which comprised only about 2% of the data), but I chose to leave them in, and remove the extra columns. The columns had to be removed even if I had deleted the “yes new window” records since they contained no data for the “no new window” records, and by leaving the records in, I was left with a larger dataset (admittedly not by much). I divided this dataset into a training set (70% of the original) and a test set (30%) using the createDataPartition function in the caret package.
With a clean dataset, I started by making a single tree using the tree package, and used it to predict the test set. Next, I used cross-validation to determine the size of the optimal tree, and used this data to create a pruned tree. The pruned tree was used to predict the test set.
To build a more accurate model, I used the Random Forest algorithm as implemented in the randomForest package, and used the result to predict the test set.
The single tree had many branches (Fig. 1), and seemed to inherently overfit the data. I include the tree just to show its structure but did not attempt to make the text readable.
## Warning: package 'caret' was built under R version 3.1.1
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'tree' was built under R version 3.1.1
Figure 1. Single tree produced from training data.
The result of using the single-tree model to predict the test set resulted in the following probability table:
## predict
## observed A B C D E
## A 0.74 0.01 0.05 0.18 0.01
## B 0.21 0.31 0.08 0.18 0.22
## C 0.15 0.16 0.51 0.09 0.09
## D 0.21 0.22 0.02 0.38 0.17
## E 0.15 0.03 0.08 0.06 0.68
The results are quite a bit better at predicting the class (A-E) than flipping a coin, but they are not great. Could pruning the tree help? I used cross-validation (via the cv.tree function in the tree package) to determine what the optimal number of branches would be. Figure 2 shows the cross-validation results, and indicates that 8 or 9 branches will produce the least deviance. I chose to prune the tree to 8 branches (simpler is generally preferable), and used the resulting tree (Fig. 3) to predict the test set. Figure 2. Deviance vs Number of Branches as determined by cross-validation
Figure 3. Pruned tree without and with labels
The resulting probability table shows that the pruned tree is not better ( worse in many ways), but it is possible that it will perform better on a novel set, i.e. it may generalize better than the original tree.
## predict
## observed A B C D E
## A 0.65 0.00 0.28 0.07 0.00
## B 0.10 0.23 0.48 0.19 0.00
## C 0.01 0.02 0.87 0.09 0.00
## D 0.04 0.11 0.55 0.30 0.00
## E 0.00 0.01 0.37 0.18 0.44
In building the Random Forest model, I originally used 500 trees, but subsequent analysis showed that fewer trees were needed to find the optimal tree. The best way to see this is to plot the error rate vs the number of trees. Figure 4 shows that 100 trees produces a nearly optimal tree - indeed 25 trees does most of the work.
## Warning: package 'randomForest' was built under R version 3.1.1
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
Figure 4. Error Rate vs Number of Trees in Random Forest Model
Using 200 trees, the OOB error rate was 0.23%:
##
## Call:
## randomForest(formula = classe ~ ., data = training, prox = TRUE, ntree = 200, importance = T)
## Type of random forest: classification
## Number of trees: 200
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.21%
## Confusion matrix:
## A B C D E class.error
## A 3905 1 0 0 0 0.000256
## B 6 2651 1 0 0 0.002634
## C 0 5 2389 2 0 0.002922
## D 0 0 10 2239 3 0.005773
## E 0 0 0 1 2524 0.000396
I used the resulting model to predict the test set. The probability table shows the model predicted perfectly. Of course the real test is on a novel dataset.
## predict
## observed A B C D E
## A 1 0 0 0 0
## B 0 1 0 0 0
## C 0 0 1 0 0
## D 0 0 0 1 0
## E 0 0 0 0 1
Cross validation is effectively done within the process of random forests. As Leo Breiman, who introduced and developed random forests, says on his website (http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm), “[an] unbiased estimate of the test set error . . . is estimated internally, during the run, as follows: Each tree is constructed using a different bootstrap sample from the original data. About one-third of the cases are left out of the bootstrap sample and not used in the construction of the kth tree.” The randomForest function uses this feature of the algorithm to calculate the OOB/Out of Sample error rate.
It is certainly possible that I could have made a simpler model by removing some of the variables. Figure 5 shows some of the variables and their mean decrease in accuracy and the gini impurity criterion.
Figure 5. Plot of Variable Importance