Machine Learning

Summary

The goal of this project is to use the data provided by wearable devices in order to build a model capable of telling whether the excercise was done correctly or incorrectly. A specific form of excercise, barbell lifts, was repeated by 6 participants in 5 different ways, both correctly an incorrectly. The report below details data processing, exploratory analysis, model selection, model validation, estimation of out of sample error and testing the model on a separate set of data.

Data processing

The data primarily consists of readings from four accelerometers devices located on the: belt, forearm, arm, and dumbbell. However, the dataset contains information in 160 variables. The first step is to identify potential variables to be included in the model.

Literature review (http://groupware.les.inf.puc-rio.br/work.jsf?p1=10335) suggested that average and std values will be most helpful. Thus we will try to replicate these findings working with a subset of the data containing summaries of the observations in the form of averages and standard deviations and comparing with subsets containing other variables.

Random Forests model with 3 fold cross-validation

## Random Forest 
## 
## 246 samples
##  24 predictor
##   5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 164, 164, 164 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa
##    2    0.703     0.624
##   13    0.683     0.600
##   24    0.675     0.590
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  A  B  C  D  E
##          A 36  7  3  4  0
##          B  3 21  1  0  3
##          C  3  1 23  3  1
##          D  0  1  1 20  2
##          E  1  1  0  0 25
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7812          
##                  95% CI : (0.7091, 0.8427)
##     No Information Rate : 0.2688          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7232          
##  Mcnemar's Test P-Value : 0.2469          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8372   0.6774   0.8214   0.7407   0.8065
## Specificity            0.8803   0.9457   0.9394   0.9699   0.9845
## Pos Pred Value         0.7200   0.7500   0.7419   0.8333   0.9259
## Neg Pred Value         0.9364   0.9242   0.9612   0.9485   0.9549
## Prevalence             0.2687   0.1938   0.1750   0.1688   0.1938
## Detection Rate         0.2250   0.1313   0.1437   0.1250   0.1562
## Detection Prevalence   0.3125   0.1750   0.1938   0.1500   0.1688
## Balanced Accuracy      0.8588   0.8116   0.8804   0.8553   0.8955

Training the random forest model from the caret package, we get the accuracy of around 0.78125. Out of sample error is around 30%.

However, one of the problems with this approach is that the data provided for testing contains no information for these set of predictors and so is useless in this approach.

Alternatives

In the following sections we consider alternative methods and alternative set of predictors.

Alternative models used with the same variables

We can see if another method does better.

##        
## predlda  A  B  C  D  E
##       A 24  8  4  5  1
##       B  8 11  0  2  3
##       C  6  2 21  3  2
##       D  1  7  3 14  4
##       E  4  3  0  3 21

Training the “lda” model from the caret package, we get the accuracy of around 0.56875.

##        
## predgbm  A  B  C  D  E
##       A 35 11  3  2  0
##       B  5 15  3  1  3
##       C  2  1 21  4  2
##       D  0  2  1 17  2
##       E  1  2  0  3 24

##          
## predrpart  A  B  C  D  E
##         A 24  7  2  4  0
##         B 10 14  6  1  2
##         C  6  2 10  3  2
##         D  2  7 10 17  2
##         E  1  1  0  2 25

Using the “gbm” method, we get the accuracy of around 0.7; “rpart” method yielded accuracy of about 0.5625. Combining these models seems to reduce accuracy.

Using a larger and different number of variables

It is possible that we are discarding too much of the data. So we tried using more predictors.

Using “lda” method from the caret package with more variables, we increase the accuracy of predictions from 0.56875 to 0.56875. Where the accuracy for “rpart” method changes from 0.5625 to 0.5625. The “gbm” and random forests methods become slower with a larger subset of data. If computational time is not an issue, it is possible to achieve much greater accuracy by using all of the predictors which have no missing values, and getting rid of the first seven collumns: observation number, person’s name, time info, more time info, etc.

Below are the calculations that take longer, but result in much higher precision when it comes to classification.

##       
## predrf    A    B    C    D    E
##      A 2229   11    0    0    0
##      B    3 1505   16    0    0
##      C    0    2 1350   22    0
##      D    0    0    2 1263    3
##      E    0    0    0    1 1439

The classes predicted for the test dataset provided are B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B, the model should get only about 1% of them wrong.

Conclusion

It seems that the best approach is using Random Forests method with a subset of data containing variables with average and sd values, that is predictors listed below.

List of variables in the prefered model
avg_pitch_arm
avg_pitch_belt
avg_pitch_dumbbell
avg_pitch_forearm
avg_roll_arm
avg_roll_belt
avg_roll_dumbbell
avg_roll_forearm
avg_yaw_arm
avg_yaw_belt
avg_yaw_dumbbell
avg_yaw_forearm
stddev_pitch_arm
stddev_pitch_belt
stddev_pitch_dumbbell
stddev_pitch_forearm
stddev_roll_arm
stddev_roll_belt
stddev_roll_dumbbell
stddev_roll_forearm
stddev_yaw_arm
stddev_yaw_belt
stddev_yaw_dumbbell
stddev_yaw_forearm

The calculations are quick because the dataset is much smaller since only key statistics (mean and sd) describing observations are used. However, for greater precision, use ‘raw’ data from accelerometers with a random forests method of the caret package, if computational time is not an issue.