Introduction

Using the training and test datasets provided by [http://groupware.les.inf.puc-rio.br/har#weight_lifting_exercises], we will attempt to construct the best fit model that will predict whether or not an specific exercise was performed correctly (classe==A) or in one of four incorrect manners (classe==B:E). see Appendix for full description of the classes.

In addition to finding the most accurate model, we’ll also keep track of processing time for each model because the most accurate model may not be the “best” model for real-life large data sets.

Data Cleanup and Pre-processing

We are interested in the accelerometer data as it relates to whether an exercise was performed correctly. Visual inspection of the data shows us that we can remove columns that contain 1 through 7, which contain identifying data, time stamp and other data irrelevant to our question.

Many columns contain no data. If more than 80% of the data in a column is NA, eliminate that data from the column set.

## [1] 19622    53

Next,remove columns with near zero variance or zero variance

There are now 53 columns left, which is identical to what it was previously, so we can leave out the costly nearZeroVar()function in the future. What is left contains data that can be used to classify whether an exercise was performed correctly, class A or incorrectly, classes B through E.

Modeling and Training Data

Subset Training Data

The dataset is still very large, containing 19622 objects and `r dim(training)[2] variables. To speed up initial model testing, we’ll create three subsets of the training data, then partition each of those subsets. Once we model and predict on the smaller data sets we can apply the use the full data set with the best model (most accurate ~ processing time).

As this is a classification problem, we’ll be using a variety of “tree” models, including: rpart, randomForest and gbm.

Model 1, rpart()

Create first model fit using recursive partitioning and regression trees, a.k.a rpart and predict on that model. rpart is great for creating a decision tree plot (see Appendix fig 1) and is more scalable, but tends to have lower accuracy compared to other functions.

Below is the accuracy and confusion matrix:

##  Accuracy 
## 0.7188465

	A	B	C	D	E
A	466	49	5	7	4
B	11	220	33	12	25
C	14	24	229	43	44
D	50	73	64	246	49
E	11	10	8	10	235

Model 2, randomForest()

randomForest differs from rpart in that it creates several subsets of trees, a forest rather than a single tree, and then averages them together to find the best model. It can be less scalable re: CPU, but if processing time is not an issue, generally produces the best results.

ptm<-proc.time()
set.seed(3383)


model.fit<-randomForest(classe ~ ., data=strain, importance = TRUE, allowParallel=TRUE) #being sneaky here and producing our final training model after discovering that this is the best model fit.

predict<-predict(model.fit, stest1)
cmp2<-confusionMatrix(predict, stest1$classe)

t<-cmp2$table
trainall.time<-proc.time()-ptm

predict on the second model

##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9788877      0.9732862      0.9714669      0.9848079      0.2842430 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN

cmp2$overall["Accuracy"]

Accuracy 0.9788877

kable(t2)

	A	B	C	D	E
A	549	7	0	1	0
B	1	365	11	0	0
C	0	4	327	10	2
D	0	0	1	307	2
E	2	0	0	0	353

Model 3, train() method=gbm

randomForest gave us a pretty good Accuracy, we’ll see if we can do better with boosting using the method **gbm* (for classification boosting) in the train function.

## Stochastic Gradient Boosting 
## 
## 4537 samples
##   52 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 3629, 3629, 3629, 3631, 3630 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.7562297  0.6907052  0.014001115
##   1                  100      0.8249983  0.7783593  0.013139273
##   1                  150      0.8510022  0.8113774  0.012936625
##   2                   50      0.8446087  0.8031443  0.010778871
##   2                  100      0.9014762  0.8752628  0.007972290
##   2                  150      0.9219713  0.9012586  0.012290797
##   3                   50      0.8902355  0.8610216  0.012405872
##   3                  100      0.9325532  0.9146411  0.006508225
##   3                  150      0.9446755  0.9300113  0.005098588
##   Kappa SD   
##   0.017692966
##   0.016482413
##   0.016348686
##   0.013682292
##   0.010065436
##   0.015590197
##   0.015684179
##   0.008276051
##   0.006467697
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

predict on third model

Accuracy 0.9588054

	A	B	C	D	E
A	542	9	1	0	0
B	4	357	18	1	3
C	2	9	316	9	6
D	0	0	3	306	7
E	4	1	1	2	341

Compare all 3 Models for Accuracy and Compare Processing Times

Model 2, Random Forest, has a slightly higher accuracy at 0.9788877 than Gradient Boosting at 0.9588054, but the former’s computation time is 23.0811808% of the latter so we’ll use this model.

The out of sample error rate should be about 0.0211123

model	Accuracy	user.self	sys.self	elapsed
rpart	0.7188465	1.06	0.00	1.06
gradient boosting	0.9588054	108.10	0.25	108.40
random forest	0.9788877	24.72	0.20	25.02

Applying Model 2 to our large training set vs. testing set we get the following predictions:

Final Prediction for Quiz

print(finalPrediction[1:20])

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Appendix

fig 1. Recursive Partitioning (rpart)Tree Plot & Prediction Plot Model 1

fig 2. Random Forest (randomForest) Error Plot & Prediction Plot Model 2

fig 3. Random Forest (randomForest) Error Plot & Prediction Plot Model 2

fig 4. Model 2 Random Forest Variable Importance

Source:

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Practical Machine Learning Assignment

Alyssa Goldberg

January 29, 2016