After downloading and cleaning the data to select only acceleration predictors, two different models, random trees and generalized boosted regression, were run on 60% of the training set. The models were then run on the remaining of the 40% testing set to test for accuracy. The model with the best performance, random trees, was then chosen to run on the given testing set. The results are given below.
Six participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of the project is to use data from accelerometers on the belt, forearm, arm, and dumbell to predict whether or not the barbell lift was perfomed correctly. The report describes the building of the model including dealing with missing data and deciding which model method to use.
Data is downloaded from the given urls.
Then, data is saved into r for processing.
training <- read.csv("trainingdata.csv", stringsAsFactors = FALSE)
testing <- read.csv("testingdata.csv", stringsAsFactors = FALSE)
The dataset is very large, so we are able to break the data into two pieces, one to train the models and the other to test the models for accuracy.
set.seed(23846)
inTrain <- createDataPartition(y = training$classe, p=0.6, list = FALSE )
trainset <- training[inTrain,]
testset <- training[-inTrain,]
Select only predictors that indicate acceleration. After working with the first selection, it becomes clear that predictors that measure the variance of the acceleration need to be excluded.
dim(trainset)
[1] 11776 17
This leaves us with 16 predictors.
names(trainset)
[1] "classe" "total_accel_belt"
[3] "accel_belt_x" "accel_belt_y"
[5] "accel_belt_z" "total_accel_arm"
[7] "accel_arm_x" "accel_arm_y"
[9] "accel_arm_z" "total_accel_dumbbell"
[11] "accel_dumbbell_x" "accel_dumbbell_y"
[13] "accel_dumbbell_z" "total_accel_forearm"
[15] "accel_forearm_x" "accel_forearm_y"
[17] "accel_forearm_z"
Remove predictors with minimal variance (this turns out to have little effect).
nsv <- nearZeroVar(trainset, saveMetrics = TRUE)
trainset <- trainset[, nsv$nzv==FALSE]
dim(trainset)
[1] 11776 17
Remove predictors with missing values (this turns out to have little effect).
dim(trainset)
[1] 11776 17
model1$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 6.06%
Confusion matrix:
A B C D E class.error
A 3213 22 52 59 2 0.04032258
B 100 2049 87 24 19 0.10092146
C 41 60 1931 15 7 0.05988315
D 44 7 89 1779 11 0.07823834
E 3 25 25 22 2090 0.03464203
We then validate the model obtained model \(model1\) on the test data to find out how well it performs by looking at the Accuracy variable.
cmrf$overall
Accuracy Kappa AccuracyLower AccuracyUpper
9.418812e-01 9.264350e-01 9.364741e-01 9.469571e-01
AccuracyNull AccuracyPValue McnemarPValue
2.844762e-01 0.000000e+00 1.687764e-12
We see the accuracy of the “rt” model is 94%. ## Generalized Boosted Regression ### Training
model2$finalModel
A gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 16 predictors of which 16 had non-zero influence.
# print model summary
print(model2)
Stochastic Gradient Boosting
11776 samples
16 predictor
5 classes: 'A', 'B', 'C', 'D', 'E'
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 1 times)
Summary of sample sizes: 9420, 9422, 9420, 9421, 9421
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa
1 50 0.5454308 0.4166360
1 100 0.6046194 0.4948253
1 150 0.6370584 0.5369123
2 50 0.6560795 0.5608886
2 100 0.7314882 0.6585172
2 150 0.7676625 0.7050553
3 50 0.7324214 0.6594103
3 100 0.7896571 0.7330953
3 150 0.8202283 0.7720946
Tuning parameter 'shrinkage' was held constant at a value
of 0.1
Tuning parameter 'n.minobsinnode' was held constant
at a value of 10
Accuracy was used to select the optimal model using
the largest value.
The final values used for the model were n.trees =
150, interaction.depth = 3, shrinkage = 0.1
and n.minobsinnode = 10.
We then validate the model obtained model \(model1\) on the test data to find out how well it performs by looking at the Accuracy variable.
cmgbm$overall
Accuracy Kappa AccuracyLower AccuracyUpper
8.202906e-01 7.721739e-01 8.116128e-01 8.287298e-01
AccuracyNull AccuracyPValue McnemarPValue
2.844762e-01 0.000000e+00 3.239876e-36
We see that the accuracy of the “gbm” model on the testset is 82%. # Running Best Model on Testing data Because the random trees model had significantly better accuracy on the testset data, we will use it to predict our answers for the quiz.
Results
[1] B A C A A E D B A A B C B A E E A B B B
Levels: A B C D E
Data comes from :
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.