We use data from accelerometers from many different subjects to build highly predictive machine model. In the end we get a model with 99.2% accuracy, which is then used to correctly predict 20 ultimate test values after cross validation. The best model used the Random Forest.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
First we download the data for this project. We will get the training data containing 19622 observations of 160 variables. We will also download the ultimate test containing 20 observations of 160 variables of data.
We will save the ultimate test data until the very end of the model building procedure since it is very small. We will subdivide the training data into a training and test set.
Nomenclature: PML shall refer to the training set csv as a whole (19,622 cases). Training will refer to a subset of PML used for training, and testing will refer to a subset of PML used for cross validation. The “ultimate test” will use the model on a very small subset of data.
Load the data
# Load the (total) training set.
# Missing variables take on 3 general forms. Properly categorize them as NAs.
pml <- read.csv("pml-training.csv", na.strings=c("#DIV/0!","","NA"))
Load the caret package and divide the PML into training and test sets. Training will be 60%, testing is 40% of the PML. Set seed for repoducability.
# load the caret package for training and partitioning
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(1337)
# segment the original training set for cross validation
inTrain <- createDataPartition(y=pml$classe,
p=0.6, list=FALSE)
training <- pml[inTrain,]
testing <- pml[-inTrain,]
The data contains zero columns, remove those. The first 7 columns of data contain information that shouldn’t improve the model and can be removed (including user name, and row numbers, and iterative counts).
# remove zero columns
training <- training[,colSums(is.na(training)) == 0]
testing <- testing[,colSums(is.na(testing)) == 0]
#remove non-useful measurement data
training <- training[,-(c(1:7))]
testing <- testing[,-c((1:7))]
The variable that is being predicted in both the training and testing set is the “classe” variable that can be one of five values (A,B,C,D,E).
I expect that a suitable model will have a error rate of less than 1 in 20, since the assignment ultimately expects us correctly predict 20 values in the ultimate test case.
Let’s run a quadratic determinant analysis, which is a more general version of the linear classifier. Let’s then validate our sample by applying it to our testing data.
# Train modFit1 with QDA. Predict on the model, and
# cross-validate with the testing data.
modFit1 <- train(classe ~ .,method="qda",data=training)
## Loading required package: MASS
prediction1 <- predict(modFit1,newdata=testing)
confusionMatrix(prediction1, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2061 80 1 2 0
## B 95 1279 66 5 58
## C 36 137 1286 176 54
## D 30 2 6 1086 39
## E 10 20 9 17 1291
##
## Overall Statistics
##
## Accuracy : 0.8926
## 95% CI : (0.8855, 0.8993)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8643
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9234 0.8426 0.9401 0.8445 0.8953
## Specificity 0.9852 0.9646 0.9378 0.9883 0.9913
## Pos Pred Value 0.9613 0.8510 0.7614 0.9338 0.9584
## Neg Pred Value 0.9700 0.9623 0.9867 0.9701 0.9768
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2627 0.1630 0.1639 0.1384 0.1645
## Detection Prevalence 0.2733 0.1916 0.2153 0.1482 0.1717
## Balanced Accuracy 0.9543 0.9036 0.9389 0.9164 0.9433
This model gives a 89.2% accuracy, which is less than one in ten. I think we can make a model that does better than this, considering that the submission assignment wants us to get 20/20 predictions correct in the submission stage, which heavily suggests we can do better than 89.2% percent accuracy.
We will use the randomForest package because it runs a lot faster than the caret version, specifying randomForest. After we run the model we will validate it by running it on the testing set.
# Predict with random forest
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
modFit2 <- randomForest(classe ~ ., data=training)
prediction2 <- predict(modFit2,newdata=testing)
confusionMatrix(prediction2, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 13 0 0 0
## B 0 1500 15 0 0
## C 0 5 1352 24 0
## D 0 0 1 1262 4
## E 0 0 0 0 1438
##
## Overall Statistics
##
## Accuracy : 0.9921
## 95% CI : (0.9899, 0.9939)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.99
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9881 0.9883 0.9813 0.9972
## Specificity 0.9977 0.9976 0.9955 0.9992 1.0000
## Pos Pred Value 0.9942 0.9901 0.9790 0.9961 1.0000
## Neg Pred Value 1.0000 0.9972 0.9975 0.9964 0.9994
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1912 0.1723 0.1608 0.1833
## Detection Prevalence 0.2861 0.1931 0.1760 0.1615 0.1833
## Balanced Accuracy 0.9988 0.9929 0.9919 0.9903 0.9986
This is a lot better with a 99.2% accuracy rate! That’s approximately a 1 in 125 error.
Let’s try our Random Forest (modFit2) on the ultimate test set. And output our predictions:
# try small test set
pmlultimatetest <- read.csv("pml-testing.csv", na.strings=c("#DIV/0!","","NA"))
# predict for small 20 case set
predict3 <- predict(modFit2, newdata=pmlultimatetest)
predict3
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Let’s write the prediction to a file for submission.
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(predict3)
I submitted these predictions to the second part of the assignment and all 20/20 were correct!
# QDA model
print(modFit1$finalModel)
## Call:
## qda(x, grouping = y)
##
## Prior probabilities of groups:
## A B C D E
## 0.2843071 0.1935292 0.1744226 0.1638927 0.1838485
##
## Group means:
## roll_belt pitch_belt yaw_belt total_accel_belt gyros_belt_x
## A 59.96214 0.2422073 -11.142936 10.75269 -0.005627240
## B 64.73887 -0.4847258 -13.468916 11.08425 -0.002395788
## C 64.48253 -1.2340604 -7.113057 11.13291 -0.015998053
## D 59.93643 1.5626166 -18.212503 11.14611 -0.013756477
## E 74.04343 0.5354919 -5.568674 12.70254 0.006013857
## gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## A 0.04120968 -0.1230585 -6.105137 29.17473 -63.62515
## B 0.04251426 -0.1328872 -4.464239 31.64195 -72.90478
## C 0.03897274 -0.1360419 -3.824732 30.82668 -70.54382
## D 0.03601036 -0.1359378 -8.141451 29.88912 -68.07461
## E 0.03978753 -0.1287667 -4.461894 28.92240 -91.00416
## magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm
## A 58.35663 602.2261 -338.0606 -0.9099074 3.444131
## B 50.26547 599.2264 -336.6073 31.7422203 -6.671071
## C 57.45764 599.7317 -337.6738 25.5064216 -1.449391
## D 48.90000 593.7606 -341.4803 23.8906114 -10.568031
## E 62.59076 568.4425 -377.2416 20.1005266 -13.276910
## yaw_arm total_accel_arm gyros_arm_x gyros_arm_y gyros_arm_z
## A -10.6519654 27.32706 0.01674731 -0.2184827 0.2693668
## B 7.3877271 26.54410 0.02355419 -0.2889601 0.2573892
## C 5.2980964 24.32619 0.06817916 -0.2392600 0.2631402
## D 5.2741140 23.45233 0.03256995 -0.2455389 0.2613316
## E 0.1092656 24.42910 0.07375058 -0.3094134 0.2894965
## accel_arm_x accel_arm_y accel_arm_z magnet_arm_x magnet_arm_y
## A -131.33333 46.80974 -73.75179 -16.68817 236.57139
## B -41.60026 24.69987 -97.39798 238.61694 126.48355
## C -79.53603 41.23856 -54.14460 146.95618 190.94985
## D 13.24922 26.24197 -49.02021 392.94404 96.80052
## E -15.06420 16.39353 -77.31732 333.72656 81.55658
## magnet_arm_z roll_dumbbell pitch_dumbbell yaw_dumbbell
## A 412.1219 21.40308 -18.729006 0.9157553
## B 190.6762 35.58885 2.845738 13.2715066
## C 362.8140 -13.15525 -24.922559 -16.2464268
## D 298.4461 50.72061 -1.656519 0.9171672
## E 212.8554 27.61226 -7.324180 5.3187394
## total_accel_dumbbell gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## A 14.68668 0.1062037 0.042458184 -0.05898148
## B 14.35630 0.1684642 0.014111452 -0.14433523
## C 12.88364 0.1949464 0.051645570 -0.15321811
## D 11.22021 0.2081658 0.008150259 -0.12967358
## E 14.55566 0.1381109 0.105838337 -0.13818938
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## A -50.3900836 52.79152 -57.14367 -385.9973
## B -0.9262835 69.15665 -16.84116 -253.4370
## C -40.2156767 30.03651 -52.76582 -370.3096
## D -21.5025907 52.73938 -32.54611 -319.8098
## E -19.3852194 56.72887 -25.00739 -291.2619
## magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm
## A 217.8874 10.28901 25.96955 -6.733557
## B 267.5024 50.31768 32.49744 14.560926
## C 153.9830 62.94985 58.52051 12.591198
## D 220.8720 60.73109 15.65495 27.931922
## E 239.6106 72.18476 38.92143 16.402503
## yaw_forearm total_accel_forearm gyros_forearm_x gyros_forearm_y
## A 25.36562 32.00000 0.1744504 0.154596774
## B 14.27789 35.55287 0.1571128 0.087595437
## C 39.89561 34.88462 0.2005161 0.106363194
## D 5.00142 35.99637 0.1120570 0.023683938
## E 12.95390 36.64527 0.1436351 -0.005404157
## gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## A 0.1841876 -0.9596774 170.3796 -58.98656
## B 0.1752611 -77.9324265 139.3629 -46.37473
## C 0.1411928 -48.8583252 213.3189 -61.01607
## D 0.1289534 -150.1839378 154.6187 -43.51865
## E 0.1385266 -71.8138568 146.9478 -61.36998
## magnet_forearm_x magnet_forearm_y magnet_forearm_z
## A -195.2264 477.5412 406.1263
## B -327.6235 279.7016 376.2211
## C -339.6480 500.5501 458.5336
## D -449.2969 319.0663 355.3938
## E -334.1404 280.5926 357.5358
# Random Forest Model. Can't list all the forests but here are the trees, so to speak.
print(modFit2)
##
## Call:
## randomForest(formula = classe ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.65%
## Confusion matrix:
## A B C D E class.error
## A 3345 2 0 0 1 0.0008960573
## B 14 2258 7 0 0 0.0092145678
## C 0 17 2035 2 0 0.0092502434
## D 0 0 24 1905 1 0.0129533679
## E 0 0 2 6 2157 0.0036951501