In this project, our goal is using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to build a model to quantify how well they do it. This report describing how we built our model, how we used cross validation, what we think the expected out of sample error is, and why we made the choices we did. The model is use to predict 20 different test cases.
# install.packages("caret")
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
training <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"))
testing <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"))
Data before cleaning
dim(training)
## [1] 19622 160
# remove variables that don't make intuitive sense for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), which happen to be the first five variables
training <- training[, -(1:5)]
# remove variables with nearly zero variance
nzv <- nearZeroVar(training)
training <- training[, -nzv]
# remove variables that are almost always NA
mostlyNA <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[, mostlyNA==F]
Data after cleaning
dim(training)
## [1] 19622 54
set.seed(258)
inTrain <- createDataPartition(y=training$classe, p=0.7, list=F)
MyTraining <- training[inTrain, ]
MyTesting <- training[-inTrain, ]
# instruct train to use 3-fold CV to select optimal tuning parameters
fitControl <- trainControl(method="cv", number=3, verboseIter=F)
# fit model on MyTraining data
fit <- train(classe ~ ., data=MyTraining, method="rf", trControl=fitControl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:ggplot2':
##
## margin
# print final model to see tuning parameters it chose
fit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.22%
## Confusion matrix:
## A B C D E class.error
## A 3905 1 0 0 0 0.0002560164
## B 6 2648 3 1 0 0.0037622272
## C 0 7 2389 0 0 0.0029215359
## D 0 0 4 2247 1 0.0022202487
## E 0 1 0 6 2518 0.0027722772
# use model to predict classe in validation set (MyTesting)
preds <- predict(fit, newdata=MyTesting)
# show confusion matrix to get estimate of out-of-sample error
confusionMatrix(MyTesting$classe, preds)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 0 0 0 1
## B 3 1136 0 0 0
## C 0 3 1023 0 0
## D 0 0 6 958 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9978
## 95% CI : (0.9962, 0.9988)
## No Information Rate : 0.2848
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9972
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9974 0.9942 1.0000 0.9991
## Specificity 0.9998 0.9994 0.9994 0.9988 1.0000
## Pos Pred Value 0.9994 0.9974 0.9971 0.9938 1.0000
## Neg Pred Value 0.9993 0.9994 0.9988 1.0000 0.9998
## Prevalence 0.2848 0.1935 0.1749 0.1628 0.1840
## Detection Rate 0.2843 0.1930 0.1738 0.1628 0.1839
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9990 0.9984 0.9968 0.9994 0.9995
From the estimated error, we may conclude that this model has very accurate prediction.
imps <- varImp(fit)
imps
## rf variable importance
##
## only 20 most important variables shown (out of 53)
##
## Overall
## num_window 100.000
## roll_belt 63.804
## pitch_forearm 40.188
## yaw_belt 33.912
## magnet_dumbbell_z 28.971
## pitch_belt 28.191
## magnet_dumbbell_y 27.693
## roll_forearm 22.248
## accel_dumbbell_y 12.404
## magnet_dumbbell_x 11.545
## roll_dumbbell 11.072
## accel_forearm_x 10.586
## accel_belt_z 9.221
## total_accel_dumbbell 8.930
## accel_dumbbell_z 8.251
## magnet_belt_y 7.835
## magnet_forearm_z 6.617
## magnet_belt_z 6.568
## magnet_belt_x 6.261
## roll_arm 5.205
predsfinal <- predict(fit, newdata=testing)
predsfinal
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E