Overview:

In this project, our goal is using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to build a model to quantify how well they do it. This report describing how we built our model, how we used cross validation, what we think the expected out of sample error is, and why we made the choices we did. The model is use to predict 20 different test cases.

1. How we built our model

# install.packages("caret")
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

training <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"))
testing <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"))

Data before cleaning

dim(training)

## [1] 19622   160

# remove variables that don't make intuitive sense for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), which happen to be the first five variables
training <- training[, -(1:5)]


# remove variables with nearly zero variance
nzv <- nearZeroVar(training)
training <- training[, -nzv]

# remove variables that are almost always NA
mostlyNA <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[, mostlyNA==F]

Data after cleaning

dim(training)

## [1] 19622    54

set.seed(258)
inTrain <- createDataPartition(y=training$classe, p=0.7, list=F)
MyTraining <- training[inTrain, ]
MyTesting <- training[-inTrain, ]

2. How to use cross validation

# instruct train to use 3-fold CV to select optimal tuning parameters
fitControl <- trainControl(method="cv", number=3, verboseIter=F)
# fit model on MyTraining data
fit <- train(classe ~ ., data=MyTraining, method="rf", trControl=fitControl)

## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

# print final model to see tuning parameters it chose
fit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.22%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B    6 2648    3    1    0 0.0037622272
## C    0    7 2389    0    0 0.0029215359
## D    0    0    4 2247    1 0.0022202487
## E    0    1    0    6 2518 0.0027722772

# use model to predict classe in validation set (MyTesting)
preds <- predict(fit, newdata=MyTesting)

3. What the expected out of sample error is

# show confusion matrix to get estimate of out-of-sample error
confusionMatrix(MyTesting$classe, preds)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    0    0    0    1
##          B    3 1136    0    0    0
##          C    0    3 1023    0    0
##          D    0    0    6  958    0
##          E    0    0    0    0 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9978          
##                  95% CI : (0.9962, 0.9988)
##     No Information Rate : 0.2848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9972          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9974   0.9942   1.0000   0.9991
## Specificity            0.9998   0.9994   0.9994   0.9988   1.0000
## Pos Pred Value         0.9994   0.9974   0.9971   0.9938   1.0000
## Neg Pred Value         0.9993   0.9994   0.9988   1.0000   0.9998
## Prevalence             0.2848   0.1935   0.1749   0.1628   0.1840
## Detection Rate         0.2843   0.1930   0.1738   0.1628   0.1839
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9990   0.9984   0.9968   0.9994   0.9995

4. why we made the choices we did

From the estimated error, we may conclude that this model has very accurate prediction.

imps <- varImp(fit)
imps

## rf variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                      Overall
## num_window           100.000
## roll_belt             63.804
## pitch_forearm         40.188
## yaw_belt              33.912
## magnet_dumbbell_z     28.971
## pitch_belt            28.191
## magnet_dumbbell_y     27.693
## roll_forearm          22.248
## accel_dumbbell_y      12.404
## magnet_dumbbell_x     11.545
## roll_dumbbell         11.072
## accel_forearm_x       10.586
## accel_belt_z           9.221
## total_accel_dumbbell   8.930
## accel_dumbbell_z       8.251
## magnet_belt_y          7.835
## magnet_forearm_z       6.617
## magnet_belt_z          6.568
## magnet_belt_x          6.261
## roll_arm               5.205

5. use our prediction model to predict 20 different test cases

predsfinal <- predict(fit, newdata=testing)
predsfinal

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning Course Project

Janpu Hou

January 15, 2016