Practical Machine Learning Assignment

High Level Overview

In this project, we’ll use use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants in order to classify how a given exercise was performed. Participants in this study were asked to perform barbell lifts in 5 different ways (correctly and incorrectly). We will fit a model using random forests, and apply this model to a testing set to classify each lift.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Getting and Cleaning Data

First we download the datasets. We won’t use the testing set at all until it’s time to predict the quiz results as such, we’ll partition the training set again into training/testing to validate the model prior to submitting results

training <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"),na.strings=c("NA","#DIV/0!","")
                     ,stringsAsFactors = F)
testing  <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"),na.strings=c("NA","#DIV/0!","")
                     ,stringsAsFactors = F)

We want to keep only fields where > 50% aren’t NA

training<-training[, colSums(is.na(training)) < nrow(training) * 0.5]

We also get rid of ID fields, which make up the first 8 cols

training <- training[, -(1:8)]

Having cleaned up the training set, we need to get rid of fields in Testing set not included in Training set. We keep the “problem_id” field which doesn’t exist in training however. This leaves 52 fields in each.

testing<-testing[,(names(testing) %in% c(names(training),"problem_id"))]
dim(training)
## [1] 19622    52
dim(testing)
## [1] 20 52

Partitioning Training and Test Sets - Cross Validation

As discussed above, since we won’t use the testing set until model submission, we create a partition with the training dataset in order to validate the model. This is our (simple) cross-validation step to avoid overfitting the model. we’ll keep 70% of the dataset for training.

inTrain  <- createDataPartition(training$classe, p=0.7, list=FALSE)
partTrain <- training[inTrain, ]
partTest  <- training[-inTrain, ]

Building and validating the model

We’ll be using a Random Forest model for prediction. First we build the model using our partitioned training data

set.seed(111)
control <- trainControl(method="cv", 5)
modFit <- train(classe ~ ., data=partTrain, method="rf", trControl=control, ntree=250)

Then we’ll use the mdoel on our partitioned test set, and use the model accuracy to estimate Out of Sample Error

predict_model <- predict(modFit, partTest)

Results

confusionMatrix(partTest$classe, predict_model)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670    1    1    0    2
##          B    5 1129    5    0    0
##          C    0    8 1015    3    0
##          D    0    0    7  957    0
##          E    0    0    4    1 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9937          
##                  95% CI : (0.9913, 0.9956)
##     No Information Rate : 0.2846          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.992           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9921   0.9835   0.9958   0.9981
## Specificity            0.9990   0.9979   0.9977   0.9986   0.9990
## Pos Pred Value         0.9976   0.9912   0.9893   0.9927   0.9954
## Neg Pred Value         0.9988   0.9981   0.9965   0.9992   0.9996
## Prevalence             0.2846   0.1934   0.1754   0.1633   0.1833
## Detection Rate         0.2838   0.1918   0.1725   0.1626   0.1830
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9980   0.9950   0.9906   0.9972   0.9986

Based on the above data, the model is 99.2% accurate. this gives an Out of Sample Error Rate of < 1%. This is quite incredible, yet having used cross-validation and only run the model to predict the test set once, we’re confident not to have overfit. <<<<<<< HEAD

Applying the model to the final testing set and obtaining quiz results

predict_model_final <- predict(modFit, testing)
predict_model_final
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

======= >>>>>>> 0257f4487cdfc09c144ec5cc490bdcc26d3320cc