In this project, we’ll use use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants in order to classify how a given exercise was performed. Participants in this study were asked to perform barbell lifts in 5 different ways (correctly and incorrectly). We will fit a model using random forests, and apply this model to a testing set to classify each lift.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
First we download the datasets. We won’t use the testing set at all until it’s time to predict the quiz results as such, we’ll partition the training set again into training/testing to validate the model prior to submitting results
training <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"),na.strings=c("NA","#DIV/0!","")
,stringsAsFactors = F)
testing <- read.csv(url("http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"),na.strings=c("NA","#DIV/0!","")
,stringsAsFactors = F)
We want to keep only fields where > 50% aren’t NA
training<-training[, colSums(is.na(training)) < nrow(training) * 0.5]
We also get rid of ID fields, which make up the first 8 cols
training <- training[, -(1:8)]
Having cleaned up the training set, we need to get rid of fields in Testing set not included in Training set. We keep the “problem_id” field which doesn’t exist in training however. This leaves 52 fields in each.
testing<-testing[,(names(testing) %in% c(names(training),"problem_id"))]
dim(training)
## [1] 19622 52
dim(testing)
## [1] 20 52
As discussed above, since we won’t use the testing set until model submission, we create a partition with the training dataset in order to validate the model. This is our (simple) cross-validation step to avoid overfitting the model. we’ll keep 70% of the dataset for training.
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
partTrain <- training[inTrain, ]
partTest <- training[-inTrain, ]
We’ll be using a Random Forest model for prediction. First we build the model using our partitioned training data
set.seed(111)
control <- trainControl(method="cv", 5)
modFit <- train(classe ~ ., data=partTrain, method="rf", trControl=control, ntree=250)
Then we’ll use the mdoel on our partitioned test set, and use the model accuracy to estimate Out of Sample Error
predict_model <- predict(modFit, partTest)
confusionMatrix(partTest$classe, predict_model)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 1 1 0 2
## B 5 1129 5 0 0
## C 0 8 1015 3 0
## D 0 0 7 957 0
## E 0 0 4 1 1077
##
## Overall Statistics
##
## Accuracy : 0.9937
## 95% CI : (0.9913, 0.9956)
## No Information Rate : 0.2846
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.992
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9921 0.9835 0.9958 0.9981
## Specificity 0.9990 0.9979 0.9977 0.9986 0.9990
## Pos Pred Value 0.9976 0.9912 0.9893 0.9927 0.9954
## Neg Pred Value 0.9988 0.9981 0.9965 0.9992 0.9996
## Prevalence 0.2846 0.1934 0.1754 0.1633 0.1833
## Detection Rate 0.2838 0.1918 0.1725 0.1626 0.1830
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9980 0.9950 0.9906 0.9972 0.9986
Based on the above data, the model is 99.2% accurate. this gives an Out of Sample Error Rate of < 1%. This is quite incredible, yet having used cross-validation and only run the model to predict the test set once, we’re confident not to have overfit. <<<<<<< HEAD
predict_model_final <- predict(modFit, testing)
predict_model_final
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
======= >>>>>>> 0257f4487cdfc09c144ec5cc490bdcc26d3320cc