By Megan Williams
Background:
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways:
Class A: exactly according to the specification (proper execution)
Class B: throwing the elbows to the front (common mistake)
Class C: lifting the dumbbell only halfway (common mistake)
Class D: lowering the dumbbell only halfway (common mistake)
Class F: Throwing the hips to the front (common mistake)
Goal of this Project: The goal of this project is to predict the manner in which the exercise was performed (i.e., Class A, B, C, D, or F). This report will describe the following:
-How I built my model
-How I used cross validation
-What I think the expected out of sample error is
-Explanation for my choices
First, I will load the appropriate packages
library(AppliedPredictiveModeling)
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:survival':
##
## cluster
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:Hmisc':
##
## combine
Next, I will load and examine the data
rm(list = ls(all = TRUE))
setwd('/Users/meganwilliams/Desktop/MachineLearning')
Training = read.csv(file="pml-training.csv", header=TRUE, as.is = TRUE, stringsAsFactors = FALSE, sep=',', na.strings = c('NA','','#DIV/0!'))
Testing = read.csv(file="pml-testing.csv", header=TRUE, as.is = TRUE, stringsAsFactors = FALSE, sep=',', na.strings = c('NA','','#DIV/0!'))
Training$classe = as.factor(Training$classe)
dim(Training)
## [1] 19622 160
dim(Testing)
## [1] 20 160
summary(Training$classe)
## A B C D E
## 5580 3797 3422 3216 3607
Next, I will get rid of any missing values or unnecessary variables
NAs = apply(Training,2,function(x) {sum(is.na(x))})
Training = Training[,which(NAs == 0)]
NAs = apply(Testing,2,function(x) {sum(is.na(x))})
Testing = Testing[,which(NAs == 0)]
Next, I will work on preprocessing the variables
pre_Proc = which(lapply(Training, class) %in% "numeric")
pre_Obj = preProcess(Training[,pre_Proc],method=c('knnImpute', 'center', 'scale'))
train = predict(pre_Obj, Training[,pre_Proc])
train$classe = Training$classe
test = predict(pre_Obj,Testing[,pre_Proc])
Non-zero Variables
Now, let's remove the non-zero variables
nzv = nearZeroVar(train,saveMetrics=TRUE)
train = train[,nzv$nzv==FALSE]
nzv = nearZeroVar(test,saveMetrics=TRUE)
test = test[,nzv$nzv==FALSE]
Cross validations
Now, we must split the data into one set for training and one set for cross validation. The cross validation set will be used as the train control method for our model
set.seed(12031987)
inTrain = createDataPartition(train$classe, p = 3/4, list=FALSE)
training = train[inTrain,]
crossValidation = train[-inTrain,]
Model
Next, we will create the Train model using Random Forest.
fit = train(classe ~., method="rf", data=training, trControl=trainControl(method='cv'), number=5, allowParallel=TRUE )
##
## Attaching package: 'e1071'
##
## The following object is masked from 'package:Hmisc':
##
## impute
save(fit,file="/Users/meganwilliams/Desktop/MachineLearning/fit.R")
Accuracy
Next, let's check out the accuracy of the training set and the cross-validation set
##Training Set
train_Pred <- predict(fit, training)
confusionMatrix(train_Pred, training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4185 0 0 0 0
## B 0 2848 0 0 0
## C 0 0 2567 0 0
## D 0 0 0 2412 0
## E 0 0 0 0 2706
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (1, 1)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.000 1.000 1.000 1.000 1.000
## Specificity 1.000 1.000 1.000 1.000 1.000
## Pos Pred Value 1.000 1.000 1.000 1.000 1.000
## Neg Pred Value 1.000 1.000 1.000 1.000 1.000
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.194 0.174 0.164 0.184
## Detection Prevalence 0.284 0.194 0.174 0.164 0.184
## Balanced Accuracy 1.000 1.000 1.000 1.000 1.000
##Cross Validation Set
cross_Pred <- predict(fit, crossValidation)
confusionMatrix(cross_Pred, crossValidation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 3 0 0 0
## B 2 944 2 0 0
## C 0 2 852 3 0
## D 0 0 1 801 3
## E 1 0 0 0 898
##
## Overall Statistics
##
## Accuracy : 0.997
## 95% CI : (0.994, 0.998)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.996
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.998 0.995 0.996 0.996 0.997
## Specificity 0.999 0.999 0.999 0.999 1.000
## Pos Pred Value 0.998 0.996 0.994 0.995 0.999
## Neg Pred Value 0.999 0.999 0.999 0.999 0.999
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.192 0.174 0.163 0.183
## Detection Prevalence 0.284 0.193 0.175 0.164 0.183
## Balanced Accuracy 0.998 0.997 0.998 0.998 0.998
Out of Sample Error
Next, we should calculate the out of sample error. We do this by subtracting 1 from the accuracy for predictions made against the cross-validation set. The out of sample error is low, suggesting that it is unlikely that the test samples will be classified incorrectly.
Out_of_Sample_Error = 1-.9965
Out_of_Sample_Error
## [1] 0.0035
Results
Look at the predictions on the real testing set
test_Pred = predict(fit, test)
test_Pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E