Machine Learning Project

Background

The objective of this analysis is the use accelerameter data to predict (classify) the manner in which a participant was performing an exercise. There are 5 different classes to predict lettered A through E wit a being the only correct method. The data was collected from the person’s belt, forearm, arm or dumbbell. Data such as this is collected from popular devices such as Jawbone or FitBit. The data was generously provided by Groupware@LES. You can read more about the project here.

library(knitr)
opts_chunk$set(comment=NA)

Download the Data

# trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
#testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# 
# download.file(trainURL, "train.csv", mode="w")
#download.file(testURL, "test.csv", mode="w")

train    <- read.csv("train.csv", header = T, stringsAsFactors = F, na.strings = c("","NA"))
test    <- read.csv("test.csv", header = T, stringsAsFactors = F, na.strings = "NA")
train$classe<- factor(train$classe)

Split out the validation set

There are 19622 observations in the training set and 20 observations in the test set. Since there are so many observations in the training set, I should be able to use a small percentage as a hold-out, or validation set, and still have a enough items to adequately do the validation. I’ll use a 90/10 split.

library(caret)
split<- createDataPartition(train$classe, p=0.9, list=FALSE)
validation<- train[-split,]
train<- train[split,]

Brief Exploratory Analysis

There are 5 classes of outcomes: A through E. The outcomes are not evenly distributed and there are more A’s than the other classes. ‘A’ is the class for a correctly performed exercise.

table(train$classe)


   A    B    C    D    E 
5022 3418 3080 2895 3247

train<- train[,-c(1:7)]
validation<-validation[,-c(1:7)]

After running the summary of the training set, it was easy to see there were many variables with NA values. In fact, if the variable had any NA values, then the variable was almost entirely NA. Because of this, it is safe to remove those variables since they will not add value to the majority of observations.

index<- which(colSums(is.na(train))>0)
train<- train[,-index]
validation<- validation[,-index]

Fit a Random Forest model

I setup the model to use 5-fold cross validation to reduce the computing time. As it stands, training the model takes several minutes for my

fit<- train(classe~., 
            data = train, 
            method = "rf", 
            trControl= trainControl(method = "cv", 
                                    number = 5,
                                    allowParallel = T))

Loading required package: randomForest
randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'

The following object is masked from 'package:ggplot2':

    margin

Predict on Validation Set

pred<- predict(fit, newdata=validation)

confusionMatrix(validation$classe, pred)

Confusion Matrix and Statistics

          Reference
Prediction   A   B   C   D   E
         A 557   0   0   0   1
         B   2 377   0   0   0
         C   0   2 339   1   0
         D   0   0   1 320   0
         E   0   0   0   0 360

Overall Statistics
                                          
               Accuracy : 0.9964          
                 95% CI : (0.9927, 0.9986)
    No Information Rate : 0.2852          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9955          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9964   0.9947   0.9971   0.9969   0.9972
Specificity            0.9993   0.9987   0.9981   0.9994   1.0000
Pos Pred Value         0.9982   0.9947   0.9912   0.9969   1.0000
Neg Pred Value         0.9986   0.9987   0.9994   0.9994   0.9994
Prevalence             0.2852   0.1934   0.1735   0.1638   0.1842
Detection Rate         0.2842   0.1923   0.1730   0.1633   0.1837
Detection Prevalence   0.2847   0.1934   0.1745   0.1638   0.1837
Balanced Accuracy      0.9979   0.9967   0.9976   0.9981   0.9986

Not bad, my prediction only misclassified a few observations. Per the summary, the accuracy of this model is 99% which is incredible. I’ll run the prediction against the test set and get the results for the quiz.

test<- test[,-c(1:7)]
test<- test[,-index]
final_pred<- data.frame(Predictions=predict(fit, newdata=test))
write.csv(final_pred, file="Predictions.csv")