The goal of this project is to predict the manner in which they did the exercise. This is the “classe” variable in the training data set. The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The rest of the data are collected from a group of enthusiasts who take measurements about themselves regularly to improve their health. The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

We want to classify the exercise manner for these 20 people according to their measurements.

Load and Pre-process Data

We load the csv data. The data has 19622 rows and 160 columns, while the last column classe is the one to be classified based on the remaining columns. We remove the first 7 columns which are not related factors to predict the value of last column. Then we further remove columns with more than 10% of the NAs and columns with almost zero variance since these data will not be useful as well. Finally, there are 52 variables left.

library(caret)
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.2.1
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.1
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
testData <- read.csv("./pml-testing.csv",header=TRUE,na.strings=c("NA","NaN","#DIV/0!", ""))
trainData <- read.csv("./pml-training.csv",header=TRUE,na.strings=c("NA","NaN","#DIV/0!", ""))
dim(trainData)
## [1] 19622   160
# clean up train data
# remove columns that are not relevant to predictions
trainData <- trainData[,-c(1:7)] 
#remove columns with more than 0.1 NAs 
trainData <- trainData[,colSums(is.na(trainData)) <= 0.1*nrow(trainData)]
# removing zero variance variables
nzv <- nearZeroVar(trainData,saveMetrics = TRUE)
trainData <- trainData[,nzv$nzv==FALSE]
dim(trainData)
## [1] 19622    53

Predict with Random Forests and Model Selection

Create partition of training dataset with 80% of data goes to training and 20% goes to test. We first test with full model and list the importance of each variable. The confusionMatrix shows the accuracy of the model against test data set is 0.9952.

trainIndex  <- createDataPartition(trainData$classe,p = .8,list = FALSE)
trainDataIn <- trainData[trainIndex,]
trainDataOut <- trainData[-trainIndex,]
set.seed(15)
# test with full model with 52 variables and check the importance of variables
rf=randomForest(trainDataIn$classe ~ .,data=trainDataIn,ntree=300, importance=TRUE)
varImpPlot(rf,)

# test the full model with the test data set
predictData <- predict(rf,trainDataOut)
confusionMatrix(trainDataOut$classe,predictData)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    2  755    2    0    0
##          C    0    4  679    1    0
##          D    0    0    6  637    0
##          E    0    0    0    1  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9934, 0.9977)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9948          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9947   0.9884   0.9969   1.0000
## Specificity            1.0000   0.9987   0.9985   0.9982   0.9997
## Pos Pred Value         1.0000   0.9947   0.9927   0.9907   0.9986
## Neg Pred Value         0.9993   0.9987   0.9975   0.9994   1.0000
## Prevalence             0.2850   0.1935   0.1751   0.1629   0.1835
## Detection Rate         0.2845   0.1925   0.1731   0.1624   0.1835
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9991   0.9967   0.9934   0.9975   0.9998

We attempt to reduce the variable number with simplest 2-fold cross-validation considering limiting computing time. we set step to 0.5 so that 50% least importance variables are removed. We then plot the cv.error versus number of variables and find that error starts to level off at 6 variables. The accuracy is 0.9837 as compared to 0.9952 in full model with 52 variables.

rf.training=rfcv(trainDataIn[-1],trainDataIn$classe,cv.fold = 2,scale = "log",step = 0.5)
with(rf.training, plot(n.var, error.cv, log="x", type="o", lwd=2))

# we try with model consisting the top six variables with large importance 
rf2=randomForest(trainDataIn$classe ~ yaw_belt+roll_belt+pitch_belt+magnet_dumbbell_z+magnet_dumbbell_y+pitch_forearm,data=trainDataIn,ntree=300, importance=TRUE)
predictData2 <- predict(rf2,trainDataOut)
confusionMatrix(trainDataOut$classe,predictData2)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1098    5   10    3    0
##          B    6  734   16    3    0
##          C    1    2  677    4    0
##          D    2    0    5  636    0
##          E    0    5    2    3  711
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9829          
##                  95% CI : (0.9784, 0.9867)
##     No Information Rate : 0.2822          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9784          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9919   0.9839   0.9535   0.9800   1.0000
## Specificity            0.9936   0.9921   0.9978   0.9979   0.9969
## Pos Pred Value         0.9839   0.9671   0.9898   0.9891   0.9861
## Neg Pred Value         0.9968   0.9962   0.9898   0.9960   1.0000
## Prevalence             0.2822   0.1902   0.1810   0.1654   0.1812
## Detection Rate         0.2799   0.1871   0.1726   0.1621   0.1812
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9927   0.9880   0.9757   0.9889   0.9984

Predict With Test Data

We predict the output of test data with full model and 6 factors model and found same predictions.

predictDataOut <- predict(rf,testData)
predictDataOut2 <- predict(rf2,testData)
identical(predictDataOut,predictDataOut2)
## [1] TRUE