PML Course Project

Project Overview

The project is the final course project for the Practical Machine Learning Course from the Johns Hopkins University Data Specialization on Coursera.

This project explores the use of machine learning to predict correct and incorrect exercise form using measured exercise data collected from various test subjects. The data set is large, including 19,622 observations in the training dataset and 20 additional observations in the test set. Each set includes 160 variables (159 predictors and the classification variable.

Using the random forests methodology and a training class of 20% of the training data, we are able to generate a prediction algorithm whichwith an accuracy of 99.26% when tested on the other 80% of the training set (the quiz set) 58 of the 159 original variables. The prediction algorithm achieved a 100% success rate on the 20 test observations. Using the alternative rpart methodology results in a much lower accuracy rate.

When attempted subsequently on a data set with sequential variables (timestamp, number, etc) using random forests, the accuracy rate remains high although the prediction model fails to accurately predict all of the final test classifications.

Data Preparation

Importing Data

The personal activity training data is located in a .csv file called “pml-training.csv”. In this code chunk we load pml-training.csv into a data.frame object called “trainingset”. The test data set, containing 20 observations, is called into a variable called “testingset”.

trainingset<-data.frame(fread(input = "pml-training.csv",header = T,na.strings = "NA"))
testingset<-data.frame(fread(input = "pml-testing.csv",header = T,na.strings = "NA"))

Data Cleaning

Many of the variables included in the training data are not included in the test data set. The majority of these variables contain no or close to no data in the test set as well. We will remove those variables as they will not be useful in predicting the test variables, leaving 58 remaining predictor variables and 1 classificatoin variable.

valuablecols<-c(2:11,37:49,60:68,84:86,102,113:124,140,151:160)
trainingset<-trainingset[,valuablecols]
testingset<-testingset[,valuablecols]
rm(valuablecols)

Training The Model

Using the random forests package, we train a predictive model (“modFit”) using the training set, which contains 3,927 observations. Due to the long processing time for this segment, the code below calls a model fit previously evaluated using the the commented code instead of evaluating the code in the markdown itself.

Partitioning Data

Using the createDataPartition function, we next subset the data into a training set and a quiz set from the available data. Due to the large size of the dataset, we limit the training data set to only 20% of the full provided training set and leave the rest of the set available as a quiz set.

inTrain<-createDataPartition(trainingset$classe,
                             p=.2,
                             list=FALSE)

train<-trainingset[inTrain,]
quiz<-trainingset[-inTrain,]

rm(inTrain)

Random Forests

The first attempt uses the Random Forests approach and provides a very accurate model. The training algorithm uses the default settings and predicts the classe (classification variable) using all other variables.

# RF20<-train(classe ~ .,
#               method="rf",
#               data=train)
# 
# saveRDS(modFitRF50,file = "rf modfit 20.RDS")

RF20<-readRDS(file="rf modfit 20.RDS")

# plot(RF20)
# RF20

RPart Method

The second attempt uses the rpart method, which provides a lower degree of accuracy. The training algorithm uses the default settings and predicts the classe (classification variable) using all other variables.

# RP20<-train(classe ~ .,
#               method="rpart",
#               data=train)
# View(train)
# saveRDS(RP20,file = "rpart modfit 20.RDS")
# 
RP20<-readRDS(file="rpart modfit 20.RDS")

Visualizing the Trained Models

Using the VarImp function we determine the most significant variables for each training algorithm. First we will consider the Random Forests model by variable imporance.

varImp(RF20)

## rf variable importance
## 
##   only 20 most important variables shown (out of 80)
## 
##                                Overall
## raw_timestamp_part_1           100.000
## num_window                      48.003
## roll_belt                       46.787
## pitch_forearm                   27.932
## magnet_dumbbell_z               22.079
## pitch_belt                      15.137
## magnet_dumbbell_y               13.820
## roll_forearm                    12.834
## cvtd_timestamp30/11/2011 17:12  12.704
## yaw_belt                        11.905
## cvtd_timestamp02/12/2011 14:58  11.695
## magnet_dumbbell_x               11.300
## magnet_belt_y                    7.840
## cvtd_timestamp28/11/2011 14:15   7.125
## accel_dumbbell_y                 6.873
## roll_dumbbell                    6.482
## cvtd_timestamp02/12/2011 13:33   6.478
## accel_forearm_x                  5.766
## accel_belt_z                     5.363
## accel_dumbbell_z                 5.029

Then the RPart by Variable Importance.

varImp(RP20)

## rpart variable importance
## 
##   only 20 most important variables shown (out of 91)
## 
##                                Overall
## raw_timestamp_part_1            100.00
## pitch_forearm                    86.29
## roll_belt                        84.20
## roll_dumbbell                    68.70
## magnet_dumbbell_y                64.77
## accel_belt_z                     62.88
## cvtd_timestamp28/11/2011 14:13   62.07
## num_window                       61.20
## accel_forearm_x                  58.66
## total_accel_belt                 47.41
## yaw_belt                         46.02
## cvtd_timestamp30/11/2011 17:12   41.66
## cvtd_timestamp30/11/2011 17:11   40.52
## accel_dumbbell_y                 39.55
## magnet_belt_y                    36.12
## pitch_belt                       35.14
## magnet_dumbbell_z                32.53
## cvtd_timestamp05/12/2011 11:24   32.27
## cvtd_timestamp02/12/2011 14:57   31.68
## cvtd_timestamp28/11/2011 14:14   31.22

Both training algorithms rely heavily on the raw_timestamp_part_1 variable, a variable representing the time the observation was taken. Random Forests uses num_window, an ordered variable, as well as roll_belt, pitch_forearm, magnet_dumbbell_z, and other observed spatial orientation variables. Rpart uses pitch_forearm, roll_belt, roll_dumbbell, and other variables.

Assessing variable correlation

First we analyze the correlation between the y variable and the timestamp and numeric variables. Both of these show a cascading pattern compared against the classifier variable, consistent with the reality that the exercise classifications were clumped together and performed sequentially by the test subjects.

featurePlot(x=trainingset[,c("raw_timestamp_part_1", "num_window")],
            y=trainingset$classe,
            plot="pairs")

For a more realistic analysis, it is important to disregard these variables, as we will discuss further below. This chart represents the correlation between the classifier variable and two of the top predictor variables aside from the synthetic, time and order variables above. Here we plot roll_belt and pitch_forearm. Both show some significant patterns with the classifier variables.

featurePlot(x=trainingset[,c("roll_belt","pitch_forearm")],
            y=trainingset$classe,
            plot="pairs")

Testing the Model

We then test the model on the remaining 15,695 observations in the quiz set, finding an accuracy of 99.26% using the random forests method but only 72% for the rpart classification tree method.

Confusion Matrices

Confusion matrices display a wide range of valuable data on a predictive test, including prediction accuracy, sensitivity, and specificity by class. As seen below, the RF model is highly accurate across all classes when sampled on the quiz set.

confusionMatrix(predict(RF20, quiz),quiz$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4454   12    0    0    0
##          B   10 3012   20    0    0
##          C    0   12 2704   17    0
##          D    0    1   13 2555    5
##          E    0    0    0    0 2880
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9943         
##                  95% CI : (0.993, 0.9954)
##     No Information Rate : 0.2844         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9927         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9978   0.9918   0.9879   0.9934   0.9983
## Specificity            0.9989   0.9976   0.9978   0.9986   1.0000
## Pos Pred Value         0.9973   0.9901   0.9894   0.9926   1.0000
## Neg Pred Value         0.9991   0.9980   0.9975   0.9987   0.9996
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2838   0.1919   0.1723   0.1628   0.1835
## Detection Prevalence   0.2845   0.1938   0.1741   0.1640   0.1835
## Balanced Accuracy      0.9983   0.9947   0.9929   0.9960   0.9991

In contrast the rpart method has only 72% accuracy and has less than 50% sensitivity for Class C.

confusionMatrix(predict(RP20, quiz),quiz$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3780  632  555  361   78
##          B  275 1893   92   14   16
##          C  140  426 1348   73  165
##          D  201   86  724 1992  341
##          E   68    0   18  132 2285
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7198          
##                  95% CI : (0.7127, 0.7269)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6434          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8468   0.6233  0.49251   0.7745   0.7920
## Specificity            0.8552   0.9686  0.93795   0.8970   0.9830
## Pos Pred Value         0.6992   0.8266  0.62639   0.5957   0.9129
## Neg Pred Value         0.9335   0.9147  0.89744   0.9530   0.9545
## Prevalence             0.2844   0.1935  0.17439   0.1639   0.1838
## Detection Rate         0.2408   0.1206  0.08589   0.1269   0.1456
## Detection Prevalence   0.3444   0.1459  0.13711   0.2131   0.1595
## Balanced Accuracy      0.8510   0.7960  0.71523   0.8357   0.8875

Visualizing Accuracy Across Metrics

These accuracies are visualized below.

accurateRF<-predict(RF20,trainingset)==trainingset$classe
accurateRP<-predict(RP20,trainingset)==trainingset$classe

RFplot<-
  qplot(classe,
      pitch_forearm,
      data=trainingset,
      colour=predict(RF20,trainingset),
      main="RF Model Accuracy: pitch_forearm"
      )+theme(legend.position = "none")

RPplot<-
  qplot(classe,
      pitch_forearm,
      data=trainingset,
      colour=predict(RP20,trainingset),
      main="RP Model Accuracy: pitch_forearm")+
      theme(legend.position = "none")

grid.arrange(RFplot,RPplot, ncol=2)

Hard Mode

Several variables which progressed sequentially with the data were important to the classification tree, including the raw time stamp and the num_window variable. Because the exercise methods were also sequential, this may represent an unfair training and testing advantage for this dataset relative to efforts to classify these behaviors observed outside of an experimental setting. In order to rectify this, we attempt to train and test a new model without the benefit of these sequential variables. This effort will be limited to the more accurate random forests method.

trainhard<-trainingset[,-(1:6)]
testhard<-testingset[,-(1:6)]
# 
# inTrain<-createDataPartition(trainhard$classe,
#                              p=.2,
#                              list=FALSE)
# 
# train<-trainhard[inTrain,]
# quiz<-trainhard[-inTrain,]
# 
# rm(inTrain)

# start<-proc.time()
# RF20hardmode<-train(classe ~ .,
#               method="rf",
#               data=train)
# end<-proc.time()-start

# saveRDS(RF20hardmode,file = "rf modfit 20 hard.RDS")

RF20hardmode<-readRDS(file = "rf modfit 20 hard.RDS")

RFHardPredicts<-predict(RF20hardmode, newdata=quiz)

Notably, the hard mode model is comparable in its accuracy to the model fitted including the timestamp data, coming in at 97.9% accuracy on the quiz set. It also returns the same results as the RF model for the final test set.

confusionMatrix(predict(RF20hardmode, quiz),quiz$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4439   73    0    1    1
##          B   15 2931   50   13    8
##          C    7   27 2668   40   10
##          D    2    5   19 2512   47
##          E    1    1    0    6 2819
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9792          
##                  95% CI : (0.9769, 0.9814)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9737          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9944   0.9651   0.9748   0.9767   0.9771
## Specificity            0.9933   0.9932   0.9935   0.9944   0.9994
## Pos Pred Value         0.9834   0.9715   0.9695   0.9718   0.9972
## Neg Pred Value         0.9978   0.9916   0.9947   0.9954   0.9949
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2828   0.1867   0.1700   0.1601   0.1796
## Detection Prevalence   0.2876   0.1922   0.1753   0.1647   0.1801
## Balanced Accuracy      0.9939   0.9792   0.9842   0.9856   0.9882

cbind(predict(RF20, testingset),predict(RF20hardmode,testhard))

##       [,1] [,2]
##  [1,]    2    2
##  [2,]    1    1
##  [3,]    2    2
##  [4,]    1    1
##  [5,]    1    1
##  [6,]    5    5
##  [7,]    4    4
##  [8,]    2    2
##  [9,]    1    1
## [10,]    1    1
## [11,]    2    2
## [12,]    3    3
## [13,]    2    2
## [14,]    1    1
## [15,]    5    5
## [16,]    5    5
## [17,]    1    1
## [18,]    2    2
## [19,]    2    2
## [20,]    2    2

varImp(RF20hardmode)

## rf variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt            100.000
## pitch_forearm         57.201
## yaw_belt              52.211
## magnet_dumbbell_z     45.978
## roll_forearm          41.690
## pitch_belt            40.906
## magnet_dumbbell_y     39.930
## accel_dumbbell_y      23.363
## roll_dumbbell         19.135
## magnet_dumbbell_x     17.666
## accel_forearm_x       16.464
## magnet_belt_z         15.726
## magnet_belt_y         13.197
## accel_dumbbell_z      12.954
## accel_belt_z          12.496
## total_accel_dumbbell  10.976
## magnet_forearm_z      10.472
## gyros_belt_z          10.164
## yaw_dumbbell           9.668
## magnet_belt_x          9.437

Final Test Accuracy Across Models

Finally, we benchmark the various models against the testing set accuracy, finding that

correctset<-predict(RF20, testingset)
RF20set<-predict(RF20, testingset)
RP20set<-predict(RP20, testingset)
RF20hardSet<-predict(RF20hardmode, testhard)

Sets<-data.frame(RF20set,RP20set,RF20hardSet,correctset)

knitr::kable(Sets,row.names = 1:20,col.names = c("Random Forests", "RPart", "Random Forests Hard Mode", "Correct"))

## Warning in if (is.na(row.names)) row.names = has_rownames(x): the condition
## has length > 1 and only the first element will be used

## Warning in if (row.names) {: the condition has length > 1 and only the
## first element will be used

	Random Forests	RPart	Random Forests Hard Mode	Correct
1	B	D	B	B
2	A	C	A	A
3	B	A	B	B
4	A	A	A	A
5	A	A	A	A
6	E	E	E	E
7	D	D	D	D
8	B	D	B	B
9	A	A	A	A
10	A	A	A	A
11	B	B	B	B
12	C	C	C	C
13	B	B	B	B
14	A	A	A	A
15	E	E	E	E
16	E	E	E	E
17	A	A	A	A
18	B	B	B	B
19	B	D	B	B
20	B	B	B	B

paste("Random Forests Test Accuracy: ",  sum(RF20set==correctset)/20*100, "%", sep="")

## [1] "Random Forests Test Accuracy: 100%"

paste("RPart Test Accuracy: ",  sum(RP20set==correctset)/20*100, "%", sep="")

## [1] "RPart Test Accuracy: 75%"

paste("Random Forests Hard Mode Test Accuracy: ",  sum(RF20hardSet==correctset)/20*100,"%", sep="")

## [1] "Random Forests Hard Mode Test Accuracy: 100%"