Machine Learning Final Project

Overview

This project outlines a method for taking belt, forearm, arm, and dumbell acceleration and gyroscope measurements and predicting whether the intended exercise was done correctly or not. The data were collected by Groupware@LES.

Cleaning the Data

The training data were first read in to create a data frame for the building data with exercise classifications marked A (correctly performed), as well as B - E (some type of incorrect performance of the exercise). The non-sensor data was removed from the data as well as measurement types with more than 10% missing values because model building and prediction are greatly impacted by missing values. Finally, several columns had numeric data in character format so those were converted for analysis.

     projBuild <- read.csv("pml-training.csv", stringsAsFactors = FALSE)
    
     ## Change outcome variable to class factor for analysis.
     projBuild$classe <- as.factor(projBuild$classe)
     
     ## 1st 7 columns are non-sensor data.
     projBuild <- projBuild[,-(1:7)]
     
      ## Convert all character columns to numeric.
     projBuild[,names(projBuild)[sapply(projBuild, class) == "character"]] <- 
          sapply(projBuild[,names(projBuild)[
          sapply(projBuild, class) == "character"]], function(x) 
               x <- as.numeric(x))
     
     ## Several columns had a lot of missing values.
     sapply(projBuild, function(x) paste(round(100*sum(is.na(x))/length(x),1),"% NAs",sep=""))

##                roll_belt               pitch_belt                 yaw_belt 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##         total_accel_belt       kurtosis_roll_belt      kurtosis_picth_belt 
##                 "0% NAs"                "98% NAs"              "98.1% NAs" 
##        kurtosis_yaw_belt       skewness_roll_belt     skewness_roll_belt.1 
##               "100% NAs"                "98% NAs"              "98.1% NAs" 
##        skewness_yaw_belt            max_roll_belt           max_picth_belt 
##               "100% NAs"              "97.9% NAs"              "97.9% NAs" 
##             max_yaw_belt            min_roll_belt           min_pitch_belt 
##                "98% NAs"              "97.9% NAs"              "97.9% NAs" 
##             min_yaw_belt      amplitude_roll_belt     amplitude_pitch_belt 
##                "98% NAs"              "97.9% NAs"              "97.9% NAs" 
##       amplitude_yaw_belt     var_total_accel_belt            avg_roll_belt 
##                "98% NAs"              "97.9% NAs"              "97.9% NAs" 
##         stddev_roll_belt            var_roll_belt           avg_pitch_belt 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##        stddev_pitch_belt           var_pitch_belt             avg_yaw_belt 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##          stddev_yaw_belt             var_yaw_belt             gyros_belt_x 
##              "97.9% NAs"              "97.9% NAs"                 "0% NAs" 
##             gyros_belt_y             gyros_belt_z             accel_belt_x 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##             accel_belt_y             accel_belt_z            magnet_belt_x 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##            magnet_belt_y            magnet_belt_z                 roll_arm 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##                pitch_arm                  yaw_arm          total_accel_arm 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##            var_accel_arm             avg_roll_arm          stddev_roll_arm 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##             var_roll_arm            avg_pitch_arm         stddev_pitch_arm 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##            var_pitch_arm              avg_yaw_arm           stddev_yaw_arm 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##              var_yaw_arm              gyros_arm_x              gyros_arm_y 
##              "97.9% NAs"                 "0% NAs"                 "0% NAs" 
##              gyros_arm_z              accel_arm_x              accel_arm_y 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##              accel_arm_z             magnet_arm_x             magnet_arm_y 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##             magnet_arm_z        kurtosis_roll_arm       kurtosis_picth_arm 
##                 "0% NAs"              "98.3% NAs"              "98.3% NAs" 
##         kurtosis_yaw_arm        skewness_roll_arm       skewness_pitch_arm 
##                "98% NAs"              "98.3% NAs"              "98.3% NAs" 
##         skewness_yaw_arm             max_roll_arm            max_picth_arm 
##                "98% NAs"              "97.9% NAs"              "97.9% NAs" 
##              max_yaw_arm             min_roll_arm            min_pitch_arm 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##              min_yaw_arm       amplitude_roll_arm      amplitude_pitch_arm 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##        amplitude_yaw_arm            roll_dumbbell           pitch_dumbbell 
##              "97.9% NAs"                 "0% NAs"                 "0% NAs" 
##             yaw_dumbbell   kurtosis_roll_dumbbell  kurtosis_picth_dumbbell 
##                 "0% NAs"                "98% NAs"              "97.9% NAs" 
##    kurtosis_yaw_dumbbell   skewness_roll_dumbbell  skewness_pitch_dumbbell 
##               "100% NAs"                "98% NAs"              "97.9% NAs" 
##    skewness_yaw_dumbbell        max_roll_dumbbell       max_picth_dumbbell 
##               "100% NAs"              "97.9% NAs"              "97.9% NAs" 
##         max_yaw_dumbbell        min_roll_dumbbell       min_pitch_dumbbell 
##                "98% NAs"              "97.9% NAs"              "97.9% NAs" 
##         min_yaw_dumbbell  amplitude_roll_dumbbell amplitude_pitch_dumbbell 
##                "98% NAs"              "97.9% NAs"              "97.9% NAs" 
##   amplitude_yaw_dumbbell     total_accel_dumbbell       var_accel_dumbbell 
##                "98% NAs"                 "0% NAs"              "97.9% NAs" 
##        avg_roll_dumbbell     stddev_roll_dumbbell        var_roll_dumbbell 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##       avg_pitch_dumbbell    stddev_pitch_dumbbell       var_pitch_dumbbell 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##         avg_yaw_dumbbell      stddev_yaw_dumbbell         var_yaw_dumbbell 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##         gyros_dumbbell_x         gyros_dumbbell_y         gyros_dumbbell_z 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##         accel_dumbbell_x         accel_dumbbell_y         accel_dumbbell_z 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##        magnet_dumbbell_x        magnet_dumbbell_y        magnet_dumbbell_z 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##             roll_forearm            pitch_forearm              yaw_forearm 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##    kurtosis_roll_forearm   kurtosis_picth_forearm     kurtosis_yaw_forearm 
##              "98.4% NAs"              "98.4% NAs"               "100% NAs" 
##    skewness_roll_forearm   skewness_pitch_forearm     skewness_yaw_forearm 
##              "98.4% NAs"              "98.4% NAs"               "100% NAs" 
##         max_roll_forearm        max_picth_forearm          max_yaw_forearm 
##              "97.9% NAs"              "97.9% NAs"              "98.4% NAs" 
##         min_roll_forearm        min_pitch_forearm          min_yaw_forearm 
##              "97.9% NAs"              "97.9% NAs"              "98.4% NAs" 
##   amplitude_roll_forearm  amplitude_pitch_forearm    amplitude_yaw_forearm 
##              "97.9% NAs"              "97.9% NAs"              "98.4% NAs" 
##      total_accel_forearm        var_accel_forearm         avg_roll_forearm 
##                 "0% NAs"              "97.9% NAs"              "97.9% NAs" 
##      stddev_roll_forearm         var_roll_forearm        avg_pitch_forearm 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##     stddev_pitch_forearm        var_pitch_forearm          avg_yaw_forearm 
##              "97.9% NAs"              "97.9% NAs"              "97.9% NAs" 
##       stddev_yaw_forearm          var_yaw_forearm          gyros_forearm_x 
##              "97.9% NAs"              "97.9% NAs"                 "0% NAs" 
##          gyros_forearm_y          gyros_forearm_z          accel_forearm_x 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##          accel_forearm_y          accel_forearm_z         magnet_forearm_x 
##                 "0% NAs"                 "0% NAs"                 "0% NAs" 
##         magnet_forearm_y         magnet_forearm_z                   classe 
##                 "0% NAs"                 "0% NAs"                 "0% NAs"

     ## Only keep sensor measurement types with less than 10% NA values.
     projBuild <- projBuild[,names(projBuild)[sapply(projBuild, function(x) 100*sum(is.na(x))/length(x) < 10)]]
     
     ## Print out a table of the outcomes and counts for each outcome.
     uniqueOutcomesTable <- table(projBuild$classe)
     
     ## Display a plot of the counts for each outcome (A-E) for the building
     ## data set.
     outcomesPlot <- ggplot(data.frame(uniqueOutcomesTable), aes(Var1, Freq)) + 
          geom_bar(stat = "Identity", fill = "red") + 
          labs(x="Unique Outcomes") + 
          labs(y="Counts") + 
          labs(title="Unique Outcomes Counts Plot") + 
          theme_minimal() + 
          theme(title = element_text(face = "bold", size=18),
                plot.margin=unit(c(2,4,2,2),"lines"),
                axis.title.x = element_text(margin = ggplot2::margin(25,0,0,0), size=16),
                axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=16))

     print(uniqueOutcomesTable)

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

     print(outcomesPlot)

Anomolous Data

Row 5373 has 3 high-leverage outliers in the gyros_forearm_y, gyros_forearm_z and gyros_dumbbell_z columns. This row was therefore removed.

     ## Identify only gyroscope data.
     gyro <- projBuild[,names(projBuild)[grep("gyro", names(projBuild))]]
     
     ## Return which gyro data rows have the outlier.
     sapply(projBuild[,names(projBuild)[
          grep("gyro", names(projBuild))]], function(x) match(x[abs(x)>200],x))

## $gyros_belt_x
## integer(0)
## 
## $gyros_belt_y
## integer(0)
## 
## $gyros_belt_z
## integer(0)
## 
## $gyros_arm_x
## integer(0)
## 
## $gyros_arm_y
## integer(0)
## 
## $gyros_arm_z
## integer(0)
## 
## $gyros_dumbbell_x
## [1] 5373
## 
## $gyros_dumbbell_y
## integer(0)
## 
## $gyros_dumbbell_z
## [1] 5373
## 
## $gyros_forearm_x
## integer(0)
## 
## $gyros_forearm_y
## [1] 5373
## 
## $gyros_forearm_z
## [1] 5373

     ## Create plots of each pair of correlated variables.
     plotList <- list()
     
     plotList[[1]] <- 
          ggplot(projBuild, aes(1,gyros_forearm_y)) + 
          geom_boxplot(color = "red") + 
          labs(x=NULL) + 
          labs(y="gyros_forearem_y") + 
          labs(title=NULL) + 
          theme_minimal() + 
          theme(title = element_text(face = "bold", size=12),
               ## plot.margin=unit(c(2,4,2,2),"lines"),
                axis.text.x = element_blank(),
                axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=10))
     
     plotList[[2]] <- 
          ggplot(projBuild, aes(1,gyros_forearm_z)) + 
          geom_boxplot(color = "red") + 
          labs(x=NULL) + 
          labs(y="gyros_forearem_z") + 
          labs(title=NULL) + 
          theme_minimal() + 
          theme(title = element_text(face = "bold", size=12),
               ## plot.margin=unit(c(2,4,2,2),"lines"),
                axis.text.x = element_blank(),
                axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=10))
     
     plotList[[3]] <- 
          ggplot(projBuild, aes(1,gyros_dumbbell_z)) + 
          geom_boxplot(color = "red") + 
          labs(x=NULL) + 
          labs(y="gyros_dumbbell_z") + 
          labs(title=NULL) + 
          theme_minimal() + 
          theme(title = element_text(face = "bold", size=12),
              ##  plot.margin=unit(c(2,4,2,2),"lines"),
                axis.text.x = element_blank(),
                axis.title.y = element_text(margin = ggplot2::margin(0,25,0,0), size=10))
     
     ## And print the plots.
     grid.arrange(plotList[[1]], plotList[[2]], ncol=2)

      ## Remove the offending row.
     projBuild <- projBuild[-5373,]

Correlated Variables

Because there are 53 sensor variables being tracked on only 4 different areas of the body or equipment, there are several pairs of highly correlated variables. As can be seen below, 15 out of the remaining 53 sensor variables (about 28%) from the building data set are highly correlated (abs(r) > .8) with other variables. Any model for this data will need to use a weighted average approach to including predictor variables in order to overcome this.

     ## Create matrix of correlations between all predictor variables.
     M <- abs(cor(projBuild[,-length(projBuild[1,])])) 
     
     ## Transoform the matrix to lower triangular so that we get unique 
     ## correlated variable pairs.
     lowerTriangularM <- M*lower.tri(M, diag=FALSE)
     
     ## Create a table of correlated variable indices.
     corVarsIndices <- which(lowerTriangularM > .8, arr.ind=TRUE)

     ## Use the indices to grab the variable names and store them in a data frame.
     correlatedVariables <- 
          data.frame(rownames(M)[corVarsIndices[,1]],
                     colnames(M)[corVarsIndices[,2]])
     colnames(correlatedVariables) <- c("Var 1", "Var 2")
     
     ## Create plots of each pair of correlated variables.
     plotList <- list()
     for(i in 1:length(correlatedVariables[,1])){
          plotList[[i]] <- 
               ggplot(data=projBuild,aes_string(
                    as.character(correlatedVariables[i, 1]),
                    as.character(correlatedVariables[i, 2]))) +
               geom_point(color="red") + theme_minimal() + 
               theme(plot.title = element_text(face="bold"));
          
          plotList[[i]] <- plotList[[i]] +
               labs(title = paste("r =", round(eval(parse(
                    text=paste("cor(projBuild$", plotList[[i]]$labels[[1]],
                    ", projBuild$", plotList[[i]]$labels[[2]],
                    ")", sep=""))), 2)))
     }
     
     ## And print the first 3 plots.
     grid.arrange(plotList[[11]],plotList[[12]], ncol=2)

Selecting a Model

Initially, the building data was further subsetted into a training set, a testing set and a validation set. A tree model, a random forest model, and a boosted tree model were trained on the training set, and then tested on the test set. They were then combined in a general additive model which was trained on the testing set. Finally, all four models were applied to the validation set to directly compare their results.

     ## Use parallel processing to speed up calculations.
     registerDoMC(cores=8)
     inProjTrain <- createDataPartition(projBuild$classe, p = .7, list=FALSE)
     
     projTrain <- projBuild[inProjTrain,]
      
     projTestAndValidate <- projBuild[-inProjTrain,]
     
     inProjTest <- createDataPartition(projTestAndValidate$classe, p = .5, list=FALSE)
     
     projTest <- projTestAndValidate[inProjTest,]
     
     projValidate <- projTestAndValidate[-inProjTest,]

     ## Perform parallel processing to speed up computations.
     registerDoMC(cores=8)
      
     ## Fit a tree model to the training data.
     rpartProjModel <- train(classe~., method="rpart", data=projTrain, preProcess = c("center", "scale"))

     ## Perform parallel processing to speed up computations.
     registerDoMC(cores=8)
     
     ## Fit a random forest model to the training data.
     rfProjModel <- train(classe~., method="rf", data=projTrain, preProcess = c("center", "scale"))

     ## Perform parallel processing to speed up computations.
     registerDoMC(cores=8)
      
     ## Fit a gradient boosting machine to the training data.
     gbmProjModel <- train(classe~., method="gbm", data=projTrain, preProcess = c("center", "scale"), verbose=FALSE)

     ## Make predictions on the test set.
     rpartPredTest <- predict(rpartProjModel, projTest)
     
     rfPredTest <- predict(rfProjModel, projTest)
     
     gbmPredTest <- predict(gbmProjModel, projTest)

     ## Combine predictions from 3 models and train on the test set.
     predDFtest <- data.frame(rpartPredTest, rfPredTest, gbmPredTest, classe = projTest$classe)
     
     registerDoMC(cores=4)
     comboModFit <- train(classe~., method="gam", data=predDFtest)
     
     comboPredTest <- predict(comboModFit, predDFtest)

Validation test results

As can be seen below, the tree model performed the worst on the validation set with an accuracy of 50.77%, the boosted tree model did better at 96.12%, and the random forest performed the best at 99.29%. These 3 models were also combined combined but performed poorly with an accuracy of 65.76%. Since the random forest was the best-performing individual model, and the combination of models did not help, a random forest model was selected going forward.

     ## Make predictions from all 3 models on the validation set.
     rpartPredValidate <- predict(rpartProjModel, projValidate)
      
     rfPredValidate <- predict(rfProjModel, projValidate)
     
     gbmPredValidate <- predict(gbmProjModel, projValidate)
     
     ## Combine predictions from 3 models and predict on the validation set.
     predDFvalidate <- data.frame(rpartPredValidate, rfPredValidate, gbmPredValidate, classe=projValidate$classe)
     
     comboModFit <- train(classe~., method="rpart", data=predDFvalidate)
     
     comboPredValidate <- predict(comboModFit, predDFvalidate)

     ## Print Tree Model Results
     print(confusionMatrix(rpartPredValidate, projValidate$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 528 102  12  19  15
##          B   2 109  13   5   7
##          C 164  94 326 150 109
##          D 140 264 162 308 188
##          E   2   0   0   0 222
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5077          
##                  95% CI : (0.4894, 0.5259)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3865          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6316  0.19156   0.6355   0.6390  0.41035
## Specificity            0.9297  0.98862   0.7871   0.6934  0.99917
## Pos Pred Value         0.7811  0.80147   0.3867   0.2900  0.99107
## Neg Pred Value         0.8640  0.83601   0.9109   0.9074  0.88259
## Prevalence             0.2843  0.19347   0.1744   0.1639  0.18395
## Detection Rate         0.1795  0.03706   0.1108   0.1047  0.07548
## Detection Prevalence   0.2299  0.04624   0.2866   0.3611  0.07616
## Balanced Accuracy      0.7806  0.59009   0.7113   0.6662  0.70476

     ## Print Random Forest Model Results
     print(confusionMatrix(rfPredValidate, projValidate$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 836   6   0   0   0
##          B   0 562   3   0   0
##          C   0   1 506   1   2
##          D   0   0   4 480   3
##          E   0   0   0   1 536
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9929          
##                  95% CI : (0.9891, 0.9956)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.991           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9877   0.9864   0.9959   0.9908
## Specificity            0.9971   0.9987   0.9984   0.9972   0.9996
## Pos Pred Value         0.9929   0.9947   0.9922   0.9856   0.9981
## Neg Pred Value         1.0000   0.9971   0.9971   0.9992   0.9979
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1840
## Detection Rate         0.2843   0.1911   0.1721   0.1632   0.1823
## Detection Prevalence   0.2863   0.1921   0.1734   0.1656   0.1826
## Balanced Accuracy      0.9986   0.9932   0.9924   0.9965   0.9952

     ## Print Boosted Tree Model Results
     print(confusionMatrix(gbmPredValidate, projValidate$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 828  21   0   0   1
##          B   5 534  17   3   9
##          C   0  12 487  11   4
##          D   3   0   8 460   9
##          E   0   2   1   8 518
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9612          
##                  95% CI : (0.9536, 0.9679)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9509          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9904   0.9385   0.9493   0.9544   0.9575
## Specificity            0.9895   0.9857   0.9889   0.9919   0.9954
## Pos Pred Value         0.9741   0.9401   0.9475   0.9583   0.9792
## Neg Pred Value         0.9962   0.9853   0.9893   0.9911   0.9905
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1840
## Detection Rate         0.2815   0.1816   0.1656   0.1564   0.1761
## Detection Prevalence   0.2890   0.1931   0.1748   0.1632   0.1799
## Balanced Accuracy      0.9900   0.9621   0.9691   0.9731   0.9765

     ## Print Combined Model Results
     print(confusionMatrix(comboPredValidate, projValidate$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 836   7 510 481   5
##          B   0 562   3   0   0
##          C   0   0   0   0   0
##          D   0   0   0   0   0
##          E   0   0   0   1 536
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6576          
##                  95% CI : (0.6401, 0.6748)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5444          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9877   0.0000   0.0000   0.9908
## Specificity            0.5235   0.9987   1.0000   1.0000   0.9996
## Pos Pred Value         0.4546   0.9947      NaN      NaN   0.9981
## Neg Pred Value         1.0000   0.9971   0.8256   0.8361   0.9979
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1840
## Detection Rate         0.2843   0.1911   0.0000   0.0000   0.1823
## Detection Prevalence   0.6253   0.1921   0.0000   0.0000   0.1826
## Balanced Accuracy      0.7618   0.9932   0.5000   0.5000   0.9952

Model Development and Cross Validation

The trainControl function in the caret package allows model creation to be cross validated so that the out of sample error rate can be estimated and the best paramters selected. A new randomization seed was selected and a random forest model was trained on the building set which was subsetted by the train function into 25 different training-test set pairs using a 632 bootstrapping method for cross validation.

     set.seed(12345)
     
     ## Create instructions for cross-validating train function results.
     control <- trainControl(method="boot632", number = 25, returnResamp = "all")
      
     ## Perform parallel processing to speed up computations.
     registerDoMC(cores=8)
     
     ## Fit a random forest model to the training data.
     rfControlModel <- train(classe~., method="rf", data=projBuild, preProcess = c("center", "scale"), trControl = control)

##rfProjModel$r
    
     print(rfControlModel)

## Random Forest 
## 
## 19621 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (52), scaled (52) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 19621, 19621, 19621, 19621, 19621, 19621, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9957192  0.9945826
##   27    0.9956004  0.9944324
##   52    0.9902780  0.9876954
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

     confusionMatrix(predict(rfControlModel), projBuild$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5579    0    0    0    0
##          B    0 3797    0    0    0
##          C    0    0 3422    0    0
##          D    0    0    0 3216    0
##          E    0    0    0    0 3607
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Results and Test Predictions

As can be seen above in the confusion matrix output, the best run of the random forest model predicted perfectly on the building set. The out of sample error rate, averaged over all runs by the 632 bootstrap method, was estimated to be 1 - 99.57% = .43%. The following code would print out the test predictions.

     projTest <- read.csv("pml-testing.csv", stringsAsFactors = FALSE)
     
     testSetPredictions <- predict(rfControlModel, projTest)
     
     print(testSetPredictions)