Practical Machine Learning Course Project

Data

The data for this project come from here. The training data for this project are available here and the test data are available here.

My goal was to create a report describing how I built the prediction models, how I used cross-validation, my thoughts on the expected out of sample errors, while justifying each choice I made along the way. Analyses were completed in R and the report was generated using R Markdown and Knitr.

Step 1: Clean the Data

Using the validation method, I split the training data into two sub-samples, a training (70%) and a test (30%) set. Then, I used the testing data as the validation data for my final model. I also removed variables that contained missing values, variables that are related to date-stamps and timestamps, and variables that were related to metadata and wouldn’t be used for predicting class assignment. After that, I removed any variables that approximate a zero-variance.

    pml = read.csv("./data/pml-training.csv", header=T)
    
    #Split the data into training, testing, and validation sets. We will use these sub-samples to evaluate the out-of-sample errors later.
        intrain = createDataPartition(y=pml$classe, p=0.7, list=FALSE) #Split 70% of data in training and 30% in testing data
        training = pml[intrain,]
        testing = pml[-intrain,] #Include all samples that don't appear in the "intrain" dataset
        validation = read.csv("./data/pml-testing.csv", header=T)
        
    #Remove the variables that contain missing values:
        training = training[, colSums(is.na(training))==0]
        testing = testing[, colSums(is.na(testing))==0]
        validation = validation[, colSums(is.na(validation))==0]
        
    #The first few columns are related to datestamps and timestamps. We don't need these to predict the model.
        #We also don't need the new_window column, or the num_window column. 
        training = training %>% select(-(1:7))
        testing = testing %>% select(-(1:7))
        validation = validation %>% select(-(1:7))
        
    #ALSO! remove the variables that approximate a zero-variance (i.e., these variables are almost fixed and do not vary)    
        #Only do this with the training and test data, NOT the validation data!
        nzv = nearZeroVar(training) #save metrics = shows details
        training = training[, -nzv]
        testing = testing[, -nzv]

Step 2: Preprocessing

Next, I investigated the correlations between the remaining variables, excluding the classe variable. I also plotted the correlation matrix with colors to visually present variables that were highly correlated (r>0.75). I subsequently removed these highly correlated variables from the dataset. Thank you to the following site: http://rismyhammer.com/ml/Pre-Processing.html for the incredibly helpful documentation on correlation matrices, visuals, and removal of variables with high correlations.

    #Investigating correlated predictors
    corrtable = cor(training[, -53])
    summary(corrtable[upper.tri(corrtable)]) #Summary table of the min, 25%, median/mean, 75%, and max correlations

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992077 -0.112831  0.002472  0.002050  0.091479  0.981343

        #Min = -0.99 and max = 0.98!
    
    corrplot(corrtable, order="FPC", method="color", type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))

    #Given a correlation matrix, findcorrelation flags predictors for removal:
    highcorrvars = findCorrelation(corrtable, cutoff = 0.75)
    names(training)[highcorrvars]

##  [1] "accel_belt_z"      "roll_belt"         "accel_belt_y"     
##  [4] "total_accel_belt"  "accel_dumbbell_z"  "accel_belt_x"     
##  [7] "pitch_belt"        "magnet_dumbbell_x" "accel_dumbbell_y" 
## [10] "magnet_dumbbell_y" "accel_dumbbell_x"  "accel_arm_x"      
## [13] "accel_arm_z"       "magnet_arm_y"      "magnet_belt_z"    
## [16] "accel_forearm_y"   "gyros_forearm_y"   "gyros_dumbbell_x" 
## [19] "gyros_dumbbell_z"  "gyros_arm_x"

    #Remove the highly correlated variables from the training set only:
    training = training[, -highcorrvars]

Step 3: Model Building

I ran three separate models: 1.) decision tree, 2.) random forest, and 3.) generalized boosted regression model in the training subset. Then, I validated each model using the testing subset. Each model centers and scales (i.e., standardizes) the predictors before processing. The confusion matrices of each model are presented below, along with a corresponding plot that displays the accuracy between the training subset and the testing subset.

Decision Tree Classification Model

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1027  206   28   51   27
##          B  330  683  210  336  489
##          C  243  215  741  232  181
##          D   74   34   46  254   37
##          E    0    1    1   91  348
## 
## Overall Statistics
##                                                
##                Accuracy : 0.5188               
##                  95% CI : (0.5059, 0.5316)     
##     No Information Rate : 0.2845               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3939               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6135   0.5996   0.7222  0.26349  0.32163
## Specificity            0.9259   0.7124   0.8207  0.96119  0.98064
## Pos Pred Value         0.7670   0.3335   0.4597  0.57079  0.78912
## Neg Pred Value         0.8577   0.8812   0.9333  0.86949  0.86517
## Prevalence             0.2845   0.1935   0.1743  0.16381  0.18386
## Detection Rate         0.1745   0.1161   0.1259  0.04316  0.05913
## Detection Prevalence   0.2275   0.3480   0.2739  0.07562  0.07494
## Balanced Accuracy      0.7697   0.6560   0.7715  0.61234  0.65113

Random Forest Model

For this model, I used parallel processing to increase processing time (see the code in “trainControl()”). Thanks to the following site for the guidance on parallel processing code: https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md. To increase precision, I used cross-validation with 5 resampling iterations.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    7    0    0    0
##          B    2 1128    6    0    0
##          C    1    4 1015   10    0
##          D    0    0    5  953    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9937               
##                  95% CI : (0.9913, 0.9956)     
##     No Information Rate : 0.2845               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.992                
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9903   0.9893   0.9886   0.9991
## Specificity            0.9983   0.9983   0.9969   0.9988   0.9998
## Pos Pred Value         0.9958   0.9930   0.9854   0.9937   0.9991
## Neg Pred Value         0.9993   0.9977   0.9977   0.9978   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1917   0.1725   0.1619   0.1837
## Detection Prevalence   0.2851   0.1930   0.1750   0.1630   0.1839
## Balanced Accuracy      0.9983   0.9943   0.9931   0.9937   0.9994

Generalized Boosted Regression Model

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1956
##      2        1.4854             nan     0.1000    0.1452
##      3        1.3955             nan     0.1000    0.1019
##      4        1.3296             nan     0.1000    0.0920
##      5        1.2734             nan     0.1000    0.0775
##      6        1.2252             nan     0.1000    0.0664
##      7        1.1841             nan     0.1000    0.0521
##      8        1.1509             nan     0.1000    0.0584
##      9        1.1154             nan     0.1000    0.0506
##     10        1.0832             nan     0.1000    0.0445
##     20        0.8381             nan     0.1000    0.0226
##     40        0.6020             nan     0.1000    0.0134
##     60        0.4732             nan     0.1000    0.0067
##     80        0.3881             nan     0.1000    0.0036
##    100        0.3266             nan     0.1000    0.0035
##    120        0.2798             nan     0.1000    0.0024
##    140        0.2424             nan     0.1000    0.0024
##    150        0.2255             nan     0.1000    0.0015

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1635   44    0    1    2
##          B   22 1049   60    4   17
##          C    9   40  946   38   14
##          D    5    1   19  914   13
##          E    3    5    1    7 1036
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9482               
##                  95% CI : (0.9422, 0.9537)     
##     No Information Rate : 0.2845               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.9344               
##                                                
##  Mcnemar's Test P-Value : 0.0000001787         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9767   0.9210   0.9220   0.9481   0.9575
## Specificity            0.9888   0.9783   0.9792   0.9923   0.9967
## Pos Pred Value         0.9721   0.9106   0.9035   0.9601   0.9848
## Neg Pred Value         0.9907   0.9810   0.9835   0.9899   0.9905
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2778   0.1782   0.1607   0.1553   0.1760
## Detection Prevalence   0.2858   0.1958   0.1779   0.1618   0.1788
## Balanced Accuracy      0.9828   0.9496   0.9506   0.9702   0.9771

Step 4. Ensemble Learning

After creating three different models, I combined classifiers by “stacking” the predictions together. To do this, I built a new dataframe that consists of predictions from each model and pulled the “classe” variable from the testing dataset. Then, I fit a new model that relates the classe variable to the new prediction variables using the random forest method.

        #1. 
        #Build a new dataset that consists of predictions from each model
        #and create a classe variable that is pulled from the classe variable in the testing dataset
        preddf = data.frame(ctpredict, rfpredict, boostpredict, classe=testing$classe)
        
        #2.     
        #Fit a new model that relates the classe variable to the two prediction variables
        #Instead of fitting a model based on the predictors, we are fitting a model based on the predictions!
        combmodfit = train(classe ~ ., method = "rf", 
                           preProcess = c("center", "scale"),
                           data=preddf)
        combpred = predict(combmodfit, preddf)
        
        #Confusion Matrix:
        confusionMatrix(combpred, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    7    0    0    0
##          B    2 1128    6    0    0
##          C    1    4 1015   10    0
##          D    0    0    5  953    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                                
##                Accuracy : 0.9937               
##                  95% CI : (0.9913, 0.9956)     
##     No Information Rate : 0.2845               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.992                
##                                                
##  Mcnemar's Test P-Value : NA                   
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9903   0.9893   0.9886   0.9991
## Specificity            0.9983   0.9983   0.9969   0.9988   0.9998
## Pos Pred Value         0.9958   0.9930   0.9854   0.9937   0.9991
## Neg Pred Value         0.9993   0.9977   0.9977   0.9978   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1917   0.1725   0.1619   0.1837
## Detection Prevalence   0.2851   0.1930   0.1750   0.1630   0.1839
## Balanced Accuracy      0.9983   0.9943   0.9931   0.9937   0.9994

        comb_accuracy = confusionMatrix(combpred, testing$classe)$overall["Accuracy"]

Step 5. Evaluate in/out of sample error and accuracy of each model

Finally, I evaluated the accuracy of each model using the test sub-sample and selected the model with the highest accuracy.

        #What is the resulting accuracy on the test set? 
        #Is it better or worse than each of the individual predictions? 
        ct_accuracy = round(ct_accuracy, 4)
        rf_accuracy = round(rf_accuracy, 4)
        boost_accuracy = round(boost_accuracy, 4)
        comb_accuracy = round(comb_accuracy, 4)

The random forest model (0.9937) and stacked model (0.9937) show similar accuracy. Therefore, I selected the random forest model to apply to the validation data. Given the accuracy of the random forest model, I expect the out of sample error rate to be around 1%.

#Step 6. Apply the best model to the validation data
        results = predict(rfmodel, newdata = validation)
        #Results are suppressed and will be submitted to the course project quit portion of the assignment.