Practical Machine Learning

II. Loading Data

Citation of this data:

Velloso, E., Bulling, A., Gellersen, H., Ugulino, W., & Fuks, H. (2013). “Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13)”. Stuttgart, Germany: ACM SIGCHI.

The five possible outcomes are:

A: Exactly according to the specification.

B: Throwing the elbows to the front.

C: Lifting the dumbbell only halfway.

D: Lowering the dumbbell only halfway.

E: Throwing the hips to the front.

library(caret) 
library(rattle)
library(rpart)
library(rpart.plot)
library(randomForest)
library(repmis)
set.seed(13192)

training_data <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
inTrain <- createDataPartition(y=training_data$classe, p=0.6, list=FALSE)
training <- training_data[inTrain, ]
testing_subset <- training_data[-inTrain, ]
testing_data <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))

After we created our partition above, we need to clean. First, we remove all variables with low variability. Then, we remove all variables with any NA cases.

myDataNZV <- nearZeroVar(training, saveMetrics=TRUE)
rownames <- as.list(rownames(myDataNZV[myDataNZV$nzv==TRUE,]))
training2 <- names(training)[!(names(training) %in% rownames)]
training_subset <- training[, training2]
training_subset <- training_subset[c(-1)]
training_final <- training_subset
for(k in 1:length(training_subset)) {
  if( sum( is.na( training_subset[, k] ) ) /nrow(training_subset) >= .01 ) { 
    for(c in 1:length(training_final)) {
      if( length( grep(names(training_subset[k]), names(training_final)[c]) ) ==1)  {
        training_final <- training_final[ , -c]
      }   
    } 
  }
}
rm(training_subset, rownames, myDataNZV, inTrain, training, training2, training_data)
clean1 <- colnames(training_final)
testing_data <- testing_data[clean1[-58]]
testing_subset <-testing_subset[clean1]

III. Model 1: Regression Trees

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 2163  208    9    2    0
         B   52 1120   99   69    0
         C   17  182 1234  137   51
         D    0    8   26 1015  190
         E    0    0    0   63 1201

Overall Statistics
                                          
               Accuracy : 0.8581          
                 95% CI : (0.8502, 0.8658)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.8202          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9691   0.7378   0.9020   0.7893   0.8329
Specificity            0.9610   0.9652   0.9403   0.9659   0.9902
Pos Pred Value         0.9081   0.8358   0.7613   0.8192   0.9502
Neg Pred Value         0.9874   0.9388   0.9785   0.9590   0.9634
Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2757   0.1427   0.1573   0.1294   0.1531
Detection Prevalence   0.3036   0.1708   0.2066   0.1579   0.1611
Balanced Accuracy      0.9650   0.8515   0.9212   0.8776   0.9115

Our output shows an overall accuracy of 85.81%. It’s not bad, but I’m sure other models can do better.

IV. Boosted Regression Model

fitControl <- trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 1)
modelfit_boostedr <- train(classe ~ ., data=training_final, method = "gbm",
                           trControl = fitControl,
                           verbose = FALSE)
modelfit_boostedr$finalModel

A gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 79 predictors of which 42 had non-zero influence.

predictions_boostedr <- predict(modelfit_boostedr, newdata=testing_subset)
gbmAccuracyTest <- confusionMatrix(predictions_boostedr, testing_subset$classe)
gbmAccuracyTest

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 2230    1    0    0    0
         B    2 1508    0    0    0
         C    0    1 1358    0    0
         D    0    8   10 1283    0
         E    0    0    0    3 1442

Overall Statistics
                                          
               Accuracy : 0.9968          
                 95% CI : (0.9953, 0.9979)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.996           
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9991   0.9934   0.9927   0.9977   1.0000
Specificity            0.9998   0.9997   0.9998   0.9973   0.9995
Pos Pred Value         0.9996   0.9987   0.9993   0.9862   0.9979
Neg Pred Value         0.9996   0.9984   0.9985   0.9995   1.0000
Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2842   0.1922   0.1731   0.1635   0.1838
Detection Prevalence   0.2843   0.1925   0.1732   0.1658   0.1842
Balanced Accuracy      0.9995   0.9965   0.9963   0.9975   0.9998

plot(modelfit_boostedr, ylim=c(0.8,1))

Our second model, the boosted regression, shows a 99.69% accuracy! That seems more than enough, but why not just one more?

V. Random Forest Model

modelfit_rforest <- randomForest(classe ~. , data=training_final)
predictions_rforest <- predict(modelfit_rforest, testing_subset, type = "class")
confusionMatrix(predictions_rforest, testing_subset$classe)

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 2230    0    0    0    0
         B    2 1518    4    0    0
         C    0    0 1363    0    0
         D    0    0    1 1285    0
         E    0    0    0    1 1442

Overall Statistics
                                         
               Accuracy : 0.999          
                 95% CI : (0.998, 0.9996)
    No Information Rate : 0.2845         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9987         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9991   1.0000   0.9963   0.9992   1.0000
Specificity            1.0000   0.9991   1.0000   0.9998   0.9998
Pos Pred Value         1.0000   0.9961   1.0000   0.9992   0.9993
Neg Pred Value         0.9996   1.0000   0.9992   0.9998   1.0000
Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2842   0.1935   0.1737   0.1638   0.1838
Detection Prevalence   0.2842   0.1942   0.1737   0.1639   0.1839
Balanced Accuracy      0.9996   0.9995   0.9982   0.9995   0.9999

confusionmatrix2 <- confusionMatrix(predictions_rforest, testing_subset$classe)
plot(confusionmatrix2 $table, col = confusionmatrix2 $byClass, main = paste("Random Forest Confusion Matrix: Accuracy =", round(confusionmatrix2$overall['Accuracy'], 4)))

Our final model, a random forest model, showed a 99.9% accuracy.

However, I like the methodology behind a boosted regression more than random forest, and I find the output more visually appealing. I’m fine with accepting a .21% lower accuracy rate for interpretability, so I have chosen to use the Boosted Regression as my final model. This means I am estimating an out of sample error rate of .31%.

predict_final <- predict(modelfit_boostedr, newdata=testing_data)

prediction_write = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

prediction_write(predict_final)

Practical Machine Learning

KRCarriere

4/19/2017

I. Overview

II. Loading Data

III. Model 1: Regression Trees

IV. Boosted Regression Model

V. Random Forest Model