Methodology

The initial data consisted of 160 variables but most of which contained 98% missing or NA values as well as personal identification variables that we weren’t interested in. Hence, we decided to simply drop these variables from both the training and test data. We partitioned the initial data into training and test data sets. We then used the final “testing” data as a validation for submission on the course website. We got all the cases correct.

We built multiple models with different methodologies but only choose Random Forest at the end due to speed of calculation and it’s accuracy. Other models that we built were Regression Trees and Neural Network. We also built these same models using Principle Component Analysis but the overall model fit wasn’t as good as when using all the variables. This is because PCA will reduce dimentionality but that is needed for classification problems.

Coding Steps:

Read and clean data
Partition the data in training and testing sets
Build a model with randomForest() function on training dataset
Evaluate out-of-sample Accuracy, i.e. cross-validation
Predict on final validation set

R Code

Data Reading & Cleaning

# Load Libraries
library(caret)

## Warning: package 'caret' was built under R version 3.1.2

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.1.2

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.1.2

# Read and Clean Data
data <- read.csv("pml-training.csv", na.string = c("NA", ""))
data.valid <- read.csv("pml-testing.csv", na.string = c("NA", ""))

# Determine Data's NA Columns
table(round(apply(data, 2, function(x) sum(is.na(x)))/nrow(data), 2))

## 
##    0 0.98 
##   60  100

columns.naCount <- apply(data, 2, function(x) sum(is.na(x)))

# Remove Data's NA Columns
columns.NA <- names(columns.naCount)[columns.naCount > 0]
columns.notNA <- names(columns.naCount)[columns.naCount == 0]
data <- data[, columns.notNA]
data <- data[, -c(1:7)]

# Set Seet
set.seed(3599)

# Partition Data
inTrain <- createDataPartition(y = data$classe, p = 0.75, list = F)
data.train <- data[inTrain, ]
data.test <- data[-inTrain, ]

Modeling Building

# Explore Data
cor.data <- cor(data[, -53])
corrplot(cor.data, method = "color")

# Train Random Forest
modFit.RF <- randomForest(classe ~ ., data = data.train)

# Compare Out-of-Sample Errors
percent <- function(x, digits = 2, format = "f", ...) {
    paste0(formatC(100 * x, format = format, digits = digits, ...), "%")
}

# Confusion Matrix
confMatrix <- confusionMatrix(data.test$classe, predict(modFit.RF, data.test))

confMatrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    2    0    0    0
##          B    4  942    3    0    0
##          C    0    2  853    0    0
##          D    0    0    8  795    1
##          E    0    0    0    0  901
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9937, 0.9975)
##     No Information Rate : 0.2849          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9948          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9971   0.9958   0.9873   1.0000   0.9989
## Specificity            0.9994   0.9982   0.9995   0.9978   1.0000
## Pos Pred Value         0.9986   0.9926   0.9977   0.9888   1.0000
## Neg Pred Value         0.9989   0.9990   0.9973   1.0000   0.9998
## Prevalence             0.2849   0.1929   0.1762   0.1621   0.1839
## Detection Rate         0.2841   0.1921   0.1739   0.1621   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9983   0.9970   0.9934   0.9989   0.9994

# Out-of-Sample Error Rate
percent(1 - confMatrix$overall[[1]])

## [1] "0.41%"

Final Prediction

predict(modFit.RF, data.valid)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Machine Learning Write Up

Gredy Garrido

Executive Summary

Methodology

R Code

Data Reading & Cleaning

Modeling Building

Final Prediction