Prediction Assignment

1. Overview

People regularly quantify how much of an excercise they do, but rarely measure their performance. In this investigation, data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants will be used to predict the manner of the subject. Random forest, decision tree and generalized boosted model will be the methods implemented to determine the best predicion. R programming will be the major tool used in the project.

2. Data preparation

The data for this project is taken from the Human Activity Recognition project by Groupware@LES. For more information, please visit their website.

As the first step in this investigation, data preparation is needed. The following code is used to load the corresponding libraries.

library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
library(e1071)

The next step is loading the dataset from the URL provided, and store the information into the training and testing variables.

trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "pml-traininig.csv"
testFile <- "pml-testing.csv"

if(!file.exists(trainFile))
{
  download.file(trainURL, destfile = trainFile)
}
if(!file.exists(testFile))
{
  download.file(testURL, destfile = testFile)
}

training <- read.csv(trainFile)
testing <- read.csv(testFile)

In order to have a better predictive model, training dataset is partitioned into 2 subsets:
* trainSet: consists of 70% of the dataset; will be used for the modeling process
* validationSet: consists of 30% of the dataset; will be used for cross validation

trainingPartition <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainSet <- training[trainingPartition, ]
validationSet <- training[-trainingPartition, ]

To ensure classification rules can be applied to the dataset, data cleansing must be done. The following considerations will be entered:
1. Remove the constant and almost constant variables accross the sample
2. Remove variables composed of at least 95% of missing values or empty strings
3. Remove identification variables, such as time and user information

# Remove constant and almost constant varibales across the sample
NZV <- nearZeroVar(trainSet)
trainSet <- trainSet[, -NZV]
validationSet <- validationSet[, -NZV]
# Remove variables with mostly missing values
na <- sapply(trainSet, function(x) mean(is.na(x))) > 0.95
trainSet <- trainSet[, na == FALSE]
validationSet <- validationSet[, na == FALSE]
# Remove identification variables 
trainSet <- trainSet[, -(1:5)]
validationSet  <- validationSet[, -(1:5)]

After this cleansing process, there are 53 variables suited for analysis.

3. Exploratory analysis

To get a better insight of the relationship between the variables, a correlation analyisis will be done.

plotCorrelation <- cor(trainSet[, -54])
corrplot(plotCorrelation, method = "color", order = "AOE", type = "lower", tl.cex = 0.5, tl.col = rgb(0, 0, 0), title = "Figure 1: Correlation Plot", mar=c(0,0,1,0))

In Figure 1: Correlation Plot, highly possitively correlated values are painted in dark blue, while negatively are colored dark red.

4. Predictive models

Now, three popular methods will be applied to model the regressions in the training dataset. A confusion matrix is plotted at the end of each analysis to better visualize the accuracy of the models.

4.1 Random forest

# Set seed for reproducibility
set.seed(1234)
# Create random forest model
controlRF <- trainControl(method = "cv", number = 3, verboseIter = FALSE)
modelRF <- train(classe ~ ., data = trainSet, method = "rf", trControl = controlRF)
modelRF$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.21%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    1    0    0    0 0.0002560164
## B    7 2647    4    0    0 0.0041384500
## C    0    4 2392    0    0 0.0016694491
## D    0    0    7 2244    1 0.0035523979
## E    0    1    0    4 2520 0.0019801980

# Predict using the test dataset
predictRF <- predict(modelRF, newdata = validationSet)
confusionMatrixRF <- confusionMatrix(predictRF, validationSet$classe)
confusionMatrixRF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    6    0    0    0
##          B    0 1132    2    0    0
##          C    0    0 1024    4    0
##          D    0    1    0  960    1
##          E    1    0    0    0 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9975          
##                  95% CI : (0.9958, 0.9986)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9968          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9939   0.9981   0.9959   0.9991
## Specificity            0.9986   0.9996   0.9992   0.9996   0.9998
## Pos Pred Value         0.9964   0.9982   0.9961   0.9979   0.9991
## Neg Pred Value         0.9998   0.9985   0.9996   0.9992   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1924   0.1740   0.1631   0.1837
## Detection Prevalence   0.2853   0.1927   0.1747   0.1635   0.1839
## Balanced Accuracy      0.9990   0.9967   0.9986   0.9977   0.9994

# Plot results
plot(confusionMatrixRF$table, col = confusionMatrixRF$byClass,
    main = paste("Figure 2: Random Forest Plot - Accuracy =",
                 round(confusionMatrixRF$overall['Accuracy'], 3)))

4.2 Decision tree

# Set seed for reproducibility
set.seed(1234)
# Create decision tree model
modelDT <- rpart(classe ~ ., data = trainSet, method = "class")
# Predict using the test dataset
predictDT <- predict(modelDT, newdata = validationSet, type = "class")
confusionMatrixDT <- confusionMatrix(predictDT, validationSet$classe)
confusionMatrixDT

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1481  170   18   60   79
##          B  134  707   24  117  138
##          C    9   51  901  130   68
##          D   21  118   69  601  129
##          E   29   93   14   56  668
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7405          
##                  95% CI : (0.7291, 0.7517)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6709          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8847   0.6207   0.8782   0.6234   0.6174
## Specificity            0.9223   0.9130   0.9469   0.9315   0.9600
## Pos Pred Value         0.8191   0.6313   0.7774   0.6407   0.7767
## Neg Pred Value         0.9527   0.9093   0.9736   0.9266   0.9176
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2517   0.1201   0.1531   0.1021   0.1135
## Detection Prevalence   0.3072   0.1903   0.1969   0.1594   0.1461
## Balanced Accuracy      0.9035   0.7668   0.9125   0.7775   0.7887

# Plot results
plot(confusionMatrixDT$table, col = confusionMatrixDT$byClass,
    main = paste("Figure 3: Decision Tree Plot - Accuracy =",
                 round(confusionMatrixDT$overall['Accuracy'], 3)))

4.3 Generalized boosted model

# Set seed for reproducibility
set.seed(1234)
# Create decision tree model
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modelGBM <- train(classe ~ ., data = trainSet, method = "gbm", trControl = controlGBM, verbose = FALSE)
modelGBM$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 41 had non-zero influence.

# Predict using the test dataset
predictGBM <- predict(modelGBM, newdata = validationSet)
confusionMatrixGBM <- confusionMatrix(predictGBM, validationSet$classe)
confusionMatrixGBM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1670   12    0    1    0
##          B    3 1111   10    6    4
##          C    0   12 1012   11    1
##          D    1    4    4  944    5
##          E    0    0    0    2 1072
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9871          
##                  95% CI : (0.9839, 0.9898)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9837          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9976   0.9754   0.9864   0.9793   0.9908
## Specificity            0.9969   0.9952   0.9951   0.9972   0.9996
## Pos Pred Value         0.9923   0.9797   0.9768   0.9854   0.9981
## Neg Pred Value         0.9990   0.9941   0.9971   0.9959   0.9979
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2838   0.1888   0.1720   0.1604   0.1822
## Detection Prevalence   0.2860   0.1927   0.1760   0.1628   0.1825
## Balanced Accuracy      0.9973   0.9853   0.9907   0.9882   0.9952

# Plot results
plot(confusionMatrixGBM$table, col = confusionMatrixGBM$byClass,
    main = paste("Figure 4: Generalized Boosted Model Plot - Accuracy =",
                 round(confusionMatrixGBM$overall['Accuracy'], 3)))

5. Applying selected model to test data

As for this investigation, the accuracy of the selected models is the following:
* Random forest: 0.999
* Decision tree: 0.729
* GBM: 0.989 Therefore, the random forest method must be used to prefict the results.

predict <- predict(modelRF, newdata = testing)
predict

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E