Practical Machine Learning Course Project

Background

This report describes a model to predict the manner in which six participants did barbell lifts in different ways.

Getting Data

First, download the raw data to a local folder, and then load the data onto R.

setwd("~/R/PMLproject")
pmltrain <- read.csv("pml-training.csv")

Model Building

Pre-load the (potentially) required packages for model building.

library(AppliedPredictiveModeling)
library(caret)
library(ElemStatLearn)

It is imperative that one takes a quick look at the raw dataset before even starting. There are many columns that are either empty or containing only very few values (the rest being NA). The raw data therefore needs to be cleaned up.

First thing first, save as a new data frame and remove columns that have 90% NA values (pointless to build a model if the variables have too many missing values). Then, go one step further by removing any near-zero covariantes because those are also quite useless in building a model. In addition, the first five columns are also not covariates, and therefore can be removed as well. The end result is a new data frame with 54 columns (53 variables and 1 outcome).

newtrain <- pmltrain
newtrain <- newtrain[,colSums(is.na(newtrain))<nrow(newtrain)*0.9]
nsv <- nearZeroVar(newtrain)
newtrain <- newtrain[,-nsv]
newtrain <- newtrain[,-c(1:5)]
dim(newtrain)

## [1] 19622    54

Now the new dataset can be split into training and testing data sets. Since the sample size is large, the dataset can be split equally.

inTrain = createDataPartition(newtrain$classe, p = 0.5, list=FALSE)
training = newtrain[ inTrain,]
testing = newtrain[-inTrain,]
dim(training)

## [1] 9812   54

dim(testing)

## [1] 9810   54

Because it has been touted as the method that gives the best accuracy, random forest will be used to bootstrap variables for building a predictive model. This can be achieved by using the “rf” method of the train function in the ‘caret’ package, however, earlier attempts had been unsuccessful due to its overlong processing times.

Fortunately there is a separate package called ‘randomForest’ that does the same thing but as a stand-alone function specifically for doing random forest method. Using this method, a model fit is built for classe against all other variables, based on the training dataset. The error rate of this model is <1% which is extremely low. The model therefore fits the training set very well.

library(randomForest)
modFit <- randomForest(classe ~ ., data=training)
modFit

## 
## Call:
##  randomForest(formula = classe ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.49%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 2789    1    0    0    0 0.0003584229
## B    8 1888    3    0    0 0.0057925224
## C    0   15 1696    0    0 0.0087668030
## D    0    0   14 1592    2 0.0099502488
## E    0    0    0    5 1799 0.0027716186

Model Validation

Now this model can be applied to predict on the testing set. The confusionMatrix function in the ‘caret’ package shows >99% accuracy of the prediction model on the test dataset. This essentially cross-validates that the model that was based on the training set fits the test dataset almost perfectly.

predictions <- predict(modFit, testing)
testcm <- confusionMatrix(predictions, testing$classe)
testcm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2790   13    0    0    0
##          B    0 1877   17    0    0
##          C    0    8 1694   13    0
##          D    0    0    0 1591    4
##          E    0    0    0    4 1799
## 
## Overall Statistics
##                                           
##                Accuracy : 0.994           
##                  95% CI : (0.9922, 0.9954)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9924          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9889   0.9901   0.9894   0.9978
## Specificity            0.9981   0.9979   0.9974   0.9995   0.9995
## Pos Pred Value         0.9954   0.9910   0.9878   0.9975   0.9978
## Neg Pred Value         1.0000   0.9973   0.9979   0.9979   0.9995
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2844   0.1913   0.1727   0.1622   0.1834
## Detection Prevalence   0.2857   0.1931   0.1748   0.1626   0.1838
## Balanced Accuracy      0.9991   0.9934   0.9937   0.9945   0.9986

Prediction For Test Cases

Finally, this model will be used to predict the classe of twenty test cases. Similarly, the raw data was downloaded to a local folder and loaded onto R. The first five non-covariates columns and columns with >90% NA values are also removed. The end result is a new data frame with 55 columns (54 variables and 1 problem id).

pmltest <- read.csv("pml-testing.csv")
newtest <- pmltest
newtest <- newtest[,colSums(is.na(newtest))<nrow(newtest)*0.9]
newtest <- newtest[,-c(1:5)]
dim(newtest)

## [1] 20 55

The following is the prediction of the twenty test cases using the fitted model. The first row represents each of the test case and the second row is the corresponding classe predicted by the model.

submission <- predict(modFit, newtest[,-55])
submission

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

The function provided in the assignment page generates twenty individual text files containing the predicted classe for each problem id (a.k.a. test case). After the files are confirmed to be in the working folder, they can be submitted for grading. Upon submission, one will find that all twenty cases were predicted correctly.

pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(submission)