1. Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. Using the data collected, machine learning models are built and the best performing model is used to predict the ‘classe’ variable. Data are provided from the http://groupware.les.inf.puc-rio.br/har.

The prediction model using random forest algorithm is built and executed against a validation data (extracted from the training data) and it yielded 99.8% accuracy.

2. Setup runtime environment

Loading in all the necessary libraries.

3. Download data

The training and testing datasets are downloaded from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv, respectively.

4. Load data

The downloaded data are loaded into memory.

A summary of the training data (train variable) and testing data (test) are as follows:

dim(train)
## [1] 19622   160
dim(test)
## [1]  20 160

5. Subset the training data into two datasets

A validation dataset is extracted from the downloaded training dataset. 60% are reserved for actual training and remaining 40% are reserved for validation. The validation data is used to cross validate the prediction model before it is runs once against the actual testing data.

inTrain <- createDataPartition(y=train$classe, p=0.6, list=FALSE)
subTrain <- train[inTrain, ]
subValidate <- train[-inTrain, ]

A summary of training data (subTrain variable) and validation data (subValidate) are as follows:

dim(subTrain)
## [1] 11776   160
dim(subValidate)
## [1] 7846  160

6. Prepare training dataset for training

The criteria to remove variables as predictors to build the model are as follows:

  1. Variables containing above 90% of “NA”, “#DIV/0!” or empty data.
  2. Serial #, timestamps and personal data.
  3. Near zero variance.

The finalised list of variables selected for building the model are also applied on the validation data and testing data.

# remove columns with more than 90% of observations that are "NA" 
cleanSubTrain<-subTrain[,!(colSums(is.na(subTrain))/dim(subTrain)[1] > 0.90)]

# remove columns with '#DIV/0!'
cleanSubTrain <- cleanSubTrain [, apply(cleanSubTrain, 2, 
                                  function(x) sum(x == "#DIV/0!" || x == "")) == 0]

# remove variables that will not impact the prediction
drops <- c("X","user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", 
           "cvtd_timestamp")
cleanSubTrain <- cleanSubTrain[,!(names(cleanSubTrain) %in% drops)]

# find out covariates that have near zero variance and then to remove them. 
nsv <- nearZeroVar(cleanSubTrain, saveMetrics=T)
drops <- c("new_window")
cleanSubTrain <- cleanSubTrain[,!(names(cleanSubTrain) %in% drops)]

# apply the same cleaning process to Validation and Test datasets
cleanSubValidate <- NULL
for (i in 1:length(colnames(subValidate)) ){
    for (j in 1:length(colnames(cleanSubTrain)) ) {
        if (colnames(subValidate)[i]==colnames(cleanSubTrain)[j]) {
            if (!is.null(cleanSubValidate)){
                cleanSubValidate <- cbind(cleanSubValidate,subValidate[,i])
                ind <- ind+1
            } else {
                ind <- 1
                cleanSubValidate <- as.data.frame(subValidate[,i])
            }
            colnames(cleanSubValidate)[ind] <- colnames(subValidate)[i]
        }
    }
}

cleanTest <- NULL
for (i in 1:length(colnames(test)) ){
    for (j in 1:length(colnames(cleanSubTrain)) ) {
        if (colnames(test)[i]==colnames(cleanSubTrain)[j]) {
            if (!is.null(cleanTest)){
                cleanTest <- cbind(cleanTest,test[,i])
                ind <- ind+1
            } else {
                ind <- 1
                cleanTest <- as.data.frame(test[,i])
            }
            colnames(cleanTest)[ind] <- colnames(test)[i]
        }
    }
}

7. Using Machine Learning : Decision Tree

The Decision Tree algorithm is selected for building the model and is validated against the validation data.

dt.stime <- proc.time()
modFit.DT <- rpart(classe ~ ., data=cleanSubTrain, method="class")
dt.etime <- proc.time()

predict.DT <- predict(modFit.DT, cleanSubValidate, type = "class")
CM.DT <- confusionMatrix(predict.DT, cleanSubValidate$classe)
print(CM.DT)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1935  114    3   22    6
##          B  131 1130   86  125   69
##          C   22  100 1129   48    5
##          D  130   89  122 1012  154
##          E   14   85   28   79 1208
## 
## Overall Statistics
##                                         
##                Accuracy : 0.817         
##                  95% CI : (0.809, 0.826)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.77          
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.867    0.744    0.825    0.787    0.838
## Specificity             0.974    0.935    0.973    0.925    0.968
## Pos Pred Value          0.930    0.733    0.866    0.672    0.854
## Neg Pred Value          0.948    0.938    0.963    0.957    0.964
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.247    0.144    0.144    0.129    0.154
## Detection Prevalence    0.265    0.196    0.166    0.192    0.180
## Balanced Accuracy       0.921    0.840    0.899    0.856    0.903

8. Using Machine Learning : Random Forest

The Random Forest algorithm is selected for building the model and is validated against the validation data.

rf.stime <- proc.time()
modFit.RF <- randomForest (classe ~ ., data=cleanSubTrain, importance=TRUE)
rf.etime <- proc.time()

predict.RF <- predict(modFit.RF, cleanSubValidate, type = "response")
CM.RF <- confusionMatrix(predict.RF, cleanSubValidate$classe)
print(CM.RF)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    2    0    0    0
##          B    0 1514    7    0    0
##          C    0    2 1361    4    0
##          D    0    0    0 1282    1
##          E    0    0    0    0 1441
## 
## Overall Statistics
##                                         
##                Accuracy : 0.998         
##                  95% CI : (0.997, 0.999)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.997         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    0.997    0.995    0.997    0.999
## Specificity             1.000    0.999    0.999    1.000    1.000
## Pos Pred Value          0.999    0.995    0.996    0.999    1.000
## Neg Pred Value          1.000    0.999    0.999    0.999    1.000
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.284    0.193    0.173    0.163    0.184
## Detection Prevalence    0.285    0.194    0.174    0.164    0.184
## Balanced Accuracy       1.000    0.998    0.997    0.998    1.000

9. Using Machine Learing : Random Forest with Cross Validation feature

The Random Forest with Cross Validation algorithm is selected for building the model and is validated against the validation data.

rfcv.stime <- proc.time()
modFit.RFCV <- rfcv(trainx = cleanSubTrain[,-54], trainy = cleanSubTrain[,54], 
                 scale = "log", step=0.5, cv.fold=3)
rfcv.etime <- proc.time()

predict.RFCV <- predict(modFit.RF, cleanSubValidate, type = "response")
CM.RFCV <- confusionMatrix(predict.RFCV, cleanSubValidate$classe)
print(CM.RFCV)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    2    0    0    0
##          B    0 1514    7    0    0
##          C    0    2 1361    4    0
##          D    0    0    0 1282    1
##          E    0    0    0    0 1441
## 
## Overall Statistics
##                                         
##                Accuracy : 0.998         
##                  95% CI : (0.997, 0.999)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.997         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    0.997    0.995    0.997    0.999
## Specificity             1.000    0.999    0.999    1.000    1.000
## Pos Pred Value          0.999    0.995    0.996    0.999    1.000
## Neg Pred Value          1.000    0.999    0.999    0.999    1.000
## Prevalence              0.284    0.193    0.174    0.164    0.184
## Detection Rate          0.284    0.193    0.173    0.163    0.184
## Detection Prevalence    0.285    0.194    0.174    0.164    0.184
## Balanced Accuracy       1.000    0.998    0.997    0.998    1.000

10. Compare the models

Model Name Accuracy Elapsed Time
Decision Tree 0.817487 3.87
Random Forest 0.997961 105.11
Random Forest with Cross Validation 0.997961 272.13

The model built using Random Forest is the best because it balances well between speed and accuracy of prediction.

11. Out of sample error

print(modFit.RF)
## 
## Call:
##  randomForest(formula = classe ~ ., data = cleanSubTrain, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.36%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3347    1    0    0    0 0.000298686
## B    6 2269    4    0    0 0.004387889
## C    0    8 2044    2    0 0.004868549
## D    0    0   16 1913    1 0.008808290
## E    0    0    0    4 2161 0.001847575

The OOB estimate error is 0.36% for the training data is low which is good.

12. Predict model using testing data & create the outcomes in files

predict.RF.test <- predict(modFit.RF, cleanTest)
              
pml_write_files = function(x){
    n = length(x)
    for(i in 1:n){
        filename = paste0("problem_id_",i,".txt")
        write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
        }
    }

pml_write_files(predict.RF.test)

The prediction on the 20 test cases are B, A, B, A, A, E, D, B, A, A, B, C, B, A, E, E, A, B, B, B