Preparing the Data (Data Cleaning)

Training Dataset

The training dataset is from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv. There are 19622 rows of training datasets, which means there are 19622 dataset for training, with 160 variables (or features).

Testing dataset

The testing dataset is from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv There are 20 rows of datasets, which means there are 20 dataset for training, with 160 variables (or features).

Strategy

This project will partition the training dataset into 70% for training, and 30% for validation on creating the machine learning model.

Not all 160 features are needed, so we need to clean the dataset by removing some columns. For example, columns with NA values, DIV/0, empty columns, and column with people names that is not correlated with the expected predicted classes.

## Data loading
setwd("..")
pmlTrain<-read.csv("pml-training.csv", header=T, na.strings=c("NA", "#DIV/0!"))
pmlTest<-read.csv("pml-testing.csv", header=T, na.string=c("NA", "#DIV/0!"))

## NA exclusion for all available variables
noNApmlTrain<-pmlTrain[, apply(pmlTrain, 2, function(x) !any(is.na(x)))] 
dim(noNApmlTrain)

## [1] 19622    60

## variables with user information, time and undefined
cleanpmlTrain<-noNApmlTrain[,-c(1:8)]
dim(cleanpmlTrain)

## [1] 19622    52

## 20 test cases provided clean info - Validation data set
cleanpmltest<-pmlTest[,names(cleanpmlTrain[,-52])]
dim(cleanpmltest)

## [1] 20 51

The training dataset into 70% for training, and 30% for validation on creating the machine learning model, using these codes:

#Create Partition
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

inTrain<-createDataPartition(y=cleanpmlTrain$classe, p=0.70,list=F)
training<-cleanpmlTrain[inTrain,] 
validate<-cleanpmlTrain[-inTrain,] 

#Training and validate set dimensions
dim(training)

## [1] 13737    52

dim(validate)

## [1] 5885   52

Training

The training dataset were trained using Random Forest Trees. 3 fold cross validation was used control the model. Random forest trees were generated for the training dataset using cross-validation. Then the generated algorithm was examined under the partitioned training set to examine the accuracy and estimated error of prediction.

By using 51 features for five classes using cross-validation at a 3-fold an accuracy of 99.54% with a 95% CI [0.9933, 0.997] was achieved accompanied by a Kappa value of 0.9942.

library(caret)
set.seed(786541)
pmlTrainControl<-trainControl(method="cv", number=3, allowParallel=T, verbose=T)
pmlTrainedModel<-train(classe~.,data=training, method="rf", trControl=pmlTrainControl, verbose=F)

## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.

## + Fold1: mtry= 2 
## - Fold1: mtry= 2 
## + Fold1: mtry=26 
## - Fold1: mtry=26 
## + Fold1: mtry=51 
## - Fold1: mtry=51 
## + Fold2: mtry= 2 
## - Fold2: mtry= 2 
## + Fold2: mtry=26 
## - Fold2: mtry=26 
## + Fold2: mtry=51 
## - Fold2: mtry=51 
## + Fold3: mtry= 2 
## - Fold3: mtry= 2 
## + Fold3: mtry=26 
## - Fold3: mtry=26 
## + Fold3: mtry=51 
## - Fold3: mtry=51 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 26 on full training set

Validating

The trained model is then tested on the validation dataset (the 30% data partitioned from original training dataset.

predrf<-predict(pmlTrainedModel, newdata=validate)
confusionMatrix(predrf, validate$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672   10    0    0    0
##          B    1 1126    7    0    0
##          C    0    2 1014   12    2
##          D    0    0    5  952    0
##          E    1    1    0    0 1080
## 
## Overall Statistics
##                                          
##                Accuracy : 0.993          
##                  95% CI : (0.9906, 0.995)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9912         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9886   0.9883   0.9876   0.9982
## Specificity            0.9976   0.9983   0.9967   0.9990   0.9996
## Pos Pred Value         0.9941   0.9929   0.9845   0.9948   0.9982
## Neg Pred Value         0.9995   0.9973   0.9975   0.9976   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1913   0.1723   0.1618   0.1835
## Detection Prevalence   0.2858   0.1927   0.1750   0.1626   0.1839
## Balanced Accuracy      0.9982   0.9935   0.9925   0.9933   0.9989

Results and Conclusion

The trained model is then used to make prediction on the 20 dataset given at the beginning of the project.

pred20<-predict(pmlTrainedModel, newdata=cleanpmltest)