Summary

The goal of the machine learning predictoin conducted here is to predict the type of trainging participants used in experiment described in following website. The link of the website: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

What I have tried is to find models can get better accuracy on prediction. Two types of approaches are used: one is to increase size of training sample; the other one is to change training algorithms. Two types of machine learning algorithms are used: decision tree via rpart function as well as randomforest algorithm. In the begining, I tried model prediction retrieved from rpart function with small size of training data. The accuracy is only 0.666. Then I change the size of training data to see if sample size affect prediction result a lot or not. For the second model prediction retrieved from rpart function with 10 times lager size of training data, the accuracy I got is still low, 0.738. Then I moved to use randomForest function since the size of training data is not the key to determine accurate model. When I used randomForest function for model prediciton, I used small size of training data for prediction, and used larger size of training data for prediction later. the accuracy I got from predictions via randomForest function are 0.9104 and 0.9906, which demonstrates randomForest is a better function for model prediciton, and I picked up the best one for prediction of test data.

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Load data

First off, load library needed and read datasets from local repo.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(rpart.plot)             # Enhanced tree plots

train=read.csv("pml-training.csv", header=TRUE)
test=read.csv("pml-testing.csv", header=TRUE)

Clean data

Found a lot of NAs and missing values in train data. a process of data cleaning in COLUMN is necessary in the begining. Later, split train dataset into training and testing datasets for final model testing

badTrainind1<-sapply(train, function(x) any(is.na(x)))
badTrainind2<-sapply(train, function(x) "" %in% levels(x))
badTrainindtot <-badTrainind1 | badTrainind2
cleanTrain <- train[,-which(badTrainindtot)]

# first 7 column of cleanTrain are not important for training model. remove them
cleanTrain <- cleanTrain[,-1:-7]

# split train data into training data and testing data with the ratio of 70/30
set.seed(125)
inTrain <-createDataPartition(cleanTrain$classe, p = 0.7, list=FALSE)
training <- cleanTrain[inTrain,]
testing <- cleanTrain[-inTrain,]

Model selection

Test whether to rpart function with small training data can give me noce prediction or not.

set.seed(125)
inTrainsmall <-createDataPartition(cleanTrain$classe, p = 0.04, list=FALSE)
trainingsmall <- cleanTrain[inTrainsmall,]
testingsmall <- cleanTrain[-inTrainsmall,]

# quick survey via decision tree (n=787) and check accracy of this model
rpartmodelfitsmall <- rpart(classe~., data=trainingsmall, method="class")
predictiontrainingsmall <- predict(rpartmodelfitsmall, newdata=testingsmall, type="class")
confusionMatrix(predictiontrainingsmall, testingsmall$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4425  569  320  354   29
##          B  238 1641  271  272  161
##          C  222  531 2162  610  378
##          D  248  416  278 1615  198
##          E  223  488  254  236 2696
## 
## Overall Statistics
##                                         
##                Accuracy : 0.666         
##                  95% CI : (0.659, 0.672)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.576         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.826   0.4502    0.658   0.5232    0.779
## Specificity             0.906   0.9380    0.888   0.9276    0.922
## Pos Pred Value          0.777   0.6353    0.554   0.5862    0.692
## Neg Pred Value          0.929   0.8767    0.925   0.9085    0.949
## Prevalence              0.284   0.1935    0.174   0.1639    0.184
## Detection Rate          0.235   0.0871    0.115   0.0857    0.143
## Detection Prevalence    0.302   0.1371    0.207   0.1463    0.207
## Balanced Accuracy       0.866   0.6941    0.773   0.7254    0.850

Accuracy is only 0.666. I choose larger dataset to see if the size of training dataset affect accuracy a lot ot not

set.seed(125)
# resample training dataset which has number of observation equals to 7870, and use rpart to create model
inTrainsmall1 <-createDataPartition(cleanTrain$classe, p = 0.4, list=FALSE)
trainingsmall1 <- cleanTrain[inTrainsmall1,]
testingsmall1 <- cleanTrain[-inTrainsmall1,]
# quick survey via decision tree (n=7870) and check accracy of this model
rpartmodelfitsmall1 <- rpart(classe~., data=trainingsmall1, method="class")
predictiontrainingsmall1 <- predict(rpartmodelfitsmall1, newdata=testingsmall1, type="class")
confusionMatrix(predictiontrainingsmall1, testingsmall1$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3054  507   40  107  157
##          B   21 1292  131  126   46
##          C  104  383 1755  585  318
##          D  138   76   42 1028   81
##          E   31   20   85   83 1562
## 
## Overall Statistics
##                                        
##                Accuracy : 0.738        
##                  95% CI : (0.73, 0.746)
##     No Information Rate : 0.284        
##     P-Value [Acc > NIR] : <2e-16       
##                                        
##                   Kappa : 0.667        
##  Mcnemar's Test P-Value : <2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.912    0.567    0.855   0.5329    0.722
## Specificity             0.904    0.966    0.857   0.9658    0.977
## Pos Pred Value          0.790    0.800    0.558   0.7531    0.877
## Neg Pred Value          0.963    0.903    0.965   0.9134    0.940
## Prevalence              0.284    0.194    0.174   0.1639    0.184
## Detection Rate          0.259    0.110    0.149   0.0873    0.133
## Detection Prevalence    0.328    0.137    0.267   0.1160    0.151
## Balanced Accuracy       0.908    0.767    0.856   0.7493    0.850

Even almot half of train data are used for model prediction via rpart function, the best accuracy I got is 0.738. Then I know rpart would not be a good function to build prediciotn model. I use randomForest function instead.

# use rf model and small set of training data, trainingsmall, to predict model
rf787modelfit <- randomForest(classe~., data=trainingsmall, importance = FALSE)
predictionrf787modelfit <- predict(rf787modelfit, newdata=testingsmall, type="class")
confusionMatrix(predictionrf787modelfit, testingsmall$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 5220  343   42   19   19
##          B   36 2996  189    8   28
##          C   25  237 2976  342   85
##          D   59   29   36 2664  108
##          E   16   40   42   54 3222
## 
## Overall Statistics
##                                         
##                Accuracy : 0.907         
##                  95% CI : (0.902, 0.911)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.882         
##  Mcnemar's Test P-Value : <2e-16        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.975    0.822    0.906    0.863    0.931
## Specificity             0.969    0.983    0.956    0.985    0.990
## Pos Pred Value          0.925    0.920    0.812    0.920    0.955
## Neg Pred Value          0.990    0.958    0.980    0.973    0.984
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.277    0.159    0.158    0.141    0.171
## Detection Prevalence    0.300    0.173    0.195    0.154    0.179
## Balanced Accuracy       0.972    0.902    0.931    0.924    0.960

The accuracy is 0.907, which is greatly improved. Change the size of training dataset to see if the effect of size on accuracy.

rf7870modelfit <- randomForest(classe~., data=trainingsmall1, importance = FALSE)
predictionrf7870modelfit <- predict(rf7870modelfit, newdata=testingsmall1, type="class")
confusionMatrix(predictionrf7870modelfit, testingsmall1$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3345   44    0    1    0
##          B    1 2222   24    0    0
##          C    2   10 2015   51    0
##          D    0    2   14 1876   17
##          E    0    0    0    1 2147
## 
## Overall Statistics
##                                         
##                Accuracy : 0.986         
##                  95% CI : (0.984, 0.988)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.982         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.999    0.975    0.981    0.973    0.992
## Specificity             0.995    0.997    0.994    0.997    1.000
## Pos Pred Value          0.987    0.989    0.970    0.983    1.000
## Neg Pred Value          1.000    0.994    0.996    0.995    0.998
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.189    0.171    0.159    0.182
## Detection Prevalence    0.288    0.191    0.177    0.162    0.182
## Balanced Accuracy       0.997    0.986    0.988    0.985    0.996

The accuracy is 0.986. The out of sample error rate is 0.014. Check the accuracy of this model on testing before I pick this model as final model for model prediction on test dataset.

predictionrftesting <- predict(rf7870modelfit, newdata=testing, type="class")
confusionMatrix(predictionrftesting, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673   16    0    1    0
##          B    0 1119    7    0    0
##          C    1    3 1014   17    0
##          D    0    1    5  945    3
##          E    0    0    0    1 1079
## 
## Overall Statistics
##                                         
##                Accuracy : 0.991         
##                  95% CI : (0.988, 0.993)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.988         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.999    0.982    0.988    0.980    0.997
## Specificity             0.996    0.999    0.996    0.998    1.000
## Pos Pred Value          0.990    0.994    0.980    0.991    0.999
## Neg Pred Value          1.000    0.996    0.998    0.996    0.999
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.190    0.172    0.161    0.183
## Detection Prevalence    0.287    0.191    0.176    0.162    0.184
## Balanced Accuracy       0.998    0.990    0.992    0.989    0.999

The accuracy on testing dataset is 0.991, which is good for model prediciton. Then I believe this model can be a good model for test dataset of 20 observations. Go ahead and predict!

# apply model to real test dataset, the on with 20 observations
predictionrftest<- predict(rf7870modelfit, newdata=test, type="class")
predictionrftest
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E