The goal of the machine learning predictoin conducted here is to predict the type of trainging participants used in experiment described in following website. The link of the website: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
What I have tried is to find models can get better accuracy on prediction. Two types of approaches are used: one is to increase size of training sample; the other one is to change training algorithms. Two types of machine learning algorithms are used: decision tree via rpart function as well as randomforest algorithm. In the begining, I tried model prediction retrieved from rpart function with small size of training data. The accuracy is only 0.666. Then I change the size of training data to see if sample size affect prediction result a lot or not. For the second model prediction retrieved from rpart function with 10 times lager size of training data, the accuracy I got is still low, 0.738. Then I moved to use randomForest function since the size of training data is not the key to determine accurate model. When I used randomForest function for model prediciton, I used small size of training data for prediction, and used larger size of training data for prediction later. the accuracy I got from predictions via randomForest function are 0.9104 and 0.9906, which demonstrates randomForest is a better function for model prediciton, and I picked up the best one for prediction of test data.
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
First off, load library needed and read datasets from local repo.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
library(rpart.plot) # Enhanced tree plots
train=read.csv("pml-training.csv", header=TRUE)
test=read.csv("pml-testing.csv", header=TRUE)
Found a lot of NAs and missing values in train data. a process of data cleaning in COLUMN is necessary in the begining. Later, split train dataset into training and testing datasets for final model testing
badTrainind1<-sapply(train, function(x) any(is.na(x)))
badTrainind2<-sapply(train, function(x) "" %in% levels(x))
badTrainindtot <-badTrainind1 | badTrainind2
cleanTrain <- train[,-which(badTrainindtot)]
# first 7 column of cleanTrain are not important for training model. remove them
cleanTrain <- cleanTrain[,-1:-7]
# split train data into training data and testing data with the ratio of 70/30
set.seed(125)
inTrain <-createDataPartition(cleanTrain$classe, p = 0.7, list=FALSE)
training <- cleanTrain[inTrain,]
testing <- cleanTrain[-inTrain,]
Test whether to rpart function with small training data can give me noce prediction or not.
set.seed(125)
inTrainsmall <-createDataPartition(cleanTrain$classe, p = 0.04, list=FALSE)
trainingsmall <- cleanTrain[inTrainsmall,]
testingsmall <- cleanTrain[-inTrainsmall,]
# quick survey via decision tree (n=787) and check accracy of this model
rpartmodelfitsmall <- rpart(classe~., data=trainingsmall, method="class")
predictiontrainingsmall <- predict(rpartmodelfitsmall, newdata=testingsmall, type="class")
confusionMatrix(predictiontrainingsmall, testingsmall$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4425 569 320 354 29
## B 238 1641 271 272 161
## C 222 531 2162 610 378
## D 248 416 278 1615 198
## E 223 488 254 236 2696
##
## Overall Statistics
##
## Accuracy : 0.666
## 95% CI : (0.659, 0.672)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.576
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.826 0.4502 0.658 0.5232 0.779
## Specificity 0.906 0.9380 0.888 0.9276 0.922
## Pos Pred Value 0.777 0.6353 0.554 0.5862 0.692
## Neg Pred Value 0.929 0.8767 0.925 0.9085 0.949
## Prevalence 0.284 0.1935 0.174 0.1639 0.184
## Detection Rate 0.235 0.0871 0.115 0.0857 0.143
## Detection Prevalence 0.302 0.1371 0.207 0.1463 0.207
## Balanced Accuracy 0.866 0.6941 0.773 0.7254 0.850
Accuracy is only 0.666. I choose larger dataset to see if the size of training dataset affect accuracy a lot ot not
set.seed(125)
# resample training dataset which has number of observation equals to 7870, and use rpart to create model
inTrainsmall1 <-createDataPartition(cleanTrain$classe, p = 0.4, list=FALSE)
trainingsmall1 <- cleanTrain[inTrainsmall1,]
testingsmall1 <- cleanTrain[-inTrainsmall1,]
# quick survey via decision tree (n=7870) and check accracy of this model
rpartmodelfitsmall1 <- rpart(classe~., data=trainingsmall1, method="class")
predictiontrainingsmall1 <- predict(rpartmodelfitsmall1, newdata=testingsmall1, type="class")
confusionMatrix(predictiontrainingsmall1, testingsmall1$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3054 507 40 107 157
## B 21 1292 131 126 46
## C 104 383 1755 585 318
## D 138 76 42 1028 81
## E 31 20 85 83 1562
##
## Overall Statistics
##
## Accuracy : 0.738
## 95% CI : (0.73, 0.746)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.667
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.912 0.567 0.855 0.5329 0.722
## Specificity 0.904 0.966 0.857 0.9658 0.977
## Pos Pred Value 0.790 0.800 0.558 0.7531 0.877
## Neg Pred Value 0.963 0.903 0.965 0.9134 0.940
## Prevalence 0.284 0.194 0.174 0.1639 0.184
## Detection Rate 0.259 0.110 0.149 0.0873 0.133
## Detection Prevalence 0.328 0.137 0.267 0.1160 0.151
## Balanced Accuracy 0.908 0.767 0.856 0.7493 0.850
Even almot half of train data are used for model prediction via rpart function, the best accuracy I got is 0.738. Then I know rpart would not be a good function to build prediciotn model. I use randomForest function instead.
# use rf model and small set of training data, trainingsmall, to predict model
rf787modelfit <- randomForest(classe~., data=trainingsmall, importance = FALSE)
predictionrf787modelfit <- predict(rf787modelfit, newdata=testingsmall, type="class")
confusionMatrix(predictionrf787modelfit, testingsmall$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 5220 343 42 19 19
## B 36 2996 189 8 28
## C 25 237 2976 342 85
## D 59 29 36 2664 108
## E 16 40 42 54 3222
##
## Overall Statistics
##
## Accuracy : 0.907
## 95% CI : (0.902, 0.911)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.882
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.975 0.822 0.906 0.863 0.931
## Specificity 0.969 0.983 0.956 0.985 0.990
## Pos Pred Value 0.925 0.920 0.812 0.920 0.955
## Neg Pred Value 0.990 0.958 0.980 0.973 0.984
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.277 0.159 0.158 0.141 0.171
## Detection Prevalence 0.300 0.173 0.195 0.154 0.179
## Balanced Accuracy 0.972 0.902 0.931 0.924 0.960
The accuracy is 0.907, which is greatly improved. Change the size of training dataset to see if the effect of size on accuracy.
rf7870modelfit <- randomForest(classe~., data=trainingsmall1, importance = FALSE)
predictionrf7870modelfit <- predict(rf7870modelfit, newdata=testingsmall1, type="class")
confusionMatrix(predictionrf7870modelfit, testingsmall1$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3345 44 0 1 0
## B 1 2222 24 0 0
## C 2 10 2015 51 0
## D 0 2 14 1876 17
## E 0 0 0 1 2147
##
## Overall Statistics
##
## Accuracy : 0.986
## 95% CI : (0.984, 0.988)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.982
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.999 0.975 0.981 0.973 0.992
## Specificity 0.995 0.997 0.994 0.997 1.000
## Pos Pred Value 0.987 0.989 0.970 0.983 1.000
## Neg Pred Value 1.000 0.994 0.996 0.995 0.998
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.189 0.171 0.159 0.182
## Detection Prevalence 0.288 0.191 0.177 0.162 0.182
## Balanced Accuracy 0.997 0.986 0.988 0.985 0.996
The accuracy is 0.986. The out of sample error rate is 0.014. Check the accuracy of this model on testing before I pick this model as final model for model prediction on test dataset.
predictionrftesting <- predict(rf7870modelfit, newdata=testing, type="class")
confusionMatrix(predictionrftesting, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 16 0 1 0
## B 0 1119 7 0 0
## C 1 3 1014 17 0
## D 0 1 5 945 3
## E 0 0 0 1 1079
##
## Overall Statistics
##
## Accuracy : 0.991
## 95% CI : (0.988, 0.993)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.988
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.999 0.982 0.988 0.980 0.997
## Specificity 0.996 0.999 0.996 0.998 1.000
## Pos Pred Value 0.990 0.994 0.980 0.991 0.999
## Neg Pred Value 1.000 0.996 0.998 0.996 0.999
## Prevalence 0.284 0.194 0.174 0.164 0.184
## Detection Rate 0.284 0.190 0.172 0.161 0.183
## Detection Prevalence 0.287 0.191 0.176 0.162 0.184
## Balanced Accuracy 0.998 0.990 0.992 0.989 0.999
The accuracy on testing dataset is 0.991, which is good for model prediciton. Then I believe this model can be a good model for test dataset of 20 observations. Go ahead and predict!
# apply model to real test dataset, the on with 20 observations
predictionrftest<- predict(rf7870modelfit, newdata=test, type="class")
predictionrftest
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E