The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
You can also embed plots, for example:
#download files
if (!file.exists("./pml-training.csv")){
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",destfile="./pml-training.csv")
}
if (!file.exists("./pml-testing.csv")){
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",destfile="./pml-testing.csv")
}
#Load data
training<-read.csv("pml-training.csv")
testing<-read.csv("pml-testing.csv")
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
#Remove near zero variance column
library(caret)
training1<-training[, -nearZeroVar(training)]
#mostly null
training<-training1[, -which(colMeans(is.na(training1)) > 0.5)]
#remmove first 7 columns which are username or timestamp
training<-training[,-c(1:7)]
#testing<-testing1[complete.cases(testing1), ]
dim(training)
## [1] 19622 52
#remove null row
training1<-training[complete.cases(training), ]
dim(training1)
## [1] 19622 52
#create data partition for training and test
library(caret)
## Warning: package 'caret' was built under R version 3.5.2
## Loading required package: lattice
## Loading required package: ggplot2
dpart<-createDataPartition(training1$classe,p=0.7,list=FALSE)
trainSet<-training[dpart,]
testSet<-training[-dpart,]
dim(trainSet)
## [1] 13737 52
dim(testSet)
## [1] 5885 52
The model run time exceeds 30 min. Please be patient
set.seed(1234)
#use 10 fold cross validation
fitControl <- trainControl(method = "cv",number = 10)
#use random forest method
rf<-train(classe~.,method="rf",trControl = fitControl,data=trainSet,verbose = FALSE)
#gradient boost
gbm<-train(classe~.,method="gbm",trControl = fitControl,data=trainSet,verbose = FALSE)
#linear discriminator
lda<-train(classe~.,method="lda",trControl = fitControl,data=trainSet,verbose = FALSE)
predrf<-predict(rf,testSet)
predgbm<-predict(gbm,testSet)
predlda<-predict(lda,testSet)
confusionMatrix(predrf,testSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 14 0 0 0
## B 0 1121 7 0 0
## C 0 3 1017 6 4
## D 0 0 2 957 2
## E 0 1 0 1 1076
##
## Overall Statistics
##
## Accuracy : 0.9932
## 95% CI : (0.9908, 0.9951)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9914
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9842 0.9912 0.9927 0.9945
## Specificity 0.9967 0.9985 0.9973 0.9992 0.9996
## Pos Pred Value 0.9917 0.9938 0.9874 0.9958 0.9981
## Neg Pred Value 1.0000 0.9962 0.9981 0.9986 0.9988
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1905 0.1728 0.1626 0.1828
## Detection Prevalence 0.2868 0.1917 0.1750 0.1633 0.1832
## Balanced Accuracy 0.9983 0.9914 0.9943 0.9960 0.9970
confusionMatrix(predgbm,testSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1653 41 0 0 3
## B 12 1073 37 5 8
## C 3 25 975 28 14
## D 6 0 12 923 18
## E 0 0 2 8 1039
##
## Overall Statistics
##
## Accuracy : 0.9623
## 95% CI : (0.9571, 0.967)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9523
## Mcnemar's Test P-Value : 1.25e-09
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9875 0.9421 0.9503 0.9575 0.9603
## Specificity 0.9896 0.9869 0.9856 0.9927 0.9979
## Pos Pred Value 0.9741 0.9454 0.9330 0.9625 0.9905
## Neg Pred Value 0.9950 0.9861 0.9895 0.9917 0.9911
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2809 0.1823 0.1657 0.1568 0.1766
## Detection Prevalence 0.2884 0.1929 0.1776 0.1630 0.1782
## Balanced Accuracy 0.9885 0.9645 0.9679 0.9751 0.9791
confusionMatrix(predlda,testSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1372 184 112 46 43
## B 35 711 101 58 171
## C 126 139 642 105 129
## D 133 50 144 703 119
## E 8 55 27 52 620
##
## Overall Statistics
##
## Accuracy : 0.6879
## 95% CI : (0.6758, 0.6997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6049
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8196 0.6242 0.6257 0.7293 0.5730
## Specificity 0.9086 0.9231 0.8973 0.9094 0.9704
## Pos Pred Value 0.7809 0.6608 0.5627 0.6118 0.8136
## Neg Pred Value 0.9268 0.9110 0.9191 0.9449 0.9098
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2331 0.1208 0.1091 0.1195 0.1054
## Detection Prevalence 0.2986 0.1828 0.1939 0.1952 0.1295
## Balanced Accuracy 0.8641 0.7737 0.7615 0.8193 0.7717
#pred<-predict(rf,testSet$classe)
# remove unnecessary columns from testing
#keep<-names(trainSet)
#testing1<-testing[,(names(testing) %in% keep)]
#dim(testing1)
set.seed(1234)
values<-predict(rf,testing)
The final answer is given below which has 99.56% accuracy using random forest method
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E