This is prepared for Johns Hopkins’ Practical Machine Learning online class Course Project Prediction Assignment
The goal is to use http://groupware.les.inf.puc-rio.br/har data and to predict the “classe”, the manner in which they did the exercise.
Focus On
Due to course project pages limitation, output results are available in Appendix. R coes is in R file
set.seed(999)
loadDataset<-function(data.file ) { # load one data set at one time
df.data<-read.csv(data.file, header=TRUE, sep = ",", na.strings=c("NA","#DIV/0!"));
return (df.data)
}
logStep("\n(1) Getting data - Download data files")
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
dest.dir=data.dir, dest.file="pml-training.csv", unzip=FALSE)
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
dest.dir=data.dir, dest.file="pml-testing.csv", unzip=FALSE)
logStep("\n(2) Read data file, and load to data set")
# read and store data as NA datatype if string value is one of "NA", "", "#DOV/0!"
traindata.filename<-"pml-training.csv"
traindata<-read.csv(file.path(data.dir, traindata.filename), na.strings=c("NA","#DIV/0!", ""))
testdata.filename<-"pml-testing.csv"
testdata<-read.csv(file.path(data.dir, testdata.filename), na.strings=c("NA","#DIV/0!", ""))
logStep("\n(3) Exploring data")
logStep("...dim= ", dim(traindata)); logStep("...dim= ", dim(testdata));
#-See appdx for details: colnames(traindata); str(traindata); summary(traindata$classe); colnames(testdata)
logStep("...Important column info: summary of classe=\n"); print(summary(traindata$classe))
##
## (1) Getting data - Download data files Downloading from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
## to dest.filepath= D:/_github/PracticalMachineLearningProject/mydata/pml-training.csv dest.filepath exist.No download is needed. Downloading from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
## to dest.filepath= D:/_github/PracticalMachineLearningProject/mydata/pml-testing.csv dest.filepath exist.No download is needed.
## (2) Read data file, and load to data set
## (3) Exploring data ...dim= 19622 160 ...dim= 20 160 ...Important column info: summary of classe=
## A B C D E
## 5580 3797 3422 3216 3607
The following criteria is to exam the data, and remove columns (variables) if matching the following conditions
logStep("(4) PreProcess data "); logStep("(4.1) Check and remove NA variables" )
#-see appdx sort(colSums(is.na(traindata)))
var.isna<-(colSums(is.na(traindata)) > 0) # is.na value: TRUE(1) if NA is found; FALSE(0) if NA not found
cdata<-traindata[, !var.isna ]; logStep("...data ncol=", ncol(cdata))
## (4) PreProcess data (4.1) Check and remove NA variables ...data ncol= 60
logStep("(4.2) Checking and remove Near Zero Variables ") # Near to zero variable have no variability, and are not useful when constructing a prediction model
require(caret); data.1 <- cdata;
var.nearzero<-nearZeroVar(data.1, allowParallel=TRUE, saveMetrics=FALSE) ; logStep("...nearzero.length=", length(var.nearzero));
if (length(var.nearzero)>0 ) {
logStep("...nearzero.colname=", names(cdata)[var.nearzero])
data.1<-data.1[, -var.nearzero ]
}
logStep("...data ncol=", ncol(data.1))
## (4.2) Checking and remove Near Zero Variables ...nearzero.length= 1 ...nearzero.colname= new_window ...data ncol= 59
logStep("(4.3) Remove non-relevant Variables toward 'classe' "); logStep(names(data.1)[1:6] );
data.1<-data.1[, -1:-6 ] ; logStep("...data ncol=", ncol(data.1))
## (4.3) Remove non-relevant Variables toward 'classe' X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp num_window ...data ncol= 53
logStep("(4.4) Check and remove High correlated Variables ") #shows the effect of removing descriptors with absolute correlations above .75.
cor.1<-cor(data.1[,-ncol(data.1)]); logStep("...Summary(cor.1)= \n"); print( summary(cor.1[upper.tri( cor.1 )]))
cor.high <- sum(abs(cor.1[upper.tri(cor.1)]) > .75); logStep("...Summary(abs(cor.1)) > 0.75 \n"); print( cor.high );
cor.highly<-findCorrelation( cor.1, cutoff=0.75 )
if(length(cor.highly)>0) {
data.1<-data.1[,-cor.highly];
cor.2<-cor(data.1[,-ncol(data.1)]); logStep("...cor.2)) cutoff=0.75 \n"); print(summary(cor.2[upper.tri( cor.2 )]));
}
logStep("...data ncol=", ncol(data.1))
## (4.4) Check and remove High correlated Variables ...Summary(cor.1)=
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.992000 -0.110100 0.002092 0.001790 0.092550 0.980900
## ...Summary(abs(cor.1)) > 0.75
## [1] 31
## ...cor.2)) cutoff=0.75
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.607000 -0.103800 0.006527 0.003332 0.087530 0.736500
## ...data ncol= 32
My predictive variables are cut down to 32 variables/columns for fitting predictive modeling and cross validation
Note: Since I have so many predictors I can explore, it is extremely time consuming to generate plots/feature plots, or determine which variables from plots are better for fitting models. Therefore, decided to directly go further steps: fitting prediction modeling, and will determine predators from the best models based on result of fitting models.
My data (32 variables) is GOOD now, and ready to be used for cross validation
Create Data Partition for cross validation data
Split 60% for Train set (sTrain), 40% for Test set (sTest)
‘center’ and ‘scale’ are used for pre-processing of the predictor data
data<- data.1;
idxTrain<-createDataPartition(y=data$classe, p=0.6, list=FALSE)
sTrain<- data[idxTrain,]; dim(sTrain);
## [1] 11776 32
sTest<- data[-idxTrain,]; dim(sTest);
## [1] 7846 32
Note: the following ‘train’ methods took over 2 hours to generate train result for all methods, so I saved these train result into local .Rdata files, and re-loaded them to generate prediction/comparison result for Rmd and HTML report creation
require(randomForest); require(forecast); require(caret);
#(5.1) Classification tree modeling
rdata.file <- file.path(data.dir, "fit.rpart.RData")
if (file.exists(rdata.file)) {
load(file=rdata.file)
} else {
fit.rpart<-train(classe ~ . , preProcess=c("center","scale"), data = sTrain, method="rpart");
save(fit.rpart, file=rdata.file);
}
pred.rpart<-predict(fit.rpart, sTest)
#-see appdx print(fit.rpart); ..cross validation predict(fit,rpart, sTest); print( summary(pred.rpart))
sTest$rightPred <- pred.rpart == sTest$classe
#-see appdx print(table(pred.rpart, sTest$classe))
accuracy.rpart<- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.rpart<-sum(sTest$rightPred)/length(pred.rpart)
outOfSample.error.rpart<-(1-outOfSample.accuracy.rpart)
cf.rpart<- confusionMatrix(pred.rpart, sTest$classe)
#-see appdx print(cf.rpart)
#(5.2) Random Forest model
rdata.file <- file.path(data.dir, "fit.rf.RData")
if (file.exists(rdata.file)) {
logStep( "loading ",rdata.file, load(file=rdata.file))
} else {
traincontrol <- trainControl(method = "cv", number = 10) # using "cv" for resampling method, "10" number of resampling iterations
logStep (system.time(fit.rf <- train(classe ~., preProcess=c("center","scale"), data = sTrain,
method="rf", trControl=traincontrol , verbose = FALSE)))
save(fit.rf,file=rdata.file)
}
pred.rf <-predict(fit.rf, sTest)
#-see appdx print(fit.rf ); ..cross validation predict(fit,rf, sTest); print( summary(pred.rf))
sTest$rightPred<-pred.rf == sTest$classe
#-see appdx print(table(pred.rf, sTest$classe))
accuracy.rf <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.rf<-sum(sTest$rightPred)/length(pred.rf)
outOfSample.error.rf<-(1-outOfSample.accuracy.rf)
cf.rf <- confusionMatrix(pred.rf, sTest$classe)
#-see appdx print(cf.rf)
## loading D:/_github/PracticalMachineLearningProject/mydata/fit.rf.RData fit.rf
#(5.3) Boosting model
rdata.file <- file.path(data.dir, "fit.gbm.RData")
if (file.exists(rdata.file)) {
load(file=rdata.file)
} else {
logStep (system.time(fit.gbm <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="gbm", verbose = FALSE)))
save(fit.gbm,file=rdata.file)
}
pred.gbm<-predict(fit.gbm, sTest)
#-see appdx print(fit.gbm); ..cross validation predict(fit.gbm, sTest); pint(summary(pred.gbm))
sTest$rightPred<-pred.gbm == sTest$classe
#-see appdx print(table(pred.gbm, sTest$classe))
accuracy.gbm<- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.gbm<-sum(sTest$rightPred)/length(pred.gbm)
outOfSample.error.gbm<-(1-outOfSample.accuracy.gbm)
cf.gbm <- confusionMatrix(pred.gbm, sTest$classe)
#-see appdx print(cf.gbm)
#(5.4) Linear Discriminant Analysis model
rdata.file <- file.path(data.dir, "fit.lda.RData")
if (file.exists(rdata.file)) {
load(file=rdata.file)
} else {
logStep (system.time(fit.lda <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="lda", verbose = FALSE)))
save(fit.lda,file=rdata.file)
}
pred.lda<-predict(fit.lda, sTest);
#-see appdx print(fit.lda); ..cross validation predict(fit.lda, sTest); print(summary(pred.lda))
sTest$rightPred<-pred.lda == sTest$classe
#-see appdx print(table(pred.lda, sTest$classe))
accuracy.lda <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.lda<-sum(sTest$rightPred)/length(pred.lda)
outOfSample.error.lda<-(1-outOfSample.accuracy.lda)
cf.lda <- confusionMatrix(pred.lda, sTest$classe)
#-see appdx print(cf.lda)
#(5.5) Naive Bayes model
require(klaR)
rdata.file <- file.path(data.dir, "fit.nb.RData")
if (file.exists(rdata.file)) {
load(file=rdata.file)
} else {
logStep (system.time(fit.nb <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="nb", verbose = FALSE)))
save(fit.nb,file=rdata.file)
}
pred.nb<-predict(fit.nb, sTest)
#-see appdx print(fit.nb);..cross validation predict(fit.nb, sTest); print(summary(pred.nb));
sTest$rightPred<-pred.nb == sTest$classe
#-see appdx print(table(pred.nb, sTest$classe))
accuracy.nb <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.nb<-sum(sTest$rightPred)/length(pred.nb)
outOfSample.error.nb<-(1-outOfSample.accuracy.nb)
cf.nb <- confusionMatrix(pred.nb, sTest$classe)
#-see appdx print(cf.nb)
The cross validation estimate is an Out Of Sample estimate. Comparison details included the train method,
- In-Sample Accuracy
- Confusion Matrix Accuracy
- Out-of-Sample Accuracy
- In-Sample Error, Out-of-Sample Error
Expect the best performing model if (1) the best/highest accuracy value (2) the least/lowest/smallest error value
logStep ("(6) Comparison accuracy, In/Out sample error\n")
comp.result <- cbind( c("model/train method", "rpart","rf","gbm","lda","nb"),
c("InSample.Accuracy", accuracy.rpart, accuracy.rf, accuracy.gbm, accuracy.lda, accuracy.nb),
c("confusionMatrix.Accuracy",cf.rpart$overall['Accuracy'],cf.rf$overall['Accuracy'], cf.gbm$overall['Accuracy'],
cf.lda$overall['Accuracy'],cf.nb$overall['Accuracy']),
c("InSample.Error",1-cf.rpart$overall['Accuracy'], 1-cf.rf$overall['Accuracy'], 1-cf.gbm$overall['Accuracy'],
1-cf.lda$overall['Accuracy'], 1-cf.nb$overall['Accuracy']),
c("OutOfSample.Accuracy",outOfSample.accuracy.rpart, outOfSample.accuracy.rf, outOfSample.accuracy.gbm,
outOfSample.accuracy.lda, outOfSample.accuracy.nb),
c("OutOfSample.Error",outOfSample.error.rpart, outOfSample.error.rf, outOfSample.error.gbm,
outOfSample.error.lda, outOfSample.error.nb))
print(comp.result)
logStep("...rpart postResample\n"); print(postResample(pred.rpart, sTest$classe))
logStep("...rf postResample\n"); print(postResample(pred.rf, sTest$classe))
logStep("...gbm postResample\n"); print(postResample(pred.gbm, sTest$classe))
logStep("...lda postResample\n"); print(postResample(pred.lda, sTest$classe))
logStep("...nb postResample\n"); print(postResample(pred.nb, sTest$classe))
## (6) Comparison accuracy, In/Out sample error
## [,1] [,2]
## "model/train method" "InSample.Accuracy"
## Accuracy "rpart" "0.547030333928116"
## Accuracy "rf" "0.989039000764721"
## Accuracy "gbm" "0.943028294672445"
## Accuracy "lda" "0.571501401988274"
## Accuracy "nb" "0.729671170022942"
## [,3] [,4]
## "confusionMatrix.Accuracy" "InSample.Error"
## Accuracy "0.547030333928116" "0.452969666071884"
## Accuracy "0.989039000764721" "0.0109609992352792"
## Accuracy "0.943028294672445" "0.0569717053275555"
## Accuracy "0.571501401988274" "0.428498598011726"
## Accuracy "0.729671170022942" "0.270328829977058"
## [,5] [,6]
## "OutOfSample.Accuracy" "OutOfSample.Error"
## Accuracy "0.547030333928116" "0.452969666071884"
## Accuracy "0.989039000764721" "0.0109609992352792"
## Accuracy "0.943028294672445" "0.0569717053275555"
## Accuracy "0.571501401988274" "0.428498598011726"
## Accuracy "0.729671170022942" "0.270328829977058"
## ...rpart postResample
## Accuracy Kappa
## 0.5470303 0.4302387
## ...rf postResample
## Accuracy Kappa
## 0.9890390 0.9861331
## ...gbm postResample
## Accuracy Kappa
## 0.9430283 0.9279087
## ...lda postResample
## Accuracy Kappa
## 0.5715014 0.4577694
## ...nb postResample
## Accuracy Kappa
## 0.7296712 0.6608958
Comparison result and Choose Final Model
The result of comparing “rpart”, “rf”, “gbm”, “lda”, “nb” model showed the first two highest ranks,
| Method | In Sample Accurancy | Out Of Sample Accurancy | 95% CI | Kappa | In Sample Error | Out Of Sample Error |
|---|---|---|---|---|---|---|
| rf | 98.9% | 98.9% | (98.7%, 99.1%) | 98.6% | 1.1% | 1.1 % |
| gbm | 94.3% | 94.3 % | (93.8%, 94.8%) | 92.8% | 5.7% | 5.7% |
The best performing model is the Random Forest (rf), so I chose the rf as my best fitting model to apply PML Test data
logStep ("(7) Generate files for assginment submission\n")
# Prediction submission -- select RF model to predict test data
pred.answer <- predict( fit.rf, testdata)
logStep("...pred.answer=")
print( pred.answer)
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
filepath<-file.path(output.dir, filename)
write.table(x[i],file=filepath,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
logStep("...write to files"); pml_write_files(pred.answer)
## (7) Generate files for assginment submission
## ...pred.answer= [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
## ...write to files
After comparing “rpart”, “rf”, “gbm”, “lda”, “nb” model result, the Random Forest (rf) model showed
| In Sample Accurancy | Out Of Sample Accurancy | 95% CI | Kappa | In Sample Error | Out Of Sample Error |
|---|---|---|---|---|---|
| 98.9% | 98.9% | (98.7%, 99.1%) | 98.6% | 1.1% | 1.1 % |
Result of my course project submission is 20/20 correct
logStep("(3) Exploring data\n ..Train Data colnames= \n"); head(colnames(traindata)); #str(head(traindata)) ;
## (3) Exploring data
## ..Train Data colnames=
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
head(colnames(testdata));
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
logStep("(4) PreProcess data (4.1) Check and remove NA variables" ); head( sort(colSums(is.na(traindata)), decreasing = TRUE))
## (4) PreProcess data (4.1) Check and remove NA variables
## kurtosis_yaw_belt skewness_yaw_belt kurtosis_yaw_dumbbell
## 19622 19622 19622
## skewness_yaw_dumbbell kurtosis_yaw_forearm skewness_yaw_forearm
## 19622 19622 19622
logStep ("(5.1) Classification tree modeling\n ..fit.rpart=\n"); print(fit.rpart);
## (5.1) Classification tree modeling
## ..fit.rpart=
## CART
##
## 11776 samples
## 31 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.02491694 0.5395200 0.4158615 0.02502035 0.03913236
## 0.02758662 0.5184748 0.3880017 0.02578250 0.04047764
## 0.04182487 0.4131974 0.2203257 0.10276153 0.17280339
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.02491694.
logStep("..cross validation predict(fit.rpart, sTest)=\n"); print( summary(pred.rpart))
## ..cross validation predict(fit.rpart, sTest)=
## A B C D E
## 1786 1989 2206 603 1262
logStep("..table(pred.rpart,sTest$classe)=\n"); print(table(pred.rpart, sTest$classe));
## ..table(pred.rpart,sTest$classe)=
##
## pred.rpart A B C D E
## A 1394 266 31 66 29
## B 223 860 173 318 415
## C 316 290 1016 329 255
## D 107 37 60 339 60
## E 192 65 88 234 683
print(cf.rpart)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 266 31 66 29
## B 223 860 173 318 415
## C 316 290 1016 329 255
## D 107 37 60 339 60
## E 192 65 88 234 683
##
## Overall Statistics
##
## Accuracy : 0.547
## 95% CI : (0.5359, 0.5581)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4302
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6246 0.5665 0.7427 0.26361 0.47365
## Specificity 0.9302 0.8216 0.8163 0.95976 0.90959
## Pos Pred Value 0.7805 0.4324 0.4606 0.56219 0.54120
## Neg Pred Value 0.8617 0.8877 0.9376 0.86925 0.88472
## Prevalence 0.2845 0.1935 0.1744 0.16391 0.18379
## Detection Rate 0.1777 0.1096 0.1295 0.04321 0.08705
## Detection Prevalence 0.2276 0.2535 0.2812 0.07685 0.16085
## Balanced Accuracy 0.7774 0.6941 0.7795 0.61168 0.69162
fancyRpartPlot(fit.rpart$finalModel, main="Rpart Tree Model")
logStep ("(5.2) Random Forest model\n ..fit.rf=\n"); print(fit.rf )
## (5.2) Random Forest model
## ..fit.rf=
## Random Forest
##
## 11776 samples
## 31 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 10597, 10598, 10599, 10600, 10598, 10598, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9884510 0.9853886 0.002342040 0.002964147
## 16 0.9887908 0.9858192 0.003246769 0.004107953
## 31 0.9808937 0.9758304 0.003883894 0.004914435
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 16.
logStep("..cross validation predict(fit,rf, sTest) =\n"); print( summary(pred.rf))
## ..cross validation predict(fit,rf, sTest) =
## A B C D E
## 2242 1519 1378 1271 1436
logStep("..table(pred.rf,sTest$classe)=\n"); print(table(pred.rf, sTest$classe))
## ..table(pred.rf,sTest$classe)=
##
## pred.rf A B C D E
## A 2228 13 0 1 0
## B 2 1498 18 1 0
## C 0 7 1342 24 5
## D 2 0 8 1258 3
## E 0 0 0 2 1434
print(cf.rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2228 13 0 1 0
## B 2 1498 18 1 0
## C 0 7 1342 24 5
## D 2 0 8 1258 3
## E 0 0 0 2 1434
##
## Overall Statistics
##
## Accuracy : 0.989
## 95% CI : (0.9865, 0.9912)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9861
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9868 0.9810 0.9782 0.9945
## Specificity 0.9975 0.9967 0.9944 0.9980 0.9997
## Pos Pred Value 0.9938 0.9862 0.9739 0.9898 0.9986
## Neg Pred Value 0.9993 0.9968 0.9960 0.9957 0.9988
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2840 0.1909 0.1710 0.1603 0.1828
## Detection Prevalence 0.2858 0.1936 0.1756 0.1620 0.1830
## Balanced Accuracy 0.9979 0.9918 0.9877 0.9881 0.9971
logStep ("(5.3) Boosting model \n ..fit.gbm=\n\n"); print(fit.gbm )
## (5.3) Boosting model
## ..fit.gbm=
## Stochastic Gradient Boosting
##
## 11776 samples
## 31 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD
## 1 50 0.7101335 0.6324070 0.006924339
## 1 100 0.7730951 0.7127079 0.007371483
## 1 150 0.8075144 0.7564129 0.004440979
## 2 50 0.8202313 0.7723139 0.005457241
## 2 100 0.8749929 0.8418270 0.004043285
## 2 150 0.9023588 0.8764695 0.004721279
## 3 50 0.8629356 0.8264678 0.004982426
## 3 100 0.9148331 0.8922231 0.005503074
## 3 150 0.9378045 0.9213011 0.005168093
## Kappa SD
## 0.008788581
## 0.009331767
## 0.005705155
## 0.007002126
## 0.005163815
## 0.006008614
## 0.006358391
## 0.007007181
## 0.006553734
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
logStep("..cross validation predict(fit.gbm, sTest)=\n"); print(summary(pred.gbm))
## ..cross validation predict(fit.gbm, sTest)=
## A B C D E
## 2257 1516 1418 1259 1396
logStep("..table(pred.gbm,sTest$classe)=\n"); print(table(pred.gbm, sTest$classe))
## ..table(pred.gbm,sTest$classe)=
##
## pred.gbm A B C D E
## A 2193 58 0 2 4
## B 16 1377 82 19 22
## C 8 69 1260 65 16
## D 15 6 25 1191 22
## E 0 8 1 9 1378
print(cf.gbm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2193 58 0 2 4
## B 16 1377 82 19 22
## C 8 69 1260 65 16
## D 15 6 25 1191 22
## E 0 8 1 9 1378
##
## Overall Statistics
##
## Accuracy : 0.943
## 95% CI : (0.9377, 0.9481)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9279
## Mcnemar's Test P-Value : 2.539e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9825 0.9071 0.9211 0.9261 0.9556
## Specificity 0.9886 0.9780 0.9756 0.9896 0.9972
## Pos Pred Value 0.9716 0.9083 0.8886 0.9460 0.9871
## Neg Pred Value 0.9930 0.9777 0.9832 0.9856 0.9901
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2795 0.1755 0.1606 0.1518 0.1756
## Detection Prevalence 0.2877 0.1932 0.1807 0.1605 0.1779
## Balanced Accuracy 0.9856 0.9426 0.9483 0.9579 0.9764
logStep ("(5.4) Linear Discriminant Analysis model \n ..fit.lda\n"); print(fit.lda)
## (5.4) Linear Discriminant Analysis model
## ..fit.lda
## Linear Discriminant Analysis
##
## 11776 samples
## 31 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.5816037 0.4702839 0.0062626 0.007891692
##
##
logStep("..cross validation predict(fit.lda, sTest)=\n"); print(summary(pred.lda))
## ..cross validation predict(fit.lda, sTest)=
## A B C D E
## 2297 1532 1649 1352 1016
logStep("..table(pred.lda,sTest$classe)=\n"); print(table(pred.lda, sTest$classe))
## ..table(pred.lda,sTest$classe)=
##
## pred.lda A B C D E
## A 1521 319 254 91 112
## B 298 704 142 133 255
## C 215 201 838 189 206
## D 174 152 110 734 182
## E 24 142 24 139 687
print(cf.lda)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1521 319 254 91 112
## B 298 704 142 133 255
## C 215 201 838 189 206
## D 174 152 110 734 182
## E 24 142 24 139 687
##
## Overall Statistics
##
## Accuracy : 0.5715
## 95% CI : (0.5605, 0.5825)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4578
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.6815 0.46377 0.6126 0.57076 0.47642
## Specificity 0.8618 0.86915 0.8748 0.90579 0.94863
## Pos Pred Value 0.6622 0.45953 0.5082 0.54290 0.67618
## Neg Pred Value 0.8719 0.87108 0.9145 0.91500 0.88946
## Prevalence 0.2845 0.19347 0.1744 0.16391 0.18379
## Detection Rate 0.1939 0.08973 0.1068 0.09355 0.08756
## Detection Prevalence 0.2928 0.19526 0.2102 0.17232 0.12949
## Balanced Accuracy 0.7716 0.66646 0.7437 0.73828 0.71252
logStep ("(5.5) Naive Bayes model \n ..fit.nb=\n"); print(fit.nb)
## (5.5) Naive Bayes model
## ..fit.nb=
## Naive Bayes
##
## 11776 samples
## 31 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## Pre-processing: centered (31), scaled (31)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ...
## Resampling results across tuning parameters:
##
## usekernel Accuracy Kappa Accuracy SD Kappa SD
## FALSE 0.5808695 0.4717188 0.007301880 0.009109709
## TRUE 0.7248053 0.6546382 0.009763068 0.012263254
##
## Tuning parameter 'fL' was held constant at a value of 0
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0 and usekernel = TRUE.
logStep("..cross validation predict(fit.nb, sTest)=\n"); print(summary(pred.nb))
## ..cross validation predict(fit.nb, sTest)=
## A B C D E
## 1800 1449 1969 1379 1249
logStep("..table(pred.nb,sTest$classe)=\n"); print(table(pred.nb, sTest$classe))
## ..table(pred.nb,sTest$classe)=
##
## pred.nb A B C D E
## A 1580 114 48 34 24
## B 138 1057 113 13 128
## C 231 203 1119 292 124
## D 255 102 79 873 70
## E 28 42 9 74 1096
print(cf.nb)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1580 114 48 34 24
## B 138 1057 113 13 128
## C 231 203 1119 292 124
## D 255 102 79 873 70
## E 28 42 9 74 1096
##
## Overall Statistics
##
## Accuracy : 0.7297
## 95% CI : (0.7197, 0.7395)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6609
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7079 0.6963 0.8180 0.6788 0.7601
## Specificity 0.9608 0.9381 0.8688 0.9229 0.9761
## Pos Pred Value 0.8778 0.7295 0.5683 0.6331 0.8775
## Neg Pred Value 0.8922 0.9279 0.9576 0.9361 0.9476
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2014 0.1347 0.1426 0.1113 0.1397
## Detection Prevalence 0.2294 0.1847 0.2510 0.1758 0.1592
## Balanced Accuracy 0.8343 0.8172 0.8434 0.8009 0.8681