This is prepared for Johns Hopkins’ Practical Machine Learning online class Course Project Prediction Assignment

The goal is to use http://groupware.les.inf.puc-rio.br/har data and to predict the “classe”, the manner in which they did the exercise.

Focus On

Due to course project pages limitation, output results are available in Appendix. R coes is in R file

Access the Data, Explore Data and Basic Statistics

set.seed(999)
loadDataset<-function(data.file ) { # load one data set at one time
  df.data<-read.csv(data.file, header=TRUE, sep = ",", na.strings=c("NA","#DIV/0!")); 
  return (df.data)  
}
logStep("\n(1) Getting data - Download data files")
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
 dest.dir=data.dir, dest.file="pml-training.csv", unzip=FALSE)
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
 dest.dir=data.dir, dest.file="pml-testing.csv", unzip=FALSE)

logStep("\n(2) Read data file, and load to data set")
# read and store data as NA datatype if string value is one of "NA", "", "#DOV/0!"
traindata.filename<-"pml-training.csv" 
traindata<-read.csv(file.path(data.dir, traindata.filename), na.strings=c("NA","#DIV/0!", ""))
testdata.filename<-"pml-testing.csv" 
testdata<-read.csv(file.path(data.dir, testdata.filename), na.strings=c("NA","#DIV/0!", ""))
logStep("\n(3) Exploring data")
logStep("...dim= ", dim(traindata)); logStep("...dim= ", dim(testdata));
#-See appdx for details: colnames(traindata); str(traindata); summary(traindata$classe); colnames(testdata)
logStep("...Important column info: summary of classe=\n"); print(summary(traindata$classe))
##   
## (1) Getting data - Download data files    Downloading from  https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv 
## to dest.filepath= D:/_github/PracticalMachineLearningProject/mydata/pml-training.csv  dest.filepath exist.No download is needed.  Downloading from  https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv 
## to dest.filepath= D:/_github/PracticalMachineLearningProject/mydata/pml-testing.csv   dest.filepath exist.No download is needed.  
## (2) Read data file, and load to data set  
## (3) Exploring data    ...dim=  19622 160  ...dim=  20 160     ...Important column info: summary of classe=
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Preprocess Data: Exam data and Clean data

The following criteria is to exam the data, and remove columns (variables) if matching the following conditions

  1. Check Missing values/fields/columns and remove NA variables
  2. Check and remove Near Zero Variables: Near to zero variable have no variability, and are not useful when constructing a prediction model
  3. Remove non-relevant Variables toward ‘classe’
  4. Check and remove High correlated Variables: show the effect of removing descriptors with absolute correlations above .75
logStep("(4) PreProcess data "); logStep("(4.1) Check and remove NA variables" )
#-see appdx sort(colSums(is.na(traindata)))
var.isna<-(colSums(is.na(traindata)) > 0) # is.na value: TRUE(1) if NA is found; FALSE(0) if NA not found 
cdata<-traindata[, !var.isna ]; logStep("...data ncol=", ncol(cdata))
##   (4) PreProcess data     (4.1) Check and remove NA variables     ...data ncol= 60
logStep("(4.2) Checking and remove Near Zero Variables ") # Near to zero variable have no variability, and are not useful when constructing a prediction model
require(caret); data.1 <- cdata;
var.nearzero<-nearZeroVar(data.1, allowParallel=TRUE, saveMetrics=FALSE) ; logStep("...nearzero.length=", length(var.nearzero)); 
if (length(var.nearzero)>0 ) {
 logStep("...nearzero.colname=", names(cdata)[var.nearzero])
 data.1<-data.1[, -var.nearzero ]
}
logStep("...data ncol=", ncol(data.1))
##   (4.2) Checking and remove Near Zero Variables   ...nearzero.length= 1   ...nearzero.colname= new_window     ...data ncol= 59
logStep("(4.3) Remove non-relevant Variables toward 'classe' "); logStep(names(data.1)[1:6] ); 
data.1<-data.1[, -1:-6 ] ; logStep("...data ncol=", ncol(data.1))
##   (4.3) Remove non-relevant Variables toward 'classe'     X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp num_window     ...data ncol= 53
logStep("(4.4) Check and remove High correlated Variables ") #shows the effect of removing descriptors with absolute correlations above .75.
cor.1<-cor(data.1[,-ncol(data.1)]); logStep("...Summary(cor.1)= \n"); print( summary(cor.1[upper.tri( cor.1 )]))
cor.high <- sum(abs(cor.1[upper.tri(cor.1)]) > .75); logStep("...Summary(abs(cor.1)) > 0.75 \n"); print( cor.high ); 
cor.highly<-findCorrelation( cor.1, cutoff=0.75 )
if(length(cor.highly)>0) {
  data.1<-data.1[,-cor.highly]; 
  cor.2<-cor(data.1[,-ncol(data.1)]); logStep("...cor.2)) cutoff=0.75 \n"); print(summary(cor.2[upper.tri( cor.2 )]));
}
logStep("...data ncol=", ncol(data.1))
##   (4.4) Check and remove High correlated Variables    ...Summary(cor.1)= 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992000 -0.110100  0.002092  0.001790  0.092550  0.980900 
##   ...Summary(abs(cor.1)) > 0.75 
## [1] 31
##   ...cor.2)) cutoff=0.75 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.607000 -0.103800  0.006527  0.003332  0.087530  0.736500 
##   ...data ncol= 32

My predictive variables are cut down to 32 variables/columns for fitting predictive modeling and cross validation

Note: Since I have so many predictors I can explore, it is extremely time consuming to generate plots/feature plots, or determine which variables from plots are better for fitting models. Therefore, decided to directly go further steps: fitting prediction modeling, and will determine predators from the best models based on result of fitting models.

Fit Prediction Modeling and Cross Validation

My data (32 variables) is GOOD now, and ready to be used for cross validation

Create Data Partition

Create Data Partition for cross validation data

data<- data.1; 
idxTrain<-createDataPartition(y=data$classe, p=0.6, list=FALSE)
sTrain<- data[idxTrain,]; dim(sTrain);
## [1] 11776    32
sTest<- data[-idxTrain,]; dim(sTest);
## [1] 7846   32

To Build Prediction Model on sTrain set and Validate on sTest set

Note: the following ‘train’ methods took over 2 hours to generate train result for all methods, so I saved these train result into local .Rdata files, and re-loaded them to generate prediction/comparison result for Rmd and HTML report creation

  1. Classification tree modeling (rpart)
  2. Random Forest model (rf)
    • using “cv” for resampling method, “10” number of resampling iterations
  3. Boosting model (gbm)
  4. Linear Discriminant Analysis model (lda)
  5. Naive Bayes model (nb)
require(randomForest); require(forecast); require(caret); 
#(5.1) Classification tree modeling
rdata.file <- file.path(data.dir, "fit.rpart.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file)
} else {
  fit.rpart<-train(classe ~ . , preProcess=c("center","scale"), data = sTrain, method="rpart");
  save(fit.rpart, file=rdata.file);
}
pred.rpart<-predict(fit.rpart, sTest)
#-see appdx print(fit.rpart);  ..cross validation   predict(fit,rpart, sTest); print( summary(pred.rpart))
sTest$rightPred <- pred.rpart == sTest$classe
#-see appdx print(table(pred.rpart, sTest$classe))
accuracy.rpart<- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.rpart<-sum(sTest$rightPred)/length(pred.rpart)
outOfSample.error.rpart<-(1-outOfSample.accuracy.rpart)
cf.rpart<- confusionMatrix(pred.rpart, sTest$classe)
#-see appdx print(cf.rpart)
#(5.2) Random Forest model
rdata.file <- file.path(data.dir, "fit.rf.RData")
if (file.exists(rdata.file)) {
  logStep( "loading ",rdata.file, load(file=rdata.file))
} else {
  traincontrol <- trainControl(method = "cv", number = 10) # using "cv" for resampling method, "10" number of resampling iterations
  logStep (system.time(fit.rf <- train(classe ~., preProcess=c("center","scale"), data = sTrain, 
                                       method="rf", trControl=traincontrol , verbose = FALSE)))
  save(fit.rf,file=rdata.file)
}
pred.rf <-predict(fit.rf, sTest)
#-see appdx print(fit.rf ); ..cross validation predict(fit,rf, sTest); print( summary(pred.rf))
sTest$rightPred<-pred.rf == sTest$classe
#-see appdx print(table(pred.rf, sTest$classe))
accuracy.rf <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.rf<-sum(sTest$rightPred)/length(pred.rf)
outOfSample.error.rf<-(1-outOfSample.accuracy.rf)
cf.rf <- confusionMatrix(pred.rf, sTest$classe)
#-see appdx print(cf.rf)
##   loading  D:/_github/PracticalMachineLearningProject/mydata/fit.rf.RData fit.rf
#(5.3) Boosting model
rdata.file <- file.path(data.dir, "fit.gbm.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file)
} else {
  logStep (system.time(fit.gbm <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="gbm", verbose = FALSE)))
  save(fit.gbm,file=rdata.file)
}
pred.gbm<-predict(fit.gbm, sTest)
#-see appdx print(fit.gbm); ..cross validation   predict(fit.gbm, sTest); pint(summary(pred.gbm))
sTest$rightPred<-pred.gbm == sTest$classe
#-see appdx print(table(pred.gbm, sTest$classe))
accuracy.gbm<- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.gbm<-sum(sTest$rightPred)/length(pred.gbm)
outOfSample.error.gbm<-(1-outOfSample.accuracy.gbm)
cf.gbm <- confusionMatrix(pred.gbm, sTest$classe)
#-see appdx print(cf.gbm)
#(5.4) Linear Discriminant Analysis model
rdata.file <- file.path(data.dir, "fit.lda.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file) 
} else {
  logStep (system.time(fit.lda <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="lda", verbose = FALSE)))
  save(fit.lda,file=rdata.file)
}
pred.lda<-predict(fit.lda, sTest); 
#-see appdx print(fit.lda); ..cross validation   predict(fit.lda, sTest); print(summary(pred.lda))
sTest$rightPred<-pred.lda == sTest$classe
#-see appdx print(table(pred.lda, sTest$classe))
accuracy.lda <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.lda<-sum(sTest$rightPred)/length(pred.lda)
outOfSample.error.lda<-(1-outOfSample.accuracy.lda)
cf.lda <- confusionMatrix(pred.lda, sTest$classe)
#-see appdx print(cf.lda)
#(5.5) Naive Bayes model 
require(klaR)
rdata.file <- file.path(data.dir, "fit.nb.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file) 
} else {
  logStep (system.time(fit.nb <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="nb", verbose = FALSE)))
  save(fit.nb,file=rdata.file)
}
pred.nb<-predict(fit.nb, sTest) 
#-see appdx print(fit.nb);..cross validation   predict(fit.nb, sTest); print(summary(pred.nb));
sTest$rightPred<-pred.nb == sTest$classe
#-see appdx print(table(pred.nb, sTest$classe))
accuracy.nb <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.nb<-sum(sTest$rightPred)/length(pred.nb)
outOfSample.error.nb<-(1-outOfSample.accuracy.nb)
cf.nb <- confusionMatrix(pred.nb, sTest$classe)
#-see appdx print(cf.nb)

Comparison Result for Cross Validation

The cross validation estimate is an Out Of Sample estimate. Comparison details included the train method,
- In-Sample Accuracy
- Confusion Matrix Accuracy
- Out-of-Sample Accuracy
- In-Sample Error, Out-of-Sample Error

Expect the best performing model if (1) the best/highest accuracy value (2) the least/lowest/smallest error value

logStep ("(6) Comparison accuracy, In/Out sample error\n")
comp.result <- cbind( c("model/train method", "rpart","rf","gbm","lda","nb"),
  c("InSample.Accuracy", accuracy.rpart, accuracy.rf, accuracy.gbm, accuracy.lda, accuracy.nb),
  c("confusionMatrix.Accuracy",cf.rpart$overall['Accuracy'],cf.rf$overall['Accuracy'], cf.gbm$overall['Accuracy'], 
    cf.lda$overall['Accuracy'],cf.nb$overall['Accuracy']),
  c("InSample.Error",1-cf.rpart$overall['Accuracy'], 1-cf.rf$overall['Accuracy'], 1-cf.gbm$overall['Accuracy'], 
    1-cf.lda$overall['Accuracy'], 1-cf.nb$overall['Accuracy']),
  c("OutOfSample.Accuracy",outOfSample.accuracy.rpart, outOfSample.accuracy.rf, outOfSample.accuracy.gbm,
    outOfSample.accuracy.lda, outOfSample.accuracy.nb),
  c("OutOfSample.Error",outOfSample.error.rpart, outOfSample.error.rf, outOfSample.error.gbm,
    outOfSample.error.lda, outOfSample.error.nb))
print(comp.result)
logStep("...rpart postResample\n"); print(postResample(pred.rpart, sTest$classe))
logStep("...rf postResample\n"); print(postResample(pred.rf, sTest$classe))
logStep("...gbm postResample\n"); print(postResample(pred.gbm, sTest$classe))
logStep("...lda postResample\n"); print(postResample(pred.lda, sTest$classe))
logStep("...nb postResample\n"); print(postResample(pred.nb, sTest$classe))
##   (6) Comparison accuracy, In/Out sample error
##          [,1]                 [,2]               
##          "model/train method" "InSample.Accuracy"
## Accuracy "rpart"              "0.547030333928116"
## Accuracy "rf"                 "0.989039000764721"
## Accuracy "gbm"                "0.943028294672445"
## Accuracy "lda"                "0.571501401988274"
## Accuracy "nb"                 "0.729671170022942"
##          [,3]                       [,4]                
##          "confusionMatrix.Accuracy" "InSample.Error"    
## Accuracy "0.547030333928116"        "0.452969666071884" 
## Accuracy "0.989039000764721"        "0.0109609992352792"
## Accuracy "0.943028294672445"        "0.0569717053275555"
## Accuracy "0.571501401988274"        "0.428498598011726" 
## Accuracy "0.729671170022942"        "0.270328829977058" 
##          [,5]                   [,6]                
##          "OutOfSample.Accuracy" "OutOfSample.Error" 
## Accuracy "0.547030333928116"    "0.452969666071884" 
## Accuracy "0.989039000764721"    "0.0109609992352792"
## Accuracy "0.943028294672445"    "0.0569717053275555"
## Accuracy "0.571501401988274"    "0.428498598011726" 
## Accuracy "0.729671170022942"    "0.270328829977058" 
##   ...rpart postResample
##  Accuracy     Kappa 
## 0.5470303 0.4302387 
##   ...rf postResample
##  Accuracy     Kappa 
## 0.9890390 0.9861331 
##   ...gbm postResample
##  Accuracy     Kappa 
## 0.9430283 0.9279087 
##   ...lda postResample
##  Accuracy     Kappa 
## 0.5715014 0.4577694 
##   ...nb postResample
##  Accuracy     Kappa 
## 0.7296712 0.6608958

Comparison result and Choose Final Model

The result of comparing “rpart”, “rf”, “gbm”, “lda”, “nb” model showed the first two highest ranks,

Method In Sample Accurancy Out Of Sample Accurancy 95% CI Kappa In Sample Error Out Of Sample Error
rf 98.9% 98.9% (98.7%, 99.1%) 98.6% 1.1% 1.1 %
gbm 94.3% 94.3 % (93.8%, 94.8%) 92.8% 5.7% 5.7%

The best performing model is the Random Forest (rf), so I chose the rf as my best fitting model to apply PML Test data

Generate files for assginment submission

logStep ("(7) Generate files for assginment submission\n")
# Prediction submission -- select RF model to predict test data
pred.answer <- predict( fit.rf, testdata)
logStep("...pred.answer=")
print( pred.answer)
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    filepath<-file.path(output.dir, filename)
    write.table(x[i],file=filepath,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
logStep("...write to files"); pml_write_files(pred.answer)
##   (7) Generate files for assginment submission
##   ...pred.answer= [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
##   ...write to files

Conclusion of My observation in this Exercise

Reference:

Appendix

logStep("(3) Exploring data\n ..Train Data colnames= \n"); head(colnames(traindata)); #str(head(traindata)) ;
##   (3) Exploring data
##  ..Train Data colnames=
## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"
head(colnames(testdata));
## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"
logStep("(4) PreProcess data   (4.1) Check and remove NA variables" ); head( sort(colSums(is.na(traindata)), decreasing = TRUE))
##   (4) PreProcess data   (4.1) Check and remove NA variables
##     kurtosis_yaw_belt     skewness_yaw_belt kurtosis_yaw_dumbbell 
##                 19622                 19622                 19622 
## skewness_yaw_dumbbell  kurtosis_yaw_forearm  skewness_yaw_forearm 
##                 19622                 19622                 19622
logStep ("(5.1) Classification tree modeling\n ..fit.rpart=\n"); print(fit.rpart);
##   (5.1) Classification tree modeling
##  ..fit.rpart=
## CART 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.02491694  0.5395200  0.4158615  0.02502035   0.03913236
##   0.02758662  0.5184748  0.3880017  0.02578250   0.04047764
##   0.04182487  0.4131974  0.2203257  0.10276153   0.17280339
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.02491694.
logStep("..cross validation   predict(fit.rpart, sTest)=\n"); print( summary(pred.rpart))
##   ..cross validation   predict(fit.rpart, sTest)=
##    A    B    C    D    E 
## 1786 1989 2206  603 1262
logStep("..table(pred.rpart,sTest$classe)=\n"); print(table(pred.rpart, sTest$classe)); 
##   ..table(pred.rpart,sTest$classe)=
##           
## pred.rpart    A    B    C    D    E
##          A 1394  266   31   66   29
##          B  223  860  173  318  415
##          C  316  290 1016  329  255
##          D  107   37   60  339   60
##          E  192   65   88  234  683
print(cf.rpart)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394  266   31   66   29
##          B  223  860  173  318  415
##          C  316  290 1016  329  255
##          D  107   37   60  339   60
##          E  192   65   88  234  683
## 
## Overall Statistics
##                                           
##                Accuracy : 0.547           
##                  95% CI : (0.5359, 0.5581)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4302          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6246   0.5665   0.7427  0.26361  0.47365
## Specificity            0.9302   0.8216   0.8163  0.95976  0.90959
## Pos Pred Value         0.7805   0.4324   0.4606  0.56219  0.54120
## Neg Pred Value         0.8617   0.8877   0.9376  0.86925  0.88472
## Prevalence             0.2845   0.1935   0.1744  0.16391  0.18379
## Detection Rate         0.1777   0.1096   0.1295  0.04321  0.08705
## Detection Prevalence   0.2276   0.2535   0.2812  0.07685  0.16085
## Balanced Accuracy      0.7774   0.6941   0.7795  0.61168  0.69162
fancyRpartPlot(fit.rpart$finalModel, main="Rpart Tree Model")

logStep ("(5.2) Random Forest model\n ..fit.rf=\n"); print(fit.rf )
##   (5.2) Random Forest model
##  ..fit.rf=
## Random Forest 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10597, 10598, 10599, 10600, 10598, 10598, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9884510  0.9853886  0.002342040  0.002964147
##   16    0.9887908  0.9858192  0.003246769  0.004107953
##   31    0.9808937  0.9758304  0.003883894  0.004914435
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 16.
logStep("..cross validation   predict(fit,rf, sTest) =\n"); print( summary(pred.rf))
##   ..cross validation   predict(fit,rf, sTest) =
##    A    B    C    D    E 
## 2242 1519 1378 1271 1436
logStep("..table(pred.rf,sTest$classe)=\n"); print(table(pred.rf, sTest$classe))
##   ..table(pred.rf,sTest$classe)=
##        
## pred.rf    A    B    C    D    E
##       A 2228   13    0    1    0
##       B    2 1498   18    1    0
##       C    0    7 1342   24    5
##       D    2    0    8 1258    3
##       E    0    0    0    2 1434
print(cf.rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2228   13    0    1    0
##          B    2 1498   18    1    0
##          C    0    7 1342   24    5
##          D    2    0    8 1258    3
##          E    0    0    0    2 1434
## 
## Overall Statistics
##                                           
##                Accuracy : 0.989           
##                  95% CI : (0.9865, 0.9912)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9861          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9868   0.9810   0.9782   0.9945
## Specificity            0.9975   0.9967   0.9944   0.9980   0.9997
## Pos Pred Value         0.9938   0.9862   0.9739   0.9898   0.9986
## Neg Pred Value         0.9993   0.9968   0.9960   0.9957   0.9988
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2840   0.1909   0.1710   0.1603   0.1828
## Detection Prevalence   0.2858   0.1936   0.1756   0.1620   0.1830
## Balanced Accuracy      0.9979   0.9918   0.9877   0.9881   0.9971
logStep ("(5.3) Boosting model \n ..fit.gbm=\n\n"); print(fit.gbm )
##   (5.3) Boosting model 
##  ..fit.gbm=
## Stochastic Gradient Boosting 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.7101335  0.6324070  0.006924339
##   1                  100      0.7730951  0.7127079  0.007371483
##   1                  150      0.8075144  0.7564129  0.004440979
##   2                   50      0.8202313  0.7723139  0.005457241
##   2                  100      0.8749929  0.8418270  0.004043285
##   2                  150      0.9023588  0.8764695  0.004721279
##   3                   50      0.8629356  0.8264678  0.004982426
##   3                  100      0.9148331  0.8922231  0.005503074
##   3                  150      0.9378045  0.9213011  0.005168093
##   Kappa SD   
##   0.008788581
##   0.009331767
##   0.005705155
##   0.007002126
##   0.005163815
##   0.006008614
##   0.006358391
##   0.007007181
##   0.006553734
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
logStep("..cross validation   predict(fit.gbm, sTest)=\n"); print(summary(pred.gbm))
##   ..cross validation   predict(fit.gbm, sTest)=
##    A    B    C    D    E 
## 2257 1516 1418 1259 1396
logStep("..table(pred.gbm,sTest$classe)=\n"); print(table(pred.gbm, sTest$classe))
##   ..table(pred.gbm,sTest$classe)=
##         
## pred.gbm    A    B    C    D    E
##        A 2193   58    0    2    4
##        B   16 1377   82   19   22
##        C    8   69 1260   65   16
##        D   15    6   25 1191   22
##        E    0    8    1    9 1378
print(cf.gbm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2193   58    0    2    4
##          B   16 1377   82   19   22
##          C    8   69 1260   65   16
##          D   15    6   25 1191   22
##          E    0    8    1    9 1378
## 
## Overall Statistics
##                                           
##                Accuracy : 0.943           
##                  95% CI : (0.9377, 0.9481)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9279          
##  Mcnemar's Test P-Value : 2.539e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9825   0.9071   0.9211   0.9261   0.9556
## Specificity            0.9886   0.9780   0.9756   0.9896   0.9972
## Pos Pred Value         0.9716   0.9083   0.8886   0.9460   0.9871
## Neg Pred Value         0.9930   0.9777   0.9832   0.9856   0.9901
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2795   0.1755   0.1606   0.1518   0.1756
## Detection Prevalence   0.2877   0.1932   0.1807   0.1605   0.1779
## Balanced Accuracy      0.9856   0.9426   0.9483   0.9579   0.9764
logStep ("(5.4) Linear Discriminant Analysis model \n ..fit.lda\n"); print(fit.lda)
##   (5.4) Linear Discriminant Analysis model 
##  ..fit.lda
## Linear Discriminant Analysis 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD   
##   0.5816037  0.4702839  0.0062626    0.007891692
## 
## 
logStep("..cross validation   predict(fit.lda, sTest)=\n"); print(summary(pred.lda))
##   ..cross validation   predict(fit.lda, sTest)=
##    A    B    C    D    E 
## 2297 1532 1649 1352 1016
logStep("..table(pred.lda,sTest$classe)=\n"); print(table(pred.lda, sTest$classe))
##   ..table(pred.lda,sTest$classe)=
##         
## pred.lda    A    B    C    D    E
##        A 1521  319  254   91  112
##        B  298  704  142  133  255
##        C  215  201  838  189  206
##        D  174  152  110  734  182
##        E   24  142   24  139  687
print(cf.lda)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1521  319  254   91  112
##          B  298  704  142  133  255
##          C  215  201  838  189  206
##          D  174  152  110  734  182
##          E   24  142   24  139  687
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5715          
##                  95% CI : (0.5605, 0.5825)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4578          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6815  0.46377   0.6126  0.57076  0.47642
## Specificity            0.8618  0.86915   0.8748  0.90579  0.94863
## Pos Pred Value         0.6622  0.45953   0.5082  0.54290  0.67618
## Neg Pred Value         0.8719  0.87108   0.9145  0.91500  0.88946
## Prevalence             0.2845  0.19347   0.1744  0.16391  0.18379
## Detection Rate         0.1939  0.08973   0.1068  0.09355  0.08756
## Detection Prevalence   0.2928  0.19526   0.2102  0.17232  0.12949
## Balanced Accuracy      0.7716  0.66646   0.7437  0.73828  0.71252
logStep ("(5.5) Naive Bayes model \n ..fit.nb=\n"); print(fit.nb)
##   (5.5) Naive Bayes model 
##  ..fit.nb=
## Naive Bayes 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa      Accuracy SD  Kappa SD   
##   FALSE      0.5808695  0.4717188  0.007301880  0.009109709
##    TRUE      0.7248053  0.6546382  0.009763068  0.012263254
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were fL = 0 and usekernel = TRUE.
logStep("..cross validation   predict(fit.nb, sTest)=\n"); print(summary(pred.nb))
##   ..cross validation   predict(fit.nb, sTest)=
##    A    B    C    D    E 
## 1800 1449 1969 1379 1249
logStep("..table(pred.nb,sTest$classe)=\n"); print(table(pred.nb, sTest$classe))
##   ..table(pred.nb,sTest$classe)=
##        
## pred.nb    A    B    C    D    E
##       A 1580  114   48   34   24
##       B  138 1057  113   13  128
##       C  231  203 1119  292  124
##       D  255  102   79  873   70
##       E   28   42    9   74 1096
print(cf.nb)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1580  114   48   34   24
##          B  138 1057  113   13  128
##          C  231  203 1119  292  124
##          D  255  102   79  873   70
##          E   28   42    9   74 1096
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7297          
##                  95% CI : (0.7197, 0.7395)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6609          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7079   0.6963   0.8180   0.6788   0.7601
## Specificity            0.9608   0.9381   0.8688   0.9229   0.9761
## Pos Pred Value         0.8778   0.7295   0.5683   0.6331   0.8775
## Neg Pred Value         0.8922   0.9279   0.9576   0.9361   0.9476
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2014   0.1347   0.1426   0.1113   0.1397
## Detection Prevalence   0.2294   0.1847   0.2510   0.1758   0.1592
## Balanced Accuracy      0.8343   0.8172   0.8434   0.8009   0.8681