Pratical Machine Learning Course Project

This is prepared for Johns Hopkins’ Practical Machine Learning online class Course Project Prediction Assignment

The goal is to use http://groupware.les.inf.puc-rio.br/har data and to predict the “classe”, the manner in which they did the exercise.

Focus On

The built machine learning algorithm to predict activity quality from activity monitors
The expected “out of sample error” to be and estimate the error appropriately with cross-validation

Due to course project pages limitation, output results are available in Appendix. R coes is in R file

Access the Data, Explore Data and Basic Statistics

Six (6) participants were asked to perform barbell lift activities from accelerometers on the belt, forearm, arm, and dumb bell. These lifts could correctly and incorrectly performed in 5 different ways
The training data is available in https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
- 19622 rows (observations), 160 columns (variables)
The test data is available in https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv.
- 20 rows, 160 columns
Variables/columns
- Variables having Factor data type are classe, user_name, cvtd_timestamp, new_window, etc.
- The ‘classe’ field will be my outcome parameter for following prediction modeling(s)
- ‘159’ variables will be possible and potential predictors for following prediction modeling(s)
- The final predictive variables will be determined after Prep Process data steps

set.seed(999)
loadDataset<-function(data.file ) { # load one data set at one time
  df.data<-read.csv(data.file, header=TRUE, sep = ",", na.strings=c("NA","#DIV/0!")); 
  return (df.data)  
}
logStep("\n(1) Getting data - Download data files")
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
 dest.dir=data.dir, dest.file="pml-training.csv", unzip=FALSE)
downloadZipFile(src.file.url="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
 dest.dir=data.dir, dest.file="pml-testing.csv", unzip=FALSE)

logStep("\n(2) Read data file, and load to data set")
# read and store data as NA datatype if string value is one of "NA", "", "#DOV/0!"
traindata.filename<-"pml-training.csv" 
traindata<-read.csv(file.path(data.dir, traindata.filename), na.strings=c("NA","#DIV/0!", ""))
testdata.filename<-"pml-testing.csv" 
testdata<-read.csv(file.path(data.dir, testdata.filename), na.strings=c("NA","#DIV/0!", ""))
logStep("\n(3) Exploring data")
logStep("...dim= ", dim(traindata)); logStep("...dim= ", dim(testdata));
#-See appdx for details: colnames(traindata); str(traindata); summary(traindata$classe); colnames(testdata)
logStep("...Important column info: summary of classe=\n"); print(summary(traindata$classe))

##   
## (1) Getting data - Download data files    Downloading from  https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv 
## to dest.filepath= D:/_github/PracticalMachineLearningProject/mydata/pml-training.csv  dest.filepath exist.No download is needed.  Downloading from  https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv 
## to dest.filepath= D:/_github/PracticalMachineLearningProject/mydata/pml-testing.csv   dest.filepath exist.No download is needed.  
## (2) Read data file, and load to data set  
## (3) Exploring data    ...dim=  19622 160  ...dim=  20 160     ...Important column info: summary of classe=
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Preprocess Data: Exam data and Clean data

The following criteria is to exam the data, and remove columns (variables) if matching the following conditions

Check Missing values/fields/columns and remove NA variables
Check and remove Near Zero Variables: Near to zero variable have no variability, and are not useful when constructing a prediction model
Remove non-relevant Variables toward ‘classe’
Check and remove High correlated Variables: show the effect of removing descriptors with absolute correlations above .75

logStep("(4) PreProcess data "); logStep("(4.1) Check and remove NA variables" )
#-see appdx sort(colSums(is.na(traindata)))
var.isna<-(colSums(is.na(traindata)) > 0) # is.na value: TRUE(1) if NA is found; FALSE(0) if NA not found 
cdata<-traindata[, !var.isna ]; logStep("...data ncol=", ncol(cdata))

##   (4) PreProcess data     (4.1) Check and remove NA variables     ...data ncol= 60

logStep("(4.2) Checking and remove Near Zero Variables ") # Near to zero variable have no variability, and are not useful when constructing a prediction model
require(caret); data.1 <- cdata;
var.nearzero<-nearZeroVar(data.1, allowParallel=TRUE, saveMetrics=FALSE) ; logStep("...nearzero.length=", length(var.nearzero)); 
if (length(var.nearzero)>0 ) {
 logStep("...nearzero.colname=", names(cdata)[var.nearzero])
 data.1<-data.1[, -var.nearzero ]
}
logStep("...data ncol=", ncol(data.1))

##   (4.2) Checking and remove Near Zero Variables   ...nearzero.length= 1   ...nearzero.colname= new_window     ...data ncol= 59

logStep("(4.3) Remove non-relevant Variables toward 'classe' "); logStep(names(data.1)[1:6] ); 
data.1<-data.1[, -1:-6 ] ; logStep("...data ncol=", ncol(data.1))

##   (4.3) Remove non-relevant Variables toward 'classe'     X user_name raw_timestamp_part_1 raw_timestamp_part_2 cvtd_timestamp num_window     ...data ncol= 53

logStep("(4.4) Check and remove High correlated Variables ") #shows the effect of removing descriptors with absolute correlations above .75.
cor.1<-cor(data.1[,-ncol(data.1)]); logStep("...Summary(cor.1)= \n"); print( summary(cor.1[upper.tri( cor.1 )]))
cor.high <- sum(abs(cor.1[upper.tri(cor.1)]) > .75); logStep("...Summary(abs(cor.1)) > 0.75 \n"); print( cor.high ); 
cor.highly<-findCorrelation( cor.1, cutoff=0.75 )
if(length(cor.highly)>0) {
  data.1<-data.1[,-cor.highly]; 
  cor.2<-cor(data.1[,-ncol(data.1)]); logStep("...cor.2)) cutoff=0.75 \n"); print(summary(cor.2[upper.tri( cor.2 )]));
}
logStep("...data ncol=", ncol(data.1))

##   (4.4) Check and remove High correlated Variables    ...Summary(cor.1)= 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.992000 -0.110100  0.002092  0.001790  0.092550  0.980900 
##   ...Summary(abs(cor.1)) > 0.75 
## [1] 31
##   ...cor.2)) cutoff=0.75 
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.607000 -0.103800  0.006527  0.003332  0.087530  0.736500 
##   ...data ncol= 32

My predictive variables are cut down to 32 variables/columns for fitting predictive modeling and cross validation

The ‘classe’ variable is the outcome parameter
31 variables are used as predictors while fitting a best prediction model

Note: Since I have so many predictors I can explore, it is extremely time consuming to generate plots/feature plots, or determine which variables from plots are better for fitting models. Therefore, decided to directly go further steps: fitting prediction modeling, and will determine predators from the best models based on result of fitting models.

Fit Prediction Modeling and Cross Validation

My data (32 variables) is GOOD now, and ready to be used for cross validation

Create Data Partition

Create Data Partition for cross validation data

Split 60% for Train set (sTrain), 40% for Test set (sTest)
‘center’ and ‘scale’ are used for pre-processing of the predictor data

data<- data.1; 
idxTrain<-createDataPartition(y=data$classe, p=0.6, list=FALSE)
sTrain<- data[idxTrain,]; dim(sTrain);

## [1] 11776    32

sTest<- data[-idxTrain,]; dim(sTest);

## [1] 7846   32

To Build Prediction Model on sTrain set and Validate on sTest set

Note: the following ‘train’ methods took over 2 hours to generate train result for all methods, so I saved these train result into local .Rdata files, and re-loaded them to generate prediction/comparison result for Rmd and HTML report creation

Classification tree modeling (rpart)
Random Forest model (rf)
- using “cv” for resampling method, “10” number of resampling iterations
Boosting model (gbm)
Linear Discriminant Analysis model (lda)
Naive Bayes model (nb)

require(randomForest); require(forecast); require(caret); 
#(5.1) Classification tree modeling
rdata.file <- file.path(data.dir, "fit.rpart.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file)
} else {
  fit.rpart<-train(classe ~ . , preProcess=c("center","scale"), data = sTrain, method="rpart");
  save(fit.rpart, file=rdata.file);
}
pred.rpart<-predict(fit.rpart, sTest)
#-see appdx print(fit.rpart);  ..cross validation   predict(fit,rpart, sTest); print( summary(pred.rpart))
sTest$rightPred <- pred.rpart == sTest$classe
#-see appdx print(table(pred.rpart, sTest$classe))
accuracy.rpart<- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.rpart<-sum(sTest$rightPred)/length(pred.rpart)
outOfSample.error.rpart<-(1-outOfSample.accuracy.rpart)
cf.rpart<- confusionMatrix(pred.rpart, sTest$classe)
#-see appdx print(cf.rpart)

#(5.2) Random Forest model
rdata.file <- file.path(data.dir, "fit.rf.RData")
if (file.exists(rdata.file)) {
  logStep( "loading ",rdata.file, load(file=rdata.file))
} else {
  traincontrol <- trainControl(method = "cv", number = 10) # using "cv" for resampling method, "10" number of resampling iterations
  logStep (system.time(fit.rf <- train(classe ~., preProcess=c("center","scale"), data = sTrain, 
                                       method="rf", trControl=traincontrol , verbose = FALSE)))
  save(fit.rf,file=rdata.file)
}
pred.rf <-predict(fit.rf, sTest)
#-see appdx print(fit.rf ); ..cross validation predict(fit,rf, sTest); print( summary(pred.rf))
sTest$rightPred<-pred.rf == sTest$classe
#-see appdx print(table(pred.rf, sTest$classe))
accuracy.rf <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.rf<-sum(sTest$rightPred)/length(pred.rf)
outOfSample.error.rf<-(1-outOfSample.accuracy.rf)
cf.rf <- confusionMatrix(pred.rf, sTest$classe)
#-see appdx print(cf.rf)

##   loading  D:/_github/PracticalMachineLearningProject/mydata/fit.rf.RData fit.rf

#(5.3) Boosting model
rdata.file <- file.path(data.dir, "fit.gbm.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file)
} else {
  logStep (system.time(fit.gbm <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="gbm", verbose = FALSE)))
  save(fit.gbm,file=rdata.file)
}
pred.gbm<-predict(fit.gbm, sTest)
#-see appdx print(fit.gbm); ..cross validation   predict(fit.gbm, sTest); pint(summary(pred.gbm))
sTest$rightPred<-pred.gbm == sTest$classe
#-see appdx print(table(pred.gbm, sTest$classe))
accuracy.gbm<- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.gbm<-sum(sTest$rightPred)/length(pred.gbm)
outOfSample.error.gbm<-(1-outOfSample.accuracy.gbm)
cf.gbm <- confusionMatrix(pred.gbm, sTest$classe)
#-see appdx print(cf.gbm)

#(5.4) Linear Discriminant Analysis model
rdata.file <- file.path(data.dir, "fit.lda.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file) 
} else {
  logStep (system.time(fit.lda <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="lda", verbose = FALSE)))
  save(fit.lda,file=rdata.file)
}
pred.lda<-predict(fit.lda, sTest); 
#-see appdx print(fit.lda); ..cross validation   predict(fit.lda, sTest); print(summary(pred.lda))
sTest$rightPred<-pred.lda == sTest$classe
#-see appdx print(table(pred.lda, sTest$classe))
accuracy.lda <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.lda<-sum(sTest$rightPred)/length(pred.lda)
outOfSample.error.lda<-(1-outOfSample.accuracy.lda)
cf.lda <- confusionMatrix(pred.lda, sTest$classe)
#-see appdx print(cf.lda)

#(5.5) Naive Bayes model 
require(klaR)
rdata.file <- file.path(data.dir, "fit.nb.RData")
if (file.exists(rdata.file)) {
  load(file=rdata.file) 
} else {
  logStep (system.time(fit.nb <- train(classe ~., preProcess=c("center","scale"), data = sTrain, method="nb", verbose = FALSE)))
  save(fit.nb,file=rdata.file)
}
pred.nb<-predict(fit.nb, sTest) 
#-see appdx print(fit.nb);..cross validation   predict(fit.nb, sTest); print(summary(pred.nb));
sTest$rightPred<-pred.nb == sTest$classe
#-see appdx print(table(pred.nb, sTest$classe))
accuracy.nb <- sum(sTest$rightPred)/nrow(sTest)
outOfSample.accuracy.nb<-sum(sTest$rightPred)/length(pred.nb)
outOfSample.error.nb<-(1-outOfSample.accuracy.nb)
cf.nb <- confusionMatrix(pred.nb, sTest$classe)
#-see appdx print(cf.nb)

Comparison Result for Cross Validation

The cross validation estimate is an Out Of Sample estimate. Comparison details included the train method,
- In-Sample Accuracy
- Confusion Matrix Accuracy
- Out-of-Sample Accuracy
- In-Sample Error, Out-of-Sample Error

Expect the best performing model if (1) the best/highest accuracy value (2) the least/lowest/smallest error value

logStep ("(6) Comparison accuracy, In/Out sample error\n")
comp.result <- cbind( c("model/train method", "rpart","rf","gbm","lda","nb"),
  c("InSample.Accuracy", accuracy.rpart, accuracy.rf, accuracy.gbm, accuracy.lda, accuracy.nb),
  c("confusionMatrix.Accuracy",cf.rpart$overall['Accuracy'],cf.rf$overall['Accuracy'], cf.gbm$overall['Accuracy'], 
    cf.lda$overall['Accuracy'],cf.nb$overall['Accuracy']),
  c("InSample.Error",1-cf.rpart$overall['Accuracy'], 1-cf.rf$overall['Accuracy'], 1-cf.gbm$overall['Accuracy'], 
    1-cf.lda$overall['Accuracy'], 1-cf.nb$overall['Accuracy']),
  c("OutOfSample.Accuracy",outOfSample.accuracy.rpart, outOfSample.accuracy.rf, outOfSample.accuracy.gbm,
    outOfSample.accuracy.lda, outOfSample.accuracy.nb),
  c("OutOfSample.Error",outOfSample.error.rpart, outOfSample.error.rf, outOfSample.error.gbm,
    outOfSample.error.lda, outOfSample.error.nb))
print(comp.result)
logStep("...rpart postResample\n"); print(postResample(pred.rpart, sTest$classe))
logStep("...rf postResample\n"); print(postResample(pred.rf, sTest$classe))
logStep("...gbm postResample\n"); print(postResample(pred.gbm, sTest$classe))
logStep("...lda postResample\n"); print(postResample(pred.lda, sTest$classe))
logStep("...nb postResample\n"); print(postResample(pred.nb, sTest$classe))

##   (6) Comparison accuracy, In/Out sample error
##          [,1]                 [,2]               
##          "model/train method" "InSample.Accuracy"
## Accuracy "rpart"              "0.547030333928116"
## Accuracy "rf"                 "0.989039000764721"
## Accuracy "gbm"                "0.943028294672445"
## Accuracy "lda"                "0.571501401988274"
## Accuracy "nb"                 "0.729671170022942"
##          [,3]                       [,4]                
##          "confusionMatrix.Accuracy" "InSample.Error"    
## Accuracy "0.547030333928116"        "0.452969666071884" 
## Accuracy "0.989039000764721"        "0.0109609992352792"
## Accuracy "0.943028294672445"        "0.0569717053275555"
## Accuracy "0.571501401988274"        "0.428498598011726" 
## Accuracy "0.729671170022942"        "0.270328829977058" 
##          [,5]                   [,6]                
##          "OutOfSample.Accuracy" "OutOfSample.Error" 
## Accuracy "0.547030333928116"    "0.452969666071884" 
## Accuracy "0.989039000764721"    "0.0109609992352792"
## Accuracy "0.943028294672445"    "0.0569717053275555"
## Accuracy "0.571501401988274"    "0.428498598011726" 
## Accuracy "0.729671170022942"    "0.270328829977058" 
##   ...rpart postResample
##  Accuracy     Kappa 
## 0.5470303 0.4302387 
##   ...rf postResample
##  Accuracy     Kappa 
## 0.9890390 0.9861331 
##   ...gbm postResample
##  Accuracy     Kappa 
## 0.9430283 0.9279087 
##   ...lda postResample
##  Accuracy     Kappa 
## 0.5715014 0.4577694 
##   ...nb postResample
##  Accuracy     Kappa 
## 0.7296712 0.6608958

Comparison result and Choose Final Model

The result of comparing “rpart”, “rf”, “gbm”, “lda”, “nb” model showed the first two highest ranks,

Method	In Sample Accurancy	Out Of Sample Accurancy	95% CI	Kappa	In Sample Error	Out Of Sample Error
rf	98.9%	98.9%	(98.7%, 99.1%)	98.6%	1.1%	1.1 %
gbm	94.3%	94.3 %	(93.8%, 94.8%)	92.8%	5.7%	5.7%

The best performing model is the Random Forest (rf), so I chose the rf as my best fitting model to apply PML Test data

Generate files for assginment submission

Selected the RF model to apply Test data (pml-testing.csv) for this course project assignment submission

logStep ("(7) Generate files for assginment submission\n")
# Prediction submission -- select RF model to predict test data
pred.answer <- predict( fit.rf, testdata)
logStep("...pred.answer=")
print( pred.answer)
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    filepath<-file.path(output.dir, filename)
    write.table(x[i],file=filepath,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
logStep("...write to files"); pml_write_files(pred.answer)

##   (7) Generate files for assginment submission
##   ...pred.answer= [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
##   ...write to files

Conclusion of My observation in this Exercise

After comparing “rpart”, “rf”, “gbm”, “lda”, “nb” model result, the Random Forest (rf) model showed
- The BEST accuracy Values and The LEAST/SMALLEST sample error values
In Sample Accurancy Out Of Sample Accurancy 95% CI Kappa In Sample Error Out Of Sample Error

98.9% 98.9% (98.7%, 99.1%) 98.6% 1.1% 1.1 %
Result of my course project submission is 20/20 correct

In Sample Accurancy	Out Of Sample Accurancy	95% CI	Kappa	In Sample Error	Out Of Sample Error
98.9%	98.9%	(98.7%, 99.1%)	98.6%	1.1%	1.1 %

Reference:

Johns Hopkins Practical Machine Learning by Jeff Leek, PhD, Roger D. Peng, PhD, Brian Caffo, PhDCourse Project Prediction Assignment
Data Science Specialization Course Notes by Xing Su
Weightlifting Exercises Dataset Revisited
Prediction Using Random Forests in R - An Example
Applied Predictive Modeling by By Max Kuhn and Kjell Johnson

Appendix

logStep("(3) Exploring data\n ..Train Data colnames= \n"); head(colnames(traindata)); #str(head(traindata)) ;

##   (3) Exploring data
##  ..Train Data colnames=

## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"

head(colnames(testdata));

## [1] "X"                    "user_name"            "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"

logStep("(4) PreProcess data   (4.1) Check and remove NA variables" ); head( sort(colSums(is.na(traindata)), decreasing = TRUE))

##   (4) PreProcess data   (4.1) Check and remove NA variables

##     kurtosis_yaw_belt     skewness_yaw_belt kurtosis_yaw_dumbbell 
##                 19622                 19622                 19622 
## skewness_yaw_dumbbell  kurtosis_yaw_forearm  skewness_yaw_forearm 
##                 19622                 19622                 19622

logStep ("(5.1) Classification tree modeling\n ..fit.rpart=\n"); print(fit.rpart);

##   (5.1) Classification tree modeling
##  ..fit.rpart=

## CART 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.02491694  0.5395200  0.4158615  0.02502035   0.03913236
##   0.02758662  0.5184748  0.3880017  0.02578250   0.04047764
##   0.04182487  0.4131974  0.2203257  0.10276153   0.17280339
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.02491694.

logStep("..cross validation   predict(fit.rpart, sTest)=\n"); print( summary(pred.rpart))

##   ..cross validation   predict(fit.rpart, sTest)=

##    A    B    C    D    E 
## 1786 1989 2206  603 1262

logStep("..table(pred.rpart,sTest$classe)=\n"); print(table(pred.rpart, sTest$classe));

##   ..table(pred.rpart,sTest$classe)=

##           
## pred.rpart    A    B    C    D    E
##          A 1394  266   31   66   29
##          B  223  860  173  318  415
##          C  316  290 1016  329  255
##          D  107   37   60  339   60
##          E  192   65   88  234  683

print(cf.rpart)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394  266   31   66   29
##          B  223  860  173  318  415
##          C  316  290 1016  329  255
##          D  107   37   60  339   60
##          E  192   65   88  234  683
## 
## Overall Statistics
##                                           
##                Accuracy : 0.547           
##                  95% CI : (0.5359, 0.5581)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4302          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6246   0.5665   0.7427  0.26361  0.47365
## Specificity            0.9302   0.8216   0.8163  0.95976  0.90959
## Pos Pred Value         0.7805   0.4324   0.4606  0.56219  0.54120
## Neg Pred Value         0.8617   0.8877   0.9376  0.86925  0.88472
## Prevalence             0.2845   0.1935   0.1744  0.16391  0.18379
## Detection Rate         0.1777   0.1096   0.1295  0.04321  0.08705
## Detection Prevalence   0.2276   0.2535   0.2812  0.07685  0.16085
## Balanced Accuracy      0.7774   0.6941   0.7795  0.61168  0.69162

fancyRpartPlot(fit.rpart$finalModel, main="Rpart Tree Model")

logStep ("(5.2) Random Forest model\n ..fit.rf=\n"); print(fit.rf )

##   (5.2) Random Forest model
##  ..fit.rf=

## Random Forest 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 10597, 10598, 10599, 10600, 10598, 10598, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9884510  0.9853886  0.002342040  0.002964147
##   16    0.9887908  0.9858192  0.003246769  0.004107953
##   31    0.9808937  0.9758304  0.003883894  0.004914435
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 16.

logStep("..cross validation   predict(fit,rf, sTest) =\n"); print( summary(pred.rf))

##   ..cross validation   predict(fit,rf, sTest) =

##    A    B    C    D    E 
## 2242 1519 1378 1271 1436

logStep("..table(pred.rf,sTest$classe)=\n"); print(table(pred.rf, sTest$classe))

##   ..table(pred.rf,sTest$classe)=

##        
## pred.rf    A    B    C    D    E
##       A 2228   13    0    1    0
##       B    2 1498   18    1    0
##       C    0    7 1342   24    5
##       D    2    0    8 1258    3
##       E    0    0    0    2 1434

print(cf.rf)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2228   13    0    1    0
##          B    2 1498   18    1    0
##          C    0    7 1342   24    5
##          D    2    0    8 1258    3
##          E    0    0    0    2 1434
## 
## Overall Statistics
##                                           
##                Accuracy : 0.989           
##                  95% CI : (0.9865, 0.9912)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9861          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9868   0.9810   0.9782   0.9945
## Specificity            0.9975   0.9967   0.9944   0.9980   0.9997
## Pos Pred Value         0.9938   0.9862   0.9739   0.9898   0.9986
## Neg Pred Value         0.9993   0.9968   0.9960   0.9957   0.9988
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2840   0.1909   0.1710   0.1603   0.1828
## Detection Prevalence   0.2858   0.1936   0.1756   0.1620   0.1830
## Balanced Accuracy      0.9979   0.9918   0.9877   0.9881   0.9971

logStep ("(5.3) Boosting model \n ..fit.gbm=\n\n"); print(fit.gbm )

##   (5.3) Boosting model 
##  ..fit.gbm=

## Stochastic Gradient Boosting 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.7101335  0.6324070  0.006924339
##   1                  100      0.7730951  0.7127079  0.007371483
##   1                  150      0.8075144  0.7564129  0.004440979
##   2                   50      0.8202313  0.7723139  0.005457241
##   2                  100      0.8749929  0.8418270  0.004043285
##   2                  150      0.9023588  0.8764695  0.004721279
##   3                   50      0.8629356  0.8264678  0.004982426
##   3                  100      0.9148331  0.8922231  0.005503074
##   3                  150      0.9378045  0.9213011  0.005168093
##   Kappa SD   
##   0.008788581
##   0.009331767
##   0.005705155
##   0.007002126
##   0.005163815
##   0.006008614
##   0.006358391
##   0.007007181
##   0.006553734
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

logStep("..cross validation   predict(fit.gbm, sTest)=\n"); print(summary(pred.gbm))

##   ..cross validation   predict(fit.gbm, sTest)=

##    A    B    C    D    E 
## 2257 1516 1418 1259 1396

logStep("..table(pred.gbm,sTest$classe)=\n"); print(table(pred.gbm, sTest$classe))

##   ..table(pred.gbm,sTest$classe)=

##         
## pred.gbm    A    B    C    D    E
##        A 2193   58    0    2    4
##        B   16 1377   82   19   22
##        C    8   69 1260   65   16
##        D   15    6   25 1191   22
##        E    0    8    1    9 1378

print(cf.gbm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2193   58    0    2    4
##          B   16 1377   82   19   22
##          C    8   69 1260   65   16
##          D   15    6   25 1191   22
##          E    0    8    1    9 1378
## 
## Overall Statistics
##                                           
##                Accuracy : 0.943           
##                  95% CI : (0.9377, 0.9481)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9279          
##  Mcnemar's Test P-Value : 2.539e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9825   0.9071   0.9211   0.9261   0.9556
## Specificity            0.9886   0.9780   0.9756   0.9896   0.9972
## Pos Pred Value         0.9716   0.9083   0.8886   0.9460   0.9871
## Neg Pred Value         0.9930   0.9777   0.9832   0.9856   0.9901
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2795   0.1755   0.1606   0.1518   0.1756
## Detection Prevalence   0.2877   0.1932   0.1807   0.1605   0.1779
## Balanced Accuracy      0.9856   0.9426   0.9483   0.9579   0.9764

logStep ("(5.4) Linear Discriminant Analysis model \n ..fit.lda\n"); print(fit.lda)

##   (5.4) Linear Discriminant Analysis model 
##  ..fit.lda

## Linear Discriminant Analysis 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results
## 
##   Accuracy   Kappa      Accuracy SD  Kappa SD   
##   0.5816037  0.4702839  0.0062626    0.007891692
## 
##

logStep("..cross validation   predict(fit.lda, sTest)=\n"); print(summary(pred.lda))

##   ..cross validation   predict(fit.lda, sTest)=

##    A    B    C    D    E 
## 2297 1532 1649 1352 1016

logStep("..table(pred.lda,sTest$classe)=\n"); print(table(pred.lda, sTest$classe))

##   ..table(pred.lda,sTest$classe)=

##         
## pred.lda    A    B    C    D    E
##        A 1521  319  254   91  112
##        B  298  704  142  133  255
##        C  215  201  838  189  206
##        D  174  152  110  734  182
##        E   24  142   24  139  687

print(cf.lda)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1521  319  254   91  112
##          B  298  704  142  133  255
##          C  215  201  838  189  206
##          D  174  152  110  734  182
##          E   24  142   24  139  687
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5715          
##                  95% CI : (0.5605, 0.5825)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4578          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.6815  0.46377   0.6126  0.57076  0.47642
## Specificity            0.8618  0.86915   0.8748  0.90579  0.94863
## Pos Pred Value         0.6622  0.45953   0.5082  0.54290  0.67618
## Neg Pred Value         0.8719  0.87108   0.9145  0.91500  0.88946
## Prevalence             0.2845  0.19347   0.1744  0.16391  0.18379
## Detection Rate         0.1939  0.08973   0.1068  0.09355  0.08756
## Detection Prevalence   0.2928  0.19526   0.2102  0.17232  0.12949
## Balanced Accuracy      0.7716  0.66646   0.7437  0.73828  0.71252

logStep ("(5.5) Naive Bayes model \n ..fit.nb=\n"); print(fit.nb)

##   (5.5) Naive Bayes model 
##  ..fit.nb=

## Naive Bayes 
## 
## 11776 samples
##    31 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (31), scaled (31) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 11776, 11776, 11776, 11776, 11776, 11776, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy   Kappa      Accuracy SD  Kappa SD   
##   FALSE      0.5808695  0.4717188  0.007301880  0.009109709
##    TRUE      0.7248053  0.6546382  0.009763068  0.012263254
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were fL = 0 and usekernel = TRUE.

logStep("..cross validation   predict(fit.nb, sTest)=\n"); print(summary(pred.nb))

##   ..cross validation   predict(fit.nb, sTest)=

##    A    B    C    D    E 
## 1800 1449 1969 1379 1249

logStep("..table(pred.nb,sTest$classe)=\n"); print(table(pred.nb, sTest$classe))

##   ..table(pred.nb,sTest$classe)=

##        
## pred.nb    A    B    C    D    E
##       A 1580  114   48   34   24
##       B  138 1057  113   13  128
##       C  231  203 1119  292  124
##       D  255  102   79  873   70
##       E   28   42    9   74 1096

print(cf.nb)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1580  114   48   34   24
##          B  138 1057  113   13  128
##          C  231  203 1119  292  124
##          D  255  102   79  873   70
##          E   28   42    9   74 1096
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7297          
##                  95% CI : (0.7197, 0.7395)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6609          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7079   0.6963   0.8180   0.6788   0.7601
## Specificity            0.9608   0.9381   0.8688   0.9229   0.9761
## Pos Pred Value         0.8778   0.7295   0.5683   0.6331   0.8775
## Neg Pred Value         0.8922   0.9279   0.9576   0.9361   0.9476
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2014   0.1347   0.1426   0.1113   0.1397
## Detection Prevalence   0.2294   0.1847   0.2510   0.1758   0.1592
## Balanced Accuracy      0.8343   0.8172   0.8434   0.8009   0.8681