Synopsis

This report makes use of the dataset from a study titled “Qualitative Activity Recognition of Weight Lifting Exercises”.1 The research team gathered the data to to assess whether weightlifting exercise mistakes could be detected by using activity recognition techniques. Under supervision, properly and improperly executed weightlifting activity measurements were taken.

Six male subjects, ages ranging 20 - 28 years, wore special athletic equipment fitted with accelerometers (weight belt, glove and armband). In addition, an accelerometer was placed on the dumbbell. The subjects performed dumbbell lifts correctly (class A) and incorrectly (classes B, C, D, E). This report will evaluate several machine learning models that attempt to predict the manner in which a human subject is performing a dumbbell lift exercise. The most accurate models will be used to predict 20 different test cases.

1Qualitative Activity Recognition of Weight Lifting Exercises (Velloso, Bulling, Gellersen, Ugulino, Fuks); ACM SIGCHI, 2013.

Exploratory Analysis

The dataset pml-training.csv will be read and split 70/30 into training and testing datasets labeled trn and tst respectively. trn will be used to build and tune the model; tst will be used to estimate the model predictive performance. Variables in trn and tst are listed in Backup Material. classe (a factor) is the response.

The dataset includes summary statistics calculated at a measurement boundary - these are very sparse and/or contain very high proportions of missing values and/or error values. We also note that columns 1:7 include an observation id, subject name, and time stamp information. The information content in all these variables is very low, so they will be discarded.

div0 <- apply(data, 2, function(x) sum(x=="#DIV/0!", na.rm = T)>0)  # elimination vectors
blank <- apply(data, 2, function(x) sum(x=="", na.rm = T)>0)  # missing numbers scanning
NAs <- apply(data, 2, function(x) sum(is.na(x))>0 )  # NAs scanning
hiValue <- c(rep(TRUE, 7), rep(FALSE, length(names(data))-7))
elim <- names(data[, ( div0|blank|NAs|hiValue )])
elimVar <- c(); nC <- 5  # table of discarded variables for Backup Material
names <- c(paste0("` ", elim, " `"), rep("...",3)); nV <- length(names)
for(i in 1:nC) elimVar <- cbind(elimVar, names[seq(1,nV,nV/nC)[i] : (seq(1,nV,nV/nC)+(nV/nC-1))[i]])
data <- data[, !( div0|blank|NAs|hiValue )] # eliminate variables listed above
allVars <- names(data) # list of all variables
allPreds <- allVars[-grep("classe", allVars)] # list of all predictors
nzr <- nearZeroVar(data, saveMetrics = T)
if(any(!is.na(data)) & any(!nzr$zeroVar) & any(!nzr$nzv)) 
     cat("No missing values, no zero / near zero variance predictors remaining after eliminations")
## No missing values, no zero / near zero variance predictors remaining after eliminations

Summary statistics have been eliminated except for four total acceleration variables. Remaining predictors represent raw numeric data from sensors attached to the test subject forearm, arm, belt, and to the dumbbell. The 53 remaining and 107 discarded variables are presented in the Backup Material section. We will not need to consider zero or near-zero variance predictors in our model development per the analysis above.

Comments on Exploratory Graphs, Sensor Data Correlation, Predictor Skewness

The reader is referred to Exploratory Graphs in Backup Material to explore characteristics of the training dataset. The complexity of the associations between response and predictors is displayed in Exploratory Scatterplots - Arm Accelerometer, Magnetometer sensors, which plots associations between classe response and arm mounted magnetometer and accelerometer sensors. Refer to the lattice boxplots subgrouped by sensor location (forearm, arm, belt, dumbbell), presented in Correlation map by location: forearm, arm, belt, dumbbell.

Is there significant collinearity among remaining predictors? Refer to Sensor data correlation by location in Backup Material, where a sensor data correlation tables for correlations > 0.5 may be found. As shown, there exist mild to strong positive and negative correlations with sensor signals from the same location. In spite of the implied collinearity, the author has chosen to not optimize further, and to err on the side of over-production of features to maximize the ranges presented to the machine learning algorithms discussed later in this report.

As shown in Skewness Analysis in Backup Material. Twelve of the 52 predictors exhibit skewness > 1.0. Some are so skewed that a cube root transform (reduce skew yet preserve sign) was applied to meaningfully display their distributions. The transforms was applied only to plot the data - we will rely on caret pre-processing transforms to develop models.

Splitting

A 70/30 non-overlapping split of the dataset into trn and tst is performed with createDataPartition().

set.seed(1)
inTrain <- createDataPartition(y = data$classe, p = 0.7, list = FALSE)
trn <- data[inTrain, ] # 70% to trn
tst <- data[-inTrain, ] # 30% to tst
cat("trn:", dim(trn)[1], "observations by", dim(trn)[2], "variables;  tst:",
    dim(tst)[1], "observations by", dim(tst)[2], "variables")
## trn: 13737 observations by 53 variables;  tst: 5885 observations by 53 variables

Model Development

The author fit five models for this supervised learning classification problem: (1) a CART decision tree, (2) a generalized boosted model, (3) a support vectors machine, (4) a random forest model invoked without train(), but for which tunable parameters were determined by way of the last model (5), a random forest model invoked within train(). Optimal tuning parameters are discussed below.

Decision Tree Model

We evaluate a CART decision tree, using train() method rpart. 52 variables were centered and scaled, of which 19 important variables were identified. A scree plot of these is presented in Backup Material. Cross-validation parameters used: 10 folds, repeated 10 times.

set.seed(1)  # decision tree model
dTreeModel <- train(classe ~ . , method = "rpart", data=trn, preProcess=c("center", "scale"))
dTreeModelPred <- predict(dTreeModel, newdata=tst, na.action = na.pass)  # predict w tst
cmDT <- confusionMatrix(dTreeModelPred, tst$classe)

Generalized Boosted Model

We now evaluate a Generalized Boosted Model tuned with train(). After some experimentation, the best tune was n.trees=150, interaction.depth=3, shrinkage=0.1. Cross-validation parameters: 10 folds, repeated 10 times.

set.seed(1)  # Generalized Boosted Model
gbModel <- train(classe ~ . , data=trn, method='gbm', metric='Accuracy', preProcess=c("center", "scale"),
    verbose=FALSE, trControl=trainControl(method="repeatedcv", repeats=10))
gbModelPred <- predict(gbModel, newdata=tst) # prediction with testing dataset
cmGB <- confusionMatrix(gbModelPred, tst$classe)

Support Vector Machine Model

We now evaluate a Support Vector Machine with Radial Basis Function Kernel model tuned with train(), tuneLength=12, and method=svmRadial. Fine tuning from multiple svm fits with sigma and C parameter ranges in expand.grid() identified these tuning parameters used in the calculation below. We avoided over-fitting with cross-validation (10 fold, repeated 10 times).

set.seed(1)  # support vector machines model
gridSVM <- expand.grid(sigma=0.01228219, C=513)
svmModel <- train(classe ~ . , data = trn, method = "svmRadial", preProc = c("center", "scale"),
     tuneGrid = gridSVM, trControl = trainControl(method = "repeatedcv", repeats = 10))
svmModelPred <- predict(svmModel, newdata=tst) # prediction with testing dataset
cmSVM <- confusionMatrix(svmModelPred, tst$classe)

Random Forest Models

Two random forest models were created. The first (RF1) call was invoked as randomForest(), not train(), but performs quite well with default parameter values. It was was subsequently fine-tuned after experimentation on RF2, specifically by using expand.grid() to hunt for ntree (number of trees) and mtry, the number of variables to randomly select. Model features common to both RF1 and RF2 are ntree=500 trees, mtry=8, and no pre-processing. In addition, we compensated for over-fitting on RF2 with cross-Validation (10 fold, repeated 10 times).

# random forest model 1
set.seed(1)
RF1 <- randomForest(classe ~ ., trn, mtry=8, importance=T)
RF1Pred <- predict(RF1, newdata=tst)
cmRF1 <- confusionMatrix(RF1Pred, tst$classe)
# random forest model 2
grid <- expand.grid(.mtry=8)  # Grid Search for best tune mtry value
RF2 <- train(classe ~ . , data=trn, method="rf", metric='Accuracy', tuneGrid=grid, ntree=500,
     trControl=trainControl(method="repeatedcv", number=10, repeats=10, search="grid"))
RF2Pred <- predict(RF2, newdata=tst)
cmRF2 <- confusionMatrix(RF2Pred, tst$classe)

Model Comparisons

# table of 95% confidence intervals for model accuracy
accuracyTbl <- round(100*cmDT$overall[3:4],1)
accuracyTbl <- rbind(accuracyTbl, round(100*cmGB$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmSVM$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmRF1$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmRF2$overall[3:4],1))
row.names(accuracyTbl) <- c('Decision Tree', 'Generalized Boosted', 'Support Vector Machine',
    'RandomForest 1', 'RandomForest 2')
htmlTable(accuracyTbl, caption='**Comparisons of model accuracy, 95% confidence intervals**')
Comparisons of model accuracy, 95% confidence intervals
AccuracyLower AccuracyUpper
Decision Tree 47.8 50.4
Generalized Boosted 95.9 96.8
Support Vector Machine 99.3 99.7
RandomForest 1 99.4 99.8
RandomForest 2 99.5 99.8

The last four models listed above were the highest performing of the five evaluated. For those four, we continue with a comparison of confusion matrices, out-of-box error rate, and out-of-sample error rate for the four high performing models. The confusion matrices are combined below.

# concatenate all confusion matrices
labs <- paste0("| ", rownames(cmSVM$table)); aln <- paste0(rep('lccccc',4))
hdr <- paste0('..',substring(labs, 3),'..')
tbl <- cbind(labs, cmGB$table, labs, cmSVM$table, labs, cmRF1$table, labs, cmRF2$table)
htmlTable(tbl, header=c('[ GB ]', hdr,'[ SVM ]', hdr, '[ RF1 ]', hdr,'[ RF2 ]', hdr),
    rnames=F, align=aln, header.align=aln, caption="**Confusion Matrices, Four Models**")
Confusion Matrices, Four Models
[ GB ] ..A.. ..B.. ..C.. ..D.. ..E.. [ SVM ] ..A.. ..B.. ..C.. ..D.. ..E.. [ RF1 ] ..A.. ..B.. ..C.. ..D.. ..E.. [ RF2 ] ..A.. ..B.. ..C.. ..D.. ..E..
| A 1640 39 0 1 1 | A 1673 3 0 0 0 | A 1671 2 0 0 0 | A 1671 2 0 0 0
| B 20 1075 39 3 13 | B 1 1131 7 0 0 | B 0 1135 7 0 0 | B 2 1135 6 0 0
| C 9 24 976 22 9 | C 0 4 1016 5 0 | C 1 2 1018 4 0 | C 0 2 1019 4 0
| D 4 1 9 929 7 | D 0 0 3 956 2 | D 0 0 1 959 3 | D 0 0 1 959 2
| E 1 0 2 9 1052 | E 0 1 0 3 1080 | E 2 0 0 1 1079 | E 1 0 0 1 1080

Out-of-Box error rate model comparisons

# OOB is ratio of misclassified to total training observations in confusion matrix
sumTrainObserv <- dim(trn)[1]  # total training observations
GBConfusion <- confusionMatrix(predict(gbModel, newdata=trn), trn$classe) # Decision Tree
sumMisclassified.GB <- sumTrainObserv-sum(diag(GBConfusion$table))
OOB.GB <- round(sumMisclassified.GB / sumTrainObserv, 5)
SVMConfusion <- confusionMatrix(predict(svmModel, newdata=trn), trn$classe) # Support Vectors Machine
sumMisclassified.SVM <- sumTrainObserv-sum(diag(SVMConfusion$table))
OOB.SVM <- round(sumMisclassified.SVM / sumTrainObserv, 5)
RF1Confusion <- with(RF1, confusion[1:nrow(confusion),1:nrow(confusion)])  # RF1
sumMisclassified.RF1 <- sumTrainObserv-sum(diag(RF1Confusion))
OOB.RF1 <- round(sumMisclassified.RF1 / sumTrainObserv, 5)
sumMisclassified.RF2 <- sumTrainObserv-sum(diag(RF2$finalModel$confusion))  # RF2
OOB.RF2 <- round(sumMisclassified.RF2 / sumTrainObserv, 5)
# comparison table
htmlTable(c(OOB.GB, OOB.SVM, OOB.RF1, OOB.RF2),
     header=c('| Generalized Boost |', '| Support Vector Machine |', '| RandForest1 |', '| RandForest2 |'),
     caption='**Out-of-Box Error Rate Comparison, Four Models**', rnames=F)
Out-of-Box Error Rate Comparison, Four Models
| Generalized Boost | | Support Vector Machine | | RandForest1 | | RandForest2 |
0.02701 0.00051 0.00517 0.00531

Refer to the table above - note that the two random forest methods yielded very accurate fits and similar results, but were out-performed on this training set by the support vector machine.

Out-of-Sample error rate model comparisons

# OOS is (1 - Accuracy) of fit to testing set observations
OOS.GB <- round(1-cmGB$overall[1],5); OOS.SVM <- round(1-cmSVM$overall[1],5)
OOS.RF1 <- round(1-cmRF1$overall[1],5); OOS.RF2 <- round(1-cmRF2$overall[1],5)
htmlTable(c(OOS.GB, OOS.SVM, OOS.RF1, OOS.RF2),
     header=c('| Generalized Boost |', '| Support Vector Machine |', '| RandForest1 |', '| RandForest2 |'),
     caption='**Out-of-Sample Error Rate Comparison, Four Models**', rnames=F)
Out-of-Sample Error Rate Comparison, Four Models
| Generalized Boost | | Support Vector Machine | | RandForest1 | | RandForest2 |
0.03619 0.00493 0.00391 0.00357

Refer to the table above. Previously, the support vector machine out-performed based on OOB error rate, but when out-of-sample (OOS) rates are compared, the random forest models out-perform. So we see a slightly different result when the models are evaluated with the test dataset. Again, both random forest models deliver similar results, but the trained random forest (RF2) method outperforms slightly.

Final Test: Evaluate Performance

The pml-testing.csv dataset includes 20 observations that will used to evaluate predictive performance.

gbPred.FT <- predict(gbModel, newdata=final.test, na.action = na.pass)  # evaluate on `final.test`
svmPred.FT <- predict(svmModel, newdata=final.test)
RF1Pred.FT <- predict(RF1, newdata=final.test)
RF2Pred.FT <- predict(RF2, newdata=final.test)
eq <- ifelse(gbPred.FT==svmPred.FT & svmPred.FT==RF1Pred.FT & RF1Pred.FT==RF2Pred.FT, 'T', 'F')
space <- rep('...' , 20)
htmlTable(t(data.frame('GB'=gbPred.FT, 'SVM'=svmPred.FT, 'RF1'=RF1Pred.FT, 'RF2'=RF2Pred.FT,
     space, 'Equal'=eq, fix.empty.names=F)), header=c(paste0('Ob','0',(1:9)), paste0('Ob',(10:20))))
Ob01 Ob02 Ob03 Ob04 Ob05 Ob06 Ob07 Ob08 Ob09 Ob10 Ob11 Ob12 Ob13 Ob14 Ob15 Ob16 Ob17 Ob18 Ob19 Ob20
GB B A B A A E D B A A B C B A E E A B B B
SVM B A B A A E D B A A B C B A E E A B B B
RF1 B A B A A E D B A A B C B A E E A B B B
RF2 B A B A A E D B A A B C B A E E A B B B
Equal T T T T T T T T T T T T T T T T T T T T

Refer to the table above. Each of the four highest-performing models yield identical predictions for the twenty observation test dataset.

Backup Material

Correlation map, ALL training predictors

Exploratory Analysis referred the reader to the correlation plots in this section. Plots are provided for correlations for (1) all training dataset predictors below.

# correlation maps for ALL training predictors, plot order: first principal component order
corrplot(cor(trn[, allPreds]), tl.cex=.8, tl.col="firebrick", order="FPC",
         method="ellipse", type="full", sig.level=0.05, tl.pos="lt", cl.pos="n")

Sensor data correlation by location

lst <- c('_forearm', '_arm', '_belt', '_dumbbell')
l <- lapply(lst, function(x) {
    tmp <- cor(trn[, grepl(x, allVars) & !grepl('total_accel', allVars) & !grepl('classe', allVars)])
    names.tmp <- gsub(x, "", row.names(tmp))
    diag(tmp) <- 0  # zero self correlations
    for(i in 1:dim(tmp)[2]) tmp[,i][abs(tmp[,i]) < .5] <- 0  # select corr values > threshold
    tmp <- apply(round(tmp,2), 2, as.character)  # convert to char class
    for(i in 1:dim(tmp)[2]) tmp[,i][tmp[,i] =="0"] <- ''  # blank out values < threshold
    colnames(tmp) <- row.names(tmp) <- names.tmp
    return(noquote(tmp))
})
names.tmp <- row.names(l[[1]])  # recover variable names
df <- data.frame(matrix(unlist(l), nrow=dim(l[[1]])[1], byrow=T, dimnames=list(names.tmp,rep(names.tmp,4))),
     check.names=F, stringsAsFactors=F) 
for(i in 1:length(l)) print(htmlTable(df[,(1+12*(i-1)):(12*i)],
     caption=paste0('***Correlations > 0.5 by Sensor Location:  ',
          gsub('_','',lst[i]),'***'), header=paste0('`', names.tmp, '`')))
Correlations > 0.5 by Sensor Location: forearm
roll pitch yaw gyros_x gyros_y gyros_z accel_x accel_y accel_z magnet_x magnet_y magnet_z
roll
pitch 0.88
yaw -0.55
gyros_x
gyros_y -0.92
gyros_z 0.55 0.78
accel_x 0.81 0.92 -0.99
accel_y 0.53
accel_z -0.99 -0.78 -0.93
magnet_x 0.72
magnet_y -0.73 0.69
magnet_z 0.57 0.85 0.68 -0.66 0.53
Correlations > 0.5 by Sensor Location: arm
roll pitch yaw gyros_x gyros_y gyros_z accel_x accel_y accel_z magnet_x magnet_y magnet_z
roll
pitch 0.88
yaw 0.68
gyros_x -0.53 0.56 0.53
gyros_y -0.59
gyros_z -0.53 0.82 -0.79 -0.58
accel_x -0.7 -0.97 -0.89
accel_y -0.5
accel_z -0.89 0.73 0.89
magnet_x 0.52 0.81 0.57
magnet_y -0.99 0.69
magnet_z 0.58 0.53 -0.77
Correlations > 0.5 by Sensor Location: belt
roll pitch yaw gyros_x gyros_y gyros_z accel_x accel_y accel_z magnet_x magnet_y magnet_z
roll
pitch 0.68
yaw 0.77
gyros_x
gyros_y 0.82 -0.7 -0.66
gyros_z 0.56 -0.7 0.55 -0.79 0.82
accel_x 0.81 -0.7 0.53 0.71 0.6 -0.78 0.73
accel_y -0.97 0.71 0.89
accel_z 0.78
magnet_x 0.52 0.54 -0.54 0.85 0.58
magnet_y 0.81 0.54 0.68
magnet_z -0.77
Correlations > 0.5 by Sensor Location: dumbbell
roll pitch yaw gyros_x gyros_y gyros_z accel_x accel_y accel_z magnet_x magnet_y magnet_z
roll
pitch 0.77
yaw -0.55
gyros_x -0.92
gyros_y -0.59
gyros_z 0.53 -0.66 0.78 -0.58 0.82
accel_x
accel_y 0.92 0.6 -0.5 -0.93
accel_z 0.78
magnet_x -0.73 -0.99
magnet_y 0.72 -0.54 -0.66
magnet_z

Exploratory Graphs

Exploratory Scatterplots - Arm Accelerometer, Magnetometer sensors
what <- (grepl("accel_arm", allPreds)|grepl("magnet_arm", allPreds))&!grepl("total", allPreds)
featurePlot(x=trn[,what], y=trn$classe, plot = "pairs", auto.key = list(columns = 5),
    pch='.', alpha=.5, cex=.2, aspect=.75, main="Exploratory Scatterplots - Arm Accelerometer, Magnetometer Sensors")

Exploratory Boxplots - Accelerometer, Magnetometer sensors
locs <- c("_arm", "_forearm", "_belt", "_dumbbell")  # sensor locations
plotIt <- function(feat) featurePlot(x=trn[ , feat], y=trn$classe, "box", label=NULL)
featGrep <- function(loc, f1, f2, f3, not)
      grepl(loc,allPreds)&!grepl(not,allPreds)&(grepl(f1,allPreds)|grepl(f2,allPreds)|grepl(f3,allPreds))
p1 <- plotIt(featGrep(locs[1], "magnet_", "accel_","accel_","total_accel_"))
p2 <- plotIt(featGrep(locs[2], "magnet_", "accel_","accel_","total_accel_"))
p3 <- plotIt(featGrep(locs[3], "magnet_", "accel_","accel_","total_accel_"))
p4 <- plotIt(featGrep(locs[4], "magnet_", "accel_","accel_","total_accel_"))
p5 <- plotIt(grepl("total_accel_", allPreds))
grid.arrange(p1, p2, p3, p4, p5, heights=c(3.75,3.75,3))

Exploratory Boxplots - Gyroscopic sensors
p6 <- plotIt(featGrep(locs[1], "gyros_", "gyros_","gyros_","total_accel_"))
p7 <- plotIt(featGrep(locs[2], "gyros_", "gyros_","gyros_","total_accel_"))
p8 <- plotIt(featGrep(locs[3], "gyros_", "gyros_","gyros_","total_accel_"))
p9 <- plotIt(featGrep(locs[4], "gyros_", "gyros_","gyros_","total_accel_"))
grid.arrange(p6, p7, p8, p9)

Exploratory Boxplots - Roll/Pitch/Yaw sensors
p10 <- plotIt(featGrep(locs[1], "roll_", "pitch_","yaw_","total_accel_"))
p11 <- plotIt(featGrep(locs[2], "roll_", "pitch_","yaw_","total_accel_"))
p12 <- featurePlot(trn[ , allPreds[featGrep(locs[3], "roll_", "pitch_","yaw_",
      "total_accel_")]], trn$classe, "box", label=NULL) # workaround
p13 <- plotIt(featGrep(locs[4], "roll_", "pitch_","yaw_","total_accel_"))
grid.arrange(p10, p11, p12, p13)

Skewness Analysis

The dotted red lines on the histograms are mean references. We observe considerable skew, and some distributions are multi-modal. If the goal was an accurate simple linear regression model, the author expects that these untransformed predictors would present a considerable challenge. In fact, \(adjusted R^2\) from LM <- train(as.integer(classe) ~ .,data=trn,method="lm") is only 0.53, so 47% of the variability in the response is not accounted for in this simple linear model.

skew <- function(x) round(abs(skewness(x)),2);  skew <- apply(trn[,allPreds], 2, skew)
skew <- skew[order(-skew)]
predSkew <- data.frame(Predictor=paste0("` ", names(skew), " `"), Skew=skew)
row.names(predSkew) <- NULL; n <- dim(predSkew)[1]
htmlTable(cbind(predSkew[1:(n/4),], predSkew[(n/4+1):(2*n/4),],
    predSkew[(2*n/4+1):(3*n/4),], predSkew[(3*n/4+1):n,]), align.header=rep('l',8),
    rnames=F, align=rep('l',8),  caption="***Sorted Predictor Skew***")
Sorted Predictor Skew
Predictor Skew Predictor Skew Predictor Skew Predictor Skew
gyros_dumbbell_z 114.78 accel_belt_x 0.96 magnet_arm_y 0.49 gyros_arm_z 0.17
gyros_dumbbell_x 108.99 yaw_belt 0.9 accel_dumbbell_x 0.47 accel_belt_y 0.15
gyros_forearm_z 102.34 magnet_dumbbell_z 0.88 roll_forearm 0.46 magnet_arm_x 0.14
gyros_forearm_y 54.58 accel_arm_z 0.86 accel_forearm_z 0.44 gyros_belt_z 0.11
gyros_dumbbell_y 36.04 magnet_forearm_y 0.74 accel_arm_x 0.36 gyros_arm_y 0.11
gyros_forearm_x 2.65 roll_dumbbell 0.73 accel_dumbbell_y 0.36 yaw_arm 0.1
magnet_belt_y 2.22 accel_forearm_y 0.65 gyros_arm_x 0.29 accel_dumbbell_z 0.1
magnet_dumbbell_y 1.74 magnet_forearm_x 0.65 yaw_forearm 0.27 gyros_belt_y 0.09
magnet_dumbbell_x 1.69 total_accel_dumbbell 0.61 accel_forearm_x 0.23 accel_arm_y 0.09
magnet_belt_x 1.44 total_accel_forearm 0.57 yaw_dumbbell 0.22 total_accel_arm 0.07
magnet_forearm_z 1.22 gyros_belt_x 0.54 magnet_belt_z 0.19 total_accel_belt 0.03
magnet_arm_z 1.15 pitch_dumbbell 0.52 pitch_arm 0.19 roll_belt 0.02
pitch_belt 1 pitch_forearm 0.52 roll_arm 0.18 accel_belt_z 0.02

Predictors, Skewness > 1.0, cube-root transformed

skewThr <- 1.0 # "skewed" threshold
old.par <- par(no.readonly=T); par(mfrow=c(4,3),mar=c(3,2,1,1))
  for(i in 1 : sum(skew > skewThr)) {
    tmp <- sign(trn[, names(skew[i])]) * abs(trn[, names(skew[i])])^(1/3) # cube root transform
    hist(tmp, main=names(skew[i]), breaks=60, xlab='')
    abline(v=mean(tmp, na.rm=T), lty=3, lwd=3, col="firebrick")
}

par(old.par)

Scree Plot, Ranked Variable Importance, Decision Tree Model

A scree plot identifies the 19 variables that explain most of the variability for this model.

tblDT <- data.frame(var.imp=dTreeModel$finalModel$variable.importance)
old.par <- par(no.readonly=T); par(las=3, mar=c(7,4,2,1))
plot(x=seq_along(1:dim(tblDT)[1]), y=tblDT$var.imp, xaxt='n', type='s', ylim=c(0,1200),
     xlab='', main='Scree Plot:  Decision Tree Variables', ylab='Variable Importance')
axis(side=1, cex.axis=0.8, labels=rownames(tblDT), at=seq_along(1:dim(tblDT)[1]))
abline(h=0, lty=3);par(old.par)

List of Remaining Variables Post-Eliminations

The variable elimination process described in the Exploratory Analysis section yielded the following variables remaining for model development, including the outcome classe.

tbl<-c(); nC <- 6; names <- c(paste0("` ", allVars[order(allVars)], " `"), rep("...",1))
nV <- length(names)
for(i in 1:nC)
    tbl <- cbind(tbl, c(names[seq(1, nV, nV/nC)[i] : (seq(1, nV, nV/nC)+(nV/nC-1))[i]]))
htmlTable::htmlTable(tbl, align='llllll', header=c(rep("",6)),
              caption="***Remaining variables in dataset (alphabetical order)***")
Remaining variables in dataset (alphabetical order)
accel_arm_x accel_forearm_x gyros_belt_z magnet_arm_z magnet_forearm_z total_accel_arm
accel_arm_y accel_forearm_y gyros_dumbbell_x magnet_belt_x pitch_arm total_accel_belt
accel_arm_z accel_forearm_z gyros_dumbbell_y magnet_belt_y pitch_belt total_accel_dumbbell
accel_belt_x classe gyros_dumbbell_z magnet_belt_z pitch_dumbbell total_accel_forearm
accel_belt_y gyros_arm_x gyros_forearm_x magnet_dumbbell_x pitch_forearm yaw_arm
accel_belt_z gyros_arm_y gyros_forearm_y magnet_dumbbell_y roll_arm yaw_belt
accel_dumbbell_x gyros_arm_z gyros_forearm_z magnet_dumbbell_z roll_belt yaw_dumbbell
accel_dumbbell_y gyros_belt_x magnet_arm_x magnet_forearm_x roll_dumbbell yaw_forearm
accel_dumbbell_z gyros_belt_y magnet_arm_y magnet_forearm_y roll_forearm
List of Discarded Variables

The Exploratory Analysis section discussed discarded variables, and referred the reader to this list.

htmlTable::htmlTable(elimVar, align='lllll', header=c(rep("",5)),
    caption=paste0("***Discarded ", length(elim), " variables: high proportion
        missing values or NAs, #DIV/0! errors, otherwise low information content***"))
Discarded 107 variables: high proportion missing values or NAs, #DIV/0! errors, otherwise low information content
X var_total_accel_belt kurtosis_yaw_arm min_roll_dumbbell max_roll_forearm
user_name avg_roll_belt skewness_roll_arm min_pitch_dumbbell max_picth_forearm
raw_timestamp_part_1 stddev_roll_belt skewness_pitch_arm min_yaw_dumbbell max_yaw_forearm
raw_timestamp_part_2 var_roll_belt skewness_yaw_arm amplitude_roll_dumbbell min_roll_forearm
cvtd_timestamp avg_pitch_belt max_roll_arm amplitude_pitch_dumbbell min_pitch_forearm
new_window stddev_pitch_belt max_picth_arm amplitude_yaw_dumbbell min_yaw_forearm
num_window var_pitch_belt max_yaw_arm var_accel_dumbbell amplitude_roll_forearm
kurtosis_roll_belt avg_yaw_belt min_roll_arm avg_roll_dumbbell amplitude_pitch_forearm
kurtosis_picth_belt stddev_yaw_belt min_pitch_arm stddev_roll_dumbbell amplitude_yaw_forearm
kurtosis_yaw_belt var_yaw_belt min_yaw_arm var_roll_dumbbell var_accel_forearm
skewness_roll_belt var_accel_arm amplitude_roll_arm avg_pitch_dumbbell avg_roll_forearm
skewness_roll_belt.1 avg_roll_arm amplitude_pitch_arm stddev_pitch_dumbbell stddev_roll_forearm
skewness_yaw_belt stddev_roll_arm amplitude_yaw_arm var_pitch_dumbbell var_roll_forearm
max_roll_belt var_roll_arm kurtosis_roll_dumbbell avg_yaw_dumbbell avg_pitch_forearm
max_picth_belt avg_pitch_arm kurtosis_picth_dumbbell stddev_yaw_dumbbell stddev_pitch_forearm
max_yaw_belt stddev_pitch_arm kurtosis_yaw_dumbbell var_yaw_dumbbell var_pitch_forearm
min_roll_belt var_pitch_arm skewness_roll_dumbbell kurtosis_roll_forearm avg_yaw_forearm
min_pitch_belt avg_yaw_arm skewness_pitch_dumbbell kurtosis_picth_forearm stddev_yaw_forearm
min_yaw_belt stddev_yaw_arm skewness_yaw_dumbbell kurtosis_yaw_forearm var_yaw_forearm
amplitude_roll_belt var_yaw_arm max_roll_dumbbell skewness_roll_forearm
amplitude_pitch_belt kurtosis_roll_arm max_picth_dumbbell skewness_pitch_forearm
amplitude_yaw_belt kurtosis_picth_arm max_yaw_dumbbell skewness_yaw_forearm