Practical Machine Learning Project

Synopsis

This report makes use of the dataset from a study titled “Qualitative Activity Recognition of Weight Lifting Exercises”.¹ The research team gathered the data to to assess whether weightlifting exercise mistakes could be detected by using activity recognition techniques. Under supervision, properly and improperly executed weightlifting activity measurements were taken.

Six male subjects, ages ranging 20 - 28 years, wore special athletic equipment fitted with accelerometers (weight belt, glove and armband). In addition, an accelerometer was placed on the dumbbell. The subjects performed dumbbell lifts correctly (class A) and incorrectly (classes B, C, D, E). This report will evaluate several machine learning models that attempt to predict the manner in which a human subject is performing a dumbbell lift exercise. The most accurate models will be used to predict 20 different test cases.

¹Qualitative Activity Recognition of Weight Lifting Exercises (Velloso, Bulling, Gellersen, Ugulino, Fuks); ACM SIGCHI, 2013.

Exploratory Analysis

The dataset pml-training.csv will be read and split 70/30 into training and testing datasets labeled trn and tst respectively. trn will be used to build and tune the model; tst will be used to estimate the model predictive performance. Variables in trn and tst are listed in Backup Material. classe (a factor) is the response.

The dataset includes summary statistics calculated at a measurement boundary - these are very sparse and/or contain very high proportions of missing values and/or error values. We also note that columns 1:7 include an observation id, subject name, and time stamp information. The information content in all these variables is very low, so they will be discarded.

div0 <- apply(data, 2, function(x) sum(x=="#DIV/0!", na.rm = T)>0)  # elimination vectors
blank <- apply(data, 2, function(x) sum(x=="", na.rm = T)>0)  # missing numbers scanning
NAs <- apply(data, 2, function(x) sum(is.na(x))>0 )  # NAs scanning
hiValue <- c(rep(TRUE, 7), rep(FALSE, length(names(data))-7))
elim <- names(data[, ( div0|blank|NAs|hiValue )])
elimVar <- c(); nC <- 5  # table of discarded variables for Backup Material
names <- c(paste0("` ", elim, " `"), rep("...",3)); nV <- length(names)
for(i in 1:nC) elimVar <- cbind(elimVar, names[seq(1,nV,nV/nC)[i] : (seq(1,nV,nV/nC)+(nV/nC-1))[i]])
data <- data[, !( div0|blank|NAs|hiValue )] # eliminate variables listed above
allVars <- names(data) # list of all variables
allPreds <- allVars[-grep("classe", allVars)] # list of all predictors
nzr <- nearZeroVar(data, saveMetrics = T)
if(any(!is.na(data)) & any(!nzr$zeroVar) & any(!nzr$nzv)) 
     cat("No missing values, no zero / near zero variance predictors remaining after eliminations")

## No missing values, no zero / near zero variance predictors remaining after eliminations

Summary statistics have been eliminated except for four total acceleration variables. Remaining predictors represent raw numeric data from sensors attached to the test subject forearm, arm, belt, and to the dumbbell. The 53 remaining and 107 discarded variables are presented in the Backup Material section. We will not need to consider zero or near-zero variance predictors in our model development per the analysis above.

Comments on Exploratory Graphs, Sensor Data Correlation, Predictor Skewness

The reader is referred to Exploratory Graphs in Backup Material to explore characteristics of the training dataset. The complexity of the associations between response and predictors is displayed in Exploratory Scatterplots - Arm Accelerometer, Magnetometer sensors, which plots associations between classe response and arm mounted magnetometer and accelerometer sensors. Refer to the lattice boxplots subgrouped by sensor location (forearm, arm, belt, dumbbell), presented in Correlation map by location: forearm, arm, belt, dumbbell.

Is there significant collinearity among remaining predictors? Refer to Sensor data correlation by location in Backup Material, where a sensor data correlation tables for correlations > 0.5 may be found. As shown, there exist mild to strong positive and negative correlations with sensor signals from the same location. In spite of the implied collinearity, the author has chosen to not optimize further, and to err on the side of over-production of features to maximize the ranges presented to the machine learning algorithms discussed later in this report.

As shown in Skewness Analysis in Backup Material. Twelve of the 52 predictors exhibit skewness > 1.0. Some are so skewed that a cube root transform (reduce skew yet preserve sign) was applied to meaningfully display their distributions. The transforms was applied only to plot the data - we will rely on caret pre-processing transforms to develop models.

Splitting

A 70/30 non-overlapping split of the dataset into trn and tst is performed with createDataPartition().

set.seed(1)
inTrain <- createDataPartition(y = data$classe, p = 0.7, list = FALSE)
trn <- data[inTrain, ] # 70% to trn
tst <- data[-inTrain, ] # 30% to tst
cat("trn:", dim(trn)[1], "observations by", dim(trn)[2], "variables;  tst:",
    dim(tst)[1], "observations by", dim(tst)[2], "variables")

## trn: 13737 observations by 53 variables;  tst: 5885 observations by 53 variables

Model Development

The author fit five models for this supervised learning classification problem: (1) a CART decision tree, (2) a generalized boosted model, (3) a support vectors machine, (4) a random forest model invoked without train(), but for which tunable parameters were determined by way of the last model (5), a random forest model invoked within train(). Optimal tuning parameters are discussed below.

Decision Tree Model

We evaluate a CART decision tree, using train() method rpart. 52 variables were centered and scaled, of which 19 important variables were identified. A scree plot of these is presented in Backup Material. Cross-validation parameters used: 10 folds, repeated 10 times.

set.seed(1)  # decision tree model
dTreeModel <- train(classe ~ . , method = "rpart", data=trn, preProcess=c("center", "scale"))
dTreeModelPred <- predict(dTreeModel, newdata=tst, na.action = na.pass)  # predict w tst
cmDT <- confusionMatrix(dTreeModelPred, tst$classe)

Generalized Boosted Model

We now evaluate a Generalized Boosted Model tuned with train(). After some experimentation, the best tune was n.trees=150, interaction.depth=3, shrinkage=0.1. Cross-validation parameters: 10 folds, repeated 10 times.

set.seed(1)  # Generalized Boosted Model
gbModel <- train(classe ~ . , data=trn, method='gbm', metric='Accuracy', preProcess=c("center", "scale"),
    verbose=FALSE, trControl=trainControl(method="repeatedcv", repeats=10))
gbModelPred <- predict(gbModel, newdata=tst) # prediction with testing dataset
cmGB <- confusionMatrix(gbModelPred, tst$classe)

Support Vector Machine Model

We now evaluate a Support Vector Machine with Radial Basis Function Kernel model tuned with train(), tuneLength=12, and method=svmRadial. Fine tuning from multiple svm fits with sigma and C parameter ranges in expand.grid() identified these tuning parameters used in the calculation below. We avoided over-fitting with cross-validation (10 fold, repeated 10 times).

set.seed(1)  # support vector machines model
gridSVM <- expand.grid(sigma=0.01228219, C=513)
svmModel <- train(classe ~ . , data = trn, method = "svmRadial", preProc = c("center", "scale"),
     tuneGrid = gridSVM, trControl = trainControl(method = "repeatedcv", repeats = 10))
svmModelPred <- predict(svmModel, newdata=tst) # prediction with testing dataset
cmSVM <- confusionMatrix(svmModelPred, tst$classe)

Random Forest Models

Two random forest models were created. The first (RF1) call was invoked as randomForest(), not train(), but performs quite well with default parameter values. It was was subsequently fine-tuned after experimentation on RF2, specifically by using expand.grid() to hunt for ntree (number of trees) and mtry, the number of variables to randomly select. Model features common to both RF1 and RF2 are ntree=500 trees, mtry=8, and no pre-processing. In addition, we compensated for over-fitting on RF2 with cross-Validation (10 fold, repeated 10 times).

# random forest model 1
set.seed(1)
RF1 <- randomForest(classe ~ ., trn, mtry=8, importance=T)
RF1Pred <- predict(RF1, newdata=tst)
cmRF1 <- confusionMatrix(RF1Pred, tst$classe)
# random forest model 2
grid <- expand.grid(.mtry=8)  # Grid Search for best tune mtry value
RF2 <- train(classe ~ . , data=trn, method="rf", metric='Accuracy', tuneGrid=grid, ntree=500,
     trControl=trainControl(method="repeatedcv", number=10, repeats=10, search="grid"))
RF2Pred <- predict(RF2, newdata=tst)
cmRF2 <- confusionMatrix(RF2Pred, tst$classe)

Model Comparisons

# table of 95% confidence intervals for model accuracy
accuracyTbl <- round(100*cmDT$overall[3:4],1)
accuracyTbl <- rbind(accuracyTbl, round(100*cmGB$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmSVM$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmRF1$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmRF2$overall[3:4],1))
row.names(accuracyTbl) <- c('Decision Tree', 'Generalized Boosted', 'Support Vector Machine',
    'RandomForest 1', 'RandomForest 2')
htmlTable(accuracyTbl, caption='**Comparisons of model accuracy, 95% confidence intervals**')

	AccuracyLower	AccuracyUpper
Comparisons of model accuracy, 95% confidence intervals
Decision Tree	47.8	50.4
Generalized Boosted	95.9	96.8
Support Vector Machine	99.3	99.7
RandomForest 1	99.4	99.8
RandomForest 2	99.5	99.8

The last four models listed above were the highest performing of the five evaluated. For those four, we continue with a comparison of confusion matrices, out-of-box error rate, and out-of-sample error rate for the four high performing models. The confusion matrices are combined below.

# concatenate all confusion matrices
labs <- paste0("| ", rownames(cmSVM$table)); aln <- paste0(rep('lccccc',4))
hdr <- paste0('..',substring(labs, 3),'..')
tbl <- cbind(labs, cmGB$table, labs, cmSVM$table, labs, cmRF1$table, labs, cmRF2$table)
htmlTable(tbl, header=c('[ GB ]', hdr,'[ SVM ]', hdr, '[ RF1 ]', hdr,'[ RF2 ]', hdr),
    rnames=F, align=aln, header.align=aln, caption="**Confusion Matrices, Four Models**")

[ GB ]	..A..	..B..	..C..	..D..	..E..	[ SVM ]	..A..	..B..	..C..	..D..	..E..	[ RF1 ]	..A..	..B..	..C..	..D..	..E..	[ RF2 ]	..A..	..B..	..C..	..D..	..E..
Confusion Matrices, Four Models
\| A	1640	39	0	1	1	\| A	1673	3	0	0	0	\| A	1671	2	0	0	0	\| A	1671	2	0	0	0
\| B	20	1075	39	3	13	\| B	1	1131	7	0	0	\| B	0	1135	7	0	0	\| B	2	1135	6	0	0
\| C	9	24	976	22	9	\| C	0	4	1016	5	0	\| C	1	2	1018	4	0	\| C	0	2	1019	4	0
\| D	4	1	9	929	7	\| D	0	0	3	956	2	\| D	0	0	1	959	3	\| D	0	0	1	959	2
\| E	1	0	2	9	1052	\| E	0	1	0	3	1080	\| E	2	0	0	1	1079	\| E	1	0	0	1	1080

Out-of-Box error rate model comparisons

# OOB is ratio of misclassified to total training observations in confusion matrix
sumTrainObserv <- dim(trn)[1]  # total training observations
GBConfusion <- confusionMatrix(predict(gbModel, newdata=trn), trn$classe) # Decision Tree
sumMisclassified.GB <- sumTrainObserv-sum(diag(GBConfusion$table))
OOB.GB <- round(sumMisclassified.GB / sumTrainObserv, 5)
SVMConfusion <- confusionMatrix(predict(svmModel, newdata=trn), trn$classe) # Support Vectors Machine
sumMisclassified.SVM <- sumTrainObserv-sum(diag(SVMConfusion$table))
OOB.SVM <- round(sumMisclassified.SVM / sumTrainObserv, 5)
RF1Confusion <- with(RF1, confusion[1:nrow(confusion),1:nrow(confusion)])  # RF1
sumMisclassified.RF1 <- sumTrainObserv-sum(diag(RF1Confusion))
OOB.RF1 <- round(sumMisclassified.RF1 / sumTrainObserv, 5)
sumMisclassified.RF2 <- sumTrainObserv-sum(diag(RF2$finalModel$confusion))  # RF2
OOB.RF2 <- round(sumMisclassified.RF2 / sumTrainObserv, 5)
# comparison table
htmlTable(c(OOB.GB, OOB.SVM, OOB.RF1, OOB.RF2),
     header=c('| Generalized Boost |', '| Support Vector Machine |', '| RandForest1 |', '| RandForest2 |'),
     caption='**Out-of-Box Error Rate Comparison, Four Models**', rnames=F)

\| Generalized Boost \|	\| Support Vector Machine \|	\| RandForest1 \|	\| RandForest2 \|
Out-of-Box Error Rate Comparison, Four Models
0.02701	0.00051	0.00517	0.00531

Refer to the table above - note that the two random forest methods yielded very accurate fits and similar results, but were out-performed on this training set by the support vector machine.

Out-of-Sample error rate model comparisons

# OOS is (1 - Accuracy) of fit to testing set observations
OOS.GB <- round(1-cmGB$overall[1],5); OOS.SVM <- round(1-cmSVM$overall[1],5)
OOS.RF1 <- round(1-cmRF1$overall[1],5); OOS.RF2 <- round(1-cmRF2$overall[1],5)
htmlTable(c(OOS.GB, OOS.SVM, OOS.RF1, OOS.RF2),
     header=c('| Generalized Boost |', '| Support Vector Machine |', '| RandForest1 |', '| RandForest2 |'),
     caption='**Out-of-Sample Error Rate Comparison, Four Models**', rnames=F)

\| Generalized Boost \|	\| Support Vector Machine \|	\| RandForest1 \|	\| RandForest2 \|
Out-of-Sample Error Rate Comparison, Four Models
0.03619	0.00493	0.00391	0.00357

Refer to the table above. Previously, the support vector machine out-performed based on OOB error rate, but when out-of-sample (OOS) rates are compared, the random forest models out-perform. So we see a slightly different result when the models are evaluated with the test dataset. Again, both random forest models deliver similar results, but the trained random forest (RF2) method outperforms slightly.

Final Test: Evaluate Performance

The pml-testing.csv dataset includes 20 observations that will used to evaluate predictive performance.

gbPred.FT <- predict(gbModel, newdata=final.test, na.action = na.pass)  # evaluate on `final.test`
svmPred.FT <- predict(svmModel, newdata=final.test)
RF1Pred.FT <- predict(RF1, newdata=final.test)
RF2Pred.FT <- predict(RF2, newdata=final.test)
eq <- ifelse(gbPred.FT==svmPred.FT & svmPred.FT==RF1Pred.FT & RF1Pred.FT==RF2Pred.FT, 'T', 'F')
space <- rep('...' , 20)
htmlTable(t(data.frame('GB'=gbPred.FT, 'SVM'=svmPred.FT, 'RF1'=RF1Pred.FT, 'RF2'=RF2Pred.FT,
     space, 'Equal'=eq, fix.empty.names=F)), header=c(paste0('Ob','0',(1:9)), paste0('Ob',(10:20))))

	Ob01	Ob02	Ob03	Ob04	Ob05	Ob06	Ob07	Ob08	Ob09	Ob10	Ob11	Ob12	Ob13	Ob14	Ob15	Ob16	Ob17	Ob18	Ob19	Ob20
GB	B	A	B	A	A	E	D	B	A	A	B	C	B	A	E	E	A	B	B	B
SVM	B	A	B	A	A	E	D	B	A	A	B	C	B	A	E	E	A	B	B	B
RF1	B	A	B	A	A	E	D	B	A	A	B	C	B	A	E	E	A	B	B	B
RF2	B	A	B	A	A	E	D	B	A	A	B	C	B	A	E	E	A	B	B	B
	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
Equal	T	T	T	T	T	T	T	T	T	T	T	T	T	T	T	T	T	T	T	T

Refer to the table above. Each of the four highest-performing models yield identical predictions for the twenty observation test dataset.

Backup Material

Correlation map, ALL training predictors

Exploratory Analysis referred the reader to the correlation plots in this section. Plots are provided for correlations for (1) all training dataset predictors below.

# correlation maps for ALL training predictors, plot order: first principal component order
corrplot(cor(trn[, allPreds]), tl.cex=.8, tl.col="firebrick", order="FPC",
         method="ellipse", type="full", sig.level=0.05, tl.pos="lt", cl.pos="n")

Sensor data correlation by location

lst <- c('_forearm', '_arm', '_belt', '_dumbbell')
l <- lapply(lst, function(x) {
    tmp <- cor(trn[, grepl(x, allVars) & !grepl('total_accel', allVars) & !grepl('classe', allVars)])
    names.tmp <- gsub(x, "", row.names(tmp))
    diag(tmp) <- 0  # zero self correlations
    for(i in 1:dim(tmp)[2]) tmp[,i][abs(tmp[,i]) < .5] <- 0  # select corr values > threshold
    tmp <- apply(round(tmp,2), 2, as.character)  # convert to char class
    for(i in 1:dim(tmp)[2]) tmp[,i][tmp[,i] =="0"] <- ''  # blank out values < threshold
    colnames(tmp) <- row.names(tmp) <- names.tmp
    return(noquote(tmp))
})
names.tmp <- row.names(l[[1]])  # recover variable names
df <- data.frame(matrix(unlist(l), nrow=dim(l[[1]])[1], byrow=T, dimnames=list(names.tmp,rep(names.tmp,4))),
     check.names=F, stringsAsFactors=F) 
for(i in 1:length(l)) print(htmlTable(df[,(1+12*(i-1)):(12*i)],
     caption=paste0('***Correlations > 0.5 by Sensor Location:  ',
          gsub('_','',lst[i]),'***'), header=paste0('`', names.tmp, '`')))

	`roll`	`pitch`	`yaw`	`gyros_x`	`gyros_z`	`accel_x`	`accel_y`	`accel_z`	`magnet_x`	`magnet_y`	`magnet_z`
*Correlations > 0.5 by Sensor Location: forearm*
roll
pitch					0.88
yaw											-0.55
gyros_x
gyros_y				-0.92
gyros_z										0.55	0.78
accel_x			0.81				0.92	-0.99
accel_y			0.53
accel_z	-0.99		-0.78				-0.93
magnet_x							0.72
magnet_y				-0.73	0.69
magnet_z		0.57	0.85			0.68	-0.66		0.53

	`pitch`	`yaw`	`gyros_x`	`gyros_y`	`accel_x`	`accel_y`	`accel_z`	`magnet_x`	`magnet_y`	`magnet_z`
*Correlations > 0.5 by Sensor Location: arm*
roll
pitch				0.88
yaw					0.68
gyros_x								-0.53	0.56	0.53
gyros_y						-0.59
gyros_z	-0.53				0.82				-0.79	-0.58
accel_x		-0.7			-0.97			-0.89
accel_y						-0.5
accel_z	-0.89	0.73			0.89
magnet_x		0.52			0.81		0.57
magnet_y			-0.99	0.69
magnet_z		0.58					0.53		-0.77

	`roll`	`pitch`	`yaw`	`gyros_y`	`accel_x`	`accel_y`	`accel_z`	`magnet_x`	`magnet_y`	`magnet_z`
*Correlations > 0.5 by Sensor Location: belt*
roll
pitch								0.68
yaw						0.77
gyros_x
gyros_y								0.82	-0.7	-0.66
gyros_z		0.56			-0.7		0.55	-0.79		0.82
accel_x	0.81	-0.7		0.53	0.71	0.6	-0.78	0.73
accel_y		-0.97	0.71					0.89
accel_z										0.78
magnet_x		0.52			0.54	-0.54	0.85	0.58
magnet_y		0.81	0.54				0.68
magnet_z								-0.77

	`roll`	`pitch`	`yaw`	`gyros_y`	`gyros_z`	`accel_x`	`accel_z`	`magnet_x`	`magnet_y`
*Correlations > 0.5 by Sensor Location: dumbbell*
roll
pitch									0.77
yaw							-0.55
gyros_x				-0.92
gyros_y					-0.59
gyros_z		0.53				-0.66	0.78	-0.58	0.82
accel_x
accel_y	0.92		0.6		-0.5		-0.93
accel_z									0.78
magnet_x				-0.73	-0.99
magnet_y	0.72		-0.54				-0.66
magnet_z

Exploratory Graphs

Exploratory Scatterplots - Arm Accelerometer, Magnetometer sensors

what <- (grepl("accel_arm", allPreds)|grepl("magnet_arm", allPreds))&!grepl("total", allPreds)
featurePlot(x=trn[,what], y=trn$classe, plot = "pairs", auto.key = list(columns = 5),
    pch='.', alpha=.5, cex=.2, aspect=.75, main="Exploratory Scatterplots - Arm Accelerometer, Magnetometer Sensors")

Exploratory Boxplots - Accelerometer, Magnetometer sensors

locs <- c("_arm", "_forearm", "_belt", "_dumbbell")  # sensor locations
plotIt <- function(feat) featurePlot(x=trn[ , feat], y=trn$classe, "box", label=NULL)
featGrep <- function(loc, f1, f2, f3, not)
      grepl(loc,allPreds)&!grepl(not,allPreds)&(grepl(f1,allPreds)|grepl(f2,allPreds)|grepl(f3,allPreds))
p1 <- plotIt(featGrep(locs[1], "magnet_", "accel_","accel_","total_accel_"))
p2 <- plotIt(featGrep(locs[2], "magnet_", "accel_","accel_","total_accel_"))
p3 <- plotIt(featGrep(locs[3], "magnet_", "accel_","accel_","total_accel_"))
p4 <- plotIt(featGrep(locs[4], "magnet_", "accel_","accel_","total_accel_"))
p5 <- plotIt(grepl("total_accel_", allPreds))
grid.arrange(p1, p2, p3, p4, p5, heights=c(3.75,3.75,3))

Exploratory Boxplots - Gyroscopic sensors

p6 <- plotIt(featGrep(locs[1], "gyros_", "gyros_","gyros_","total_accel_"))
p7 <- plotIt(featGrep(locs[2], "gyros_", "gyros_","gyros_","total_accel_"))
p8 <- plotIt(featGrep(locs[3], "gyros_", "gyros_","gyros_","total_accel_"))
p9 <- plotIt(featGrep(locs[4], "gyros_", "gyros_","gyros_","total_accel_"))
grid.arrange(p6, p7, p8, p9)

Exploratory Boxplots - Roll/Pitch/Yaw sensors

p10 <- plotIt(featGrep(locs[1], "roll_", "pitch_","yaw_","total_accel_"))
p11 <- plotIt(featGrep(locs[2], "roll_", "pitch_","yaw_","total_accel_"))
p12 <- featurePlot(trn[ , allPreds[featGrep(locs[3], "roll_", "pitch_","yaw_",
      "total_accel_")]], trn$classe, "box", label=NULL) # workaround
p13 <- plotIt(featGrep(locs[4], "roll_", "pitch_","yaw_","total_accel_"))
grid.arrange(p10, p11, p12, p13)

Skewness Analysis

The dotted red lines on the histograms are mean references. We observe considerable skew, and some distributions are multi-modal. If the goal was an accurate simple linear regression model, the author expects that these untransformed predictors would present a considerable challenge. In fact, \(adjusted R^2\) from LM <- train(as.integer(classe) ~ .,data=trn,method="lm") is only 0.53, so 47% of the variability in the response is not accounted for in this simple linear model.

skew <- function(x) round(abs(skewness(x)),2);  skew <- apply(trn[,allPreds], 2, skew)
skew <- skew[order(-skew)]
predSkew <- data.frame(Predictor=paste0("` ", names(skew), " `"), Skew=skew)
row.names(predSkew) <- NULL; n <- dim(predSkew)[1]
htmlTable(cbind(predSkew[1:(n/4),], predSkew[(n/4+1):(2*n/4),],
    predSkew[(2*n/4+1):(3*n/4),], predSkew[(3*n/4+1):n,]), align.header=rep('l',8),
    rnames=F, align=rep('l',8),  caption="***Sorted Predictor Skew***")

Predictor	Skew	Predictor	Skew	Predictor	Skew	Predictor	Skew
*Sorted Predictor Skew*
`gyros_dumbbell_z`	114.78	`accel_belt_x`	0.96	`magnet_arm_y`	0.49	`gyros_arm_z`	0.17
`gyros_dumbbell_x`	108.99	`yaw_belt`	0.9	`accel_dumbbell_x`	0.47	`accel_belt_y`	0.15
`gyros_forearm_z`	102.34	`magnet_dumbbell_z`	0.88	`roll_forearm`	0.46	`magnet_arm_x`	0.14
`gyros_forearm_y`	54.58	`accel_arm_z`	0.86	`accel_forearm_z`	0.44	`gyros_belt_z`	0.11
`gyros_dumbbell_y`	36.04	`magnet_forearm_y`	0.74	`accel_arm_x`	0.36	`gyros_arm_y`	0.11
`gyros_forearm_x`	2.65	`roll_dumbbell`	0.73	`accel_dumbbell_y`	0.36	`yaw_arm`	0.1
`magnet_belt_y`	2.22	`accel_forearm_y`	0.65	`gyros_arm_x`	0.29	`accel_dumbbell_z`	0.1
`magnet_dumbbell_y`	1.74	`magnet_forearm_x`	0.65	`yaw_forearm`	0.27	`gyros_belt_y`	0.09
`magnet_dumbbell_x`	1.69	`total_accel_dumbbell`	0.61	`accel_forearm_x`	0.23	`accel_arm_y`	0.09
`magnet_belt_x`	1.44	`total_accel_forearm`	0.57	`yaw_dumbbell`	0.22	`total_accel_arm`	0.07
`magnet_forearm_z`	1.22	`gyros_belt_x`	0.54	`magnet_belt_z`	0.19	`total_accel_belt`	0.03
`magnet_arm_z`	1.15	`pitch_dumbbell`	0.52	`pitch_arm`	0.19	`roll_belt`	0.02
`pitch_belt`	1	`pitch_forearm`	0.52	`roll_arm`	0.18	`accel_belt_z`	0.02

Predictors, Skewness > 1.0, cube-root transformed

skewThr <- 1.0 # "skewed" threshold
old.par <- par(no.readonly=T); par(mfrow=c(4,3),mar=c(3,2,1,1))
  for(i in 1 : sum(skew > skewThr)) {
    tmp <- sign(trn[, names(skew[i])]) * abs(trn[, names(skew[i])])^(1/3) # cube root transform
    hist(tmp, main=names(skew[i]), breaks=60, xlab='')
    abline(v=mean(tmp, na.rm=T), lty=3, lwd=3, col="firebrick")
}

par(old.par)

Scree Plot, Ranked Variable Importance, Decision Tree Model

A scree plot identifies the 19 variables that explain most of the variability for this model.

tblDT <- data.frame(var.imp=dTreeModel$finalModel$variable.importance)
old.par <- par(no.readonly=T); par(las=3, mar=c(7,4,2,1))
plot(x=seq_along(1:dim(tblDT)[1]), y=tblDT$var.imp, xaxt='n', type='s', ylim=c(0,1200),
     xlab='', main='Scree Plot:  Decision Tree Variables', ylab='Variable Importance')
axis(side=1, cex.axis=0.8, labels=rownames(tblDT), at=seq_along(1:dim(tblDT)[1]))
abline(h=0, lty=3);par(old.par)

List of Remaining Variables Post-Eliminations

The variable elimination process described in the Exploratory Analysis section yielded the following variables remaining for model development, including the outcome classe.

tbl<-c(); nC <- 6; names <- c(paste0("` ", allVars[order(allVars)], " `"), rep("...",1))
nV <- length(names)
for(i in 1:nC)
    tbl <- cbind(tbl, c(names[seq(1, nV, nV/nC)[i] : (seq(1, nV, nV/nC)+(nV/nC-1))[i]]))
htmlTable::htmlTable(tbl, align='llllll', header=c(rep("",6)),
              caption="***Remaining variables in dataset (alphabetical order)***")


*Remaining variables in dataset (alphabetical order)*
`accel_arm_x`	`accel_forearm_x`	`gyros_belt_z`	`magnet_arm_z`	`magnet_forearm_z`	`total_accel_arm`
`accel_arm_y`	`accel_forearm_y`	`gyros_dumbbell_x`	`magnet_belt_x`	`pitch_arm`	`total_accel_belt`
`accel_arm_z`	`accel_forearm_z`	`gyros_dumbbell_y`	`magnet_belt_y`	`pitch_belt`	`total_accel_dumbbell`
`accel_belt_x`	`classe`	`gyros_dumbbell_z`	`magnet_belt_z`	`pitch_dumbbell`	`total_accel_forearm`
`accel_belt_y`	`gyros_arm_x`	`gyros_forearm_x`	`magnet_dumbbell_x`	`pitch_forearm`	`yaw_arm`
`accel_belt_z`	`gyros_arm_y`	`gyros_forearm_y`	`magnet_dumbbell_y`	`roll_arm`	`yaw_belt`
`accel_dumbbell_x`	`gyros_arm_z`	`gyros_forearm_z`	`magnet_dumbbell_z`	`roll_belt`	`yaw_dumbbell`
`accel_dumbbell_y`	`gyros_belt_x`	`magnet_arm_x`	`magnet_forearm_x`	`roll_dumbbell`	`yaw_forearm`
`accel_dumbbell_z`	`gyros_belt_y`	`magnet_arm_y`	`magnet_forearm_y`	`roll_forearm`	…

List of Discarded Variables

The Exploratory Analysis section discussed discarded variables, and referred the reader to this list.

htmlTable::htmlTable(elimVar, align='lllll', header=c(rep("",5)),
    caption=paste0("***Discarded ", length(elim), " variables: high proportion
        missing values or NAs, #DIV/0! errors, otherwise low information content***"))


*Discarded 107 variables: high proportion missing values or NAs, #DIV/0! errors, otherwise low information content*
`X`	`var_total_accel_belt`	`kurtosis_yaw_arm`	`min_roll_dumbbell`	`max_roll_forearm`
`user_name`	`avg_roll_belt`	`skewness_roll_arm`	`min_pitch_dumbbell`	`max_picth_forearm`
`raw_timestamp_part_1`	`stddev_roll_belt`	`skewness_pitch_arm`	`min_yaw_dumbbell`	`max_yaw_forearm`
`raw_timestamp_part_2`	`var_roll_belt`	`skewness_yaw_arm`	`amplitude_roll_dumbbell`	`min_roll_forearm`
`cvtd_timestamp`	`avg_pitch_belt`	`max_roll_arm`	`amplitude_pitch_dumbbell`	`min_pitch_forearm`
`new_window`	`stddev_pitch_belt`	`max_picth_arm`	`amplitude_yaw_dumbbell`	`min_yaw_forearm`
`num_window`	`var_pitch_belt`	`max_yaw_arm`	`var_accel_dumbbell`	`amplitude_roll_forearm`
`kurtosis_roll_belt`	`avg_yaw_belt`	`min_roll_arm`	`avg_roll_dumbbell`	`amplitude_pitch_forearm`
`kurtosis_picth_belt`	`stddev_yaw_belt`	`min_pitch_arm`	`stddev_roll_dumbbell`	`amplitude_yaw_forearm`
`kurtosis_yaw_belt`	`var_yaw_belt`	`min_yaw_arm`	`var_roll_dumbbell`	`var_accel_forearm`
`skewness_roll_belt`	`var_accel_arm`	`amplitude_roll_arm`	`avg_pitch_dumbbell`	`avg_roll_forearm`
`skewness_roll_belt.1`	`avg_roll_arm`	`amplitude_pitch_arm`	`stddev_pitch_dumbbell`	`stddev_roll_forearm`
`skewness_yaw_belt`	`stddev_roll_arm`	`amplitude_yaw_arm`	`var_pitch_dumbbell`	`var_roll_forearm`
`max_roll_belt`	`var_roll_arm`	`kurtosis_roll_dumbbell`	`avg_yaw_dumbbell`	`avg_pitch_forearm`
`max_picth_belt`	`avg_pitch_arm`	`kurtosis_picth_dumbbell`	`stddev_yaw_dumbbell`	`stddev_pitch_forearm`
`max_yaw_belt`	`stddev_pitch_arm`	`kurtosis_yaw_dumbbell`	`var_yaw_dumbbell`	`var_pitch_forearm`
`min_roll_belt`	`var_pitch_arm`	`skewness_roll_dumbbell`	`kurtosis_roll_forearm`	`avg_yaw_forearm`
`min_pitch_belt`	`avg_yaw_arm`	`skewness_pitch_dumbbell`	`kurtosis_picth_forearm`	`stddev_yaw_forearm`
`min_yaw_belt`	`stddev_yaw_arm`	`skewness_yaw_dumbbell`	`kurtosis_yaw_forearm`	`var_yaw_forearm`
`amplitude_roll_belt`	`var_yaw_arm`	`max_roll_dumbbell`	`skewness_roll_forearm`	…
`amplitude_pitch_belt`	`kurtosis_roll_arm`	`max_picth_dumbbell`	`skewness_pitch_forearm`	…
`amplitude_yaw_belt`	`kurtosis_picth_arm`	`max_yaw_dumbbell`	`skewness_yaw_forearm`	…