This report makes use of the dataset from a study titled “Qualitative Activity Recognition of Weight Lifting Exercises”.1 The research team gathered the data to to assess whether weightlifting exercise mistakes could be detected by using activity recognition techniques. Under supervision, properly and improperly executed weightlifting activity measurements were taken.
Six male subjects, ages ranging 20 - 28 years, wore special athletic equipment fitted with accelerometers (weight belt, glove and armband). In addition, an accelerometer was placed on the dumbbell. The subjects performed dumbbell lifts correctly (class A) and incorrectly (classes B, C, D, E). This report will evaluate several machine learning models that attempt to predict the manner in which a human subject is performing a dumbbell lift exercise. The most accurate models will be used to predict 20 different test cases.
1Qualitative Activity Recognition of Weight Lifting Exercises (Velloso, Bulling, Gellersen, Ugulino, Fuks); ACM SIGCHI, 2013.
The dataset pml-training.csv will be read and split 70/30 into training and testing datasets labeled trn and tst respectively. trn will be used to build and tune the model; tst will be used to estimate the model predictive performance. Variables in trn and tst are listed in Backup Material. classe (a factor) is the response.
The dataset includes summary statistics calculated at a measurement boundary - these are very sparse and/or contain very high proportions of missing values and/or error values. We also note that columns 1:7 include an observation id, subject name, and time stamp information. The information content in all these variables is very low, so they will be discarded.
div0 <- apply(data, 2, function(x) sum(x=="#DIV/0!", na.rm = T)>0) # elimination vectors
blank <- apply(data, 2, function(x) sum(x=="", na.rm = T)>0) # missing numbers scanning
NAs <- apply(data, 2, function(x) sum(is.na(x))>0 ) # NAs scanning
hiValue <- c(rep(TRUE, 7), rep(FALSE, length(names(data))-7))
elim <- names(data[, ( div0|blank|NAs|hiValue )])
elimVar <- c(); nC <- 5 # table of discarded variables for Backup Material
names <- c(paste0("` ", elim, " `"), rep("...",3)); nV <- length(names)
for(i in 1:nC) elimVar <- cbind(elimVar, names[seq(1,nV,nV/nC)[i] : (seq(1,nV,nV/nC)+(nV/nC-1))[i]])
data <- data[, !( div0|blank|NAs|hiValue )] # eliminate variables listed above
allVars <- names(data) # list of all variables
allPreds <- allVars[-grep("classe", allVars)] # list of all predictors
nzr <- nearZeroVar(data, saveMetrics = T)
if(any(!is.na(data)) & any(!nzr$zeroVar) & any(!nzr$nzv))
cat("No missing values, no zero / near zero variance predictors remaining after eliminations")
## No missing values, no zero / near zero variance predictors remaining after eliminations
Summary statistics have been eliminated except for four total acceleration variables. Remaining predictors represent raw numeric data from sensors attached to the test subject forearm, arm, belt, and to the dumbbell. The 53 remaining and 107 discarded variables are presented in the Backup Material section. We will not need to consider zero or near-zero variance predictors in our model development per the analysis above.
A 70/30 non-overlapping split of the dataset into trn and tst is performed with createDataPartition().
set.seed(1)
inTrain <- createDataPartition(y = data$classe, p = 0.7, list = FALSE)
trn <- data[inTrain, ] # 70% to trn
tst <- data[-inTrain, ] # 30% to tst
cat("trn:", dim(trn)[1], "observations by", dim(trn)[2], "variables; tst:",
dim(tst)[1], "observations by", dim(tst)[2], "variables")
## trn: 13737 observations by 53 variables; tst: 5885 observations by 53 variables
The author fit five models for this supervised learning classification problem: (1) a CART decision tree, (2) a generalized boosted model, (3) a support vectors machine, (4) a random forest model invoked without train(), but for which tunable parameters were determined by way of the last model (5), a random forest model invoked within train(). Optimal tuning parameters are discussed below.
We evaluate a CART decision tree, using train() method rpart. 52 variables were centered and scaled, of which 19 important variables were identified. A scree plot of these is presented in Backup Material. Cross-validation parameters used: 10 folds, repeated 10 times.
set.seed(1) # decision tree model
dTreeModel <- train(classe ~ . , method = "rpart", data=trn, preProcess=c("center", "scale"))
dTreeModelPred <- predict(dTreeModel, newdata=tst, na.action = na.pass) # predict w tst
cmDT <- confusionMatrix(dTreeModelPred, tst$classe)
We now evaluate a Generalized Boosted Model tuned with train(). After some experimentation, the best tune was n.trees=150, interaction.depth=3, shrinkage=0.1. Cross-validation parameters: 10 folds, repeated 10 times.
set.seed(1) # Generalized Boosted Model
gbModel <- train(classe ~ . , data=trn, method='gbm', metric='Accuracy', preProcess=c("center", "scale"),
verbose=FALSE, trControl=trainControl(method="repeatedcv", repeats=10))
gbModelPred <- predict(gbModel, newdata=tst) # prediction with testing dataset
cmGB <- confusionMatrix(gbModelPred, tst$classe)
We now evaluate a Support Vector Machine with Radial Basis Function Kernel model tuned with train(), tuneLength=12, and method=svmRadial. Fine tuning from multiple svm fits with sigma and C parameter ranges in expand.grid() identified these tuning parameters used in the calculation below. We avoided over-fitting with cross-validation (10 fold, repeated 10 times).
set.seed(1) # support vector machines model
gridSVM <- expand.grid(sigma=0.01228219, C=513)
svmModel <- train(classe ~ . , data = trn, method = "svmRadial", preProc = c("center", "scale"),
tuneGrid = gridSVM, trControl = trainControl(method = "repeatedcv", repeats = 10))
svmModelPred <- predict(svmModel, newdata=tst) # prediction with testing dataset
cmSVM <- confusionMatrix(svmModelPred, tst$classe)
Two random forest models were created. The first (RF1) call was invoked as randomForest(), not train(), but performs quite well with default parameter values. It was was subsequently fine-tuned after experimentation on RF2, specifically by using expand.grid() to hunt for ntree (number of trees) and mtry, the number of variables to randomly select. Model features common to both RF1 and RF2 are ntree=500 trees, mtry=8, and no pre-processing. In addition, we compensated for over-fitting on RF2 with cross-Validation (10 fold, repeated 10 times).
# random forest model 1
set.seed(1)
RF1 <- randomForest(classe ~ ., trn, mtry=8, importance=T)
RF1Pred <- predict(RF1, newdata=tst)
cmRF1 <- confusionMatrix(RF1Pred, tst$classe)
# random forest model 2
grid <- expand.grid(.mtry=8) # Grid Search for best tune mtry value
RF2 <- train(classe ~ . , data=trn, method="rf", metric='Accuracy', tuneGrid=grid, ntree=500,
trControl=trainControl(method="repeatedcv", number=10, repeats=10, search="grid"))
RF2Pred <- predict(RF2, newdata=tst)
cmRF2 <- confusionMatrix(RF2Pred, tst$classe)
# table of 95% confidence intervals for model accuracy
accuracyTbl <- round(100*cmDT$overall[3:4],1)
accuracyTbl <- rbind(accuracyTbl, round(100*cmGB$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmSVM$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmRF1$overall[3:4],1))
accuracyTbl <- rbind(accuracyTbl, round(100*cmRF2$overall[3:4],1))
row.names(accuracyTbl) <- c('Decision Tree', 'Generalized Boosted', 'Support Vector Machine',
'RandomForest 1', 'RandomForest 2')
htmlTable(accuracyTbl, caption='**Comparisons of model accuracy, 95% confidence intervals**')
| Comparisons of model accuracy, 95% confidence intervals | ||
| AccuracyLower | AccuracyUpper | |
|---|---|---|
| Decision Tree | 47.8 | 50.4 |
| Generalized Boosted | 95.9 | 96.8 |
| Support Vector Machine | 99.3 | 99.7 |
| RandomForest 1 | 99.4 | 99.8 |
| RandomForest 2 | 99.5 | 99.8 |
The last four models listed above were the highest performing of the five evaluated. For those four, we continue with a comparison of confusion matrices, out-of-box error rate, and out-of-sample error rate for the four high performing models. The confusion matrices are combined below.
# concatenate all confusion matrices
labs <- paste0("| ", rownames(cmSVM$table)); aln <- paste0(rep('lccccc',4))
hdr <- paste0('..',substring(labs, 3),'..')
tbl <- cbind(labs, cmGB$table, labs, cmSVM$table, labs, cmRF1$table, labs, cmRF2$table)
htmlTable(tbl, header=c('[ GB ]', hdr,'[ SVM ]', hdr, '[ RF1 ]', hdr,'[ RF2 ]', hdr),
rnames=F, align=aln, header.align=aln, caption="**Confusion Matrices, Four Models**")
| Confusion Matrices, Four Models | |||||||||||||||||||||||
| [ GB ] | ..A.. | ..B.. | ..C.. | ..D.. | ..E.. | [ SVM ] | ..A.. | ..B.. | ..C.. | ..D.. | ..E.. | [ RF1 ] | ..A.. | ..B.. | ..C.. | ..D.. | ..E.. | [ RF2 ] | ..A.. | ..B.. | ..C.. | ..D.. | ..E.. |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| | A | 1640 | 39 | 0 | 1 | 1 | | A | 1673 | 3 | 0 | 0 | 0 | | A | 1671 | 2 | 0 | 0 | 0 | | A | 1671 | 2 | 0 | 0 | 0 |
| | B | 20 | 1075 | 39 | 3 | 13 | | B | 1 | 1131 | 7 | 0 | 0 | | B | 0 | 1135 | 7 | 0 | 0 | | B | 2 | 1135 | 6 | 0 | 0 |
| | C | 9 | 24 | 976 | 22 | 9 | | C | 0 | 4 | 1016 | 5 | 0 | | C | 1 | 2 | 1018 | 4 | 0 | | C | 0 | 2 | 1019 | 4 | 0 |
| | D | 4 | 1 | 9 | 929 | 7 | | D | 0 | 0 | 3 | 956 | 2 | | D | 0 | 0 | 1 | 959 | 3 | | D | 0 | 0 | 1 | 959 | 2 |
| | E | 1 | 0 | 2 | 9 | 1052 | | E | 0 | 1 | 0 | 3 | 1080 | | E | 2 | 0 | 0 | 1 | 1079 | | E | 1 | 0 | 0 | 1 | 1080 |
# OOB is ratio of misclassified to total training observations in confusion matrix
sumTrainObserv <- dim(trn)[1] # total training observations
GBConfusion <- confusionMatrix(predict(gbModel, newdata=trn), trn$classe) # Decision Tree
sumMisclassified.GB <- sumTrainObserv-sum(diag(GBConfusion$table))
OOB.GB <- round(sumMisclassified.GB / sumTrainObserv, 5)
SVMConfusion <- confusionMatrix(predict(svmModel, newdata=trn), trn$classe) # Support Vectors Machine
sumMisclassified.SVM <- sumTrainObserv-sum(diag(SVMConfusion$table))
OOB.SVM <- round(sumMisclassified.SVM / sumTrainObserv, 5)
RF1Confusion <- with(RF1, confusion[1:nrow(confusion),1:nrow(confusion)]) # RF1
sumMisclassified.RF1 <- sumTrainObserv-sum(diag(RF1Confusion))
OOB.RF1 <- round(sumMisclassified.RF1 / sumTrainObserv, 5)
sumMisclassified.RF2 <- sumTrainObserv-sum(diag(RF2$finalModel$confusion)) # RF2
OOB.RF2 <- round(sumMisclassified.RF2 / sumTrainObserv, 5)
# comparison table
htmlTable(c(OOB.GB, OOB.SVM, OOB.RF1, OOB.RF2),
header=c('| Generalized Boost |', '| Support Vector Machine |', '| RandForest1 |', '| RandForest2 |'),
caption='**Out-of-Box Error Rate Comparison, Four Models**', rnames=F)
| Out-of-Box Error Rate Comparison, Four Models | |||
| | Generalized Boost | | | Support Vector Machine | | | RandForest1 | | | RandForest2 | |
|---|---|---|---|
| 0.02701 | 0.00051 | 0.00517 | 0.00531 |
Refer to the table above - note that the two random forest methods yielded very accurate fits and similar results, but were out-performed on this training set by the support vector machine.
# OOS is (1 - Accuracy) of fit to testing set observations
OOS.GB <- round(1-cmGB$overall[1],5); OOS.SVM <- round(1-cmSVM$overall[1],5)
OOS.RF1 <- round(1-cmRF1$overall[1],5); OOS.RF2 <- round(1-cmRF2$overall[1],5)
htmlTable(c(OOS.GB, OOS.SVM, OOS.RF1, OOS.RF2),
header=c('| Generalized Boost |', '| Support Vector Machine |', '| RandForest1 |', '| RandForest2 |'),
caption='**Out-of-Sample Error Rate Comparison, Four Models**', rnames=F)
| Out-of-Sample Error Rate Comparison, Four Models | |||
| | Generalized Boost | | | Support Vector Machine | | | RandForest1 | | | RandForest2 | |
|---|---|---|---|
| 0.03619 | 0.00493 | 0.00391 | 0.00357 |
Refer to the table above. Previously, the support vector machine out-performed based on OOB error rate, but when out-of-sample (OOS) rates are compared, the random forest models out-perform. So we see a slightly different result when the models are evaluated with the test dataset. Again, both random forest models deliver similar results, but the trained random forest (RF2) method outperforms slightly.
The pml-testing.csv dataset includes 20 observations that will used to evaluate predictive performance.
gbPred.FT <- predict(gbModel, newdata=final.test, na.action = na.pass) # evaluate on `final.test`
svmPred.FT <- predict(svmModel, newdata=final.test)
RF1Pred.FT <- predict(RF1, newdata=final.test)
RF2Pred.FT <- predict(RF2, newdata=final.test)
eq <- ifelse(gbPred.FT==svmPred.FT & svmPred.FT==RF1Pred.FT & RF1Pred.FT==RF2Pred.FT, 'T', 'F')
space <- rep('...' , 20)
htmlTable(t(data.frame('GB'=gbPred.FT, 'SVM'=svmPred.FT, 'RF1'=RF1Pred.FT, 'RF2'=RF2Pred.FT,
space, 'Equal'=eq, fix.empty.names=F)), header=c(paste0('Ob','0',(1:9)), paste0('Ob',(10:20))))
| Ob01 | Ob02 | Ob03 | Ob04 | Ob05 | Ob06 | Ob07 | Ob08 | Ob09 | Ob10 | Ob11 | Ob12 | Ob13 | Ob14 | Ob15 | Ob16 | Ob17 | Ob18 | Ob19 | Ob20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GB | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
| SVM | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
| RF1 | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
| RF2 | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
| … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | … | |
| Equal | T | T | T | T | T | T | T | T | T | T | T | T | T | T | T | T | T | T | T | T |
Refer to the table above. Each of the four highest-performing models yield identical predictions for the twenty observation test dataset.
Exploratory Analysis referred the reader to the correlation plots in this section. Plots are provided for correlations for (1) all training dataset predictors below.
# correlation maps for ALL training predictors, plot order: first principal component order
corrplot(cor(trn[, allPreds]), tl.cex=.8, tl.col="firebrick", order="FPC",
method="ellipse", type="full", sig.level=0.05, tl.pos="lt", cl.pos="n")
lst <- c('_forearm', '_arm', '_belt', '_dumbbell')
l <- lapply(lst, function(x) {
tmp <- cor(trn[, grepl(x, allVars) & !grepl('total_accel', allVars) & !grepl('classe', allVars)])
names.tmp <- gsub(x, "", row.names(tmp))
diag(tmp) <- 0 # zero self correlations
for(i in 1:dim(tmp)[2]) tmp[,i][abs(tmp[,i]) < .5] <- 0 # select corr values > threshold
tmp <- apply(round(tmp,2), 2, as.character) # convert to char class
for(i in 1:dim(tmp)[2]) tmp[,i][tmp[,i] =="0"] <- '' # blank out values < threshold
colnames(tmp) <- row.names(tmp) <- names.tmp
return(noquote(tmp))
})
names.tmp <- row.names(l[[1]]) # recover variable names
df <- data.frame(matrix(unlist(l), nrow=dim(l[[1]])[1], byrow=T, dimnames=list(names.tmp,rep(names.tmp,4))),
check.names=F, stringsAsFactors=F)
for(i in 1:length(l)) print(htmlTable(df[,(1+12*(i-1)):(12*i)],
caption=paste0('***Correlations > 0.5 by Sensor Location: ',
gsub('_','',lst[i]),'***'), header=paste0('`', names.tmp, '`')))
| Correlations > 0.5 by Sensor Location: forearm | ||||||||||||
roll
|
pitch
|
yaw
|
gyros_x
|
gyros_y
|
gyros_z
|
accel_x
|
accel_y
|
accel_z
|
magnet_x
|
magnet_y
|
magnet_z
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| roll | ||||||||||||
| pitch | 0.88 | |||||||||||
| yaw | -0.55 | |||||||||||
| gyros_x | ||||||||||||
| gyros_y | -0.92 | |||||||||||
| gyros_z | 0.55 | 0.78 | ||||||||||
| accel_x | 0.81 | 0.92 | -0.99 | |||||||||
| accel_y | 0.53 | |||||||||||
| accel_z | -0.99 | -0.78 | -0.93 | |||||||||
| magnet_x | 0.72 | |||||||||||
| magnet_y | -0.73 | 0.69 | ||||||||||
| magnet_z | 0.57 | 0.85 | 0.68 | -0.66 | 0.53 | |||||||
| Correlations > 0.5 by Sensor Location: arm | ||||||||||||
roll
|
pitch
|
yaw
|
gyros_x
|
gyros_y
|
gyros_z
|
accel_x
|
accel_y
|
accel_z
|
magnet_x
|
magnet_y
|
magnet_z
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| roll | ||||||||||||
| pitch | 0.88 | |||||||||||
| yaw | 0.68 | |||||||||||
| gyros_x | -0.53 | 0.56 | 0.53 | |||||||||
| gyros_y | -0.59 | |||||||||||
| gyros_z | -0.53 | 0.82 | -0.79 | -0.58 | ||||||||
| accel_x | -0.7 | -0.97 | -0.89 | |||||||||
| accel_y | -0.5 | |||||||||||
| accel_z | -0.89 | 0.73 | 0.89 | |||||||||
| magnet_x | 0.52 | 0.81 | 0.57 | |||||||||
| magnet_y | -0.99 | 0.69 | ||||||||||
| magnet_z | 0.58 | 0.53 | -0.77 | |||||||||
| Correlations > 0.5 by Sensor Location: belt | ||||||||||||
roll
|
pitch
|
yaw
|
gyros_x
|
gyros_y
|
gyros_z
|
accel_x
|
accel_y
|
accel_z
|
magnet_x
|
magnet_y
|
magnet_z
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| roll | ||||||||||||
| pitch | 0.68 | |||||||||||
| yaw | 0.77 | |||||||||||
| gyros_x | ||||||||||||
| gyros_y | 0.82 | -0.7 | -0.66 | |||||||||
| gyros_z | 0.56 | -0.7 | 0.55 | -0.79 | 0.82 | |||||||
| accel_x | 0.81 | -0.7 | 0.53 | 0.71 | 0.6 | -0.78 | 0.73 | |||||
| accel_y | -0.97 | 0.71 | 0.89 | |||||||||
| accel_z | 0.78 | |||||||||||
| magnet_x | 0.52 | 0.54 | -0.54 | 0.85 | 0.58 | |||||||
| magnet_y | 0.81 | 0.54 | 0.68 | |||||||||
| magnet_z | -0.77 | |||||||||||
| Correlations > 0.5 by Sensor Location: dumbbell | ||||||||||||
roll
|
pitch
|
yaw
|
gyros_x
|
gyros_y
|
gyros_z
|
accel_x
|
accel_y
|
accel_z
|
magnet_x
|
magnet_y
|
magnet_z
|
|
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| roll | ||||||||||||
| pitch | 0.77 | |||||||||||
| yaw | -0.55 | |||||||||||
| gyros_x | -0.92 | |||||||||||
| gyros_y | -0.59 | |||||||||||
| gyros_z | 0.53 | -0.66 | 0.78 | -0.58 | 0.82 | |||||||
| accel_x | ||||||||||||
| accel_y | 0.92 | 0.6 | -0.5 | -0.93 | ||||||||
| accel_z | 0.78 | |||||||||||
| magnet_x | -0.73 | -0.99 | ||||||||||
| magnet_y | 0.72 | -0.54 | -0.66 | |||||||||
| magnet_z | ||||||||||||
what <- (grepl("accel_arm", allPreds)|grepl("magnet_arm", allPreds))&!grepl("total", allPreds)
featurePlot(x=trn[,what], y=trn$classe, plot = "pairs", auto.key = list(columns = 5),
pch='.', alpha=.5, cex=.2, aspect=.75, main="Exploratory Scatterplots - Arm Accelerometer, Magnetometer Sensors")
locs <- c("_arm", "_forearm", "_belt", "_dumbbell") # sensor locations
plotIt <- function(feat) featurePlot(x=trn[ , feat], y=trn$classe, "box", label=NULL)
featGrep <- function(loc, f1, f2, f3, not)
grepl(loc,allPreds)&!grepl(not,allPreds)&(grepl(f1,allPreds)|grepl(f2,allPreds)|grepl(f3,allPreds))
p1 <- plotIt(featGrep(locs[1], "magnet_", "accel_","accel_","total_accel_"))
p2 <- plotIt(featGrep(locs[2], "magnet_", "accel_","accel_","total_accel_"))
p3 <- plotIt(featGrep(locs[3], "magnet_", "accel_","accel_","total_accel_"))
p4 <- plotIt(featGrep(locs[4], "magnet_", "accel_","accel_","total_accel_"))
p5 <- plotIt(grepl("total_accel_", allPreds))
grid.arrange(p1, p2, p3, p4, p5, heights=c(3.75,3.75,3))
p6 <- plotIt(featGrep(locs[1], "gyros_", "gyros_","gyros_","total_accel_"))
p7 <- plotIt(featGrep(locs[2], "gyros_", "gyros_","gyros_","total_accel_"))
p8 <- plotIt(featGrep(locs[3], "gyros_", "gyros_","gyros_","total_accel_"))
p9 <- plotIt(featGrep(locs[4], "gyros_", "gyros_","gyros_","total_accel_"))
grid.arrange(p6, p7, p8, p9)
p10 <- plotIt(featGrep(locs[1], "roll_", "pitch_","yaw_","total_accel_"))
p11 <- plotIt(featGrep(locs[2], "roll_", "pitch_","yaw_","total_accel_"))
p12 <- featurePlot(trn[ , allPreds[featGrep(locs[3], "roll_", "pitch_","yaw_",
"total_accel_")]], trn$classe, "box", label=NULL) # workaround
p13 <- plotIt(featGrep(locs[4], "roll_", "pitch_","yaw_","total_accel_"))
grid.arrange(p10, p11, p12, p13)
The dotted red lines on the histograms are mean references. We observe considerable skew, and some distributions are multi-modal. If the goal was an accurate simple linear regression model, the author expects that these untransformed predictors would present a considerable challenge. In fact, \(adjusted R^2\) from LM <- train(as.integer(classe) ~ .,data=trn,method="lm") is only 0.53, so 47% of the variability in the response is not accounted for in this simple linear model.
skew <- function(x) round(abs(skewness(x)),2); skew <- apply(trn[,allPreds], 2, skew)
skew <- skew[order(-skew)]
predSkew <- data.frame(Predictor=paste0("` ", names(skew), " `"), Skew=skew)
row.names(predSkew) <- NULL; n <- dim(predSkew)[1]
htmlTable(cbind(predSkew[1:(n/4),], predSkew[(n/4+1):(2*n/4),],
predSkew[(2*n/4+1):(3*n/4),], predSkew[(3*n/4+1):n,]), align.header=rep('l',8),
rnames=F, align=rep('l',8), caption="***Sorted Predictor Skew***")
| Sorted Predictor Skew | |||||||
| Predictor | Skew | Predictor | Skew | Predictor | Skew | Predictor | Skew |
|---|---|---|---|---|---|---|---|
gyros_dumbbell_z
|
114.78 |
accel_belt_x
|
0.96 |
magnet_arm_y
|
0.49 |
gyros_arm_z
|
0.17 |
gyros_dumbbell_x
|
108.99 |
yaw_belt
|
0.9 |
accel_dumbbell_x
|
0.47 |
accel_belt_y
|
0.15 |
gyros_forearm_z
|
102.34 |
magnet_dumbbell_z
|
0.88 |
roll_forearm
|
0.46 |
magnet_arm_x
|
0.14 |
gyros_forearm_y
|
54.58 |
accel_arm_z
|
0.86 |
accel_forearm_z
|
0.44 |
gyros_belt_z
|
0.11 |
gyros_dumbbell_y
|
36.04 |
magnet_forearm_y
|
0.74 |
accel_arm_x
|
0.36 |
gyros_arm_y
|
0.11 |
gyros_forearm_x
|
2.65 |
roll_dumbbell
|
0.73 |
accel_dumbbell_y
|
0.36 |
yaw_arm
|
0.1 |
magnet_belt_y
|
2.22 |
accel_forearm_y
|
0.65 |
gyros_arm_x
|
0.29 |
accel_dumbbell_z
|
0.1 |
magnet_dumbbell_y
|
1.74 |
magnet_forearm_x
|
0.65 |
yaw_forearm
|
0.27 |
gyros_belt_y
|
0.09 |
magnet_dumbbell_x
|
1.69 |
total_accel_dumbbell
|
0.61 |
accel_forearm_x
|
0.23 |
accel_arm_y
|
0.09 |
magnet_belt_x
|
1.44 |
total_accel_forearm
|
0.57 |
yaw_dumbbell
|
0.22 |
total_accel_arm
|
0.07 |
magnet_forearm_z
|
1.22 |
gyros_belt_x
|
0.54 |
magnet_belt_z
|
0.19 |
total_accel_belt
|
0.03 |
magnet_arm_z
|
1.15 |
pitch_dumbbell
|
0.52 |
pitch_arm
|
0.19 |
roll_belt
|
0.02 |
pitch_belt
|
1 |
pitch_forearm
|
0.52 |
roll_arm
|
0.18 |
accel_belt_z
|
0.02 |
Predictors, Skewness > 1.0, cube-root transformed
skewThr <- 1.0 # "skewed" threshold
old.par <- par(no.readonly=T); par(mfrow=c(4,3),mar=c(3,2,1,1))
for(i in 1 : sum(skew > skewThr)) {
tmp <- sign(trn[, names(skew[i])]) * abs(trn[, names(skew[i])])^(1/3) # cube root transform
hist(tmp, main=names(skew[i]), breaks=60, xlab='')
abline(v=mean(tmp, na.rm=T), lty=3, lwd=3, col="firebrick")
}
par(old.par)
A scree plot identifies the 19 variables that explain most of the variability for this model.
tblDT <- data.frame(var.imp=dTreeModel$finalModel$variable.importance)
old.par <- par(no.readonly=T); par(las=3, mar=c(7,4,2,1))
plot(x=seq_along(1:dim(tblDT)[1]), y=tblDT$var.imp, xaxt='n', type='s', ylim=c(0,1200),
xlab='', main='Scree Plot: Decision Tree Variables', ylab='Variable Importance')
axis(side=1, cex.axis=0.8, labels=rownames(tblDT), at=seq_along(1:dim(tblDT)[1]))
abline(h=0, lty=3);par(old.par)
The variable elimination process described in the Exploratory Analysis section yielded the following variables remaining for model development, including the outcome classe.
tbl<-c(); nC <- 6; names <- c(paste0("` ", allVars[order(allVars)], " `"), rep("...",1))
nV <- length(names)
for(i in 1:nC)
tbl <- cbind(tbl, c(names[seq(1, nV, nV/nC)[i] : (seq(1, nV, nV/nC)+(nV/nC-1))[i]]))
htmlTable::htmlTable(tbl, align='llllll', header=c(rep("",6)),
caption="***Remaining variables in dataset (alphabetical order)***")
| Remaining variables in dataset (alphabetical order) | |||||
accel_arm_x
|
accel_forearm_x
|
gyros_belt_z
|
magnet_arm_z
|
magnet_forearm_z
|
total_accel_arm
|
accel_arm_y
|
accel_forearm_y
|
gyros_dumbbell_x
|
magnet_belt_x
|
pitch_arm
|
total_accel_belt
|
accel_arm_z
|
accel_forearm_z
|
gyros_dumbbell_y
|
magnet_belt_y
|
pitch_belt
|
total_accel_dumbbell
|
accel_belt_x
|
classe
|
gyros_dumbbell_z
|
magnet_belt_z
|
pitch_dumbbell
|
total_accel_forearm
|
accel_belt_y
|
gyros_arm_x
|
gyros_forearm_x
|
magnet_dumbbell_x
|
pitch_forearm
|
yaw_arm
|
accel_belt_z
|
gyros_arm_y
|
gyros_forearm_y
|
magnet_dumbbell_y
|
roll_arm
|
yaw_belt
|
accel_dumbbell_x
|
gyros_arm_z
|
gyros_forearm_z
|
magnet_dumbbell_z
|
roll_belt
|
yaw_dumbbell
|
accel_dumbbell_y
|
gyros_belt_x
|
magnet_arm_x
|
magnet_forearm_x
|
roll_dumbbell
|
yaw_forearm
|
accel_dumbbell_z
|
gyros_belt_y
|
magnet_arm_y
|
magnet_forearm_y
|
roll_forearm
|
… |
The Exploratory Analysis section discussed discarded variables, and referred the reader to this list.
htmlTable::htmlTable(elimVar, align='lllll', header=c(rep("",5)),
caption=paste0("***Discarded ", length(elim), " variables: high proportion
missing values or NAs, #DIV/0! errors, otherwise low information content***"))
| Discarded 107 variables: high proportion missing values or NAs, #DIV/0! errors, otherwise low information content | ||||
X
|
var_total_accel_belt
|
kurtosis_yaw_arm
|
min_roll_dumbbell
|
max_roll_forearm
|
user_name
|
avg_roll_belt
|
skewness_roll_arm
|
min_pitch_dumbbell
|
max_picth_forearm
|
raw_timestamp_part_1
|
stddev_roll_belt
|
skewness_pitch_arm
|
min_yaw_dumbbell
|
max_yaw_forearm
|
raw_timestamp_part_2
|
var_roll_belt
|
skewness_yaw_arm
|
amplitude_roll_dumbbell
|
min_roll_forearm
|
cvtd_timestamp
|
avg_pitch_belt
|
max_roll_arm
|
amplitude_pitch_dumbbell
|
min_pitch_forearm
|
new_window
|
stddev_pitch_belt
|
max_picth_arm
|
amplitude_yaw_dumbbell
|
min_yaw_forearm
|
num_window
|
var_pitch_belt
|
max_yaw_arm
|
var_accel_dumbbell
|
amplitude_roll_forearm
|
kurtosis_roll_belt
|
avg_yaw_belt
|
min_roll_arm
|
avg_roll_dumbbell
|
amplitude_pitch_forearm
|
kurtosis_picth_belt
|
stddev_yaw_belt
|
min_pitch_arm
|
stddev_roll_dumbbell
|
amplitude_yaw_forearm
|
kurtosis_yaw_belt
|
var_yaw_belt
|
min_yaw_arm
|
var_roll_dumbbell
|
var_accel_forearm
|
skewness_roll_belt
|
var_accel_arm
|
amplitude_roll_arm
|
avg_pitch_dumbbell
|
avg_roll_forearm
|
skewness_roll_belt.1
|
avg_roll_arm
|
amplitude_pitch_arm
|
stddev_pitch_dumbbell
|
stddev_roll_forearm
|
skewness_yaw_belt
|
stddev_roll_arm
|
amplitude_yaw_arm
|
var_pitch_dumbbell
|
var_roll_forearm
|
max_roll_belt
|
var_roll_arm
|
kurtosis_roll_dumbbell
|
avg_yaw_dumbbell
|
avg_pitch_forearm
|
max_picth_belt
|
avg_pitch_arm
|
kurtosis_picth_dumbbell
|
stddev_yaw_dumbbell
|
stddev_pitch_forearm
|
max_yaw_belt
|
stddev_pitch_arm
|
kurtosis_yaw_dumbbell
|
var_yaw_dumbbell
|
var_pitch_forearm
|
min_roll_belt
|
var_pitch_arm
|
skewness_roll_dumbbell
|
kurtosis_roll_forearm
|
avg_yaw_forearm
|
min_pitch_belt
|
avg_yaw_arm
|
skewness_pitch_dumbbell
|
kurtosis_picth_forearm
|
stddev_yaw_forearm
|
min_yaw_belt
|
stddev_yaw_arm
|
skewness_yaw_dumbbell
|
kurtosis_yaw_forearm
|
var_yaw_forearm
|
amplitude_roll_belt
|
var_yaw_arm
|
max_roll_dumbbell
|
skewness_roll_forearm
|
… |
amplitude_pitch_belt
|
kurtosis_roll_arm
|
max_picth_dumbbell
|
skewness_pitch_forearm
|
… |
amplitude_yaw_belt
|
kurtosis_picth_arm
|
max_yaw_dumbbell
|
skewness_yaw_forearm
|
… |
Comments on Exploratory Graphs, Sensor Data Correlation, Predictor Skewness
The reader is referred to Exploratory Graphs in Backup Material to explore characteristics of the training dataset. The complexity of the associations between response and predictors is displayed in Exploratory Scatterplots - Arm Accelerometer, Magnetometer sensors, which plots associations between
classeresponse and arm mounted magnetometer and accelerometer sensors. Refer to the lattice boxplots subgrouped by sensor location (forearm, arm, belt, dumbbell), presented in Correlation map by location: forearm, arm, belt, dumbbell.Is there significant collinearity among remaining predictors? Refer to Sensor data correlation by location in Backup Material, where a sensor data correlation tables for correlations > 0.5 may be found. As shown, there exist mild to strong positive and negative correlations with sensor signals from the same location. In spite of the implied collinearity, the author has chosen to not optimize further, and to err on the side of over-production of features to maximize the ranges presented to the machine learning algorithms discussed later in this report.
As shown in Skewness Analysis in Backup Material. Twelve of the 52 predictors exhibit skewness > 1.0. Some are so skewed that a cube root transform (reduce skew yet preserve sign) was applied to meaningfully display their distributions. The transforms was applied only to plot the data - we will rely on caret pre-processing transforms to develop models.