Machine-Learning-based Assessment of the quality of weight-lifting exercises

INTRODUCTION

The rapid diffusion of sensors able to record physical parameters associated with motion (e.g. accelerometers), in dedicated devices and more importantly in general consumer electronics available/used by a broader population has sparked a great interest in developing applications taking advantage of these motion-related data. One area of particular interest concerns fitness-related activities.

This report summarizes the results of the development, and testing, of a Machine Learning model able to recognize the quality of a simple weight lifting exercise, namely whether or not it was performed appropropriately (and hence safely and effectively).

We used the dataset put together by the research group on Human Activity Recognition at the PUC of Rio de Janeiro.

SUMMARY OF RESULTS

We tested three types of ML algorithms, all tree-based methods: CART trees, boosted trees, and random forest.

The first two methods failed to yield high quality results. This may have been caused by less than ideal choice of parameters, although in most cases we run them with the default values from caret, which are expected to be reasonable for decent results.

Random forest models produced high quality results, with accuracies exceeding 99%, both in the built-in Out Of the Bag resampling, and on our separate testing subset.

Beside its clearly better performance, the choice of a random forest as an ensemble method is supported by its ability to handle multi-class problems.

We ran random forest models with three different internal cross-validation setups (implemented through the trainControl() function of caret): * 4-fold Cross-Validation, * bootstrap, and * Leave Group Out Cross Validation.

As noted, the trained models achieved exceptional accuracy in the ability of predicting the outcome variable classe, not only when tested against the 20-entries project benchmark, but more importantly when tested against the portion (25%) of the full dataset that we set aside for __validation_.

The results of a random forest model are not easily interpretable, even in presence of physically/motion based predictors. Nevertheless, as illustrated in some example plots, the data contain fairly clear pattern and differences between categories of exercise quality, that can be related to the slight differences in the motion of the body and weight dumbbell, and that are apparently very well picked out by the algorithm.

THE DATA SET

The data for the project were made available from the Coursera ML course webpage. Two separate sets were posted:

a training dataset. This set comprises a little over 16,000 entries for 160 variables.
a testing dataset, to be used as a final project benchmark, comprising 20 “anonymized” entries.

Structure

The dataset comprises 160 variables:

152 actual predictors, i.e. the sensor data.
1 is the quality class of the exercise (classe, taking values A, B, C, D, E).
7 are auxiliary variables:
- the user name (user_name).
- 3 time stamp related variables: raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp.
- 2 exercise window markers/counters: new_window, num_window.

The sensor data

As described in the paper by Velloso et al. [REF], four inertial measurement units (IMU) where setup, placed
on belt, arm, forearm, dumbbell. Each sensor measured 3-axes acceleration, gyroscope and magnetometer data at high cadence (45 Hz). These data were processed to yield 13 timed variables for each sensor:

total acceleration.
roll, pitch, yaw angles.
x, y, z values for gyroscope, acceleleration, and magnetometer.

For instance, for the belt sensor the basic timed data are: total_accel_belt, roll_belt, pitch_belt, yaw_belt, gyros_belt_x, gyros_belt_y, gyros_belt_z, accel_belt_x, accel_belt_y, accel_belt_z, magnet_belt_x, magnet_belt_y, magnet_belt_z.

The dataset therefore comprises \(4 \times 13 = 52\) basic timed data.

In addition to these, several statistical summaries are computed and reported for each exercise window, for each sensor:

For total_accel, its variance var_accel.
For each of the three angles: avg, stddev, var, kurtosis, skewness, max, min, amplitude (\(3 \times 8\) variables).

These \(1 + 24 = 25\) statistical summaries for each sensor add another \(100\) variables to the dataset for a total of \(152\) variables.

It is worth emphasizing that the dataset presents timed and summary variables all together in one table. While this may be practically convenient, it makes this dataset un-tidy by combining variables of different nature. Fortunately the two types of variables can be easily separated on the basis of the value of the new_window auxiliary variable, which has value no for entries corresponding to timed data, and yes for their statistical summaries over each exercise window.

DATA PREPARATION

Loading

full <- read.csv("./pml-training.csv", na.strings=c("#DIV/0!","","NA"), stringsAsFactors=FALSE)
full <- add_new_variables(full)
alt.full <- tidy_df(full)

TEST <- read.csv("./pml-testing.csv", na.strings=c("#DIV/0!","","NA"), stringsAsFactors=FALSE)
alt.TEST <- tidy_df(TEST)

Cleaning/Tidying

Non-sensor variables

Some variables should be discarded because associated with very specific aspects of the experiment that should be irrelevant from the point of view of its goal, such as window flags and time stamps.
These are the excluded variables: X, user_name, new_window, num_window, cvtd_timestamp, raw_timestamp_part_1, raw_timestamp_part_2.

Beside their intrinsic irrelevance, keeping these in would likely strongly drive the results in a completely spurious and meaningless way, because for instance the algorithm may hook on the user_name or num_window.

Individual measurements vs summaries : the `new_window` variable

To the best of my understanding, the dataset combines two different kinds of observations:

single measurements of the main observables from the sensors, with some time cadence, and organized in windows, which are numbered (num_window variable).
These data have new_window == "no".
statistical summaries of the measurements of each main observable over each window.
These data have new_window == "yes", and

We restricted our analysis to the 52 variables representing individual timed measurements, discarding the summary data.

alt.full <- subset(alt.full, new_window == "no")
alt.full.good <- select_proper_vars(alt.full)
alt.TEST.good <- select_proper_vars(alt.TEST)
alt.user <- alt.full$user_name

We also filtered out variables with NA, which basically means filtering against the summary variables.

# columns without ANY NA
alt.tt <- colSums(is.na(alt.full.good)) == 0

alt.full.select <- alt.full.good[, alt.tt]
alt.TEST.select <- alt.TEST.good[, alt.tt]

Some exploratory plots

Features plotted vs. sequence index, and color coded by `user_name` and `classe`

These kind of plots shows that some of the features seem to correlated very strongly with the user, even more than with their classe, somewhat oddly.

This suggest that the training to predict the quality parameter of the weight lifting exercise (classe) that we can achieve with this dataset may not be easily generalized.

plot of chunk plots-examine2b

plot of chunk plots-examine2c

Feature vs. Feature plots with separate panels by `classe`

This second set of example plots shows that there are indeed some reasonably recognizable patterns allowing to distinguish between different classe categories.

The expectation is that the ML algorithm will be able to identify them and build on them a classification scheme.

plot of chunk plots-more_1

plot of chunk plots-more_2

ABOUT FEATURE SELECTION

Zero/low variance predictors

We checked the dataset for un-informative predictors, namely variables taking (nearly) unique values or having very little variance in their values.
The caret package provides a very convenient function to perform this quality-check, nearZeroVar().

None of the 52 features meets the criteria for exclusion on the basis of near Zero Variance.
The full results of running it on our dataset (nearZeroVar(alt.full.select, saveMetrics=TRUE)) are reported in the Appendix.

Collinearity between predictors

The presence of correlated predictor is undesirable because it can bias/mislead the modeling and in any case it may lead to run a model with an unnecessarily large(r) number of predictors. Although some ML algorithms are not negatively affected, it is generally safe to exclude correlated pr edictors.

For tree-based models it is actually recommended to clean the data set of correlated predictors because they end up sharing their overall importance, thus appearing to be less significant than they actually are.

We took advantage of the caret function findCorrelation() to identify variables whose absolute correlation value exceeds a set threshold (we chose 0.75) and obtain a list of variables to exclude selected among those with high correlation.

The actual predictors filtering was done applying this method just on the training subset (see below).

DATA SPLITTING: “NEW” TRAINING AND TESTING SUBSETS

For validation purposes we split the full dataset in two subsets:

a training subset, comprising 75% of the data.
a testing subset, comprising 25% of the data.

This training / testing split should not be confused with the original two datasets, which unfortunately are named also training and testing.

We are splitting the original training large dataset in two to be able to have an independent validation of the models, beyond what may already be done internally by some ML algorithms or by caret wrapped around them (e.g. by bootstrapping, or the built-in randomization and subsetting of random forest methods).

seed.split <- 12468
set.seed(seed.split)
i.train.alt <- createDataPartition(y = alt.full.select$classe, p=0.75, list=FALSE)

alt.training <- alt.full.select[i.train.alt, ]
alt.testing <- alt.full.select[-i.train.alt, ]

Feature selection on training / testing subsets

In the spirit of truly preserving the independence of the testing data subset, we performed the correlation-based feature reduction on the basis of the correlation between variables computed on the training subset instead of the full dataset, and applied the same variables filtering to the testing subset.

# correlation filtering done on the training subset
alt.allCorr <- cor(alt.training[, -1])
i.fC.75.alt <- findCorrelation(alt.allCorr, cutoff=0.75)

The following plot shows the correlation matrix, with variables ordered on the basis of their clustering.

corrplot(alt.allCorr, order="hclust", method="color", 
         col=color1(20), cl.length=21, tl.cex=0.8, tl.col="black", mar=c(1,1,1,0))

plot of chunk split_data-3

On the basis of their correlation, with a threshold of 0.75, these are the variables that would be excluded.

# variables to be excluded
colnames(alt.training)[i.fC.75.alt+1]
##  [1] "accel_belt_z"      "roll_belt"         "accel_belt_y"      "total_accel_belt" 
##  [5] "accel_dumbbell_z"  "accel_belt_x"      "pitch_belt"        "magnet_dumbbell_x"
##  [9] "accel_dumbbell_y"  "magnet_dumbbell_y" "accel_dumbbell_x"  "accel_arm_x"      
## [13] "accel_arm_z"       "magnet_arm_y"      "magnet_belt_y"     "accel_forearm_y"  
## [17] "gyros_arm_y"       "gyros_forearm_z"   "gyros_forearm_y"   "gyros_dumbbell_x"

# variables selection
alt.training.cut75 <- alt.training[, -(i.fC.75.alt+1)]
alt.testing.cut75 <- alt.testing[, -(i.fC.75.alt+1)]

MODELING

We tested three types of ML algorithms, all within the framework provided by caret, and all generally speaking tree-based models.

CART trees, namely rpart2.
boosted tree, namely gbm.
random forest, namely rf.

The first two methods failed to yield high quality results, in fact in some cases their performance on the testing subset was very poor.
This may have been caused by less than ideal choice of parameters, but in most cases we let the modeling run with the default values from caret, which are expected to be reasonable for decent results.
We have to acknowledge that in some cases, in particular for the gbm models, the running time turned out to be very long and the memory requirements large enough to make it impractical, and we did not pursue those models more extensively.

On the other hand random forest models produced high quality results, with accuracies exceeding 99%, both in the built-in Out Of the Bag resampling, and on our separate testing subset.

In the next three sections we illustrate the results of random forest models run with three different internal cross-validation setups, implemented through the trainControl() function of caret:

cv: Cross-Validation, 4-fold (i.e. 75%/25% splits).
boot (the default): bootstrap, 25 repeats.
’LGOCV`: Leave Group Out Cross Validation, 25 repeats, 75%/25% train/test splits of the data.

In all cases we also tried a set of values for mtry, which regulates how many predictors are selected in the random forest random subsetting of variables.

Random Forest case 1 : 4-fold CV

With mtry = 2, 6, 10, 18, 26, 34.

mtry.values <- c(2, 6, 10, 18, 26, 34)

ctrl.rf1c <- trainControl(method = "cv", number=4)

seed.rf1c <- 16790; set.seed(seed.rf1c)
mod.alt.rf1c <- train(x = alt.training.cut75[, -1], 
                      y = alt.training.cut75$classe, 
                      method = "rf", 
                      trControl = ctrl.rf1c,
                      tuneGrid = data.frame(mtry = mtry.values),
                      importance = TRUE, 
                      proximity = TRUE)

Fit Summary

mod.alt.rf1c
## Random Forest 
## 
## 14414 samples
##    32 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## 
## Summary of sample sizes: 10811, 10810, 10811, 10810 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     1         1      0.004        0.005   
##   6     1         1      0.003        0.004   
##   10    1         1      0.003        0.004   
##   20    1         1      0.003        0.004   
##   30    1         1      0.003        0.004   
##   30    1         1      0.003        0.004   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 6.
mod.alt.rf1c$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.61%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4100    2    0    1    1   0.0009747
## B   17 2762   10    0    0   0.0096809
## C    0   19 2484   10    1   0.0119332
## D    1    0   21 2337    2   0.0101652
## E    0    0    1    2 2643   0.0011338
mod.alt.rf1c$results
##   mtry Accuracy  Kappa AccuracySD  KappaSD
## 1    2   0.9898 0.9871   0.003952 0.005002
## 2    6   0.9912 0.9889   0.002983 0.003775
## 3   10   0.9901 0.9875   0.003060 0.003873
## 4   18   0.9880 0.9848   0.003000 0.003797
## 5   26   0.9830 0.9785   0.003199 0.004049
## 6   34   0.9778 0.9719   0.003414 0.004320

Predictions on testing subset

pred.rf1c.test75 <- predict(mod.alt.rf1c, alt.testing.cut75, type="raw")

# confusion matrix
confusionMatrix(alt.testing.cut75$classe, pred.rf1c.test75)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1364    3    0    0    0
##          B    0  919    7    0    3
##          C    0    6  824    8    0
##          D    0    0   12  772    2
##          E    0    0    0    1  881
## 
## Overall Statistics
##                                         
##                Accuracy : 0.991         
##                  95% CI : (0.988, 0.994)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.989         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    0.990    0.977    0.988    0.994
## Specificity             0.999    0.997    0.996    0.997    1.000
## Pos Pred Value          0.998    0.989    0.983    0.982    0.999
## Neg Pred Value          1.000    0.998    0.995    0.998    0.999
## Prevalence              0.284    0.193    0.176    0.163    0.185
## Detection Rate          0.284    0.191    0.172    0.161    0.183
## Detection Prevalence    0.285    0.193    0.175    0.164    0.184
## Balanced Accuracy       1.000    0.994    0.987    0.992    0.997

Predictions on TEST subset (the 20 benchmark values for the Project)

pred.rf1c.TEST <- predict(mod.alt.rf1c, alt.TEST.select, type="raw")

# comparison with "truth"
pred.rf1c.TEST == answers
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [20] TRUE

Variable Importance

varImp(mod.alt.rf1c, useModel=TRUE, scale=FALSE)
## rf variable importance
## 
##   variables are sorted by maximum importance across the classes
##   only 20 most important variables shown (out of 32)
## 
##                         A    B    C    D    E
## yaw_belt             71.8 61.7 57.9 69.8 48.8
## magnet_dumbbell_z    62.4 51.8 64.3 48.2 52.1
## roll_dumbbell        36.9 46.1 62.0 48.9 43.3
## magnet_belt_z        41.2 48.8 45.7 58.4 54.4
## pitch_forearm        40.7 44.1 46.5 45.3 45.9
## gyros_belt_z         35.6 43.8 37.7 44.6 46.3
## roll_arm             28.8 44.1 37.7 46.3 33.3
## roll_forearm         34.9 33.0 44.0 31.7 31.6
## yaw_dumbbell         28.4 37.4 42.1 36.8 40.6
## gyros_dumbbell_y     28.6 32.5 40.3 33.2 27.3
## total_accel_dumbbell 31.8 36.0 34.2 33.4 39.9
## magnet_forearm_z     32.9 38.5 31.3 38.0 35.5
## yaw_arm              33.3 38.1 32.4 37.0 32.0
## gyros_arm_x          21.0 36.8 29.3 34.8 29.3
## magnet_belt_x        23.1 36.3 36.8 29.3 36.3
## accel_forearm_x      21.8 31.8 32.4 36.2 29.6
## pitch_arm            24.4 35.5 28.2 30.0 24.6
## accel_forearm_z      22.2 29.6 34.9 29.0 31.0
## magnet_arm_z         28.6 33.8 30.0 31.1 28.6
## accel_arm_y          23.6 32.7 29.2 29.5 27.1

# plot(varImp(mod.alt.rf1c, useModel=TRUE, scale=FALSE), top=ncol(mod.alt.rf1c$trainingData)-1)
dotPlot(varImp(mod.alt.rf1c, useModel=TRUE, scale=FALSE), top=ncol(mod.alt.rf1c$trainingData)-1)

plot of chunk rf1c_post-plots

Random Forest case 2 : bootstrap, 25 reps

With mtry = 2, 6, 10, 18, 26, 34

mtry.values <- c(2, 6, 10, 18, 26, 34)

seed.rf1b <- 16789; set.seed(seed.rf1b)
mod.rf1b <- train(x = training.cut75[, -1], 
                      y = training.cut75$classe, 
                      method = "rf", 
                      tuneGrid = data.frame(mtry = mtry.values))

Fit Summary

mod.rf1b
## Random Forest 
## 
## 14718 samples
##    34 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 14718, 14718, 14718, 14718, 14718, 14718, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     1         1      0.002        0.003   
##   6     1         1      0.002        0.002   
##   10    1         1      0.001        0.002   
##   20    1         1      0.002        0.002   
##   30    1         1      0.002        0.003   
##   30    1         1      0.003        0.004   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 6.
mod.rf1b$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.64%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4182    3    0    0    0   0.0007168
## B   13 2826    8    0    1   0.0077247
## C    0   19 2530   17    1   0.0144137
## D    0    0   21 2387    4   0.0103648
## E    0    0    2    5 2699   0.0025868
mod.rf1b$results
##   mtry Accuracy  Kappa AccuracySD  KappaSD
## 1    2   0.9880 0.9848   0.002011 0.002541
## 2    6   0.9892 0.9864   0.001674 0.002114
## 3   10   0.9887 0.9857   0.001480 0.001868
## 4   18   0.9866 0.9831   0.001856 0.002344
## 5   26   0.9836 0.9792   0.002145 0.002709
## 6   34   0.9768 0.9707   0.002933 0.003698

Predictions on testing subset

pred.rf1b.test75 <- predict(mod.rf1b, testing.cut75, type="raw")

# confusion matrix
confusionMatrix(testing.cut75$classe, pred.rf1b.test75)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1393    1    0    0    1
##          B    7  936    4    0    2
##          C    0    5  845    4    1
##          D    0    0    5  799    0
##          E    0    0    0    3  898
## 
## Overall Statistics
##                                         
##                Accuracy : 0.993         
##                  95% CI : (0.991, 0.995)
##     No Information Rate : 0.285         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.991         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             0.995    0.994    0.989    0.991    0.996
## Specificity             0.999    0.997    0.998    0.999    0.999
## Pos Pred Value          0.999    0.986    0.988    0.994    0.997
## Neg Pred Value          0.998    0.998    0.998    0.998    0.999
## Prevalence              0.285    0.192    0.174    0.164    0.184
## Detection Rate          0.284    0.191    0.172    0.163    0.183
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       0.997    0.995    0.993    0.995    0.997

Predictions on TEST subset (the 20 benchmark values for the Project)

pred.rf1b.TEST <- predict(mod.rf1b, TEST.select, type="raw")

# comparison with "truth"
pred.rf1b.TEST == answers
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [20] TRUE

Variable Importance

varImp(mod.rf1b, useModel=TRUE, scale=FALSE)
## rf variable importance
## 
##   only 20 most important variables shown (out of 34)
## 
##                      Overall
## yaw_belt                1155
## magnet_dumbbell_z        806
## pitch_forearm            763
## magnet_belt_z            698
## roll_forearm             626
## roll_dumbbell            574
## gyros_belt_z             499
## roll_arm                 415
## total_accel_dumbbell     391
## gyros_dumbbell_y         378
## yaw_dumbbell             377
## magnet_forearm_z         342
## accel_forearm_x          333
## magnet_arm_x             327
## accel_forearm_z          320
## pitch_dumbbell           307
## magnet_belt_x            306
## yaw_arm                  285
## magnet_forearm_y         272
## magnet_forearm_x         235

# plot(varImp(mod.rf1b, useModel=TRUE, scale=FALSE), top=ncol(mod.rf1b$trainingData)-1)
dotPlot(varImp(mod.rf1b, useModel=TRUE, scale=FALSE), top=ncol(mod.rf1b$trainingData)-1)

plot of chunk rf1b_post-plots

Random Forest case 3 : LGOCV, 25 repeats, 75%/25% splits

With mtry = 2, 4, 6, 8, 10.

mtryValues <- c(2, 4, 6, 8, 10)

ctrl <- trainControl(method = "LGOCV",
                     classProbs = TRUE)

seed.rf1e <- 17891; set.seed(seed.rf1e)
mod.alt.rf1e <- train(x = alt.training.cut75[, -1], 
                  y = alt.training.cut75$classe, 
                  method = "rf", 
                  tuneGrid = data.frame(mtry=mtryValues),
                  trControl = ctrl,
                  importance = TRUE, 
                  proximity = TRUE)

Fit Summary

mod.alt.rf1e
## Random Forest 
## 
## 14414 samples
##    32 predictors
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Repeated Train/Test Splits Estimated (25 reps, 0.75%) 
## 
## Summary of sample sizes: 10812, 10812, 10812, 10812, 10812, 10812, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
##   2     1         1      0.002        0.002   
##   4     1         1      0.001        0.002   
##   6     1         1      0.001        0.002   
##   8     1         1      0.002        0.002   
##   10    1         1      0.002        0.002   
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 4.
mod.alt.rf1e$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 0.65%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4101    1    0    1    1    0.000731
## B   18 2764    7    0    0    0.008964
## C    0   19 2483   12    0    0.012331
## D    0    0   28 2330    3    0.013130
## E    0    0    1    3 2642    0.001512
mod.alt.rf1e$results
##   mtry Accuracy  Kappa AccuracySD  KappaSD
## 1    2   0.9897 0.9869   0.001745 0.002209
## 2    4   0.9910 0.9887   0.001204 0.001523
## 3    6   0.9910 0.9886   0.001362 0.001724
## 4    8   0.9907 0.9882   0.001660 0.002101
## 5   10   0.9903 0.9877   0.001702 0.002154

Predictions on testing subset

pred.rf1e.test75 <- predict(mod.alt.rf1e, alt.testing.cut75, type="raw")

# confusion matrix
confusionMatrix(alt.testing.cut75$classe, pred.rf1e.test75)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1365    2    0    0    0
##          B    0  920    6    0    3
##          C    0    8  823    7    0
##          D    0    0   12  771    3
##          E    0    0    0    1  881
## 
## Overall Statistics
##                                         
##                Accuracy : 0.991         
##                  95% CI : (0.988, 0.994)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.989         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    0.989    0.979    0.990    0.993
## Specificity             0.999    0.998    0.996    0.996    1.000
## Pos Pred Value          0.999    0.990    0.982    0.981    0.999
## Neg Pred Value          1.000    0.997    0.995    0.998    0.998
## Prevalence              0.284    0.194    0.175    0.162    0.185
## Detection Rate          0.284    0.192    0.171    0.161    0.183
## Detection Prevalence    0.285    0.193    0.175    0.164    0.184
## Balanced Accuracy       1.000    0.993    0.987    0.993    0.996

Predictions on TEST subset (the 20 benchmark values for the Project)

pred.rf1e.TEST <- predict(mod.alt.rf1e, alt.TEST.select, type="raw")

# comparison with "truth"
pred.rf1e.TEST == answers
##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [20] TRUE

Variable Importance

# plot(varImp(mod.alt.rf1e, useModel=TRUE, scale=FALSE), top=ncol(mod.alt.rf1e$trainingData)-1)
dotPlot(varImp(mod.alt.rf1e, useModel=TRUE, scale=FALSE), top=ncol(mod.alt.rf1e$trainingData)-1)

plot of chunk rf1e_post-plots

Appendices

Timed vs. summary data entries

alt.statsNA <- as.data.frame(t(sapply(alt.full.good, function(x){ c(good = sum(!is.na(x)), bad = sum(is.na(x)))})))
print(alt.statsNA, quote=FALSE, print.gap=5)
##                               good       bad
## classe                       19216         0
## total_accel_belt             19216         0
## var_accel_belt                   0     19216
## roll_belt                    19216         0
## avg_roll_belt                    0     19216
## stddev_roll_belt                 0     19216
## var_roll_belt                    0     19216
## kurtosis_roll_belt               0     19216
## skewness_roll_belt               0     19216
## min_roll_belt                    0     19216
## max_roll_belt                    0     19216
## amplitude_roll_belt              0     19216
## pitch_belt                   19216         0
## avg_pitch_belt                   0     19216
## stddev_pitch_belt                0     19216
## var_pitch_belt                   0     19216
## kurtosis_pitch_belt              0     19216
## skewness_pitch_belt              0     19216
## min_pitch_belt                   0     19216
## max_pitch_belt                   0     19216
## amplitude_pitch_belt             0     19216
## yaw_belt                     19216         0
## avg_yaw_belt                     0     19216
## stddev_yaw_belt                  0     19216
## var_yaw_belt                     0     19216
## kurtosis_yaw_belt                0     19216
## skewness_yaw_belt                0     19216
## min_yaw_belt                     0     19216
## max_yaw_belt                     0     19216
## amplitude_yaw_belt               0     19216
## gyros_belt_x                 19216         0
## gyros_belt_y                 19216         0
## gyros_belt_z                 19216         0
## accel_belt_x                 19216         0
## accel_belt_y                 19216         0
## accel_belt_z                 19216         0
## magnet_belt_x                19216         0
## magnet_belt_y                19216         0
## magnet_belt_z                19216         0
## total_accel_arm              19216         0
## var_accel_arm                    0     19216
## roll_arm                     19216         0
## avg_roll_arm                     0     19216
## stddev_roll_arm                  0     19216
## var_roll_arm                     0     19216
## kurtosis_roll_arm                0     19216
## skewness_roll_arm                0     19216
## min_roll_arm                     0     19216
## max_roll_arm                     0     19216
## amplitude_roll_arm               0     19216
## pitch_arm                    19216         0
## avg_pitch_arm                    0     19216
## stddev_pitch_arm                 0     19216
## var_pitch_arm                    0     19216
## kurtosis_pitch_arm               0     19216
## skewness_pitch_arm               0     19216
## min_pitch_arm                    0     19216
## max_pitch_arm                    0     19216
## amplitude_pitch_arm              0     19216
## yaw_arm                      19216         0
## avg_yaw_arm                      0     19216
## stddev_yaw_arm                   0     19216
## var_yaw_arm                      0     19216
## kurtosis_yaw_arm                 0     19216
## skewness_yaw_arm                 0     19216
## min_yaw_arm                      0     19216
## max_yaw_arm                      0     19216
## amplitude_yaw_arm                0     19216
## gyros_arm_x                  19216         0
## gyros_arm_y                  19216         0
## gyros_arm_z                  19216         0
## accel_arm_x                  19216         0
## accel_arm_y                  19216         0
## accel_arm_z                  19216         0
## magnet_arm_x                 19216         0
## magnet_arm_y                 19216         0
## magnet_arm_z                 19216         0
## total_accel_forearm          19216         0
## var_accel_forearm                0     19216
## roll_forearm                 19216         0
## avg_roll_forearm                 0     19216
## stddev_roll_forearm              0     19216
## var_roll_forearm                 0     19216
## kurtosis_roll_forearm            0     19216
## skewness_roll_forearm            0     19216
## min_roll_forearm                 0     19216
## max_roll_forearm                 0     19216
## amplitude_roll_forearm           0     19216
## pitch_forearm                19216         0
## avg_pitch_forearm                0     19216
## stddev_pitch_forearm             0     19216
## var_pitch_forearm                0     19216
## kurtosis_pitch_forearm           0     19216
## skewness_pitch_forearm           0     19216
## min_pitch_forearm                0     19216
## max_pitch_forearm                0     19216
## amplitude_pitch_forearm          0     19216
## yaw_forearm                  19216         0
## avg_yaw_forearm                  0     19216
## stddev_yaw_forearm               0     19216
## var_yaw_forearm                  0     19216
## kurtosis_yaw_forearm             0     19216
## skewness_yaw_forearm             0     19216
## min_yaw_forearm                  0     19216
## max_yaw_forearm                  0     19216
## amplitude_yaw_forearm            0     19216
## gyros_forearm_x              19216         0
## gyros_forearm_y              19216         0
## gyros_forearm_z              19216         0
## accel_forearm_x              19216         0
## accel_forearm_y              19216         0
## accel_forearm_z              19216         0
## magnet_forearm_x             19216         0
## magnet_forearm_y             19216         0
## magnet_forearm_z             19216         0
## total_accel_dumbbell         19216         0
## var_accel_dumbbell               0     19216
## roll_dumbbell                19216         0
## avg_roll_dumbbell                0     19216
## stddev_roll_dumbbell             0     19216
## var_roll_dumbbell                0     19216
## kurtosis_roll_dumbbell           0     19216
## skewness_roll_dumbbell           0     19216
## min_roll_dumbbell                0     19216
## max_roll_dumbbell                0     19216
## amplitude_roll_dumbbell          0     19216
## pitch_dumbbell               19216         0
## avg_pitch_dumbbell               0     19216
## stddev_pitch_dumbbell            0     19216
## var_pitch_dumbbell               0     19216
## kurtosis_pitch_dumbbell          0     19216
## skewness_pitch_dumbbell          0     19216
## min_pitch_dumbbell               0     19216
## max_pitch_dumbbell               0     19216
## amplitude_pitch_dumbbell         0     19216
## yaw_dumbbell                 19216         0
## avg_yaw_dumbbell                 0     19216
## stddev_yaw_dumbbell              0     19216
## var_yaw_dumbbell                 0     19216
## kurtosis_yaw_dumbbell            0     19216
## skewness_yaw_dumbbell            0     19216
## min_yaw_dumbbell                 0     19216
## max_yaw_dumbbell                 0     19216
## amplitude_yaw_dumbbell           0     19216
## gyros_dumbbell_x             19216         0
## gyros_dumbbell_y             19216         0
## gyros_dumbbell_z             19216         0
## accel_dumbbell_x             19216         0
## accel_dumbbell_y             19216         0
## accel_dumbbell_z             19216         0
## magnet_dumbbell_x            19216         0
## magnet_dumbbell_y            19216         0
## magnet_dumbbell_z            19216         0

Checking for zero/low variance predictors.

nzv <- nearZeroVar(alt.full.select, saveMetrics=TRUE)
nzv
##                      freqRatio percentUnique zeroVar   nzv
## classe                   1.471       0.02602   FALSE FALSE
## total_accel_belt         1.069       0.15092   FALSE FALSE
## roll_belt                1.086       6.83805   FALSE FALSE
## pitch_belt               1.037       9.53893   FALSE FALSE
## yaw_belt                 1.047      10.10616   FALSE FALSE
## gyros_belt_x             1.051       0.72336   FALSE FALSE
## gyros_belt_y             1.149       0.35908   FALSE FALSE
## gyros_belt_z             1.071       0.87948   FALSE FALSE
## accel_belt_x             1.059       0.85346   FALSE FALSE
## accel_belt_y             1.115       0.74417   FALSE FALSE
## accel_belt_z             1.081       1.55600   FALSE FALSE
## magnet_belt_x            1.089       1.70171   FALSE FALSE
## magnet_belt_y            1.097       1.55079   FALSE FALSE
## magnet_belt_z            1.019       2.37302   FALSE FALSE
## total_accel_arm          1.020       0.34346   FALSE FALSE
## roll_arm                51.154      13.75937   FALSE FALSE
## pitch_arm               85.282      15.96066   FALSE FALSE
## yaw_arm                 32.282      14.89904   FALSE FALSE
## gyros_arm_x              1.024       3.34617   FALSE FALSE
## gyros_arm_y              1.451       1.95150   FALSE FALSE
## gyros_arm_z              1.119       1.29059   FALSE FALSE
## accel_arm_x              1.018       4.04351   FALSE FALSE
## accel_arm_y              1.169       2.78414   FALSE FALSE
## accel_arm_z              1.139       4.12157   FALSE FALSE
## magnet_arm_x             1.012       6.96295   FALSE FALSE
## magnet_arm_y             1.045       4.53268   FALSE FALSE
## magnet_arm_z             1.028       6.57785   FALSE FALSE
## total_accel_forearm      1.133       0.36428   FALSE FALSE
## roll_forearm            11.726      11.23543   FALSE FALSE
## pitch_forearm           64.576      15.09679   FALSE FALSE
## yaw_forearm             15.236      10.29871   FALSE FALSE
## gyros_forearm_x          1.050       1.54559   FALSE FALSE
## gyros_forearm_y          1.043       3.84055   FALSE FALSE
## gyros_forearm_z          1.112       1.58201   FALSE FALSE
## accel_forearm_x          1.143       4.13197   FALSE FALSE
## accel_forearm_y          1.050       5.20920   FALSE FALSE
## accel_forearm_z          1.019       3.01311   FALSE FALSE
## magnet_forearm_x         1.013       7.92569   FALSE FALSE
## magnet_forearm_y         1.256       9.72627   FALSE FALSE
## magnet_forearm_z         1.018       8.75833   FALSE FALSE
## total_accel_dumbbell     1.081       0.22377   FALSE FALSE
## roll_dumbbell            1.038      83.75312   FALSE FALSE
## pitch_dumbbell           2.248      81.22398   FALSE FALSE
## yaw_dumbbell             1.132      83.02456   FALSE FALSE
## gyros_dumbbell_x         1.010       1.25416   FALSE FALSE
## gyros_dumbbell_y         1.271       1.44151   FALSE FALSE
## gyros_dumbbell_z         1.053       1.06682   FALSE FALSE
## accel_dumbbell_x         1.006       2.21170   FALSE FALSE
## accel_dumbbell_y         1.062       2.41986   FALSE FALSE
## accel_dumbbell_z         1.150       2.12323   FALSE FALSE
## magnet_dumbbell_x        1.094       5.84929   FALSE FALSE
## magnet_dumbbell_y        1.189       4.38177   FALSE FALSE
## magnet_dumbbell_z        1.027       3.51270   FALSE FALSE

Some handy functions

tidy_df <- function( data ) {
    for(i in 8:159) { 
        data[,i] <- as.numeric(data[,i])
    }
    colnames(data) <- gsub("_picth", "_pitch", colnames(data), perl=TRUE)
    colnames(data) <- gsub("var_total_accel_", "var_accel_", colnames(data), perl=TRUE)
    colnames(data) <- gsub("roll_belt.1", "pitch_belt", colnames(data), perl=TRUE)
    return(data)
}

add_new_variables <- function( data ) { 
    data$classe <- as.factor(data$classe)
    data$timestamp <- data$raw_timestamp_part_1 + data$raw_timestamp_part_2
    data$date <- strptime(as.character(data$cvtd_timestamp), "%d/%m/%Y %H:%M")
    return(data)
}

select_proper_vars <- function( data ) {
    vec0 <- c("total_accel", "var_accel")
    
    nn1 <- c("avg", "stddev", "var", "kurtosis", "skewness", "min", "max", "amplitude")
    vec1 <- c("roll", paste(nn1, "roll", sep="_"), 
              "pitch", paste(nn1, "pitch", sep="_"), 
              "yaw", paste(nn1, "yaw", sep="_"))
    
    nn2 <- c("gyros", "accel", "magnet")
    vec2 <- paste( rep(nn2, each=3), "VVV", c("x","y","z"), sep="_")
    
    vec.VVV <- c(paste(c("total_accel", "var_accel", vec1), "VVV", sep="_"), vec2)
    vec.belt <- gsub("_VVV", "_belt", vec.VVV, perl=TRUE)
    vec.arm <- gsub("_VVV", "_arm", vec.VVV, perl=TRUE)
    vec.forearm <- gsub("_VVV", "_forearm", vec.VVV, perl=TRUE)
    vec.dumbbell <- gsub("_VVV", "_dumbbell", vec.VVV, perl=TRUE)
    i.classe <- which( colnames(data) == "classe")
    
    if( length(i.classe) > 0 ) {
        select <- data[, c("classe", vec.belt, vec.arm, vec.forearm, vec.dumbbell)]
    } else {
        select <- data[, c("problem_id", vec.belt, vec.arm, vec.forearm, vec.dumbbell)]
    }
    return(select)
}

# define color palettes
color1 <- colorRampPalette(c("#7F0000", "red", "#FF7F00", "yellow", "white", "cyan", "#007FFF", "blue", "#00007F"))
color2 <- colorRampPalette(c("#67001F", "#B2182B", "#D6604D", "#F4A582", "#FDDBC7",
                           "#FFFFFF", "#D1E5F0", "#92C5DE", "#4393C3", "#2166AC", "#053061"))   
color3 <- colorRampPalette(c("red", "white", "blue"))   

# correct answers
answers <- c("B", "A", "B", "A", "A", "E", "D", "B", "A", "A", "B", "C", "B", "A", "E", "E", "A", "B", "B", "B")

Machine-Learning-based Assessment of the quality of weight-lifting exercises

Giovanni Fossati

INTRODUCTION

SUMMARY OF RESULTS

THE DATA SET

Structure

The sensor data

DATA PREPARATION

Loading

Cleaning/Tidying

Non-sensor variables

Individual measurements vs summaries : the new_window variable

Some exploratory plots

Features plotted vs. sequence index, and color coded by user_name and classe

Feature vs. Feature plots with separate panels by classe

ABOUT FEATURE SELECTION

Zero/low variance predictors

Collinearity between predictors

DATA SPLITTING: “NEW” TRAINING AND TESTING SUBSETS

Feature selection on training / testing subsets

MODELING

Random Forest case 1 : 4-fold CV

Fit Summary

Predictions on testing subset

Predictions on TEST subset (the 20 benchmark values for the Project)

Variable Importance

Random Forest case 2 : bootstrap, 25 reps

Fit Summary

Predictions on testing subset

Predictions on TEST subset (the 20 benchmark values for the Project)

Variable Importance

Random Forest case 3 : LGOCV, 25 repeats, 75%/25% splits

Fit Summary

Predictions on testing subset

Predictions on TEST subset (the 20 benchmark values for the Project)

Variable Importance

Appendices

Timed vs. summary data entries

Checking for zero/low variance predictors.

Some handy functions

Individual measurements vs summaries : the `new_window` variable

Features plotted vs. sequence index, and color coded by `user_name` and `classe`

Feature vs. Feature plots with separate panels by `classe`