Introduction

The pml data sets describes weight lifting exercise perform by 6 individuals measured by a set of sensors attached to the participant belt, arm, dumbbell, and forearm.

The weight lifting exercises, namely unilateral dumbbell biceps curl, were performed in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Note that these classes are unordered.

The goal of the project is to predict the manner in which the individuals did the exercise given a set of sensors measurements.

Data Understanding

The sensor measurement variables could be arranged in groups of 3, such that they span a 3D orthogonal space for each of the objects (arm, dumbbell, etc.). To illustrate the movement in space, imagine one of the objects, say the arm, in place of the airplane.

Yaw_Axis_Corrected

From here we expect that the measurements would normal distribute with zero mean; since if the arm was “lowered down” (measurements towards \(-\infty\)), it must be “pulled up” (measurements towards \(+\infty\)), crossing the middle point twice (measurement equal to zero), while the full extension and contraction of the arm are performed once for each cycle.

In the original data set (available here), there are 6 participants performing the exercise in 5 different fashions, that is, 30 participants-performance combinations. In contrast, the train-set contain 6 combinations which would cause the model to assign “great emphasis” on the participants name features. Furthermore, the test-set contain participants-performance combinations which are not include in the train-set, that is, overfitting is likely to occur if we were choose to use the participants name feature in our model.

Data Preparation

Get the Data

We start by loading the data and selecting only the measurement variables (note the variables are grouped in triplets).

## Get the Data
train <- read.csv("./data/pml-training.csv")
test  <- read.csv("./data/pml-testing.csv")

## Subset the data
var.names <- c("roll_belt","pitch_belt","yaw_belt",
               "gyros_belt_x","gyros_belt_y","gyros_belt_z",
               "accel_belt_x","accel_belt_y","accel_belt_z",
               "magnet_belt_x","magnet_belt_y","magnet_belt_z",
               "roll_arm","pitch_arm","yaw_arm",
               "gyros_arm_x","gyros_arm_y","gyros_arm_z",
               "accel_arm_x","accel_arm_y","accel_arm_z",
               "magnet_arm_x","magnet_arm_y","magnet_arm_z",
               "gyros_dumbbell_x","gyros_dumbbell_y","gyros_dumbbell_z",
               "accel_dumbbell_x","accel_dumbbell_y","accel_dumbbell_z",
               "magnet_dumbbell_x","magnet_dumbbell_y","magnet_dumbbell_z",
               "roll_forearm","pitch_forearm","yaw_forearm",
               "gyros_forearm_x","gyros_forearm_y","gyros_forearm_z",
               "accel_forearm_x","accel_forearm_y","accel_forearm_z",
               "magnet_forearm_x","magnet_forearm_y","magnet_forearm_z")
X_tr <- apply(train[,var.names],2, as.numeric)
X_te <- apply(test[,var.names], 2, as.numeric)
y <- train[,"classe"]

Splitting the Data

We are in a data-rich situation where the train-set contains 19622 observations on 45 variables, while 4% of the elements are zero (dense matrix).

The best approach for both model selection1 and model assessment2 is to randomly divide the dataset into three parts3: a training set, a validation set, and a test set. The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model.

set.seed(2015)
u <- runif(nrow(X_tr)) # Generate uniformly random distributes variable
index.tr <- which(u<0.50) # 50% train set (used to fit the final model)
index.va <- which(u>0.75) # 25% validation set (for estimate model selection)
index.te <- which((0.5<=u) & (u<0.75)) # 25% test set (for estimate final model)

The test set would be kept in a “vault” and be brought out only at the end of the data analysis.

Predictors Nature

Between-Predictors Correlations

Each pairwise correlation is computed from the train data and colored according to its magnitude. This visualization is symmetric: the top and bottom diagonal shows identical information. Dark blue colors indicate strong positive correlations, dark red is used for strong negative correlations, and white implies no empirical relationship between the predictors.

Generally, there is low correlation between the different variables with few exceptions. We choose to treat the issue through detecting and removing the high correlated variables (in contrast to PCA).

## Remove highly correlated variables (r>0.9)
index.hcv <- caret::findCorrelation(cor(X_tr), cutoff=0.90)
X_tr <- X_tr[,-index.hcv]
X_te <- X_te[,-index.hcv]

accel_belt_z, roll_belt, accel_belt_x, gyros_dumbbell_x, gyros_dumbbell_z, gyros_arm_x were removed from the dataset.

Predictors Distributions

To check whatever the distribution are skewed and/or need centering and scaling we look on important statistics of the different predictors (for details please see the appendix).

  • Note there are 0 variables with zero or near-zero variance predictors.
  • The values for asymmetry and kurtosis between -2 and +2 are considered acceptable in order to prove normal univariate distribution.

Analyzing the statistics we choose to standardize the variables, keeping in mind that linear operations on the variables preserve the orthogonally between them.

library("caret")
preObj <- preProcess(X_tr, method=c("center","scale"))
X_tr   <- predict(preObj, X_tr)
X_te   <- predict(preObj, X_te)

Modeling

First, we setup the computational nuances of the model training phase for 10-folds CV,

fitControl <- trainControl(
        classProbs=TRUE,
        ## 10-fold CV        
        method = "cv",
        number = 10,   
        allowParallel = TRUE,
        returnData = FALSE, # saves memory
        seeds = list(2016,2017,2018,2019,2020,2021,2022,2023,2024,2025,2026))

Second, we build a benchmark utilizing Multinomial Regression with the validation set,

mdl.glm <- train(classe~., 
                 data=data.frame(X_tr[index.va,],classe=y[index.va]),
                 method="glmnet", # Multinomial Regression
                 tuneGrid=expand.grid("alpha"=1, "lambda"=0),
                 metric="Accuracy", maximize=TRUE,
                 trControl=fitControl)

Third, we fit a Random Forest model for the validation set,

mdl.rf <- train(classe~., 
                data=data.frame(X_tr[index.va,],classe=y[index.va]),
                method="rf",
                metric="Accuracy", maximize=TRUE,
                tuneGrid=expand.grid(mtry=round(sqrt(45))),
                trControl=fitControl)

Evaluation

Between-Models Comparison

We can clearly see the Random Forest model outperform the Benchmark (please see appendix for further details). The 10-folds CV shows Accuracy and Kappa \(>97\%\) (where’s the Benchmark \(<58\%\)).

Final Model Assessment

If it seems too good to be true, it probably is

To bust that myth, we fit Random Forest model on (our subdivided 50%) train-set and use unseen data to evaluate it (our subdivided 25% test-set).

mdl.valid <- train(classe~., 
                   data=data.frame(X_tr[index.tr,],classe=y[index.tr]),
                   method="parRF",
                   metric="Accuracy", maximize=TRUE,
                   tuneGrid=expand.grid(mtry=round(sqrt(45))),
                   trControl=trainControl(classProbs=TRUE,
                                          method = "none",
                                          allowParallel = TRUE,
                                          returnData = FALSE))
y_valid <- predict(mdl.valid, newdata=X_tr[index.te,])
confusionMatrix(y_valid, y[index.te])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1411   13    0    0    0
##          B    2  939   14    0    0
##          C    1    6  845   10    1
##          D    0    0    1  821    0
##          E    0    0    0    0  881
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9903          
##                  95% CI : (0.9872, 0.9928)
##     No Information Rate : 0.2859          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9877          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9979   0.9802   0.9826   0.9880   0.9989
## Specificity            0.9963   0.9960   0.9956   0.9998   1.0000
## Pos Pred Value         0.9909   0.9832   0.9791   0.9988   1.0000
## Neg Pred Value         0.9991   0.9952   0.9963   0.9976   0.9998
## Prevalence             0.2859   0.1937   0.1739   0.1680   0.1784
## Detection Rate         0.2853   0.1899   0.1709   0.1660   0.1782
## Detection Prevalence   0.2880   0.1931   0.1745   0.1662   0.1782
## Balanced Accuracy      0.9971   0.9881   0.9891   0.9939   0.9994

The evaluation seems consistent we our previous findings, so we proceed to the next phase with confident and expectation for consistent results (\(\text{Accuracy} \approx 99\%\)) in the given test set.

Deployment

We fit the final model using all of given train-set,

mdl.final <- train(classe~., 
                   data=data.frame(X_tr,classe=y),
                   method="parRF",
                   metric="Accuracy", maximize=TRUE,
                   tuneGrid=expand.grid(mtry=round(sqrt(45))),
                   trControl=trainControl(classProbs=TRUE,
                                          method = "none",
                                          allowParallel = TRUE,
                                          returnData = FALSE))
y_hat <- predict(mdl.final, newdata=X_te)

Finally, we predict and export our results,

export.pred <- function(y_hat){
        dir.name <- "Prediction Assignment Submission"
        dir.create(dir.name,FALSE)
        
        n <- length(y_hat)
        for(i in 1:n){
                file.name <- paste0("problem_id_",i,".txt")
                write.table(y_hat[i],
                            file.path(getwd(),dir.name,file.name),
                            quote=FALSE,row.names=FALSE,col.names=FALSE)
        } #end for y_hat
} #end export.pred

export.pred(y_hat)

Results

Using the described model, the total score is 20/20.

Appendix

Predictors Distributions

Important Predictors Statistics
  Median Mean Variance Skewness Kurtosis

pitch_belt

0.23

-0.01

1.02

-0.97

-0.17

yaw_belt

0.04

0.02

1.01

0.86

-0.66

gyros_belt_x

0.17

0.00

1.05

-0.54

6.76

gyros_belt_y

-0.25

0.01

1.02

0.14

4.33

gyros_belt_z

0.00

-0.04

1.00

0.02

6.30

accel_belt_y

0.36

0.03

0.99

0.10

-1.64

magnet_belt_x

-0.32

0.01

1.02

1.41

2.16

magnet_belt_y

0.21

-0.01

1.06

-2.23

7.43

magnet_belt_z

0.41

0.01

1.07

0.33

13.26

roll_arm

-0.25

-0.01

0.97

-0.15

-0.43

pitch_arm

0.15

-0.01

0.97

0.18

0.27

yaw_arm

0.01

0.01

0.98

-0.09

-0.11

gyros_arm_y

0.00

0.00

1.00

0.16

0.64

gyros_arm_z

-0.04

0.01

1.00

-0.17

0.12

accel_arm_x

0.11

0.01

1.00

0.33

-0.80

accel_arm_y

-0.20

-0.02

1.01

0.10

-1.03

accel_arm_z

0.17

0.02

0.98

-0.82

0.88

magnet_arm_x

0.22

0.01

1.00

-0.16

-1.60

magnet_arm_y

0.21

0.00

0.99

-0.46

-0.84

magnet_arm_z

0.41

0.00

0.99

-1.13

0.09

gyros_dumbbell_y

0.01

0.01

0.63

0.11

2.98

accel_dumbbell_x

0.29

0.02

0.97

-0.40

1.05

accel_dumbbell_y

-0.14

0.00

1.01

0.33

-0.46

accel_dumbbell_z

0.35

0.00

0.97

-0.17

-0.74

magnet_dumbbell_x

-0.44

0.01

1.01

1.66

1.32

magnet_dumbbell_y

0.27

0.00

1.04

-1.96

5.12

magnet_dumbbell_z

-0.24

-0.01

0.98

0.90

0.36

roll_forearm

-0.12

-0.01

0.99

-0.46

-0.74

pitch_forearm

-0.06

0.00

0.98

-0.48

0.85

yaw_forearm

-0.19

0.01

0.99

-0.26

-1.14

gyros_forearm_x

-0.17

0.01

0.92

0.15

0.86

gyros_forearm_y

-0.01

0.00

0.48

-0.18

-0.36

gyros_forearm_z

-0.04

-0.01

0.12

-0.90

11.68

accel_forearm_x

0.02

-0.01

0.99

-0.21

-0.69

accel_forearm_y

0.18

0.01

0.99

-0.68

-0.05

accel_forearm_z

0.12

-0.01

0.98

0.43

-0.87

magnet_forearm_x

-0.19

-0.01

0.99

0.60

-0.10

magnet_forearm_y

0.42

0.01

0.98

-0.76

-0.36

magnet_forearm_z

0.33

0.01

0.99

-1.28

1.11

Model Performance: Multinomial Regression

##   TrainAccuracy TrainKappa method
## 1     0.6742728  0.5878087 glmnet
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 22.4  2.9  2.6  1.0  1.4
##          B  1.2 11.6  1.4  1.6  2.4
##          C  1.7  1.8 11.3  2.3  1.5
##          D  1.6  0.8  1.3 10.7  2.2
##          E  0.7  1.9  1.0  1.3 11.5

Model Performance: Random Forest

##   TrainAccuracy TrainKappa method
## 1     0.9771575  0.9711597     rf
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentages of table totals)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 27.5  0.5  0.0  0.0  0.0
##          B  0.0 18.2  0.4  0.0  0.0
##          C  0.1  0.3 17.0  0.5  0.1
##          D  0.0  0.0  0.1 16.4  0.1
##          E  0.0  0.0  0.0  0.0 18.7

References

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2008. The Elements of Statistical Learning. Vol. 1. Springer series in statistics Springer, Berlin.

Kuhn, Max, and Kjell Johnson. 2013. Applied Predictive Modeling. Springer.


  1. Model selection: estimating the performance of different models in order to choose the best one.

  2. Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.

  3. We use a typical split: 50% for training, and 25% each for validation and testing