Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

exactly according to the specification (Class A)
throwing the elbows to the front (Class B)
lifting the dumbbell only halfway (Class C)
lowering the dumbbell only halfway (Class D)
throwing the hips to the front (Class E)

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

Goal of the project is to predict the manner in which people did the exercise (“classe” variable).

Data

The dataset provided can be obtained from my Github Repo in the source_data/ folder. It is composed of two files :

pml-training.csv : a training set containing 19622 observations of 160 variables including the dependent variable “classe”.
pml-testing.csv : the validation set containing 20 observations of 159 variables for which the “classe” is to be predicted.

We first load these two datasets :

data <- read.csv("./source_data/pml-training.csv", na.string = c(NA,""))
validation  <- read.csv("./source_data/pml-testing.csv", na.string = c(NA,""))

Preprocessing

A summary of the dataset reveals that out of the 159 predictors, 100 present 98% of missing values in the data dataset (same predictors present 100% missing values in the validation dataset). These predictors are removed :

idx1 <- which(sapply(data,function(x) sum(is.na(x))/nrow(data))>0.95)
data <- data[,-idx1]

Among the 59 predictors left, We remove X which corresponds to the row number. Besides, the purpose being to predict the classe outcome from the sensors informations, we do not use predictors related to time and date:

data <- subset(data, select=-c(X,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window,num_window))

These preprocessing steps lead to the following 53 predictors (for more details about their meaning refer to the original website and related informations):

names(data)

##  [1] "user_name"            "roll_belt"            "pitch_belt"          
##  [4] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
##  [7] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [10] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [13] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [16] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [19] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [22] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [25] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [28] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [31] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [34] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [37] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [40] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [43] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [46] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [49] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [52] "magnet_forearm_y"     "magnet_forearm_z"     "classe"

We then look at the correlation Matrix. Below we print the number of strong correlations (>0.80) for each predictor (not including autocorrelation):

corMatrix <- abs(cor(subset(data, select=-c(user_name,classe))))>.8
diag(corMatrix) <- FALSE # do not take into account autocorrelations
colSums(corMatrix)       # display number of high correlation for each variable

##            roll_belt           pitch_belt             yaw_belt 
##                    4                    2                    1 
##     total_accel_belt         gyros_belt_x         gyros_belt_y 
##                    3                    0                    0 
##         gyros_belt_z         accel_belt_x         accel_belt_y 
##                    0                    2                    3 
##         accel_belt_z        magnet_belt_x        magnet_belt_y 
##                    3                    2                    0 
##        magnet_belt_z             roll_arm            pitch_arm 
##                    0                    0                    0 
##              yaw_arm      total_accel_arm          gyros_arm_x 
##                    0                    0                    1 
##          gyros_arm_y          gyros_arm_z          accel_arm_x 
##                    1                    0                    1 
##          accel_arm_y          accel_arm_z         magnet_arm_x 
##                    0                    0                    1 
##         magnet_arm_y         magnet_arm_z        roll_dumbbell 
##                    1                    1                    0 
##       pitch_dumbbell         yaw_dumbbell total_accel_dumbbell 
##                    1                    1                    0 
##     gyros_dumbbell_x     gyros_dumbbell_y     gyros_dumbbell_z 
##                    2                    0                    2 
##     accel_dumbbell_x     accel_dumbbell_y     accel_dumbbell_z 
##                    1                    0                    1 
##    magnet_dumbbell_x    magnet_dumbbell_y    magnet_dumbbell_z 
##                    0                    0                    0 
##         roll_forearm        pitch_forearm          yaw_forearm 
##                    0                    0                    0 
##  total_accel_forearm      gyros_forearm_x      gyros_forearm_y 
##                    0                    0                    1 
##      gyros_forearm_z      accel_forearm_x      accel_forearm_y 
##                    3                    0                    0 
##      accel_forearm_z     magnet_forearm_x     magnet_forearm_y 
##                    0                    0                    0 
##     magnet_forearm_z 
##                    0

For threshold of 0.80, looking more clesely at the correlation matrix reveals the following strong pairwise correlations:

total_accel_belt with accel_belt_y and accel_belt_z (last 2 also correlated)
magnet_belt_x with accel_belt_x and pitch_belt (last 2 also correlated)
gyro_arm_x with gyro_arm_y
magnet_arm_x with accel_arm_x
magnet_arm_y with magnet_arm_z
pitch_dumbbell with accel_dumbbell_x
yaw_dumbdell with accel_dumbbell_z
gyros_forearm_z with gyros_forearm_y and gyros_dumbbell_z and gyros_dumbbell_x (last 2 also correlated)

A vector of variables to remove to reduce pairwise correlation can also be obtained as follows:

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

flags <- findCorrelation(cor(subset(data, select=-c(user_name,classe))),cutoff = .80)
names(subset(data, select=-c(user_name,classe)))[flags]

##  [1] "accel_belt_z"     "roll_belt"        "accel_belt_y"    
##  [4] "accel_dumbbell_z" "accel_belt_x"     "pitch_belt"      
##  [7] "accel_arm_x"      "accel_dumbbell_x" "magnet_arm_y"    
## [10] "gyros_arm_y"      "gyros_forearm_z"  "gyros_dumbbell_x"

Which confirms the previous observations.

Machine learning

Before training a model, we split the data into a training and testing sets. The training set is to be used for training the model while the testing set will be use to provide an unbiased estimate of the model accuracy. Note that we do not use here the validation set of 20 observations because these predictions are to be uploaded on Coursera for project grading.

Splitting is done as follows (ratio 75/25):

set.seed(3764) # for reproducibility
inTrain<- createDataPartition(data$classe, p = 0.75, list = FALSE)
training <- data[inTrain,]
testing  <- data[-inTrain,]

Random forest model using all predictors

We train a random forest on the training set, keeping all 53 predictors. 10-fold cross validation is used along with different values of the mtry parameter (number of variables allowed for each split):

# takes about 40 minutes to run on intel core i5-4200U and 8Gb Ram.
numFolds <- trainControl(method = "cv", number = 10)
rffit    <- train(classe ~ ., method="rf", data=training, trControl = numFolds)

rffit

## Random Forest 
## 
## 14718 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 13245, 13247, 13246, 13244, 13246, 13246, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9916434  0.9894285  0.002640017  0.003340925
##   29    0.9922541  0.9902017  0.001901686  0.002405403
##   57    0.9860711  0.9823788  0.003642431  0.004609978
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 29.

The best model is obtained for mtry = 29, leading to a CV accuracy of 99.2%. Plot below shows the variable importance (top 10 predictors):

library(randomForest)
varImpPlot(rffit$finalModel, n.var = 10, main = "Random Forest Model - top 10 predictors")

We then apply the model on the hold out testing set in order to estimate the model accuracy:

testPred <- predict(rffit, newdata = testing)
confusionMatrix(testPred,testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    9    0    0    1
##          B    0  936    3    0    1
##          C    1    4  848    7    2
##          D    0    0    4  797    4
##          E    0    0    0    0  893
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9927          
##                  95% CI : (0.9899, 0.9949)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9907          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9863   0.9918   0.9913   0.9911
## Specificity            0.9972   0.9990   0.9965   0.9980   1.0000
## Pos Pred Value         0.9929   0.9957   0.9838   0.9901   1.0000
## Neg Pred Value         0.9997   0.9967   0.9983   0.9983   0.9980
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1909   0.1729   0.1625   0.1821
## Detection Prevalence   0.2863   0.1917   0.1758   0.1642   0.1821
## Balanced Accuracy      0.9982   0.9926   0.9942   0.9947   0.9956

The estimated model accuracy on the testing set is 0.993 (error rate 0.007). This is consistent with the CV accuracy meaning we are not overfitting the training set.

The model was then applied to the 20 observations of the validation set (result not shown here because subject to the Coursera Honor Code.)

Random forest model using 9 predictors

The variable importance of the previous model, along with the pairwise correlations discused previously, suggest that not all predictors are necessary to perform a proper prediction on this dataset. Therefore, considering correlations and variable importance, the number of predictors was iteratively decreased until the following training set was obtained:

training2 <- subset(training, select=-c(roll_belt,pitch_belt,yaw_belt,roll_dumbbell, magnet_dumbbell_x, magnet_dumbbell_y, magnet_dumbbell_z, roll_forearm, pitch_forearm, classe))

A random forest is again trained with the above predictors (because there are only 9 predictors we here expand the grid for mtry parameter.) :

numFolds <- trainControl( method = "cv", number = 10)
cpGrid   <- expand.grid( .mtry = seq(3,10,by = 1))
rffit2   <- train(classe ~ ., method="rf", data=training2, trControl = numFolds, tuneGrid = cpGrid)

rffit2

## Random Forest 
## 
## 14718 samples
##     9 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## 
## Summary of sample sizes: 13244, 13248, 13246, 13245, 13247, 13246, ... 
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    3    0.9849167  0.9809275  0.002841563  0.003589338
##    4    0.9846450  0.9805836  0.002583553  0.003262094
##    5    0.9846450  0.9805847  0.003329737  0.004203510
##    6    0.9834219  0.9790388  0.004321749  0.005455679
##    7    0.9830140  0.9785235  0.004343940  0.005482633
##    8    0.9809751  0.9759457  0.005567400  0.007028496
##    9    0.9781894  0.9724250  0.006248816  0.007887492
##   10    0.9780535  0.9722526  0.005873128  0.007411011
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 3.

This model achieves 0.985 CV accuracy for mtry=3, which is confirmed by predicting on the held out testing set as shown below:

testPred <- predict(rffit2, newdata = testing)
confusionMatrix(testPred,testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1388   11    2    0    0
##          B    3  925    5    0    6
##          C    2   10  843    6    1
##          D    2    3    5  797    2
##          E    0    0    0    1  892
## 
## Overall Statistics
##                                           
##                Accuracy : 0.988           
##                  95% CI : (0.9845, 0.9908)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9848          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9950   0.9747   0.9860   0.9913   0.9900
## Specificity            0.9963   0.9965   0.9953   0.9971   0.9998
## Pos Pred Value         0.9907   0.9851   0.9780   0.9852   0.9989
## Neg Pred Value         0.9980   0.9939   0.9970   0.9983   0.9978
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2830   0.1886   0.1719   0.1625   0.1819
## Detection Prevalence   0.2857   0.1915   0.1758   0.1650   0.1821
## Balanced Accuracy      0.9956   0.9856   0.9906   0.9942   0.9949

This model with only 9 predictors exhibit like the previous one a 100% accuracy regarding the 20 observations of the validation set. Note that it is possible to keep decreasing the number of predictors while keeping a relatively high accuracy (down to about 3-4 predictors and accuracy > 80%), however the objective here was to find a minimal model able to predict correctly the 20 validation obsevations (i.e. with less than 9 predictors I have not been able to get all of them correct).

Reference

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science., pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Detecting Execution Mistakes in Weight Lifting Exercises Using On-body sensing and Machine Learning

Renaud Dufour