Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).
Goal of the project is to predict the manner in which people did the exercise (“classe” variable).
The dataset provided can be obtained from my Github Repo in the source_data/ folder. It is composed of two files :
We first load these two datasets :
data <- read.csv("./source_data/pml-training.csv", na.string = c(NA,""))
validation <- read.csv("./source_data/pml-testing.csv", na.string = c(NA,""))
A summary of the dataset reveals that out of the 159 predictors, 100 present 98% of missing values in the data dataset (same predictors present 100% missing values in the validation dataset). These predictors are removed :
idx1 <- which(sapply(data,function(x) sum(is.na(x))/nrow(data))>0.95)
data <- data[,-idx1]
Among the 59 predictors left, We remove X which corresponds to the row number. Besides, the purpose being to predict the classe outcome from the sensors informations, we do not use predictors related to time and date:
data <- subset(data, select=-c(X,raw_timestamp_part_1,raw_timestamp_part_2,cvtd_timestamp,new_window,num_window))
These preprocessing steps lead to the following 53 predictors (for more details about their meaning refer to the original website and related informations):
names(data)
## [1] "user_name" "roll_belt" "pitch_belt"
## [4] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [7] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [10] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [13] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [16] "pitch_arm" "yaw_arm" "total_accel_arm"
## [19] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [22] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [25] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [28] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [31] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [34] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [37] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [40] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [43] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [46] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [49] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [52] "magnet_forearm_y" "magnet_forearm_z" "classe"
We then look at the correlation Matrix. Below we print the number of strong correlations (>0.80) for each predictor (not including autocorrelation):
corMatrix <- abs(cor(subset(data, select=-c(user_name,classe))))>.8
diag(corMatrix) <- FALSE # do not take into account autocorrelations
colSums(corMatrix) # display number of high correlation for each variable
## roll_belt pitch_belt yaw_belt
## 4 2 1
## total_accel_belt gyros_belt_x gyros_belt_y
## 3 0 0
## gyros_belt_z accel_belt_x accel_belt_y
## 0 2 3
## accel_belt_z magnet_belt_x magnet_belt_y
## 3 2 0
## magnet_belt_z roll_arm pitch_arm
## 0 0 0
## yaw_arm total_accel_arm gyros_arm_x
## 0 0 1
## gyros_arm_y gyros_arm_z accel_arm_x
## 1 0 1
## accel_arm_y accel_arm_z magnet_arm_x
## 0 0 1
## magnet_arm_y magnet_arm_z roll_dumbbell
## 1 1 0
## pitch_dumbbell yaw_dumbbell total_accel_dumbbell
## 1 1 0
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## 2 0 2
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1 0 1
## magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z
## 0 0 0
## roll_forearm pitch_forearm yaw_forearm
## 0 0 0
## total_accel_forearm gyros_forearm_x gyros_forearm_y
## 0 0 1
## gyros_forearm_z accel_forearm_x accel_forearm_y
## 3 0 0
## accel_forearm_z magnet_forearm_x magnet_forearm_y
## 0 0 0
## magnet_forearm_z
## 0
For threshold of 0.80, looking more clesely at the correlation matrix reveals the following strong pairwise correlations:
A vector of variables to remove to reduce pairwise correlation can also be obtained as follows:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
flags <- findCorrelation(cor(subset(data, select=-c(user_name,classe))),cutoff = .80)
names(subset(data, select=-c(user_name,classe)))[flags]
## [1] "accel_belt_z" "roll_belt" "accel_belt_y"
## [4] "accel_dumbbell_z" "accel_belt_x" "pitch_belt"
## [7] "accel_arm_x" "accel_dumbbell_x" "magnet_arm_y"
## [10] "gyros_arm_y" "gyros_forearm_z" "gyros_dumbbell_x"
Which confirms the previous observations.
Before training a model, we split the data into a training and testing sets. The training set is to be used for training the model while the testing set will be use to provide an unbiased estimate of the model accuracy. Note that we do not use here the validation set of 20 observations because these predictions are to be uploaded on Coursera for project grading.
Splitting is done as follows (ratio 75/25):
set.seed(3764) # for reproducibility
inTrain<- createDataPartition(data$classe, p = 0.75, list = FALSE)
training <- data[inTrain,]
testing <- data[-inTrain,]
We train a random forest on the training set, keeping all 53 predictors. 10-fold cross validation is used along with different values of the mtry parameter (number of variables allowed for each split):
# takes about 40 minutes to run on intel core i5-4200U and 8Gb Ram.
numFolds <- trainControl(method = "cv", number = 10)
rffit <- train(classe ~ ., method="rf", data=training, trControl = numFolds)
rffit
## Random Forest
##
## 14718 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 13245, 13247, 13246, 13244, 13246, 13246, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9916434 0.9894285 0.002640017 0.003340925
## 29 0.9922541 0.9902017 0.001901686 0.002405403
## 57 0.9860711 0.9823788 0.003642431 0.004609978
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 29.
The best model is obtained for mtry = 29, leading to a CV accuracy of 99.2%. Plot below shows the variable importance (top 10 predictors):
library(randomForest)
varImpPlot(rffit$finalModel, n.var = 10, main = "Random Forest Model - top 10 predictors")
We then apply the model on the hold out testing set in order to estimate the model accuracy:
testPred <- predict(rffit, newdata = testing)
confusionMatrix(testPred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 9 0 0 1
## B 0 936 3 0 1
## C 1 4 848 7 2
## D 0 0 4 797 4
## E 0 0 0 0 893
##
## Overall Statistics
##
## Accuracy : 0.9927
## 95% CI : (0.9899, 0.9949)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9907
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9863 0.9918 0.9913 0.9911
## Specificity 0.9972 0.9990 0.9965 0.9980 1.0000
## Pos Pred Value 0.9929 0.9957 0.9838 0.9901 1.0000
## Neg Pred Value 0.9997 0.9967 0.9983 0.9983 0.9980
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1909 0.1729 0.1625 0.1821
## Detection Prevalence 0.2863 0.1917 0.1758 0.1642 0.1821
## Balanced Accuracy 0.9982 0.9926 0.9942 0.9947 0.9956
The estimated model accuracy on the testing set is 0.993 (error rate 0.007). This is consistent with the CV accuracy meaning we are not overfitting the training set.
The model was then applied to the 20 observations of the validation set (result not shown here because subject to the Coursera Honor Code.)
The variable importance of the previous model, along with the pairwise correlations discused previously, suggest that not all predictors are necessary to perform a proper prediction on this dataset. Therefore, considering correlations and variable importance, the number of predictors was iteratively decreased until the following training set was obtained:
training2 <- subset(training, select=-c(roll_belt,pitch_belt,yaw_belt,roll_dumbbell, magnet_dumbbell_x, magnet_dumbbell_y, magnet_dumbbell_z, roll_forearm, pitch_forearm, classe))
A random forest is again trained with the above predictors (because there are only 9 predictors we here expand the grid for mtry parameter.) :
numFolds <- trainControl( method = "cv", number = 10)
cpGrid <- expand.grid( .mtry = seq(3,10,by = 1))
rffit2 <- train(classe ~ ., method="rf", data=training2, trControl = numFolds, tuneGrid = cpGrid)
rffit2
## Random Forest
##
## 14718 samples
## 9 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 13244, 13248, 13246, 13245, 13247, 13246, ...
##
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 3 0.9849167 0.9809275 0.002841563 0.003589338
## 4 0.9846450 0.9805836 0.002583553 0.003262094
## 5 0.9846450 0.9805847 0.003329737 0.004203510
## 6 0.9834219 0.9790388 0.004321749 0.005455679
## 7 0.9830140 0.9785235 0.004343940 0.005482633
## 8 0.9809751 0.9759457 0.005567400 0.007028496
## 9 0.9781894 0.9724250 0.006248816 0.007887492
## 10 0.9780535 0.9722526 0.005873128 0.007411011
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
This model achieves 0.985 CV accuracy for mtry=3, which is confirmed by predicting on the held out testing set as shown below:
testPred <- predict(rffit2, newdata = testing)
confusionMatrix(testPred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1388 11 2 0 0
## B 3 925 5 0 6
## C 2 10 843 6 1
## D 2 3 5 797 2
## E 0 0 0 1 892
##
## Overall Statistics
##
## Accuracy : 0.988
## 95% CI : (0.9845, 0.9908)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9848
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9950 0.9747 0.9860 0.9913 0.9900
## Specificity 0.9963 0.9965 0.9953 0.9971 0.9998
## Pos Pred Value 0.9907 0.9851 0.9780 0.9852 0.9989
## Neg Pred Value 0.9980 0.9939 0.9970 0.9983 0.9978
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2830 0.1886 0.1719 0.1625 0.1819
## Detection Prevalence 0.2857 0.1915 0.1758 0.1650 0.1821
## Balanced Accuracy 0.9956 0.9856 0.9906 0.9942 0.9949
This model with only 9 predictors exhibit like the previous one a 100% accuracy regarding the 20 observations of the validation set. Note that it is possible to keep decreasing the number of predictors while keeping a relatively high accuracy (down to about 3-4 predictors and accuracy > 80%), however the objective here was to find a minimal model able to predict correctly the 20 validation obsevations (i.e. with less than 9 predictors I have not been able to get all of them correct).