This project aims at predicting the manner in which a weight lifting exercise was performed. The data come from the accelerometers on the belt, forearm, arm, and dumbell of 6 participants who were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in different fashions. According to the documentation available here, the quality of the execution of the exercise is divided in 5 classes that are listed below:
In order to come up with the best prediction model, we will go through the following steps. We first read and clean the data. We then divide the training data set into a training and validation data set. We use a k-fold cross-validation to tune and evaluate three sophisticated machine learning models: classification tree, random forest and stochastic gradient boosting. The best model is selected and its out of sample error is estimated. We finally apply the model to the original testing data set.
Let’s load and read the training and test data set:
trainingFileName <- "pml-training.csv"
if( !file.exists(trainingFileName) ) {
trainingFileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(trainingFileURL, trainingFileName, method = "curl")
}
testingFileName <- "pml-testing.csv"
if( !file.exists(testingFileName) ) {
trainingFileURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(testingFileURL, testingFileName, method = "curl")
}
training <- read.csv(trainingFileName, header = TRUE)
testing <- read.csv(testingFileName, header = TRUE)
The training data set contains 19622 observations of 160 variables. The test data set encloses 20 observations of the same variables.
We first remove the identification and time variables from both data frames. Those are not related to the classe that we want to predict.
training <- training[,-(1:7)]
testing <- testing[,-(1:7)]
Let’s take a look at variables with NA values in the testing data set.
NAs <- sapply(testing, function(x) mean(is.na(x) ) )
length(which(NAs > 0) )
## [1] 100
min(NAs[NAs > 0])
## [1] 1
There are 100 variables enclosing NA values. Besides, one can sees that the NA values represent 100% of the observations. We cannot impute the missing values and thus discard all 100 variables in both data set.
training <- training[,NAs == 0]
testing <- testing[,NAs == 0]
We need to make sure that the training data set does not have other variables with NA values.
sum(complete.cases(training) ) == dim(training)[1]
## [1] TRUE
There are no NA values left in the training data set.
The training data set is now made of 19622 observations of 53 variables. We want to partition this data set in a training (60% of the original data) and validation (remaining 40%) data set.
library("caret")
inTrain <- createDataPartition(training$class, p = 0.6, list = FALSE)
trainingSample <- training[inTrain,]
validationSample <- training[-inTrain,]
The training sample will be used to tune and compare the models considered in this project while the validation sample will allow to estimate the out of sample error of the best model.
A classification tree, random forest and stochastic gradient boosting model are used to predict the manner in which the participants did the exercise (classe variable). We will use all 52 predictors to do so. For the training scheme, we will use a 5-fold cross-validation. We repeat the procedure 4 times in order to compare the accuracy distributions (20 values) between the models.
control <- trainControl(method = "repeatedcv", number = 5, repeats = 4)
We now train the three models.
set.seed(64)
model_dt <- train(classe ~ ., method = 'rpart', trControl = control, data = trainingSample)
model_rf <- train(classe ~ ., method = 'rf', trControl = control, data = trainingSample, ntree = 200)
model_gbm <- train(classe ~ ., method = 'gbm', trControl = control, data = trainingSample, verbose = FALSE)
The accuracy distribution for each model is summarized below in terms of percentiles:
results <- resamples(list(DT = model_dt, RF = model_rf, GBM = model_gbm, RF = model_rf) )
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: DT, RF, GBM, RF
## Number of resamples: 20
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9847 0.9868 0.9883 0.9888 0.9903 0.9945 0
## DT 0.5193 0.5815 0.6149 0.5986 0.6270 0.6412 0
## GBM 0.9503 0.9554 0.9594 0.9585 0.9610 0.9673 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9807 0.9833 0.9852 0.9858 0.9878 0.9930 0
## DT 0.3729 0.4747 0.5164 0.4903 0.5313 0.5486 0
## GBM 0.9371 0.9436 0.9487 0.9475 0.9507 0.9586 0
One can sees that the random forest is the model that performs the best. We will now use this model.
Using the validation data sample, we extract the confusion matrix of the random forest model.
predict_validation <- predict(model_rf, newdata = validationSample)
confusionMatrix(predict_validation, validationSample$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2230 7 0 0 0
## B 1 1510 3 0 0
## C 1 1 1363 9 1
## D 0 0 2 1277 1
## E 0 0 0 0 1440
##
## Overall Statistics
##
## Accuracy : 0.9967
## 95% CI : (0.9951, 0.9978)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9958
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9991 0.9947 0.9963 0.9930 0.9986
## Specificity 0.9988 0.9994 0.9981 0.9995 1.0000
## Pos Pred Value 0.9969 0.9974 0.9913 0.9977 1.0000
## Neg Pred Value 0.9996 0.9987 0.9992 0.9986 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2842 0.1925 0.1737 0.1628 0.1835
## Detection Prevalence 0.2851 0.1930 0.1752 0.1631 0.1835
## Balanced Accuracy 0.9989 0.9970 0.9972 0.9963 0.9993
The out of sample error of the random forest model is 0.33%.
We apply the random forest model to the test data and print out the result.
predict_testing <- predict(model_rf, newdata = testing)
predict_testing
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
library("corrplot")
M <- cor(trainingSample[,-dim(trainingSample)[2]])
corrplot(M, method = 'pie', type = 'lower')
We observe very little correlations between predictors. For this reason we don’t perform any pre-processing, such as a principal component Analysis (PCA), to reduce the number of predictors based on their correlation.
bwplot(results)
plot(varImp(model_rf), top = 15)