Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project will be used data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here:
The data used to build the prediction model are:
Before we load all the data, we have to filter out all the unnecessary values:
if (!file.exists("pml-training.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile="pml-training.csv")
}
if (!file.exists("pml-testing.csv")) {
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile="pml-testing.csv")
}
pml_train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("", "NA", "#DIV/0!"))
pml_test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("", "NA", "#DIV/0!"))
The file pml-training.csv has dimensions:
dim(pml_train)
## [1] 19622 160
Based on the number of rows, with the following commands:
na_count <-lapply(pml_train, function(y) sum(is.na(y)))
length(na_count[na_count == 0])
## [1] 60
length(na_count[na_count > 19000])
## [1] 100
it is easy to see that we have 100 out of 160 columns with missing values NA. In our model we are not going to consider these values as they are unnecessary predictors:
bad_predictors <- names(na_count[na_count > 19000])
pml_train <- pml_train[,!(names(pml_train) %in% bad_predictors)]
pml_test <- pml_test[,!(names(pml_test) %in% bad_predictors)]
Finally I will remove other unnecessary predictors that can be easily spotted just looking at the left data:
other_bad_predictors <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window")
pml_train <- pml_train[,!(names(pml_train) %in% other_bad_predictors)]
pml_test <- pml_test[,!(names(pml_test) %in% other_bad_predictors)]
The classe column is the data that quantifies the manner in which the exercise has been done and it is the value that we want to predict:
summary(pml_train$classe)
## A B C D E
## 5580 3797 3422 3216 3607
As we can see, it is a Factor with rates expressed as A, B, C, D and E.
In order to perform cross-validation, we need to separate the training set data using part of it (70%) for the training and the remaining part (30%) for testing (validation):
library("caret")
## Loading required package: lattice
## Loading required package: ggplot2
inTrain = createDataPartition(y = pml_train$classe, p = 0.7, list=F)
training <- pml_train[inTrain,]
testing <- pml_train[-inTrain,]
Now that we have isolated the necessary predictors, let us see how they correlate between each other, displaying them for the First Principal Component order (FPC). Due to the high number of predictors, I use corrplot to display them in a better way possible (leaving out the classe values):
library("corrplot")
cor_training <- cor(training[, -54])
corrplot(cor_training, method="circle", order = "FPC", type = "lower", tl.cex = 0.5)
The correlations value are displayed shading from red (negative correlation) to blue (positive correlation). Some of the displayed values have strong correlation but overall it looks a quite homogeneous and not too strong correlation status across all the predictors. So I will keep all of them.
Bagging may be a good choice as prediction model due to its capacity to deal with non linear models. Also it keep low bias and variance. However to get a better accuracy, Random Forest could be a better choice. The main problem with Random Forest is that could turn to be extremely slow for processing.
In my case I have limited the cross validation to 4-fold only. Also to accelerate the process, I have parallelized it (parRF).
modControl <- trainControl(method = "cv", number = 4)
modFit <- train(classe ~ ., data = training, method = "parRF", trControl = modControl)
## Warning: executing %dopar% sequentially: no parallel backend registered
modFit
## Parallel Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 10303, 10303, 10302, 10303
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9932301 0.9914361
## 27 0.9968698 0.9960406
## 53 0.9954866 0.9942913
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
The model created has the following structure:
library(rpart)
library(rpart.plot)
tree <- rpart(classe ~ ., data=training, method="class")
prp(tree)
In order to evaluate the accuracy of the model over the testing set, we need to apply the prediction:
prediction <- predict(modFit, testing)
accuracy <- postResample(prediction, testing$classe)
accuracy
## Accuracy Kappa
## 0.9974511 0.9967760
The confusion matrix gives an alternative to evaluate the accuracy too:
confusion_matrix <- confusionMatrix(testing$classe, prediction)
confusion_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 1 0 0 0
## B 2 1135 2 0 0
## C 0 3 1020 3 0
## D 0 0 3 961 0
## E 0 0 0 1 1081
##
## Overall Statistics
##
## Accuracy : 0.9975
## 95% CI : (0.9958, 0.9986)
## No Information Rate : 0.2846
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9968
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9965 0.9951 0.9959 1.0000
## Specificity 0.9998 0.9992 0.9988 0.9994 0.9998
## Pos Pred Value 0.9994 0.9965 0.9942 0.9969 0.9991
## Neg Pred Value 0.9995 0.9992 0.9990 0.9992 1.0000
## Prevalence 0.2846 0.1935 0.1742 0.1640 0.1837
## Detection Rate 0.2843 0.1929 0.1733 0.1633 0.1837
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9993 0.9978 0.9969 0.9976 0.9999
and we can estimate the out-of-sample error simply with:
1 - as.numeric(confusion_matrix$overall[1])
## [1] 0.002548853
Both the accuracy evaluations returned a value of ~ 99.92% and the relative out-of-sample error is ~ 0.08%.
Finally let us see what are the predicted values for the testing set when we use the built model:
predict(modFit, pml_test)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E