Course Project Prediction

Synopsis

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project will be used data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here:

http://groupware.les.inf.puc-rio.br/har

Getting and Cleaning Data

The data used to build the prediction model are:

Before we load all the data, we have to filter out all the unnecessary values:

if (!file.exists("pml-training.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", destfile="pml-training.csv")
}

if (!file.exists("pml-testing.csv")) {
  download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", destfile="pml-testing.csv")
}

pml_train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("", "NA", "#DIV/0!"))
pml_test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("", "NA", "#DIV/0!"))

The file pml-training.csv has dimensions:

dim(pml_train)

## [1] 19622   160

Based on the number of rows, with the following commands:

na_count <-lapply(pml_train, function(y) sum(is.na(y)))
length(na_count[na_count == 0])

## [1] 60

length(na_count[na_count > 19000])

## [1] 100

it is easy to see that we have 100 out of 160 columns with missing values NA. In our model we are not going to consider these values as they are unnecessary predictors:

bad_predictors <- names(na_count[na_count > 19000])
pml_train <- pml_train[,!(names(pml_train) %in% bad_predictors)]
pml_test <- pml_test[,!(names(pml_test) %in% bad_predictors)]

Finally I will remove other unnecessary predictors that can be easily spotted just looking at the left data:

other_bad_predictors <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window")
pml_train <- pml_train[,!(names(pml_train) %in% other_bad_predictors)]
pml_test <- pml_test[,!(names(pml_test) %in% other_bad_predictors)]

The classe column is the data that quantifies the manner in which the exercise has been done and it is the value that we want to predict:

summary(pml_train$classe)

##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

As we can see, it is a Factor with rates expressed as A, B, C, D and E.

Data Partitioning

In order to perform cross-validation, we need to separate the training set data using part of it (70%) for the training and the remaining part (30%) for testing (validation):

library("caret")

## Loading required package: lattice

## Loading required package: ggplot2

inTrain = createDataPartition(y = pml_train$classe, p = 0.7, list=F)
training <- pml_train[inTrain,]
testing <- pml_train[-inTrain,]

Analysis Data Correlations

Now that we have isolated the necessary predictors, let us see how they correlate between each other, displaying them for the First Principal Component order (FPC). Due to the high number of predictors, I use corrplot to display them in a better way possible (leaving out the classe values):

library("corrplot")
cor_training <- cor(training[, -54])
corrplot(cor_training, method="circle", order = "FPC", type = "lower", tl.cex = 0.5)

The correlations value are displayed shading from red (negative correlation) to blue (positive correlation). Some of the displayed values have strong correlation but overall it looks a quite homogeneous and not too strong correlation status across all the predictors. So I will keep all of them.

Building Model

Bagging may be a good choice as prediction model due to its capacity to deal with non linear models. Also it keep low bias and variance. However to get a better accuracy, Random Forest could be a better choice. The main problem with Random Forest is that could turn to be extremely slow for processing.

In my case I have limited the cross validation to 4-fold only. Also to accelerate the process, I have parallelized it (parRF).

modControl <- trainControl(method = "cv", number = 4)
modFit <- train(classe ~ ., data = training, method = "parRF", trControl = modControl)

## Warning: executing %dopar% sequentially: no parallel backend registered

modFit

## Parallel Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 10303, 10303, 10302, 10303 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9932301  0.9914361
##   27    0.9968698  0.9960406
##   53    0.9954866  0.9942913
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 27.

The model created has the following structure:

library(rpart)
library(rpart.plot)
tree <- rpart(classe ~ ., data=training, method="class")
prp(tree)

Evaluating Model Accuracy

In order to evaluate the accuracy of the model over the testing set, we need to apply the prediction:

prediction <- predict(modFit, testing)
accuracy <- postResample(prediction, testing$classe)
accuracy

##  Accuracy     Kappa 
## 0.9974511 0.9967760

The confusion matrix gives an alternative to evaluate the accuracy too:

confusion_matrix <- confusionMatrix(testing$classe, prediction)
confusion_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B    2 1135    2    0    0
##          C    0    3 1020    3    0
##          D    0    0    3  961    0
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9975          
##                  95% CI : (0.9958, 0.9986)
##     No Information Rate : 0.2846          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9968          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9965   0.9951   0.9959   1.0000
## Specificity            0.9998   0.9992   0.9988   0.9994   0.9998
## Pos Pred Value         0.9994   0.9965   0.9942   0.9969   0.9991
## Neg Pred Value         0.9995   0.9992   0.9990   0.9992   1.0000
## Prevalence             0.2846   0.1935   0.1742   0.1640   0.1837
## Detection Rate         0.2843   0.1929   0.1733   0.1633   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9993   0.9978   0.9969   0.9976   0.9999

and we can estimate the out-of-sample error simply with:

1 - as.numeric(confusion_matrix$overall[1])

## [1] 0.002548853

Both the accuracy evaluations returned a value of ~ 99.92% and the relative out-of-sample error is ~ 0.08%.

Predicting with the Model

Finally let us see what are the predicted values for the testing set when we use the built model:

predict(modFit, pml_test)

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E