Overview

In this document we are going to define a model to predict how a weight lifting exercise is being done. In order to do that we are going to use the database created by the Human Activity Recognition at Groupware@LES.

From the data, the most relevant features have been selected, then the data has been splitted between train and test, and different models have been created and tested. After the evaluation of the different models it has been detected that the RandomForest was the one which would predict better the kind of performance the user has done with more than 99% of Accuracy.

Preparation

Libraries being used:

library(caret)
library(rattle)
library(parallel)
library(doParallel)

Downloading data

training<-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", 
                   na.strings = c("NA","","#DIV/0!"))
testing<-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
          na.strings = c("NA","","#DIV/0!"))

Feature selection

After a visual inspection on the data, the first features were removed, because they did not provide relevant information to the prediction. Features with almost values were also discarded. Finally, it was checked if any other value of the feature list was not providing any value.

# Remove unnecessary features
t<-training[,-1:-7]
# Remove features with no values
t<-t[, colSums(is.na(t)) < 1900]
# Check if there are other features that should be removed
nearZero<-nearZeroVar(t, saveMetrics = TRUE)

After visual inspection of nearZero vector, it was checked that no more features should be removed. And we finally get 53 features and the predicted value.

Highly correlated features will also be removed.

# Choosing features
c<-cor(t[,-length(t)])

# Cutoff has been set to 0.8, since there are too many features and it may be difficult to compute models
correlated<-findCorrelation(c, cutoff=0.8)
t<-t[, -correlated]

With this final selection, only 40 features remain.

Database split

In order to ensure the independece of the study, training dataset is splitted for training the data model (train_data) and testing it (test_data). Testing test will be kept untouched.

set.seed(1234)
inTrain<-createDataPartition(training$classe, p = 0.70, list = FALSE)
train_data<-t[inTrain,]
test_data<-t[-inTrain,]

Fitting models

Several models will be created to test which one predicts best how the exercise is being done:

In order to accelerate the training process:

cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

Tree

A Tree model is fitted to the training data

tree<-train(classe ~., data=train_data, method="rpart")

tree_predict<-predict(tree,test_data)
tree_cm<-confusionMatrix(tree_predict,test_data$classe)

We can see what would be the decission Tree in this scenario:

fancyRpartPlot(tree$finalModel, sub = "")

GBM

fitControl <- trainControl(method = "cv",
                           number = 10,
                           allowParallel = TRUE)
gbm<-train(classe ~., data=train_data, method="gbm", 
           trControl = fitControl)
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2175
##      2        1.4741             nan     0.1000    0.1435
##      3        1.3860             nan     0.1000    0.1196
##      4        1.3125             nan     0.1000    0.0965
##      5        1.2528             nan     0.1000    0.0782
##      6        1.2037             nan     0.1000    0.0735
##      7        1.1586             nan     0.1000    0.0667
##      8        1.1165             nan     0.1000    0.0655
##      9        1.0774             nan     0.1000    0.0508
##     10        1.0458             nan     0.1000    0.0477
##     20        0.8181             nan     0.1000    0.0212
##     40        0.5809             nan     0.1000    0.0097
##     60        0.4492             nan     0.1000    0.0091
##     80        0.3661             nan     0.1000    0.0058
##    100        0.3058             nan     0.1000    0.0038
##    120        0.2569             nan     0.1000    0.0028
##    140        0.2247             nan     0.1000    0.0023
##    150        0.2080             nan     0.1000    0.0006
gbm_predict<-predict(gbm,test_data)
gbm_cm<-confusionMatrix(gbm_predict,test_data$classe)

RandomForest

rf<-train(classe ~., data=train_data, preprocess=c("center","scale"), 
          method="rf",trControl = fitControl)

rf_predict<-predict(rf,test_data)
rf_cm<-confusionMatrix(rf_predict,test_data$classe)

lda

lda<-train(classe ~., data=train_data, preprocess=c("center","scale"),
           method="lda",trControl = fitControl)
lda_predict<-predict(lda,test_data)
lda_cm<-confusionMatrix(lda_predict,test_data$classe)

SVM

ctrl <- trainControl(method = "repeatedcv", repeats = 10)
svm <- train(classe~., data=train_data, preprocess=c("center","scale"), 
             method = "svmLinear", trControl = ctrl)

svm_predict<-predict(svm,test_data)
svm_cm<-confusionMatrix(svm_predict,test_data$classe)

Confusion Matrix

Checking the confusion matrix for the different algorithms used:

par(mfrow=c(2,2))
plot(tree_cm$table, main="TREE")
plot(gbm_cm$table, main="GBM")
plot(svm_cm$table, main="SVM")
plot(lda_cm$table, main="LDA")

par(mfrow=c(1,1))
plot(rf_cm$table, main="RandomForest")

As it can bee seen the correlation is almost perfect for RandomForest, and that’s the model used on the exercise.

rf_cm$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    9    0    0    0
##          B    1 1126    8    0    0
##          C    0    4 1017   13    0
##          D    0    0    1  949    2
##          E    0    0    0    2 1080
rf_cm$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9932031      0.9914011      0.9907559      0.9951399      0.2844520 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN