In this document we are going to define a model to predict how a weight lifting exercise is being done. In order to do that we are going to use the database created by the Human Activity Recognition at Groupware@LES.
From the data, the most relevant features have been selected, then the data has been splitted between train and test, and different models have been created and tested. After the evaluation of the different models it has been detected that the RandomForest was the one which would predict better the kind of performance the user has done with more than 99% of Accuracy.
Libraries being used:
library(caret)
library(rattle)
library(parallel)
library(doParallel)
Downloading data
training<-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
na.strings = c("NA","","#DIV/0!"))
testing<-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
na.strings = c("NA","","#DIV/0!"))
After a visual inspection on the data, the first features were removed, because they did not provide relevant information to the prediction. Features with almost values were also discarded. Finally, it was checked if any other value of the feature list was not providing any value.
# Remove unnecessary features
t<-training[,-1:-7]
# Remove features with no values
t<-t[, colSums(is.na(t)) < 1900]
# Check if there are other features that should be removed
nearZero<-nearZeroVar(t, saveMetrics = TRUE)
After visual inspection of nearZero vector, it was checked that no more features should be removed. And we finally get 53 features and the predicted value.
Highly correlated features will also be removed.
# Choosing features
c<-cor(t[,-length(t)])
# Cutoff has been set to 0.8, since there are too many features and it may be difficult to compute models
correlated<-findCorrelation(c, cutoff=0.8)
t<-t[, -correlated]
With this final selection, only 40 features remain.
In order to ensure the independece of the study, training dataset is splitted for training the data model (train_data) and testing it (test_data). Testing test will be kept untouched.
set.seed(1234)
inTrain<-createDataPartition(training$classe, p = 0.70, list = FALSE)
train_data<-t[inTrain,]
test_data<-t[-inTrain,]
Several models will be created to test which one predicts best how the exercise is being done:
In order to accelerate the training process:
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
A Tree model is fitted to the training data
tree<-train(classe ~., data=train_data, method="rpart")
tree_predict<-predict(tree,test_data)
tree_cm<-confusionMatrix(tree_predict,test_data$classe)
We can see what would be the decission Tree in this scenario:
fancyRpartPlot(tree$finalModel, sub = "")
fitControl <- trainControl(method = "cv",
number = 10,
allowParallel = TRUE)
gbm<-train(classe ~., data=train_data, method="gbm",
trControl = fitControl)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2175
## 2 1.4741 nan 0.1000 0.1435
## 3 1.3860 nan 0.1000 0.1196
## 4 1.3125 nan 0.1000 0.0965
## 5 1.2528 nan 0.1000 0.0782
## 6 1.2037 nan 0.1000 0.0735
## 7 1.1586 nan 0.1000 0.0667
## 8 1.1165 nan 0.1000 0.0655
## 9 1.0774 nan 0.1000 0.0508
## 10 1.0458 nan 0.1000 0.0477
## 20 0.8181 nan 0.1000 0.0212
## 40 0.5809 nan 0.1000 0.0097
## 60 0.4492 nan 0.1000 0.0091
## 80 0.3661 nan 0.1000 0.0058
## 100 0.3058 nan 0.1000 0.0038
## 120 0.2569 nan 0.1000 0.0028
## 140 0.2247 nan 0.1000 0.0023
## 150 0.2080 nan 0.1000 0.0006
gbm_predict<-predict(gbm,test_data)
gbm_cm<-confusionMatrix(gbm_predict,test_data$classe)
rf<-train(classe ~., data=train_data, preprocess=c("center","scale"),
method="rf",trControl = fitControl)
rf_predict<-predict(rf,test_data)
rf_cm<-confusionMatrix(rf_predict,test_data$classe)
lda<-train(classe ~., data=train_data, preprocess=c("center","scale"),
method="lda",trControl = fitControl)
lda_predict<-predict(lda,test_data)
lda_cm<-confusionMatrix(lda_predict,test_data$classe)
ctrl <- trainControl(method = "repeatedcv", repeats = 10)
svm <- train(classe~., data=train_data, preprocess=c("center","scale"),
method = "svmLinear", trControl = ctrl)
svm_predict<-predict(svm,test_data)
svm_cm<-confusionMatrix(svm_predict,test_data$classe)
Checking the confusion matrix for the different algorithms used:
par(mfrow=c(2,2))
plot(tree_cm$table, main="TREE")
plot(gbm_cm$table, main="GBM")
plot(svm_cm$table, main="SVM")
plot(lda_cm$table, main="LDA")
par(mfrow=c(1,1))
plot(rf_cm$table, main="RandomForest")
As it can bee seen the correlation is almost perfect for RandomForest, and that’s the model used on the exercise.
rf_cm$table
## Reference
## Prediction A B C D E
## A 1673 9 0 0 0
## B 1 1126 8 0 0
## C 0 4 1017 13 0
## D 0 0 1 949 2
## E 0 0 0 2 1080
rf_cm$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9932031 0.9914011 0.9907559 0.9951399 0.2844520
## AccuracyPValue McnemarPValue
## 0.0000000 NaN