Currently there are various mobile applications and external tools that allow monitoring of human activity. This allows data to be collected to later study them, thus a field dedicated to this called: Human Activity Recognition - HAR. In this project we will try to predict how well a series of physical exercises are performed. We will use various prediction models to find the one that best performs this function.
The packages to be used for the project are:
library(readr)
library(caret)
Obtaining the data
A database was created from various tools that analyze human activity recognition (HAR). This dataset is licensed under the Creative Commons license (CC BY-SA). Source:
“Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6”
We have the data divided into testing and training
pml_testing <- read_csv("pml-testing.csv")
pml_training <- read_csv("pml-training.csv")
dim(pml_testing)
## [1] 20 160
dim(pml_training)
## [1] 19622 160
We create a partition for training and for testing. We add a seed for validation.
set.seed(2021)
inTrain <- createDataPartition(y=pml_training$classe, p=0.75, list=FALSE)
training <- pml_training[inTrain,]
testing <- pml_training[-inTrain,]
dim(training)
## [1] 14718 160
dim(testing)
## [1] 4904 160
Cleaning and exploring the data
Identify the dependent variable: classe. It contains the values:
Some columns are seen to have a large amount of NaN. Columns that do not have enough information are cleaned up.
vector = NULL
intervalo <- c(1:length(training))
for (i in intervalo) {
if ((mean(is.na(training[i]))) < 0.9) vector = append(vector,i)
}
training = training[vector]
testing = testing[vector]
We clean the variables that are not necessary for the project.
training = training[,-c(1:6)]
testing = testing[,-c(1:6)]
After analyzing various classification models, we opted for: random forest. Train function will be used to ensure optimal resampling.
set.seed(2021)
modrf <- train(classe~ .,data=training,method="rf")
Below you can see the final model evaluated.
modrf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.22%
## Confusion matrix:
## A B C D E class.error
## A 4184 0 0 0 1 0.0002389486
## B 8 2837 3 0 0 0.0038623596
## C 0 4 2561 2 0 0.0023373588
## D 0 0 9 2402 1 0.0041459370
## E 0 0 0 5 2701 0.0018477458
The accuracy and the OOB error rate (0.22%) show us how reliable our model is.
modrf$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9925794 0.9906088 0.001383569 0.001756202
## 2 27 0.9958565 0.9947562 0.001365514 0.001731226
## 3 53 0.9910977 0.9887341 0.001981400 0.002511575
We proceed to validate our model, implementing it in the test set.
predictionDATA <- predict(modrf,testing)
We validate by observing the confusion matrix.
confusionM <- confusionMatrix(table(predictionDATA,testing$classe))
confusionM
## Confusion Matrix and Statistics
##
##
## predictionDATA A B C D E
## A 1395 5 0 0 0
## B 0 939 0 0 0
## C 0 4 855 5 0
## D 0 1 0 799 2
## E 0 0 0 0 899
##
## Overall Statistics
##
## Accuracy : 0.9965
## 95% CI : (0.9945, 0.998)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9956
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9895 1.0000 0.9938 0.9978
## Specificity 0.9986 1.0000 0.9978 0.9993 1.0000
## Pos Pred Value 0.9964 1.0000 0.9896 0.9963 1.0000
## Neg Pred Value 1.0000 0.9975 1.0000 0.9988 0.9995
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1915 0.1743 0.1629 0.1833
## Detection Prevalence 0.2855 0.1915 0.1762 0.1635 0.1833
## Balanced Accuracy 0.9993 0.9947 0.9989 0.9965 0.9989
The Kappa value allows us to observe the error (1- kappa) (1-0.9956) = (0.0044) .
They provided us with a separate set of modeling data. It works to explain the future data processing that we will implement in the model.
pml_testing = pml_testing[vector]
pml_testing = pml_testing[,-c(1:6)]
predictionDATANEW <- predict(modrf,pml_testing)
predictionDATANEW
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The decision to use a Random Forest model to apply it in this project responds to specific needs. It is true that interpretability is lost, but in this case it was more important to have a larger prediction area and the model had to respond to classification needs. It is important to say that the best model will depend on particular needs such as interpretability, prediction, data form, technical capabilities of the computer equipment, etc.