Synopsis

Currently there are various mobile applications and external tools that allow monitoring of human activity. This allows data to be collected to later study them, thus a field dedicated to this called: Human Activity Recognition - HAR. In this project we will try to predict how well a series of physical exercises are performed. We will use various prediction models to find the one that best performs this function.

Development

Exploratory analysis

The packages to be used for the project are:

library(readr)
library(caret)

Obtaining the data

A database was created from various tools that analyze human activity recognition (HAR). This dataset is licensed under the Creative Commons license (CC BY-SA). Source:

“Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6”

We have the data divided into testing and training

pml_testing <- read_csv("pml-testing.csv")
pml_training <- read_csv("pml-training.csv")
dim(pml_testing)
## [1]  20 160
dim(pml_training)
## [1] 19622   160

We create a partition for training and for testing. We add a seed for validation.

set.seed(2021)
inTrain <- createDataPartition(y=pml_training$classe, p=0.75, list=FALSE)
training <- pml_training[inTrain,]
testing <- pml_training[-inTrain,]
dim(training)
## [1] 14718   160
dim(testing)
## [1] 4904  160

Cleaning and exploring the data

Identify the dependent variable: classe. It contains the values:

  • A specified execution of the exercise
  • (B,C,D,E) common mistakes
    • B throwing the elbows to the front
    • C lifting the dumbbell only halfway
    • D lowering the dumbbell only halfway
    • E throwing the hips to the front

Some columns are seen to have a large amount of NaN. Columns that do not have enough information are cleaned up.

vector = NULL
intervalo <- c(1:length(training))
for (i in intervalo) {
        if ((mean(is.na(training[i]))) < 0.9) vector = append(vector,i)
}
training = training[vector]
testing = testing[vector]

We clean the variables that are not necessary for the project.

training = training[,-c(1:6)]
testing = testing[,-c(1:6)]

Training the model

After analyzing various classification models, we opted for: random forest. Train function will be used to ensure optimal resampling.

set.seed(2021)
modrf <- train(classe~ .,data=training,method="rf")

Below you can see the final model evaluated.

modrf$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.22%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4184    0    0    0    1 0.0002389486
## B    8 2837    3    0    0 0.0038623596
## C    0    4 2561    2    0 0.0023373588
## D    0    0    9 2402    1 0.0041459370
## E    0    0    0    5 2701 0.0018477458

The accuracy and the OOB error rate (0.22%) show us how reliable our model is.

modrf$results
##   mtry  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2 0.9925794 0.9906088 0.001383569 0.001756202
## 2   27 0.9958565 0.9947562 0.001365514 0.001731226
## 3   53 0.9910977 0.9887341 0.001981400 0.002511575

Cross-validation

We proceed to validate our model, implementing it in the test set.

predictionDATA <- predict(modrf,testing)

We validate by observing the confusion matrix.

confusionM <- confusionMatrix(table(predictionDATA,testing$classe))
confusionM
## Confusion Matrix and Statistics
## 
##               
## predictionDATA    A    B    C    D    E
##              A 1395    5    0    0    0
##              B    0  939    0    0    0
##              C    0    4  855    5    0
##              D    0    1    0  799    2
##              E    0    0    0    0  899
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9965         
##                  95% CI : (0.9945, 0.998)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9956         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9895   1.0000   0.9938   0.9978
## Specificity            0.9986   1.0000   0.9978   0.9993   1.0000
## Pos Pred Value         0.9964   1.0000   0.9896   0.9963   1.0000
## Neg Pred Value         1.0000   0.9975   1.0000   0.9988   0.9995
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1915   0.1743   0.1629   0.1833
## Detection Prevalence   0.2855   0.1915   0.1762   0.1635   0.1833
## Balanced Accuracy      0.9993   0.9947   0.9989   0.9965   0.9989

The Kappa value allows us to observe the error (1- kappa) (1-0.9956) = (0.0044) .

Prediction in separate set of modeling data

They provided us with a separate set of modeling data. It works to explain the future data processing that we will implement in the model.

pml_testing = pml_testing[vector]
pml_testing = pml_testing[,-c(1:6)]
predictionDATANEW <- predict(modrf,pml_testing)
predictionDATANEW
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

The decision to use a Random Forest model to apply it in this project responds to specific needs. It is true that interpretability is lost, but in this case it was more important to have a larger prediction area and the model had to respond to classification needs. It is important to say that the best model will depend on particular needs such as interpretability, prediction, data form, technical capabilities of the computer equipment, etc.