Prediction model applied in Human Activity Recognition HAR

Development

Exploratory analysis

The packages to be used for the project are:

library(readr)
library(caret)

Obtaining the data

A database was created from various tools that analyze human activity recognition (HAR). This dataset is licensed under the Creative Commons license (CC BY-SA). Source:

“Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6”

We have the data divided into testing and training

pml_testing <- read_csv("pml-testing.csv")
pml_training <- read_csv("pml-training.csv")
dim(pml_testing)

## [1]  20 160

dim(pml_training)

## [1] 19622   160

We create a partition for training and for testing. We add a seed for validation.

set.seed(2021)
inTrain <- createDataPartition(y=pml_training$classe, p=0.75, list=FALSE)
training <- pml_training[inTrain,]
testing <- pml_training[-inTrain,]
dim(training)

## [1] 14718   160

dim(testing)

## [1] 4904  160

Cleaning and exploring the data

Identify the dependent variable: classe. It contains the values:

A specified execution of the exercise
(B,C,D,E) common mistakes
- B throwing the elbows to the front
- C lifting the dumbbell only halfway
- D lowering the dumbbell only halfway
- E throwing the hips to the front

Some columns are seen to have a large amount of NaN. Columns that do not have enough information are cleaned up.

vector = NULL
intervalo <- c(1:length(training))
for (i in intervalo) {
        if ((mean(is.na(training[i]))) < 0.9) vector = append(vector,i)
}
training = training[vector]
testing = testing[vector]

We clean the variables that are not necessary for the project.

training = training[,-c(1:6)]
testing = testing[,-c(1:6)]

Training the model

After analyzing various classification models, we opted for: random forest. Train function will be used to ensure optimal resampling.

set.seed(2021)
modrf <- train(classe~ .,data=training,method="rf")

Below you can see the final model evaluated.

modrf$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.22%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4184    0    0    0    1 0.0002389486
## B    8 2837    3    0    0 0.0038623596
## C    0    4 2561    2    0 0.0023373588
## D    0    0    9 2402    1 0.0041459370
## E    0    0    0    5 2701 0.0018477458

The accuracy and the OOB error rate (0.22%) show us how reliable our model is.

modrf$results

##   mtry  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2 0.9925794 0.9906088 0.001383569 0.001756202
## 2   27 0.9958565 0.9947562 0.001365514 0.001731226
## 3   53 0.9910977 0.9887341 0.001981400 0.002511575

Cross-validation

We proceed to validate our model, implementing it in the test set.

predictionDATA <- predict(modrf,testing)

We validate by observing the confusion matrix.

confusionM <- confusionMatrix(table(predictionDATA,testing$classe))
confusionM

## Confusion Matrix and Statistics
## 
##               
## predictionDATA    A    B    C    D    E
##              A 1395    5    0    0    0
##              B    0  939    0    0    0
##              C    0    4  855    5    0
##              D    0    1    0  799    2
##              E    0    0    0    0  899
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9965         
##                  95% CI : (0.9945, 0.998)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9956         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9895   1.0000   0.9938   0.9978
## Specificity            0.9986   1.0000   0.9978   0.9993   1.0000
## Pos Pred Value         0.9964   1.0000   0.9896   0.9963   1.0000
## Neg Pred Value         1.0000   0.9975   1.0000   0.9988   0.9995
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1915   0.1743   0.1629   0.1833
## Detection Prevalence   0.2855   0.1915   0.1762   0.1635   0.1833
## Balanced Accuracy      0.9993   0.9947   0.9989   0.9965   0.9989

The Kappa value allows us to observe the error (1- kappa) (1-0.9956) = (0.0044) .

Prediction in separate set of modeling data

They provided us with a separate set of modeling data. It works to explain the future data processing that we will implement in the model.

pml_testing = pml_testing[vector]
pml_testing = pml_testing[,-c(1:6)]
predictionDATANEW <- predict(modrf,pml_testing)
predictionDATANEW

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Prediction model applied in Human Activity Recognition HAR

Uriel Casas

8/2/2021

Synopsis

Development

Exploratory analysis

Training the model

Cross-validation

Prediction in separate set of modeling data

Conclusion