Course Project

In this case. we have a dataset from groupware This data is about accelerometers metrics obtained.

With this data, we must predict the manner in which they did the exercise.

first, need to load library and data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Versión 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Escriba 'rattle()' para agitar, sacudir y  rotar sus datos.
train<-read.csv("pml-training.csv")
validation<-read.csv("pml-testing.csv")

only loaded train and validation data because actual train data will be splited in two groups, trainig and testing

in_train  <- createDataPartition(train$classe, p=0.75, list=FALSE)
training <- train[ in_train, ]
testing  <- train[-in_train, ]

We alredy have data, but to predict something with data, we need first know what data we have. Then, a brief exploratory data analisys

# str(test)
nCols<-dim(testing)[2]

I commented the line because the output it too long. But, in the data have 160 columns and his class are near zero values and NA so we must remove that values

nzv_var <- nearZeroVar(training)
training <- training[ , -nzv_var]
testing  <- testing [ , -nzv_var]

to NA’s values, only will be removed the data how NA mean are above of 95%

na_var <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[ , na_var == FALSE]
testing  <- testing [ , na_var == FALSE]

Now we have almost clean our data, now the first 5 columns will be removed because are only idex

training <- training[ , -(1:5)]
testing  <- testing [ , -(1:5)]

Only train left.

first will use tree decition

mod1<-train(classe ~ ., data=training,method="rpart")

this model, was relatively fat, so the accuracy will be checked

predm1<- predict(mod1, newdata = testing)
matrix2 <- confusionMatrix(predm1, factor(testing$classe))
matrix2
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1259  402  399  372  116
##          B   23  298   28  127   99
##          C  109  249  428  305  252
##          D    0    0    0    0    0
##          E    4    0    0    0  434
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4933          
##                  95% CI : (0.4792, 0.5074)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3379          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9025  0.31401  0.50058   0.0000  0.48169
## Specificity            0.6327  0.92996  0.77402   1.0000  0.99900
## Pos Pred Value         0.4941  0.51826  0.31869      NaN  0.99087
## Neg Pred Value         0.9423  0.84962  0.88009   0.8361  0.89543
## Prevalence             0.2845  0.19352  0.17435   0.1639  0.18373
## Detection Rate         0.2567  0.06077  0.08728   0.0000  0.08850
## Detection Prevalence   0.5196  0.11725  0.27386   0.0000  0.08931
## Balanced Accuracy      0.7676  0.62199  0.63730   0.5000  0.74034

The accuracy are so bad, and “D”" classe aren´t predicted. This is easier to see in a graph

fancyRpartPlot(mod1$finalModel)

Now is it is clearer that aren’t good predictor. We need to change it.

In this step, will be used random forest

ctrl_RF <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fit_RF  <- train(classe ~ ., data = training, method = "rf",
                  trControl = ctrl_RF, verbose = FALSE)

Also, the accuracy will be checked

predict_RF <- predict(fit_RF, newdata = testing)
conf_matrix_RF <- confusionMatrix(predict_RF, factor(testing$classe))
conf_matrix_RF
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    3    0    0    0
##          B    0  946    0    0    0
##          C    0    0  855    3    0
##          D    0    0    0  800    0
##          E    0    0    0    1  901
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9986          
##                  95% CI : (0.9971, 0.9994)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9982          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9968   1.0000   0.9950   1.0000
## Specificity            0.9991   1.0000   0.9993   1.0000   0.9998
## Pos Pred Value         0.9979   1.0000   0.9965   1.0000   0.9989
## Neg Pred Value         1.0000   0.9992   1.0000   0.9990   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1929   0.1743   0.1631   0.1837
## Detection Prevalence   0.2851   0.1929   0.1750   0.1631   0.1839
## Balanced Accuracy      0.9996   0.9984   0.9996   0.9975   0.9999

now the accuracy its to much better. With this, we can predict our validation data-

pred<-predict(fit_RF,newdata=validation)
pred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E