Summary

The following is a final report of the Hands-on-Learning course, which is part of the Data Science major. Its purpose is to predict the machine learning algorithm, which uses the class variable of the training set, applied to the 20 test cases available in the test data.

Introduction

With devices such as Jawbone Up , Nike FuelBand and Fitbit , it is possible to collect a large amount of data on personal activity in a cost-effective manner. These types of devices are part of quantified self-movement: a group of enthusiasts who regularly take measurements about themselves to improve their health, to find patterns in their behavior, or because they are technology fanatics. One thing people do regularly is quantify the amount of a particular activity they do, but they rarely quantify how well they do it. In this project, data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants, who were asked to perform correct and incorrectly shaped bar lifts in 5 different ways, will be used. More information can be found at the following website: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Data

Training data for this project:

Test data:

Test data:

Details of this project:

Details of this project:

If you use the document you create for this class for any purpose, please cite them as they have been very generous in allowing their data to be used for this type of assignment.

Load the necessary packages in R and a seed is established.

paquetes <- c('lattice', #High-level graphic-inspired data visualization system
              'ggplot2', #for graphics
              'caret', #Functions for training, track classification and regression models.
              'rpart', #Recursive partitioning for classification, regression and survival trees.
              'rpart.plot', #Plot 'rpart' models.
              'randomForest', #para crear los modelos
              'corrplot', #A graphical display of a correlation matrix or general matrix. 
              'rattle', #Utilities for the data scientist.
              'RColorBrewer' #Provides color schemes for maps (and other graphics) 
)
#Crea un vector lógico con si están instalados o no
instalados <- paquetes %in% installed.packages()
#Si hay al menos uno no instalado los instala
if(sum(instalados == FALSE) > 0) {
  install.packages(paquetes[!instalados])
}
lapply(paquetes,require,character.only = TRUE)
set.seed(12345)

We load the data

training <-read.csv('http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv', strip.white = TRUE, na.strings = c("NA","")) 
testing <- read.csv('http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv', strip.white = TRUE, na.strings = c("NA",""))
dim(training)
## [1] 19622   160
dim(testing)
## [1]  20 160

Two partitions (75% and 25%) are made within the original training data set.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
in_training  <- createDataPartition(training$classe, p=0.75, list=FALSE)
training_set <- training[ in_training, ]
test_set  <- training[-in_training, ]
dim(training_set)
## [1] 14718   160
dim(test_set)
## [1] 4904  160

We will remove the NA as well as variables with almost 0 values, the ID variables and variations (NZV) from the 2 data sets (training_set and test_set)

nzv_var <- nearZeroVar(training_set)
training_set <- training_set[ , -nzv_var]
test_set  <- test_set [ , -nzv_var]
dim(training_set)
## [1] 14718   123
dim(test_set)
## [1] 4904  123

Eliminate NA variables. A threshold of 95 % is selected.

na_var <- sapply(training_set, function(x) mean(is.na(x))) > 0.95
training_set <- training_set[ , na_var == FALSE]
test_set  <- test_set [ , na_var == FALSE]
dim(training_set)
## [1] 14718    59
dim(test_set)
## [1] 4904   59

We will remove the identification columns 1 to 5

training_set <- training_set[ , -(1:5)]
test_set  <- test_set [ , -(1:5)]
dim(training_set)
## [1] 14718    54
dim(test_set)
## [1] 4904   54

Correlation Analysis

FPC" is selected for the first order of main components.

library(corrplot)
## corrplot 0.84 loaded
corr_matrix <- cor(training_set[ , -54])
corrplot(corr_matrix, order = "FPC", method = "circle", type = "lower",
         tl.cex = 0.6, tl.col = rgb(0, 0, 0))

Two variables are highly correlated:
Dark blue (Positive correlation) Dark red (negative correlation).

We will build a few prediction models.

Prediction Models 6.1. Decision Tree Model

library(rpart)
library(RColorBrewer)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Versión 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Escriba 'rattle()' para agitar, sacudir y  rotar sus datos.
set.seed(12345)
fit_decision_tree <- rpart(classe ~ ., data = training_set, method="class")
fancyRpartPlot(fit_decision_tree)

Prediction decision tree model in test_set.

library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
## 
##     importance
## The following object is masked from 'package:ggplot2':
## 
##     margin
predict_decision_tree <- predict(fit_decision_tree, newdata = test_set, type="class")
conf_matrix_decision_tree <- confusionMatrix(table(predict_decision_tree, test_set$classe))
conf_matrix_decision_tree
## Confusion Matrix and Statistics
## 
##                      
## predict_decision_tree    A    B    C    D    E
##                     A 1263  230   16   52   25
##                     B   43  527   28   23   22
##                     C   15   57  707  110   83
##                     D   54   81   50  549  107
##                     E   20   54   54   70  664
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7565          
##                  95% CI : (0.7443, 0.7685)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6909          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9054   0.5553   0.8269   0.6828   0.7370
## Specificity            0.9080   0.9707   0.9346   0.9288   0.9505
## Pos Pred Value         0.7963   0.8196   0.7274   0.6528   0.7703
## Neg Pred Value         0.9602   0.9010   0.9624   0.9372   0.9414
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2575   0.1075   0.1442   0.1119   0.1354
## Detection Prevalence   0.3234   0.1311   0.1982   0.1715   0.1758
## Balanced Accuracy      0.9067   0.7630   0.8807   0.8058   0.8437
# plot matrix results
plot(conf_matrix_decision_tree$table, 
     col = conf_matrix_decision_tree$byClass, 
     main = paste("Random Forest - Accuracy =", round(conf_matrix_decision_tree$overall['Accuracy'], 4)))