The following is a final report of the Hands-on-Learning course, which is part of the Data Science major. Its purpose is to predict the machine learning algorithm, which uses the class variable of the training set, applied to the 20 test cases available in the test data.
With devices such as Jawbone Up , Nike FuelBand and Fitbit , it is possible to collect a large amount of data on personal activity in a cost-effective manner. These types of devices are part of quantified self-movement: a group of enthusiasts who regularly take measurements about themselves to improve their health, to find patterns in their behavior, or because they are technology fanatics. One thing people do regularly is quantify the amount of a particular activity they do, but they rarely quantify how well they do it. In this project, data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants, who were asked to perform correct and incorrectly shaped bar lifts in 5 different ways, will be used. More information can be found at the following website: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
Details of this project:
If you use the document you create for this class for any purpose, please cite them as they have been very generous in allowing their data to be used for this type of assignment.
paquetes <- c('lattice', #High-level graphic-inspired data visualization system
'ggplot2', #for graphics
'caret', #Functions for training, track classification and regression models.
'rpart', #Recursive partitioning for classification, regression and survival trees.
'rpart.plot', #Plot 'rpart' models.
'randomForest', #para crear los modelos
'corrplot', #A graphical display of a correlation matrix or general matrix.
'rattle', #Utilities for the data scientist.
'RColorBrewer' #Provides color schemes for maps (and other graphics)
)
#Crea un vector lógico con si están instalados o no
instalados <- paquetes %in% installed.packages()
#Si hay al menos uno no instalado los instala
if(sum(instalados == FALSE) > 0) {
install.packages(paquetes[!instalados])
}
lapply(paquetes,require,character.only = TRUE)
set.seed(12345)
We load the data
training <-read.csv('http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv', strip.white = TRUE, na.strings = c("NA",""))
testing <- read.csv('http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv', strip.white = TRUE, na.strings = c("NA",""))
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
Two partitions (75% and 25%) are made within the original training data set.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
in_training <- createDataPartition(training$classe, p=0.75, list=FALSE)
training_set <- training[ in_training, ]
test_set <- training[-in_training, ]
dim(training_set)
## [1] 14718 160
dim(test_set)
## [1] 4904 160
We will remove the NA as well as variables with almost 0 values, the ID variables and variations (NZV) from the 2 data sets (training_set and test_set)
nzv_var <- nearZeroVar(training_set)
training_set <- training_set[ , -nzv_var]
test_set <- test_set [ , -nzv_var]
dim(training_set)
## [1] 14718 123
dim(test_set)
## [1] 4904 123
Eliminate NA variables. A threshold of 95 % is selected.
na_var <- sapply(training_set, function(x) mean(is.na(x))) > 0.95
training_set <- training_set[ , na_var == FALSE]
test_set <- test_set [ , na_var == FALSE]
dim(training_set)
## [1] 14718 59
dim(test_set)
## [1] 4904 59
We will remove the identification columns 1 to 5
training_set <- training_set[ , -(1:5)]
test_set <- test_set [ , -(1:5)]
dim(training_set)
## [1] 14718 54
dim(test_set)
## [1] 4904 54
FPC" is selected for the first order of main components.
library(corrplot)
## corrplot 0.84 loaded
corr_matrix <- cor(training_set[ , -54])
corrplot(corr_matrix, order = "FPC", method = "circle", type = "lower",
tl.cex = 0.6, tl.col = rgb(0, 0, 0))
Two variables are highly correlated:
Dark blue (Positive correlation) Dark red (negative correlation).
We will build a few prediction models.
Prediction Models 6.1. Decision Tree Model
library(rpart)
library(RColorBrewer)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Versión 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Escriba 'rattle()' para agitar, sacudir y rotar sus datos.
set.seed(12345)
fit_decision_tree <- rpart(classe ~ ., data = training_set, method="class")
fancyRpartPlot(fit_decision_tree)
Prediction decision tree model in test_set.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
predict_decision_tree <- predict(fit_decision_tree, newdata = test_set, type="class")
conf_matrix_decision_tree <- confusionMatrix(table(predict_decision_tree, test_set$classe))
conf_matrix_decision_tree
## Confusion Matrix and Statistics
##
##
## predict_decision_tree A B C D E
## A 1263 230 16 52 25
## B 43 527 28 23 22
## C 15 57 707 110 83
## D 54 81 50 549 107
## E 20 54 54 70 664
##
## Overall Statistics
##
## Accuracy : 0.7565
## 95% CI : (0.7443, 0.7685)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6909
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9054 0.5553 0.8269 0.6828 0.7370
## Specificity 0.9080 0.9707 0.9346 0.9288 0.9505
## Pos Pred Value 0.7963 0.8196 0.7274 0.6528 0.7703
## Neg Pred Value 0.9602 0.9010 0.9624 0.9372 0.9414
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2575 0.1075 0.1442 0.1119 0.1354
## Detection Prevalence 0.3234 0.1311 0.1982 0.1715 0.1758
## Balanced Accuracy 0.9067 0.7630 0.8807 0.8058 0.8437
# plot matrix results
plot(conf_matrix_decision_tree$table,
col = conf_matrix_decision_tree$byClass,
main = paste("Random Forest - Accuracy =", round(conf_matrix_decision_tree$overall['Accuracy'], 4)))