dataTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
dataTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"Project Practical Machine Learning
Introduction
The goal of this project is to predict how six participants performed barbell lifts correctly and incorrectly in five different ways. Data from accelerometers attached to the belt, forearm, upper arm, and dumbbell are used to determine how well each participant was performing the barbell lifts. The machine learning algorithm that is developed in this report is applied to 20 test cases available in the test data. The predictions are used for the project prediction evaluation questionnaire for qualification.
Data Loading, Packages and Library
Download data from url
Load the datasets
training <- read.csv(url(dataTrain))
testing <- read.csv(url(dataTest))Load packages and library
library(knitr)
library(caret)Loading required package: ggplot2
Loading required package: lattice
library(rpart)
library(rpart.plot)
library(rattle)Loading required package: tibble
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:rattle':
importance
The following object is masked from 'package:ggplot2':
margin
library(corrplot)corrplot 0.92 loaded
Data spliting
A training data set (70%) will be created for the modeling process and a test data set (30%) for validations. The testing data set is used in the choice of results for the prediction quiz.
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
trainSet <- training[inTrain, ]
testSet <- training[-inTrain, ]Exploratory Analysis and Cleaning
dim(trainSet)[1] 13737 160
dim(testSet)[1] 5885 160
Note: The data created from trainSet and testSet contains 160 variables. In the variables there are NA, which will be applied cleaning. Likewise, unnecessary variables such as Nearly Zero Variance and id are eliminated.
Remove NAs in variables
nonNA <- sapply(trainSet, function(x) mean(is.na(x))) > 0.95
trainSet <- trainSet[, nonNA==FALSE]
testSet <- testSet[, nonNA==FALSE]dim(trainSet)[1] 13737 93
dim(testSet)[1] 5885 93
Remove variables with almost zero variance
neZeVar <- nearZeroVar(trainSet)
trainSet <- trainSet[, -neZeVar]
testSet <- testSet[, -neZeVar]dim(trainSet)[1] 13737 59
dim(testSet)[1] 5885 59
Remove id variables
trainSet <- trainSet[, -(1:5)]
testSet <- testSet[, -(1:5)]dim(trainSet)[1] 13737 54
dim(testSet)[1] 5885 54
Note: 54 will be the variables that will be analyzed after the cleaning process.
Correlation Analysis
cMatrix <- cor(trainSet[, -54])
corrplot(cMatrix, order = "FPC", method = "color", type = "lower",
tl.cex = 0.8, tl.col = rgb(0, 0, 0))Note: Variables with high correlations are presented in dark colors.
Prediction Model Building
Random Forest, Decision Trees and Generalized Boosted Model methods are applied to model the regressions. The one with the highest precision, when applied to the test data set, is used in the predictions of the questionnaire.
Random Forest
Model fit
fitRfc <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRf <- train(classe ~ ., data=trainSet, method="rf",
trControl=fitRfc)
modFitRf$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 27
OOB estimate of error rate: 0.19%
Confusion matrix:
A B C D E class.error
A 3904 1 0 0 1 0.0005120328
B 5 2650 2 1 0 0.0030097818
C 0 5 2390 1 0 0.0025041736
D 0 0 8 2244 0 0.0035523979
E 0 0 0 2 2523 0.0007920792
Prediction on test dataset
predictRf <- predict(modFitRf, newdata=testSet)
confMatRf <- confusionMatrix(predictRf, as.factor(testSet$classe))
confMatRfConfusion Matrix and Statistics
Reference
Prediction A B C D E
A 1674 1 0 0 0
B 0 1138 3 0 0
C 0 0 1023 3 0
D 0 0 0 960 0
E 0 0 0 1 1082
Overall Statistics
Accuracy : 0.9986
95% CI : (0.9973, 0.9994)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9983
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 1.0000 0.9991 0.9971 0.9959 1.0000
Specificity 0.9998 0.9994 0.9994 1.0000 0.9998
Pos Pred Value 0.9994 0.9974 0.9971 1.0000 0.9991
Neg Pred Value 1.0000 0.9998 0.9994 0.9992 1.0000
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2845 0.1934 0.1738 0.1631 0.1839
Detection Prevalence 0.2846 0.1939 0.1743 0.1631 0.1840
Balanced Accuracy 0.9999 0.9992 0.9982 0.9979 0.9999
Plot matrix results
plot(confMatRf$table, col = confMatRf$byClass,
main = paste("Random Forest - Accuracy =",
round(confMatRf$overall['Accuracy'], 4)))Decision Trees
Model fit
set.seed(12345)
modFitDTree <- rpart(classe ~ ., data=trainSet, method="class")
fancyRpartPlot(modFitDTree)Prediction on test dataset
predictDTree <- predict(modFitDTree, newdata=testSet, type="class")
confMatDTree <- confusionMatrix(predictDTree, as.factor(testSet$classe))
confMatDTreeConfusion Matrix and Statistics
Reference
Prediction A B C D E
A 1468 165 49 58 5
B 99 710 63 26 19
C 57 113 819 60 27
D 46 146 95 759 166
E 4 5 0 61 865
Overall Statistics
Accuracy : 0.7852
95% CI : (0.7745, 0.7957)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7284
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.8769 0.6234 0.7982 0.7873 0.7994
Specificity 0.9342 0.9564 0.9471 0.9079 0.9854
Pos Pred Value 0.8413 0.7743 0.7612 0.6262 0.9251
Neg Pred Value 0.9502 0.9136 0.9570 0.9561 0.9562
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2494 0.1206 0.1392 0.1290 0.1470
Detection Prevalence 0.2965 0.1558 0.1828 0.2059 0.1589
Balanced Accuracy 0.9056 0.7899 0.8727 0.8476 0.8924
Plot matrix results
plot(confMatDTree$table, col = confMatDTree$byClass,
main = paste("Decision Tree - Accuracy =",
round(confMatDTree$overall['Accuracy'], 4)))Generalized Boosted Model
Model fit
set.seed(12345)
fitGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitGBM <- train(classe ~ ., data=trainSet, method = "gbm",
trControl =fitGBM, verbose = FALSE)
modFitGBM$finalModelA gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 53 predictors of which 53 had non-zero influence.
Prediction on test dataset
predictGBM <- predict(modFitGBM, newdata=testSet)
cfMatGBM <- confusionMatrix(predictGBM, as.factor(testSet$classe))
cfMatGBMConfusion Matrix and Statistics
Reference
Prediction A B C D E
A 1671 11 0 0 0
B 1 1118 21 1 1
C 0 7 996 10 0
D 2 3 6 951 5
E 0 0 3 2 1076
Overall Statistics
Accuracy : 0.9876
95% CI : (0.9844, 0.9903)
No Information Rate : 0.2845
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9843
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: A Class: B Class: C Class: D Class: E
Sensitivity 0.9982 0.9816 0.9708 0.9865 0.9945
Specificity 0.9974 0.9949 0.9965 0.9967 0.9990
Pos Pred Value 0.9935 0.9790 0.9832 0.9835 0.9954
Neg Pred Value 0.9993 0.9956 0.9938 0.9974 0.9988
Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
Detection Rate 0.2839 0.1900 0.1692 0.1616 0.1828
Detection Prevalence 0.2858 0.1941 0.1721 0.1643 0.1837
Balanced Accuracy 0.9978 0.9883 0.9836 0.9916 0.9967
Plot matrix results
plot(cfMatGBM$table, col = cfMatGBM$byClass,
main = paste("GBM - Accuracy =", round(cfMatGBM$overall['Accuracy'], 4)))Prediction
The Random Forest model is used to predict the 20 results of the questionnaire for qualification
Random Forest : 0.999
Decision Tree : 0.7342
GBM : 0.9871
predictTest<- predict(modFitRf, newdata=testing)
predictTest [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E