This document is the final report of the Peer Assessment project from Coursera’s course Practical Machine Learning, as part of the Specialization in Data Science. It was built up in RStudio, using its knitr functions, meant to be published in html format. This analysis meant to be the basis for the course quiz and a prediction assignment writeup. The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading. ## ##III. Data Loading and Exploratory Analysis
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
rm(list=ls()) # free up memory for download of the data sets
knitr::opts_chunk$set(echo = TRUE)
Environment Preparation We first upload the R libraries that are necessary for the complete analysis.Setting seed, loading libraries and dataset.
set.seed(1967)
library(knitr)
library(lattice)
library(ggplot2)
#install.packages("caret", dependencies = TRUE)
library(caret)
library(rpart)
library(rpart.plot)
library(corrplot)
## corrplot 0.84 loaded
library(RColorBrewer)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(data.table)
library(corrplot)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(gbm)
## Loaded gbm 2.1.8
Set working directory
setwd("C:/Users/patty/OneDrive/Desktop/Coursera/Machine Learning")
Load the training and test datasets The next step is loading the data set. The training dataset is then partinioned in 2 to create a Training set (70% of the data) for the modeling process and a Test set (with the remaining 30%) for the validations. The testing data set is not changed and will only be used for the quiz results generation.
#command to read trainig and test set
data_train <- read.csv("pml-training.csv")
data_quiz <- read.csv("pml-testing.csv")
dim(data_train)
## [1] 19622 160
dim(data_quiz)
## [1] 20 160
Both created datasets have 160 variables. Those variables have plenty of NA, that can be removed with the cleaning procedures below.
in_train <- createDataPartition(data_train$classe, p=0.70, list=FALSE)
train_set <- data_train[ in_train, ]
test_set <- data_train[-in_train, ]
dim(train_set)
## [1] 13737 160
dim(test_set)
## [1] 5885 160
nzv_var <- nearZeroVar(train_set)
train_set <- train_set[ , -nzv_var]
test_set <- test_set [ , -nzv_var]
dim(train_set)
## [1] 13737 104
dim(test_set)
## [1] 5885 104
#Remove variables that are mostly NA. A threshlod of 95 % is selected
na_var <- sapply(train_set, function(x) mean(is.na(x))) > 0.95
train_set <- train_set[ , na_var == FALSE]
test_set <- test_set [ , na_var == FALSE]
dim(train_set)
## [1] 13737 59
dim(test_set)
## [1] 5885 59
#Since columns 1 to 5 are identification variables only, they will be removed as well
train_set <- train_set[ , -(1:5)]
test_set <- test_set [ , -(1:5)]
dim(train_set)
## [1] 13737 54
dim(test_set)
## [1] 5885 54
Correlation Analysis A correlation among variables could be analyzed before proceeding to the modeling procedures.
corr_matrix <- cor(train_set[ , -54])
corrplot(corr_matrix, order = "FPC", method = "circle", type = "lower",
tl.cex = 0.6, tl.col = rgb(0, 0, 0))
analysis has been reduced from the original 160 down to 54
Three methods will be applied to model the regressions (in the Train data set) and the best one (with higher accuracy when applied to the Test data set) will be used for the quiz predictions. The methods are: - Decision Tree, - Generalized Boosted Model (GBM) and - Random Forests (rf), as described below.
Training the Model A Confusion Matrix is plotted at the end of each analysis to better visualize the accuracy of the models.
#Decision Tree Model
set.seed(1967)
fit_DT <- rpart(classe ~ ., data = train_set, method="class")
fancyRpartPlot(fit_DT)
predict_DT <- predict(fit_DT, newdata = test_set, type="class")
conf_matrix_DT <- confusionMatrix(table(predict_DT, test_set$classe))
conf_matrix_DT
## Confusion Matrix and Statistics
##
##
## predict_DT A B C D E
## A 1460 177 29 37 13
## B 124 775 96 156 81
## C 19 69 800 130 87
## D 28 49 47 536 83
## E 43 69 54 105 818
##
## Overall Statistics
##
## Accuracy : 0.7458
## 95% CI : (0.7345, 0.7569)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6779
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8722 0.6804 0.7797 0.55602 0.7560
## Specificity 0.9392 0.9037 0.9372 0.95794 0.9436
## Pos Pred Value 0.8508 0.6291 0.7240 0.72140 0.7511
## Neg Pred Value 0.9487 0.9218 0.9527 0.91676 0.9450
## Prevalence 0.2845 0.1935 0.1743 0.16381 0.1839
## Detection Rate 0.2481 0.1317 0.1359 0.09108 0.1390
## Detection Prevalence 0.2916 0.2093 0.1878 0.12625 0.1850
## Balanced Accuracy 0.9057 0.7921 0.8585 0.75698 0.8498
plot(conf_matrix_DT$table, col = conf_matrix_DT$byClass,
main = paste("Decision Tree Model: Predictive Accuracy =",
round(conf_matrix_DT$overall['Accuracy'], 4)))
### Generalized Boosted Model (GBM)
set.seed(1967)
ctrl_GBM <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fit_GBM <- train(classe ~ ., data = train_set, method = "gbm",
trControl = ctrl_GBM, verbose = FALSE)
fit_GBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
predict_GBM <- predict(fit_GBM, newdata = test_set)
conf_matrix_GBM <- confusionMatrix(table(predict_GBM, test_set$classe))
conf_matrix_GBM
## Confusion Matrix and Statistics
##
##
## predict_GBM A B C D E
## A 1665 13 0 1 1
## B 8 1120 8 2 4
## C 0 6 1016 11 3
## D 1 0 2 950 10
## E 0 0 0 0 1064
##
## Overall Statistics
##
## Accuracy : 0.9881
## 95% CI : (0.985, 0.9907)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.985
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9946 0.9833 0.9903 0.9855 0.9834
## Specificity 0.9964 0.9954 0.9959 0.9974 1.0000
## Pos Pred Value 0.9911 0.9807 0.9807 0.9865 1.0000
## Neg Pred Value 0.9979 0.9960 0.9979 0.9972 0.9963
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2829 0.1903 0.1726 0.1614 0.1808
## Detection Prevalence 0.2855 0.1941 0.1760 0.1636 0.1808
## Balanced Accuracy 0.9955 0.9893 0.9931 0.9914 0.9917
set.seed(1967)
ctrl_RF <- trainControl(method = "repeatedcv", number = 5, repeats = 2)
fit_RF <- train(classe ~ ., data = train_set, method = "rf",
trControl = ctrl_RF, verbose = FALSE)
fit_RF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, verbose = FALSE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.2%
## Confusion matrix:
## A B C D E class.error
## A 3905 0 0 0 1 0.0002560164
## B 7 2648 3 0 0 0.0037622272
## C 0 4 2391 1 0 0.0020868114
## D 0 0 4 2246 2 0.0026642984
## E 0 0 0 6 2519 0.0023762376
predict_RF <- predict(fit_RF, newdata = test_set)
conf_matrix_RF <- confusionMatrix(table(predict_RF, test_set$classe))
conf_matrix_RF
## Confusion Matrix and Statistics
##
##
## predict_RF A B C D E
## A 1674 1 0 0 0
## B 0 1137 1 0 0
## C 0 1 1025 3 0
## D 0 0 0 961 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9986
## 95% CI : (0.9973, 0.9994)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9983
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9982 0.9990 0.9969 0.9982
## Specificity 0.9998 0.9998 0.9992 0.9996 1.0000
## Pos Pred Value 0.9994 0.9991 0.9961 0.9979 1.0000
## Neg Pred Value 1.0000 0.9996 0.9998 0.9994 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1932 0.1742 0.1633 0.1835
## Detection Prevalence 0.2846 0.1934 0.1749 0.1636 0.1835
## Balanced Accuracy 0.9999 0.9990 0.9991 0.9982 0.9991
Applying the Best Predictive Model to the Test Data To summarize, the predictive accuracy of the three models evaluated is as follows:
Summary of the results: - Decision tree model - is the worst model running, has the low mean and the highest standard deviation. - GBM model - has a decent mean accuracy but a little bit lower accuracy than RF, - Random Fores model - has the highest mean accuracy and lowest standard deviation
f)arameter Tuning Checking prediction accuracy on my own testing/validation set. I am expecting similar accuracy as the mean from the cross validation.
The kappa statistic The kappa statistic (labeled Kappa in the previous output) adjusts accuracy by accounting for the possibility of a correct prediction by chance alone. Kappa values range to a maximum value of 1, which indicates perfect agreement between the model’s predictions and the true values—a rare occurrence. Values less than one indicate imperfect agreement.
Depending on how your model is to be used, the interpretation of the kappa statistic might vary. One common interpretation is shown as follows: • Poor agreement = Less than 0.20 • Fair agreement = 0.20 to 0.40 • Moderate agreement = 0.40 to 0.60 • Good agreement = 0.60 to 0.80 • Very good agreement = 0.80 to 1.00
This three models preforms as expected, the deviation from the cross validation accuracy is low and I do not see a reason to change resampling method or adding repetitons.
Checking if there are anything to gain from increasing the number of boosting iterations.
plot(fit_RF)
print(fit_RF$bestTune)
## mtry
## 2 27
The predictive accuracy of the Random Forest model is excellent at 99.8 %. Accuracy has plateaued, and further tuning would only yield decimal gain. - The best tuning parameters hads 150 trees (boosting iterations), - interaction depth 3 - shrinkage 0.1.
Decision Tree Model: 74.58 % Generalized Boosted Model: 98.81 % Random Forest Model: 99.86 %
The Random Forest model is selected and applied to make predictions on the 20 data points from the original testing dataset (data_quiz).
cat("Predictions: ", paste(predict(fit_RF, data_quiz)))
## Predictions: B A B A A E D B A A B C B A E E A B B B