This is coursera Pratical machine learning project. Here, we would focus on various different factors for weighing different excersies. The goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. We are provided with both training and testing data.
The project required header files has to be added.
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(caret)
## Loading required package: lattice
library(rpart)
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gbm)
## Loaded gbm 2.1.8
The following steps is to retrieve the data from relevant souces into R database for further calculations. The data provided included training and testing data.
train_input <-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
header = TRUE)
test_input <-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
header = TRUE)
dim(train_input)
## [1] 19622 160
dim(test_input)
## [1] 20 160
Here we need to tidy the data captured and remove missing columns which are completely filled with missing values.
train_data <- train_input[, colSums(is.na(train_input)) == 0] # to remove col with NAs
test_data <- test_input[, colSums(is.na(test_input)) == 0]
dim(train_data)
## [1] 19622 93
dim(test_data)
## [1] 20 60
train_set <- train_data[,-c(1:7)]
valid_set <- test_data[,-c(1:7)]
For prediction we need to split the train_data into training set and testing data. For that we would use the R package caret to create partition.
set.seed(1590)
intrain <- createDataPartition(y = train_set$classe,
p = 0.8,
list = FALSE)
trainData <- train_set[intrain,]
testData <- train_set[-intrain,]
dim(trainData)
## [1] 15699 86
dim(testData)
## [1] 3923 86
# To remove variables which are non-necessary
cols <- nearZeroVar(x = trainData)
trainData <- trainData[,-cols]
testData <- testData[,-cols]
Here we would apply Random Forest method to create the prediction.
rf_mdl <- train(classe ~ .,
data = trainData,
method = "rf")
rf_mdl$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.59%
## Confusion matrix:
## A B C D E class.error
## A 4462 2 0 0 0 0.0004480287
## B 18 3009 10 1 0 0.0095457538
## C 0 14 2715 9 0 0.0084002922
## D 0 2 22 2546 3 0.0104935873
## E 0 2 2 7 2875 0.0038115038
To further check the status of the model.
rf_pred <- predict(object = rf_mdl, newdata = testData)
rf_cm <- confusionMatrix(rf_pred, as.factor(testData$classe))
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1111 7 0 0 0
## B 1 752 1 1 0
## C 3 0 680 5 4
## D 0 0 3 637 4
## E 1 0 0 0 713
##
## Overall Statistics
##
## Accuracy : 0.9924
## 95% CI : (0.9891, 0.9948)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9903
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9955 0.9908 0.9942 0.9907 0.9889
## Specificity 0.9975 0.9991 0.9963 0.9979 0.9997
## Pos Pred Value 0.9937 0.9960 0.9827 0.9891 0.9986
## Neg Pred Value 0.9982 0.9978 0.9988 0.9982 0.9975
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2832 0.1917 0.1733 0.1624 0.1817
## Detection Prevalence 0.2850 0.1925 0.1764 0.1642 0.1820
## Balanced Accuracy 0.9965 0.9949 0.9952 0.9943 0.9943
Here we would apply GBM method to create the prediction.
set.seed(1111)
gbm_trctrl <- trainControl(method = "repeatedcv",
repeats = 1,
number = 3)
gbm_mdl <- train(classe ~ .,
data = trainData,
method = "gbm",
trControl = gbm_trctrl)
gbm_mdl$finalModel
To further check the status of the model.
gbm_pred <- predict(object = gbm_mdl, newdata = testData)
gbm_cm <- confusionMatrix(gbm_pred, as.factor(testData$classe))
gbm_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1091 28 0 0 2
## B 12 704 26 3 6
## C 12 25 644 22 6
## D 1 2 12 613 9
## E 0 0 2 5 698
##
## Overall Statistics
##
## Accuracy : 0.9559
## 95% CI : (0.949, 0.9621)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9442
##
## Mcnemar's Test P-Value : 0.0002073
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9776 0.9275 0.9415 0.9533 0.9681
## Specificity 0.9893 0.9851 0.9799 0.9927 0.9978
## Pos Pred Value 0.9732 0.9374 0.9083 0.9623 0.9901
## Neg Pred Value 0.9911 0.9827 0.9876 0.9909 0.9929
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2781 0.1795 0.1642 0.1563 0.1779
## Detection Prevalence 0.2858 0.1914 0.1807 0.1624 0.1797
## Balanced Accuracy 0.9835 0.9563 0.9607 0.9730 0.9830
Here we would apply Decision Tree method to create the prediction.
set.seed(3333)
dt_mdl <- rpart(classe ~., data = trainData, method = "class")
fancyRpartPlot(dt_mdl)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
To further the model performance
dt_pred <- predict(dt_mdl, newdata = testData, type = "class")
dt_cm <- confusionMatrix(dt_pred, as.factor(testData$classe))
dt_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 969 129 15 46 11
## B 35 435 36 50 61
## C 39 82 538 93 75
## D 31 48 46 395 32
## E 42 65 49 59 542
##
## Overall Statistics
##
## Accuracy : 0.7339
## 95% CI : (0.7197, 0.7477)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6629
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8683 0.5731 0.7865 0.6143 0.7517
## Specificity 0.9284 0.9425 0.9108 0.9521 0.9329
## Pos Pred Value 0.8282 0.7050 0.6505 0.7156 0.7160
## Neg Pred Value 0.9466 0.9020 0.9528 0.9264 0.9435
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2470 0.1109 0.1371 0.1007 0.1382
## Detection Prevalence 0.2982 0.1573 0.2108 0.1407 0.1930
## Balanced Accuracy 0.8983 0.7578 0.8487 0.7832 0.8423
The overall performance of each prediction model can be depicted as follows:
acc_cm <- data.frame(rf_cm$overall[1], gbm_cm$overall[1], dt_cm$overall[1])
On mapping the each different model to the required valid data set and calculating the accuracy.
rf_pred_valid <- predict(rf_mdl, newdata = valid_set)
gbm_pred_valid <- predict(gbm_mdl, newdata = valid_set)
dt_pred_valid <- predict(dt_mdl, newdata = valid_set, type = "class")
pred_output <- data.frame(rf_pred_valid,
gbm_pred_valid,
dt_pred_valid)
headings <- c( "RandomForest", "Gbm", "DecisionTree")
names(pred_output) <- headings
From the above analysis we can see the resubstitution error and generalization error that occurs in each of the 2 scenarios using the test_data and valid_data.
acc_cm
## rf_cm.overall.1. gbm_cm.overall.1. dt_cm.overall.1.
## Accuracy 0.9923528 0.9559011 0.7338771
We found that random forest had the highest in-sample accuracy of 0.9923528
The predicted output for the validation set is as below:
pred_output
## RandomForest Gbm DecisionTree
## 1 B B B
## 2 A A A
## 3 B B E
## 4 A A D
## 5 A A A
## 6 E E C
## 7 D D D
## 8 B B A
## 9 A A A
## 10 A A A
## 11 B B C
## 12 C C E
## 13 B B C
## 14 A A A
## 15 E E E
## 16 E E E
## 17 A A A
## 18 B B B
## 19 B B B
## 20 B B B
Thus, we found the predicted output and most effective was the random forest method having a higher level of accuracy compared to genralized boosting method.