Import Library
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
Cleaning the data
Observing the data cleaning is required.
Step1 remove NA, "", #DIV/0!.
Step2 remove Variables near to zero and NA.
Step3 remove Non-numerical variable like timestamp.
train_data <- read.csv('training.csv', na.strings = c("NA", "#DIV/0!", ""))
test_data <- read.csv('testing.csv', na.strings = c("NA", "#DIV/0!", ""))
nz <- nearZeroVar(train_data)
train_data <- train_data[,-nz]
test_data <- test_data[,-nz]
rm_na <- sapply(train_data, function(x) mean(is.na(x))) > 0.95
train_data <- train_data[,rm_na == FALSE]
test_data <- test_data[,rm_na == FALSE]
train_data<- train_data[, -c(1:7)]
test_data<- test_data[, -c(1:7)]
Split the Training Dataset
Machine Learning Algorithm for Prediction
- Decision Tree
- Random Forest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1202 24 96 0 5
## B 397 276 218 34 1
## C 371 35 395 0 0
## D 351 10 292 137 0
## E 184 149 221 36 262
##
## Overall Statistics
##
## Accuracy : 0.4838
## 95% CI : (0.4694, 0.4982)
## No Information Rate : 0.5334
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3264
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4798 0.55870 0.32324 0.66184 0.97761
## Specificity 0.9429 0.84531 0.88313 0.85453 0.86676
## Pos Pred Value 0.9058 0.29806 0.49313 0.17342 0.30751
## Neg Pred Value 0.6132 0.94218 0.78768 0.98208 0.99844
## Prevalence 0.5334 0.10520 0.26022 0.04408 0.05707
## Detection Rate 0.2560 0.05877 0.08411 0.02917 0.05579
## Detection Prevalence 0.2826 0.19719 0.17057 0.16823 0.18143
## Balanced Accuracy 0.7114 0.70201 0.60319 0.75818 0.92218
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1327 0 0 0 0
## B 0 926 0 0 0
## C 0 1 800 0 0
## D 0 0 1 789 0
## E 0 0 0 0 852
##
## Overall Statistics
##
## Accuracy : 0.9996
## 95% CI : (0.9985, 0.9999)
## No Information Rate : 0.2826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9995
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9989 0.9988 1.0000 1.0000
## Specificity 1.0000 1.0000 0.9997 0.9997 1.0000
## Pos Pred Value 1.0000 1.0000 0.9988 0.9987 1.0000
## Neg Pred Value 1.0000 0.9997 0.9997 1.0000 1.0000
## Prevalence 0.2826 0.1974 0.1706 0.1680 0.1814
## Detection Rate 0.2826 0.1972 0.1704 0.1680 0.1814
## Detection Prevalence 0.2826 0.1972 0.1706 0.1682 0.1814
## Balanced Accuracy 1.0000 0.9995 0.9992 0.9999 1.0000
Result
- From the confusion matrix it is clear that random forest algorithm works better than decision tree. So using random forest model the prediction should be made.
Conclusion
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E