Goal: To compare the performace of the various models on predicting the species of the flower.
Approach: Compared the train and test MSE for 7 different regression models.
Results: SVM and Random Forests gave 100% accuracy
library(kableExtra)
library(rattle)
library(corrplot)
library(dplyr)
library(ggplot2)
library(GGally)
library(ggthemes)
library(plotly)
library(tidyr)
library(caTools)
library(DT)
library(gridExtra)
library(ROCR)
library(leaps)
library(PRROC)
library(boot)
library(naniar)
library(psych)
library(grid)
library(ggplot2)
library(lattice)
library(caret) # Use cross-validation
library(class)
library(rpart) # Decision Tree
library(caretEnsemble)
library(e1071)
library(kernlab)
Structure of the dataset
## 'data.frame': 150 obs. of 6 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SepalLength: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ SepalWidth : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ PetalLength: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ PetalWidth : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
The structure shows that the dataset has 5 numeric and 1 categorical variable.
Check for null, duplicate values
## id SepalLength SepalWidth PetalLength
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:150 FALSE:150 FALSE:150 FALSE:150
## PetalWidth Species
## Mode :logical Mode :logical
## FALSE:150 FALSE:150
## [1] FALSE
## [1] 150
Look at the missing values
The graph shows that there is no missing value
Get final glimse of the final data
## Observations: 150
## Variables: 6
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ SepalLength <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, ...
## $ SepalWidth <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, ...
## $ PetalLength <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, ...
## $ PetalWidth <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, ...
## $ Species <chr> "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-seto...
Check for distribution of variables in the dataset using histogram
There is strong correlation of SepalLength with PetalLength and PetalWidth. Also, PetalLength and PetalWidth are strongly correlated with each other.SepalLength and SepalWidth are strongly correlated with each other.
## Observations: 150
## Variables: 6
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ SepalLength <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, ...
## $ SepalWidth <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, ...
## $ PetalLength <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, ...
## $ PetalWidth <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, ...
## $ Species <chr> "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-seto...
training_set[-5] = scale(training_set[-5])
test_set[-5] = scale(test_set[-5])
The Accuracy from knn is:
## [1] 97.22222
## [1] 97.22222
## [1] 97.22222
## [1] 100
## [1] 100
## [1] 100
The predicted accuracy the decision tree model by running it on resamples of the train data.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1 2
## 0 33.3 0.0 0.0
## 1 0.0 30.7 4.4
## 2 0.0 2.6 28.9
##
## Accuracy (average) : 0.9298
The Accuracy from decision tree is:
## [1] 97.22222
The predicted accuracy the decision tree model by running it on resamples of the train data.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1 2
## 0 33.3 0.0 0.0
## 1 0.0 30.7 3.5
## 2 0.0 2.6 29.8
##
## Accuracy (average) : 0.9386
The accuracy from random forest is:
## [1] 100
The predicted accuracy the Naive Bayes model by running it on resamples of the train data.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1 2
## 0 33.3 0.0 0.0
## 1 0.0 31.6 2.6
## 2 0.0 1.8 30.7
##
## Accuracy (average) : 0.9561
## nb_predict
## 0 1 2
## 0 12 0 0
## 1 0 11 1
## 2 0 0 12
The accuracy from Naive Beyes is:
## [1] 97.22222
The predicted accuracy the SVM model by running it on resamples of the train data.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction 0 1 2
## 0 33.3 0.0 0.0
## 1 0.0 29.8 2.6
## 2 0.0 3.5 30.7
##
## Accuracy (average) : 0.9386
The accuracy from SVM is:
## [1] 100