Summary

Goal: To compare the performace of the various models on predicting the species of the flower.

Approach: Compared the train and test MSE for 7 different regression models.

Results: SVM and Random Forests gave 100% accuracy

library(kableExtra)
library(rattle)
library(corrplot)
library(dplyr)
library(ggplot2)
library(GGally)
library(ggthemes) 
library(plotly) 
library(tidyr)
library(caTools)
library(DT)
library(gridExtra)
library(ROCR)
library(leaps)
library(PRROC)
library(boot)
library(naniar)
library(psych)
library(grid)
library(ggplot2)
library(lattice)
library(caret) # Use cross-validation
library(class)
library(rpart) # Decision Tree
library(caretEnsemble)
library(e1071)
library(kernlab)

Data Cleaning

Structure of the dataset

## 'data.frame':    150 obs. of  6 variables:
##  $ id         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SepalLength: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ SepalWidth : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ PetalLength: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ PetalWidth : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species    : chr  "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...

The structure shows that the dataset has 5 numeric and 1 categorical variable.

Check for null, duplicate values

##      id          SepalLength     SepalWidth      PetalLength    
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:150       FALSE:150       FALSE:150       FALSE:150      
##  PetalWidth       Species       
##  Mode :logical   Mode :logical  
##  FALSE:150       FALSE:150
## [1] FALSE
## [1] 150

Look at the missing values

The graph shows that there is no missing value

Get final glimse of the final data

## Observations: 150
## Variables: 6
## $ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ SepalLength <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, ...
## $ SepalWidth  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, ...
## $ PetalLength <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, ...
## $ PetalWidth  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, ...
## $ Species     <chr> "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-seto...

Data Visualization

Distribution

Check for distribution of variables in the dataset using histogram

Correlation

There is strong correlation of SepalLength with PetalLength and PetalWidth. Also, PetalLength and PetalWidth are strongly correlated with each other.SepalLength and SepalWidth are strongly correlated with each other.

Outliers

## Observations: 150
## Variables: 6
## $ id          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ SepalLength <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, ...
## $ SepalWidth  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, ...
## $ PetalLength <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, ...
## $ PetalWidth  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, ...
## $ Species     <chr> "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-seto...

K-NN

Encoding the target feature as factor

Feature Scaling

training_set[-5] = scale(training_set[-5])
test_set[-5] = scale(test_set[-5])

Fitting K-NN to the Training set and Predicting the Test set results

Making the Confusion Matrix

Clasification Accuracy

The Accuracy from knn is:

## [1] 97.22222
## [1] 97.22222
## [1] 97.22222
## [1] 100
## [1] 100
## [1] 100

Decision Tree

The predicted accuracy the decision tree model by running it on resamples of the train data.

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1    2
##          0 33.3  0.0  0.0
##          1  0.0 30.7  4.4
##          2  0.0  2.6 28.9
##                             
##  Accuracy (average) : 0.9298

The Accuracy from decision tree is:

## [1] 97.22222

Random Forest

The predicted accuracy the decision tree model by running it on resamples of the train data.

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1    2
##          0 33.3  0.0  0.0
##          1  0.0 30.7  3.5
##          2  0.0  2.6 29.8
##                             
##  Accuracy (average) : 0.9386

The accuracy from random forest is:

## [1] 100

NB

The predicted accuracy the Naive Bayes model by running it on resamples of the train data.

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1    2
##          0 33.3  0.0  0.0
##          1  0.0 31.6  2.6
##          2  0.0  1.8 30.7
##                             
##  Accuracy (average) : 0.9561
##    nb_predict
##      0  1  2
##   0 12  0  0
##   1  0 11  1
##   2  0  0 12

The accuracy from Naive Beyes is:

## [1] 97.22222

SVM

The predicted accuracy the SVM model by running it on resamples of the train data.

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    0    1    2
##          0 33.3  0.0  0.0
##          1  0.0 29.8  2.6
##          2  0.0  3.5 30.7
##                             
##  Accuracy (average) : 0.9386

The accuracy from SVM is:

## [1] 100