Dataset & Objective

I will be using one the most classic dataset, which is iris. This dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

Features used will be Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and the target variable is Species

Libraries

I imported the following libraries to carry out the analysis

library(class) # to carry out KNN
library(gmodels) # to check model accuracy
library(ggvis) # for better visualization
## Warning: package 'ggvis' was built under R version 3.2.5

Data Overview

This is what the iris data.frame table looks like (partial):

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Summary statistics of data:

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Scatterplots

Scatter plots will enable us to see the correlation of the data from a high level

There is a rather high positive correlation for setosa compared to versicolor and virginica when it comes to its sepal length and width.

However, for petal length and width, all 3 species have pretty high positive correlation

Training and Test Data

I will use the training set to train the system and the test set to evaluate and test the trained system. The ratio of training to test set i will use is 1:3.

Training data:

##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 6          5.4         3.9          1.7         0.4
## 7          4.6         3.4          1.4         0.3

Test data:

##    Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5           5.0         3.6          1.4         0.2
## 11          5.4         3.7          1.5         0.2
## 14          4.3         3.0          1.1         0.1
## 16          5.7         4.4          1.5         0.4
## 26          5.0         3.0          1.6         0.2
## 28          5.2         3.5          1.5         0.2

Class Labels:

The class labels contain the target variable for the training and test data.

Training Labels:

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Test Labels:

## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

Building the Classifer

The machine learning algorithm i will be using is K Nearest Neighbour to classify the test data for our target variable, Species. The parameter k used here is 3.

iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)
iris_pred
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] virginica  virginica  virginica  virginica  versicolor virginica 
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## [37] virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

Comparing the Outcomes

Here is where I compare the model’s performance on the predicted species to the observed Species.

##    iris.testLabels  iris_pred
## 1           setosa     setosa
## 2           setosa     setosa
## 3           setosa     setosa
## 4           setosa     setosa
## 5           setosa     setosa
## 6           setosa     setosa
## 7           setosa     setosa
## 8           setosa     setosa
## 9           setosa     setosa
## 10          setosa     setosa
## 11          setosa     setosa
## 12          setosa     setosa
## 13      versicolor versicolor
## 14      versicolor versicolor
## 15      versicolor versicolor
## 16      versicolor versicolor
## 17      versicolor versicolor
## 18      versicolor versicolor
## 19      versicolor versicolor
## 20      versicolor versicolor
## 21      versicolor versicolor
## 22      versicolor versicolor
## 23      versicolor versicolor
## 24      versicolor versicolor
## 25       virginica  virginica
## 26       virginica  virginica
## 27       virginica  virginica
## 28       virginica  virginica
## 29       virginica versicolor
## 30       virginica  virginica
## 31       virginica  virginica
## 32       virginica  virginica
## 33       virginica  virginica
## 34       virginica  virginica
## 35       virginica  virginica
## 36       virginica  virginica
## 37       virginica  virginica
## 38       virginica  virginica
## 39       virginica  virginica
## 40       virginica  virginica

Seems like the model managed to predict everything correct except for one entry on the 29th row.

Cross Tabulation Table

Cross tabulation table helps understand the relationship between the observed and predicted species.

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  40 
## 
##  
##                 | iris_pred 
## iris.testLabels |     setosa | versicolor |  virginica |  Row Total | 
## ----------------|------------|------------|------------|------------|
##          setosa |         12 |          0 |          0 |         12 | 
##                 |      1.000 |      0.000 |      0.000 |      0.300 | 
##                 |      1.000 |      0.000 |      0.000 |            | 
##                 |      0.300 |      0.000 |      0.000 |            | 
## ----------------|------------|------------|------------|------------|
##      versicolor |          0 |         12 |          0 |         12 | 
##                 |      0.000 |      1.000 |      0.000 |      0.300 | 
##                 |      0.000 |      0.923 |      0.000 |            | 
##                 |      0.000 |      0.300 |      0.000 |            | 
## ----------------|------------|------------|------------|------------|
##       virginica |          0 |          1 |         15 |         16 | 
##                 |      0.000 |      0.062 |      0.938 |      0.400 | 
##                 |      0.000 |      0.077 |      1.000 |            | 
##                 |      0.000 |      0.025 |      0.375 |            | 
## ----------------|------------|------------|------------|------------|
##    Column Total |         12 |         13 |         15 |         40 | 
##                 |      0.300 |      0.325 |      0.375 |            | 
## ----------------|------------|------------|------------|------------|
## 
##