I will be using one the most classic dataset, which is iris. This dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, and Species.
Features used will be Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and the target variable is Species
I imported the following libraries to carry out the analysis
library(class) # to carry out KNN
library(gmodels) # to check model accuracy
library(ggvis) # for better visualization
## Warning: package 'ggvis' was built under R version 3.2.5
This is what the iris data.frame table looks like (partial):
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Summary statistics of data:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
Scatter plots will enable us to see the correlation of the data from a high level
However, for petal length and width, all 3 species have pretty high positive correlation
I will use the training set to train the system and the test set to evaluate and test the trained system. The ratio of training to test set i will use is 1:3.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 6 5.4 3.9 1.7 0.4
## 7 4.6 3.4 1.4 0.3
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 5 5.0 3.6 1.4 0.2
## 11 5.4 3.7 1.5 0.2
## 14 4.3 3.0 1.1 0.1
## 16 5.7 4.4 1.5 0.4
## 26 5.0 3.0 1.6 0.2
## 28 5.2 3.5 1.5 0.2
The class labels contain the target variable for the training and test data.
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
The machine learning algorithm i will be using is K Nearest Neighbour to classify the test data for our target variable, Species. The parameter k used here is 3.
iris_pred <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3)
iris_pred
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] virginica virginica virginica virginica versicolor virginica
## [31] virginica virginica virginica virginica virginica virginica
## [37] virginica virginica virginica virginica
## Levels: setosa versicolor virginica
Here is where I compare the model’s performance on the predicted species to the observed Species.
## iris.testLabels iris_pred
## 1 setosa setosa
## 2 setosa setosa
## 3 setosa setosa
## 4 setosa setosa
## 5 setosa setosa
## 6 setosa setosa
## 7 setosa setosa
## 8 setosa setosa
## 9 setosa setosa
## 10 setosa setosa
## 11 setosa setosa
## 12 setosa setosa
## 13 versicolor versicolor
## 14 versicolor versicolor
## 15 versicolor versicolor
## 16 versicolor versicolor
## 17 versicolor versicolor
## 18 versicolor versicolor
## 19 versicolor versicolor
## 20 versicolor versicolor
## 21 versicolor versicolor
## 22 versicolor versicolor
## 23 versicolor versicolor
## 24 versicolor versicolor
## 25 virginica virginica
## 26 virginica virginica
## 27 virginica virginica
## 28 virginica virginica
## 29 virginica versicolor
## 30 virginica virginica
## 31 virginica virginica
## 32 virginica virginica
## 33 virginica virginica
## 34 virginica virginica
## 35 virginica virginica
## 36 virginica virginica
## 37 virginica virginica
## 38 virginica virginica
## 39 virginica virginica
## 40 virginica virginica
Seems like the model managed to predict everything correct except for one entry on the 29th row.
Cross tabulation table helps understand the relationship between the observed and predicted species.
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 40
##
##
## | iris_pred
## iris.testLabels | setosa | versicolor | virginica | Row Total |
## ----------------|------------|------------|------------|------------|
## setosa | 12 | 0 | 0 | 12 |
## | 1.000 | 0.000 | 0.000 | 0.300 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.300 | 0.000 | 0.000 | |
## ----------------|------------|------------|------------|------------|
## versicolor | 0 | 12 | 0 | 12 |
## | 0.000 | 1.000 | 0.000 | 0.300 |
## | 0.000 | 0.923 | 0.000 | |
## | 0.000 | 0.300 | 0.000 | |
## ----------------|------------|------------|------------|------------|
## virginica | 0 | 1 | 15 | 16 |
## | 0.000 | 0.062 | 0.938 | 0.400 |
## | 0.000 | 0.077 | 1.000 | |
## | 0.000 | 0.025 | 0.375 | |
## ----------------|------------|------------|------------|------------|
## Column Total | 12 | 13 | 15 | 40 |
## | 0.300 | 0.325 | 0.375 | |
## ----------------|------------|------------|------------|------------|
##
##