Let’s begin with our classification task on Iris Dataset using k-Nearest Neighbours algorithm. Follow the following points to use code in this document:

    Step 1: Start R Studio
    Step 2: Execute each R command one by one on the R Studio Console

1. Load and view dataset

require("class") # load pre-installed package
## Loading required package: class
require("datasets")
data("iris") # load Iris Dataset
str(iris) #view structure of dataset
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris) #view statistical summary of dataset
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
head(iris) #view top  rows of dataset
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

2. Preprocess the dataset

Since classification is a type of Supervised Learning, we would require two sets of data i.e. Training and Testing Data(generally in 80:20 ratio). We would load Iris Dataset which is available in RStudio by default and then divide the dataset into two subsets. Our knn classification model would then be trained using subset iris.train and tested using iris.test. Since the iris dataset is sorted by “Species” by default, we will first jumble the data rows and then take subset.

set.seed(99) # required to reproduce the results
rnum<- sample(rep(1:150)) # randomly generate numbers from 1 to 150
rnum
##   [1]  88  17 102 146  79 141  97  43  51  25  77  71  27 150  94  87  48
##  [18]  14  13  24  30  11 106  76  98  44   1 101 124 145  61  39  41  64
##  [35]   5  52 147 136  68 109  96  53  84 100 142 144  38 125  60  12 139
##  [52]  82  74  95  29  99   9 115  85  36  83  26 148 117 104  73  47 140
##  [69] 122  69   7 128  33  32 118  80  40 113  46  93 119  45   4  70  89
##  [86]  10 112  20   6 114  62  50 129  28  58 123 107  66  23 105   3 126
## [103]  91 116 110  31  16 111 137  75  19 134  55  37 131  86  92 127   8
## [120]  56 133  72 132  49  34  15 121  22  59 108   2  42  81 149 143  35
## [137]  21  90  57  78  67 103 138 120  63  54 130  65  18 135
iris<- iris[rnum,] #randomize "iris" dataset
head(iris)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 88           6.3         2.3          4.4         1.3 versicolor
## 17           5.4         3.9          1.3         0.4     setosa
## 102          5.8         2.7          5.1         1.9  virginica
## 146          6.7         3.0          5.2         2.3  virginica
## 79           6.0         2.9          4.5         1.5 versicolor
## 141          6.7         3.1          5.6         2.4  virginica
# Normalize the dataset between values 0 and 1
normalize <- function(x){
  return ((x-min(x))/(max(x)-min(x)))
}

iris.new<- as.data.frame(lapply(iris[,c(1,2,3,4)],normalize))
head(iris.new)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1    0.5555556   0.1250000   0.57627119   0.5000000
## 2    0.3055556   0.7916667   0.05084746   0.1250000
## 3    0.4166667   0.2916667   0.69491525   0.7500000
## 4    0.6666667   0.4166667   0.71186441   0.9166667
## 5    0.4722222   0.3750000   0.59322034   0.5833333
## 6    0.6666667   0.4583333   0.77966102   0.9583333
# subset the dataset
iris.train<- iris.new[1:130,]
iris.train.target<- iris[1:130,5]
iris.test<- iris.new[131:150,]
iris.test.target<- iris[131:150,5]
summary(iris.new)
##   Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
##  Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
##  Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
##  3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000

3. Apply k-NN classification algorithm

model1<- knn(train=iris.train, test=iris.test, cl=iris.train.target, k=16)
#model1

4. Verify results

table(iris.test.target, model1)
##                 model1
## iris.test.target setosa versicolor virginica
##       setosa          5          0         0
##       versicolor      0          7         1
##       virginica       0          2         5

The values on the diagonal shows number of correctly classified instances out of total 153 instances. The values not on the diagonal implies that they have been incorrectly instances. Hence, there is a scope of further improvement in classifier model. Improvement may be done in terms of trying different values of “k” and choosing the one with maximum accuracy. However, other classification algorithms may also be tried to get a better result. There is no stopping in Optimization!

You may also wish to try out Data Classification, Clustering or Linear Regression from following links:

  1. k-NN Classification for beginners

    Using Airquality Dataset
  2. k-means Clustering for beginners

    Using Iris Dataset

    Using Airquality Dataset
  3. Linear Regression for beginners

    Using Iris Dataset

    Using Airquality Dataset

Good luck! :)