Let’s begin with our classification task on Iris Dataset using k-Nearest Neighbours algorithm. Follow the following points to use code in this document:
Step 1: Start R Studio
Step 2: Execute each R command one by one on the R Studio Console
require("class") # load pre-installed package
## Loading required package: class
require("datasets")
data("iris") # load Iris Dataset
str(iris) #view structure of dataset
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris) #view statistical summary of dataset
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
head(iris) #view top rows of dataset
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Since classification is a type of Supervised Learning, we would require two sets of data i.e. Training and Testing Data(generally in 80:20 ratio). We would load Iris Dataset which is available in RStudio by default and then divide the dataset into two subsets. Our knn classification model would then be trained using subset iris.train and tested using iris.test. Since the iris dataset is sorted by “Species” by default, we will first jumble the data rows and then take subset.
set.seed(99) # required to reproduce the results
rnum<- sample(rep(1:150)) # randomly generate numbers from 1 to 150
rnum
## [1] 88 17 102 146 79 141 97 43 51 25 77 71 27 150 94 87 48
## [18] 14 13 24 30 11 106 76 98 44 1 101 124 145 61 39 41 64
## [35] 5 52 147 136 68 109 96 53 84 100 142 144 38 125 60 12 139
## [52] 82 74 95 29 99 9 115 85 36 83 26 148 117 104 73 47 140
## [69] 122 69 7 128 33 32 118 80 40 113 46 93 119 45 4 70 89
## [86] 10 112 20 6 114 62 50 129 28 58 123 107 66 23 105 3 126
## [103] 91 116 110 31 16 111 137 75 19 134 55 37 131 86 92 127 8
## [120] 56 133 72 132 49 34 15 121 22 59 108 2 42 81 149 143 35
## [137] 21 90 57 78 67 103 138 120 63 54 130 65 18 135
iris<- iris[rnum,] #randomize "iris" dataset
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 88 6.3 2.3 4.4 1.3 versicolor
## 17 5.4 3.9 1.3 0.4 setosa
## 102 5.8 2.7 5.1 1.9 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 79 6.0 2.9 4.5 1.5 versicolor
## 141 6.7 3.1 5.6 2.4 virginica
# Normalize the dataset between values 0 and 1
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}
iris.new<- as.data.frame(lapply(iris[,c(1,2,3,4)],normalize))
head(iris.new)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 0.5555556 0.1250000 0.57627119 0.5000000
## 2 0.3055556 0.7916667 0.05084746 0.1250000
## 3 0.4166667 0.2916667 0.69491525 0.7500000
## 4 0.6666667 0.4166667 0.71186441 0.9166667
## 5 0.4722222 0.3750000 0.59322034 0.5833333
## 6 0.6666667 0.4583333 0.77966102 0.9583333
# subset the dataset
iris.train<- iris.new[1:130,]
iris.train.target<- iris[1:130,5]
iris.test<- iris.new[131:150,]
iris.test.target<- iris[131:150,5]
summary(iris.new)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
## Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
## Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
## 3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
model1<- knn(train=iris.train, test=iris.test, cl=iris.train.target, k=16)
#model1
table(iris.test.target, model1)
## model1
## iris.test.target setosa versicolor virginica
## setosa 5 0 0
## versicolor 0 7 1
## virginica 0 2 5
The values on the diagonal shows number of correctly classified instances out of total 153 instances. The values not on the diagonal implies that they have been incorrectly instances. Hence, there is a scope of further improvement in classifier model. Improvement may be done in terms of trying different values of “k” and choosing the one with maximum accuracy. However, other classification algorithms may also be tried to get a better result. There is no stopping in Optimization!
You may also wish to try out Data Classification, Clustering or Linear Regression from following links:
k-NN Classification for beginners
Using Airquality Datasetk-means Clustering for beginners
Using Airquality DatasetLinear Regression for beginners
Good luck! :)