Let’s begin with our clustering task on Iris Dataset using k-means algorithm. Follow the following points to use code in this document:
Step 1: Start R Studio
Step 2: Execute each R command one by one on the R Studio Console
require("datasets")
data("iris") # load Iris Dataset
str(iris) #view structure of dataset
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris) #view statistical summary of dataset
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
head(iris) #view top rows of dataset
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Since clustering is a type of Unsupervised Learning, we would not require Class Label(output) during execution of our algorithm. We will, therefore, remove Class Attribute “Species” and store it in another variable. We would then normalize the attributes between 0 and 1 using our own function.
iris.new<- iris[,c(1,2,3,4)]
iris.class<- iris[,"Species"]
head(iris.new)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
head(iris.class)
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
normalize <- function(x){
return ((x-min(x))/(max(x)-min(x)))
}
iris.new$Sepal.Length<- normalize(iris.new$Sepal.Length)
iris.new$Sepal.Width<- normalize(iris.new$Sepal.Width)
iris.new$Petal.Length<- normalize(iris.new$Petal.Length)
iris.new$Petal.Width<- normalize(iris.new$Petal.Width)
head(iris.new)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 0.22222222 0.6250000 0.06779661 0.04166667
## 2 0.16666667 0.4166667 0.06779661 0.04166667
## 3 0.11111111 0.5000000 0.05084746 0.04166667
## 4 0.08333333 0.4583333 0.08474576 0.04166667
## 5 0.19444444 0.6666667 0.06779661 0.04166667
## 6 0.30555556 0.7916667 0.11864407 0.12500000
result<- kmeans(iris.new,3) #aplly k-means algorithm with no. of centroids(k)=3
result$size # gives no. of records in each cluster
## [1] 61 39 50
result$centers # gives value of cluster center datapoint value(3 centers for k=3)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 0.4412568 0.3073770 0.57571548 0.54918033
## 2 0.7072650 0.4508547 0.79704476 0.82478632
## 3 0.1961111 0.5950000 0.07830508 0.06083333
result$cluster #gives cluster vector showing the custer where each record falls
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
## [106] 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 2
## [141] 2 2 1 2 2 2 1 2 2 1
par(mfrow=c(2,2), mar=c(5,4,2,2))
plot(iris.new[c(1,2)], col=result$cluster)# Plot to see how Sepal.Length and Sepal.Width data points have been distributed in clusters
plot(iris.new[c(1,2)], col=iris.class)# Plot to see how Sepal.Length and Sepal.Width data points have been distributed originally as per "class" attribute in dataset
plot(iris.new[c(3,4)], col=result$cluster)# Plot to see how Petal.Length and Petal.Width data points have been distributed in clusters
plot(iris.new[c(3,4)], col=iris.class)
table(result$cluster,iris.class)
## iris.class
## setosa versicolor virginica
## 1 0 47 14
## 2 0 3 36
## 3 50 0 0
Result of table shows that Cluster 1 corresponds to Virginica, Cluster 2 corresponds to Versicolor and Cluster 3 to Setosa.
Total number of correctly classified instances are: 36 + 47 + 50= 133
Total number of incorrectly classified instances are: 3 + 14= 17
Accuracy = 133/(133+17) = 0.88 i.e our model has achieved 88% accuracy!
In order to improve this accuracy further, we may try different values of “k”. In some cases, it is also beneficial to change the algorithm in case k-means is unable to yield good results.
You may also wish to try out Data Classification, Clustering or Linear Regression from following links:
k-NN Classification for beginners
Using Airquality Datasetk-means Clustering for beginners
Using Airquality DatasetLinear Regression for beginners
Good luck! :)