This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
In this section we will see how the R language can be used to create cluster of the data sets. We will start with K-Means Clustering involving iris data sets.
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
The task is to used Iris data sets:
This data set is already build into RStudio, so it is easy to find.
Iris data set is not huge, probably advanced R programmes or data scientists would find it rather unsuitable, but for, as a lady and begginer in big data analytics, this is a great place to start and show some simple tricks to plot and analyze this data set.
Load the Iris data sets that is mandatory for completing the task:
library(datasets) #for using iris data sets load the datasets library.
data("iris")
Iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Check the upper value of datasets:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
To read the total number of rows and columns of iris datasets:
dim(iris)
## [1] 150 5
So, the iris datasets have 150 Rows and 5 Columns.
To check the details of iris datasets:
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
it ensures that you get the same result if you start with that same seed each time you run the same process.
I set seed to our visualization.
set.seed(8593)
Now I am removing the species and then assigning them to to variable iris_data, which is makes this process safer and easier.
Because K-Means is applied on non-categorical data.
iris_data <- iris[1:4]
head(iris_data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4
Mine first step to scaling the give dataset:
iris_data_scale <- scale(iris_data)
head(iris_data_scale,20)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] -0.89767388 1.01560199 -1.335752 -1.311052
## [2,] -1.13920048 -0.13153881 -1.335752 -1.311052
## [3,] -1.38072709 0.32731751 -1.392399 -1.311052
## [4,] -1.50149039 0.09788935 -1.279104 -1.311052
## [5,] -1.01843718 1.24503015 -1.335752 -1.311052
## [6,] -0.53538397 1.93331463 -1.165809 -1.048667
## [7,] -1.50149039 0.78617383 -1.335752 -1.179859
## [8,] -1.01843718 0.78617383 -1.279104 -1.311052
## [9,] -1.74301699 -0.36096697 -1.335752 -1.311052
## [10,] -1.13920048 0.09788935 -1.279104 -1.442245
## [11,] -0.53538397 1.47445831 -1.279104 -1.311052
## [12,] -1.25996379 0.78617383 -1.222456 -1.311052
## [13,] -1.25996379 -0.13153881 -1.335752 -1.442245
## [14,] -1.86378030 -0.13153881 -1.505695 -1.442245
## [15,] -0.05233076 2.16274279 -1.449047 -1.311052
## [16,] -0.17309407 3.08045544 -1.279104 -1.048667
## [17,] -0.53538397 1.93331463 -1.392399 -1.048667
## [18,] -0.89767388 1.01560199 -1.335752 -1.179859
## [19,] -0.17309407 1.70388647 -1.165809 -1.179859
## [20,] -0.89767388 1.70388647 -1.279104 -1.179859
For determine the value create visualization of scaling datasets with sumsquares
library(ggplot2)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster sums of squares, average silhouette and gap statistics. “wss”: for total within sum of square
fviz_nbclust(iris_data_scale, kmeans, method = "wss") +
labs(title = "The elbow method")
The optimum cluster or value of k is where the elbow occur.
From the above graph we clearly see the elbow occur is on the point of 3.
So, the value of k is 3.
kmeans.model <- kmeans(iris_data, 3)
kmeans.model
## K-means clustering with 3 clusters of sizes 50, 62, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.006000 3.428000 1.462000 0.246000
## 2 5.901613 2.748387 4.393548 1.433871
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
After this code I can see size of these clusters, which means exact count of variables for cluster, in this case: 50, 62, 38.
For further information about accuracy of each species in cluster, I used:
table(iris$Species,kmeans.model$cluster)
##
## 1 2 3
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 14 36
Now visualization the cluster:
colnames(iris_data)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
plot(iris_data[c("Sepal.Length","Sepal.Width")],
col = kmeans.model$cluster
, pch = 19)
And after assigning centers we cans see that some of the dots are closer to one point than the other.
So, for that we create the cluster value visualization.
library(cluster)
library(fpc)
## Warning: package 'fpc' was built under R version 4.1.1
plotcluster(iris_data,kmeans.model$cluster)
clusplot(iris_data, kmeans.model$cluster, color = TRUE , shade = TRUE)