R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Clustering By Using R

In this section we will see how the R language can be used to create cluster of the data sets. We will start with K-Means Clustering involving iris data sets.

K-Means Clustering:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

Loading required R packages For solving the Problem

Data Sets:

The task is to used Iris data sets:

Data Source:

This data set is already build into RStudio, so it is easy to find.

Iris data set is not huge, probably advanced R programmes or data scientists would find it rather unsuitable, but for, as a lady and begginer in big data analytics, this is a great place to start and show some simple tricks to plot and analyze this data set.

Loading Data sets

Load the Iris data sets that is mandatory for completing the task:

library(datasets) #for using iris data sets load the datasets library.

data("iris")

Iris dataset gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.

To inspect dataset:

Check the upper value of datasets:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

To read the total number of rows and columns of iris datasets:

dim(iris)
## [1] 150   5

So, the iris datasets have 150 Rows and 5 Columns.

To check the details of iris datasets:

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Set Seeds

it ensures that you get the same result if you start with that same seed each time you run the same process.

I set seed to our visualization.

set.seed(8593)

K-Meand Mode:

Now I am removing the species and then assigning them to to variable iris_data, which is makes this process safer and easier.

Because K-Means is applied on non-categorical data.

iris_data <- iris[1:4]
head(iris_data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

find Optimum Cluster or determine the value of k

Mine first step to scaling the give dataset:

iris_data_scale <- scale(iris_data)
head(iris_data_scale,20)
##       Sepal.Length Sepal.Width Petal.Length Petal.Width
##  [1,]  -0.89767388  1.01560199    -1.335752   -1.311052
##  [2,]  -1.13920048 -0.13153881    -1.335752   -1.311052
##  [3,]  -1.38072709  0.32731751    -1.392399   -1.311052
##  [4,]  -1.50149039  0.09788935    -1.279104   -1.311052
##  [5,]  -1.01843718  1.24503015    -1.335752   -1.311052
##  [6,]  -0.53538397  1.93331463    -1.165809   -1.048667
##  [7,]  -1.50149039  0.78617383    -1.335752   -1.179859
##  [8,]  -1.01843718  0.78617383    -1.279104   -1.311052
##  [9,]  -1.74301699 -0.36096697    -1.335752   -1.311052
## [10,]  -1.13920048  0.09788935    -1.279104   -1.442245
## [11,]  -0.53538397  1.47445831    -1.279104   -1.311052
## [12,]  -1.25996379  0.78617383    -1.222456   -1.311052
## [13,]  -1.25996379 -0.13153881    -1.335752   -1.442245
## [14,]  -1.86378030 -0.13153881    -1.505695   -1.442245
## [15,]  -0.05233076  2.16274279    -1.449047   -1.311052
## [16,]  -0.17309407  3.08045544    -1.279104   -1.048667
## [17,]  -0.53538397  1.93331463    -1.392399   -1.048667
## [18,]  -0.89767388  1.01560199    -1.335752   -1.179859
## [19,]  -0.17309407  1.70388647    -1.165809   -1.179859
## [20,]  -0.89767388  1.70388647    -1.279104   -1.179859

For determine the value create visualization of scaling datasets with sumsquares

library(ggplot2)
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.1
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_nbclust(): Dertemines and visualize the optimal number of clusters using different methods: within cluster sums of squares, average silhouette and gap statistics. “wss”: for total within sum of square

fviz_nbclust(iris_data_scale, kmeans, method = "wss") + 
  labs(title = "The elbow method")

The optimum cluster or value of k is where the elbow occur.

From the above graph we clearly see the elbow occur is on the point of 3.

So, the value of k is 3.

K-Means Model:

kmeans.model <- kmeans(iris_data, 3)
kmeans.model
## K-means clustering with 3 clusters of sizes 50, 62, 38
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1     5.006000    3.428000     1.462000    0.246000
## 2     5.901613    2.748387     4.393548    1.433871
## 3     6.850000    3.073684     5.742105    2.071053
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 15.15100 39.82097 23.87947
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

After this code I can see size of these clusters, which means exact count of variables for cluster, in this case: 50, 62, 38.

For further information about accuracy of each species in cluster, I used:

table(iris$Species,kmeans.model$cluster)
##             
##               1  2  3
##   setosa     50  0  0
##   versicolor  0 48  2
##   virginica   0 14 36

Visualization

Now visualization the cluster:

colnames(iris_data)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"
plot(iris_data[c("Sepal.Length","Sepal.Width")], 
     col = kmeans.model$cluster
     , pch = 19)

And after assigning centers we cans see that some of the dots are closer to one point than the other.

So, for that we create the cluster value visualization.

library(cluster)
library(fpc)
## Warning: package 'fpc' was built under R version 4.1.1
plotcluster(iris_data,kmeans.model$cluster)

clusplot(iris_data, kmeans.model$cluster, color = TRUE , shade = TRUE)

Thank You!