library(dplyr)
datairis <- read.csv("iris.csv")
datairis <- datairis %>%
select(-c(Species))
datairis
Iris dataset is containing of 5 column such as Sepal. Length, Sepal.Width, Petal.Length, Petal.Width, and Species. Actually, iris data set just can be download just by running “iris” in the chunk, but I just want to make a show how to read data if it is a csv file. So this dataset is describing about each characteristics of every species. In the shake of learning about unsupervised machine learning, firstly we can ignore column Species (‘pretending that column Species is not exist’). And then we try to cluster the dataset based on characteristics of Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width. So, we can delete column Species as below:
Now, we would like to see is there any NA number in the dataset by running the chunk below:
colSums(is.na(datairis))
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0 0 0 0
Based on the results above, we can conclude that there are no NA number in the dataset.
In determining cluster k, we can use 2 method : - Based on bussiness knowledge: User already know the expected number of cluster. - Elbow method: Choosing K when the decreasing of total WSS to the next K has started to make a relatively flat slope.
In this case, I would like to used Elbow method. And based on the plot below, K=4 is the K that has started to make a relatively flat slope. So we will make 4 cluster on datairis.
library(factoextra)
fviz_nbclust(x = datairis, method = "wss", kmeans)
RNGkind(sample.kind = "Rounding")
set.seed(100)
datairis_cluster <- kmeans(x = datairis, centers = 4)
library(factoextra)
fviz_cluster(object = datairis_cluster, data = datairis)
datairis_new <- datairis %>%
mutate(cluster = as.factor(datairis_cluster$cluster))
head(datairis_new)
datairis_new <- datairis %>%
mutate(cluster = as.factor(datairis_cluster$cluster))
datairis_new
profil <- datairis_new %>%
group_by(cluster) %>%
summarise_all(mean)
head(profil)
library(tidyverse)
profil %>%
pivot_longer(cols = -cluster, names_to = "type", values_to = "value") %>%
ggplot(aes(x = cluster, y = value)) +
geom_col(aes(fill = cluster)) +
facet_wrap(~type)
From visualization plot above we can conclude characteristic in each cluster:
cluster 1 describe about flower that has the highest sepal width than in any other clusters. So if you have a flower shop and your customer is looking for a flower with widest sepal, you can offer a flower in the cluster 1.
cluster 2 describe about flower that has a smallest sepal length, smallest petal width, and smallest petal length than any other cluster.
cluster 3 describe about flower that has highest sepal length, highest petal length, and highest petal width than any other cluster
cluster 4 describe about flower that has smallest sepal width