data(iris)

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Dataset Overview

Variables Description

Data Manipulation

I will create a new variable, the Sepal Ratio, which is the ratio of Sepal Length to Sepal Width. This new feature provides a unique measure of the flower’s shape, offering a more refined perspective on the relationship between the sepal’s length and width. Including this ratio as an additional parameter enhances the clustering analysis by incorporating a new dimension that could potentially reveal distinct patterns among the flowers that are not captured by the individual sepal measurements.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
iris <- iris %>%
  mutate("Sepal.Ratio" = Sepal.Length / Sepal.Width)

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Ratio
## 1          5.1         3.5          1.4         0.2  setosa    1.457143
## 2          4.9         3.0          1.4         0.2  setosa    1.633333
## 3          4.7         3.2          1.3         0.2  setosa    1.468750
## 4          4.6         3.1          1.5         0.2  setosa    1.483871
## 5          5.0         3.6          1.4         0.2  setosa    1.388889
## 6          5.4         3.9          1.7         0.4  setosa    1.384615

Descriptive statistics

library(psych)

describe(iris)
##              vars   n mean   sd median trimmed  mad  min  max range  skew
## Sepal.Length    1 150 5.84 0.83   5.80    5.81 1.04 4.30 7.90  3.60  0.31
## Sepal.Width     2 150 3.06 0.44   3.00    3.04 0.44 2.00 4.40  2.40  0.31
## Petal.Length    3 150 3.76 1.77   4.35    3.76 1.85 1.00 6.90  5.90 -0.27
## Petal.Width     4 150 1.20 0.76   1.30    1.18 1.04 0.10 2.50  2.40 -0.10
## Species*        5 150 2.00 0.82   2.00    2.00 1.48 1.00 3.00  2.00  0.00
## Sepal.Ratio     6 150 1.95 0.40   2.03    1.94 0.38 1.27 2.96  1.69  0.02
##              kurtosis   se
## Sepal.Length    -0.61 0.07
## Sepal.Width      0.14 0.04
## Petal.Length    -1.42 0.14
## Petal.Width     -1.36 0.06
## Species*        -1.52 0.07
## Sepal.Ratio     -0.86 0.03

Let’s analyze few results I got. For example about the two lengths.

Sepal Length

The mean sepal length is 5.84 cm with a standard deviation of 0.83, indicating that most measurements are relatively close to the mean, but there is some variability. The skewness of 0.31 indicates a slight rightward skew, suggesting that there are a few provinces with slightly larger sepal lengths. The range, from 4.3 cm to 7.9 cm, shows that there are a few flowers with much larger sepals, but the majority fall within a more typical range.

Petal Length

Petal length has a mean of 3.76 cm and a standard deviation of 1.77, which is notably higher than for other parameters, reflecting a wider spread in petal lengths. The median is 4.35 cm, which is higher than the mean, indicating that the distribution is skewed to the left (more smaller petals than larger ones). The range extends from 1.0 cm to 6.9 cm, signifying considerable variation in petal length across the dataset.

Clustering

RQ: can clustering techniques reveal natural groupings within the dataset of Iris flowers based on their sepal and petal characteristics?

Standardizing & dissimilarities

iris_clu_std <- as.data.frame(scale(iris[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Sepal.Ratio")]))

head(iris_clu_std)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio
## 1   -0.8976739  1.01560199    -1.335752   -1.311052  -1.2398566
## 2   -1.1392005 -0.13153881    -1.335752   -1.311052  -0.7999085
## 3   -1.3807271  0.32731751    -1.392399   -1.311052  -1.2108735
## 4   -1.5014904  0.09788935    -1.279104   -1.311052  -1.1731164
## 5   -1.0184372  1.24503015    -1.335752   -1.311052  -1.4102869
## 6   -0.5353840  1.93331463    -1.165809   -1.048667  -1.4209578

For the clustering analysis of the Iris dataset, I selected five variables: Sepal Length, Sepal Width, Petal Length, Petal Width, and Sepal Ratio. I decided to exclude the Species variable from the clustering process for now. I’ll use the species labels to compare how well the clusters match with the actual flower species, with criterion validity.

iris_clu_std$Dissimilarity <- sqrt(iris_clu_std$Sepal.Length^2 +
                                   iris_clu_std$Sepal.Width^2 +
                                   iris_clu_std$Petal.Length^2 +
                                   iris_clu_std$Petal.Width^2 +
                                   iris_clu_std$Sepal.Ratio^2)

head(data.frame(Flower_ID = 1:nrow(iris), Dissimilarity = iris_clu_std$Dissimilarity)[order(-iris_clu_std$Dissimilarity), ])
##     Flower_ID Dissimilarity
## 119       119      4.208884
## 16         16      3.867347
## 123       123      3.634628
## 34         34      3.624215
## 33         33      3.601554
## 132       132      3.537387

In the results of the dissimilarity calculation, we observe that flower ID 119 has a notably higher dissimilarity score of 4.21 compared to the other flowers. This suggests that flower 119 is an outlier in terms of its dissimilarity, so I’ll remove it.

library(dplyr)

iris_cleaned <- iris_clu_std %>%
  filter(row_number() != 119)

iris_cleaned$Dissimilarity <- sqrt(
  iris_cleaned$Sepal.Length^2 +
  iris_cleaned$Sepal.Width^2 +
  iris_cleaned$Petal.Length^2 +
  iris_cleaned$Petal.Width^2 +
  iris_cleaned$Sepal.Ratio^2
)

head(iris_cleaned[order(-iris_cleaned$Dissimilarity), ])
##     Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio Dissimilarity
## 16    -0.1730941   3.0804554    -1.279104   -1.048667  -1.6435927      3.867347
## 122    2.2421720  -0.5903951     1.666574    1.050416   1.9884107      3.634628
## 34    -0.4146207   2.6215991    -1.335752   -1.311052  -1.6084617      3.624215
## 33    -0.7769106   2.3921710    -1.279104   -1.442245  -1.7114159      3.601554
## 131    2.4836986   1.7038865     1.496631    1.050416   0.3127907      3.537387
## 118    2.2421720   1.7038865     1.666574    1.312801   0.1813696      3.530492

Hopkins statistics

library(factoextra)
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances <- get_dist(iris_cleaned,
                       method = "euclidean")

fviz_dist(Distances, gradient = list(low = "darkred",
                                      mid = "grey95",
      high = "white"))

 library(factoextra)

get_clust_tendency(iris_cleaned,
                   n = nrow(iris_cleaned) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8495027
## 
## $plot
## NULL

The Hopkins statistic, which tells us how well-suited the data is for clustering, is 0.849. Since this value is close to 1, it shows that the data has a strong clustering structure. This means the data is ideal for clustering analysis. Moreover, when we look at the distance matrix, we can see distinct groupings, which further proves that the data naturally forms clear clusters.

Number of clusters

library(factoextra) 
library(NbClust)

fviz_nbclust(iris_cleaned, kmeans, method = "wss") +
   labs(subtitle = "Elbow method")

The elbow method helps find the best number of clusters by plotting the total within-cluster sum of squares against the number of clusters. In the plot, we see noticeable breaks at 2 and 3 clusters. So, dividing the data into either 2 or 3 clusters gives us a good balance of simplicity and meaningful differences in the data.

fviz_nbclust(iris_cleaned, kmeans, method = "silhouette") +
   labs(subtitle = "Silhouette analysis")

The silhouette method also supports this, showing that 2 clusters is the most preferable choice.

library(NbClust)

nc <- NbClust(iris_cleaned, distance = "euclidean",
               min.nc = 2, max.nc = 10,
               method = "kmeans", index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 12 proposed 2 as the best number of clusters 
## * 7 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

After considering the K-means results and aiming for more interesting and distinct clusters, I decided to go with 3 clusters. This allows for a more detailed grouping while still maintaining meaningful differences between the clusters.

Clustering groups

Clustering <- kmeans(iris_cleaned,
                         centers = 3,
                         nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 50, 64, 35
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio Dissimilarity
## 1  -1.01119138   0.8504137    -1.300630  -1.2507035  -1.2072836      2.711802
## 2   0.07031946  -0.7266181     0.386691   0.3247565   0.5415159      1.403314
## 3   1.25191290   0.1437750     1.100097   1.1516218   0.6625869      2.320648
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 3 2 3 3 3 3
## [112] 2 3 2 2 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3 3
## [149] 2
## 
## Within cluster sum of squares by cluster:
## [1]  58.58131 107.41077  54.49020
##  (between_SS / total_SS =  73.2 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

From the K-means clustering results, here are some interesting findings:

  • Largest Cluster: Cluster 2 is the largest, with 64 units, making it the most populous cluster in the dataset.

  • Cluster Means: The mean for Petal Length in Cluster 1 is -1.300630, indicating that this cluster contains flowers with relatively smaller petal lengths compared to other clusters.

  • SS Ratio: The Within-Cluster Sum of Squares (SS) ratio is 73.2%, which shows that the clusters are well-separated, with the majority of the variability in the data accounted for by differences between clusters rather than within each cluster. This high ratio suggests that the clustering model is effective.

library(factoextra)

 fviz_cluster(Clustering,
              palette = "Set1",
              repel = TRUE,
              ggtheme = theme_bw(),
              data = iris_cleaned)

Based on graphical observation, I could remove iris id 16, 34 and 40 from cluster 1; 86, 71, 58, 94, 99 and 61 from cluster 2; 131, 118 and 110 from cluster 3. It is evident they are potential outliers.

library(dplyr)

iris_cleaned <- iris_cleaned %>% filter(!(row_number() %in% c(16, 34, 39, 40, 42, 86, 71, 58, 94, 99, 61, 131, 181, 110)))

iris_cleaned <- as.data.frame(scale(iris_cleaned[, c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Sepal.Ratio")]))
Clustering <- kmeans(iris_cleaned,
                        centers = 3,
                        nstart = 25)
Clustering
## K-means clustering with 3 clusters of sizes 52, 45, 39
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio
## 1   0.05127646  -0.8884530    0.4145288   0.3411984   0.6045754
## 2  -1.07671389   0.9391076   -1.3274463  -1.2683989  -1.2389273
## 3   1.17399357   0.1010183    0.9789637   1.0086061   0.6234310
## 
## Clustering vector:
##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 3 3 3 1 1 1 3 1 1 1 1 1 1 3 1 1 1 1 1 1 1 1 3 3 3 1 1 1 1
##  [75] 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 1 3 1 3 1 1 3 3 3 1 3 1 3 1
## [112] 3 3 1 1 3 3 3 3 1 1 3 3 3 1 3 3 3 1 3 3 3 1 3 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 59.93223 37.52378 43.38612
##  (between_SS / total_SS =  79.1 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The ratio improved, which is a positive thing.

library(factoextra)

 fviz_cluster(Clustering,
              palette = "Set1",
              repel = FALSE,
              ggtheme = theme_bw(),
              data = iris_cleaned)

Now the clusters better satisfy both within-cluster and between-cluster requirements. Separation of groups with reduced outlier influence is clearer.

Cluster results

Averages <- Clustering$centers
 Averages
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Ratio
## 1   0.05127646  -0.8884530    0.4145288   0.3411984   0.6045754
## 2  -1.07671389   0.9391076   -1.3274463  -1.2683989  -1.2389273
## 3   1.17399357   0.1010183    0.9789637   1.0086061   0.6234310

Let’s explain two of the averages for the clusters:

  • Cluster 1 - Sepal Length: The average Sepal Length for Cluster 1 is 0.05127646. This means that, on average, the flowers in Cluster 1 have a Sepal Length close to the overall average. The value being close to zero suggests that the sepal length of these flowers is pretty typical compared to the entire dataset.

  • Cluster 2 - Petal Length: The average Petal Length for Cluster 2 is -1.3274463, which is lower than the overall average. This indicates that the flowers in Cluster 2 have smaller petal lengths compared to the rest of the flowers in the dataset. The negative value shows they are below the mean for petal length.

Figure <- as.data.frame(Averages)
 Figure$ID <- 1:nrow(Figure)

 library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width" , "Sepal.Ratio"))
 Figure$Group <- factor(Figure$ID,
                        levels = c(1, 2, 3),
                        labels = c("1", "2", "3"))
 Figure$NameF <- factor(Figure$name,
                        levels = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width" , "Sepal.Ratio"),
                        labels = c("Sepal Length", "Sepal Width", "Petal Length", "Petal Width" , "Sepal Ratio"))

 library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
   geom_hline(yintercept = 0) +
   theme_bw() +
   geom_point(aes(shape = Group, color = Group), size = 3) +
   geom_line(aes(group = ID), size = 1) +
   ylab("Averages") +
   xlab("Cluster Variables") +
   ylim(-2, 3.5) +
   theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In the line graph comparing the three clusters, we can observe the following trends:

  • Cluster 1 has values close to the overall average for most variables, showing moderate measurements for features like Sepal Length and Petal Length.
  • Cluster 2 stands out with lower values for Petal Length, indicating that the flowers in this group have smaller petals compared to the others. They also show higher values for Sepal Width and Petal Width, which may suggest different flower shapes.
  • Cluster 3 has higher values for Petal Length and Sepal Length, making these flowers have larger petals and slightly longer sepals than the others. The graph highlights these distinctions, showing how the clusters vary in size and shape.

The graph would show distinct lines for each cluster, with Cluster 2 dipping lower for petal-related features and Cluster 3 rising higher, while Cluster 1 remains more balanced and centered.

Differentiation between groups

 iris_cleaned$Group <- Clustering$cluster
fit <- aov(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Sepal.Ratio) ~ as.factor(Group),  
           data = iris_cleaned)  

summary(fit)
##  Response Sepal.Length :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 106.058  53.029  243.69 < 2.2e-16 ***
## Residuals        133  28.942   0.218                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Sepal.Width :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 81.131  40.565  100.15 < 2.2e-16 ***
## Residuals        133 53.869   0.405                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Petal.Length :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 125.607  62.803  889.26 < 2.2e-16 ***
## Residuals        133   9.393   0.071                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Petal.Width :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 118.125  59.063  465.51 < 2.2e-16 ***
## Residuals        133  16.875   0.127                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Sepal.Ratio :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 103.237  51.618  216.14 < 2.2e-16 ***
## Residuals        133  31.763   0.239                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The very low p-values (less than 0.001) for all flower measurements show that the clustering clearly separates the flowers into different groups. Each cluster has its own unique pattern based on petal and sepal size. This means the clustering method worked well in finding groups of flowers with different shapes and sizes.

Criterion validity

I had an issue when trying to run the chi-square test for criterion validity because my dataset didn’t have the Group column properly assigned. This happened because I had created a standardized version of the iris dataset (iris_clu_std) that didn’t include the Species variable, and then I removed outliers, which made it harder to match everything correctly.

To fix this, I asked ChatGPT for help. It suggested that I reassign the clusters to the original iris dataset by running K-means clustering again and adding the Group column properly. After that, I made sure there were no missing values and created a contingency table to compare the species with the clusters. Finally, I was able to run the chi-square test correctly and proceed with the criterion validity analysis.

set.seed(42)
kmeans_result <- kmeans(iris[, 1:4], centers = 3) 

iris$Group <- as.factor(kmeans_result$cluster)

iris_cleaned <- iris[complete.cases(iris$Species, iris$Group), ]

contingency_table <- table(iris_cleaned$Species, iris_cleaned$Group)

chi_square <- chisq.test(contingency_table)

chi_square
## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 223.6, df = 4, p-value < 2.2e-16

The null hypothesis says that there is no relationship between species and clusters, meaning the clusters do not group flowers based on their species.

Since the p-value is extremely small (p<0.001), this means we can reject H0. In simple terms, the test confirms that the clusters are not random and that there is a strong relationship between the species and the way the flowers were grouped. This shows that the clustering worked well in separating the different types of flowers.

addmargins(chi_square$observed)
##             
##                1   2   3 Sum
##   setosa      50   0   0  50
##   versicolor   0  48   2  50
##   virginica    0  14  36  50
##   Sum         50  62  38 150
addmargins(round(chi_square$expected, 2))
##             
##                  1     2     3    Sum
##   setosa     16.67 20.67 12.67  50.01
##   versicolor 16.67 20.67 12.67  50.01
##   virginica  16.67 20.67 12.67  50.01
##   Sum        50.01 62.01 38.01 150.03

The results show that setosa flowers are all in Cluster 1, while versicolor is mostly in Cluster 2, and virginica is split between Clusters 2 and 3. The expected values show that if the clustering was random, each species would be more evenly spread (since 50/3 ≈ 16.67). Since the actual results are very different, it confirms that the clustering worked well in grouping similar flowers together.

round(chi_square$res, 2)
##             
##                  1     2     3
##   setosa      8.16 -4.55 -3.56
##   versicolor -4.08  6.01 -3.00
##   virginica  -4.08 -1.47  6.56
library(effectsize)
## 
## Attaching package: 'effectsize'
## The following object is masked from 'package:psych':
## 
##     phi
effectsize::cramers_v(iris$Species, iris$Group)
## Cramer's V (adj.) |       95% CI
## --------------------------------
## 0.86              | [0.76, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].

A high positive residual (like 8.16 for setosa in Cluster 1) means there are more setosa flowers in that cluster than expected, while a large negative residual (like -4.08 for virginica in Cluster 1) means there are fewer virginica flowers there than expected. This confirms that the clusters are meaningfully separating the species.

The high Cramer’s V (0.86) confirms a strong relationship between the species and the clusters, meaning the clustering grouped the flowers very well.