Homework 4

Research Question:

“Can we discover meaningful clusters of iris flowers based on their sepal and petal measurements, and how do these clusters correspond to the known species?”

library(datasets)
data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Explanation of dataset:

Unit of Observation: Each row corresponds to a single iris flower.
Sample Size: The dataset contains 150 observations (flowers).

Variables:

Sepal Length: Length of the sepal (in centimeters).
Sepal Width: Width of the sepal (in centimeters).
Petal Length: Length of the petal (in centimeters).
Petal Width: Width of the petal (in centimeters).

Units of Measurement:

All measurements are in centimeters.

Data Source:

The Iris dataset was introduced by the British biologist and statistician Ronald A. Fisher in 1936.

summary(iris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
##

Mean: The mean sepal length is approximately 5.84 cm, and the mean sepal width is approximately 3.06 cm.
3rd Quartile Petal length: 75% of iris flowers has up to 5.1 cm of petal length based on this sample.
Median Petal Width: If arranged in ascending order, the median petal width would be 1.3 cm.

#We will standardize our variables
iris$Sepal.Length_z <- scale(iris$Sepal.Length)
iris$Sepal.Width_z <- scale(iris$Sepal.Width)
iris$Petal.Length_z <- scale(iris$Petal.Length)
iris$Petal.Width_z <- scale(iris$Petal.Width)

#Creating new factor variable
iris$SpeciesFactor <- factor(iris$Species,
                              levels = c( "setosa", "versicolor", "virginica"),
                              labels = c( "Setosa", "Versicolor", "Virginica"))

# Display the modified dataset
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Sepal.Length_z
## 1          5.1         3.5          1.4         0.2  setosa     -0.8976739
## 2          4.9         3.0          1.4         0.2  setosa     -1.1392005
## 3          4.7         3.2          1.3         0.2  setosa     -1.3807271
## 4          4.6         3.1          1.5         0.2  setosa     -1.5014904
## 5          5.0         3.6          1.4         0.2  setosa     -1.0184372
## 6          5.4         3.9          1.7         0.4  setosa     -0.5353840
##   Sepal.Width_z Petal.Length_z Petal.Width_z SpeciesFactor
## 1    1.01560199      -1.335752     -1.311052        Setosa
## 2   -0.13153881      -1.335752     -1.311052        Setosa
## 3    0.32731751      -1.392399     -1.311052        Setosa
## 4    0.09788935      -1.279104     -1.311052        Setosa
## 5    1.24503015      -1.335752     -1.311052        Setosa
## 6    1.93331463      -1.165809     -1.048667        Setosa

library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

rcorr(as.matrix(iris[, c ("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")]),
      type = "pearson")

##                Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z
## Sepal.Length_z           1.00         -0.12           0.87          0.82
## Sepal.Width_z           -0.12          1.00          -0.43         -0.37
## Petal.Length_z           0.87         -0.43           1.00          0.96
## Petal.Width_z            0.82         -0.37           0.96          1.00
## 
## n= 150 
## 
## 
## P
##                Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z
## Sepal.Length_z                0.1519        0.0000         0.0000       
## Sepal.Width_z  0.1519                       0.0000         0.0000       
## Petal.Length_z 0.0000         0.0000                       0.0000       
## Petal.Width_z  0.0000         0.0000        0.0000

Correlation should be less than 0.3, which is not the case. However, even though the variables have a high correlation, I am still continuing with clustering and will decide later based on the dendogram if the clusters look good.

# Now we look for dissimilarities
iris$Dissimilarity <- sqrt(iris$"Sepal.Length_z"^2 + iris$"Sepal.Width_z"^2 + iris$"Petal.Length_z"^2 + iris$"Petal.Width_z"^2)

head(iris[order(-iris$Dissimilarity), ], 10)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species Sepal.Length_z
## 118          7.7         3.8          6.7         2.2 virginica      2.2421720
## 132          7.9         3.8          6.4         2.0 virginica      2.4836986
## 16           5.7         4.4          1.5         0.4    setosa     -0.1730941
## 119          7.7         2.6          6.9         2.3 virginica      2.2421720
## 34           5.5         4.2          1.4         0.2    setosa     -0.4146207
## 33           5.2         4.1          1.5         0.1    setosa     -0.7769106
## 123          7.7         2.8          6.7         2.0 virginica      2.2421720
## 42           4.5         2.3          1.3         0.3    setosa     -1.6222537
## 110          7.2         3.6          6.1         2.5 virginica      1.6383555
## 136          7.7         3.0          6.1         2.3 virginica      2.2421720
##     Sepal.Width_z Petal.Length_z Petal.Width_z SpeciesFactor Dissimilarity
## 118     1.7038865       1.666574      1.312801     Virginica      3.525830
## 132     1.7038865       1.496631      1.050416     Virginica      3.523530
## 16      3.0804554      -1.279104     -1.048667        Setosa      3.500711
## 119    -1.0492515       1.779869      1.443994     Virginica      3.373621
## 34      2.6215991      -1.335752     -1.311052        Setosa      3.247735
## 33      2.3921710      -1.279104     -1.442245        Setosa      3.168951
## 123    -0.5903951       1.666574      1.050416     Virginica      3.042490
## 42     -1.7375359      -1.392399     -1.179859        Setosa      2.996929
## 110     1.2450302       1.326688      1.706379     Virginica      2.984316
## 136    -0.1315388       1.326688      1.443994     Virginica      2.981586

I see no relevant reason to remove any unit by far, so everything will be left as it is because I do not see signs of abnormalities.

#install.packages("factoextra")
library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#Calculating Eucledian distances
distance <- get_dist(iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")],
                     method = "euclidian")
distance2 <- distance^2

fviz_dist(distance2)

With the help of Euclidian distance I can see three cluster groups forming.

get_clust_tendency(iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")],
                   n = nrow(iris) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.8184781
## 
## $plot
## NULL

The Hopkins statistics is above 0.5, meaning that the data is appropriate for making clusters.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

WARD1 <- iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")] %>%
  get_dist(method = "euclidean") %>%
  hclust(method = "ward.D2")        

WARD1

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 150

library(factoextra)
fviz_dend(WARD1, show_labels = FALSE)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

With the help of the Dendogram I can observe that three groups are formed. The left one is most homogeneous.

fviz_dend(WARD1,
          k = 3,
          cex = 0.5,
          palette = "jama",
          color_labels_by_k = TRUE,
          rect = TRUE)

As previously mentioned, 3 groups would be the most suitable in our case.

iris$ClusterWard <- cutree(WARD1,
                             k = 3)

clustering1 <- cbind(ID = 1:nrow(iris), iris)
head(clustering1[c( "ID", "ClusterWard")])

##   ID ClusterWard
## 1  1           1
## 2  2           1
## 3  3           1
## 4  4           1
## 5  5           1
## 6  6           1

#Calculating the position of initial leaders
initial_leaders <- aggregate(iris,
                             by = list(iris$ClusterWard),
                             FUN = mean)

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA

initial_leaders

##   Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1       1     5.016327    3.451020     1.465306    0.244898      NA
## 2       2     5.530000    2.566667     3.930000    1.206667      NA
## 3       3     6.546479    2.992958     5.267606    1.854930      NA
##   Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z SpeciesFactor
## 1     -0.9987207     0.9032290    -1.29875725  -1.252149314            NA
## 2     -0.3783917    -1.1257275     0.09743396   0.009620796            NA
## 3      0.8491418    -0.1476957     0.85515615   0.860094260            NA
##   Dissimilarity ClusterWard
## 1      2.404644           1
## 2      1.352279           2
## 3      1.696173           3

#Assign initial leaders to the clusters
library(factoextra)
kmeans_clu <- hkmeans(iris[c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z")],
                      k = 3,
                      hc.metric = "euclidean",
                      hc.method = "ward.D2")

kmeans_clu

## Hierarchical K-means clustering with 3 clusters of sizes 50, 53, 47
## 
## Cluster means:
##   Sepal.Length_z Sepal.Width_z Petal.Length_z Petal.Width_z
## 1    -1.01119138    0.85041372     -1.3006301    -1.2507035
## 2    -0.05005221   -0.88042696      0.3465767     0.2805873
## 3     1.13217737    0.08812645      0.9928284     1.0141287
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 2 2 2 3 2 2 2 2 2 2 2 2 3 2 2 2 2 3 2 2 2
##  [75] 2 3 3 3 2 2 2 2 2 2 2 3 3 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3
## [112] 3 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 3 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 47.35062 44.08754 47.45019
##  (between_SS / total_SS =  76.7 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Cluster 1 has 50 units in it, cluster 2 has 53 and cluster 3 has 47 units. The ratio between the between-group sum of squares and the total sum of squares, which is 76.7%, indicates the proportion of the overall variability in a dependent variable that can be accounted for by group differences.

library(factoextra)
fviz_cluster(kmeans_clu,
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw())

I do not see any outliars here, so I won’t adjust the data for now.

iris$ClusterK_Means <- kmeans_clu$cluster
table(iris$ClusterWard)

## 
##  1  2  3 
## 49 30 71

table(iris$ClusterK_Means)

## 
##  1  2  3 
## 50 53 47

table(iris$ClusterWard, iris$ClusterK_Means)

##    
##      1  2  3
##   1 49  0  0
##   2  1 29  0
##   3  0 24 47

2nd column: The number of units in cluster 2 as a result of hierarchical clustering is 29 but beacuse of the K-Means clustering method, it received 24 units from Cluster 3 after reclassification and now the number of units in Cluster 2 after K-Means clustering is 53.

Leaders_final <- kmeans_clu$centers

Figure <- as.data.frame(Leaders_final)
Figure$ID <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z"))

Figure$Group <- factor(Figure$ID,
                       levels = c(1, 2, 3,4),
                       labels = c("1", "2", "3","4"))

Figure$NameF <- factor(Figure$name,
                       levels = c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z"),
                       labels = c("Sepal.Length_z", "Sepal.Width_z", "Petal.Length_z", "Petal.Width_z"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2)

Analyzing the provided chart reveals that individuals within cluster 1 (depicted by a red circle) exhibit sepal length, petal length, and petal width below the average. Nevertheless, their sepal width is notably above average. In cluster 2 (illustrated by a green triangle), the Irises showcase a little bit below average sepal length, slightly above-average petal length and petal width, but notably fall below the average in sepal width. As for the third cluster (represented by a blue square), it surpasses the average across all variables, particularly excelling in sepal length and closely approaching the average in sepal width.

fit <- aov(cbind(Sepal.Length_z, Sepal.Width_z, Petal.Length_z, Petal.Width_z) ~ as.factor(ClusterK_Means),
           data = iris)

summary(fit)

##  Response 1 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 111.504  55.752  218.57 < 2.2e-16 ***
## Residuals                 147  37.496   0.255                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 77.608  38.804    79.9 < 2.2e-16 ***
## Residuals                 147 71.392   0.486                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 137.276  68.638  860.64 < 2.2e-16 ***
## Residuals                 147  11.724   0.080                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 130.723  65.362   525.7 < 2.2e-16 ***
## Residuals                 147  18.277   0.124                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA test:

H0: Means of the respective variable are equal across all groups.
H1: At least one of the means of the respective variables differs across the groups.

Through the application of ANOVA, I aim to assess the effectiveness of my clustering variables in distinguishing between each other. Given that the p-values are below 0.001, we can reject the null hypothesis, indicating successful differentiation between the variables.

chisq_results <- chisq.test(iris$SpeciesFactor, as.factor(iris$ClusterK_Means))
chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  iris$SpeciesFactor and as.factor(iris$ClusterK_Means)
## X-squared = 187.64, df = 4, p-value < 2.2e-16

Hypothesis for Chi square test:

H0: There is no association between species and clusters by K-Means.
H1: There is association between species and clusters by K-Means.

We reject the null hypothesis at p < 0.001 and conclude that there is an association between species and clusters by K-Means.

addmargins(chisq_results$observed)

##                   as.factor(iris$ClusterK_Means)
## iris$SpeciesFactor   1   2   3 Sum
##         Setosa      50   0   0  50
##         Versicolor   0  39  11  50
##         Virginica    0  14  36  50
##         Sum         50  53  47 150

round(chisq_results$expected, 2)

##                   as.factor(iris$ClusterK_Means)
## iris$SpeciesFactor     1     2     3
##         Setosa     16.67 17.67 15.67
##         Versicolor 16.67 17.67 15.67
##         Virginica  16.67 17.67 15.67

round(chisq_results$res, 2)

##                   as.factor(iris$ClusterK_Means)
## iris$SpeciesFactor     1     2     3
##         Setosa      8.16 -4.20 -3.96
##         Versicolor -4.08  5.08 -1.18
##         Virginica  -4.08 -0.87  5.14

Sepal Length

aggregate(iris$Sepal.Length, 
          by = list(iris$ClusterK_Means), 
          FUN = "mean")

##   Group.1        x
## 1       1 5.006000
## 2       2 5.801887
## 3       3 6.780851

This data shows us the mean of sepal lengths among all three species of Iris flowers.

#Checking if Sepal Length has statistically significant impact on groups classified by K-Means
fit <- aov(Sepal.Length ~ as.factor(ClusterK_Means), 
           data = iris)
summary(fit)

##                            Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(ClusterK_Means)   2  76.46   38.23   218.6 <2e-16 ***
## Residuals                 147  25.71    0.17                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: The means of sepal length are the same across 3 cluster groups.
H1: At least one cluster group differs in the mean of sepal length.

We can reject H0 and conclude that at least one cluster group differs from the others in the mean of sepal length inside it (p<0.001).