Introduction

The Iris dataset originates from a 1936 publication by Ronald A. Fisher in the Annals of Eugenics. Fisher—while widely regarded for his contributions to statistics—was also an outspoken eugenicist. The dataset’s underlying aim was to demonstrate the separability of biological species based on morphological measurements, a framing that reinforces typological thinking and the flattening of biological variation. This analysis incorporates a critical data literacy perspective, recognizing that “neutral data” is a myth and that the choice of dataset carries inherent assumptions and power dynamics.

This project focuses on exploratory data analysis (EDA), examining the variability, relationships, and structure present in the dataset. No statistical inference or hypothesis testing is performed.

Key variables: - Sepal.Length: Numeric, length of sepal in cm - Sepal.Width: Numeric, width of sepal in cm - Petal.Length: Numeric, length of petal in cm - Petal.Width: Numeric, width of petal in cm - Species: Categorical, the species of iris (setosa, versicolor, virginica)


Research Questions

1.How do the four floral features vary across the three iris species? 2.What combinations of features appear to best distinguish species, and where do overlaps emerge? 3.What patterns of internal variability exist within each species? 4.When reduced to principal components, does the structure suggest separation or overlap? 5.To what extent do k-means clusters align with the species labels?


Methodology

Exploratory Data Analysis (EDA) techniques were employed to understand the structure, variability, and relationships within the Iris dataset.

Methods included calculating summary statistics (mean, standard deviation) to assess central tendencies and spread.

Visualizations such as pairwise scatterplots, boxplots, and density plots were used to examine distributions and species separation.

Principal Component Analysis (PCA) was applied to reduce dimensionality and explore the primary sources of variance in the data. K-means clustering (with k=3) was performed to identify natural groupings in the feature space and compare them to the provided species labels.

This analysis also incorporates a critical examination of the dataset’s origins and implications.

Data Setup

First, we load the necessary libraries and the dataset.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.1     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
library(ggfortify)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(knitr)

# Load the iris dataset (it's built-in)
data(iris)
head(iris) # Display the first few rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Summary Statistics

We begin by exploring the basic distribution of floral measurements across the three iris species using summary statistics (mean and standard deviation). This helps establish whether there are observable differences in size and shape across the species.

Table 1: Mean and Standard Deviation of Floral Features by Species
Species Sepal.Length_Mean Sepal.Length_SD Sepal.Width_Mean Sepal.Width_SD Petal.Length_Mean Petal.Length_SD Petal.Width_Mean Petal.Width_SD
setosa 5.01 0.35 3.43 0.38 1.46 0.17 0.25 0.11
versicolor 5.94 0.52 2.77 0.31 4.26 0.47 1.33 0.20
virginica 6.59 0.64 2.97 0.32 5.55 0.55 2.03 0.27

The summary statistics reveal clear differences in mean values across species, particularly for Petal.Length and Petal.Width. I. setosa generally has the smallest petal measurements, while I. virginica has the largest. Sepal measurements show less distinct differences, although I. setosa tends to have wider sepals on average.

Sepal measurements show less distinct differences, although I.setosa tends to have wider sepals on average.

Results

## Pairwise Feature Relationships

Sepal measurements show less distinct differences, although I. setosa tends to have wider sepals on average.

Pairwise Scatterplots

To explore relationships between all pairs of continuous variables and assess species separation, we use a scatterplot matrix.

library(ggplot2)
library(GGally)
p1 <- ggpairs(iris, columns = 1:4, aes(color = Species, alpha = 0.8))
print(p1)

## Distribution of Floral Features by Species

To explore within-group variability and assess the degree of separation across species, we use boxplots and density plots for each of the four floral measurements. These visualizations help highlight both central tendencies and spread within each species.

Boxplots make it easier to compare medians, interquartile ranges, and potential outliers.
Density plots help visualize distribution shape and overlap across species.

These plots reveal that species differences are most pronounced in petal measurements (length and width), while sepal width shows considerable overlap across all three species.

### Petal Dimensions

library(ggplot2)
library(gridExtra)

# Boxplots
p_length_box <- ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_boxplot() + labs(title = "Petal Length by Species") + theme_minimal()

p_width_box <- ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) +
  geom_boxplot() + labs(title = "Petal Width by Species") + theme_minimal()

# Density plots
p_length_density <- ggplot(iris, aes(x = Petal.Length, fill = Species, color = Species)) +
  geom_density(alpha = 0.5) + labs(title = "Petal Length Density") + theme_minimal()

p_width_density <- ggplot(iris, aes(x = Petal.Width, fill = Species, color = Species)) +
  geom_density(alpha = 0.5) + labs(title = "Petal Width Density") + theme_minimal()

# Display
print(grid.arrange(p_length_box, p_width_box, ncol = 2))

## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]
print(grid.arrange(p_length_density, p_width_density, ncol = 2))

## TableGrob (1 x 2) "arrange": 2 grobs
##   z     cells    name           grob
## 1 1 (1-1,1-1) arrange gtable[layout]
## 2 2 (1-1,2-2) arrange gtable[layout]

Principal Component Analysis (PCA)

To examine the overall variance structure in the dataset and explore potential low-dimensional separation between species, we apply PCA to the four numeric features.

PCA reduces the data to orthogonal components that capture the greatest variation.
In this case:

  • PC1 explains ~73% of the variance,
  • PC2 adds ~23%,
  • Together they explain over 95% of the total variance.

This indicates that most of the information in the 4-dimensional space can be captured in 2 dimensions.

library(ggfortify)  # for autoplot.prcomp

# Perform PCA (scaling is important for features on different scales)
pca_model <- prcomp(iris[, 1:4], scale. = TRUE)

# Print summary to document
summary(pca_model)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000
# Plot PCA colored by Species
p6 <- autoplot(pca_model, 
               data = iris, 
               colour = 'Species', 
               frame = TRUE, 
               frame.type = 'norm')
print(p6)


## K-Means Clustering and Species Alignment

To explore whether natural groupings emerge in the absence of species labels, we applied k-means clustering to the four numeric features in the dataset (sepal and petal length and width).  
We set `k = 3` to reflect the known number of species.

The resulting clusters were compared to species labels using a contingency table.  
This allows us to assess whether the structure identified by k-means aligns with labeled biological categories, without using those labels to inform the clustering process.


``` r
set.seed(42)  # for reproducibility

# Perform k-means clustering (k = 3)
km <- kmeans(iris[, 1:4], centers = 3, nstart = 25)

# Add cluster assignments to the iris data frame
iris_clustered <- iris
iris_clustered$Cluster <- as.factor(km$cluster)

# Contingency table: clusters vs true species
table(Species = iris$Species, Cluster = km$cluster)
##             Cluster
## Species       1  2  3
##   setosa      0  0 50
##   versicolor  2 48  0
##   virginica  36 14  0

#Summary of Results

The EDA of the Iris dataset reveals several key findings:

Clear differences exist between species, particularly in petal measurements (length and width). I.setosa is distinctly separate from I. versicolor and I. virginica across multiple analyses (summary stats, visualizations, PCA, clustering). I.versicolor and I. virginica show significant overlap in their feature distributions, making them harder to distinguish based solely on these four measurements. PCA confirms the primary separation axis (PC1) is related to overall size, particularly petal size, and captures over 70% of the variance. K-means clustering (k=3) largely recapitulates the species structure for I. setosa but struggles with the versicolor/virginica boundary, reflecting the inherent variability and overlap in the data. Furthermore, this analysis critically acknowledges the dataset’s origin within a eugenics framework, highlighting the importance of considering the historical and ethical context of data used in analysis. The Iris dataset, while useful for pedagogical purposes, should not be treated as a neutral artifact. The question of where meaning resides in classification—whether in our taxonomies or in the data’s underlying structure—remains relevant.