Introduction

The Iris dataset originates from a 1936 publication by Ronald A. Fisher in the Annals of Eugenics. Fisher—while widely regarded for his contributions to statistics—was also an outspoken eugenicist. The dataset’s underlying aim was to demonstrate the separability of biological species based on morphological measurements, a framing that reinforces typological thinking and the flattening of biological variation.

This project focuses on exploratory data analysis (EDA), examining the variability, relationships, and structure present in the dataset. No statistical inference or hypothesis testing is performed.

#Key variables: • Sepal.Length: Numeric, length of sepal in cm • Sepal.Width: Numeric, width of sepal in cm • Petal.Length: Numeric, length of petal in cm • Petal.Width: Numeric, width of petal in cm • Species: Categorical, the species of iris (setosa, versicolor, virginica)

#Research Questions 1. How do the four floral features vary across the three iris species? 2. What combinations of features appear to best distinguish species, and where do overlaps emerge? 3. What patterns of internal variability exist within each species? 4. When reduced to principal components, does the structure suggest separation or overlap? 5. To what extent do k-means clusters align with the species labels?

#Methodology

Exploratory Data Analysis (EDA) techniques were employed using summary statistics, visualizations (pairwise scatterplots, boxplots, density plots), PCA, and k-means clustering (k=3). This analysis also incorporates a critical examination of the dataset’s origins and implications.

#Results #Data Setup

library(tidyverse)
library(GGally)
library(ggfortify)
library(factoextra)
library(cluster)
library(gridExtra)
library(knitr)
library(ggplot2)

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Summary Statistics

Table 1: Mean and Standard Deviation of Floral Features by Species
Species	Sepal.Length_Mean	Sepal.Length_SD	Sepal.Width_Mean	Sepal.Width_SD	Petal.Length_Mean	Petal.Length_SD	Petal.Width_Mean	Petal.Width_SD
setosa	5.01	0.35	3.43	0.38	1.46	0.17	0.25	0.11
versicolor	5.94	0.52	2.77	0.31	4.26	0.47	1.33	0.20
virginica	6.59	0.64	2.97	0.32	5.55	0.55	2.03	0.27

Pairwise Plots

# Pairwise scatterplots and density plots by species
ggpairs(
  iris,
  columns = 1:4,
  aes(color = Species, alpha = 0.7)
)

Distribution of Petal Length by Species

p1 <- ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
  geom_boxplot() +
  labs(title = "Petal Length Distribution by Species", y = "Petal Length (cm)") +
  theme_minimal()

print(p1)

Distribution of Sepal Width by Species

p2 <- ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
  geom_boxplot() +
  labs(title = "Sepal Width Distribution by Species", y = "Sepal Width (cm)") +
  theme_minimal()

print(p2)

PCA Summary

pca_model <- prcomp(iris[, 1:4], scale. = TRUE)

# Print textual summary
summary(pca_model)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

# PCA Plot
p3 <- autoplot(pca_model, data = iris, colour = 'Species', frame = TRUE, frame.type = 'norm') +
  labs(title = "PCA Colored by Species")

print(p3)

PCA: Species vs K-Means Clusters

set.seed(123)
k_result <- kmeans(iris[, 1:4], centers = 3, nstart = 25)
iris$Cluster <- as.factor(k_result$cluster)

p4 <- fviz_pca_ind(pca_model,
                   geom.ind = "point",
                   col.ind = iris$Species,
                   palette = "jco",
                   addEllipses = TRUE,
                   ellipse.type = "norm",
                   legend.title = "Species") +
  ggtitle("PCA: Colored by Species")

p5 <- fviz_pca_ind(pca_model,
                   geom.ind = "point",
                   col.ind = iris$Cluster,
                   palette = "jco",
                   addEllipses = TRUE,
                   ellipse.type = "norm",
                   legend.title = "K-Means Cluster") +
  ggtitle("PCA: Colored by K-Means")

grid.arrange(p4, p5, ncol = 2)

K-Means Clustering and Species Alignment

Table 2: Contingency Table – K-means Clusters vs. Species Labels
setosa	versicolor	virginica
0	48	14
0	2	36
50	0	0

Summary of Results

Summary of Results • Clear differences exist between species, particularly in petal measurements. • I. setosa is distinctly separate from I. versicolor and I. virginica. • I. versicolor and I. virginica show significant overlap, especially in PCA and clustering. • PCA confirms over 95% of variance captured in two components. • K-means clustering largely reproduces the setosa boundary but struggles with overlap between the other two species.

Conclusion

This exploratory analysis demonstrates that the Iris dataset, while well-structured and easily separable, reflects deeper tensions in statistical pedagogy. Common analyses like PCA and k-means clustering reveal high separability of species, but researchers must remain attentive to how data is framed and interpreted — particularly when it originates from ethically complex or typologically motivated contexts. This analysis acknowledges the dataset’s eugenicist origins and urges caution in treating canonical datasets as ideologically neutral. Data is not apolitical; classification systems encode assumptions. Neutrality in analysis is not the absence of stance but the masking of one.

Project 1: Exploratory Data Analysis of the Iris Dataset – A Critical Perspective

Phinn Markson

October 22, 2025