The Iris dataset originates from a 1936 publication by Ronald A. Fisher in the Annals of Eugenics. Fisher—while widely regarded for his contributions to statistics—was also an outspoken eugenicist. The dataset’s underlying aim was to demonstrate the separability of biological species based on morphological measurements, a framing that reinforces typological thinking and the flattening of biological variation.
This project focuses on exploratory data analysis (EDA), examining the variability, relationships, and structure present in the dataset. No statistical inference or hypothesis testing is performed.
#Key variables: • Sepal.Length: Numeric, length of sepal in cm • Sepal.Width: Numeric, width of sepal in cm • Petal.Length: Numeric, length of petal in cm • Petal.Width: Numeric, width of petal in cm • Species: Categorical, the species of iris (setosa, versicolor, virginica)
#Research Questions 1. How do the four floral features vary across the three iris species? 2. What combinations of features appear to best distinguish species, and where do overlaps emerge? 3. What patterns of internal variability exist within each species? 4. When reduced to principal components, does the structure suggest separation or overlap? 5. To what extent do k-means clusters align with the species labels?
#Methodology
Exploratory Data Analysis (EDA) techniques were employed using summary statistics, visualizations (pairwise scatterplots, boxplots, density plots), PCA, and k-means clustering (k=3). This analysis also incorporates a critical examination of the dataset’s origins and implications.
#Results #Data Setup
library(tidyverse)
library(GGally)
library(ggfortify)
library(factoextra)
library(cluster)
library(gridExtra)
library(knitr)
library(ggplot2)
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Species | Sepal.Length_Mean | Sepal.Length_SD | Sepal.Width_Mean | Sepal.Width_SD | Petal.Length_Mean | Petal.Length_SD | Petal.Width_Mean | Petal.Width_SD |
---|---|---|---|---|---|---|---|---|
setosa | 5.01 | 0.35 | 3.43 | 0.38 | 1.46 | 0.17 | 0.25 | 0.11 |
versicolor | 5.94 | 0.52 | 2.77 | 0.31 | 4.26 | 0.47 | 1.33 | 0.20 |
virginica | 6.59 | 0.64 | 2.97 | 0.32 | 5.55 | 0.55 | 2.03 | 0.27 |
# Pairwise scatterplots and density plots by species
ggpairs(
iris,
columns = 1:4,
aes(color = Species, alpha = 0.7)
)
p1 <- ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot() +
labs(title = "Petal Length Distribution by Species", y = "Petal Length (cm)") +
theme_minimal()
print(p1)
p2 <- ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
geom_boxplot() +
labs(title = "Sepal Width Distribution by Species", y = "Sepal Width (cm)") +
theme_minimal()
print(p2)
pca_model <- prcomp(iris[, 1:4], scale. = TRUE)
# Print textual summary
summary(pca_model)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
# PCA Plot
p3 <- autoplot(pca_model, data = iris, colour = 'Species', frame = TRUE, frame.type = 'norm') +
labs(title = "PCA Colored by Species")
print(p3)
set.seed(123)
k_result <- kmeans(iris[, 1:4], centers = 3, nstart = 25)
iris$Cluster <- as.factor(k_result$cluster)
p4 <- fviz_pca_ind(pca_model,
geom.ind = "point",
col.ind = iris$Species,
palette = "jco",
addEllipses = TRUE,
ellipse.type = "norm",
legend.title = "Species") +
ggtitle("PCA: Colored by Species")
p5 <- fviz_pca_ind(pca_model,
geom.ind = "point",
col.ind = iris$Cluster,
palette = "jco",
addEllipses = TRUE,
ellipse.type = "norm",
legend.title = "K-Means Cluster") +
ggtitle("PCA: Colored by K-Means")
grid.arrange(p4, p5, ncol = 2)
setosa | versicolor | virginica |
---|---|---|
0 | 48 | 14 |
0 | 2 | 36 |
50 | 0 | 0 |
Summary of Results • Clear differences exist between species, particularly in petal measurements. • I. setosa is distinctly separate from I. versicolor and I. virginica. • I. versicolor and I. virginica show significant overlap, especially in PCA and clustering. • PCA confirms over 95% of variance captured in two components. • K-means clustering largely reproduces the setosa boundary but struggles with overlap between the other two species.
This exploratory analysis demonstrates that the Iris dataset, while well-structured and easily separable, reflects deeper tensions in statistical pedagogy. Common analyses like PCA and k-means clustering reveal high separability of species, but researchers must remain attentive to how data is framed and interpreted — particularly when it originates from ethically complex or typologically motivated contexts. This analysis acknowledges the dataset’s eugenicist origins and urges caution in treating canonical datasets as ideologically neutral. Data is not apolitical; classification systems encode assumptions. Neutrality in analysis is not the absence of stance but the masking of one.