untitled

title: “PCA Wine” author: “Linfeng Noah Joe”

I chose the iris and wine datasets because they show two different cases: iris is simple and easy to separate, wine is more complex. This helps show what PCA can do well and where it has limits.

PCA itself is done by the base R function prcomp(), so we only need package “ggplot2” for visualization.

#Load Package
library(ggplot2)

Iris Part

The iris dataset has 150 observations and 4 numeric variables: (sepal length/width, petal length/width).

We keep only the numeric columns (1–4) for PCA because PCA needs numbers.

The “Species” column is stored separately, so we can use it later to color points

and see how species cluster in PCA space.

# Prepare data (just need numbers)
iris_data <- iris[, 1:4]
iris_species <- iris$Species # keep species labels for coloring

We run PCA using prcomp(), asking it to:

- center = TRUE: subtract the mean so that each variable has mean = 0

- scale. = TRUE: divide by standard deviation so variables are comparable

PCA finds the directions (PC1, PC2, …) that capture the most variation.

We then extract PC1 and PC2 scores, which tell us where each flower lies in this new 2D coordinate system.

# Perform PCA (center and scale so that all variables are comparable)
iris_pca <- prcomp(iris_data, center = TRUE, scale. = TRUE)
# Get PC1 + PC2 for plotting
iris_scores <- data.frame(iris_pca$x[, 1:2], Species = iris_species)
# Show Consequence
summary(iris_pca)
iris_pca$rotation

Iris Plots

This plot shows how the 150 iris flowers are distributed in the PC1–PC2 space.

Each point is one flower, colored by species (setosa, versicolor, virginica).

We also draw 95% confidence ellipses for each species group. So if PCA works well, we expect to see clear separation between species, this indicates that the original measurements can distinguish them.

# Plot PC1 vs PC2
# This plot shows how individual flowers are distributed in the "compressed" space
ggplot(iris_scores, aes(PC1, PC2, color = Species)) + geom_point(size = 3) +
  stat_ellipse(level = 0.95) +
  labs(title = "Iris PCA: PC1 vs PC2",
    x = paste0("PC1 (", round(summary(iris_pca)$importance[2, 1] * 100, 1), "%)"),
    y = paste0("PC2 (", round(summary(iris_pca)$importance[2, 2] * 100, 1), "%)")) + theme_minimal()

The scree plot shows how much variance each PC explains.

PC1 usually explains the largest share, PC2 the second, and so on.

This helps decide how many PCs to keep:

- If PC1+PC2 explain most of the variance (e.g. >70%),

a 2D plot is a good summary of the data.

- If the first few PCs explain little, we might need more dimensions.

# Scree plot
# Shows how much variance each PC explains
iris_var <- iris_pca$sdev^2 / sum(iris_pca$sdev^2)
iris_scree <- data.frame(PC = factor(paste0("PC", 1:4)), Var = iris_var)

ggplot(iris_scree, aes(PC, Var)) +
  geom_col(fill = "#66c2a5") +
  geom_line(aes(group = 1), color = "red") +
  geom_point(color = "red") +
  geom_text(aes(label = scales::percent(Var, accuracy = 0.1)), vjust = -0.5) +
  labs(title = "Iris PCA Scree Plot", x = "PC", y = "Variance Explained") + theme_minimal()

Wine Part

The wine dataset has 178 samples, each with 13 chemical measurements.

The first column is “Class” (1, 2, 3), telling us which wine cultivar it is.

We keep the class as a factor for coloring, and use the 13 numeric variables for PCA.

# Load data
wine <- read.csv("wine.data", header = FALSE)
# Add column names
colnames(wine) <- c("Class","Alcohol","Malic_Acid","Ash","Alcalinity_of_Ash",
                    "Magnesium","Total_Phenols","Flavanoids","Nonflavanoid_Phenols",
                    "Proanthocyanins","Color_Intensity","Hue","OD280/OD315","Proline")

# Separate class & numeric data
wine_class <- factor(wine$Class)
wine_data <- wine[, -1] # keep only numeric columns

We again run PCA with centering and scaling.

PCA here compresses 13 chemical features into a few principal components.

Each PC is a linear combination of the original measurements.

# Run PCA
wine_pca <- prcomp(wine_data, scale. = TRUE)

# Get PC1 + PC2
wine_scores <- data.frame(wine_pca$x[, 1:2], Class = wine_class)

# Show consequence
summary(wine_pca)
wine_pca$rotation

Wine Plots

This scatter plot also shows how wine samples from three cultivars separate in PCA space.

Points are colored by cultivar, and ellipses show the group spread.

If PCA captures the chemical differences, we expect the groups to be well-separated.

If there is overlap, it means the cultivars are chemically similar.

# Plot PC1 vs PC2
# This shows how wine samples from three classes separate in PCA space
ggplot(wine_scores, aes(PC1, PC2, color = Class)) +
  geom_point(size = 3, alpha = 0.8) +
  stat_ellipse(level = 0.95) +
  labs(title = "Wine PCA: PC1 vs PC2",
    x = paste0("PC1 (", round(summary(wine_pca)$importance[2, 1] * 100, 1), "%)"),
    y = paste0("PC2 (", round(summary(wine_pca)$importance[2, 2] * 100, 1), "%)")) + theme_minimal()

This scree plot shows the proportion of variance explained by each PC (PC1–PC13).

Unlike the iris dataset, variance is more spread out because there are more features.

PC1 may only explain ~30%, so we might need PC1–PC4 to capture most of the variance.

This is common in high-dimensional data: no single PC dominates,

and more components are needed to summarize the data well.

# Scree plot
# Shows which PCs matter most
wine_var <- wine_pca$sdev^2 / sum(wine_pca$sdev^2)

# Make sure X axis is PC1 - PC13
wine_scree <- data.frame(
  PC = factor(paste0("PC", 1:length(wine_var)), 
              levels = paste0("PC", 1:length(wine_var))),  
  Var = wine_var)

ggplot(wine_scree, aes(x = PC, y = Var)) +
  geom_col(fill = "skyblue") +
  geom_line(aes(group = 1), color = "darkred") +
  geom_point(color = "darkred") +
  geom_text(aes(label = scales::percent(Var, accuracy = 0.1)), vjust = -0.5) +
  labs(title = "Wine PCA Scree Plot",
    x = "Principal Component",
    y = "Variance Explained") + theme_minimal()