Introduction

This session demonstrates Principal Component Analysis (PCA) using the built-in iris dataset in R. PCA is a dimensionality reduction technique that transforms correlated variables into a smaller set of uncorrelated variables called principal components. This means the factor or principal components seek for unidimensionality, i.e. the variables measure the same are unidimensional.

For example: If we measure attitude on fish consumption using 20 statements, we can reduce the 20 variables into 2-3 principal components (positive, negative and no opinion attitude) that capture the most variance in the data. This helps in visualizing and interpreting complex datasets.

Load libraries

pacman::p_load(tidyverse, GGally, psych)

Data Preparation

Load and Examine Data

data(iris)
rownames(iris) = paste(substr(iris$Species, 1, 2), 1:nrow(iris), sep="_")
head(iris)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## se_1          5.1         3.5          1.4         0.2  setosa
## se_2          4.9         3.0          1.4         0.2  setosa
## se_3          4.7         3.2          1.3         0.2  setosa
## se_4          4.6         3.1          1.5         0.2  setosa
## se_5          5.0         3.6          1.4         0.2  setosa
## se_6          5.4         3.9          1.7         0.4  setosa

Summary Statistics

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Correlation Matrix

The variables should be correlated to each other for PCA to be effective. Let’s check the correlation matrix of the numeric variables in the iris dataset. We see that correlation coefficients are generally high, indicating strong relationships between the variables.

round(cor(iris[,1:4]), 2)
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length         1.00       -0.12         0.87        0.82
## Sepal.Width         -0.12        1.00        -0.43       -0.37
## Petal.Length         0.87       -0.43         1.00        0.96
## Petal.Width          0.82       -0.37         0.96        1.00

Exploratory Data Analysis

Pairwise Scatterplots for visualization of the correlations

ggpairs(iris[, 1:4], 
       lower = list(continuous = wrap("points", alpha = 0.5)),
       upper = list(continuous = wrap("cor", size = 3)),
       diag = list(continuous = wrap("barDiag", fill = "lightgray")))

Interpretation:
The scatterplots show strong correlations between petal measurements and between sepal length and petal measurements, suggesting PCA could be useful for dimensionality reduction.

KMO and Bartlett’s Test

Measure of Sampling Adequacy (MSA) value should be higher than 0.5.

KMO(iris[, 1:4])
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = iris[, 1:4])
## Overall MSA =  0.54
## MSA for each item = 
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##         0.58         0.27         0.53         0.63

Bartlett’s test should be significant (p < 0.05) for PCA to be appropriate. H0: Variables are not correlated.

bartlett.test(iris[, 1:4])
## 
##  Bartlett test of homogeneity of variances
## 
## data:  iris[, 1:4]
## Bartlett's K-squared = 294.19, df = 3, p-value < 2.2e-16

Communalities (variance explained by factor solution) should be high

# Commonalities extraction: smc = Squared Multiple Correlations
# Regressing each variable on all other variables and using the R-squared value as an estimate of the communality.
# Communalities range from 0 to 1. A higher communality (closer to 1) indicates that a larger proportion of the variable's variance is explained by the common factors.
# We do not exclude variables if communalities > 0.3

communalities = smc(iris[,1:4])
communalities
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##    0.8586117    0.5240071    0.9680118    0.9378503

Perform PCA

# Unrotated solution
# If loadings < 0.3, loadings are not shown
# Loadings show which variables are most important for each component

iris.pca = principal(iris[,1:4], 
                               nfactors = ncol(iris[,1:4]), 
                               rotate = "none", scores = FALSE)

# Display unrotated loadings and proportion variance explained by each factors
# SS loadings = total loadings = number of variables
iris.pca$loadings
## 
## Loadings:
##              PC1    PC2    PC3    PC4   
## Sepal.Length  0.890  0.361 -0.276       
## Sepal.Width  -0.460  0.883              
## Petal.Length  0.992                0.115
## Petal.Width   0.965         0.243       
## 
##                  PC1   PC2   PC3   PC4
## SS loadings    2.918 0.914 0.147 0.021
## Proportion Var 0.730 0.229 0.037 0.005
## Cumulative Var 0.730 0.958 0.995 1.000

PCA rotated

pca_rotated <- principal(iris[,1:4], nfactors=2, rotate="varimax")

# Rotated component matrix
# Variance explained after rotation
pca_rotated$loadings
## 
## Loadings:
##              RC1    RC2   
## Sepal.Length  0.959       
## Sepal.Width  -0.145  0.985
## Petal.Length  0.944 -0.304
## Petal.Width   0.932 -0.257
## 
##                  RC1   RC2
## SS loadings    2.702 1.130
## Proportion Var 0.676 0.283
## Cumulative Var 0.676 0.958

Key Findings:

  • PC1 explains 67.6% of the variance

  • PC2 explains 28.3% of the variance

  • Together, PC1 and PC2 explain 95.8% of the variance

Factors Loadings: Unrotated Component Matrix and Rotated Component Matrix

Interpretation of Loadings:

- PC1: All variables except Sepal.Width contribute positively (overall size)

- PC2: Dominated by Sepal.Width (narrow vs wide sepals)

- PC3 and PC4 explain minimal variance but show more complex patterns

Visualization of Results

Scree Plot

# PCA simple way
# Retain components with eigenvalues > 1 (Kaiser criterion)
# This is not a good example of PCA

plot.pca = prcomp(iris[,1:4], scale = TRUE)
plot(plot.pca, type = 'l')

Interpretation:
The elbow occurs at PC2, suggesting 2 principal components are sufficient. The red line shows the Kaiser criterion (eigenvalue > 1), which also suggests retaining 2 PCs.

Biplot

biplot(plot.pca, scale = 0, 
       cex = c(0.6, 1.2), 
       col = c("cadetblue", "blue"),
       main = "PCA Biplot of Iris Dataset",
       xlab = paste0("PC1 (67.6%)"),
       ylab = paste0("PC2 (28.3%)"))

Interpretation:

  • Petal.Length and Petal.Width vectors are close, indicating strong correlation

  • PC1 represents overall flower size

  • PC2 contrasts sepal width against other dimensions

PCA Scores Plot with ggplot2

pca_data = data.frame(plot.pca$x, Species = iris$Species)

ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
  geom_point(size = 3, alpha = 0.8) +
  stat_ellipse(level = 0.95, linewidth = 1) +
  labs(title = "PCA of Iris Dataset",
       x = paste0("PC1 (", round(100*summary(plot.pca)$importance[2,1], 1), "%)"),
       y = paste0("PC2 (", round(100*summary(plot.pca)$importance[2,2], 1), "%)"),
       color = "Iris Species") +
  theme_minimal(base_size = 12) +
  scale_color_manual(values = c("setosa" = "#E69F00", 
                                "versicolor" = "#56B4E9", 
                                "virginica" = "#009E73"))

Interpretation:

  • Clear separation of setosa from other species along PC1

  • Partial separation of versicolor and virginica along PC2

Comment

The PCA analysis successfully reduced the 4-dimensional iris data to 2 principal components that explain 95.8% of the variance. Key findings include:

  1. Setosa is clearly distinct from other species in PC1 space
  2. Versicolor and virginica show partial separation in PC2 space
  3. The main patterns are:
    • PC1: Overall flower size (petal and sepal dimensions)
    • PC2: Contrast between sepal width and other dimensions

Factor scores

Use unrotated scores if you need maximum variance preservation

Use rotated scores if you need interpretability of components

# Factor scores for the first two components
scores = pca_rotated$scores
PCA1 = scores[,1]
PCA2 = scores[,2]

iris_scores = bind_cols(iris, PCA1 = PCA1, PCA2 = PCA2)

# Display first few rows of scores
head(iris_scores)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width Species       PCA1
## se_1          5.1         3.5          1.4         0.2  setosa -1.0834754
## se_2          4.9         3.0          1.4         0.2  setosa -1.3775358
## se_3          4.7         3.2          1.3         0.2  setosa -1.4198321
## se_4          4.6         3.1          1.5         0.2  setosa -1.4716069
## se_5          5.0         3.6          1.4         0.2  setosa -1.0952963
## se_6          5.4         3.9          1.7         0.4  setosa -0.6336551
##            PCA2
## se_1  0.9067262
## se_2 -0.2648876
## se_3  0.1165198
## se_4 -0.1474634
## se_5  1.0949536
## se_6  1.8641036

You can now continue with cluster analysis. PCA is usually done before cluster analysis.