This session demonstrates Principal Component Analysis (PCA) using
the built-in iris
dataset in R. PCA is a dimensionality
reduction technique that transforms correlated variables into a smaller
set of uncorrelated variables called principal components. This means
the factor or principal components seek for unidimensionality, i.e. the
variables measure the same are unidimensional.
For example: If we measure attitude on fish consumption using 20 statements, we can reduce the 20 variables into 2-3 principal components (positive, negative and no opinion attitude) that capture the most variance in the data. This helps in visualizing and interpreting complex datasets.
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## se_1 5.1 3.5 1.4 0.2 setosa
## se_2 4.9 3.0 1.4 0.2 setosa
## se_3 4.7 3.2 1.3 0.2 setosa
## se_4 4.6 3.1 1.5 0.2 setosa
## se_5 5.0 3.6 1.4 0.2 setosa
## se_6 5.4 3.9 1.7 0.4 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The variables should be correlated to each other for PCA to be effective. Let’s check the correlation matrix of the numeric variables in the iris dataset. We see that correlation coefficients are generally high, indicating strong relationships between the variables.
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.00 -0.12 0.87 0.82
## Sepal.Width -0.12 1.00 -0.43 -0.37
## Petal.Length 0.87 -0.43 1.00 0.96
## Petal.Width 0.82 -0.37 0.96 1.00
ggpairs(iris[, 1:4],
lower = list(continuous = wrap("points", alpha = 0.5)),
upper = list(continuous = wrap("cor", size = 3)),
diag = list(continuous = wrap("barDiag", fill = "lightgray")))
Interpretation:
The scatterplots show strong correlations between petal measurements and
between sepal length and petal measurements, suggesting PCA could be
useful for dimensionality reduction.
Measure of Sampling Adequacy (MSA) value should be higher than 0.5.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = iris[, 1:4])
## Overall MSA = 0.54
## MSA for each item =
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.58 0.27 0.53 0.63
Bartlett’s test should be significant (p < 0.05) for PCA to be appropriate. H0: Variables are not correlated.
##
## Bartlett test of homogeneity of variances
##
## data: iris[, 1:4]
## Bartlett's K-squared = 294.19, df = 3, p-value < 2.2e-16
# Commonalities extraction: smc = Squared Multiple Correlations
# Regressing each variable on all other variables and using the R-squared value as an estimate of the communality.
# Communalities range from 0 to 1. A higher communality (closer to 1) indicates that a larger proportion of the variable's variance is explained by the common factors.
# We do not exclude variables if communalities > 0.3
communalities = smc(iris[,1:4])
communalities
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.8586117 0.5240071 0.9680118 0.9378503
# Unrotated solution
# If loadings < 0.3, loadings are not shown
# Loadings show which variables are most important for each component
iris.pca = principal(iris[,1:4],
nfactors = ncol(iris[,1:4]),
rotate = "none", scores = FALSE)
# Display unrotated loadings and proportion variance explained by each factors
# SS loadings = total loadings = number of variables
iris.pca$loadings
##
## Loadings:
## PC1 PC2 PC3 PC4
## Sepal.Length 0.890 0.361 -0.276
## Sepal.Width -0.460 0.883
## Petal.Length 0.992 0.115
## Petal.Width 0.965 0.243
##
## PC1 PC2 PC3 PC4
## SS loadings 2.918 0.914 0.147 0.021
## Proportion Var 0.730 0.229 0.037 0.005
## Cumulative Var 0.730 0.958 0.995 1.000
pca_rotated <- principal(iris[,1:4], nfactors=2, rotate="varimax")
# Rotated component matrix
# Variance explained after rotation
pca_rotated$loadings
##
## Loadings:
## RC1 RC2
## Sepal.Length 0.959
## Sepal.Width -0.145 0.985
## Petal.Length 0.944 -0.304
## Petal.Width 0.932 -0.257
##
## RC1 RC2
## SS loadings 2.702 1.130
## Proportion Var 0.676 0.283
## Cumulative Var 0.676 0.958
Key Findings:
PC1 explains 67.6% of the variance
PC2 explains 28.3% of the variance
Together, PC1 and PC2 explain 95.8% of the variance
Interpretation of Loadings:
- PC1: All variables except Sepal.Width contribute positively (overall size)
- PC2: Dominated by Sepal.Width (narrow vs wide sepals)
- PC3 and PC4 explain minimal variance but show more complex patterns
# PCA simple way
# Retain components with eigenvalues > 1 (Kaiser criterion)
# This is not a good example of PCA
plot.pca = prcomp(iris[,1:4], scale = TRUE)
plot(plot.pca, type = 'l')
Interpretation:
The elbow occurs at PC2, suggesting 2
principal components are sufficient. The red line shows the Kaiser
criterion (eigenvalue > 1), which also suggests retaining 2 PCs.
biplot(plot.pca, scale = 0,
cex = c(0.6, 1.2),
col = c("cadetblue", "blue"),
main = "PCA Biplot of Iris Dataset",
xlab = paste0("PC1 (67.6%)"),
ylab = paste0("PC2 (28.3%)"))
Interpretation:
Petal.Length and Petal.Width vectors are close, indicating strong correlation
PC1 represents overall flower size
PC2 contrasts sepal width against other dimensions
pca_data = data.frame(plot.pca$x, Species = iris$Species)
ggplot(pca_data, aes(x = PC1, y = PC2, color = Species)) +
geom_point(size = 3, alpha = 0.8) +
stat_ellipse(level = 0.95, linewidth = 1) +
labs(title = "PCA of Iris Dataset",
x = paste0("PC1 (", round(100*summary(plot.pca)$importance[2,1], 1), "%)"),
y = paste0("PC2 (", round(100*summary(plot.pca)$importance[2,2], 1), "%)"),
color = "Iris Species") +
theme_minimal(base_size = 12) +
scale_color_manual(values = c("setosa" = "#E69F00",
"versicolor" = "#56B4E9",
"virginica" = "#009E73"))
Interpretation:
Clear separation of setosa from other species along PC1
Partial separation of versicolor and virginica along PC2
The PCA analysis successfully reduced the 4-dimensional iris data to 2 principal components that explain 95.8% of the variance. Key findings include:
Use unrotated scores if you need maximum variance preservation
Use rotated scores if you need interpretability of components
# Factor scores for the first two components
scores = pca_rotated$scores
PCA1 = scores[,1]
PCA2 = scores[,2]
iris_scores = bind_cols(iris, PCA1 = PCA1, PCA2 = PCA2)
# Display first few rows of scores
head(iris_scores)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species PCA1
## se_1 5.1 3.5 1.4 0.2 setosa -1.0834754
## se_2 4.9 3.0 1.4 0.2 setosa -1.3775358
## se_3 4.7 3.2 1.3 0.2 setosa -1.4198321
## se_4 4.6 3.1 1.5 0.2 setosa -1.4716069
## se_5 5.0 3.6 1.4 0.2 setosa -1.0952963
## se_6 5.4 3.9 1.7 0.4 setosa -0.6336551
## PCA2
## se_1 0.9067262
## se_2 -0.2648876
## se_3 0.1165198
## se_4 -0.1474634
## se_5 1.0949536
## se_6 1.8641036
You can now continue with cluster analysis. PCA is usually done before cluster analysis.