Introduction
We will go through a simple example of Correspondence Analysis in R
on the HairEyeColor dataset.
Correspondence Analysis (CA) is a multivariate statistical technique
used to analyze and visualize the relationships between categorical
variables in a contingency table. It reduces the dimensionality of the
data, representing the associations between rows and columns in a
low-dimensional space, typically two dimensions, for an easier
interpretation. The Chi-squared distance between rows or columns to
highlight associations:
\[
\chi^{2} = \sum_{i = 1}^{n}\sum_{j = 1}^{m} \frac{(n_{ij} -
Np_{ij})^{2}}{Np_{ij}}
\]
where \(n_{ij}\) is the observed
frequency and \(Np_{ij}\) is the
expected frequency under the independence assumption.
HeirEyeColor dataset
It is a dataset of 592 observations x 3 variables, namely ‘hair’,
‘eye’ and ‘sex’.
# load libraries and dataset
library(ggplot2)
## Warning: le package 'ggplot2' a été compilé avec la version R 4.4.1
library(ca)
## Warning: le package 'ca' a été compilé avec la version R 4.4.1
library(viridis)
## Warning: le package 'viridis' a été compilé avec la version R 4.4.1
## Le chargement a nécessité le package : viridisLite
## Warning: le package 'viridisLite' a été compilé avec la version R 4.4.1
data(HairEyeColor)
head(HairEyeColor)
## , , Sex = Male
##
## Eye
## Hair Brown Blue Hazel Green
## Black 32 11 10 3
## Brown 53 50 25 15
## Red 10 10 7 7
## Blond 3 30 5 8
##
## , , Sex = Female
##
## Eye
## Hair Brown Blue Hazel Green
## Black 36 9 5 2
## Brown 66 34 29 14
## Red 16 7 7 7
## Blond 4 64 5 8
Summary
In order to perform a correspondence analysis, we typically call the
function ‘ca()’ from the package ‘ca’ on a contingency table. The
summary of our analysis is displayed below.
# contingency table for hair and eye color
hair_eye = as.data.frame(HairEyeColor)
contingency_table = xtabs(Freq ~ Hair + Eye, data=hair_eye)
# correspondence analysis
ca_result = ca(contingency_table)
ca_result
##
## Principal inertias (eigenvalues):
## 1 2 3
## Value 0.208773 0.022227 0.002598
## Percentage 89.37% 9.52% 1.11%
##
##
## Rows:
## Black Brown Red Blond
## Mass 0.182432 0.483108 0.119932 0.214527
## ChiDist 0.551192 0.159461 0.354770 0.838397
## Inertia 0.055425 0.012284 0.015095 0.150793
## Dim. 1 -1.104277 -0.324463 -0.283473 1.828229
## Dim. 2 1.440917 -0.219111 -2.144015 0.466706
##
##
## Columns:
## Brown Blue Hazel Green
## Mass 0.371622 0.363176 0.157095 0.108108
## ChiDist 0.500487 0.553684 0.288654 0.385727
## Inertia 0.093086 0.111337 0.013089 0.016085
## Dim. 1 -1.077128 1.198061 -0.465286 0.354011
## Dim. 2 0.592420 0.556419 -1.122783 -2.274122
Plot of factors
Plot of factors in the two main dimensions.
# coordinates of rows and columns
row_coords = as.data.frame(ca_result$rowcoord)
col_coords = as.data.frame(ca_result$colcoord)
# labels for plotting
row_coords$Hair = rownames(row_coords)
col_coords$Eye = rownames(col_coords)
# Plot the correspondence analysis results using ggplot2
ggplot() +
geom_point(data=row_coords, aes(x=Dim1, y=Dim2, color=Hair), size=4) +
geom_point(data=col_coords, aes(x=Dim1, y=Dim2, color=Eye), size=4, shape=17) +
geom_text(data=row_coords, aes(x=Dim1, y=Dim2, label=Hair), vjust=-1, hjust=-0.5) +
geom_text(data=col_coords, aes(x=Dim1, y=Dim2, label=Eye), vjust=-1, hjust=1.5) +
scale_fill_viridis(discrete=TRUE, option="rocket") +
labs(title = 'Correspondence analysis',
subtitle = 'HairEyeColor data',
y="Dimention 2", x="Dimention 1") +
theme(axis.text=element_text(size=8),
axis.title=element_text(size=8),
plot.subtitle=element_text(size=9, face="italic", color="darkred"),
panel.background = element_rect(fill = "white", colour = "grey50"),
panel.grid.major = element_line(colour = "grey90"))

Main observations
Dimension Reduction: The relationships between hair color and eye
color in a lower-dimensional space, here the first two on the plot.
Association Visualization: Points close to each other in the plot
indicate a stronger association between the corresponding hair and eye
colors. For example, if “Black Hair” and “Brown Eyes” are close together
(frequently observed together).
Dimensional Interpretation: The axes (Dimension 1 and Dimension 2)
represent the principal dimensions that capture the most variance in the
data.
Categorical Differentiation: The plot visually differentiates between
hair and eye colors using different shapes and colors, making it easy to
interpret the correspondence between categories.
References
An Introduction to Applied Multivariate Analysis with R, 2011, B.
Everitt, T. Hothorn, Springer, e-ISBN 978-1-4419-9650-3
The R Project for Statistical Computing: https://www.r-project.org/