Introduction

We will go through a simple example of Correspondence Analysis in R on the HairEyeColor dataset.

Correspondence Analysis (CA) is a multivariate statistical technique used to analyze and visualize the relationships between categorical variables in a contingency table. It reduces the dimensionality of the data, representing the associations between rows and columns in a low-dimensional space, typically two dimensions, for an easier interpretation. The Chi-squared distance between rows or columns to highlight associations:

\[ \chi^{2} = \sum_{i = 1}^{n}\sum_{j = 1}^{m} \frac{(n_{ij} - Np_{ij})^{2}}{Np_{ij}} \]

where \(n_{ij}\)​ is the observed frequency and \(Np_{ij}\)​ is the expected frequency under the independence assumption.

HeirEyeColor dataset

It is a dataset of 592 observations x 3 variables, namely ‘hair’, ‘eye’ and ‘sex’.

# load libraries and dataset
library(ggplot2)
## Warning: le package 'ggplot2' a été compilé avec la version R 4.4.1
library(ca)
## Warning: le package 'ca' a été compilé avec la version R 4.4.1
library(viridis)
## Warning: le package 'viridis' a été compilé avec la version R 4.4.1
## Le chargement a nécessité le package : viridisLite
## Warning: le package 'viridisLite' a été compilé avec la version R 4.4.1
data(HairEyeColor)
head(HairEyeColor)
## , , Sex = Male
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    32   11    10     3
##   Brown    53   50    25    15
##   Red      10   10     7     7
##   Blond     3   30     5     8
## 
## , , Sex = Female
## 
##        Eye
## Hair    Brown Blue Hazel Green
##   Black    36    9     5     2
##   Brown    66   34    29    14
##   Red      16    7     7     7
##   Blond     4   64     5     8

Summary

In order to perform a correspondence analysis, we typically call the function ‘ca()’ from the package ‘ca’ on a contingency table. The summary of our analysis is displayed below.

#  contingency table for hair and eye color
hair_eye = as.data.frame(HairEyeColor)
contingency_table = xtabs(Freq ~ Hair + Eye, data=hair_eye)

# correspondence analysis
ca_result = ca(contingency_table)
ca_result
## 
##  Principal inertias (eigenvalues):
##            1        2        3       
## Value      0.208773 0.022227 0.002598
## Percentage 89.37%   9.52%    1.11%   
## 
## 
##  Rows:
##             Black     Brown       Red    Blond
## Mass     0.182432  0.483108  0.119932 0.214527
## ChiDist  0.551192  0.159461  0.354770 0.838397
## Inertia  0.055425  0.012284  0.015095 0.150793
## Dim. 1  -1.104277 -0.324463 -0.283473 1.828229
## Dim. 2   1.440917 -0.219111 -2.144015 0.466706
## 
## 
##  Columns:
##             Brown     Blue     Hazel     Green
## Mass     0.371622 0.363176  0.157095  0.108108
## ChiDist  0.500487 0.553684  0.288654  0.385727
## Inertia  0.093086 0.111337  0.013089  0.016085
## Dim. 1  -1.077128 1.198061 -0.465286  0.354011
## Dim. 2   0.592420 0.556419 -1.122783 -2.274122

Plot of factors

Plot of factors in the two main dimensions.

# coordinates of rows and columns
row_coords = as.data.frame(ca_result$rowcoord)
col_coords = as.data.frame(ca_result$colcoord)

# labels for plotting
row_coords$Hair = rownames(row_coords)
col_coords$Eye = rownames(col_coords)

# Plot the correspondence analysis results using ggplot2
ggplot() +
  geom_point(data=row_coords, aes(x=Dim1, y=Dim2, color=Hair), size=4) +
  geom_point(data=col_coords, aes(x=Dim1, y=Dim2, color=Eye), size=4, shape=17) +
  geom_text(data=row_coords, aes(x=Dim1, y=Dim2, label=Hair), vjust=-1, hjust=-0.5) +
  geom_text(data=col_coords, aes(x=Dim1, y=Dim2, label=Eye), vjust=-1, hjust=1.5) +
  scale_fill_viridis(discrete=TRUE, option="rocket") +
  labs(title = 'Correspondence analysis',
       subtitle = 'HairEyeColor data',
       y="Dimention 2", x="Dimention 1") +
  theme(axis.text=element_text(size=8),
        axis.title=element_text(size=8),
        plot.subtitle=element_text(size=9, face="italic", color="darkred"),
        panel.background = element_rect(fill = "white", colour = "grey50"),
        panel.grid.major = element_line(colour = "grey90"))

Main observations

Dimension Reduction: The relationships between hair color and eye color in a lower-dimensional space, here the first two on the plot.

Association Visualization: Points close to each other in the plot indicate a stronger association between the corresponding hair and eye colors. For example, if “Black Hair” and “Brown Eyes” are close together (frequently observed together).

Dimensional Interpretation: The axes (Dimension 1 and Dimension 2) represent the principal dimensions that capture the most variance in the data.

Categorical Differentiation: The plot visually differentiates between hair and eye colors using different shapes and colors, making it easy to interpret the correspondence between categories.

References

An Introduction to Applied Multivariate Analysis with R, 2011, B. Everitt, T. Hothorn, Springer, e-ISBN 978-1-4419-9650-3

The R Project for Statistical Computing: https://www.r-project.org/