Correspondence analysis, also called reciprocal averaging, is a useful data science visualization technique for finding out and displaying the relationship between categories. It is a multivariate statistical tool that was first proposed in 1935 by Herman Otto Harley which uses a graph that plots data, visually showing the outcome of two or more data points.
It is conceptually similar to principle component analysis , but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarizing a set of data in two-dimensional graphical form.
A correspondence analysis uses a contingency table (a table of frequencies that shows how variables distribute categories). The data in the table undergoes a series of transformations in relation to the data around it to produce relational data. The resulting data is then graphed to show those relationships visually.
PART(A)
Description of the data
‘’smoke data set’’ which contains 5 rows (staff group) and 4 columns (smoking categories), giving the frequencies of smoking categories in each staff group in a fictional organization.
none
light
medium
heavy
SM
4
2
3
2
JM
4
3
7
4
SE
25
10
12
4
JE
18
24
33
13
SC
10
6
7
2
SM: senior managers
JM: junior managers
SE: senior employees
JE: junior employees
SC: secretaries
library(ca)library(ggrepel)
Loading required package: ggplot2
library(FactoMineR)library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ade4)
Attaching package: 'ade4'
The following object is masked from 'package:FactoMineR':
reconst
plot(ca_smoke, mass =TRUE, contrib ="absolute",map ="rowgreen", arrows =c(FALSE, TRUE))
According to the above plot,we can clearly see that most of the senior managers are non smokers and junior managers are heavy smokers while junior employees seems to be medium smokers.
library(rgl)plot3d.ca(ca(smoke, nd=3))
PART(B)
Multiple Correspondence Analysis (MCA)
Multiple Correspondence Analysis (MCA) is an extension of simple CA of a single cross-tabulation to more than two categorical variables.
Description of the data
This data frame(“wg93”) contains records of four questions on attitude towards science with responses on a five-point scale (1=agree strongly to 5=disagree strongly) and three demographic variables (sex, age and education).
**Results of the Multiple Correspondence Analysis (MCA)**
The analysis was performed on 871 individuals, described by 7 variables
*The results are available in the following objects:
name description
1 "$eig" "eigenvalues"
2 "$var" "results for the variables"
3 "$var$coord" "coord. of the categories"
4 "$var$cos2" "cos2 for the categories"
5 "$var$contrib" "contributions of the categories"
6 "$var$v.test" "v-test for the categories"
7 "$ind" "results for the individuals"
8 "$ind$coord" "coord. for the individuals"
9 "$ind$cos2" "cos2 for the individuals"
10 "$ind$contrib" "contributions of the individuals"
11 "$call" "intermediate results"
12 "$call$marge.col" "weights of columns"
13 "$call$marge.li" "weights of rows"
The proportion of variances retained by the different dimensions (axes) can be extracted using the function get_eigenvalue()
According to above result no dimension has explained significance amount of variance out of total. This can be clearly visualize by creating scree plot. To do that we can use the function “fviz_screeplot()”
The function “fviz_mca_biplot()” is used to draw the biplot of individuals and variable categories
fviz_mca_biplot(MCA_wg93, repel =TRUE, # Avoid text overlapping (slow if many point)ggtheme =theme_minimal())
Warning: ggrepel: 808 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
The plot above shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.
Variables of Multiple Correspondence Analysis can be extracted by using below code.
var <-get_mca_var(MCA_wg93)var
Multiple Correspondence Analysis Results for variables
===================================================
Name Description
1 "$coord" "Coordinates for categories"
2 "$cos2" "Cos2 for categories"
3 "$contrib" "contributions of categories"
head(var$coord)
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
A_1 -0.9671597 0.82273703 0.20937924 -0.42460797 0.07117658
A_2 -0.4260310 -0.07132741 -0.08186277 0.15141118 -0.05311328
A_3 0.1803910 -0.84426258 -0.06866201 -0.43681468 0.30997012
A_4 0.7686636 0.22766291 0.29174984 0.48798870 -0.66303991
A_5 1.6385852 1.18265182 -0.76001539 0.08379486 1.32124301
B_1 -1.4660261 1.39035983 0.25123381 -0.73875510 -0.14521786
head(var$coord)
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
A_1 -0.9671597 0.82273703 0.20937924 -0.42460797 0.07117658
A_2 -0.4260310 -0.07132741 -0.08186277 0.15141118 -0.05311328
A_3 0.1803910 -0.84426258 -0.06866201 -0.43681468 0.30997012
A_4 0.7686636 0.22766291 0.29174984 0.48798870 -0.66303991
A_5 1.6385852 1.18265182 -0.76001539 0.08379486 1.32124301
B_1 -1.4660261 1.39035983 0.25123381 -0.73875510 -0.14521786
head(var$coord)
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
A_1 -0.9671597 0.82273703 0.20937924 -0.42460797 0.07117658
A_2 -0.4260310 -0.07132741 -0.08186277 0.15141118 -0.05311328
A_3 0.1803910 -0.84426258 -0.06866201 -0.43681468 0.30997012
A_4 0.7686636 0.22766291 0.29174984 0.48798870 -0.66303991
A_5 1.6385852 1.18265182 -0.76001539 0.08379486 1.32124301
B_1 -1.4660261 1.39035983 0.25123381 -0.73875510 -0.14521786
The plot above helps to identify variables that are the most correlated with each dimension. The squared correlations between variables and the dimensions are used as coordinates.
It can be seen that, the variables Sex, Age and education are the most correlated with dimension 1. Similarly, the variables “record of question D” is the most correlated with dimension 2.
head(round(var$coord, 2), 4)
Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
A_1 -0.97 0.82 0.21 -0.42 0.07
A_2 -0.43 -0.07 -0.08 0.15 -0.05
A_3 0.18 -0.84 -0.07 -0.44 0.31
A_4 0.77 0.23 0.29 0.49 -0.66
fviz_mca_var(MCA_wg93, repel =TRUE, # Avoid text overlapping (slow)ggtheme =theme_minimal())