Simple Correspondence Analysis (CA)

The following R commands perform a simple CA on the following data set. The data come from Fisher’s (1940) example on colors of eyes(row) and hair(column) of people in Caithness, Scotland. The data are saved in the “fisher.txt” file that I read next.

temp.f <- read.csv("~/Desktop/碩一下/多變量/fisher.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote =
## quote, : incomplete final line found by readTableHeader on '~/Desktop/碩一
## 下/多變量/fisher.csv'
fisher <- temp.f[,-1]
rownames(fisher) <- temp.f[,1]

# Then apply CA to the two-way table by using the R package “ca”.

library(ca)

Let us request a 2-D solution by choosing nd = 2 and check out the summary: (Recall that the maximal dimensionality for the solution is min(4-1,5-1) = 3 )

fisher.ca <- ca(fisher, nd=2)
fisher.ca
## 
##  Principal inertias (eigenvalues):
##            1        2        3       
## Value      0.199245 0.030087 0.000859
## Percentage 86.56%   13.07%   0.37%   
## 
## 
##  Rows:
##              blue     light    medium     dark
## Mass     0.133284  0.293299  0.329311 0.244106
## ChiDist  0.437855  0.450620  0.247359 0.715398
## Inertia  0.025553  0.059557  0.020149 0.124932
## Dim. 1  -0.896793 -0.987318  0.075306 1.574347
## Dim. 2   0.953623  0.510004 -1.412478 0.772036
## 
## 
##  Columns:
##              fair       red    medium     dark    black
## Mass     0.270095  0.053091  0.396696 0.258214 0.021905
## ChiDist  0.571235  0.265854  0.212526 0.597901 1.132193
## Inertia  0.088134  0.003752  0.017918 0.092308 0.028079
## Dim. 1  -1.218714 -0.522575 -0.094147 1.318885 2.451760
## Dim. 2   1.002243  0.278336 -1.200909 0.599292 1.651357

Note that the “Value” (or eigenvalue) here represents the “square of singular values” obtained from the SVD (see our slides), while the first dimension explains 0.199245/(0.199245+0.0300087+0.000859) = 86.56% of the total inertia, and the 2nd dimension explains almost the rest 13% of the total inertia.

The output of the 2D solution is given by:

plot(fisher.ca)

It should be mentioned that in the default 2D plot the row and column scores are scaled to have weighted variances equal to the principal inertia (i.e. total inertia divided by N). This means that the square root of N is taken from the row and column scores, i.e., the scores in the plot correspond to the notation (^X, ^Y) shown on the course slide. With the 2D plot one can clearly explain the relationships between row and row, column and column, and row and column Note:

  1. The function “plot3d.ca”, which is included in the package “rgl”, can be used to plot a 3D solution of the CA. 無法載…

  2. By using the option (map = “symbiplot”) in the plot function, one can produce a similar output by scaling both row and column scores to have variances equal to singular values (square roots of eigenvalues). But this will lose the preservation of the row or column metrics (i.e. the Euclidean distance is not equivalent to the chi-square distance).

Multiple Correspondence Analysis (MCA)

The data are saved in the “mammals.txt” file that I read next.

temp.m<-read.csv("~/Desktop/碩一下/多變量/mammal.csv")
mammals <- temp.m[,-1]
rownames(mammals) <- temp.m[,1]

Then apply MCA to the data by starting with a 2-D solution:

mammals.mca<-mjca(mammals, nd=2, lambda="Burt") 
mammals.mca
## 
##  Eigenvalues:
##            1        2        3        4       5        6        7       
## Value      0.536655 0.144377 0.075683 0.04791 0.030939 0.014984 0.012649
## Percentage 60.5%    16.28%   8.53%    5.4%    3.49%    1.69%    1.43%   
##            8       9        10       11       12       13      14      
## Value      0.01072 0.006867 0.004247 0.000751 0.000728 0.00023 0.000162
## Percentage 1.21%   0.77%    0.48%    0.08%    0.08%    0.03%   0.02%   
##            15       16      17      18    19
## Value      0.000102 6.6e-05 1.6e-05 3e-06 0 
## Percentage 0.01%    0.01%   0%      0%    0%
## 
## 
##  Columns:
##            TI:TI1    TI:TI2    TI:TI3    TI:TI4    BI:BI1    BI:BI2
## Mass     0.018939  0.039773  0.017045  0.049242  0.003788  0.037879
## ChiDist  1.397021  1.002364  1.058524  0.921204  2.195275  1.097299
## Inertia  0.036963  0.039961  0.019099  0.041788  0.018255  0.045609
## Dim. 1  -0.747253 -1.183301  0.081736  1.214855 -0.385320 -1.327965
## Dim. 2   3.237438 -1.205632 -0.208552 -0.199198  0.960940 -1.239952
##           BI:BI3    BI:BI4    BI:BI5    TC:TC1    TC:TC2    BC:BC1
## Mass    0.009470  0.054924  0.018939  0.051136  0.073864  0.056818
## ChiDist 1.500927  0.757364  1.387976  0.941188  0.651592  0.870214
## Inertia 0.021333  0.031505  0.036486  0.045298  0.031360  0.043027
## Dim. 1  1.192217  0.937279 -0.581224 -1.267129  0.877243 -1.158651
## Dim. 2  0.193445 -0.392662  3.329713  0.080205 -0.055526  0.343329
##            BC:BC2    TP:TP1    TP:TP2    TP:TP3   TP:TP4    TP:TP5
## Mass     0.068182  0.011364  0.013258  0.022727 0.049242  0.028409
## ChiDist  0.725179  1.844593  1.458572  1.037330 0.699550  1.157696
## Inertia  0.035856  0.038665  0.028205  0.024456 0.024098  0.038076
## Dim. 1   0.965542 -1.553898 -1.305056 -0.754832 0.331747  1.259423
## Dim. 2  -0.286108 -0.649563 -1.448980 -0.988324 1.602172 -1.050422
##            BP:BP1    BP:BP2    BP:BP3   BP:BP4    BP:BP5    TM:TM1
## Mass     0.011364  0.022727  0.018939 0.045455  0.026515  0.043561
## ChiDist  1.844593  1.383880  0.999211 0.724795  1.179755  0.990776
## Inertia  0.038665  0.043526  0.018910 0.023879  0.036904  0.042761
## Dim. 1  -1.553898 -1.493375  0.359132 0.255901  1.250782  1.287239
## Dim. 2  -0.649563 -1.471432 -0.460603 1.736672 -1.108538 -0.275537
##            TM:TM2    BM:BM1    BM:BM2
## Mass     0.081439  0.039773  0.085227
## ChiDist  0.529950  1.010237  0.471444
## Inertia  0.022872  0.040591  0.018943
## Dim. 1  -0.688523  1.292350 -0.603097
## Dim. 2   0.147380 -0.197040  0.091952

First, the maximum number of dimensions is 27 (total # of categories) - 8 (# of variables) = 19. The first two dimensions explain 60.5% + 16.28% = 76.78% of total inertia, which shows a fairly good fit. Let us look at the resulting 2D plot of all categories projected on the principal axes:

plot(mammals.mca)

It would be interesting to include also the objects in the picture:

plot(mammals.mca, what = c("all", "all"), col=c("blue","red"))

It is possible to make a 3D plot to display the respective frequency on each object location (since the same location can have different mammals). There are totally 27 distinguished object locations, check out the frequency for each location: I would suggest using Matlab to construct an informative 3-D plot including (categories, object locations) with the height to be the frequency on each object location. (Can you do this in R ?)

Also, how do you interpret the result based on the 2D plot?

Note: To avoid overestimating the total inertia, one can perform the Joint Correspondence Analysis (JCA, Greenacre, 1988) by utilizing the information of off-diagonal matrices of the Burt table (the JCA finds the optimal weighted least-squares fit to the off-diagonal tables).

The following command implements the JCA

mammals.jca<-mjca(mammals, nd=2, lambda="JCA") 

plot(mammals.jca)

plot(mammals.jca, what = c("all", "all"), col=c("blue","red"))