Assignment-06

Author

Hasantha

Correspondence Analysis(CA)

Correspondence analysis, also called reciprocal averaging, is a useful data science visualization technique for finding out and displaying the relationship between categories. It is a multivariate statistical tool that was first proposed in 1935 by Herman Otto Harley which uses a graph that plots data, visually showing the outcome of two or more data points.

It is conceptually similar to principle component analysis , but applies to categorical rather than continuous data. In a similar manner to principal component analysis, it provides a means of displaying or summarizing a set of data in two-dimensional graphical form.

A correspondence analysis uses a contingency table (a table of frequencies that shows how variables distribute categories). The data in the table undergoes a series of transformations in relation to the data around it to produce relational data. The resulting data is then graphed to show those relationships visually.

PART(A)

Description of the data

‘’smoke data set’’ which contains 5 rows (staff group) and 4 columns (smoking categories), giving the frequencies of smoking categories in each staff group in a fictional organization.

  none light medium heavy
SM 4 2 3 2
JM 4 3 7 4
SE 25 10 12 4
JE 18 24 33 13
SC 10 6 7 2

SM: senior managers

JM: junior managers

SE: senior employees

JE: junior employees

SC: secretaries

library(ca)
library(ggrepel)
Loading required package: ggplot2
library(FactoMineR)
library(factoextra)
Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ade4)

Attaching package: 'ade4'
The following object is masked from 'package:FactoMineR':

    reconst
library(prettyGraphs)

data(smoke)
data(smoke, package="ca")
ca_smoke<-ca(smoke)
names(ca_smoke)
 [1] "sv"         "nd"         "rownames"   "rowmass"    "rowdist"   
 [6] "rowinertia" "rowcoord"   "rowsup"     "colnames"   "colmass"   
[11] "coldist"    "colinertia" "colcoord"   "colsup"     "N"         
[16] "call"      
summary(ca_smoke)

Principal inertias (eigenvalues):

 dim    value      %   cum%   scree plot               
 1      0.074759  87.8  87.8  **********************   
 2      0.010017  11.8  99.5  ***                      
 3      0.000414   0.5 100.0                           
        -------- -----                                 
 Total: 0.085190 100.0                                 


Rows:
    name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
1 |   SM |   57  893   31 |  -66  92   3 | -194 800 214 |
2 |   JM |   93  991  139 |  259 526  84 | -243 465 551 |
3 |   SE |  264 1000  450 | -381 999 512 |  -11   1   3 |
4 |   JE |  456 1000  308 |  233 942 331 |   58  58 152 |
5 |   SC |  130  999   71 | -201 865  70 |   79 133  81 |

Columns:
    name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
1 | none |  316 1000  577 | -393 994 654 |  -30   6  29 |
2 | lght |  233  984   83 |   99 327  31 |  141 657 463 |
3 | medm |  321  983  148 |  196 982 166 |    7   1   2 |
4 | hevy |  130  995  192 |  294 684 150 | -198 310 506 |
plot(ca_smoke)

plot(ca_smoke, mass = TRUE, contrib = "absolute",map = "rowgreen", arrows = c(FALSE, TRUE))

According to the above plot,we can clearly see that most of the senior managers are non smokers and junior managers are heavy smokers while junior employees seems to be medium smokers.

library(rgl)
plot3d.ca(ca(smoke, nd=3))

PART(B)

Multiple Correspondence Analysis (MCA)

Multiple Correspondence Analysis (MCA) is an extension of simple CA of a single cross-tabulation to more than two categorical variables.

Description of the data

This data frame(“wg93”) contains records of four questions on attitude towards science with responses on a five-point scale (1=agree strongly to 5=disagree strongly) and three demographic variables (sex, age and education).

data(wg93)
data(wg93, package="ca")
MCA_wg93<-MCA(wg93)

names(MCA_wg93)
[1] "eig"  "call" "ind"  "var"  "svd" 
summary(MCA_wg93)

Call:
MCA(X = wg93) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
Variance               0.289   0.255   0.208   0.197   0.183   0.168   0.165
% of var.              7.483   6.604   5.400   5.117   4.732   4.350   4.281
Cumulative % of var.   7.483  14.087  19.488  24.605  29.337  33.687  37.968
                       Dim.8   Dim.9  Dim.10  Dim.11  Dim.12  Dim.13  Dim.14
Variance               0.160   0.157   0.151   0.147   0.144   0.139   0.133
% of var.              4.139   4.078   3.912   3.824   3.730   3.599   3.458
Cumulative % of var.  42.107  46.184  50.097  53.921  57.651  61.250  64.708
                      Dim.15  Dim.16  Dim.17  Dim.18  Dim.19  Dim.20  Dim.21
Variance               0.129   0.127   0.127   0.122   0.117   0.112   0.107
% of var.              3.357   3.297   3.280   3.163   3.023   2.893   2.764
Cumulative % of var.  68.065  71.362  74.643  77.806  80.829  83.722  86.486
                      Dim.22  Dim.23  Dim.24  Dim.25  Dim.26  Dim.27
Variance               0.098   0.094   0.092   0.085   0.082   0.069
% of var.              2.530   2.442   2.398   2.210   2.138   1.796
Cumulative % of var.  89.016  91.458  93.856  96.066  98.204 100.000

Individuals (the 10 first)
       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr   cos2  
1   |  0.115  0.005  0.005 | -0.278  0.035  0.028 |  0.377  0.078  0.051 |
2   |  0.410  0.067  0.049 | -0.499  0.112  0.072 |  0.363  0.073  0.038 |
3   | -0.409  0.067  0.072 | -0.298  0.040  0.038 | -0.112  0.007  0.005 |
4   | -0.138  0.008  0.008 | -0.318  0.045  0.042 | -0.233  0.030  0.022 |
5   | -0.059  0.001  0.001 | -0.945  0.403  0.289 | -0.243  0.033  0.019 |
6   |  0.364  0.053  0.043 | -0.093  0.004  0.003 | -0.032  0.001  0.000 |
7   | -0.133  0.007  0.007 | -0.501  0.113  0.096 | -0.039  0.001  0.001 |
8   |  0.578  0.133  0.112 | -0.428  0.082  0.061 | -0.105  0.006  0.004 |
9   | -0.358  0.051  0.030 | -0.158  0.011  0.006 | -0.537  0.159  0.069 |
10  | -0.117  0.005  0.005 | -0.630  0.179  0.155 | -0.572  0.181  0.128 |

Categories (the 10 first)
        Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2  v.test  
A_1 |  -0.967   6.325   0.148 -11.348 |   0.823   5.186   0.107   9.654 |
A_2 |  -0.426   3.321   0.106  -9.624 |  -0.071   0.105   0.003  -1.611 |
A_3 |   0.180   0.377   0.010   2.943 |  -0.844   9.362   0.218 -13.772 |
A_4 |   0.769   5.976   0.152  11.491 |   0.228   0.594   0.013   3.403 |
A_5 |   1.639   7.324   0.157  11.672 |   1.183   4.323   0.082   8.424 |
B_1 |  -1.466   8.671   0.191 -12.882 |   1.390   8.837   0.172  12.217 |
B_2 |  -0.635   3.987   0.101  -9.358 |  -0.335   1.256   0.028  -4.934 |
B_3 |  -0.330   1.268   0.034  -5.399 |  -0.515   3.505   0.082  -8.432 |
B_4 |   0.454   3.289   0.098   9.238 |  -0.256   1.186   0.031  -5.211 |
B_5 |   1.105   9.712   0.234  14.262 |   0.979   8.645   0.184  12.641 |
      Dim.3     ctr    cos2  v.test  
A_1   0.209   0.411   0.007   2.457 |
A_2  -0.082   0.170   0.004  -1.849 |
A_3  -0.069   0.076   0.001  -1.120 |
A_4   0.292   1.193   0.022   4.361 |
A_5  -0.760   2.183   0.034  -5.414 |
B_1   0.251   0.353   0.006   2.208 |
B_2  -0.067   0.062   0.001  -0.991 |
B_3  -0.064   0.067   0.001  -1.055 |
B_4   0.366   2.970   0.064   7.457 |
B_5  -0.685   5.169   0.090  -8.839 |

Categorical variables (eta2)
      Dim.1 Dim.2 Dim.3  
A   | 0.471 0.349 0.059 |
B   | 0.544 0.418 0.126 |
C   | 0.519 0.486 0.203 |
D   | 0.026 0.395 0.160 |
sex | 0.069 0.026 0.055 |
age | 0.158 0.030 0.391 |
edu | 0.233 0.080 0.464 |
plot(MCA_wg93, cex = .7, col.var = "black", col.ind = "gray", invis = "ind")

library("factoextra")
print(MCA_wg93)
**Results of the Multiple Correspondence Analysis (MCA)**
The analysis was performed on 871 individuals, described by 7 variables
*The results are available in the following objects:

   name              description                       
1  "$eig"            "eigenvalues"                     
2  "$var"            "results for the variables"       
3  "$var$coord"      "coord. of the categories"        
4  "$var$cos2"       "cos2 for the categories"         
5  "$var$contrib"    "contributions of the categories" 
6  "$var$v.test"     "v-test for the categories"       
7  "$ind"            "results for the individuals"     
8  "$ind$coord"      "coord. for the individuals"      
9  "$ind$cos2"       "cos2 for the individuals"        
10 "$ind$contrib"    "contributions of the individuals"
11 "$call"           "intermediate results"            
12 "$call$marge.col" "weights of columns"              
13 "$call$marge.li"  "weights of rows"                 

The proportion of variances retained by the different dimensions (axes) can be extracted using the function get_eigenvalue()

eig.val <- get_eigenvalue(MCA_wg93)
eig.val
       eigenvalue variance.percent cumulative.variance.percent
Dim.1  0.28863097         7.483025                    7.483025
Dim.2  0.25473881         6.604339                   14.087365
Dim.3  0.20829251         5.400176                   19.487541
Dim.4  0.19738826         5.117473                   24.605014
Dim.5  0.18252871         4.732226                   29.337240
Dim.6  0.16777120         4.349624                   33.686864
Dim.7  0.16513859         4.281371                   37.968235
Dim.8  0.15963130         4.138589                   42.106824
Dim.9  0.15728148         4.077668                   46.184492
Dim.10 0.15090458         3.912341                   50.096833
Dim.11 0.14749027         3.823822                   53.920654
Dim.12 0.14387158         3.730004                   57.650658
Dim.13 0.13883734         3.599487                   61.250145
Dim.14 0.13338181         3.458047                   64.708192
Dim.15 0.12948212         3.356944                   68.065135
Dim.16 0.12716579         3.296891                   71.362026
Dim.17 0.12653319         3.280490                   74.642516
Dim.18 0.12201043         3.163233                   77.805750
Dim.19 0.11660840         3.023181                   80.828931
Dim.20 0.11158366         2.892910                   83.721840
Dim.21 0.10660185         2.763752                   86.485592
Dim.22 0.09758714         2.530037                   89.015629
Dim.23 0.09419675         2.442138                   91.457767
Dim.24 0.09248798         2.397837                   93.855604
Dim.25 0.08524996         2.210184                   96.065788
Dim.26 0.08246419         2.137960                   98.203748
Dim.27 0.06928400         1.796252                  100.000000

According to above result no dimension has explained significance amount of variance out of total. This can be clearly visualize by creating scree plot. To do that we can use the function “fviz_screeplot()”

fviz_screeplot(MCA_wg93, addlabels = TRUE, ylim = c(0, 45))

The function “fviz_mca_biplot()” is used to draw the biplot of individuals and variable categories

fviz_mca_biplot(MCA_wg93, 
                repel = TRUE, # Avoid text overlapping (slow if many point)
                ggtheme = theme_minimal())
Warning: ggrepel: 808 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

The plot above shows a global pattern within the data. Rows (individuals) are represented by blue points and columns (variable categories) by red triangles.The distance between any row points or column points gives a measure of their similarity (or dissimilarity). Row points with similar profile are closed on the factor map. The same holds true for column points.

Variables of Multiple Correspondence Analysis can be extracted by using below code.

var <- get_mca_var(MCA_wg93)
var
Multiple Correspondence Analysis Results for variables
 ===================================================
  Name       Description                  
1 "$coord"   "Coordinates for categories" 
2 "$cos2"    "Cos2 for categories"        
3 "$contrib" "contributions of categories"
head(var$coord)
         Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
A_1 -0.9671597  0.82273703  0.20937924 -0.42460797  0.07117658
A_2 -0.4260310 -0.07132741 -0.08186277  0.15141118 -0.05311328
A_3  0.1803910 -0.84426258 -0.06866201 -0.43681468  0.30997012
A_4  0.7686636  0.22766291  0.29174984  0.48798870 -0.66303991
A_5  1.6385852  1.18265182 -0.76001539  0.08379486  1.32124301
B_1 -1.4660261  1.39035983  0.25123381 -0.73875510 -0.14521786
head(var$coord)
         Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
A_1 -0.9671597  0.82273703  0.20937924 -0.42460797  0.07117658
A_2 -0.4260310 -0.07132741 -0.08186277  0.15141118 -0.05311328
A_3  0.1803910 -0.84426258 -0.06866201 -0.43681468  0.30997012
A_4  0.7686636  0.22766291  0.29174984  0.48798870 -0.66303991
A_5  1.6385852  1.18265182 -0.76001539  0.08379486  1.32124301
B_1 -1.4660261  1.39035983  0.25123381 -0.73875510 -0.14521786
head(var$coord)
         Dim 1       Dim 2       Dim 3       Dim 4       Dim 5
A_1 -0.9671597  0.82273703  0.20937924 -0.42460797  0.07117658
A_2 -0.4260310 -0.07132741 -0.08186277  0.15141118 -0.05311328
A_3  0.1803910 -0.84426258 -0.06866201 -0.43681468  0.30997012
A_4  0.7686636  0.22766291  0.29174984  0.48798870 -0.66303991
A_5  1.6385852  1.18265182 -0.76001539  0.08379486  1.32124301
B_1 -1.4660261  1.39035983  0.25123381 -0.73875510 -0.14521786

Correlation between variables and dimensions.

fviz_mca_var(MCA_wg93, choice = "mca.cor", 
             repel = TRUE, # Avoid text overlapping (slow)
             ggtheme = theme_minimal())

  • The plot above helps to identify variables that are the most correlated with each dimension. The squared correlations between variables and the dimensions are used as coordinates.

  • It can be seen that, the variables Sex, Age and education are the most correlated with dimension 1. Similarly, the variables “record of question D” is the most correlated with dimension 2.

head(round(var$coord, 2), 4)
    Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
A_1 -0.97  0.82  0.21 -0.42  0.07
A_2 -0.43 -0.07 -0.08  0.15 -0.05
A_3  0.18 -0.84 -0.07 -0.44  0.31
A_4  0.77  0.23  0.29  0.49 -0.66
fviz_mca_var(MCA_wg93, 
             repel = TRUE, # Avoid text overlapping (slow)
             ggtheme = theme_minimal())

fviz_mca_var(MCA_wg93, col.var="blue", shape.var = 15,
             repel = TRUE)

head(var$cos2, 4)
          Dim 1       Dim 2       Dim 3      Dim 4        Dim 5
A_1 0.148021742 0.107115226 0.006937394 0.02853024 0.0008016843
A_2 0.106454957 0.002983984 0.003930578 0.01344620 0.0016545879
A_3 0.009952545 0.218001464 0.001441907 0.05835778 0.0293862375
A_4 0.151760731 0.013312859 0.021862913 0.06116547 0.1129187613

Contribution of variable categories to the dimensions

The variable categories with the larger value, contribute the most to the definition of the dimensions. It can be obtained by using below code.

head(round(var$contrib,2), 4)
    Dim 1 Dim 2 Dim 3 Dim 4 Dim 5
A_1  6.33  5.19  0.41  1.78  0.05
A_2  3.32  0.11  0.17  0.61  0.08
A_3  0.38  9.36  0.08  3.23  1.76
A_4  5.98  0.59  1.19  3.52  7.03
fviz_contrib(MCA_wg93, choice = "var", axes = 1, top = 15)

fviz_contrib(MCA_wg93, choice = "var", axes = 2, top = 15)

Variable categories that contribute the most to Dim.1 and Dim.2 are the most important in explaining the variability in the data set.