It’s all about music

Back in 2019, university and high school students at HSE were casting votes for the headliner of the big graduation show. The results were presented in a neat picture like this https://www.hse.ru/news/life/264225010.html

We are going to turn these results into a table and find out which artists were closer to which group of students.

As part of the exercise, we will add the age of the lead singer as the third row. It will most probably be unrelated to the real votes, and thus should create another dimension. (Otherwise there would be no map but a line as the minimum number of dimensions = MIN(row, column) - 1.)

1. Let’s create a matrix with votes and ages:

vd <- matrix(c(1272,894, 797, 748, 781, 731, 693, 433, 461, 392,
        136, 176, 169, 163, 117, 146, 169, 151, 87, 126,
        57, 37, 34, 37, 50, 46, 24, 33, 32, 24) , nrow = 3, byrow = T)
coln <- c('VM', 'LB', 'Max_K', 'Ox', 'Bi2', 'Zemf', 'Mone', 'LSP', 'Skr', 'TB')
rown <- c('university', 'high_school', 'age')
colnames(vd) <- coln
row.names(vd) <- rown
vd <- as.table(vd)
vd
##               VM   LB Max_K   Ox  Bi2 Zemf Mone  LSP  Skr   TB
## university  1272  894   797  748  781  731  693  433  461  392
## high_school  136  176   169  163  117  146  169  151   87  126
## age           57   37    34   37   50   46   24   33   32   24

2. The next step is to run a chi-square test of independence (both variables are categorical):

chisq.test(vd)
## 
##  Pearson's Chi-squared test
## 
## data:  vd
## X-squared = 140.39, df = 18, p-value < 2.2e-16

Now we see the cells where counts are most different from the expected frequencies.

chisq.test(vd)$res # no standardization
##                      VM          LB       Max_K          Ox         Bi2
## university   2.97451955  0.32707154 -0.06376266 -0.33666224  0.86253374
## high_school -6.40563493 -0.06059180  0.73461199  0.94183332 -2.79651159
## age         -0.48371751 -1.31637585 -1.16165678 -0.37071858  1.70233408
##                    Zemf        Mone         LSP         Skr          TB
## university  -0.23180965 -0.55401602 -2.69638035 -0.10709807 -1.96807958
## high_school -0.11678249  2.31104497  5.28410689 -0.58550405  4.23833950
## age          1.24638901 -2.10359772  1.46384820  1.61885423  0.31990181
chisq.test(vd)$stdres # standardized for the size of cells
##                      VM          LB       Max_K          Ox         Bi2
## university   7.24619229  0.77853335 -0.15075885 -0.79342654  2.03277079
## high_school -7.63578811 -0.07057436  0.84991028  1.08613804 -3.22498426
## age         -0.53987888 -1.43557501 -1.25836245 -0.40028396  1.83809782
##                    Zemf        Mone         LSP         Skr          TB
## university  -0.54547142 -1.30068553 -6.22820108 -0.24683606 -4.52578258
## high_school -0.13446736  2.65495426  5.97243901 -0.66032165  4.76919886
## age          1.34371019 -2.26268548  1.54913413  1.70940997  0.33703829
library(corrplot)
corrplot(chisq.test(vd)$stdres, is.corr=FALSE)

3. Finally, let’s run correspondence analysis and get the correspondence map

library(FactoMineR)
my.ca <- CA(vd, graph = T)

summary(my.ca)
## 
## Call:
## CA(X = vd, graph = T) 
## 
## The chi square of independence between the two variables is equal to 140.3939 (p-value =  5.377263e-21 ).
## 
## Eigenvalues
##                        Dim.1   Dim.2
## Variance               0.014   0.002
## % of var.             87.249  12.751
## Cumulative % of var.  87.249 100.000
## 
## Rows
##               Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## university  |     2.366 | -0.054 16.952  0.973 | -0.009  3.168  0.027 |
## high_school |    11.298 |  0.266 83.012  0.998 | -0.011  1.016  0.002 |
## age         |     1.907 |  0.011  0.036  0.003 |  0.214 95.816  0.997 |
## 
## Columns
##               Iner*1000    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
## VM          |     5.558 | -0.185 40.808  0.997 | -0.009  0.712  0.003 |
## LB          |     0.204 | -0.006  0.038  0.025 | -0.040 10.040  0.975 |
## Max_K       |     0.210 |  0.021  0.370  0.240 | -0.038  8.042  0.760 |
## Ox          |     0.126 |  0.032  0.800  0.861 | -0.013  0.884  0.139 |
## Bi2         |     1.271 | -0.093  6.728  0.719 |  0.058 17.994  0.281 |
## Zemf        |     0.180 |  0.000  0.000  0.000 |  0.042  9.054  1.000 |
## Mone        |     1.117 |  0.077  4.295  0.522 | -0.074 26.878  0.478 |
## LSP         |     4.141 |  0.240 28.925  0.949 |  0.056 10.640  0.051 |
## Skr         |     0.330 | -0.019  0.172  0.071 |  0.069 15.444  0.929 |
## TB          |     2.433 |  0.201 17.865  0.997 |  0.010  0.311  0.003 |
library(factoextra)
fviz_screeplot(my.ca, addlabels = TRUE)

fviz_ca_biplot(my.ca, repel = TRUE)

Asymmetric plots

“If the angle between two arrows is acute, then their is a strong association between the corresponding row and column.

To interpret the distance between rows and and a column you should perpendicularly project row points on the column arrow.” http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/113-ca-correspondence-analysis-in-r-essentials/

fviz_ca_biplot(my.ca, repel = TRUE,
              map = "rowprincipal",
              arrow = c(TRUE, TRUE))

Questions for self-check:

  1. Which artists contribute the most to the observed differences?

  2. What is the meaning of the two dimensions here?

  3. Compare the results of the chi-squared test and the correspondence map.

LS0tDQp0aXRsZTogIkNvcnJlc3BvbmRlbmNlIEFuYWx5c2lzIg0KYXV0aG9yOiAiQW5uYSBTaGlyb2thbm92YSINCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCINCm91dHB1dDoNCiAgaHRtbF9kb2N1bWVudDoNCiAgICBjb2RlX2Rvd25sb2FkOiB0cnVlDQotLS0NCg0KYGBge3Igc2V0dXAsIGluY2x1ZGU9RkFMU0V9DQprbml0cjo6b3B0c19jaHVuayRzZXQoZWNobyA9IFRSVUUsIG1lc3NhZ2UgPSBGLCB3YXJuaW5nID0gRikNCmBgYA0KDQojIyBJdCdzIGFsbCBhYm91dCBtdXNpYw0KDQpCYWNrIGluIDIwMTksIHVuaXZlcnNpdHkgYW5kIGhpZ2ggc2Nob29sIHN0dWRlbnRzIGF0IEhTRSB3ZXJlIGNhc3Rpbmcgdm90ZXMgZm9yIHRoZSBoZWFkbGluZXIgb2YgdGhlIGJpZyBncmFkdWF0aW9uIHNob3cuIFRoZSByZXN1bHRzIHdlcmUgcHJlc2VudGVkIGluIGEgbmVhdCBwaWN0dXJlIGxpa2UgdGhpcyA8aHR0cHM6Ly93d3cuaHNlLnJ1L25ld3MvbGlmZS8yNjQyMjUwMTAuaHRtbD4NCg0KV2UgYXJlIGdvaW5nIHRvIHR1cm4gdGhlc2UgcmVzdWx0cyBpbnRvIGEgdGFibGUgYW5kIGZpbmQgb3V0IHdoaWNoIGFydGlzdHMgd2VyZSBjbG9zZXIgdG8gd2hpY2ggZ3JvdXAgb2Ygc3R1ZGVudHMuDQoNCkFzIHBhcnQgb2YgdGhlIGV4ZXJjaXNlLCB3ZSB3aWxsIGFkZCB0aGUgYWdlIG9mIHRoZSBsZWFkIHNpbmdlciBhcyB0aGUgdGhpcmQgcm93LiBJdCB3aWxsIG1vc3QgcHJvYmFibHkgYmUgdW5yZWxhdGVkIHRvIHRoZSByZWFsIHZvdGVzLCBhbmQgdGh1cyBzaG91bGQgY3JlYXRlIGFub3RoZXIgZGltZW5zaW9uLiAoT3RoZXJ3aXNlIHRoZXJlIHdvdWxkIGJlIG5vIG1hcCBidXQgYSBsaW5lIGFzIHRoZSBtaW5pbXVtIG51bWJlciBvZiBkaW1lbnNpb25zID0gTUlOKHJvdywgY29sdW1uKSAtIDEuKQ0KDQoNCiMjIyAxLiBMZXQncyBjcmVhdGUgYSBtYXRyaXggd2l0aCB2b3RlcyBhbmQgYWdlczoNCg0KYGBge3J9DQp2ZCA8LSBtYXRyaXgoYygxMjcyLDg5NCwgNzk3LCA3NDgsIDc4MSwgNzMxLCA2OTMsIDQzMywgNDYxLCAzOTIsDQogICAgICAgIDEzNiwgMTc2LCAxNjksIDE2MywgMTE3LCAxNDYsIDE2OSwgMTUxLCA4NywgMTI2LA0KICAgICAgICA1NywgMzcsIDM0LCAzNywgNTAsIDQ2LCAyNCwgMzMsIDMyLCAyNCkgLCBucm93ID0gMywgYnlyb3cgPSBUKQ0KY29sbiA8LSBjKCdWTScsICdMQicsICdNYXhfSycsICdPeCcsICdCaTInLCAnWmVtZicsICdNb25lJywgJ0xTUCcsICdTa3InLCAnVEInKQ0Kcm93biA8LSBjKCd1bml2ZXJzaXR5JywgJ2hpZ2hfc2Nob29sJywgJ2FnZScpDQpjb2xuYW1lcyh2ZCkgPC0gY29sbg0Kcm93Lm5hbWVzKHZkKSA8LSByb3duDQp2ZCA8LSBhcy50YWJsZSh2ZCkNCnZkDQpgYGANCg0KIyMjIDIuIFRoZSBuZXh0IHN0ZXAgaXMgdG8gcnVuIGEgY2hpLXNxdWFyZSB0ZXN0IG9mIGluZGVwZW5kZW5jZSAoYm90aCB2YXJpYWJsZXMgYXJlIGNhdGVnb3JpY2FsKToNCg0KYGBge3J9DQpjaGlzcS50ZXN0KHZkKQ0KYGBgDQoNCk5vdyB3ZSBzZWUgdGhlIGNlbGxzIHdoZXJlIGNvdW50cyBhcmUgbW9zdCBkaWZmZXJlbnQgZnJvbSB0aGUgZXhwZWN0ZWQgZnJlcXVlbmNpZXMuDQoNCmBgYHtyfQ0KY2hpc3EudGVzdCh2ZCkkcmVzICMgbm8gc3RhbmRhcmRpemF0aW9uDQpjaGlzcS50ZXN0KHZkKSRzdGRyZXMgIyBzdGFuZGFyZGl6ZWQgZm9yIHRoZSBzaXplIG9mIGNlbGxzDQpsaWJyYXJ5KGNvcnJwbG90KQ0KY29ycnBsb3QoY2hpc3EudGVzdCh2ZCkkc3RkcmVzLCBpcy5jb3JyPUZBTFNFKQ0KYGBgDQoNCg0KIyMjIDMuIEZpbmFsbHksIGxldCdzIHJ1biBjb3JyZXNwb25kZW5jZSBhbmFseXNpcyBhbmQgZ2V0IHRoZSBjb3JyZXNwb25kZW5jZSBtYXANCg0KYGBge3J9DQpsaWJyYXJ5KEZhY3RvTWluZVIpDQpteS5jYSA8LSBDQSh2ZCwgZ3JhcGggPSBUKQ0Kc3VtbWFyeShteS5jYSkNCmxpYnJhcnkoZmFjdG9leHRyYSkNCmZ2aXpfc2NyZWVwbG90KG15LmNhLCBhZGRsYWJlbHMgPSBUUlVFKQ0KZnZpel9jYV9iaXBsb3QobXkuY2EsIHJlcGVsID0gVFJVRSkNCmBgYA0KDQpfX0FzeW1tZXRyaWMgcGxvdHNfXw0KDQoiSWYgdGhlIGFuZ2xlIGJldHdlZW4gdHdvIGFycm93cyBpcyBhY3V0ZSwgdGhlbiB0aGVpciBpcyBhIHN0cm9uZyBhc3NvY2lhdGlvbiBiZXR3ZWVuIHRoZSBjb3JyZXNwb25kaW5nIHJvdyBhbmQgY29sdW1uLg0KDQpUbyBpbnRlcnByZXQgdGhlIGRpc3RhbmNlIGJldHdlZW4gcm93cyBhbmQgYW5kIGEgY29sdW1uIHlvdSBzaG91bGQgcGVycGVuZGljdWxhcmx5IHByb2plY3Qgcm93IHBvaW50cyBvbiB0aGUgY29sdW1uIGFycm93LiIgPGh0dHA6Ly93d3cuc3RoZGEuY29tL2VuZ2xpc2gvYXJ0aWNsZXMvMzEtcHJpbmNpcGFsLWNvbXBvbmVudC1tZXRob2RzLWluLXItcHJhY3RpY2FsLWd1aWRlLzExMy1jYS1jb3JyZXNwb25kZW5jZS1hbmFseXNpcy1pbi1yLWVzc2VudGlhbHMvPg0KDQpgYGB7cn0NCmZ2aXpfY2FfYmlwbG90KG15LmNhLCByZXBlbCA9IFRSVUUsDQogICAgICAgICAgICAgIG1hcCA9ICJyb3dwcmluY2lwYWwiLA0KICAgICAgICAgICAgICBhcnJvdyA9IGMoVFJVRSwgVFJVRSkpDQpgYGANCg0KDQojIyMgUXVlc3Rpb25zIGZvciBzZWxmLWNoZWNrOg0KDQoxKSBXaGljaCBhcnRpc3RzIGNvbnRyaWJ1dGUgdGhlIG1vc3QgdG8gdGhlIG9ic2VydmVkIGRpZmZlcmVuY2VzPw0KDQoyKSBXaGF0IGlzIHRoZSBtZWFuaW5nIG9mIHRoZSB0d28gZGltZW5zaW9ucyBoZXJlPw0KDQozKSBDb21wYXJlIHRoZSByZXN1bHRzIG9mIHRoZSBjaGktc3F1YXJlZCB0ZXN0IGFuZCB0aGUgY29ycmVzcG9uZGVuY2UgbWFwLg0KDQojIyMgT3RoZXIgc291cmNlczoNCg0KQm91cmRpZXUsIDE5ODRbMTk3OV0gRGlzdGluY3Rpb246IHBwLjI2MiwgMjY2IDxodHRwczovL21vbm9za29wLm9yZy9pbWFnZXMvZS9lMC9QaWVycmVfQm91cmRpZXVfRGlzdGluY3Rpb25fQV9Tb2NpYWxfQ3JpdGlxdWVfb2ZfdGhlX0p1ZGdlbWVudF9vZl9UYXN0ZV8xOTg0LnBkZj4NCg0KIVtCYW5rcyBDb0FdKGh0dHBzOi8vc3RhdGljLnRpbGRhY2RuLmNvbS90aWxkNjQ2NS0zNDYxLTQ3NjUtYjA2Mi02MjY1NjQzNjYyMzgvQ29ycmVzc193aXRoX2R5bmFtaWMucG5nKSA8aHR0cHM6Ly9yYWRhci1yZXNlYXJjaC5ydS90cG9zdC9yM3VvamhyZngxLWFuYWxpei1zb290dmV0c3R2aWk+DQoNClNoYWZpciwgMjAwNjogcHAuNjEsIDY0LCA2OSwgNzgsIDgzICFbU2hhZmlyLCAyMDA2XShodHRwczovL3N0YXRtb2QucnUvd2lraS9fbWVkaWEvc3R1ZHk6ZmFsbDIwMTU6NXN0YXRfbGVjdHVyZTpzaGFmaXJfbS5fLXByb3N0b2lpX2lfbW5vamVzdHZlbm55aWlpaV9hbmFsaXpfc29vdHZldHN0dmlpaV9rYWtfbWV0b2RfcmF6dmVkb2Nobm9nb19hbmFsaXphX2Rhbm55aWloLl92eWlpcHlza25hZWFfa3ZhbGlmaWthdHMucGRmKSBMaW5rOiA8aHR0cHM6Ly9zdGF0bW9kLnJ1L3dpa2kvX21lZGlhL3N0dWR5OmZhbGwyMDE1OjVzdGF0X2xlY3R1cmU6c2hhZmlyX20uXy1wcm9zdG9paV9pX21ub2plc3R2ZW5ueWlpaWlfYW5hbGl6X3Nvb3R2ZXRzdHZpaWlfa2FrX21ldG9kX3JhenZlZG9jaG5vZ29fYW5hbGl6YV9kYW5ueWlpaC5fdnlpaXB5c2tuYWVhX2t2YWxpZmlrYXRzLnBkZj4NCg0K