Categorical Correlation Matrix

Lily (Zhi Lin) Zhou

Imagine…

  • You want to assess multi-colinearity between your categorical variables
  • corrplot package only works for continuous variables as it uses Pearson’s R to generation correlation coefficients
  • Is there a simple way of doing this for categorical variables?

corrcat generates correlation matrix for categorical variables

  1. Cramer’s V to calculate correlation coefficients
  2. produces a table with correlation coefficients
  3. produces a correlation matrix

Lets test it out

Birds Dataset

   color country  diet can_fly  wingspan
1  white  canada   bug       1  44.94233
2   blue    peru  fish       0 141.64550
3  white  canada fruit       1 146.31591
4  white  canada   bug       0 109.00643
5  white   japan   bug       0  20.55236
6  white  canada   bug       1 146.03742
7   <NA>    peru fruit       0  53.71118
8   blue   japan fruit       0  18.54671
9   blue   japan fruit       1  39.78065
10 white   japan fruit       1  81.48394

corrplot_cat(birds)
           color country   diet can_fly wingspan
color    1.00000  0.4956 0.2570 0.07422   0.7888
country  0.49560  1.0000 0.2053 0.28370   1.0000
diet     0.25700  0.2053 1.0000 0.20750   1.0000
can_fly  0.07422  0.2837 0.2075 1.00000   1.0000
wingspan 0.78880  1.0000 1.0000 1.00000   1.0000

corrplot_cat(birds[1:4])
          color country   diet can_fly
color   1.00000  0.4956 0.2570 0.07422
country 0.49560  1.0000 0.2053 0.28370
diet    0.25700  0.2053 1.0000 0.20750
can_fly 0.07422  0.2837 0.2075 1.00000

Notes on the package

  • will automatically drop NAs
  • only works for comparing nominal categorical variables
  • will need to filter data set of continuous variables prior to using function

Future Steps

  • Add to function so categorical v continuous variables can be compared
  • Customize colors

Questions?