Correspondence Analysis

Author

Justin Pons

Introduction

Correspondence Analysis (CA) is a dimension reduction technique similar to principal components analysis. CA, however, uses categorical variables instead of continuous variables.

We will be conducting a correspondence analysis between Race and Pizza restaraunts.

As always we will be using tidyverse. The vegan package is used for CA.

library(tidyverse)
library(ca)
library(vegan)

Analysis

df <- read_csv("C:/Users/Justin Pons/Spring 2024/DA 6213/4/pizzafem truncated.csv")

We take a look at the data using str() function. The myid column is a unique identifier and not necessary. Pizza and race columns must be converted into factor, or character, variables. We also remove the id variable and any observations with missing values.

str(df)
spc_tbl_ [25,439 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ myid     : num [1:25439] 2354858 2354859 2354861 2354862 2354863 ...
 $ RESP_RACE: num [1:25439] 1 1 4 4 4 1 1 1 1 1 ...
 $ pizza    : num [1:25439] NA NA NA 2 2 NA NA NA NA NA ...
 - attr(*, "spec")=
  .. cols(
  ..   myid = col_double(),
  ..   RESP_RACE = col_double(),
  ..   pizza = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
df <- df |> na.omit()
df <- df |> mutate_at(c('RESP_RACE', 'pizza'), as.factor)
df <- df |> select(-myid)
str(df)
tibble [2,574 × 2] (S3: tbl_df/tbl/data.frame)
 $ RESP_RACE: Factor w/ 4 levels "1","2","3","4": 4 4 2 4 4 1 1 1 1 1 ...
 $ pizza    : Factor w/ 4 levels "1","2","3","4": 2 2 1 2 1 1 2 2 2 2 ...
 - attr(*, "na.action")= 'omit' Named int [1:22865] 1 2 3 6 7 8 9 10 11 12 ...
  ..- attr(*, "names")= chr [1:22865] "1" "2" "3" "6" ...

This code replaces the numeric representations with readable values.

df$RESP_RACE <- df$RESP_RACE |> 
  case_match("1"~"White","2"~"Black","3"~"Asian","4"~"Other")
df$pizza <- df$pizza |> 
  case_match("1"~"PizzaHut","2"~"Dominos","3"~"PapaJohns","4"~"LittleCaesars")

A frequency table is created. Margins are added with the sums of each column and row. This is necessary to calculated expected values.

sums <- addmargins(table(df$RESP_RACE, df$pizza))
sums
       
        Dominos LittleCaesars PapaJohns PizzaHut  Sum
  Asian      33             7        13       27   80
  Black      76            67        30      103  276
  Other     165           139        32      100  436
  White     508           476       238      560 1782
  Sum       782           689       313      790 2574

The chisq.test() function takes a frequency table as an input (a new table is created without the margins). The expected values are displayed for each cell.

frequency_table <- table(df$RESP_RACE, df$pizza)
expectedValues <- chisq.test(frequency_table)$expected
expectedValues
       
          Dominos LittleCaesars PapaJohns  PizzaHut
  Asian  24.30458      21.41414   9.72805  24.55322
  Black  83.85082      73.87879  33.56177  84.70862
  Other 132.45998     116.70707  53.01787 133.81507
  White 541.38462     477.00000 216.69231 546.92308

The chi^^2 statistic is calculated by squaring the difference between the observed and expected values, then dividing by the expected values.

chisq <- ((frequency_table-expectedValues)^2)/expectedValues
chisq
       
            Dominos LittleCaesars   PapaJohns    PizzaHut
  Asian 3.110946197   9.702348961 1.100493818 0.243825819
  Black 0.735059151   0.640477789 0.377996037 3.949708913
  Other 7.993754609   4.258308374 8.332113185 8.545070331
  White 2.058670514   0.002096436 2.095218590 0.312669047

This table shows us that Asian/Little Caesars contributes the most to the chi^^2 statistic total of 53.4. The total chi square is 53.45. The degrees of freedom are (r-1)(c-1) = 9. The critical value for p=.05 and df=3 is 16.9. 53>16, so we can reject the null hypothesis that there is no relationship.

addmargins(chisq)
       
             Dominos LittleCaesars    PapaJohns     PizzaHut          Sum
  Asian  3.110946197   9.702348961  1.100493818  0.243825819 14.157614795
  Black  0.735059151   0.640477789  0.377996037  3.949708913  5.703241890
  Other  7.993754609   4.258308374  8.332113185  8.545070331 29.129246499
  White  2.058670514   0.002096436  2.095218590  0.312669047  4.468654587
  Sum   13.898430472  14.603231560 11.905821630 13.051274110 53.458757772

This shows the eigan values for the new dimensions.

pizza.cca <- cca(frequency_table)
summary(pizza.cca)

Call:
cca(X = frequency_table) 

Partitioning of scaled Chi-square:
              Inertia Proportion
Total         0.02077          1
Unconstrained 0.02077          1

Eigenvalues, and their contribution to the scaled Chi-square 

Importance of components:
                          CA1      CA2      CA3
Eigenvalue            0.01411 0.005141 0.001514
Proportion Explained  0.67957 0.247513 0.072914
Cumulative Proportion 0.67957 0.927086 1.000000

Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions


Species scores

                   CA1       CA2        CA3
Dominos       -0.09709 -0.091360  4.712e-05
LittleCaesars -0.10740  0.097854  9.198e-03
PapaJohns      0.17396 -0.010912  8.750e-02
PizzaHut       0.12085  0.009415 -4.274e-02


Site scores (weighted averages of species scores)

          CA1     CA2     CA3
Asian  1.3894 -5.3923  0.4092
Black  0.7937  0.1799 -2.7683
Other -2.1607 -0.3928 -0.2838
White  0.3434  0.3103  0.4798
vegdist(frequency_table, method = "chisq")
          Asian     Black     Other
Black 0.4241554                    
Other 0.5541624 0.3663792          
White 0.4273424 0.1375780 0.3031929
pizza.cca <- ca::ca(frequency_table)
pizza.cca

 Principal inertias (eigenvalues):
           1        2        3       
Value      0.014114 0.005141 0.001514
Percentage 67.96%   24.75%   7.29%   


 Rows:
            Asian    Black     Other    White
Mass     0.031080 0.107226  0.169386 0.692308
ChiDist  0.420678 0.143750  0.258477 0.050077
Inertia  0.005500 0.002216  0.011317 0.001736
Dim. 1   1.389359 0.793746 -2.160742 0.343356
Dim. 2  -5.392331 0.179879 -0.392816 0.310330


 Columns:
          Dominos LittleCaesars PapaJohns PizzaHut
Mass     0.303807      0.267677  0.121601 0.306915
ChiDist  0.133315      0.145584  0.195033 0.128532
Inertia  0.005400      0.005673  0.004625 0.005070
Dim. 1  -0.817238     -0.904032  1.464276 1.017265
Dim. 2  -1.274237      1.364812 -0.152200 0.131312
summary(pizza.cca)

Principal inertias (eigenvalues):

 dim    value      %   cum%   scree plot               
 1      0.014114  68.0  68.0  *****************        
 2      0.005141  24.8  92.7  ******                   
 3      0.001514   7.3 100.0  **                       
        -------- -----                                 
 Total: 0.020769 100.0                                 


Rows:
    name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
1 | Asin |   31  999  265 |  165 154  60 | -387 845 904 |
2 | Blck |  107  438  107 |   94 430  68 |   13   8   3 |
3 | Othr |  169  998  545 | -257 986 791 |  -28  12  26 |
4 | Whit |  692  861   84 |   41 664  82 |   22 197  67 |

Columns:
    name   mass  qlt  inr    k=1 cor ctr    k=2 cor ctr  
1 | Dmns |  304 1000  260 |  -97 530 203 |  -91 470 493 |
2 | LttC |  268  996  273 | -107 544 219 |   98 452 499 |
3 | PpJh |  122  799  223 |  174 796 261 |  -11   3   3 |
4 | PzzH |  307  889  244 |  121 884 318 |    9   5   5 |
plot(ca(frequency_table, graph = TRUE))

chisq.test(frequency_table)$stdres
       
            Dominos LittleCaesars   PapaJohns    PizzaHut
  Asian  2.14752172   -3.69779717  1.13711356  0.60256339
  Black -1.08749100   -0.98975998 -0.69426756  2.52649621
  Other  3.71801464    2.64586432 -3.37934080 -3.85269907
  White -3.10006194   -0.09645641  2.78426354  1.21085174