library(tidyverse)
library(ca)
library(vegan)Correspondence Analysis
Introduction
Correspondence Analysis (CA) is a dimension reduction technique similar to principal components analysis. CA, however, uses categorical variables instead of continuous variables.
We will be conducting a correspondence analysis between Race and Pizza restaraunts.
As always we will be using tidyverse. The vegan package is used for CA.
Analysis
df <- read_csv("C:/Users/Justin Pons/Spring 2024/DA 6213/4/pizzafem truncated.csv")We take a look at the data using str() function. The myid column is a unique identifier and not necessary. Pizza and race columns must be converted into factor, or character, variables. We also remove the id variable and any observations with missing values.
str(df)spc_tbl_ [25,439 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ myid : num [1:25439] 2354858 2354859 2354861 2354862 2354863 ...
$ RESP_RACE: num [1:25439] 1 1 4 4 4 1 1 1 1 1 ...
$ pizza : num [1:25439] NA NA NA 2 2 NA NA NA NA NA ...
- attr(*, "spec")=
.. cols(
.. myid = col_double(),
.. RESP_RACE = col_double(),
.. pizza = col_double()
.. )
- attr(*, "problems")=<externalptr>
df <- df |> na.omit()
df <- df |> mutate_at(c('RESP_RACE', 'pizza'), as.factor)
df <- df |> select(-myid)
str(df)tibble [2,574 × 2] (S3: tbl_df/tbl/data.frame)
$ RESP_RACE: Factor w/ 4 levels "1","2","3","4": 4 4 2 4 4 1 1 1 1 1 ...
$ pizza : Factor w/ 4 levels "1","2","3","4": 2 2 1 2 1 1 2 2 2 2 ...
- attr(*, "na.action")= 'omit' Named int [1:22865] 1 2 3 6 7 8 9 10 11 12 ...
..- attr(*, "names")= chr [1:22865] "1" "2" "3" "6" ...
This code replaces the numeric representations with readable values.
df$RESP_RACE <- df$RESP_RACE |>
case_match("1"~"White","2"~"Black","3"~"Asian","4"~"Other")df$pizza <- df$pizza |>
case_match("1"~"PizzaHut","2"~"Dominos","3"~"PapaJohns","4"~"LittleCaesars")A frequency table is created. Margins are added with the sums of each column and row. This is necessary to calculated expected values.
sums <- addmargins(table(df$RESP_RACE, df$pizza))
sums
Dominos LittleCaesars PapaJohns PizzaHut Sum
Asian 33 7 13 27 80
Black 76 67 30 103 276
Other 165 139 32 100 436
White 508 476 238 560 1782
Sum 782 689 313 790 2574
The chisq.test() function takes a frequency table as an input (a new table is created without the margins). The expected values are displayed for each cell.
frequency_table <- table(df$RESP_RACE, df$pizza)
expectedValues <- chisq.test(frequency_table)$expected
expectedValues
Dominos LittleCaesars PapaJohns PizzaHut
Asian 24.30458 21.41414 9.72805 24.55322
Black 83.85082 73.87879 33.56177 84.70862
Other 132.45998 116.70707 53.01787 133.81507
White 541.38462 477.00000 216.69231 546.92308
The chi^^2 statistic is calculated by squaring the difference between the observed and expected values, then dividing by the expected values.
chisq <- ((frequency_table-expectedValues)^2)/expectedValues
chisq
Dominos LittleCaesars PapaJohns PizzaHut
Asian 3.110946197 9.702348961 1.100493818 0.243825819
Black 0.735059151 0.640477789 0.377996037 3.949708913
Other 7.993754609 4.258308374 8.332113185 8.545070331
White 2.058670514 0.002096436 2.095218590 0.312669047
This table shows us that Asian/Little Caesars contributes the most to the chi^^2 statistic total of 53.4. The total chi square is 53.45. The degrees of freedom are (r-1)(c-1) = 9. The critical value for p=.05 and df=3 is 16.9. 53>16, so we can reject the null hypothesis that there is no relationship.
addmargins(chisq)
Dominos LittleCaesars PapaJohns PizzaHut Sum
Asian 3.110946197 9.702348961 1.100493818 0.243825819 14.157614795
Black 0.735059151 0.640477789 0.377996037 3.949708913 5.703241890
Other 7.993754609 4.258308374 8.332113185 8.545070331 29.129246499
White 2.058670514 0.002096436 2.095218590 0.312669047 4.468654587
Sum 13.898430472 14.603231560 11.905821630 13.051274110 53.458757772
This shows the eigan values for the new dimensions.
pizza.cca <- cca(frequency_table)
summary(pizza.cca)
Call:
cca(X = frequency_table)
Partitioning of scaled Chi-square:
Inertia Proportion
Total 0.02077 1
Unconstrained 0.02077 1
Eigenvalues, and their contribution to the scaled Chi-square
Importance of components:
CA1 CA2 CA3
Eigenvalue 0.01411 0.005141 0.001514
Proportion Explained 0.67957 0.247513 0.072914
Cumulative Proportion 0.67957 0.927086 1.000000
Scaling 2 for species and site scores
* Species are scaled proportional to eigenvalues
* Sites are unscaled: weighted dispersion equal on all dimensions
Species scores
CA1 CA2 CA3
Dominos -0.09709 -0.091360 4.712e-05
LittleCaesars -0.10740 0.097854 9.198e-03
PapaJohns 0.17396 -0.010912 8.750e-02
PizzaHut 0.12085 0.009415 -4.274e-02
Site scores (weighted averages of species scores)
CA1 CA2 CA3
Asian 1.3894 -5.3923 0.4092
Black 0.7937 0.1799 -2.7683
Other -2.1607 -0.3928 -0.2838
White 0.3434 0.3103 0.4798
vegdist(frequency_table, method = "chisq") Asian Black Other
Black 0.4241554
Other 0.5541624 0.3663792
White 0.4273424 0.1375780 0.3031929
pizza.cca <- ca::ca(frequency_table)
pizza.cca
Principal inertias (eigenvalues):
1 2 3
Value 0.014114 0.005141 0.001514
Percentage 67.96% 24.75% 7.29%
Rows:
Asian Black Other White
Mass 0.031080 0.107226 0.169386 0.692308
ChiDist 0.420678 0.143750 0.258477 0.050077
Inertia 0.005500 0.002216 0.011317 0.001736
Dim. 1 1.389359 0.793746 -2.160742 0.343356
Dim. 2 -5.392331 0.179879 -0.392816 0.310330
Columns:
Dominos LittleCaesars PapaJohns PizzaHut
Mass 0.303807 0.267677 0.121601 0.306915
ChiDist 0.133315 0.145584 0.195033 0.128532
Inertia 0.005400 0.005673 0.004625 0.005070
Dim. 1 -0.817238 -0.904032 1.464276 1.017265
Dim. 2 -1.274237 1.364812 -0.152200 0.131312
summary(pizza.cca)
Principal inertias (eigenvalues):
dim value % cum% scree plot
1 0.014114 68.0 68.0 *****************
2 0.005141 24.8 92.7 ******
3 0.001514 7.3 100.0 **
-------- -----
Total: 0.020769 100.0
Rows:
name mass qlt inr k=1 cor ctr k=2 cor ctr
1 | Asin | 31 999 265 | 165 154 60 | -387 845 904 |
2 | Blck | 107 438 107 | 94 430 68 | 13 8 3 |
3 | Othr | 169 998 545 | -257 986 791 | -28 12 26 |
4 | Whit | 692 861 84 | 41 664 82 | 22 197 67 |
Columns:
name mass qlt inr k=1 cor ctr k=2 cor ctr
1 | Dmns | 304 1000 260 | -97 530 203 | -91 470 493 |
2 | LttC | 268 996 273 | -107 544 219 | 98 452 499 |
3 | PpJh | 122 799 223 | 174 796 261 | -11 3 3 |
4 | PzzH | 307 889 244 | 121 884 318 | 9 5 5 |
plot(ca(frequency_table, graph = TRUE))chisq.test(frequency_table)$stdres
Dominos LittleCaesars PapaJohns PizzaHut
Asian 2.14752172 -3.69779717 1.13711356 0.60256339
Black -1.08749100 -0.98975998 -0.69426756 2.52649621
Other 3.71801464 2.64586432 -3.37934080 -3.85269907
White -3.10006194 -0.09645641 2.78426354 1.21085174