Chi squared

Anna Shirokanova and Olesya Volchenko

February 15, 2022

Two Categorical Variables: Pearson’s Chi-Square Test

Read this: https://www.r-bloggers.com/chi-square-test-of-independence-in-r/

Example 1: Should pubs be open 24/7?

Chi-squared calculated on tabular data

This data is fictional!

Table <- matrix(c(120, 80, 90, 110), nrow=2)
row.names(Table) <- c("Male","Female")
colnames(Table) <- c("Yes","No")
Table
##        Yes  No
## Male   120  90
## Female  80 110
chisq.test(Table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Table
## X-squared = 8.4311, df = 1, p-value = 0.003689

There is a statistically significant association between opinion on pubs and sex of the respondent, X(1) = 8.43, p = 0.004. It means that males and females prefer different opinions on the working hours of pubs.

firstchi <- chisq.test(Table)

The result of chisq.test() function is a list containing the following components:

str(firstchi)
## List of 9
##  $ statistic: Named num 8.43
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter: Named int 1
##   ..- attr(*, "names")= chr "df"
##  $ p.value  : num 0.00369
##  $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
##  $ data.name: chr "Table"
##  $ observed : num [1:2, 1:2] 120 80 90 110
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  $ expected : num [1:2, 1:2] 105 95 105 95
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  $ residuals: num [1:2, 1:2] 1.46 -1.54 -1.46 1.54
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  $ stdres   : num [1:2, 1:2] 3 -3 -3 3
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  - attr(*, "class")= chr "htest"

The format of the R code to use for getting these values is as follow: printing standardized residuals

firstchi$stdres
##              Yes        No
## Male    3.003757 -3.003757
## Female -3.003757  3.003757

If the chi squared test is significant we need to take a look at residuals.

If the value of standardized residual is lower than -2 it means that the cell contains fewer observations that it was expected (the case of variables independence). If the value of standardized residual is higher than 2 it means that the cell contains more observations that it was expected.

stdres formula: observed - expected / (expected * (1 - row sum/total) * (1 - column sum/total))

“A standardized Pearson residual that exceeds about 2 or 3 in absolute value indicates lack of fit of H0 in that cell. Larger values are more relevant when df is larger and it becomes more likely that at least one is large simply by chance.” (Agresti, 2002, p.81)

We can plot the results using corrplot() function

library(corrplot)
corrplot(firstchi$stdres, is.cor = FALSE)

Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.

Negative residuals are in red. This implies a repulsion (negative association)between the corresponding row and column variables.

Our interpretation: According to the data, there are many more males saying that pubs should be open 24/7 than we would observe if the sex and opinion on pubs were independent.

Example 2: Subjective class and level of education in Argentina

chi-squared calculated on raw data

data <- read.csv("/Users/olesyavolchenko/Yandex.Disk.localized/teaching/Radboud/data/argentina.csv")

The dataset is available here http://www.worldvaluessurvey.org/wvs.jsp (WVS, round 6, Argentina subsample)

Let’s explore the relationship between subjective class and the level of education.

Contingency table:

table(data$educf, data$V238)
##            
##             Lower class Lower middle class Upper middle class Working class
##   primary            57                 94                  2            84
##   secondary          44                320                  8           177
##   tertiary            0                 38                  7            11
chisq.test(data$educf, data$V238)
## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  data$educf and data$V238
## X-squared = 92.078, df = 6, p-value < 2.2e-16
argchi <- chisq.test(data$educf, data$V238)
## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may be
## incorrect

These statistics show strong evidence of association between the subjectively perceived social class of the respondents and their levels of education. However, some of the chi-squared assumptions seem to be broken, hence, the warning message.

argchi$observed
##            data$V238
## data$educf  Lower class Lower middle class Upper middle class Working class
##   primary            57                 94                  2            84
##   secondary          44                320                  8           177
##   tertiary            0                 38                  7            11
argchi$expected
##            data$V238
## data$educf  Lower class Lower middle class Upper middle class Working class
##   primary      28.42874          127.22565           4.785036      76.56057
##   secondary    65.85392          294.71259          11.084323     177.34917
##   tertiary      6.71734           30.06176           1.130641      18.09026

Here, deleting the very rare category (or uniting it with a similar category) can solve the technical problem (the researcher is responsible for this decision).

table(data$V238)
## 
##        Lower class Lower middle class Upper middle class      Working class 
##                105                562                 34                306
data$V238[ data$V238 == "Upper middle class"] <- NA
argchi1 <- chisq.test(data$educf, data$V238)

No warnings! Yay! (we also can unite 2 categories to avoid warnings in chi-squared)

argchi1
## 
##  Pearson's Chi-squared test
## 
## data:  data$educf and data$V238
## X-squared = 59.259, df = 4, p-value = 4.151e-12
argchi1$stdres
##            data$V238
## data$educf  Lower class Lower middle class Working class
##   primary     6.6436471         -5.3860589     1.0700456
##   secondary  -4.9700666          3.4742965    -0.2129345
##   tertiary   -2.6957869          3.3010847    -1.6152749

Now, the statistic X is smaller (59.3 vs. 92.1 above) on fewer df’s (4 vs. 6), but the results still provide strong evidence of association.

The most contributing cells are the Lower class * Primary education, Lower class * Primary education, and Lower middle class* Primary education. In the data, there are many more respondents with primary education who feel they are lower class than we would expect if the two variables were independent, and there are much fewer respondents with secondary or tertiary education who consider themselves lower class. By contrast, among the lower middle class, there are much fewer respondents with primary education and many more of those with secondary or tertiary education than if the variables were independent. There are other cells with large standardized Pearson residuals as well.

The simplest way to plot chi-square results using sjPlot package

library(sjPlot)
plot_xtab(data$V238, data$educf, margin = "row", bar.pos = "stack",
         show.summary = TRUE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

These graphs can help both illustrate the proportions of two variables and calculate the chi-squared statistic on the go.

Example 3: A really bad implementation of chi-squared test. Don’t do that!

Do not attempt to duplicate, re-create, or perform the same or similar stunts and tricks at home, as personal injury or property damage may result.

Let’s find how gender is associated with the level of satisfaction with life (V23, 11 categories)

data <- read.csv("argentina.csv")
barplot(table(data$gndr))

barplot(table(data$V23))

chisq.test(data$gndr, data$V23)
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  data$gndr and data$V23
## X-squared = 6.203, df = 9, p-value = 0.7194
chisq.test(data$gndr, data$V23)$observed
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect
##          data$V23
## data$gndr   2   3   4   5   6   7   8   9 Completely dissatisfied
##    Female   2   6  14  41  56 124 162  77                       3
##    Male     2   5  16  20  54 114 145  72                       4
##          data$V23
## data$gndr Completely satisfied
##    Female                   56
##    Male                     47
chisq.test(data$gndr, data$V23)$expected
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect
##          data$V23
## data$gndr        2        3        4        5        6        7        8
##    Female 2.121569 5.834314 15.91176 32.35392 58.34314 126.2333 162.8304
##    Male   1.878431 5.165686 14.08824 28.64608 51.65686 111.7667 144.1696
##          data$V23
## data$gndr        9 Completely dissatisfied Completely satisfied
##    Female 79.02843                3.712745             54.63039
##    Male   69.97157                3.287255             48.36961
corrplot(chisq.test(data$gndr, data$V23)$stdres, is.cor = FALSE)
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect

That’s all Folks