Anna Shirokanova and Olesya Volchenko
February 15, 2022
Read this: https://www.r-bloggers.com/chi-square-test-of-independence-in-r/
Chi-squared calculated on tabular data
This data is fictional!
Table <- matrix(c(120, 80, 90, 110), nrow=2)
row.names(Table) <- c("Male","Female")
colnames(Table) <- c("Yes","No")
Table## Yes No
## Male 120 90
## Female 80 110
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Table
## X-squared = 8.4311, df = 1, p-value = 0.003689
There is a statistically significant association between opinion on pubs and sex of the respondent, X(1) = 8.43, p = 0.004. It means that males and females prefer different opinions on the working hours of pubs.
The result of chisq.test() function is a list containing the following components:
## List of 9
## $ statistic: Named num 8.43
## ..- attr(*, "names")= chr "X-squared"
## $ parameter: Named int 1
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.00369
## $ method : chr "Pearson's Chi-squared test with Yates' continuity correction"
## $ data.name: chr "Table"
## $ observed : num [1:2, 1:2] 120 80 90 110
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## $ expected : num [1:2, 1:2] 105 95 105 95
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## $ residuals: num [1:2, 1:2] 1.46 -1.54 -1.46 1.54
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## $ stdres : num [1:2, 1:2] 3 -3 -3 3
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## - attr(*, "class")= chr "htest"
The format of the R code to use for getting these values is as follow: printing standardized residuals
## Yes No
## Male 3.003757 -3.003757
## Female -3.003757 3.003757
If the chi squared test is significant we need to take a look at residuals.
If the value of standardized residual is lower than -2 it means that the cell contains fewer observations that it was expected (the case of variables independence). If the value of standardized residual is higher than 2 it means that the cell contains more observations that it was expected.
stdres formula: observed - expected / (expected * (1 - row sum/total) * (1 - column sum/total))
“A standardized Pearson residual that exceeds about 2 or 3 in absolute value indicates lack of fit of H0 in that cell. Larger values are more relevant when df is larger and it becomes more likely that at least one is large simply by chance.” (Agresti, 2002, p.81)
We can plot the results using corrplot() function
Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
Negative residuals are in red. This implies a repulsion (negative association)between the corresponding row and column variables.
Our interpretation: According to the data, there are many more males saying that pubs should be open 24/7 than we would observe if the sex and opinion on pubs were independent.
chi-squared calculated on raw data
data <- read.csv("/Users/olesyavolchenko/Yandex.Disk.localized/teaching/Radboud/data/argentina.csv")The dataset is available here http://www.worldvaluessurvey.org/wvs.jsp (WVS, round 6, Argentina subsample)
Let’s explore the relationship between subjective class and the level of education.
Contingency table:
##
## Lower class Lower middle class Upper middle class Working class
## primary 57 94 2 84
## secondary 44 320 8 177
## tertiary 0 38 7 11
## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: data$educf and data$V238
## X-squared = 92.078, df = 6, p-value < 2.2e-16
## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may be
## incorrect
These statistics show strong evidence of association between the subjectively perceived social class of the respondents and their levels of education. However, some of the chi-squared assumptions seem to be broken, hence, the warning message.
## data$V238
## data$educf Lower class Lower middle class Upper middle class Working class
## primary 57 94 2 84
## secondary 44 320 8 177
## tertiary 0 38 7 11
## data$V238
## data$educf Lower class Lower middle class Upper middle class Working class
## primary 28.42874 127.22565 4.785036 76.56057
## secondary 65.85392 294.71259 11.084323 177.34917
## tertiary 6.71734 30.06176 1.130641 18.09026
Here, deleting the very rare category (or uniting it with a similar category) can solve the technical problem (the researcher is responsible for this decision).
##
## Lower class Lower middle class Upper middle class Working class
## 105 562 34 306
No warnings! Yay! (we also can unite 2 categories to avoid warnings in chi-squared)
##
## Pearson's Chi-squared test
##
## data: data$educf and data$V238
## X-squared = 59.259, df = 4, p-value = 4.151e-12
## data$V238
## data$educf Lower class Lower middle class Working class
## primary 6.6436471 -5.3860589 1.0700456
## secondary -4.9700666 3.4742965 -0.2129345
## tertiary -2.6957869 3.3010847 -1.6152749
Now, the statistic X is smaller (59.3 vs. 92.1 above) on fewer df’s (4 vs. 6), but the results still provide strong evidence of association.
The most contributing cells are the Lower class * Primary education, Lower class * Primary education, and Lower middle class* Primary education. In the data, there are many more respondents with primary education who feel they are lower class than we would expect if the two variables were independent, and there are much fewer respondents with secondary or tertiary education who consider themselves lower class. By contrast, among the lower middle class, there are much fewer respondents with primary education and many more of those with secondary or tertiary education than if the variables were independent. There are other cells with large standardized Pearson residuals as well.
The simplest way to plot chi-square results using sjPlot package
library(sjPlot)
plot_xtab(data$V238, data$educf, margin = "row", bar.pos = "stack",
show.summary = TRUE)## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
These graphs can help both illustrate the proportions of two variables and calculate the chi-squared statistic on the go.
Do not attempt to duplicate, re-create, or perform the same or similar stunts and tricks at home, as personal injury or property damage may result.
Let’s find how gender is associated with the level of satisfaction with life (V23, 11 categories)
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect
##
## Pearson's Chi-squared test
##
## data: data$gndr and data$V23
## X-squared = 6.203, df = 9, p-value = 0.7194
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect
## data$V23
## data$gndr 2 3 4 5 6 7 8 9 Completely dissatisfied
## Female 2 6 14 41 56 124 162 77 3
## Male 2 5 16 20 54 114 145 72 4
## data$V23
## data$gndr Completely satisfied
## Female 56
## Male 47
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect
## data$V23
## data$gndr 2 3 4 5 6 7 8
## Female 2.121569 5.834314 15.91176 32.35392 58.34314 126.2333 162.8304
## Male 1.878431 5.165686 14.08824 28.64608 51.65686 111.7667 144.1696
## data$V23
## data$gndr 9 Completely dissatisfied Completely satisfied
## Female 79.02843 3.712745 54.63039
## Male 69.97157 3.287255 48.36961
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may be
## incorrect
Chi squared doesn’t take into account the order of categories (it treats all variables as nominal)
When you are using chi sqiuared test for variables with great number of categories you’ll probably violate the assumptions.
We’ll talk about more apropriate way to find the relationship between such variables later.