Chi-square Read this: https://www.r-bloggers.com/chi-square-test-of-independence-in-r/
Should pubs be open 24/7?
This data is fictional!
Table <- matrix(c(120, 80, 90, 110), nrow=2)
row.names(Table) <- c("Male","Female")
colnames(Table) <- c("Yes","No")
Table
## Yes No
## Male 120 90
## Female 80 110
chisq.test(Table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: Table
## X-squared = 8.4311, df = 1, p-value = 0.003689
firstchi <- chisq.test(Table)
The result of chisq.test() function is a list containing the following components:
str(firstchi)
## List of 9
## $ statistic: Named num 8.43
## ..- attr(*, "names")= chr "X-squared"
## $ parameter: Named int 1
## ..- attr(*, "names")= chr "df"
## $ p.value : num 0.00369
## $ method : chr "Pearson's Chi-squared test with Yates' continuity correction"
## $ data.name: chr "Table"
## $ observed : num [1:2, 1:2] 120 80 90 110
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## $ expected : num [1:2, 1:2] 105 95 105 95
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## $ residuals: num [1:2, 1:2] 1.46 -1.54 -1.46 1.54
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## $ stdres : num [1:2, 1:2] 3 -3 -3 3
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:2] "Male" "Female"
## .. ..$ : chr [1:2] "Yes" "No"
## - attr(*, "class")= chr "htest"
The format of the R code to use for getting these values is as follow: printing standardized residuals
firstchi$stdres
## Yes No
## Male 3.003757 -3.003757
## Female -3.003757 3.003757
If the chi squared test is significant we need to take a look at residuals.
If the value of standardized residual is lower than -2 it means that the cell contains fewer observations that it was expected (the case of variables independence). If the value of standardized residual is higher than 2 it means that the cell contains more observations that it was expected.
We can plot the results using assocplot() function
assocplot(t(Table), main="Residuals and number of observations")
library(corrplot)
## corrplot 0.84 loaded
corrplot(firstchi$stdres, is.cor = FALSE)
Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.
Negative residuals are in red. This implies a repulsion (negative association)between the corresponding row and column variables.
library(foreign)
library(sjPlot)
alcohol <- read.spss("C:/Users/lssi5/Dropbox/Data Analysis II (2018)/Lab 3 One-way ANOVA/alcohol.sav", use.value.labels = T, to.data.frame = T)
A basic version of the plot:
library(sjPlot)
sjp.xtab(alcohol$marital, alcohol$sex)
Customized version of the plot:
sjp.xtab(alcohol$marital, alcohol$sex, margin = "row", bar.pos = "stack",
show.summary = TRUE, coord.flip = TRUE)
chi <- chisq.test(alcohol$sex, alcohol$marital)
chi
##
## Pearson's Chi-squared test
##
## data: alcohol$sex and alcohol$marital
## X-squared = 1151.3, df = 5, p-value < 2.2e-16
Table <- table(alcohol$sex, alcohol$marital)
assocplot(t(Table))
library(corrplot)
corrplot(chi$stdres, is.cor = FALSE)
data <- read.csv("C:/Users/lssi5/YandexDisk/teaching/DA MA 2018/dontcryformeargentina.csv")
So, we already tried today to plot the relationship between subjective class and level of education.
Let’s try a formal test (i.e. chi-squared)
chisq.test(data$educf, data$V238)
## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: data$educf and data$V238
## X-squared = 92.078, df = 6, p-value < 2.2e-16
argchi <- chisq.test(data$educf, data$V238)
## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may
## be incorrect
argchi$observed
## data$V238
## data$educf Lower class Lower middle class Upper middle class
## primary 57 94 2
## secondary 44 320 8
## tertiary 0 38 7
## data$V238
## data$educf Working class
## primary 84
## secondary 177
## tertiary 11
counts <- table(data$educf, data$V238)
barplot(counts,
legend = rownames(counts), las = 2)
data1 <- data
table(data1$V238)
##
## Lower class Lower middle class Upper middle class
## 105 562 34
## Working class
## 306
data1$V238[ data1$V238 == "Upper middle class"] <- NA
argchi1 <- chisq.test(data1$educf, data1$V238)
No warning! Yay! (we also can unite 2 categories to avoid warnings in chi-squared)
argchi1
##
## Pearson's Chi-squared test
##
## data: data1$educf and data1$V238
## X-squared = 59.259, df = 4, p-value = 4.151e-12
argchi1$stdres
## data1$V238
## data1$educf Lower class Lower middle class Working class
## primary 6.6436471 -5.3860589 1.0700456
## secondary -4.9700666 3.4742965 -0.2129345
## tertiary -2.6957869 3.3010847 -1.6152749
Let’s remake the plot
library(RColorBrewer)
counts <- table(data1$educf, data1$V238)
barplot(counts,
col=brewer.pal(n = 3, name = "Set2"),
legend = rownames(counts), las = 2)
Hmmm… We still have “Upper middle class” as an empty category. Let’s drop that level.
data1$V238 <- droplevels(data1$V238)
pal <- brewer.pal(n = 3, name = "Set2")
counts <- table(data1$educf, data1$V238)
barplot(counts,
col = pal,
legend = rownames(counts), las = 2)
Let’s make bars of the same size to show the difference in proportions
par(mar = c(10, 4, 5, 1), xpd = T)
barplot(prop.table(counts, 2)*100, col = pal,
las = 2)
legend(1, 130, legend = rownames(counts),
fill = pal, horiz = T, bty = "n")
A similar plot using sjPlot package
sjp.xtab(data1$V238, data1$educf, margin = "row", bar.pos = "stack",
show.summary = TRUE)
Let’s find how gender is associated with the level of satisfaction with life (V23, 11 categories)
barplot(table(data$gndr))
barplot(table(data$V23))
chisq.test(data$gndr, data$V23)
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect
##
## Pearson's Chi-squared test
##
## data: data$gndr and data$V23
## X-squared = 6.203, df = 9, p-value = 0.7194
chisq.test(data$gndr, data$V23)$observed
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect
## data$V23
## data$gndr 2 3 4 5 6 7 8 9 Completely dissatisfied
## Female 2 6 14 41 56 124 162 77 3
## Male 2 5 16 20 54 114 145 72 4
## data$V23
## data$gndr Completely satisfied
## Female 56
## Male 47
chisq.test(data$gndr, data$V23)$expected
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect
## data$V23
## data$gndr 2 3 4 5 6 7 8
## Female 2.121569 5.834314 15.91176 32.35392 58.34314 126.2333 162.8304
## Male 1.878431 5.165686 14.08824 28.64608 51.65686 111.7667 144.1696
## data$V23
## data$gndr 9 Completely dissatisfied Completely satisfied
## Female 79.02843 3.712745 54.63039
## Male 69.97157 3.287255 48.36961
corrplot(chisq.test(data$gndr, data$V23)$stdres, is.cor = FALSE)
## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect
Chi squared doesn’t take into account order of categories (it treats all variables as nominal)
When you are using chi sqiuared test for variables with great number of categories you’ll probably violate the assumptions.
We’ll talk about more apropriate way to find the relationship between such variables later.