Chi-square Read this: https://www.r-bloggers.com/chi-square-test-of-independence-in-r/

Should pubs be open 24/7?

This data is fictional!

Table <- matrix(c(120, 80, 90, 110), nrow=2)
row.names(Table) <- c("Male","Female")
colnames(Table) <- c("Yes","No")
Table

##        Yes  No
## Male   120  90
## Female  80 110

chisq.test(Table)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  Table
## X-squared = 8.4311, df = 1, p-value = 0.003689

firstchi <- chisq.test(Table)

The result of chisq.test() function is a list containing the following components:

str(firstchi)

## List of 9
##  $ statistic: Named num 8.43
##   ..- attr(*, "names")= chr "X-squared"
##  $ parameter: Named int 1
##   ..- attr(*, "names")= chr "df"
##  $ p.value  : num 0.00369
##  $ method   : chr "Pearson's Chi-squared test with Yates' continuity correction"
##  $ data.name: chr "Table"
##  $ observed : num [1:2, 1:2] 120 80 90 110
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  $ expected : num [1:2, 1:2] 105 95 105 95
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  $ residuals: num [1:2, 1:2] 1.46 -1.54 -1.46 1.54
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  $ stdres   : num [1:2, 1:2] 3 -3 -3 3
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "Male" "Female"
##   .. ..$ : chr [1:2] "Yes" "No"
##  - attr(*, "class")= chr "htest"

$statistic: the value the chi-squared test statistic.
$parameter: the degrees of freedom
$p.value: the p-value of the test
$observed: the observed count
$expected: the expected count
$residuals: the difference between observed and expected
$stdres: standardized residuals

The format of the R code to use for getting these values is as follow: printing standardized residuals

firstchi$stdres

##              Yes        No
## Male    3.003757 -3.003757
## Female -3.003757  3.003757

If the chi squared test is significant we need to take a look at residuals.

If the value of standardized residual is lower than -2 it means that the cell contains fewer observations that it was expected (the case of variables independence). If the value of standardized residual is higher than 2 it means that the cell contains more observations that it was expected.

We can plot the results using assocplot() function

assocplot(t(Table), main="Residuals and number of observations")

library(corrplot)

## corrplot 0.84 loaded

corrplot(firstchi$stdres, is.cor = FALSE)

Positive residuals are in blue. Positive values in cells specify an attraction (positive association) between the corresponding row and column variables.

Negative residuals are in red. This implies a repulsion (negative association)between the corresponding row and column variables.

library(foreign)
library(sjPlot)
alcohol <- read.spss("C:/Users/lssi5/Dropbox/Data Analysis II (2018)/Lab 3 One-way ANOVA/alcohol.sav", use.value.labels = T, to.data.frame = T)

A basic version of the plot:

library(sjPlot)
sjp.xtab(alcohol$marital, alcohol$sex)

Customized version of the plot:

sjp.xtab(alcohol$marital, alcohol$sex, margin = "row", bar.pos = "stack",
         show.summary = TRUE, coord.flip = TRUE)

chi <- chisq.test(alcohol$sex, alcohol$marital)
chi

## 
##  Pearson's Chi-squared test
## 
## data:  alcohol$sex and alcohol$marital
## X-squared = 1151.3, df = 5, p-value < 2.2e-16

Table <- table(alcohol$sex, alcohol$marital)
assocplot(t(Table))

library(corrplot)
corrplot(chi$stdres, is.cor = FALSE)

Argentina example

data <- read.csv("C:/Users/lssi5/YandexDisk/teaching/DA MA 2018/dontcryformeargentina.csv")

So, we already tried today to plot the relationship between subjective class and level of education.

Let’s try a formal test (i.e. chi-squared)

chisq.test(data$educf, data$V238)

## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may
## be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  data$educf and data$V238
## X-squared = 92.078, df = 6, p-value < 2.2e-16

argchi <- chisq.test(data$educf, data$V238)

## Warning in chisq.test(data$educf, data$V238): Chi-squared approximation may
## be incorrect

argchi$observed

##            data$V238
## data$educf  Lower class Lower middle class Upper middle class
##   primary            57                 94                  2
##   secondary          44                320                  8
##   tertiary            0                 38                  7
##            data$V238
## data$educf  Working class
##   primary              84
##   secondary           177
##   tertiary             11

counts <- table(data$educf, data$V238)
barplot(counts, 
        legend = rownames(counts), las = 2)

data1 <- data
table(data1$V238)

## 
##        Lower class Lower middle class Upper middle class 
##                105                562                 34 
##      Working class 
##                306

data1$V238[ data1$V238 == "Upper middle class"] <- NA

argchi1 <- chisq.test(data1$educf, data1$V238)

No warning! Yay! (we also can unite 2 categories to avoid warnings in chi-squared)

argchi1

## 
##  Pearson's Chi-squared test
## 
## data:  data1$educf and data1$V238
## X-squared = 59.259, df = 4, p-value = 4.151e-12

argchi1$stdres

##            data1$V238
## data1$educf Lower class Lower middle class Working class
##   primary     6.6436471         -5.3860589     1.0700456
##   secondary  -4.9700666          3.4742965    -0.2129345
##   tertiary   -2.6957869          3.3010847    -1.6152749

Let’s remake the plot

library(RColorBrewer)
counts <- table(data1$educf, data1$V238)
barplot(counts, 
        col=brewer.pal(n = 3, name = "Set2"), 
        legend = rownames(counts), las = 2)

Hmmm… We still have “Upper middle class” as an empty category. Let’s drop that level.

data1$V238 <- droplevels(data1$V238)

pal <- brewer.pal(n = 3, name = "Set2")
counts <- table(data1$educf, data1$V238)
barplot(counts, 
        col = pal, 
        legend = rownames(counts), las = 2)

Let’s make bars of the same size to show the difference in proportions

par(mar = c(10, 4, 5, 1), xpd = T)
barplot(prop.table(counts, 2)*100, col = pal, 
        las = 2)
legend(1, 130, legend = rownames(counts), 
       fill = pal, horiz = T, bty = "n")

A similar plot using sjPlot package

sjp.xtab(data1$V238, data1$educf, margin = "row", bar.pos = "stack",
         show.summary = TRUE)

A really bad chi-squared example. Don’t do that!

Let’s find how gender is associated with the level of satisfaction with life (V23, 11 categories)

barplot(table(data$gndr))

barplot(table(data$V23))

chisq.test(data$gndr, data$V23)

## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  data$gndr and data$V23
## X-squared = 6.203, df = 9, p-value = 0.7194

chisq.test(data$gndr, data$V23)$observed

## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect

##          data$V23
## data$gndr   2   3   4   5   6   7   8   9 Completely dissatisfied
##    Female   2   6  14  41  56 124 162  77                       3
##    Male     2   5  16  20  54 114 145  72                       4
##          data$V23
## data$gndr Completely satisfied
##    Female                   56
##    Male                     47

chisq.test(data$gndr, data$V23)$expected

## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect

##          data$V23
## data$gndr        2        3        4        5        6        7        8
##    Female 2.121569 5.834314 15.91176 32.35392 58.34314 126.2333 162.8304
##    Male   1.878431 5.165686 14.08824 28.64608 51.65686 111.7667 144.1696
##          data$V23
## data$gndr        9 Completely dissatisfied Completely satisfied
##    Female 79.02843                3.712745             54.63039
##    Male   69.97157                3.287255             48.36961

corrplot(chisq.test(data$gndr, data$V23)$stdres, is.cor = FALSE)

## Warning in chisq.test(data$gndr, data$V23): Chi-squared approximation may
## be incorrect

Chi squared doesn’t take into account order of categories (it treats all variables as nominal)
When you are using chi sqiuared test for variables with great number of categories you’ll probably violate the assumptions.
We’ll talk about more apropriate way to find the relationship between such variables later.

Chi squared

Anna Shirokanova and Olesya Volchenko

January 31, 2019

Argentina example

A really bad chi-squared example. Don’t do that!