Library setup

library(tidyverse)
library(flextable)

Motivation

On one occassion i helped a student do her statistics assignment , up to know i don’t like how university concentrate so much on the theory without students doing the practical aspect of the methods.

This tutorial shows how to calculate :

  • frequencies
  • proportions
  • Test for associations

Contigency tables

when we work with categorical data in research we aim at discussing the analysis of response data when there are either no predictor variables, or all of the predictor variables are also categorical.

  • If you have no predictor variable, you are essentially testing whether the heights of a bar chart differ significantly from each or from some hypothesized distribution. This is a one-way table and you perform a one-proportion Z-test, exact binomial test, chi-squared goodness-of-fit test, or G-test.

  • If you have a single categorical predictor variable, you are testing whether the joint frequency counts differ from the expected frequency counts in the saturated model. This is a two-way table and you are compare the frequency counts with a proportion difference test, chi-squared independence test, G-test, or Fisher’s exact test.

  • If you have multiple categorical predictor variables, you have a k-way contingency table and you can extend the chi-square independence test and G-test to cover this as well.

Pearson Chi-square Test for independence

  • The chi-square independence test tests whether observed joint frequency counts \(O_{ij}\) differ from expected frequency counts \(E_{ij}\) under the independence model (the model of independent explanatory variables, \(\pi_{ij} = \pi_{i+} \pi_{+j}\). \(H_0\) is \(O_{ij} = E_{ij}\).
  • Determine whether an association exists or whether two variables are dependent on each other or not + Sometimes, \(H_0\) represents the model whose validity is to be tested. Contrast this with the conventional formulation of \(H_0\) as the hypothesis that is to be disproved. The goal in this case is not to disprove the model, but to see whether data are consistent with the model and if deviation can be attributed to chance.
  • These tests do not measure the strength of an association.
  • These tests depend on and reflect the sample size - double the sample size by copying each observation, double the \(\chi^2\) statistic even thought the strength of the association does not change.
  • The [Pearson Chi-square Test] is not appropriate when more than about 20% of the cells have an expected cell frequency of less than 5 (large-sample p-values not appropriate).
  • When the sample size is small the exact p-values can be calculated (this is prohibitive for large samples); calculation of the exact p-values assumes that the column totals and row totals are fixed.
    \[X^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

where \(O_j = p_j n\) and \(E_j = \pi_j n\). The sampling distribution of \(X^2\) approaches the \(\chi_{J-1}^2\) as the sample size \(n \rightarrow \infty\).

Question (Taken from Zimbabwe Open University)

A ZOU Regional Coordinator observed that on one weekend school, 150 students who turned up were categorized by gender and programme as shown in

Gender Physcology Development studies R & A Management Total
Male 46 29 28 103
Female 27 14 6 47
Total 73 43 34 150

record the data in R

dat_o <- matrix(
  c(46,27,29,14,28,6), 
  nrow = 2,
  dimnames = list(Gender = c("Male", "Female"),
    study = c("Physcology", "Development studies","R & A Management"))
)

view the observed values in a neat table

dat_o %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("row", "col")) %>%
  flextable::flextable() %>%
  flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
  flextable::autofit()

Physcology

Development.studies

R...A.Management

Total

Male

46

29

28

103

Female

27

14

6

47

Total

73

43

34

150

contigency table for the data is given above

Look at the Observed proportions

prop.table(dat_o, margin = 1) %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("col")) %>%
  flextable::flextable() %>%
  flextable::colformat_num(j = c(2, 3, 4), digits = 2) %>%
  flextable::autofit()

Physcology

Development.studies

R...A.Management

Total

Male

0.4466019

0.2815534

0.2718447

1

Female

0.5744681

0.2978723

0.1276596

1

Calculating expected frequencies

\[ E_{ij}=\frac{O_i\times O_{ij}}{n} \]

\[ E_{11}=\frac{73\times 103}{150}=50.12667 \]

\[ E_{12}=\frac{43\times 103}{150}=29.52667 \]

\[ E_{11}=\frac{34\times 103}{150}=23.34667 \]

\[ E_{11}=\frac{73\times 47}{150}=22.87333 \]

\[ E_{22}=\frac{43\times 47}{150}=13.47333 \]

\[ E_{23}=\frac{34\times 47}{150}=10.65333 \]

Doing this in R yeilds

expected <- matrix(rowSums(dat_o)) %*% t(matrix(colSums(dat_o))) / sum(dat_o)
expected
#>          [,1]     [,2]     [,3]
#> [1,] 50.12667 29.52667 23.34667
#> [2,] 22.87333 13.47333 10.65333

Define the hypothesis

  • \(H_0\) : There is no association between gender and student programme choice
  • \(H_1\) : There is an association between gender and student programme choice
  1. The test statistic is calculated as follows \[\chi^2=\sum_{j=1}^{c}\sum_{i=1}^{r}\frac{(O_{ij}-E_{ij})^2}{E_{ij}}\]

We reject the null hypothesis at 5% level of significance if \[\chi_2>\chi^{2}_{0.05}(r-1)(c-1)\]

calculating \(\chi^2\)

\[\chi^2=\frac{(46-50.12667)^2}{50.12667}+\frac{(29-29.52667)^2}{29.52667}\\ +\frac{(28-23.34667)^2}{23.34667}+\frac{(27-22.87333)^2}{22.87333}+\\ \frac{(14-13.47333)^2}{13.47333}+\frac{(6-10.65333)^2}{10.65333}\]

\[\chi^2=4.07425\] and in R We can

vitc_e <- sum(dat_o) * prop.table(dat_o, 1) * prop.table(dat_o, 2)
X2 <- sum((dat_o - vitc_e)^2 / vitc_e)
print(X2)
#> [1] 4.074251

The degrees of freedom is

vitc_dof <- (nrow(dat_o) - 1) * (ncol(dat_o) - 1)
print(vitc_dof)
#> [1] 2

\[\chi^2_{0.05}(2-1)(3-1)=\chi^2_{0.05}(2)=6\]

  • conclusion

Since \(\chi^{2}=4.07425<6\) ,we fail to reject \(H_0\) and conclude that there is no association between gender and student programme choice.

in R we can do this manually

pchisq(q = X2, df = vitc_dof, lower.tail = FALSE)
#> [1] 0.130403

Conclusion

since the p-value is greater than 0.05 we conclude that there is no association between the two variables

the deviance statistic is

\[G^2 = 2 \sum_{ij} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right)\]

G2 <- - 2 * sum(dat_o * log(dat_o / vitc_e))
print(G2)
#> [1] 4.371267

\(X^2\) and \(G^2\) increase with the disagreement between the saturated model proportions \(p_{ij}\) and the independence model proportions \(\pi_{ij}\).

The associated p-values for the deviance residuals are

pchisq(q = G2, df = vitc_dof, lower.tail = FALSE)
#> [1] 0.1124065

Other than doing the manual labor

  • R has a built in function called chisq.test() and it takes in a table object

recall that

dat_o
#>         study
#> Gender   Physcology Development studies R & A Management
#>   Male           46                  29               28
#>   Female         27                  14                6
vitc_chisq_test <- chisq.test(dat_o, correct = FALSE)
print(vitc_chisq_test)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  dat_o
#> X-squared = 4.0743, df = 2, p-value = 0.1304

The Yates correction yields more conservative p-values.

chisq.test(dat_o)
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  dat_o
#> X-squared = 4.0743, df = 2, p-value = 0.1304

These p-values are evidence for rejecting the independence model.

Here is the chi-square test applied to the above data. Recall this data set is 2x3, so the degrees of freedom are \((2-1)(3-1) = 3\). The Yates continuity correction does not apply to data other than 2x2, so the correct = c(TRUE, FALSE) has no effect in chisq.test().

Expected frequencies from the output

vitc_chisq_test$expected
#>         study
#> Gender   Physcology Development studies R & A Management
#>   Male     50.12667            29.52667         23.34667
#>   Female   22.87333            13.47333         10.65333