library(tidyverse)
library(flextable)
On one occassion i helped a student do her statistics assignment , up to know i don’t like how university concentrate so much on the theory without students doing the practical aspect of the methods.
This tutorial shows how to calculate :
when we work with categorical data in research we aim at discussing the analysis of response data when there are either no predictor variables, or all of the predictor variables are also categorical.
If you have no predictor variable, you are essentially testing whether the heights of a bar chart differ significantly from each or from some hypothesized distribution. This is a one-way table and you perform a one-proportion Z-test, exact binomial test, chi-squared goodness-of-fit test, or G-test.
If you have a single categorical predictor variable, you are testing whether the joint frequency counts differ from the expected frequency counts in the saturated model. This is a two-way table and you are compare the frequency counts with a proportion difference test, chi-squared independence test, G-test, or Fisher’s exact test.
If you have multiple categorical predictor variables, you have a k-way contingency table and you can extend the chi-square independence test and G-test to cover this as well.
where \(O_j = p_j n\) and \(E_j = \pi_j n\). The sampling distribution of \(X^2\) approaches the \(\chi_{J-1}^2\) as the sample size \(n \rightarrow \infty\).
A ZOU Regional Coordinator observed that on one weekend school, 150 students who turned up were categorized by gender and programme as shown in
Gender | Physcology | Development studies | R & A Management | Total |
---|---|---|---|---|
Male | 46 | 29 | 28 | 103 |
Female | 27 | 14 | 6 | 47 |
Total | 73 | 43 | 34 | 150 |
<- matrix(
dat_o c(46,27,29,14,28,6),
nrow = 2,
dimnames = list(Gender = c("Male", "Female"),
study = c("Physcology", "Development studies","R & A Management"))
)
%>% data.frame() %>% rownames_to_column(var = " ") %>%
dat_o ::adorn_totals(where = c("row", "col")) %>%
janitor::flextable() %>%
flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
flextable::autofit() flextable
| Physcology | Development.studies | R...A.Management | Total |
---|---|---|---|---|
Male | 46 | 29 | 28 | 103 |
Female | 27 | 14 | 6 | 47 |
Total | 73 | 43 | 34 | 150 |
contigency table for the data is given above
prop.table(dat_o, margin = 1) %>% data.frame() %>% rownames_to_column(var = " ") %>%
::adorn_totals(where = c("col")) %>%
janitor::flextable() %>%
flextable::colformat_num(j = c(2, 3, 4), digits = 2) %>%
flextable::autofit() flextable
| Physcology | Development.studies | R...A.Management | Total |
---|---|---|---|---|
Male | 0.4466019 | 0.2815534 | 0.2718447 | 1 |
Female | 0.5744681 | 0.2978723 | 0.1276596 | 1 |
\[ E_{ij}=\frac{O_i\times O_{ij}}{n} \]
\[ E_{11}=\frac{73\times 103}{150}=50.12667 \]
\[ E_{12}=\frac{43\times 103}{150}=29.52667 \]
\[ E_{11}=\frac{34\times 103}{150}=23.34667 \]
\[ E_{11}=\frac{73\times 47}{150}=22.87333 \]
\[ E_{22}=\frac{43\times 47}{150}=13.47333 \]
\[ E_{23}=\frac{34\times 47}{150}=10.65333 \]
<- matrix(rowSums(dat_o)) %*% t(matrix(colSums(dat_o))) / sum(dat_o)
expected
expected#> [,1] [,2] [,3]
#> [1,] 50.12667 29.52667 23.34667
#> [2,] 22.87333 13.47333 10.65333
We reject the null hypothesis at 5% level of significance if \[\chi_2>\chi^{2}_{0.05}(r-1)(c-1)\]
calculating \(\chi^2\)
\[\chi^2=\frac{(46-50.12667)^2}{50.12667}+\frac{(29-29.52667)^2}{29.52667}\\ +\frac{(28-23.34667)^2}{23.34667}+\frac{(27-22.87333)^2}{22.87333}+\\ \frac{(14-13.47333)^2}{13.47333}+\frac{(6-10.65333)^2}{10.65333}\]
\[\chi^2=4.07425\] and in
R
We can
<- sum(dat_o) * prop.table(dat_o, 1) * prop.table(dat_o, 2)
vitc_e <- sum((dat_o - vitc_e)^2 / vitc_e)
X2 print(X2)
#> [1] 4.074251
The degrees of freedom is
<- (nrow(dat_o) - 1) * (ncol(dat_o) - 1)
vitc_dof print(vitc_dof)
#> [1] 2
\[\chi^2_{0.05}(2-1)(3-1)=\chi^2_{0.05}(2)=6\]
Since \(\chi^{2}=4.07425<6\) ,we fail to reject \(H_0\) and conclude that there is no association between gender and student programme choice.
in R we can do this manually
pchisq(q = X2, df = vitc_dof, lower.tail = FALSE)
#> [1] 0.130403
Conclusion
since the p-value is greater than
0.05
we conclude that there is no association between the two variables
\[G^2 = 2 \sum_{ij} O_{ij} \log \left( \frac{O_{ij}}{E_{ij}} \right)\]
<- - 2 * sum(dat_o * log(dat_o / vitc_e))
G2 print(G2)
#> [1] 4.371267
\(X^2\) and \(G^2\) increase with the disagreement between the saturated model proportions \(p_{ij}\) and the independence model proportions \(\pi_{ij}\).
The associated p-values for the deviance residuals are
pchisq(q = G2, df = vitc_dof, lower.tail = FALSE)
#> [1] 0.1124065
chisq.test()
and it
takes in a table objectrecall that
dat_o#> study
#> Gender Physcology Development studies R & A Management
#> Male 46 29 28
#> Female 27 14 6
<- chisq.test(dat_o, correct = FALSE)
vitc_chisq_test print(vitc_chisq_test)
#>
#> Pearson's Chi-squared test
#>
#> data: dat_o
#> X-squared = 4.0743, df = 2, p-value = 0.1304
The Yates correction yields more conservative p-values.
chisq.test(dat_o)
#>
#> Pearson's Chi-squared test
#>
#> data: dat_o
#> X-squared = 4.0743, df = 2, p-value = 0.1304
These p-values are evidence for rejecting the independence model.
Here is the chi-square test applied to the above data. Recall this
data set is 2x3, so the degrees of freedom are \((2-1)(3-1) = 3\). The Yates continuity
correction does not apply to data other than 2x2, so the
correct = c(TRUE, FALSE)
has no effect in
chisq.test()
.
$expected
vitc_chisq_test#> study
#> Gender Physcology Development studies R & A Management
#> Male 50.12667 29.52667 23.34667
#> Female 22.87333 13.47333 10.65333