There are 4 things we want to know for each statistical test:
What types of variables are needed for this type of test?
What is the null and alternative hypothesis?
How do we conduct the test?
What are the assumptions? How do we test them? What to do if the assumptions are not met?
In this lab, we’ll go through these steps to solve the Chi-Square test.
Variables ### 1)
Based on what we discussed in lecture, what types of variables do we
need to conduct a chi-square test? Categorical
Null and Alternative Hypotheses ### 2)
Recall: What is the null and alternative hypothesis for the chi-square
test?
Ho: p1 = p2 (proportions are equal)
Ha: p1 ≠ p2 (proportions are unequal
Conducting the Chi-Square test To conduct our chi-square test, we will use data from the dataset “High sex ratios in rural China: declining well-being with age in never-married men”. This dataset has information on partnership and education for men aged 20-29 in rural China. 691 men were included in the sample, 351 unpartnered and 340 partnered men. The distribution of education is as follows: Middle or high school: 156 unpartnered, 204 partnered College or higher: 195 unpartnered, 136 partnered
Create a contingency table of the data below:
cont_table <- data.frame(partnered=c(204,136), Unpartnered=c(156,195))
cont_table
## partnered Unpartnered
## 1 204 156
## 2 136 195
There are two assumptions of the chi-square test:
The data are independent
The expected frequencies > 5 for each cell.
Are the data independent? Yes. The individuals are either partnered or un-partnered. The individuals either at middle or high school or college education level. No individual is or could be both.
Are the expected frequencies > 5 for each cell. How can you determine this? Yes, all expected values are well above 100 from the contingency table, calculate the row times the column divided by total = expected for each row column pairing
chi squared test in excel
To conduct the chi-square test, we can use the R code:
chisq.test() Recall that you can either enter in the
cont_table into a separate variable and then use
chisq.test() or you can do it all in one line of code (from
our lecture notes):
c=2
r=2
df <-(c-1)*(r-1)
df
## [1] 1
chisq.test(data.frame(c(204,136),c(156,195)))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data.frame(c(204, 136), c(156, 195))
## X-squared = 16.128, df = 1, p-value = 5.92e-05
qchisq(0.95, df = df)
## [1] 3.841459
At this point, can you determine if the data meet the assumptions of our test? Why or why not? Yes, R did not give us a warning. If the expected frequencies were below 5, R would give a warning and run the code with a Yates’ continuity correction
What is the p-value? What would you conclude? Is there a significant
difference between the proportion of men who are partnered or not based
on completed education?
p=5.92e-05 The value we found (16.13) is larger than the
critical value (3.84) and the p-value is less than 0.05 (p=5.92e-05).
Therefore, we have strong evidence to reject the null hypothesis in
favor of the alternative that the proportions are different.
ODDS RATIO Odds ratios allow us to examine the
strength of the relationship between two categorical variables, but it
can only compare four cells (it can’t handle larger contingency tables).
So, we’ll need to decide which variables we would like to compare in
this example. Let’s examine the relationship between being partnered and
attaining higher education vs. middle/high school. To calculate the odds
ratio, we need to calculate: odds (partnered after middle/high school) /
odds (unpartnered after middle/high school). These odds can be
calculated as follows: Odds (partnered after higher education) = Number
that have higher education and are partnered / Number that have higher
education and are not partnered. Odds (partnered after middle/high
school) = Number that have middle/high school and are partnered / Number
that have middle/high school but aren’t partnered. Note: When you take
the odds ratio, it makes it easier to interpret if you do the larger
number divided by the smaller.
8) Please calculate the odds ratio for this example
# Observed
part_mid_hs <- 204
part_college <- 136
unpart_mid_hs <- 156
unpart_college <- 195
# Odds (partnered after higher education) = Number that have higher education and are partnered / Number that have higher education and are not partnered
odds_part_after_college <- part_college / unpart_college
odds_part_after_college
## [1] 0.6974359
# Odds (partnered after middle/high school) = Number that have middle/high school and are partnered / Number that have middle/high school but aren’t partnered.
odds_part_precollege <- part_mid_hs / unpart_mid_hs
odds_part_precollege
## [1] 1.307692
#Larger / Smaller
partered_secondary_per_partnered_tertiary <- odds_part_precollege / odds_part_after_college
partered_secondary_per_partnered_tertiary
## [1] 1.875
Review the slide on communicating results:
From lecture notes:
Start with some descriptive statistics.
The description tells you what the null hypothesis being tested is
A “stat” block is included
The results are interpreted
Write a 4-sentence paragraph describing the results of your chi-square test.
Of the 691 individuals this survey,360 individuals highest level
of education was middle or high school level, while 331 had at least
entered college. A chi-square test for association with Yates continuity
correction was conducted to test whether there was an association
between education level and being partnered. Results show a significant
association between the education level and whether partnered
χ2(1, N = 691 =16.13 p < 5.92e-05 Based on the odds
ratio, the odds of being partnered in middle and high school were 1.88
times higher than partnering after obtaining at least some college
education.
FISHER’S EXACT TEST In this case, we did not have to run Fisher’s Exact Test, but let’s examine an example where that might be necessary. Lady tea tasting: Some people argue that they can tell (by taste alone) whether a cup of tea with milk had the tea poured first or the milk poured first. An experiment was performed. Eight cups of tea were prepared and presented in random order. Here are the results:
cont_table <- data.frame(Tea=c(3,1), Milk=c(1,3))
cont_table
## Tea Milk
## 1 3 1
## 2 1 3
In this case, the null hypothesis is that the proportions are equal
(meaning the person cannot tell which was really poured first). Evidence
for the proportions not being equal means that there is evidence the
person can tell which was poured first. Run a Fisher’s exact test:
fisher.test() to determine if there is evidence that this
person can tell by taste alone which was poured first.
fisher.test(data.frame(c(3,1),c(1,3)))
##
## Fisher's Exact Test for Count Data
##
## data: data.frame(c(3, 1), c(1, 3))
## p-value = 0.4857
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.2117329 621.9337505
## sample estimates:
## odds ratio
## 6.408309
Is Fisher’s exact test significant? What would you conclude?
p-value = 0.4857 is not significant. (p > 0.05)
Results do not show a significant association between believed poured first, whether tea or milk. We fail to reject the null as evidence does not support that the proportions are unequal.
NO INDEPENDENCE ### 11) Our other assumption is that our data are independent. What would you do if the data were not independent? (Hint: Check your text for the answer) Abort! Consider McNemar Test
We will use data that reports the number of goals scored in a soccer match by individuals who work for U of I and those that work for Boise State (note: this data is fictional). The contingency table is as follows:
What is the proportion of individuals who scored from U of I? 5 out of 24 individuals
c=2
r=2
df <-(c-1)*(r-1)
df
## [1] 1
cont_table <- data.frame(UofI=c(5,19), BSU=c(23,30))
cont_table
## UofI BSU
## 1 5 23
## 2 19 30
chisq.test(cont_table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cont_table
## X-squared = 2.7246, df = 1, p-value = 0.09881
chisq.test(data.frame(c(5,19),c(23,30)))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: data.frame(c(5, 19), c(23, 30))
## X-squared = 2.7246, df = 1, p-value = 0.09881
qchisq(0.95, df = 1)
## [1] 3.841459
This result has the Yates’ continuity correction.
chisq.test(cont_table, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: cont_table
## X-squared = 3.6342, df = 1, p-value = 0.0566
Is there a significant difference between the proportions? p-value = 0.09881 (p > 0.05). Therefore, we fail to reject the null that their is no association between university and whether the individual scored or not.
Are any of the expected values below 5?
Lowest expected value is 8.7.
Do we need to run a Fisher’s exact test? No
Determine the odds ratio and interpret it.
# Observed
UofI_score <- 5
UofI_noscore <- 19
BSU_score <- 23
BSU_noscore <- 30
# Odds (UofI) = Number scored and at UofI / Number not scored and at UofI
odds_UofI_score_noscore <- UofI_score / UofI_noscore
odds_UofI_score_noscore
## [1] 0.2631579
# Odds (BSU) = Number scored and at BSU / Number not scored and at BSU
odds_BSU_score_noscore <- BSU_score / BSU_noscore
odds_BSU_score_noscore
## [1] 0.7666667
odds_BSU_score_noscore/odds_UofI_score_noscore
## [1] 2.913333
χ2(1, N = 77) = 2.72, p = 0.099. Based on the odds
ratio, the odds of scoring were 2.91 times higher for individuals from
BSU than if from UofI.