MAS 261 - Lecture 19

Contingency Tables and Two Sample Tests of Proportions

Author

Penelope Pooler Eisenbies

Published

October 28, 2024

Housekeeping

Today’s plan
- Corrections to Syllabus
- Comments and Questions about Previous Lecture from Engagement Questions
- Upcoming Dates
- A few minutes for R Questions 🪄
- Review Question - Two-sided test
- Review of One Sample Proportion Hypothesis Tests
- Contingency Tables
  - Tests of Two proportions
  - Format of Hypothesis Tests

Upcoming Dates

HW 7 is now posted and is due 11/6 (Grace period ends 11/7).
- Most questions are multiple choice, but do not just guess and keep trying.
- Demo videos will be posted by tomorrow or Saturday at the latest.
Test 2 is on November 12th and will include material up through Lecture 20 (HW 7)
Lecture 21 - Intro to Portfolio Management will be on Final Exam, not on Test 2.

R and RStudio

In this course we will use R and RStudio to understand statistical concepts.
You will access R and RStudio through Posit Cloud.
- Sign up for a Free Posit Cloud Account
I will post R/RStudio files on Posit Cloud that you can access in provided links.
I will also provide demo videos that show how to access files and complete exercises.
NOTE: The free Posit Cloud account is limited to 25 hours per month.
- I demo how to download completed work so that you can use this allotment efficiently.
- For those who want to go further with R/RStudio:
  - I have added a new page to the MAS 261 website, Installing R and RStudio

Lecture 19 In-class Exercises - Q1

TRUE OR FALSE: When conducting a two-tailed two sample hypothesis test of means, we can only tell if two population means are significantly different, not which one is larger (or smaller).

Two-sided Two Sample Hypothesis Test
- $H_{0}: \mu_{1} = \mu_{2}$
- $H_{A}: \mu_{1} \neq \mu_{2}$

Review of One Sample Hypothesis Tests of Proportions

Question of Interest polled by YouGov:

987 adults in the US were asked:

Would you like to see the changing of the clocks eliminated, so people no longer change their clocks twice per year?

Should We End Daylight Savings Time?

YouGov Polled 987 US adults

612 said YES, we should eliminate the practice of changing our clocks.
375 said NO or they were unsure. We group these two categories together.

If we test these data, what are the null and alternative hpotheses:

$H_{0}: P_{YES} = P_{NO}$ - No difference in proportion with these two opinions.
$H_{A}: P_{YES} \neq P_{NO}$ - There is a difference in proportion with these two opinions.

Specify alpha ($\alpha$) as 0.05 unless we have a specific reason to choose a different alpha.

Should We End Daylight Savings Clock Changes?

Code

```{r echo=T}
prop.test(612,987,correct=F)
```


    1-sample proportions test without continuity correction

data:  612 out of 987, null probability 0.5
X-squared = 56.909, df = 1, p-value = 0.00000000000004565
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
 0.5893699 0.6498207
sample estimates:
        p 
0.6200608

One Sample Hypothesis Test Conclusion

Hypotheses being tested:

$H_{0}: P_{YES} = 0.5$
- The Yes and Not Yes votes are roughly equal.
$H_{A}: P_{YES} \neq 0.5$
- There is a difference in the proportion Yes and Not Yes votes.
P-value from hypotheses test: < 0.0001
Conclusion: P-value is much less than 0.05 so we REJECT $H_{0}$.
Interpretation: See Polling Question on next slide

Lecture 19 In-class Exercises - Q2

Given our stated hypotheses and our p-value < 0.0001

$H_{0}: P_{YES} = 0.5$
$H_{A}: P_{YES} \neq 0.5$

How do we interpret the outcome of this hypothesis test?

Contingency Tables to Examine Proportions in Multiple Categories

Question: Are these disparities in opinions about daylight savings consistent among age groups?

We can examine this question using tables, plots, and hypothesis tests.
A Contingency Table is 2 x 2 or larger and allows us to subdivide count data by categories
Commonly used in market research to understand opinions by category
Example: How do Gen Z (18-29) and Millennial adults feel (30-44) about daylight savings?

Should the USA Eliminate Daylight Savings Clock Changes
	Yes	No/Not Sure
Ages 18-29	99	98
Ages 30-44	129	105

Comparing Contingency Tables and Plots

Contingency tables and bar plots are two effective ways to examine these data

Should the USA Eliminate Daylight Savings Clock Changes
	Yes	No/Not Sure
Ages 18-29	99	98
Ages 30-44	129	105

Hypothesis Test Comparing Two Proportions

Hypotheses being tested:
- $H_{0}: P_{18-29} = P_{30-44}$
  - There is no difference between these two age groups with respect to proportion that says yes.
- $H_{A}: P_{18-29} \neq P_{30-44}$
  - There is a difference between these two age groups with respect to proportion that says yes.

Code

```{r echo=T}
x <- c(99, 129)  # yes votes in each age group (18-29 first)
n <- c(197, 234) # sample size in each age group (18-29 first)
prop.test(x,n, correct=F)
```


    2-sample test for equality of proportions without continuity correction

data:  x out of n
X-squared = 1.0199, df = 1, p-value = 0.3125
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.14327319  0.04578523
sample estimates:
   prop 1    prop 2 
0.5025381 0.5512821

Two Proportion Hypothesis Test Conclusion and Interpretation

Hypotheses being tested:
- $H_{0}: P_{18-29} = P_{30-44}$
  - No difference between these age groups respect to proportion that says yes.
- $H_{A}: P_{18-29} \neq P_{30-44}$
  - There is a difference between these age groups respect to proportion that says yes.

Questions we will answer:

What is the p-value from this test?
Do we Reject or Fail to Reject the Null Hypothesis?
What do we conclude about the opinions of these two age groups?

Lecture 19 In-class Exercises - Q3-Q4

Question 3: What is the p-value from this hypothesis test?

Question 4: If we specify $\alpha = 0.05$, do we reject or fail to reject the null hypothesis, $H_{0}$?

Lecture 19 In-class Exercises - Q5-Q6

Question 5: What do we conclude about the opinions of these two age groups?

Question 6 (Not on Point Solutions): What type of error might we have made?

Do Gen-Zs and Millenials differ from Gen-Xers?

Should the USA Eliminate Daylight Savings Clock Changes
	Yes	No/Not Sure
Ages 18-44	228	205
Ages 45-64	201	118

Column and Row Percentages

Original Data

Should the USA Eliminate Daylight Savings Clock Changes
	Yes	No/Not Sure
Ages 18-44	228	205
Ages 45-64	201	118

Row Percentages: Percentages of each age group that said ‘Yes’ or ‘No’.

Row Percentages
	Yes	No/Not Sure
Ages 18-44	52.66	47.34
Ages 45-64	63.01	36.99

Column percentages: Percentages of Yes/No opinions in each age group.

Column Percentages
	Yes	No/Not Sure
Ages 18-44	53.15	63.47
Ages 45-64	46.85	36.53

Hypothesis Test Comparing Two Proportions

Hypotheses being tested:
- $H_{0}: P_{18-44} = P_{45-64}$
  - No difference between these age groups respect to proportion that says yes.
- $H_{A}: P_{18-44} \neq P_{45-64}$
  - There is a difference between these age groups respect to proportion that says yes.

Code

```{r echo=T}
x <- c(228,201)  # yes votes in each age group (18-44 first)
n <- c(433,319)  # sample size in each age group (18-44 first)

prop.test(x,n, correct=F)
```


    2-sample test for equality of proportions without continuity correction

data:  x out of n
X-squared = 8.0355, df = 1, p-value = 0.004587
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.17437592 -0.03269438
sample estimates:
   prop 1    prop 2 
0.5265589 0.6300940

Lecture 19 In-class Exercises - Q7

What do you conclude from this two sample two-sided hypothesis test of two proportions?

Key Points from Today

Protocol for conducting and interpreting hypothesis tests is same, regardless of how they are specified.
This is true for quantitative data and for categorical proportion data
For two sample tests of proportions, it is helpful to examine the data using contingency tables.
By default, it is common for two sample tests of proportions to be conducted as two sided tests.
These same methods can be used with larger contingency tables tat are interatively analyzed.

To submit an Engagement Question or Comment about material from Lecture 19: Submit it by midnight today (day of lecture).

--- title: "MAS 261 - Lecture 19" subtitle: "Contingency Tables and Two Sample Tests of Proportions" author: "Penelope Pooler Eisenbies" date: last-modified toc: true toc-depth: 3 toc-location: left toc-title: "Table of Contents" toc-expand: 1 format: html: code-line-numbers: true code-fold: true code-tools: true execute: echo: fenced --- ## Housekeeping ```{r setup, echo=FALSE, warning=F, message=F, include=F} #| include: false # this line specifies options for default options for all R Chunks knitr::opts_chunk$set(echo=F) # suppress scientific notation options(scipen=100) # install helper package that loads and installs other packages, if needed if (!require("pacman")) install.packages("pacman", repos = "http://lib.stat.cmu.edu/R/CRAN/") # install and load required packages pacman::p_load(pacman,tidyverse, magrittr, olsrr, shadowtext, mapproj, knitr, kableExtra, countrycode, usdata, maps, RColorBrewer, gridExtra, ggthemes, gt, mosaicData, epiDisplay, vistributions) # verify packages # p_loaded() ``` - Today's plan - Corrections to [**Syllabus**](https://docs.google.com/document/d/1g8TGtKF1rR_6rRmRe5IHhiJ2KhjGAGhcZcOhM621kGM/edit?usp=sharing){target="_blank"} - Comments and Questions about Previous Lecture from Engagement Questions - Upcoming Dates - A few minutes for R Questions 🪄 - Review Question - Two-sided test - Review of One Sample Proportion Hypothesis Tests - Contingency Tables - Tests of Two proportions - Format of Hypothesis Tests ## Upcoming Dates - HW 7 is now posted and is due 11/6 (Grace period ends 11/7). - Most questions are multiple choice, but do not just guess and keep trying. - Demo videos will be posted by tomorrow or Saturday at the latest. - Test 2 is on November 12th and will include material up through Lecture 20 (HW 7) - Lecture 21 - Intro to Portfolio Management will be on Final Exam, not on Test 2. ## R and RStudio - In this course we will use R and RStudio to understand statistical concepts. - You will access R and RStudio through **Posit Cloud**. - Sign up for a [Free Posit Cloud Account](https://posit.cloud/plans/free){target="_blank"} - I will post R/RStudio files on Posit Cloud that you can access in provided links. - I will also provide demo videos that show how to access files and complete exercises. - NOTE: The free Posit Cloud account is limited to 25 hours per month. - I demo how to download completed work so that you can use this allotment efficiently. - For those who want to go further with R/RStudio: - I have added a new page to the MAS 261 website, [Installing R and RStudio](https://penelope2040.quarto.pub/mas-261/#installing-r-and-rstudio){target="_blank"} ## ### Lecture 19 In-class Exercises - Q1 ::::: columns ::: {.column width="50%"} TRUE OR FALSE: When conducting a two-tailed two sample hypothesis test of means, we can only tell if two population means are significantly different, not which one is larger (or smaller). - **Two-sided Two Sample Hypothesis Test** - $H_{0}: \mu_{1} = \mu_{2}$ - $H_{A}: \mu_{1} \neq \mu_{2}$ ::: ::: {.column width="50%"} ```{r} knitr::include_graphics("img/hypotheses_graphic.jpg", dpi=300) ``` ::: ::::: ## Review of One Sample Hypothesis Tests of Proportions ::::: columns ::: {.column width="50%"} Question of Interest polled by [YouGov](https://docs.cdn.yougov.com/etwjvohrxx/Daylight_Saving_Time_Toplines_Crosstabs.pdf): 987 adults in the US were asked: **Would you like to see the changing of the clocks eliminated, so people no longer change their clocks twice per year?** ::: ::: {.column width="50%"} ```{r} knitr::include_graphics("img/dst_fall_back.jpg") ``` ::: ::::: ## Should We End Daylight Savings Time? YouGov Polled 987 US adults - 612 said YES, we should eliminate the practice of changing our clocks. - 375 said NO or they were unsure. We group these two categories together. ::: fragment If we test these data, what are the null and alternative hpotheses: ::: - $H_{0}: P_{YES} = P_{NO}$ - No difference in proportion with these two opinions. - $H_{A}: P_{YES} \neq P_{NO}$ - There is a difference in proportion with these two opinions. ::: fragment Specify alpha ($\alpha$) as 0.05 unless we have a specific reason to choose a different alpha. ::: ## Should We End Daylight Savings Clock Changes? ::::: columns ::: {.column width="50%"} ```{r echo=T} prop.test(612,987,correct=F) ``` ::: ::: {.column width="50%"} ```{r} Opinion <- c("Yes", "No/Not Sure") Frequency <- c(612,375) dl <- tibble(Opinion, Frequency) (opinion_plot <- dl |> ggplot() + geom_bar(aes(x=Opinion, y=Frequency, fill=Opinion), stat="identity", show.legend = F) + scale_fill_manual(values=c("cornflowerblue","chartreuse3"))+ theme(legend.position = "none") + theme_classic() + labs(title="Should We Eliminate Daylight Savings Clock Changes") + theme(plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), legend.title = element_text(size = 15))) ``` ::: ::::: ## One Sample Hypothesis Test Conclusion Hypotheses being tested: - $H_{0}: P_{YES} = 0.5$ - The Yes and Not Yes votes are roughly equal. - $H_{A}: P_{YES} \neq 0.5$ - There is a difference in the proportion Yes and Not Yes votes. - P-value from hypotheses test: \< 0.0001 - Conclusion: P-value is much less than 0.05 so we **REJECT** $H_{0}$. - Interpretation: See Polling Question on next slide ## ### Lecture 19 In-class Exercises - Q2 Given our stated hypotheses and our p-value \< 0.0001 - $H_{0}: P_{YES} = 0.5$ - $H_{A}: P_{YES} \neq 0.5$ ::: fragment How do we interpret the outcome of this hypothesis test? ::: ## ### Contingency Tables to Examine Proportions in Multiple Categories :::::: columns ::: {.column width="54%"} Question: Are these disparities in opinions about daylight savings consistent among age groups? - We can examine this question using tables, plots, and hypothesis tests. - A Contingency Table is 2 x 2 or larger and allows us to subdivide count data by categories - Commonly used in market research to understand opinions by category - Example: How do Gen Z (18-29) and Millennial adults feel (30-44) about daylight savings? ::: ::: {.column width="2%"} ::: ::: {.column width="44%"} ```{r} # Summarized counts dl_poll <- matrix(c(99,98,129,105), ncol=2, byrow = T) # label columns (col) and rows colnames(dl_poll) <- c("Yes", "No/Not Sure") rownames(dl_poll) <- c("Ages 18-29","Ages 30-44") #create a table of these data in R dl_table <- as.table(dl_poll) #print version of table kable(dl_table, align="lcc", caption="Should the USA Eliminate Daylight Savings Clock Changes") ``` ::: :::::: ## Comparing Contingency Tables and Plots Contingency tables and bar plots are two effective ways to examine these data ::::: columns ::: {.column width="50%"} ```{r} # Summarized counts dl_poll <- matrix(c(99,98,129,105), ncol=2, byrow = T) # label columns (col) and rows colnames(dl_poll) <- c("Yes", "No/Not Sure") rownames(dl_poll) <- c("Ages 18-29","Ages 30-44") #create a table of these data in R dl_table <- as.table(dl_poll) #print version of table kable(dl_table, align="lcc", caption="Should the USA Eliminate Daylight Savings Clock Changes") ``` ::: ::: {.column width="50%"} ```{r} Frequency <- c(99,98,129,105) Age_Group <- c(rep("Ages 18-29",2), rep("Ages 30-44",2)) Opinion <- rep(c("Yes", "No/Not Sure"),2) dl_data2<- tibble(Age_Group,Opinion, Frequency) (op_plot2 <- dl_data2 |> ggplot() + geom_bar(aes(x=Age_Group, y=Frequency, fill=Opinion), stat="identity", position="dodge") + scale_fill_manual(values=c("cornflowerblue","chartreuse3")) + theme_classic() + labs(title="Should We Eliminate Daylight Savings Clock Changes", x="") + theme(plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), legend.title = element_text(size = 15))) ``` ::: ::::: ## ### Hypothesis Test Comparing Two Proportions - Hypotheses being tested: - $H_{0}: P_{18-29} = P_{30-44}$ - There is no difference between these two age groups with respect to proportion that says yes. - $H_{A}: P_{18-29} \neq P_{30-44}$ - There is a difference between these two age groups with respect to proportion that says yes. ::: fragment ```{r echo=T} x <- c(99, 129) # yes votes in each age group (18-29 first) n <- c(197, 234) # sample size in each age group (18-29 first) prop.test(x,n, correct=F) ``` ::: ## Two Proportion Hypothesis Test Conclusion and Interpretation - Hypotheses being tested: - $H_{0}: P_{18-29} = P_{30-44}$ - No difference between these age groups respect to proportion that says yes. - $H_{A}: P_{18-29} \neq P_{30-44}$ - There is a difference between these age groups respect to proportion that says yes. ::: fragment Questions we will answer: - What is the p-value from this test? - Do we Reject or Fail to Reject the Null Hypothesis? - What do we conclude about the opinions of these two age groups? ::: ## ### Lecture 19 In-class Exercises - Q3-Q4 Question 3: What is the p-value from this hypothesis test? Question 4: If we specify $\alpha = 0.05$, do we reject or fail to reject the null hypothesis, $H_{0}$? ## ### Lecture 19 In-class Exercises - Q5-Q6 Question 5: What do we conclude about the opinions of these two age groups? Question 6 (Not on Point Solutions): What type of error might we have made? ## ### Do Gen-Zs and Millenials differ from Gen-Xers? ::::: columns ::: {.column width="50%"} ```{r} # Summarized counts dl_poll3 <- matrix(c(228,205,201,118), ncol=2, byrow = T) # label columns (col) and rows colnames(dl_poll3) <- c("Yes", "No/Not Sure") rownames(dl_poll3) <- c("Ages 18-44","Ages 45-64") #create a table of these data in R dl_table3 <- as.table(dl_poll3) #print version of table kable(dl_table3, align="lcc", caption="Should the USA Eliminate Daylight Savings Clock Changes") ``` ::: ::: {.column width="50%"} ```{r} Frequency <- c(228,205,201,118) Age_Group <- c(rep("Ages 18-44",2), rep("Ages 45-64",2)) Opinion <- rep(c("Yes", "No/Not Sure"),2) dl_data2<- tibble(Age_Group,Opinion, Frequency) (op_plot2 <- dl_data2 |> ggplot() + geom_bar(aes(x=Age_Group, y=Frequency, fill=Opinion), stat="identity", position="dodge") + scale_fill_manual(values=c("cornflowerblue","chartreuse3")) + theme_classic() + labs(title="Should We Eliminate Daylight Savings Clock Changes", x="")+ theme(plot.title = element_text(size = 20), axis.title = element_text(size=18), axis.text = element_text(size=15), plot.caption = element_text(size = 10), legend.text = element_text(size = 12), legend.title = element_text(size = 15))) ``` ::: ::::: ## Column and Row Percentages ::::: columns ::: {.column width="50%"} **Original Data** ```{r} # Summarized counts dl_poll3 <- matrix(c(228,205,201,118), ncol=2, byrow = T) # label columns (col) and rows colnames(dl_poll3) <- c("Yes", "No/Not Sure") rownames(dl_poll3) <- c("Ages 18-44","Ages 45-64") #create a table of these data in R dl_table3 <- as.table(dl_poll3) #print version of table kable(dl_table3, align="lcc", caption="Should the USA Eliminate Daylight Savings Clock Changes") ``` ::: ::: {.column width="50%"} **Row Percentages: Percentages of each age group that said 'Yes' or 'No'.** ```{r} #print version of row percentages table kable(prop.table(dl_table3, 1)*100, digits=2, align="lcc", caption = "Row Percentages") ``` **Column percentages: Percentages of Yes/No opinions in each age group.** ```{r} #print version of column percentages table kable(prop.table(dl_table3, 2)*100, digits=2, align="lcc", caption = "Column Percentages") ``` ::: ::::: ## ### Hypothesis Test Comparing Two Proportions - Hypotheses being tested: - $H_{0}: P_{18-44} = P_{45-64}$ - No difference between these age groups respect to proportion that says yes. - $H_{A}: P_{18-44} \neq P_{45-64}$ - There is a difference between these age groups respect to proportion that says yes. ::: fragment ```{r echo=T} x <- c(228,201) # yes votes in each age group (18-44 first) n <- c(433,319) # sample size in each age group (18-44 first) prop.test(x,n, correct=F) ``` ::: ## ### Lecture 19 In-class Exercises - Q7 What do you conclude from this two sample two-sided hypothesis test of two proportions? ## ### Key Points from Today - Protocol for conducting and interpreting hypothesis tests is same, regardless of how they are specified. - This is true for quantitative data and for categorical proportion data - For two sample tests of proportions, it is helpful to examine the data using contingency tables. - By default, it is common for two sample tests of proportions to be conducted as two sided tests. - These same methods can be used with larger contingency tables tat are interatively analyzed. ::: fragment **To submit an Engagement Question or Comment about material from Lecture 19:** Submit it by midnight today (day of lecture). :::