To visualize data using bar graphs and mosaic plots.
To check Validity Conditions for Two Sample z-procedures: The theory-based test and interval for the difference in two proportions (called a two-sample z-test or interval) work well when there are at least 10 observations in each of the four cells of the 2 × 2 table.
To calculate the standard error for use in hypothesis tests with Two Proportions.
\[ z = \frac{\textrm{statistic} - \textrm{hypothesized value}}{\textrm{standard error of statistic}} =\frac{\hat{p}_1 - \hat{p}_2 - 0}{\sqrt{\hat{p}(1-\hat{p})\left(\frac{1}{n_1} + \frac{1}{n_2}\right)}},\]
where \(\hat{p}\) is the pooled success proportion.
To apply Theory-Based methods for Confidence Intervals for Two Proportions:
\[ \textrm{margin of error} = \textrm{multiplier} \times \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2} } \]
Draw appropriate conclusions from Theory-based techniques for Two Proportions
As usual, we start by loading our two packages: mosaic
and ggformula. To load a package, you use the
library() function, wrapped around the name of a package.
I’ve put the code to load one package into the chunk below. Add the
other package you need.
We’ll load the example data, GSS22clean.csv from this
Url: https://raw.githubusercontent.com/IJohnson-math/Math138/main/GSS22clean.csv
and use the read.csv() function.
Our research question is whether there is a difference in level of
support for the legalization of marijuana among working people in the
two groups of self-employed and people that work for somebody else. We
will view the works_for variable as the explanatory
variable and should_marijuana_be_legal as the response
variable.
Our null hypothesis is the proportion of people that believe marijuana should be legal is the same in the self-employed group as it is for the people that work for someone else. In other words, the null is that there is no association between views on marijuana legalization and whether a person works for someone else or is self-employed.
Let \(\pi_{\mathrm{SelfEmp}}\) be the proportion of self-employed people that think marijuana should be legal and \(\pi_{\mathrm{SomeoneElse}}\) be the proportion of people that work for someone else that think marijuana should be legal.
Our null and alternative hypotheses are
\[H_0 : \pi_{\mathrm{SelfEmp}} - \pi_{\mathrm{SomeoneElse}} = 0\]
\[H_a:\pi_{\mathrm{SelfEmp}} - \pi_{\mathrm{SomeoneElse}} \neq 0\]
To determine the proportion of self employed people that believe
marijuana should be made legal and the proportion of people that work
for someone else that believe marijuana should be made legal we create a
2-way table with the tally command.
Important note: the order of the variables matters!! It should be
tally(response_var ~ explanatory_var).
Check your work and make sure the explanatory variable is displayed horizontally. Be careful with this step or your work from here on will be incorrect!
Our variables are ‘works_for’ and ‘should_marijuana_be_legal’. As mentioned above, the ‘works_for’ variable is the explanatory (the variable that determines the groups) and ‘should_marijuana_be_legal’ is the response variable of interest.
## works_for
## should_marijuana_be_legal self-employed someone else <NA>
## should be legal 90 674 24
## should not be legal 33 282 20
## <NA> 270 2056 95
Notice that the data has many NA values. We will remove the NAs before proceeding with our analysis.
Next, let’s rerun the tally function.
## works_for
## should_marijuana_be_legal self-employed someone else
## should be legal 90 674
## should not be legal 33 282
For use later, we calculate the sample size for each group: \(n_1\) is the number of people that are self-employed and \(n_2\) the number of people that work for someone else.
## [1] 123
## [1] 956
## [1] 1079
To calculate the conditional proportions for support of marijuana
legalization in the two groups we include the option
format ="proportion" within the tally function.
## works_for
## should_marijuana_be_legal self-employed someone else
## should be legal 0.7317073 0.7050209
## should not be legal 0.2682927 0.2949791
The proportion of self-employed workers that believe marijuana should be legal is 0.7317 and the proportion of workers that work for someone else and support marijuana legalization is 0.705. Let’s save these values for use later.
#p1 is support for MJ in self-employed, p2 is support for MJ in non-self-employed
p1 = 0.7317
p2 = 0.7050Here is a segmented bar graph.
gf_props( ~works_for, fill= ~ should_marijuana_be_legal, data=GSS22, position ="fill", xlab="Employment", title="Views on marijuana legalization by employment group" )Here is a mosaic plot. Caution! In the
mosaicplot( ) function, make sure to list the explanatory
variable first.
mosaicplot( ~works_for + should_marijuana_be_legal, data=GSS22, main="Employment and Marjiuana Legalization", ylab=" ", xlab=" ", las=1, color=c("salmon", "turquoise"))Validity Conditions: The theory-based test and interval for the difference in two proportions (called a two-sample z-test or interval) work well when there are at least 10 observations in each of the four cells of the 2 × 2 table.
If we look at the tally of counts, we see that the
values in the 2 x 2 table are 90, 674, 33, 282, all of which are greater
than 10. So our validity conditions are definitely satisfied.
Let’s start by finding our observed statistic, \(p_{\mathrm{diff}}\).
#(p1 for the self-employed group) - (p2 for the works for someone else group)
p_diff <- p1-p2
p_diff## [1] 0.0267
For two proportions, in a hypothesis test the standard error of the null distribution is given by
\[ SE=\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}} \] where \(\hat{p}\) is the pooled proportion of “success”. Here success represents support for marijuana legalization.
Using R as a calculator the pooled proportion is
## [1] 0.708063
The standard error is
## [1] 0.04355216
Next, we can calculate the standardized statistic using the formula
\[ z = \frac{\hat{p}_1 - \hat{p}_2 - 0}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}}} = \frac{p_{\mathrm{diff}} - 0}{\sqrt{\frac{\hat{p}(1-\hat{p})}{n_1}+\frac{\hat{p}(1-\hat{p})}{n_2}}}\]
## [1] 0.613058
What evidence if any does this standardized statistic provide regarding our hypothesis test?
The standardized statistic (z = 0.613) is not greater than 2 or less than -2, so we don’t have enough evidence to reject the null hypothesis. It looks like the proportion of people that believe that marijuana should be legal in the self-employed group is similar to the proportion for workers that are not self-employed. Thus the difference between those proportions is plausibly equal to zero.
Next we calculate the theory based \(p\)-value using prop.test.
Note: In the code below we will omit the default continuity correction
(using the option correct= FALSE because the counts in
all four cells of our two-way table are large. The continuity
correction becomes important if one of the cell counts is smaller than
10, especially if a count is less than or equal to 5.
#inference for two proportions
prop.test(should_marijuana_be_legal ~ works_for, data = GSS22, success = "should be legal", alternative = "two.sided", conf.level = 0.95, correct=FALSE)##
## 2-sample test for equality of proportions without continuity correction
##
## data: tally(should_marijuana_be_legal ~ works_for)
## X-squared = 0.37546, df = 1, p-value = 0.54
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.05678065 0.11015344
## sample estimates:
## prop 1 prop 2
## 0.7317073 0.7050209
Here is another way to input data for two proportion inference. This assumes we only have data from a two-way table.
## works_for
## should_marijuana_be_legal self-employed someone else
## should be legal 90 674
## should not be legal 33 282
#USE THIS COMMAND for inference when you only have the counts and not the data
# c(90, 674) are the success counts for the two groups: self employed or works for someone else
# c(123, 956) are the sample size counts for the two groups
# be consistent with the order of the numbers! I'm consistently putting the self-employed group first.
# Always use alternative = "two.sided" when calculating confidence intervals!
prop.test(c(90, 674), c(123, 956), alternative = "two.sided", conf.level = 0.99, correct=FALSE)##
## 2-sample test for equality of proportions without continuity correction
##
## data: c out of c90 out of 123674 out of 956
## X-squared = 0.37546, df = 1, p-value = 0.54
## alternative hypothesis: two.sided
## 99 percent confidence interval:
## -0.0830079 0.1363807
## sample estimates:
## prop 1 prop 2
## 0.7317073 0.7050209
Does the \(p\)-value from
prop.test support the conclusion made with the standardized
statistic?
Yes, the p-value supports the same conclusion as the standardized statistic. The p-value is 0.54, which is much larger that 0.01, so we do not have evidence that would support rejecting the null hypothesis. The null hypothesis is plausible. Contextually, this means that the group of self-employed people and the group people that work for someone else support marijuana legalization at similar percentages.
To find confidence intervals for a difference of proportions, we start by computing the standard error. Recall that the formula for standard error depends on whether we’re doing a confidence interval or a hypothesis test. The reason for the two formulas stems from the fact that when we do a hypothesis test we have a hypothesized value for the unknown parameter, namely \(\pi_{SelfEmp} - \pi_{SomeoneElse}=0\), but when determining a confidence interval we have no preferred value for the parameter.
For two proportions, the standard error for a confidence interval is given by \[ SE = \sqrt{\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2}}\]
Notice that this formula uses our observed proportions \(\hat{p}_1\) and \(\hat{p}_2\) instead of the pooled proportion \(\hat{p}\).
Our margin of error (MOE) for a 2SD interval is given by
## [1] 0.08517288
The interval is centered at the \(statistic\) and the left and right endpoints of our 2SD confidence interval are
## [1] -0.05847288
## [1] 0.1118729
Does this align with a 95% confidence interval calculated using
prop.test? Yes, they are nearly identical.
Interpret the confidence interval: We are 95% confident that the difference in support for marijuana legalization between self-employed people and people that work for someone else, \(\pi_{\mathrm{SelfEmp}} - \pi_{\mathrm{SomeoneElse}}=0\), is between -0.058 and 0.111, meaning that the works for someone else group could support for marijuana legalization 5.8 percentage points more than the self-employed group or the self employed group could support marijuana legalization by up to 11.1 percentage points more or by some value in between these.
Connection between confidence interval and hypothesis test: Since this confidence interval contains 0, it is plausible that people in the groups: self-employed and work for someone else, support marijuana legalization at equal rates.
This is an observational study so we cannot make any conclusions about causation even if the results had been significant. Since our data is a random sample of individuals that participated in the GSS 2022 survey we can cautiously generalize to the population of working adults in the US. This generalization does warrant some caution because we filtered out 2465 of the 3544 people in the GSS survey. Only 1079 people reported answers to both the ‘works_for’ and ‘should_marijuana_be_legal’ questions and removing such a large portion of people from our random sample could be a potential source of bias.
Notice: the standardized statistic, the p-value and the confidence interval all lead to the same conclusion that the two proportions of interest are plausibly equal!