Inference for Two Populations

Joe Ripberger

Statistical Inference for One Population

Goal: estimate unknown population parameters using sample statistics as point estimates for the unknown population parameters
- Continuous data: the unknown population parameter is often \(\mu\), which we estimate with \(\bar{x}\)
- Categorical data: the unknown population parameter is often \(p\), which we estimate with \(\hat{p}\)
Requires that we account for sampling variation, which we do by estimating the standard deviation of the sampling distribution (SE)
- \(SE(\bar{x})=\frac{s}{\sqrt{n}}\), where \(s\) is the sample standard deviation and \(n\) is the sample size
- \(SE(\hat{p})=\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\), where \(\hat{p}\) is the proportion of successes in the sample and \(n\) is the sample size

Statistical Inference for One Population (Proportion)

Are greenhouse gases, such as those resulting from the combustion of coal, oil, natural gas, and other materials, causing average global temperatures to rise?

Statistical Inference for One Population (Proportion)

H: More than 50% of Oklahoma residents believe that greenhouse gases are causing average global temperatures to rise
- \(H_0: p = 0.50\)
- \(H_A: p > 0.50\)

Point estimate: \(\hat{p}=0.571\)
Standard error: \(SE(\hat{p})=\sqrt{\frac{0.571(1-0.571)}{2547}}=0.007\)
Confidence interval: \(95\%CI=0.571 \pm 1.96*0.007 = [0.557, 0.585]\)
z-statistic: \(z=\frac{0.571-0.500}{0.007}=10.143\)
p-value: 1-pnorm(10.143) < 0.0000001

Statistical Inference for Two Populations

Goal: estimate unknown population parameters using sample statistics as point estimates for the unknown population parameters
- Continuous data: the unknown population parameter is often a difference of means (\(\mu_1-\mu_2\)), which we estimate with \(\bar{x}_1-\bar{x}_2\)
- Categorical data: the unknown population parameter is often a difference of proportions (\(p_1-p_2\)), which we estimate with \(\hat{p}_1-\hat{p}_2\)
Requires that we account for sampling variation, which we do by estimating the standard deviation of the sampling distribution (SE)
- \(SE(\bar{x}_1-\bar{x}_2) = \sqrt{SE(\bar{x}_1)^2+SE(\bar{x}_2)^2}\)
- \(SE(\hat{p}_1-\hat{p}_2) = \sqrt{SE(\hat{p}_1)^2+SE(\hat{p}_2)^2}\)

Statistical Inference for Two Populations

We use confidence intervals and p-values to test hypotheses about the difference between means or proportions
- \(H_0: \bar{x}_1-\bar{x}_2 = 0, \hat{p}_1-\hat{p}_2 = 0\)
- \(H_A: \bar{x}_1-\bar{x}_2 \neq 0, \hat{p}_1-\hat{p}_2 \neq 0\)

Difference of Means

On a scale from zero to ten, where zero means no risk and ten means extreme risk, how much risk do you think global warming poses for people and the environment?

Difference of Means

On a scale from zero to ten, where zero means no risk and ten means extreme risk, how much risk do you think global warming poses for people and the environment?

survey_data %>% 
  drop_na(gender, glbcc_risk) %>% 
  group_by(gender) %>% 
  summarise(n = n(), 
            mean = mean(glbcc_risk), 
            s = sd(glbcc_risk), 
            se = s / (sqrt(n)))

# A tibble: 2 × 5
  gender     n  mean     s     se
   <dbl> <int> <dbl> <dbl>  <dbl>
1      0  1512  6.13  2.98 0.0767
2      1  1023  5.67  3.18 0.0994

Difference of Means (Math)

\(H_0: \bar{x}_1-\bar{x}_2 = 0\)
\(H_A: \bar{x}_1-\bar{x}_2 \neq 0\)

Point estimate: \(\bar{x}_1-\bar{x}_2=6.13-5.67=0.46\)
Standard error: \(SE(\bar{x}_1-\bar{x}_2)=\sqrt{(0.077^2+0.099^2}=0.125\)
Confidence interval: \(95\%CI=0.46 \pm 1.96*0.125 = [0.215, 0.705]\)
z-statistic: \(z=\frac{0.46-0}{0.125}=3.68\)
p-value: 1 - pnorm(3.68) = <0.001

Difference of Means (Code)

(pe <- 6.13 - 5.67)

[1] 0.46

(se <- sqrt(((2.98 / sqrt(1512)) ^ 2) + ((3.18 / sqrt(1023)) ^ 2)))

[1] 0.1255322

(ci <- c(pe - 1.96 * se, pe + 1.96 * se))

[1] 0.213957 0.706043

(z <- (pe - 0) / se)

[1] 3.664399

(pnorm(z, lower.tail = FALSE))

[1] 0.0001239598

Difference of Proportions

Are greenhouse gases, such as those resulting from the combustion of coal, oil, natural gas, and other materials, causing average global temperatures to rise?

Difference of Proportions

Are greenhouse gases, such as those resulting from the combustion of coal, oil, natural gas, and other materials, causing average global temperatures to rise?

survey_data %>% 
  drop_na(gender, glbcc) %>% 
  count(gender)

# A tibble: 2 × 2
  gender     n
   <dbl> <int>
1      0  1520
2      1  1026

survey_data %>% 
  drop_na(gender, glbcc) %>% 
  group_by(gender, glbcc) %>% 
  summarise(n = n()) %>% 
  mutate(p = n / sum(n))

# A tibble: 4 × 4
# Groups:   gender [2]
  gender glbcc     n     p
   <dbl> <dbl> <int> <dbl>
1      0     0   606 0.399
2      0     1   914 0.601
3      1     0   485 0.473
4      1     1   541 0.527

Difference of Proportions (Math)

\(H_0: \hat{p}_1-\hat{p}_2 = 0\)
\(H_A: \hat{p}_1-\hat{p}_2 \neq 0\)

Point estimate: \(\hat{p}_1-\hat{p}_2=0.601-0.527=0.074\)
Standard error: \(SE(\hat{p}_1-\hat{p}_2)=\sqrt{(\sqrt{\frac{0.601(1-0.601)}{1520}})^2+(\sqrt{\frac{0.527(1-0.527)}{1026}})^2}=0.02\)
Confidence interval: \(95\%CI=0.074 \pm 1.96*0.020 = [0.035, 0.113]\)
z-statistic: \(z=\frac{0.074-0}{0.020}=3.7\)
p-value: 1 - pnorm(3.7) = <0.001

Difference of Proportions (Code)

(pe <- 0.601 - 0.527)

[1] 0.074

(se <- sqrt(sqrt(((0.601 * (1 - 0.601)) / 1520) ^ 2) + sqrt(((0.527 * (1 - 0.527)) / 1026) ^ 2)))

[1] 0.02001791

(ci <- c(pe - 1.96 * se, pe + 1.96 * se))

[1] 0.0347649 0.1132351

(z <- (pe - 0) / se)

[1] 3.69669

(pnorm(z, lower.tail = FALSE))

[1] 0.0001092145

Student’s t-distribution

Sampling distribution approximates normality when \(n > 30\) and the distribution of \(x\) is roughly symmetric around the mean
- Allows us to calculate SEs, CIs, and p-values without the sampling distribution
- Allows for statistical inference
If \(n < 30\), slight skews in the distribution of \(x\) may generate abnormalities in the sampling distribution
- Can lead to inaccurate (biased) SEs, CIs, p-values, and inferences
To combat this bias, we assume that the sampling distribution follows the t-distribution rather than the z-distribution
- The t-distribution has heavier tails than the z-distribution
- The t-distribution is defined by \(\nu\), degrees of freedom (\(n - 1\))
- When \(\nu\) > 30, the t-distribution \(\approx\) the z-distribution

Student’s t-distribution (pdf)

Confidence Intervals with the t-distribution

With the z-distribution, we ALWAYS use 1.96 as the critical value to calculate a 95% confidence interval
- \(95\%CI = Estimate \pm 1.96*SE\)
With the t-distribution, the critical value associated with a 95% confidence interval changes with \(\nu\)
- \(95\%CI = Estimate \pm t_\nu*SE\)

round(qnorm(p = 0.975), 2)

[1] 1.96

round(qt(p = 0.975, df = c(1, 2, 5, 30, 100, 200, 500)), 2)

[1] 12.71  4.30  2.57  2.04  1.98  1.97  1.96

P-values with the t-distribution

With the z-distribution, we use z-scores to calculate p-values
- \(z=\frac{Estimate - Null}{SE}\)
- pnorm(z) or 1 - pnorm(z)
With the t-distribution, we use t-scores to calculate p-values
- \(t=\frac{Estimate - Null}{SE}\)
- pt(z, df) or 1 - pt(z, df)

Difference of Means (Math)

\(H_0: \bar{x}_1-\bar{x}_2 = 0\)
\(H_A: \bar{x}_1-\bar{x}_2 \neq 0\)

Point estimate: \(\bar{x}_1-\bar{x}_2=5.4-6.1=-0.7\)
Standard error: \(SE(\bar{x}_1-\bar{x}_2)=\sqrt{(\frac{3.3}{\sqrt{957}})^2+(\frac{3.0}{\sqrt{1543}})^2}=0.13\)
Confidence interval: \(95\%CI=-0.7 \pm 1.96*0.13 = [-0.95, -0.45]\)
t-statistic: \(t=\frac{-0.7-0}{0.13}=-5.38\)
p-value: pt(-5.38, df = (957 + 1543 - 2)) = <0.001

Difference of Means (Code)

t.test(survey_data$glbcc_risk ~ survey_data$gender)


    Welch Two Sample t-test

data:  survey_data$glbcc_risk by survey_data$gender
t = 3.6927, df = 2097.5, p-value = 0.0002275
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 0.2174340 0.7099311
sample estimates:
mean in group 0 mean in group 1 
       6.134259        5.670577

Difference of Means (Welch’s t-test)

Welch–Satterthwaite equation \[\nu = \frac{(s_1^2/n_1 + s_2^2/n_2)^2}{(s_1^2/n_1)^2/(n_1-1) + (s_2^2/n_2)^2/(n_2-1)}\]

T-tests calculations…

Independent (unpaired) two-sample t-test
- Unequal sample sizes, unequal variances
- Equal sample sizes, equal variances
Dependent (paired) two-sample t-test
- Matched pairs
- Repeated measures