Comparing two means

M. Drew LaMar
October 29, 2018

“Statisticians, like artists, have the bad habit of falling in love with their models.”

- George Box

Class Announcements

  • Reading Assignment: Whitlock & Schluter: Chapter 12 - Comparing two means (NO QUIZ)
  • Homework #6: Chapters 9 and 10 (due Monday, November 5, 1:00 pm)
  • Lab #6: Data Visualization in R (due Monday, November 12, 11:59 pm)

It all starts with experimental design

We will be comparing the means of a numerical variable between two groups.

Definition: In the paired design, both treatments are applied to every sampled unit. In the two-sample design, each treatment group is composed of an independent, random sample of units.

It all starts with experimental design

We will be comparing the means of a numerical variable between two groups.

Definition: In the paired design, both treatments are applied to every sampled unit. In the two-sample design, each treatment group is composed of an independent, random sample of units.


Data:

  • Response: One numerical variable
  • Explanatory: One categorical variable with 2 levels

Paired design

Remember standard error: \[ \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}} \]

We can increase power and the precision of our estimates by decreasing the standard error through…

  1. …increasing the sample size (denominator).
  2. …decreasing the variability \( \sigma \) in our measured variable (numerator).

    The paired design mainly effects point 2 above, i.e. reduces variability. How?

Experimental Design

Unpaired Design

Paired Design

Paired vs. Unpaired

Unpaired

Paired

Paired design examples

Discuss: Can you come up with an example of a paired and unpaired design?

From the book:

  • Comparing patient weight before and after hospitalization
  • Comparing fish species diversity in lakes before and after heavy metal contamination
  • Testing effects of sunscreen applied to one arm of each subject compared with a placebo applied to the other arm
  • Testing effects of smoking in a sample of smokers, each of which is compared with a nonsmoker closely matched by age, weight, and ethnic background

Paired design: What is our resulting variable?

Definition: Paired measurements are converted to a single measurement by taking the difference between them.


\[ d = Y_{T}-Y_{C}, \]

where \( Y_{T} \) and \( Y_{C} \) denote the variable in the treatment and control groups, respectively.

Paired design: Estimation

If \( Y_{T}\sim N(\mu_{T},\sigma_{T}^2) \), \( Y_{C}\sim N(\mu_{C},\sigma_{C}^2) \), and \( d = Y_{T}-Y_{C} \), then

\[ d \sim N(\mu_{T}-\mu_{C},\sigma_{T}^2 + \sigma_{C}^2) \]

Confidence intervals

\[ \bar{d} - t_{\alpha(2),df}\mathrm{SE}_{\bar{d}} < \mu_{d} < \bar{d} + t_{\alpha(2),df}\mathrm{SE}_{\bar{d}} \]

Paired design: Hypothesis testing

Paired \( t \)-test: One-sample \( t \)-test on the difference d

\[ H_{0}: \mu_{d} = \mu_{d0} \] \[ H_{A}: \mu_{d} \neq \mu_{d0} \]

Test statistic:

\[ t = \frac{\bar{d} - \mu_{d0}}{SE_{\bar{d}}} \]

Assumptions: Same as one-sample t-test

  • The sampling units are randomly sampled from population.
  • Paired differences have normal distribution in population. Original measurements DO NOT have to be normal.

Paired design: Practice Problem #1

Question: Can the death rate be influenced by tax incentives?

Kopczuk and Slemrod (2003) investigated this possibility using data on deaths in the United States in years in which the government announced it was changing (usually raising) the tax rate on inheritance (the estate tax). The authors calculated the death rate during the 14 days before, and the 14 days after, the changes in the estate tax rates took effect. The number of deaths per day for each of these periods was recorded.

Paired design: Practice Problem #1

   yearOfChange HigherTaxDeaths lowerTaxDeaths
1          1917           22.21          24.93
2          1917           18.86          20.00
3          1919           28.21          29.93
4          1924           31.64          30.64
5          1926           18.43          20.86
6          1932            9.50          10.14
7          1934           24.29          28.00
8          1935           26.64          25.29
9          1940           35.07          35.00
10         1941           38.86          37.57
11         1942           28.50          34.79

Paired design: Practice Problem #1

plot of chunk stripchart

Paired design: Practice Problem #1

with(deathRate, 
     stripchart(list(HigherTaxDeaths, 
                     lowerTaxDeaths), 
                vertical = TRUE, 
                group.names = c("Higher","Lower"), 
                xlim=c(0.5, 2.5), 
                pch = 16, 
                col = "firebrick", 
                ylab="Death Rate", xlab="Estate tax rate", 
                cex=1.5, cex.lab=1.5, cex.axis=1.5))

with(deathRate, 
     segments(1, HigherTaxDeaths, 
              2, lowerTaxDeaths))

Paired design: Practice Problem #1

Question: What are the null and alternate hypotheses?

Answer:
\[ \begin{align} H_{0}: & \mathrm{Mean \ change \ in \ death \ rate \ is \ zero}\\ H_{A}: & \mathrm{Mean \ change \ in \ death \ rate \ is \ not \ zero} \end{align} \]

Answer:
\[ H_{0}: \mu_{d} = 0 \] \[ H_{A}: \mu_{d} \neq 0 \]

Paired design: Practice Problem #1

Let's do a one-sample \( t \)-test

deathRate %>% 
  mutate(d = HigherTaxDeaths - lowerTaxDeaths) %>% 
  summarize(n = length(d),
            sderr = sd(d)/sqrt(n),
            tstat = (mean(d) - 0)/sderr,
            pval = 2*pt(abs(tstat), df=n-1, lower.tail=FALSE))
   n     sderr     tstat       pval
1 11 0.7103096 -1.912098 0.08491016

Paired design: Practice Problem #1

Let's do a one-sample \( t \)-test

with(deathRate, 
     t.test(HigherTaxDeaths, 
            lowerTaxDeaths, 
            mu = 0, 
            paired = TRUE))

Paired design: Practice Problem #1

Let's do a one-sample \( t \)-test


    Paired t-test

data:  HigherTaxDeaths and lowerTaxDeaths
t = -1.9121, df = 10, p-value = 0.08491
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.9408501  0.2244865
sample estimates:
mean of the differences 
              -1.358182 

Two-sample design: Estimation

\[ \bar{Y}_{1}\sim N(\mu_{1},\sigma_{\bar{Y}_{1}}^2) \ \mathrm{and} \ \bar{Y}_{2}\sim N(\mu_{2},\sigma_{\bar{Y}_{2}}^2) \\ \bar{Y}_{1} - \bar{Y}_{2}\sim N(\mu_{1}-\mu_{2}, \sigma_{\bar{Y}_{1}}^2+\sigma_{\bar{Y}_{2}}^2) \]

Definition: The standard error of the difference of the means between two groups is given by \[ \mathrm{SE}_{\bar{Y_{1}}-\bar{Y_{2}}} = \sqrt{s_{p}^2\left(\frac{1}{n_{1}} + \frac{1}{n_{2}}\right)} \] where pooled sample variance \( s_{p}^{2} \) is given by

\[ s_{p}^2 = \frac{df_{1}s_{1}^2 + df_{2}s_{2}^2}{df_{1}+df_{2}}. \]

Two-sample design: Estimation

Since sampling distribution of \( \bar{Y}_{1} - \bar{Y}_{2} \) is normal

\[ \bar{Y}_{1} - \bar{Y}_{2}\sim N(\mu_{1}-\mu_{2}, \sigma_{\bar{Y}_{1}}^2+\sigma_{\bar{Y}_{2}}^2) \]

the sampling distribution of the statistic

\[ t = \frac{\left(\bar{Y}_{1} - \bar{Y}_{2}\right) - \left(\mu_{1}-\mu_{2}\right)}{\mathrm{SE}_{\bar{Y}_{1} - \bar{Y}_{2}}} \]

has a Student's \( t \)-distribution with total degrees of freedom given by

\[ df = df_{1} + df_{2} = n_{1} + n_{2} - 2. \]