Comparing two means

M. Drew LaMar
November 1, 2019

“Statisticians, like artists, have the bad habit of falling in love with their models.”

- George Box

Class Announcements

Reading Assignment for Monday, November 4: Whitlock & Schluter: Chapter 12 - Comparing two means (~~QUIZ~~)

Estimating population variance

In many cases, it isn't the mean that we are interested in estimating but the variability of a population measure.

Remember, variance is also a population parameter, so we should be able to estimate it.

Stalk-eyed flies have staring contests! Longer stalked flies usually win.

alt text

Estimating population variance

Definition: If \( Y \) has a normal distribution, then the sampling distribution of the quantity \[ \chi^{2} = (n-1)s^2/\sigma^2 \] is the \( \chi^2 \) distribution with \( n-1 \) degrees of freedom.

\[ \frac{df s^2}{\chi^2_{\alpha/2,df}} < \sigma^2 < \frac{df s^2}{\chi^2_{1-\alpha/2,df}} \]

alt text

Example 11.2: Stalk-eyed flies

myData <- read.csv("http://whitlockschluter.zoology.ubc.ca/wp-content/data/chapter11/chap11e2Stalkies.csv")
eyespan <- myData$eyespan
summary(eyespan)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  8.150   8.630   8.690   8.778   8.960   9.450

Example 11.2: Stalk-eyed flies

svar <- var(eyespan)
df <- length(eyespan) - 1
tcritL <- qchisq(0.025, df=df, lower.tail=TRUE)
tcritU <- qchisq(0.025, df=df, lower.tail=FALSE)
ci <- c(df*svar/tcritU, df*svar/tcritL)
names(ci) <- c("lower bound", "upper bound")
ci

lower bound upper bound 
 0.07238029  0.58225336

Note: Same assumptions as confidence interval for mean, but much less robust to deviations from these assumptions!!!

It all starts with experimental design

We will be comparing the means of a numerical variable between two groups.

Definition: In the paired design, both treatments are applied to every sampled unit. In the two-sample design, each treatment group is composed of an independent, random sample of units.

It all starts with experimental design

We will be comparing the means of a numerical variable between two groups.

Definition: In the paired design, both treatments are applied to every sampled unit. In the two-sample design, each treatment group is composed of an independent, random sample of units.

Data:

Response: One numerical variable
Explanatory: One categorical variable with 2 levels

Paired design

Remember standard error: \[ \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}} \]

We can increase power and the precision of our estimates by decreasing the standard error through…

…increasing the sample size (denominator).
…decreasing the variability \( \sigma \) in our measured variable (numerator).

The paired design mainly effects point 2 above, i.e. reduces variability. How?

Experimental Design

Unpaired Design

Paired Design

Paired vs. Unpaired

Unpaired

Paired

Paired design examples

Discuss: Can you come up with an example of a paired and unpaired design?

From the book:

Comparing patient weight before and after hospitalization
Comparing fish species diversity in lakes before and after heavy metal contamination
Testing effects of sunscreen applied to one arm of each subject compared with a placebo applied to the other arm
Testing effects of smoking in a sample of smokers, each of which is compared with a nonsmoker closely matched by age, weight, and ethnic background

Paired design: What is our resulting variable?

Definition: Paired measurements are converted to a single measurement by taking the difference between them.

\[ d = Y_{T}-Y_{C}, \]

where \( Y_{T} \) and \( Y_{C} \) denote the variable in the treatment and control groups, respectively.

Paired design: Estimation

If \( Y_{T}\sim N(\mu_{T},\sigma_{T}^2) \), \( Y_{C}\sim N(\mu_{C},\sigma_{C}^2) \), and \( d = Y_{T}-Y_{C} \), then

\[ d \sim N(\mu_{T}-\mu_{C},\sigma_{T}^2 + \sigma_{C}^2) \]

Confidence intervals

\[ \bar{d} - t_{\alpha(2),df}\mathrm{SE}_{\bar{d}} < \mu_{d} < \bar{d} + t_{\alpha(2),df}\mathrm{SE}_{\bar{d}} \]

Paired design: Hypothesis testing

Paired \( t \)-test: One-sample \( t \)-test on the difference d

\[ H_{0}: \mu_{d} = \mu_{d0} \] \[ H_{A}: \mu_{d} \neq \mu_{d0} \]

Test statistic:

\[ t = \frac{\bar{d} - \mu_{d0}}{SE_{\bar{d}}} \]

Assumptions: Same as one-sample t-test

The sampling units are randomly sampled from population.
Paired differences have normal distribution in population. ~~Original measurements DO NOT have to be normal.~~

Paired design: Practice Problem #1

Question: Can the death rate be influenced by tax incentives?

Kopczuk and Slemrod (2003) investigated this possibility using data on deaths in the United States in years in which the government announced it was changing (usually raising) the tax rate on inheritance (the estate tax). The authors calculated the death rate during the 14 days before, and the 14 days after, the changes in the estate tax rates took effect. The number of deaths per day for each of these periods was recorded.

Paired design: Practice Problem #1

   yearOfChange HigherTaxDeaths lowerTaxDeaths
1          1917           22.21          24.93
2          1917           18.86          20.00
3          1919           28.21          29.93
4          1924           31.64          30.64
5          1926           18.43          20.86
6          1932            9.50          10.14
7          1934           24.29          28.00
8          1935           26.64          25.29
9          1940           35.07          35.00
10         1941           38.86          37.57
11         1942           28.50          34.79

Paired design: Practice Problem #1

plot of chunk stripchart

Paired design: Practice Problem #1

with(deathRate, 
     stripchart(list(HigherTaxDeaths, 
                     lowerTaxDeaths), 
                vertical = TRUE, 
                group.names = c("Higher","Lower"), 
                xlim=c(0.5, 2.5), 
                pch = 16, 
                col = "firebrick", 
                ylab="Death Rate", xlab="Estate tax rate", 
                cex=1.5, cex.lab=1.5, cex.axis=1.5))

with(deathRate, 
     segments(1, HigherTaxDeaths, 
              2, lowerTaxDeaths))

Paired design: Practice Problem #1

Question: What are the null and alternate hypotheses?

Answer:
\[ \begin{align} H_{0}: & \mathrm{Mean \ change \ in \ death \ rate \ is \ zero}\\ H_{A}: & \mathrm{Mean \ change \ in \ death \ rate \ is \ not \ zero} \end{align} \]

Answer:
\[ H_{0}: \mu_{d} = 0 \] \[ H_{A}: \mu_{d} \neq 0 \]

Paired design: Practice Problem #1

Let's do a one-sample \( t \)-test

deathRate %>% 
  mutate(d = HigherTaxDeaths - lowerTaxDeaths) %>% 
  summarize(n = length(d),
            sderr = sd(d)/sqrt(n),
            tstat = (mean(d) - 0)/sderr,
            pval = 2*pt(abs(tstat), df=n-1, lower.tail=FALSE))

   n     sderr     tstat       pval
1 11 0.7103096 -1.912098 0.08491016

Paired design: Practice Problem #1

Let's do a one-sample \( t \)-test

with(deathRate, 
     t.test(HigherTaxDeaths, 
            lowerTaxDeaths, 
            mu = 0, 
            paired = TRUE))

Paired design: Practice Problem #1

Let's do a one-sample \( t \)-test


    Paired t-test

data:  HigherTaxDeaths and lowerTaxDeaths
t = -1.9121, df = 10, p-value = 0.08491
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.9408501  0.2244865
sample estimates:
mean of the differences 
              -1.358182

Two-sample design: Estimation

\[ \bar{Y}_{1}\sim N(\mu_{1},\sigma_{\bar{Y}_{1}}^2) \ \mathrm{and} \ \bar{Y}_{2}\sim N(\mu_{2},\sigma_{\bar{Y}_{2}}^2) \\ \bar{Y}_{1} - \bar{Y}_{2}\sim N(\mu_{1}-\mu_{2}, \sigma_{\bar{Y}_{1}}^2+\sigma_{\bar{Y}_{2}}^2) \]

Definition: The standard error of the difference of the means between two groups is given by \[ \mathrm{SE}_{\bar{Y_{1}}-\bar{Y_{2}}} = \sqrt{s_{p}^2\left(\frac{1}{n_{1}} + \frac{1}{n_{2}}\right)} \] where pooled sample variance \( s_{p}^{2} \) is given by

\[ s_{p}^2 = \frac{df_{1}s_{1}^2 + df_{2}s_{2}^2}{df_{1}+df_{2}}. \]

Two-sample design: Estimation

Since sampling distribution of \( \bar{Y}_{1} - \bar{Y}_{2} \) is normal

\[ \bar{Y}_{1} - \bar{Y}_{2}\sim N(\mu_{1}-\mu_{2}, \sigma_{\bar{Y}_{1}}^2+\sigma_{\bar{Y}_{2}}^2) \]

the sampling distribution of the statistic

\[ t = \frac{\left(\bar{Y}_{1} - \bar{Y}_{2}\right) - \left(\mu_{1}-\mu_{2}\right)}{\mathrm{SE}_{\bar{Y}_{1} - \bar{Y}_{2}}} \]

has a Student's \( t \)-distribution with total degrees of freedom given by

\[ df = df_{1} + df_{2} = n_{1} + n_{2} - 2. \]