M. Drew LaMar
October 14, 2020
“Statisticians, like artists, have the bad habit of falling in love with their models.”
- George Box
We will be comparing the means of a numerical variable between two groups.
Definition: In the
paired design , both treatments are applied to every sampled unit. In thetwo-sample design , each treatment group is composed of an independent, random sample of units.
We will be comparing the means of a numerical variable between two groups.
Definition: In the
paired design , both treatments are applied to every sampled unit. In thetwo-sample design , each treatment group is composed of an independent, random sample of units.
Data:
Remember standard error: \[ \sigma_{\bar{Y}} = \frac{\sigma}{\sqrt{n}} \]
We can increase power and the precision of our estimates by decreasing the standard error through…
…decreasing the variability \( \sigma \) in our measured variable (numerator).
The paired design mainly effects point 2 above, i.e. reduces variability. How?
Unpaired
Paired
Discuss: Can you come up with an example of a paired and unpaired design?
From the book:
Definition: Paired measurements are converted to a single measurement by taking the difference between them.
\[ d = Y_{T}-Y_{C}, \]
where \( Y_{T} \) and \( Y_{C} \) denote the variable in the treatment and control groups, respectively.
If \( Y_{T}\sim N(\mu_{T},\sigma_{T}^2) \), \( Y_{C}\sim N(\mu_{C},\sigma_{C}^2) \), and \( d = Y_{T}-Y_{C} \), then
\[ d \sim N(\mu_{T}-\mu_{C},\sigma_{T}^2 + \sigma_{C}^2) \]
Confidence intervals
\[ \bar{d} - t_{\alpha(2),df}\mathrm{SE}_{\bar{d}} < \mu_{d} < \bar{d} + t_{\alpha(2),df}\mathrm{SE}_{\bar{d}} \]
Paired \( t \)-test: One-sample \( t \)-test on the difference d
\[ H_{0}: \mu_{d} = \mu_{d0} \] \[ H_{A}: \mu_{d} \neq \mu_{d0} \]
Test statistic:
\[ t = \frac{\bar{d} - \mu_{d0}}{SE_{\bar{d}}} \]
Assumptions: Same as one-sample t-test
Question: Can the death rate be influenced by tax incentives?
Kopczuk and Slemrod (2003) investigated this possibility using data on deaths in the United States in years in which the government announced it was changing (usually raising) the tax rate on inheritance (the estate tax). The authors calculated the death rate during the 14 days before, and the 14 days after, the changes in the estate tax rates took effect. The number of deaths per day for each of these periods was recorded.
yearOfChange HigherTaxDeaths lowerTaxDeaths
1 1917 22.21 24.93
2 1917 18.86 20.00
3 1919 28.21 29.93
4 1924 31.64 30.64
5 1926 18.43 20.86
6 1932 9.50 10.14
7 1934 24.29 28.00
8 1935 26.64 25.29
9 1940 35.07 35.00
10 1941 38.86 37.57
11 1942 28.50 34.79
with(deathRate,
stripchart(list(HigherTaxDeaths,
lowerTaxDeaths),
vertical = TRUE,
group.names = c("Higher","Lower"),
xlim=c(0.5, 2.5),
pch = 16,
col = "firebrick",
ylab="Death Rate", xlab="Estate tax rate",
cex=1.5, cex.lab=1.5, cex.axis=1.5))
with(deathRate,
segments(1, HigherTaxDeaths,
2, lowerTaxDeaths))
Question: What are the null and alternate hypotheses?
Answer:
\[ \begin{align} H_{0}: & \mathrm{Mean \ change \ in \ death \ rate \ is \ zero}\\ H_{A}: & \mathrm{Mean \ change \ in \ death \ rate \ is \ not \ zero} \end{align} \]
Answer:
\[ H_{0}: \mu_{d} = 0 \] \[ H_{A}: \mu_{d} \neq 0 \]
Let's do a one-sample \( t \)-test
deathRate %>%
mutate(d = HigherTaxDeaths - lowerTaxDeaths) %>%
summarize(n = length(d),
sderr = sd(d)/sqrt(n),
tstat = (mean(d) - 0)/sderr,
pval = 2*pt(abs(tstat), df=n-1, lower.tail=FALSE))
n sderr tstat pval
1 11 0.7103096 -1.912098 0.08491016
Let's do a one-sample \( t \)-test
with(deathRate,
t.test(HigherTaxDeaths,
lowerTaxDeaths,
mu = 0,
paired = TRUE))
Let's do a one-sample \( t \)-test
Paired t-test
data: HigherTaxDeaths and lowerTaxDeaths
t = -1.9121, df = 10, p-value = 0.08491
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.9408501 0.2244865
sample estimates:
mean of the differences
-1.358182
\[ \bar{Y}_{1}\sim N(\mu_{1},\sigma_{\bar{Y}_{1}}^2) \ \mathrm{and} \ \bar{Y}_{2}\sim N(\mu_{2},\sigma_{\bar{Y}_{2}}^2) \\ \bar{Y}_{1} - \bar{Y}_{2}\sim N(\mu_{1}-\mu_{2}, \sigma_{\bar{Y}_{1}}^2+\sigma_{\bar{Y}_{2}}^2) \]
Definition: The
standard error of the difference of the means between two groups is given by \[ \mathrm{SE}_{\bar{Y_{1}}-\bar{Y_{2}}} = \sqrt{s_{p}^2\left(\frac{1}{n_{1}} + \frac{1}{n_{2}}\right)} \] wherepooled sample variance \( s_{p}^{2} \) is given by
\[ s_{p}^2 = \frac{df_{1}s_{1}^2 + df_{2}s_{2}^2}{df_{1}+df_{2}}. \]
Since sampling distribution of \( \bar{Y}_{1} - \bar{Y}_{2} \) is normal
\[ \bar{Y}_{1} - \bar{Y}_{2}\sim N(\mu_{1}-\mu_{2}, \sigma_{\bar{Y}_{1}}^2+\sigma_{\bar{Y}_{2}}^2) \]
the sampling distribution of the statistic
\[ t = \frac{\left(\bar{Y}_{1} - \bar{Y}_{2}\right) - \left(\mu_{1}-\mu_{2}\right)}{\mathrm{SE}_{\bar{Y}_{1} - \bar{Y}_{2}}} \]
has a Student's \( t \)-distribution with total degrees of freedom given by
\[ df = df_{1} + df_{2} = n_{1} + n_{2} - 2. \]