This lesson is about estimating the difference between two population means, \(\mu_1\) and \(\mu_2\). We will consider three different cases depending on whether the samples are independent or dependent and also depending on whether the population variances are equal.
When the samples are dependent, we analyze the differences. More on this below in Case 3.
When the samples are independent, we use the sampling distribution of \(\bar X_1 - \bar X_2\) as the basis of the confidence interval.
\[\bar X_1 - \bar X_2 \sim N(\mu_1 - \mu_2,\sqrt{ \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} })\] Of course, we don’t know \(\sigma_1^2\) and \(\sigma_2^2\) so we are going to use the sample variances \(s_1^2\) and \(s_2^2\) which means that we will be using the \(t\) distribution instead of the normal distribution.
The degrees of freedom and the exact form of the estimated standard error depends on whether we have equal population variances.
A \(100(1-\alpha)\%\) confidence interval for \(\mu_1-\mu_2\) is
\[(\bar x_1 - \bar x_2) \pm t_{\nu,\alpha/2} \cdot \sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}\]
where the approximate degrees of freedom are \(\nu\) which is equal to the smaller of \(n_1-1\) and \(n_2-1\), i.e. \(\nu = \text{min}(n_1-1,n_2-1)\). The exact degrees of freedom have a much more complicated formula.
We require independent random samples that are either both large \((n_1, n_2 \ge 30)\) or from normal populations.
Example 1:
The local baseball team conducts a study to find the amount spent on refreshments at the ball park. Over the course of the season, they gather simple random samples of 20 men and 23 women. For men, the average expenditure was $20 with a standard deviation of $3. For women, the average expenditure was $18 with a standard deviation of $2. Construct a 99% confidence interval for the mean spending difference between men and women.
Click For AnswerLet \(\mu_1=\) mean expenditure for men and \(\mu_2=\) mean expenditure for women. We want to estimate the mean difference \(\mu_1 - \mu_2\) with 99% confidence.
First, enter the sample data
n1 <- 20
xbar1 <- 20
s1 <- 3
n2 <- 23
xbar2 <- 18
s2 <- 2
Compute the standard error
se <- sqrt(s1^2/n1 + s2^2/n2)
se
[1] 0.7898817
Finally, find the t critical value
alpha <- 0.01
df <- min(n1-1,n2-1)
t <- qt(1-alpha/2,df)
t
[1] 2.860935
Put it altogether to get the confidence interval
(xbar1 - xbar2) - t*se
[1] -0.2597998
(xbar1 - xbar2) + t*se
[1] 4.2598
We are 95% confident that the mean difference in expenditures for men and women lies between -$0.26 and $4.26 as long as the populations we sampled from are normal since the sample sizes are less than \(30\).
Notice that the confidence interval contains zero, indicating that there is not a significant difference in the average expenditures for men and women at 99% confidence.
A \(100(1-\alpha)\%\) confidence interval for \(\mu_1-\mu_2\) is
\[(\bar x_1 - \bar x_2) \pm t_{\nu,\alpha/2} \cdot \sqrt{s_p^2(\frac{1}{n_1} + \frac{1}{n_2})}\]
where the degrees of freedom are \(\nu = n_1+n_2-2\) and \[s_p^2 = \frac{(n_1-1)s_1^2 + (n_2-1)s_2^2}{n_1+n_2-2}\]
We require independent random samples that are either both large \((n_1, n_2 \ge 30)\) or from normal populations. We also require that the population variances are equal (this will be given to you in the problem).
Example 2: You are testing two computer processors (A and B) for speed (in Mhz). Construct a 95% confidence interval for the mean difference in CPU speed. Assume equal population variances.
| Processor | \(n\) | \(\bar x\) | \(s\) |
|---|---|---|---|
| A | \(17\) | \(3004\) | \(74\) |
| B | \(14\) | \(2538\) | \(56\) |
Let \(\mu_1=\) mean speed for Processor A and \(\mu_2=\) mean speed for Processor B. We want to estimate the mean difference \(\mu_1 - \mu_2\) with 95% confidence.
First enter the sample data
n1 <- 17
xbar1 <- 3004
s1 <- 74
n2 <- 14
xbar2 <- 2538
s2 <- 56
Then compute the pooled variance and use it to find the standard eror
pool <- ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2)
pool
[1] 4427.034
se <- sqrt(pool*(1/n1 + 1/n2))
se
[1] 24.01313
Finally, find the t critical value
alpha <- 0.05
df <- n1 + n2 - 2
t <- qt(1-alpha/2,df)
t
[1] 2.04523
Put it altogether to get the confidence interval
(xbar1 - xbar2) - t*se
[1] 416.8876
(xbar1 - xbar2) + t*se
[1] 515.1124
We are 95% confident that the mean difference in processor speed lies between 416.9 Mhz and 515.1 Mhz as long as the populations we sampled from are normal since the sample sizes are less than \(30\).
Notice that the confidence interval does not contain zero, indicating that there is a significant difference in the average processor speed at 95% confidence.
Samples are dependent when they are related somehow. A common type of dependent sample is two measurements made on the same person like a before and after study. Twin studies where one twin gets randomly assigned to a treatment group and another randomly assigned to a control group are also dependent samples. Some dependent samples are artificially created by researchers who “match” subjects on various attributes. For example, matching on gender and weight would mean finding two people with the same gender and weight and randomly assign one to the treatment group and the other to the control group.
When the samples are dependent, the differences are analyzed. We put the subscript \(d\) on our statistics in the formula below to remind us that we are working with the differences.
A \(100(1-\alpha)\%\) confidence interval for \(\mu_d = \mu_1 -\mu_2\) is
\[\bar x_d \pm t_{\nu,\alpha/2} \cdot \frac{s_d}{\sqrt{n_d}}\] where \(n_d =\) the number of differences, \(\bar x_d =\) the sample mean of the differences and \(s_d =\) the sample standard deviation of the differences.
The degrees of freedom are \(\nu = n_d-1\) and it is required that either \(n_d \ge 30\) or the population of differences is normal.
Example 3: Six people sign up for a weight loss program. Is it effective? Construct a 95% confidence interval for the true mean difference in weight.
| Person | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Before | 136 | 205 | 257 | 238 | 175 | 166 |
| After | 125 | 195 | 250 | 240 | 165 | 160 |
First enter the data
before <- c(136,205,257,238,175,166)
after <- c(125,195,250,240,165,160)
We can use the t.test command with the paired=TRUE subcommand
t.test(after,before,paired=TRUE,conf.level=0.95)
Paired t-test
data: after and before
t = -3.5598, df = 5, p-value = 0.01622
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-12.054751 -1.945249
sample estimates:
mean of the differences
-7
Or we can find the differences and then find the mean and standard deviation of the differences
dif <- after - before
nd <- length(dif)
xbard <- mean(dif)
sd <- sd(dif)
nd
[1] 6
xbard
[1] -7
sd
[1] 4.816638
Find the t critical value
alpha <- 0.05
df <- nd - 1
t <- qt(1-alpha/2,df)
And then put it altogether to get the confidence interval
xbard - t*sd/sqrt(nd)
[1] -12.05475
xbard + t*sd/sqrt(nd)
[1] -1.945249
Either way, we get the same answer. We are 95% confident that the mean difference in weight (after - before) lies between -12.1 and -1.95. So participants lost, on average, between approximately 2 and 12 pounds. This interval requires that the population of differences is normal since the sample size of differences is less than 30.