Introduction

This lesson is about testing hypotheses about the difference between two population means, \(\mu_1\) and \(\mu_2\). We will consider three different cases depending on whether the samples are independent or dependent and also depending on whether the population variances are equal.

When the samples are dependent, we analyze the differences. More on this below in Case 3.

When the samples are independent, we use the sampling distribution of \(\bar X_1 - \bar X_2\) as the basis of the confidence interval.

\[\bar X_1 - \bar X_2 \sim N(\mu_1 - \mu_2,\sqrt{ \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2} })\] Of course, we donโ€™t know \(\sigma_1^2\) and \(\sigma_2^2\) so we are going to use the sample variances \(s_1^2\) and \(s_2^2\) which means that we will be using the \(t\) distribution instead of the normal distribution.

The degrees of freedom and the exact form of the estimated standard error depends on whether we have equal population variances.

Case 1: Independent Random Samples (\(\sigma_1^2 \ne \sigma_2^2\))

Step 1: Check conditions required

  • independent random samples
  • unequal population variances, \(\sigma_1^2 \ne \sigma_2^2\)
  • \(n_1 \ge 30\) and \(n_2 \ge 30\) or normal populations

Step 2: Define the population means, \(\mu_1\) and \(\mu_2\), in the context of the problem.

Step 3: Determine the null and alternative hypotheses. They can be structured three ways.

Two tailed Left tailed Right tailed
\(H_0: \mu_1 = \mu_2\) \(H_0: \mu_1 = \mu_2\) \(H_0: \mu_1 = \mu_2\)
\(H_a: \mu_1 \ne \mu_2\) \(H_a: \mu_1 < \mu_2\) \(H_0: \mu_1 > \mu_2\)

Step 4: Compute the test statistic

\[t = \frac{\bar x_1 - \bar x_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}\]

Step 5: Find the p-value with approximate degrees of freedom = min\((n_1-1,n_2-1)\). Software such as R will use exact degrees of freedom which is a more complex formula.

Two tailed Left tailed Right tailed
\(2P(T>|t|)\) = 2*(1-pt(abs(t),df)) \(P(T<t)\) = pt(t,df) \(P(T>t)\) = 1 - pt(t,df)

Step 6: Reject \(H_0\) if p-value\(\le \alpha\). Otherwise, fail to reject \(H_0\)

Step 7: State your conclusions in the context of the problem.

If you have raw data, then you can use the t.test command to find the test statistic and p-value.

Example 1:

The local baseball team conducts a study to find the amount spent on refreshments at the ball park. Over the course of the season, they gather simple random samples of 20 men and 23 women. For men, the average expenditure was $20 with a standard deviation of $3. For women, the average expenditure was $18 with a standard deviation of $2. Do men and women spend differently on average? Test at \(\alpha = 0.01.\)

Step 1: We require the following conditions

  • independent random samples
  • unequal population variances, \(\sigma_1^2 \ne \sigma_2^2\)
  • normal populations since the sample sizes are less than \(30\)

Step 2: Let \(\mu_1=\) mean expenditure for men and \(\mu_2=\) mean expenditure for women.

Step 3: We want to test the null hypothesis of no difference in spending on average \(H_0: \mu_1 = \mu_2\) against the alternative hypothesis that there is a difference in spending on average \(H_a: \mu_1 \ne \mu_2\)

Step 4: To find the test statistic we first enter the sample data

n1 <- 20
xbar1 <- 20
s1 <- 3
n2 <- 23
xbar2 <- 18
s2 <- 2

Compute the standard error

se <- sqrt(s1^2/n1 + s2^2/n2)
se
[1] 0.7898817

Finally, find the test statistic

t <- (xbar1 - xbar2)/se
t
[1] 2.532025

Step 5: To find the p-value, we need to look at the direction of the test. In this case, we are conducting a two tailed test.

df <- min(n1-1,n2-1)
2*(1 - pt(abs(t),df))
[1] 0.02031837

Step 6: Fail to reject \(H_0\) since the p-value of \(0.02\) > \(\alpha = 0.01\).

Step 7: We have insufficient evidence to conclude that there is a significant difference in mean expenditure for men and women at 1% significance.

Case 2: Independent Random Samples (\(\sigma_1^2 = \sigma_2^2\))

Step 1: Check conditions required

  • independent random samples
  • equal population variances, \(\sigma_1^2 = \sigma_2^2\)
  • \(n_1 \ge 30\) and \(n_2 \ge 30\) or normal populations

Step 2: Define the population means, \(\mu_1\) and \(\mu_2\), in the context of the problem.

Step 3: Determine the null and alternative hypotheses. They can be structured three ways.

Two tailed Left tailed Right tailed
\(H_0: \mu_1 = \mu_2\) \(H_0: \mu_1 = \mu_2\) \(H_0: \mu_1 = \mu_2\)
\(H_a: \mu_1 \ne \mu_2\) \(H_a: \mu_1 < \mu_2\) \(H_0: \mu_1 > \mu_2\)

Step 4: Compute the test statistic

\[t = \frac{\bar x_1 - \bar x_2}{\sqrt{s_p^2({\frac{1}{n_1}+\frac{1}{n_2}})}}\] where \[s_p^2 = \frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{n_1+n_2-2}\]

Step 5: Find the p-value with df \(= n_1 + n_2 - 2\)

Two tailed Left tailed Right tailed
\(2P(T>|t|)\) = 2*(1-pt(abs(t),df)) \(P(T<t)\) = pt(t,df) \(P(T>t)\) = 1 - pt(t,df)

Step 6: Reject \(H_0\) if p-value\(\le \alpha\). Otherwise, fail to reject \(H_0\)

Step 7: State your conclusions in the context of the problem.

If you have raw data, then you can use the t.test command to find the test statistic and p-value.

Example 2: Is computer processor A faster than computer processor B? Test the appropriate hypotheses at \(\alpha=0.05\). Assume equal population variances.

Processor \(n\) \(\bar x\) \(s\)
A \(17\) \(3004\) \(74\)
B \(14\) \(2538\) \(56\)

Step 1: We require the following conditions

  • independent random samples
  • equal population variances, \(\sigma_1^2 = \sigma_2^2\)
  • normal populations since the sample sizes are less than \(30\)

Step 2: Let \(\mu_1=\) mean speed for Processor A and \(\mu_2=\) mean speed for Processor B

Step 3: We want to test the null hypothesis of no difference in speed on average \(H_0: \mu_1 = \mu_2\) against the alternative hypothesis that Processor A is faster (has greater speed) than Processor B on average \(H_a: \mu_1 > \mu_2\)

Step 4: To find the test statistic we first enter the sample data

n1 <- 17
xbar1 <- 3004
s1 <- 74
n2 <- 14
xbar2 <- 2538
s2 <- 56

Then compute the pooled variance and use it to find the standard eror

pool <- ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2)
pool
[1] 4427.034
se <- sqrt(pool*(1/n1 + 1/n2))
se
[1] 24.01313

Finally, find the test statistic

t <- (xbar1-xbar2)/se
t
[1] 19.40605

Step 5: To find the p-value, we need to look at the direction of the test. In this case, we are conducting a right tailed test.

df <- n1 + n2 - 2
1 - pt(t,df)
[1] 0

Step 6: Reject \(H_0\) since the p-value\(=0\) is less than \(\alpha = 0.05\).

Step 7: We have sufficient evidence to conclude that Processor A is faster than Processor B, on average, at 5% significance.

Case 3: Dependent Random Samples

Step 1: Check conditions required

  • dependent random samples
  • \(n_d \ge 30\) or normal population of differences

Step 2: Define the population mean difference \(\mu_d=\mu_1-\mu_2\), in the context of the problem.

Step 3: Determine the null and alternative hypotheses. They can be structured three ways.

Two tailed Left tailed Right tailed
\(H_0: \mu_d = 0\) \(H_0: \mu_d = 0\) \(H_0: \mu_d = 0\)
\(H_a: \mu_d \ne 0\) \(H_a: \mu_d < 0\) \(H_0: \mu_d > 0\)

Step 4: Compute the test statistic

\[t=\frac{\bar x_d}{s_d/\sqrt{n_d}}\] where \(n_d=\) the number of differences, \(\bar x_d=\) the sample mean of the differences and \(s_d=\) the sample standard deviation of the differences.

Step 5: Find the p-value with df \(= n_d - 1\)

Two tailed Left tailed Right tailed
\(2P(T>|t|)\) = 2*(1-pt(abs(t),df)) \(P(T<t)\) = pt(t,df) \(P(T>t)\) = 1 - pt(t,df)

Step 6: Reject \(H_0\) if p-value\(\le \alpha\). Otherwise, fail to reject \(H_0\)

Step 7: State your conclusions in the context of the problem.

Example 3: Six people sign up for a weight loss program. Is it effective? Test the appropriate hypothesis at \(\alpha = 0.05\).

Person 1 2 3 4 5 6
Before 136 205 257 238 175 166
After 125 195 250 240 165 160

Step 1: Check conditions required

  • dependent random samples
  • normal population of differences since \(n_d<30\)

Step 2: \(\mu_d=\) the mean difference (after-before) in weight loss

Step 3: We want to test the null hypothesis that the diet is not effective \(H_0:\mu_d=0\) against the alternative that the diet is effective \(H_a: \mu_d < 0\).

Step 4: To find the test statistic we first enter the sample data

before <- c(136,205,257,238,175,166)
after <- c(125,195,250,240,165,160)

We can use the t.test command with the paired=TRUE subcommand

t.test(after,before,paired=TRUE,alternative="less")

    Paired t-test

data:  after and before
t = -3.5598, df = 5, p-value = 0.008109
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
      -Inf -3.037641
sample estimates:
mean of the differences 
                     -7 

Or we can find the differences and then find the mean and standard deviation of the differences

dif <- after - before
nd <- length(dif)
xbard <- mean(dif)
sd <- sd(dif)
nd
[1] 6
xbard
[1] -7
sd
[1] 4.816638

Find the test statistic

t <- xbard/(sd/sqrt(nd))
t
[1] -3.559833

Find the p-value

pt(t,nd-1)
[1] 0.008108714

Either way we reject the null hypothesis since the p-value of \(0.008\) is less than \(\alpha.\) There is sufficient evidence to conclude that the diet is effective on average.