KITADA

Lesson #2

Inference for Two-Sample problems comparing means or medians

Motivation: If an inference problem involves a quantitative response variable and a categorical explanatory variable with two categories, two-sample methods will be used.

What you need to know from this lesson:

After completing this lesson, you should be able to

To accomplish the above “What You Need to Know”, do the following:

1.  Attend lecture and answer the questions on the following pages of this lesson.

2.  Read Sections 6.10 through 6.12 in the text

3.  Do the Lesson 2 questions at the end of the lesson notes 

The Lesson

I. What makes an inference problem a two-sample problem

A. List the criteria for an inference problem to be a two-sample problem for a quantitative variable

There are two populations (two lists) of numeric data.

B. For each of the following, state whether the two-sample or one-sample methods are appropriate and why.

II. Example of a two-sample problem.

Example

Suppose that a commuter, Katy, wants to determine which of two possible driving routes gets her to work more quickly. She randomly selects 20 days during a 3-month period. Katy then randomly assigns 10 of those days to days where she’ll drive route 1 and the other ten days to days where she’ll drive route 2. She records commuting times, in minutes, which are as follows:

route1<-c(19.3,  20.5,  23.0,   25.8,   28.0,   28.8,   30.6,   32.1,   33.5,   38.4)  
route2<-c(23.7, 24.5, 27.7, 30.0,   31.9,   32.5,   32.6,   35.5,   38.7,   42.9)   

Step one: Identify the variable of interest and the populations: 1. What is the variable of interest? Is it quantitative or categorical?

The variable if interest is the amount of time it takes her to commute on the two different routes. The variable is quantitative.

2. What are the populations?

The populations are all the commuting times for the two routes.

Step 2: Assess whether the samples are representative of the populations

3. Why can we feel comfortable saying that the commuting times in the samples are representative of all commuting times for each of the routes?

We can feel comfortable saying that the commuting times in the sample are representative of all communitying times becasuse Katy randomly chose the days to drive the different routes.

4. What are some arguments why the sample still may not be representative of all commuting times for each of the routes even though random samples were taken?

Perhaps commuting times during the 3-month period that was chosen are different than the commuting times during other times of the year.

Step 3: Determine if this is an estimation or hypothesis test problem

5. Is this an estimation or hypothesis test problem? Why?

This is a hypothesis testing problem because we want to know if there is evidence to suggest the commuting times are different of the two routes.

Step 4: State the null and alternative hypothesis

6. State the null and alternative hypotheses in statistical notation and words.

\( H_0: \mu_1-\mu_2=0 \) or \( H_0: \mu_1=\mu_2 \)

\( H_A: \mu_1-\mu_2 \neq 0 \) or \( H_0: \mu_1 \neq \mu_2 \)

Step 5: Explore the sample data

Side-by-side box-and-whisker plots and dotplots for commuting times for route 1 and route 2 are given below.

# DOT PLOTS
par(mfrow=c(2,1))
stripchart(route1, method = "stack", offset = .5, at = .15, pch = 19, 
           main = "Dotplot of Commute Times for Route 1", xlim=c(15, 50),xlab = "Commute Time (Minutes)")
stripchart(route2, method = "stack", offset = .5, at = .15, pch = 19, 
           main = "Dotplot of Commute Times for Route 2", xlim=c(15, 50),xlab = "Commute Time (Minutes)")

plot of chunk unnamed-chunk-3

# SIDE-BY-SIDE BOX PLOTS
boxplot(dataRoute)

plot of chunk unnamed-chunk-4

7. One reason for exploring the sample data is to get an initial idea if there will be evidence to reject the claim in the null hypothesis for the claim made in the alternative hypothesis.

A. For this sample of days, does one route always get Katy to work more quickly than the other?

No, one is not always faster than the other. There is overlap in the range of commuting times.

B. For this sample of days, does one route tend to get Katy to work more quickly than the other? If so, which one?

It appears that the Route 1 commute times are lower since the minimum for Route 1 is less than the minimum for Route 2 and the maximum for Route 1 is less than the maximum for Route 2.

C. Would you expect the claim made in the null hypothesis to be rejected? Explain.

Even though one looks a little bit higher there still is a lot of overlap so I believe that the hypothesis test will find that there is not enough evidence to suggest that the two routes are different.

8. A second reason for exploring the sample data is to determine which two-sample method and which parameter are most appropriate to use. The following is a guideline and is similar to what was covered in Lesson 1.

Which method to use

Which parameter to use

A. Which method(s) can be used in this example? For what parameter?

The data appear to be pretty symmetric so we could compare the means.

B. Both sample sizes are somewhat small. Even though the sample data are roughly symmetric, give an argument why the data in each population may not be symmetric and bell-shaped.

It possible that since the sample sizes are so small they are not truly representative of the distribution of commuting times.

C. If the data in the population are skewed, what method and what parameter should be used?

If the data are skewed we should use the bootstrap methods.

9. The summary statistics for each route are given below:

# SUMMARY STATS: ROUTE 1
n1<-length(route1)
xbar1<-mean(route1)
med1<-median(route1)
s1<-sd(route1)

n1
## [1] 10
xbar1
## [1] 28
med1
## [1] 28.4
s1
## [1] 6.003703
# SUMMARY STATS: ROUTE 2
n2<-length(route2)
xbar2<-mean(route2)
med2<-median(route2)
s2<-sd(route2)

n2
## [1] 10
xbar2
## [1] 32
med2
## [1] 32.2
s2
## [1] 6.001852

Steps 6 & 7: find the p-value and answer the question of interest

10. The “un-pooled” two-sample t-methods:

A. If H0 is true, what is \( \mu_1-\mu_2 \)?

\( \mu_1-\mu_2 \) is the true difference of population means.

B. What is the best estimate of \( \mu_1-\mu_2 \)?

The best estimate is \( \bar{x}_1-\bar{x}_2 \)

C. Calculate \( \bar{x}_1-\bar{x}_2 \).

xbar1-xbar2
## [1] -4

D. Could \( \bar{x}_1-\bar{x}_2 \)= -4 minutes even if H0 is true?

Yes, it can be since the distribution of differences of samples means is also normal and the normal distribution is supported from negative infinity and positive infinity. This means the it is possible for \( \bar{x}_1-\bar{x}_2 \) to be -4.

E. The idea is to find how likely it is to observe the difference in sample means we observed (-4 minutes) or a difference “more unusual” if the null hypothesis is true.

i. What values of the differences in sample means (route 1 – route 2) are “as or more unusual” than the difference observed”?

Values greater than 4 and less than -4 are as or more extreme than the difference observed since we have a two sided alternative.

ii. As a probability statement, write the probability of observing a difference in sample means like the one observed or a difference “more unusual”?

\( Pr(X>4)+(1-Pr(X>-4)) \) = \( 2*Pr(X>4) \)

F. Notice the probability statement is in term of the difference in sample means (\( \bar{x}_1-\bar{x}_2 \)). To find this probability (and, therefore, the p-value), we need to generate the distribution of differences in sample means. To use the two-sample t-methods to find this probability, the distribution of differences in sample means needs to be approximately normally distributed.

Sketch the distribution of the difference in sample means (\( \bar{x}_1-\bar{x}_2 \)) under the condition that the null hypothesis is true. Label the x-axis, mark the spots at the mean of this distribution and one standard error away from the mean. Shade the region corresponding to the probability that we’re trying to find.

i. Under what condition is this shape normally distributed?

If the distributions of the two route commute times are normal then the distribution of their difference is also normally distributed

ii. What is the value of the mean of this distribution (\( \mu_{\bar{x}_1-\bar{x}_2} \) ) if the null hypothesis is true?

If the null hypothesis is true then \( \mu_1-\mu_2=0 \).

iii. The standard error of the distribution of the differences in sample means (\( SE_{\bar{x}_1-\bar{x}_2} \) ).

Derivation of formula:

Even though we’re considering two groups, we’re looking at one distribution: the distribution of the difference in sample means. In a sense, we are “combining” the two groups together into one distribution. To determine the variation in this distribution of differences, we need to “combine” the standard deviations of the two groups. To “combine” the standard deviations of the two separate groups, we ADD THE VARIANCES OF THE TWO GROUPS TOGETHER.

Recall, the variance is standard deviation squared.  The notation for the population variance is  , and for the sample variance $s^2$.

If the population standard deviations were known:

If the population standard deviation were unknown (which is usually the case):

Use this formula to calculate the standard error

SEdiff<-sqrt((s1^2/n1)+(s2^2/n2))
SEdiff
## [1] 2.684524

G. Calculate the t-statistic. What are the degrees of freedom for this t-statistic? \( t=\frac{\bar{x}_1-\bar{x}_2}{SE_{\bar{x}_1-\bar{x}_2}} \)

test_unpooled<-(xbar1-xbar2)/SEdiff
test_unpooled
## [1] -1.490022

Conservative DF: \( min(n_1-1, n_2-1) \)

Estimated DF:\( \frac{(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2}{\frac{s_1^4}{n_1^2(n_1-1)}+\frac{s_2^4}{n_2^2(n_2-1)}} \)

H. Use a calculator (hand-held or online) or t-table to find the p-value.

2*pt(test_unpooled, df=min(n1-1, n2-1), lower=TRUE)
## [1] 0.1704049

I. State a conclusion in the context of the problem.

Since the p-value is large there is no evidence to suggest that there is a difference in commuting times for the two routes. Therefore, we will fail to reject the null at an \( \alpha=0.05 \) level.

J. The “pooled” two-sample t-methods

If the samples taken from the two populations came from populations that had the same standard deviations, another type of two-sample t-test should be performed instead. It’s called a “pooled” two-sample t-test because the standard deviations of the two groups will be “pooled” together to get the standard error of the distribution of differences in sample means. Again, this can only be done if we feel confident that the population standard deviations of the two populations being compared are the same. That is, if \( \sigma_1 \) is the population standard deviation of the first group, and \( \sigma_2 \) is the population standard deviation of the second group. The “pooled” two-sample t-methods can only be used if \( \sigma_1=\sigma_2 \)

i. When is it reasonable to say \( \sigma_1=\sigma_2 \)?

Based on the above guidelines, can the pooled two-sample t-methods be used in the “commuting example”? Explain.

We have two independent samples from two different populations. Since the sample standard deviations are so close it may be reasonable to use pooled methods.

ii. What is different between the pooled methods and the “un-pooled” methods?

A. The standard error formula is different. Without derivation, here is the standard error formula for the pooled two-sample t-methods:

\( SE_{\bar{x}_1-\bar{x}_2}=s_p\times \sqrt{\frac{1}{n_1}+\frac{1}{n_2}} \) where \( s_p=\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{(n_1+n_2-2)}} \)

Use this formula to calculate standard error of the distribution of sample means. How different is the standard error from the “un-pooled” methods?

Sp<-sqrt(((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
SEdiff_pooled<-Sp*sqrt(1/n1+1/n2)
SEdiff_pooled
## [1] 2.684524

compared to

SEdiff
## [1] 2.684524

Since the sample standard deviatons are so close the pooled and unpooled standard errors are basically the same.

B. The calculation of the degrees of freedom is different. Here is the formula for the pooled two-sample t-methods: \( n_1 + n_2 – 2 \).

What are the degrees of freedom if the pooled two-sample t-methods were used?

df_pooled<-n1+n2-2
df_pooled
## [1] 18

11. The randomization test to compare two groups

Recall the idea of a randomization test when comparing two groups in the experimental setting: if the null hypothesis is true (no difference between the two comparison groups), we’d expect each case’s value of the response variable to be the same no matter to which group the case was assigned. A randomization test will generate many, many different randomizations of the cases to the different comparison groups and determine how many of them have differences in sample means (or medians) “as or more unusual” than the one observed. Since the hypothesized value of the differences in means (or medians) is always 0, “as or more unusual” implies differences further away from 0 than the original difference is from 0.

In particular, if an experiment has been performed, the twomeans function will

The twomeans function will also give an estimate of the standard deviation of the distribution of differences in sample means (or medians) by calculating the standard deviation of the several thousand differences and construct a confidence interval using the percentile method.

In order to run the twomeans function, the data set must be organized in such a was that there is one column (vector) that contains the values of the response variable and one column (vector) that contains the explanatory variable or groups.

A general form of the twomeans function looks something like this:

with(NAMEOFDATASET,
twomeans(response, groups, iterations, MEAN = FALSE, MEDIAN = FALSE, 
          ci_level =NULL, Alt_Hyp = NULL, histogram = TRUE)

)

For this example, the arguments are

In R, the “call” of the twomeans function would look like this:

my.twomeans <- 
  with(COMMUTE, 
       twomeans(time, route, 5000, MEAN = TRUE, ci_level = .95, Alt_Hyp = 3)
       )

After running the function, you can pull out any of the saved output using the following commands:

my.twomeans$Diffs_SD

my.twomeans$Confidence_Intervals

my.twomeans$pval

my.twomeans$Alt_Hyp

Here is the output from the twomeans function for a hypothesis test comparing the mean commuting time of the two routes:

my.twomeans$Diffs_SD

[1] 2.757522

my.twomeans$Confidence_Intervals

CI_Percent CI_Formula

1      -9.34  -9.404643

2       1.30   1.404643

my.twomeans$pval

[1] 0.1506

my.twomeans$Alt_Hyp

[1] "Pop1 =/= Pop2 (two sided)"

Compare the p-value and standard deviation of the difference in sample means to the p-value and standard error from the two-sample t-test.

12. Confidence intervals for the difference in the population means

A. Construct and interpret a 95% confidence interval for the difference in population means (\( \mu_1-\mu_2 \)) using the “un-pooled” two-sample t-methods by hand.

df<-min(n1-1, n2-1)
criticalVal<-qt(0.975, df)
criticalVal
## [1] 2.262157
CI<-(xbar1-xbar2)+c(-1,1)*criticalVal*SEdiff
CI
## [1] -10.072814   2.072814

We are 95% confident that the true difference is means is between (-10.06, 2.20). Since zero is contained in this interval, this suggests that there is not a significant difference between the commuting times for the two routes.

B. Would the confidence interval change if the pooled two-sample t-methods were used?

df
## [1] 9
df_pooled
## [1] 18

Yes, since the degrees of freedom are larger in the pooled case the associated critical value will be smaller.

C. Write the 95% confidence interval using the percentile method from the twomeans function.

my.twomeans$Confidence_Intervals

CI_Percent CI_Formula

1      -9.34  -9.404643

2       1.30   1.404643

i. Show how the bounds were determined.

The bounds are created by taking the middle 95% of the bootstrapped differenes of the sample means.

ii. Compare the bounds with the two-sample t-methods.

The confidence interval for the two sample t-method is wider.

13. Although it seems appropriate to use the t-methods to compare the means, the sample sizes are quite small. With small sample sizes, we have to be certain that the data in the populations from which the samples were taken are normally distributed. In this example, that means that we must believe that all commuting times for both routes are normally distributed. An argument was given in #8b that commuting times for either route may not be normally distributed. If the population data are not normally distributed for one or both groups and the sample sizes are small, the t-methods to compare the means should not be used.

Let’s suppose that you felt that all commuting times for both routes weren’t perfectly normally distributed and/or there were “extreme” outliers. With the small sample sizes and skewed data and/or extreme outliers, inference on the medians should be performed. Only the randomization/bootstrap methods can be used when comparing the medians of two groups.

A. State the null and alternative hypotheses in notation. Define the notation used.

\( H_0: m_1-m_2=0 \) or \( H_0: m_1=m_2 \)

\( H_A: m_1-m_2 \neq 0 \) or \( H_A: m_1 \neq m_2 \)

B. Here is what the arguments of the twomeans function in R looks like in this situation:

my.twoMED <- with(COMMUTE, twomeans(time, route, 5000, MEDIAN = TRUE, ci_level = .95, Alt_Hyp = 3) )

C. Below is the output from the twomeans function comparing the medians of the two routes.

my.twoMED$Diffs_SD

[1] 3.088312

my.twoMED$Confidence_Intervals

CI_Percent CI_Formula

1 -9.35 -9.85298

2 1.75 2.25298

my.twoMED$pval

[1] 0.2484

my.twoMED$Alt_Hyp

[1] “Pop1 =/= Pop2 (two sided)”

i. Based on the p-value, state a conclusion in the context of the problem. Does the conclusion agree with the conclusion when comparing the means of the two routes?

Again, due to the large p-vale there is no evidence to suggest that there is a difference in the commute times for the two routes. Therefore, we will fail to reject the null that the medians of the two commute times are the same.

ii. Using the percentile method from the twomeans function, write and interpret the 95% confidence interval for the difference in the population medians between the two routes (route 1 – route 2).

95% of the bootstrapped differences of samples medians are between (-9.35, 1.75). Since 0 is contained in this interval this suggests that there isn't a significant difference between the two commute route medians.