KITADA

Lesson #2

Inference for Two-Sample problems comparing means or medians

Motivation: If an inference problem involves a quantitative response variable and a categorical explanatory variable with two categories, two-sample methods will be used.

What you need to know from this lesson:

After completing this lesson, you should be able to

recognize when to use the two-sample methods as opposed to the one-sample methods
explain when the t-methods can and can’t be used in an inference problem involving two samples
recognize when an inference on the difference in medians should be performed and when an inference on the difference in means should be performed
construct and interpret a confidence interval for the difference in population means using the two-sample t-methods
recognize when the pooled two-sample t-methods can be used
- calculate the t-statistic and construct a confidence interval using the pooled two-sample t-methods
construct and interpret a confidence interval for the difference in population means or medians using the percentile method
perform a hypothesis test for the difference in means using a two-sample t-test
perform a hypothesis test for the difference in means or medians using the randomization/bootstrap methods
recognize that a valid inference to the populations is based on having data collected that are representative of the populations

To accomplish the above “What You Need to Know”, do the following:

1.  Attend lecture and answer the questions on the following pages of this lesson.

2.  Read Sections 6.10 through 6.12 in the text

3.  Do the Lesson 2 questions at the end of the lesson notes

The Lesson

I. What makes an inference problem a two-sample problem

A. List the criteria for an inference problem to be a two-sample problem for a quantitative variable

There are two populations (two lists) of numeric data.

B. For each of the following, state whether the two-sample or one-sample methods are appropriate and why.

1. A dietitian has developed a diet that is low in fats, carbohydrates, and cholesterol (let’s call this the “low-fat” diet). Although the diet was initially intended to be used by people with heart disease, the dietitian wished to examine the effect this diet had on the weights of obese people. Two hundred obese people were selected to participate in the study to investigate the effect of the dietician’s diet on weight loss. One hundred of these were randomly assigned to the “low-fat” diet. The other 100 were placed on a diet that contained approximately the same quantity of food as those on the “low-fat” diet, but was not as low in fats, carbohydrates, and cholesterol. For each person, the amount of weight lost (or gained) in a 3-week period was recorded. Two people in the “low-fat diet” group did not complete the study and were excluded from the analysis. The dieticians hypothesized that the low-fat diet helped reduce weight of obese people more compared to a regular diet. –> Two-sample
1. A report states that the national average yearly salary offer for students graduating with accounting degrees in 2010 was $48,722. A random sample of 50 accounting students was taken from a large university who received job offers. Administrators at this university wondered if the average yearly salary for students graduating with accounting degrees at their university was more than the national average. –> One-sample

II. Example of a two-sample problem.

Example

Suppose that a commuter, Katy, wants to determine which of two possible driving routes gets her to work more quickly. She randomly selects 20 days during a 3-month period. Katy then randomly assigns 10 of those days to days where she’ll drive route 1 and the other ten days to days where she’ll drive route 2. She records commuting times, in minutes, which are as follows:

route1<-c(19.3,  20.5,  23.0,   25.8,   28.0,   28.8,   30.6,   32.1,   33.5,   38.4)  
route2<-c(23.7, 24.5, 27.7, 30.0,   31.9,   32.5,   32.6,   35.5,   38.7,   42.9)

Step one: Identify the variable of interest and the populations: 1. What is the variable of interest? Is it quantitative or categorical?

The variable if interest is the amount of time it takes her to commute on the two different routes. The variable is quantitative.

2. What are the populations?

The populations are all the commuting times for the two routes.

Step 2: Assess whether the samples are representative of the populations

3. Why can we feel comfortable saying that the commuting times in the samples are representative of all commuting times for each of the routes?

We can feel comfortable saying that the commuting times in the sample are representative of all communitying times becasuse Katy randomly chose the days to drive the different routes.

4. What are some arguments why the sample still may not be representative of all commuting times for each of the routes even though random samples were taken?

Perhaps commuting times during the 3-month period that was chosen are different than the commuting times during other times of the year.

Step 3: Determine if this is an estimation or hypothesis test problem

5. Is this an estimation or hypothesis test problem? Why?

This is a hypothesis testing problem because we want to know if there is evidence to suggest the commuting times are different of the two routes.

Step 4: State the null and alternative hypothesis

6. State the null and alternative hypotheses in statistical notation and words.

$ H_0: \mu_1-\mu_2=0 $ or $ H_0: \mu_1=\mu_2 $

$ H_A: \mu_1-\mu_2 \neq 0 $ or $ H_0: \mu_1 \neq \mu_2 $

Step 5: Explore the sample data

Side-by-side box-and-whisker plots and dotplots for commuting times for route 1 and route 2 are given below.

# DOT PLOTS
par(mfrow=c(2,1))
stripchart(route1, method = "stack", offset = .5, at = .15, pch = 19, 
           main = "Dotplot of Commute Times for Route 1", xlim=c(15, 50),xlab = "Commute Time (Minutes)")
stripchart(route2, method = "stack", offset = .5, at = .15, pch = 19, 
           main = "Dotplot of Commute Times for Route 2", xlim=c(15, 50),xlab = "Commute Time (Minutes)")

plot of chunk unnamed-chunk-3

# SIDE-BY-SIDE BOX PLOTS
boxplot(dataRoute)

plot of chunk unnamed-chunk-4

7. One reason for exploring the sample data is to get an initial idea if there will be evidence to reject the claim in the null hypothesis for the claim made in the alternative hypothesis.

A. For this sample of days, does one route always get Katy to work more quickly than the other?

No, one is not always faster than the other. There is overlap in the range of commuting times.

B. For this sample of days, does one route tend to get Katy to work more quickly than the other? If so, which one?

It appears that the Route 1 commute times are lower since the minimum for Route 1 is less than the minimum for Route 2 and the maximum for Route 1 is less than the maximum for Route 2.

C. Would you expect the claim made in the null hypothesis to be rejected? Explain.

Even though one looks a little bit higher there still is a lot of overlap so I believe that the hypothesis test will find that there is not enough evidence to suggest that the two routes are different.

8. A second reason for exploring the sample data is to determine which two-sample method and which parameter are most appropriate to use. The following is a guideline and is similar to what was covered in Lesson 1.

Which method to use

Use the two-sample t-method only if
- the data in each sample are symmetric regardless of the sample sizes or
- both sample sizes are “large” and the data in each sample are symmetric or slightly skewed
Use the randomization/bootstrap methods if
- the data in one or both samples (or groups) is quite skewed and/or has extreme outliers, regardless of sample sizes
- one or both sample sizes are “small” and the sample data are not symmetric
- Note: the randomization/bootstrap methods can also be used for the criteria listed for using the two-sample t-method above.

Which parameter to use

Compare the means of the two groups unless
- the data in one or both samples is quite skewed and/or has extreme outliers, regardless of sample sizes
- one or both sample sizes are “small” and the sample data are not symmetric
In these two situations, compare the medians of the two groups

A. Which method(s) can be used in this example? For what parameter?

The data appear to be pretty symmetric so we could compare the means.

B. Both sample sizes are somewhat small. Even though the sample data are roughly symmetric, give an argument why the data in each population may not be symmetric and bell-shaped.

It possible that since the sample sizes are so small they are not truly representative of the distribution of commuting times.

C. If the data in the population are skewed, what method and what parameter should be used?

If the data are skewed we should use the bootstrap methods.

9. The summary statistics for each route are given below:

# SUMMARY STATS: ROUTE 1
n1<-length(route1)
xbar1<-mean(route1)
med1<-median(route1)
s1<-sd(route1)

n1

## [1] 10

xbar1

## [1] 28

med1

## [1] 28.4

s1

## [1] 6.003703

# SUMMARY STATS: ROUTE 2
n2<-length(route2)
xbar2<-mean(route2)
med2<-median(route2)
s2<-sd(route2)

n2

## [1] 10

xbar2

## [1] 32

med2

## [1] 32.2

s2

## [1] 6.001852

Steps 6 & 7: find the p-value and answer the question of interest

10. The “un-pooled” two-sample t-methods:

A. If H0 is true, what is $ \mu_1-\mu_2 $?

$ \mu_1-\mu_2 $ is the true difference of population means.

B. What is the best estimate of $ \mu_1-\mu_2 $?

The best estimate is $ \bar{x}_1-\bar{x}_2 $

C. Calculate $ \bar{x}_1-\bar{x}_2 $.

xbar1-xbar2

## [1] -4

D. Could $ \bar{x}_1-\bar{x}_2 $= -4 minutes even if H0 is true?

Yes, it can be since the distribution of differences of samples means is also normal and the normal distribution is supported from negative infinity and positive infinity. This means the it is possible for $ \bar{x}_1-\bar{x}_2 $ to be -4.

E. The idea is to find how likely it is to observe the difference in sample means we observed (-4 minutes) or a difference “more unusual” if the null hypothesis is true.

i. What values of the differences in sample means (route 1 – route 2) are “as or more unusual” than the difference observed”?

Values greater than 4 and less than -4 are as or more extreme than the difference observed since we have a two sided alternative.

ii. As a probability statement, write the probability of observing a difference in sample means like the one observed or a difference “more unusual”?

$ Pr(X>4)+(1-Pr(X>-4)) $ = $ 2*Pr(X>4) $

F. Notice the probability statement is in term of the difference in sample means ($ \bar{x}_1-\bar{x}_2 $). To find this probability (and, therefore, the p-value), we need to generate the distribution of differences in sample means. To use the two-sample t-methods to find this probability, the distribution of differences in sample means needs to be approximately normally distributed.

Sketch the distribution of the difference in sample means ($ \bar{x}_1-\bar{x}_2 $) under the condition that the null hypothesis is true. Label the x-axis, mark the spots at the mean of this distribution and one standard error away from the mean. Shade the region corresponding to the probability that we’re trying to find.

i. Under what condition is this shape normally distributed?

If the distributions of the two route commute times are normal then the distribution of their difference is also normally distributed

ii. What is the value of the mean of this distribution ($ \mu_{\bar{x}_1-\bar{x}_2} $ ) if the null hypothesis is true?

If the null hypothesis is true then $ \mu_1-\mu_2=0 $.

iii. The standard error of the distribution of the differences in sample means ($ SE_{\bar{x}_1-\bar{x}_2} $ ).

Derivation of formula:

Even though we’re considering two groups, we’re looking at one distribution: the distribution of the difference in sample means. In a sense, we are “combining” the two groups together into one distribution. To determine the variation in this distribution of differences, we need to “combine” the standard deviations of the two groups. To “combine” the standard deviations of the two separate groups, we ADD THE VARIANCES OF THE TWO GROUPS TOGETHER.

Recall, the variance is standard deviation squared.  The notation for the population variance is  , and for the sample variance $s^2$.

If the population standard deviations were known:

variance for route 1: $ \sigma_{\bar{x}_1}^2=\frac{\sigma_1^2}{n_1} $
variance for route 2: $ \sigma_{\bar{x}_2}^2=\frac{\sigma_2^2}{n_2} $
Adding the variances, we get: $ \sigma_{\bar{x}_1-\bar{x}_2}^2=\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2} $
To get the standard deviation of the distribution of differences in sample means, just take the square root: $ \sigma_{\bar{x}_1-\bar{x}_2}=\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}} $

If the population standard deviation were unknown (which is usually the case):

replace the population variances with the sample variances. This is called the standard error of the distribution of the differences in sample means: $ SE_{\bar{x}_1-\bar{x}_2}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}} $

Use this formula to calculate the standard error

SEdiff<-sqrt((s1^2/n1)+(s2^2/n2))
SEdiff

## [1] 2.684524

G. Calculate the t-statistic. What are the degrees of freedom for this t-statistic? $ t=\frac{\bar{x}_1-\bar{x}_2}{SE_{\bar{x}_1-\bar{x}_2}} $

test_unpooled<-(xbar1-xbar2)/SEdiff
test_unpooled

## [1] -1.490022

Conservative DF: $ min(n_1-1, n_2-1) $

Estimated DF:$ \frac{(\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2}{\frac{s_1^4}{n_1^2(n_1-1)}+\frac{s_2^4}{n_2^2(n_2-1)}} $

H. Use a calculator (hand-held or online) or t-table to find the p-value.

2*pt(test_unpooled, df=min(n1-1, n2-1), lower=TRUE)

## [1] 0.1704049

I. State a conclusion in the context of the problem.

Since the p-value is large there is no evidence to suggest that there is a difference in commuting times for the two routes. Therefore, we will fail to reject the null at an $ \alpha=0.05 $ level.

J. The “pooled” two-sample t-methods

If the samples taken from the two populations came from populations that had the same standard deviations, another type of two-sample t-test should be performed instead. It’s called a “pooled” two-sample t-test because the standard deviations of the two groups will be “pooled” together to get the standard error of the distribution of differences in sample means. Again, this can only be done if we feel confident that the population standard deviations of the two populations being compared are the same. That is, if $ \sigma_1 $ is the population standard deviation of the first group, and $ \sigma_2 $ is the population standard deviation of the second group. The “pooled” two-sample t-methods can only be used if $ \sigma_1=\sigma_2 $

i. When is it reasonable to say $ \sigma_1=\sigma_2 $?

If an experiment is performed with two comparison groups, it is often reasonable to the standard deviations of the two comparison groups are the same because the randomization of the cases to the two comparison groups “balances” all variables thus creating groups with the same spread in the response variable.
If independent random samples were taken from two populations, the pooled methods should only be used if spread of the sample data is nearly the same for both comparison groups. (If the spread of the sample data for the two comparison groups is roughly the same, we can feel comfortable saying the samples came from populations with the same spread.) To determine if the spread of the sample data is nearly the same, we can do the following:
- Look at the side-by-side box-and-whisker plots: if the width of the boxes and the lengths of each of the whiskers for the two comparison groups is roughly the same, we can feel comfortable saying the samples came from populations with the same spread
- Compare the sample standard deviations: if the ratio of the larger sample standard deviation to the smaller sample standard deviation is close to 1, we can feel comfortable saying the samples came from populations with the same spread. How close does this ratio need to be to 1? There is not a right answer and not one acceptable ratio to use, but if it’s above 1.3 or so, do not use the pooled methods.

Based on the above guidelines, can the pooled two-sample t-methods be used in the “commuting example”? Explain.

We have two independent samples from two different populations. Since the sample standard deviations are so close it may be reasonable to use pooled methods.

ii. What is different between the pooled methods and the “un-pooled” methods?

A. The standard error formula is different. Without derivation, here is the standard error formula for the pooled two-sample t-methods:

$ SE_{\bar{x}_1-\bar{x}_2}=s_p\times \sqrt{\frac{1}{n_1}+\frac{1}{n_2}} $ where $ s_p=\sqrt{\frac{(n_1-1)s_1^2+(n_2-1)s_2^2}{(n_1+n_2-2)}} $

Use this formula to calculate standard error of the distribution of sample means. How different is the standard error from the “un-pooled” methods?

Sp<-sqrt(((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))
SEdiff_pooled<-Sp*sqrt(1/n1+1/n2)
SEdiff_pooled

## [1] 2.684524

compared to

SEdiff

## [1] 2.684524

Since the sample standard deviatons are so close the pooled and unpooled standard errors are basically the same.

B. The calculation of the degrees of freedom is different. Here is the formula for the pooled two-sample t-methods: $ n_1 + n_2 – 2 $.

What are the degrees of freedom if the pooled two-sample t-methods were used?

df_pooled<-n1+n2-2
df_pooled

## [1] 18

11. The randomization test to compare two groups

Recall the idea of a randomization test when comparing two groups in the experimental setting: if the null hypothesis is true (no difference between the two comparison groups), we’d expect each case’s value of the response variable to be the same no matter to which group the case was assigned. A randomization test will generate many, many different randomizations of the cases to the different comparison groups and determine how many of them have differences in sample means (or medians) “as or more unusual” than the one observed. Since the hypothesized value of the differences in means (or medians) is always 0, “as or more unusual” implies differences further away from 0 than the original difference is from 0.

In particular, if an experiment has been performed, the twomeans function will

randomly assign each case to one of the two groups (so that each group has the same number of cases as in the original problem – 10 to route 1 and 10 to route 2 in this example)
calculate the mean (or median) of each group for the new random assignment
calculate the difference in the new sample means (or medians)
repeat this process the number of times specified in the argument for the function (suggested at least 2000 times)
count the number of randomizations that produce a difference in sample means (or medians) like the one observed or “more unusual” (further away from the hypothesized value for the difference, which is always 0 (no difference) in a two-sample problem)
divide this count by the number of randomizations specified in the argument for the function to get the p-value

The twomeans function will also give an estimate of the standard deviation of the distribution of differences in sample means (or medians) by calculating the standard deviation of the several thousand differences and construct a confidence interval using the percentile method.

In order to run the twomeans function, the data set must be organized in such a was that there is one column (vector) that contains the values of the response variable and one column (vector) that contains the explanatory variable or groups.

A general form of the twomeans function looks something like this:

with(NAMEOFDATASET,
twomeans(response, groups, iterations, MEAN = FALSE, MEDIAN = FALSE, 
          ci_level =NULL, Alt_Hyp = NULL, histogram = TRUE)

)

For this example, the arguments are

response: C1 (the column that contains commuting times)
group: C2 (the column that contains the routes)
nrand: 5000 (again, at least 2000 randomizations is suggested)
Output: C4 (the column that will contain the difference in sample means or medians for each randomization)

In R, the “call” of the twomeans function would look like this:

my.twomeans <- 
  with(COMMUTE, 
       twomeans(time, route, 5000, MEAN = TRUE, ci_level = .95, Alt_Hyp = 3)
       )

After running the function, you can pull out any of the saved output using the following commands:

my.twomeans$Diffs_SD

my.twomeans$Confidence_Intervals

my.twomeans$pval

my.twomeans$Alt_Hyp

a confidence interval (with the indicated level of confidence in the function) with bounds determined using the percentile method:
- lower bound is the (100 – CL) / 2 percentile, where CL = the confidence level
- upper bound is the (100 + CL) / 2 percentile

Here is the output from the twomeans function for a hypothesis test comparing the mean commuting time of the two routes:

my.twomeans$Diffs_SD

[1] 2.757522

my.twomeans$Confidence_Intervals

CI_Percent CI_Formula

1      -9.34  -9.404643

2       1.30   1.404643

my.twomeans$pval

[1] 0.1506

my.twomeans$Alt_Hyp

[1] "Pop1 =/= Pop2 (two sided)"

Compare the p-value and standard deviation of the difference in sample means to the p-value and standard error from the two-sample t-test.

12. Confidence intervals for the difference in the population means

A. Construct and interpret a 95% confidence interval for the difference in population means ($ \mu_1-\mu_2 $) using the “un-pooled” two-sample t-methods by hand.

df<-min(n1-1, n2-1)
criticalVal<-qt(0.975, df)
criticalVal

## [1] 2.262157

CI<-(xbar1-xbar2)+c(-1,1)*criticalVal*SEdiff
CI

## [1] -10.072814   2.072814

We are 95% confident that the true difference is means is between (-10.06, 2.20). Since zero is contained in this interval, this suggests that there is not a significant difference between the commuting times for the two routes.

B. Would the confidence interval change if the pooled two-sample t-methods were used?

df

## [1] 9

df_pooled

## [1] 18

Yes, since the degrees of freedom are larger in the pooled case the associated critical value will be smaller.

C. Write the 95% confidence interval using the percentile method from the twomeans function.

my.twomeans$Confidence_Intervals

CI_Percent CI_Formula

1      -9.34  -9.404643

2       1.30   1.404643

i. Show how the bounds were determined.

The bounds are created by taking the middle 95% of the bootstrapped differenes of the sample means.

ii. Compare the bounds with the two-sample t-methods.

The confidence interval for the two sample t-method is wider.

13. Although it seems appropriate to use the t-methods to compare the means, the sample sizes are quite small. With small sample sizes, we have to be certain that the data in the populations from which the samples were taken are normally distributed. In this example, that means that we must believe that all commuting times for both routes are normally distributed. An argument was given in #8b that commuting times for either route may not be normally distributed. If the population data are not normally distributed for one or both groups and the sample sizes are small, the t-methods to compare the means should not be used.

Let’s suppose that you felt that all commuting times for both routes weren’t perfectly normally distributed and/or there were “extreme” outliers. With the small sample sizes and skewed data and/or extreme outliers, inference on the medians should be performed. Only the randomization/bootstrap methods can be used when comparing the medians of two groups.

A. State the null and alternative hypotheses in notation. Define the notation used.

$ H_0: m_1-m_2=0 $ or $ H_0: m_1=m_2 $

$ H_A: m_1-m_2 \neq 0 $ or $ H_A: m_1 \neq m_2 $

B. Here is what the arguments of the twomeans function in R looks like in this situation:

my.twoMED <- with(COMMUTE, twomeans(time, route, 5000, MEDIAN = TRUE, ci_level = .95, Alt_Hyp = 3) )

C. Below is the output from the twomeans function comparing the medians of the two routes.

my.twoMED$Diffs_SD

[1] 3.088312

my.twoMED$Confidence_Intervals

CI_Percent CI_Formula

1 -9.35 -9.85298

2 1.75 2.25298

my.twoMED$pval

[1] 0.2484

my.twoMED$Alt_Hyp

[1] “Pop1 =/= Pop2 (two sided)”

i. Based on the p-value, state a conclusion in the context of the problem. Does the conclusion agree with the conclusion when comparing the means of the two routes?

Again, due to the large p-vale there is no evidence to suggest that there is a difference in the commute times for the two routes. Therefore, we will fail to reject the null that the medians of the two commute times are the same.

ii. Using the percentile method from the twomeans function, write and interpret the 95% confidence interval for the difference in the population medians between the two routes (route 1 – route 2).

95% of the bootstrapped differences of samples medians are between (-9.35, 1.75). Since 0 is contained in this interval this suggests that there isn't a significant difference between the two commute route medians.