Lecture 5: Comparison of Two Independent Population Means

Joel Correa da Rosa
February 1st 2017

Inference

The comparison of two independent population means in the worst case scenario is an inferential procedure that involves 4 unknown parameters.

\( \mu_1 \) : mean of population 1
\( \mu_2 \) : mean of population 2
\( \sigma_1^2 \) : variance of population 1
\( \sigma_2^2 \) : variance of population 2

Scientific Question

The inference about two independent population means is usually related to the following scientific question:

“Is there any difference between the means of a variable from two independent populations”

Different Strategies

Hypothesis testing
Confidence interval
Regression

Motivational Example

An investigator claims that EAAT2 (Excitatory Amino Acid Transporter 2) gene expression levels are higher in old mice when compared to young mice. To verify this hypothesis, the investigator will draw two independent samples:

\( n_1=5 \) old mice

\( n_2=5 \) young mice

Observed Data

old	young
6.7	9.4
9.8	7.9
10.1	7.5
11.8	7.8
10.5	8.5

The mean and standard deviation for the sample of old mice are \( \bar{x}_1 \) = 9.78 and \( S_1 \) = 1.88 respectively.

The mean and standard deviation for the sample of young mice are \( \bar{x}_2 \) = 8.22 and \( S_2 \) = 0.75 respectively.

Strategy #1 Hypothesis Test

Two hypotheses are stated

\( H_0 \): The two populations have the same mean (\( \mu_1 = \mu_2 \))

\( H_1 \) The two populations have different means (\( \mu_1 \neq \mu_2 \))

Test Statistic

The decision making process is based on probabilities evaluated for the test statistic.

Basic Assumptions:

Normal Distribution for the Variable
Independence (this is a random sample)

Distribution of Test Statistic

Variances are known and

Equal

\( z = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sigma\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \)

Unequal

\( z = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \)

Normal Distribution

Distribution of Test Statistic

Variances are Unknown

Equal

\( t_{n_1+n_2-2} = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{Sp\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \)

Unequal

\( t_{m} = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}} \)

Student's t distribution

Spooled

If we assume that the variances are equal, they are estimated from a pooled combination.

\( Sp=\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n1+n2-2} \)

Welch Test

When the variances are unknown and unequal, the degrees of freedom for the Student's t test statistic is obtained from a complex combination of information from the variances in the two different samples and the sample sizes.

https://en.wikipedia.org/wiki/Welch's_t-test

Strategy #2 Confidence Interval

Unknown Variance

The confidence interval determines a range of plausible values for the unknown parameter.

\( P(\bar{x}_1-\bar{x}_2-t(\alpha)\times s.e. < \mu_1-\mu_2 < \bar{x}_1-\bar{x}_2 + t(\alpha)\times s.e.) \) \( =1-\alpha \)

\( s.e. \) standard error of the test statistic is written in the test statistic's denominator.

\( t(\alpha) \) is a quantile of Student's t distribution that leaves probability \( \alpha \) equally divided in the tails of the distribution.

Using R for Hypothesis Test and Confidence Interval

Variances Unknown and Unequal

old<- c(6.7,9.8,10.1,11.8,10.5)

young<-c(9.4 , 7.9, 7.5, 7.8,8.5)

t.test(old,young)


    Welch Two Sample t-test

data:  old and young
t = 1.7198, df = 5.247, p-value = 0.1433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.739105  3.859105
sample estimates:
mean of x mean of y 
     9.78      8.22

Using R for Hypothesis Test and Confidence Interval

Variances Unknown and Equal

old<- c(6.7,9.8,10.1,11.8,10.5)

young<-c(9.4 , 7.9, 7.5, 7.8,8.5)

t.test(old,young,var.equal = TRUE)


    Two Sample t-test

data:  old and young
t = 1.7198, df = 8, p-value = 0.1238
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.5317377  3.6517377
sample estimates:
mean of x mean of y 
     9.78      8.22

Strategy #3 Regression

The regression approach assumes that there is a relationship between two variables:

\( Y \) : The EAAT2 level of expression

\( X \) : The age group of the mouse (\( X=0 \) for young mice and \( X=1 \) for old mice)

This association is determined by the equation

\( Y = \beta_0 + \beta_1 X +\epsilon \)

We assume that \( \epsilon \) comes from the normal distribution centered around zero.

Regression Coefficients

In the regression approach:

\( \beta_0 \) : mean expression level of EAAT2 in the young mice population

\( \beta_1 \) : mean increase (or decrease) in the mean expression level if the mouse is in the old population.

The parameters can be estimated by minimizing the sum of squared residuals.

Using R for regression

Data configuration.

old<- c(6.7,9.8,10.1,11.8,10.5)

young<-c(9.4 , 7.9, 7.5, 7.8,8.5)

eaat2<-c(old,young)

age<-c(rep('old',5),rep('young',5))

dr<-cbind.data.frame(eaat2 = eaat2 , age = age)

dr$age<-factor(dr$age,levels=c('young','old'))

Data Configuration

eaat2	age
6.7	old
9.8	old
10.1	old
11.8	old
10.5	old
9.4	young
7.9	young
7.5	young
7.8	young
8.5	young

Linear Regression

fit<-lm(eaat2~age,data=dr)
summary(fit)


Call:
lm(formula = eaat2 ~ age, data = dr)

Residuals:
   Min     1Q Median     3Q    Max 
-3.080 -0.395  0.150  0.620  2.020 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   8.2200     0.6414   12.82  1.3e-06 ***
ageold        1.5600     0.9071    1.72    0.124    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.434 on 8 degrees of freedom
Multiple R-squared:  0.2699,    Adjusted R-squared:  0.1787 
F-statistic: 2.958 on 1 and 8 DF,  p-value: 0.1238

Checking Normality Assumption with Regression

hist(fit$residuals,breaks='FD')

plot of chunk unnamed-chunk-8

Checking Normality Assumption with Regression

qqnorm(fit$residuals)

plot of chunk unnamed-chunk-9