Joel Correa da Rosa
February 1st 2017
The comparison of two independent population means in the worst case scenario is an inferential procedure that involves 4 unknown parameters.
\( \mu_1 \) : mean of population 1
\( \mu_2 \) : mean of population 2
\( \sigma_1^2 \) : variance of population 1
\( \sigma_2^2 \) : variance of population 2
The inference about two independent population means is usually related to the following scientific question:
“Is there any difference between the means of a variable from two independent populations”
Hypothesis testing
Confidence interval
Regression
An investigator claims that EAAT2 (Excitatory Amino Acid Transporter 2) gene expression levels are higher in old mice when compared to young mice. To verify this hypothesis, the investigator will draw two independent samples:
\( n_1=5 \) old mice
\( n_2=5 \) young mice
old | young |
---|---|
6.7 | 9.4 |
9.8 | 7.9 |
10.1 | 7.5 |
11.8 | 7.8 |
10.5 | 8.5 |
The mean and standard deviation for the sample of old mice are \( \bar{x}_1 \) = 9.78 and \( S_1 \) = 1.88 respectively.
The mean and standard deviation for the sample of young mice are \( \bar{x}_2 \) = 8.22 and \( S_2 \) = 0.75 respectively.
Two hypotheses are stated
\( H_0 \): The two populations have the same mean (\( \mu_1 = \mu_2 \))
\( H_1 \) The two populations have different means (\( \mu_1 \neq \mu_2 \))
The decision making process is based on probabilities evaluated for the test statistic.
Basic Assumptions:
Variances are known and
\( z = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sigma\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \)
\( z = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}} \)
Normal Distribution
Variances are Unknown
\( t_{n_1+n_2-2} = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{Sp\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \)
\( t_{m} = \frac{\bar{x}_1-\bar{x}_2-(\mu_1-\mu_2)}{\sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}}} \)
Student's t distribution
If we assume that the variances are equal, they are estimated from a pooled combination.
\( Sp=\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2}{n1+n2-2} \)
When the variances are unknown and unequal, the degrees of freedom for the Student's t test statistic is obtained from a complex combination of information from the variances in the two different samples and the sample sizes.
The confidence interval determines a range of plausible values for the unknown parameter.
\( P(\bar{x}_1-\bar{x}_2-t(\alpha)\times s.e. < \mu_1-\mu_2 < \bar{x}_1-\bar{x}_2 + t(\alpha)\times s.e.) \) \( =1-\alpha \)
\( s.e. \) standard error of the test statistic is written in the test statistic's denominator.
\( t(\alpha) \) is a quantile of Student's t distribution that leaves probability \( \alpha \) equally divided in the tails of the distribution.
old<- c(6.7,9.8,10.1,11.8,10.5)
young<-c(9.4 , 7.9, 7.5, 7.8,8.5)
t.test(old,young)
Welch Two Sample t-test
data: old and young
t = 1.7198, df = 5.247, p-value = 0.1433
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.739105 3.859105
sample estimates:
mean of x mean of y
9.78 8.22
old<- c(6.7,9.8,10.1,11.8,10.5)
young<-c(9.4 , 7.9, 7.5, 7.8,8.5)
t.test(old,young,var.equal = TRUE)
Two Sample t-test
data: old and young
t = 1.7198, df = 8, p-value = 0.1238
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.5317377 3.6517377
sample estimates:
mean of x mean of y
9.78 8.22
The regression approach assumes that there is a relationship between two variables:
\( Y \) : The EAAT2 level of expression
\( X \) : The age group of the mouse (\( X=0 \) for young mice and \( X=1 \) for old mice)
This association is determined by the equation
\( Y = \beta_0 + \beta_1 X +\epsilon \)
We assume that \( \epsilon \) comes from the normal distribution centered around zero.
In the regression approach:
\( \beta_0 \) : mean expression level of EAAT2 in the young mice population
\( \beta_1 \) : mean increase (or decrease) in the mean expression level if the mouse is in the old population.
The parameters can be estimated by minimizing the sum of squared residuals.
Data configuration.
old<- c(6.7,9.8,10.1,11.8,10.5)
young<-c(9.4 , 7.9, 7.5, 7.8,8.5)
eaat2<-c(old,young)
age<-c(rep('old',5),rep('young',5))
dr<-cbind.data.frame(eaat2 = eaat2 , age = age)
dr$age<-factor(dr$age,levels=c('young','old'))
eaat2 | age |
---|---|
6.7 | old |
9.8 | old |
10.1 | old |
11.8 | old |
10.5 | old |
9.4 | young |
7.9 | young |
7.5 | young |
7.8 | young |
8.5 | young |
fit<-lm(eaat2~age,data=dr)
summary(fit)
Call:
lm(formula = eaat2 ~ age, data = dr)
Residuals:
Min 1Q Median 3Q Max
-3.080 -0.395 0.150 0.620 2.020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.2200 0.6414 12.82 1.3e-06 ***
ageold 1.5600 0.9071 1.72 0.124
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.434 on 8 degrees of freedom
Multiple R-squared: 0.2699, Adjusted R-squared: 0.1787
F-statistic: 2.958 on 1 and 8 DF, p-value: 0.1238
hist(fit$residuals,breaks='FD')
qqnorm(fit$residuals)