Joel Correa da Rosa
February 8th 2017
The comparison of means from two dependent populations is an inferential procedure that depends on 5 unknown parameters:
\( \mu_1 \) : mean of population 1
\( \mu_2 \) : mean of population 2
\( \sigma_1^2 \) : variance of population 1
\( \sigma_2^2 \) : variance of population 2
\( \sigma_{12} \) : covariance of populations 1 and 2
The inference about two dependent population means is usually related to the following scientific question:
“Is there any difference between the means of a variable from two dependent populations”
healthy and “diseased” skin from the same patient
repeated measures (visit # 1 / visit #2)
before-after-intervetion experiments
age-gender matched samples
10 subjects were given a very low calorie diet and Body Mass Indexes were measured before and after diet.
bmi.pre | bmi.post |
---|---|
41.5 | 36.6 |
43.8 | 38.6 |
34.7 | 31.5 |
35.6 | 32.5 |
41.3 | 37.2 |
37.3 | 33.8 |
36.3 | 32.8 |
35.4 | 31.1 |
43.0 | 39.3 |
38.9 | 37.2 |
Instead of two independent samples, the comparison of two dependent means will be based on sample of pairs:
\( (X_{11},X_{21}),(X_{12},X_{22}),...,(X_{1n},X_{2n}) \)
from the bivariate random variable \( (X_1,X_2) \).
Is a measure of how two variables vary together. It can be estimated from a sample according to the formula :
\( S_{12}=\sum_{i=1}^n \frac{(x_{1i}-\bar{x}_1)(x_{2i}-\bar{x}_2)}{n-1} \)
If \( X_1 \) and \( X_2 \) are independent, we expect \( S_{12}=0 \)
The scatterplot is a tool for covariance visualization.
plot(owl$bmi.pre,owl$bmi.post)
abline(lm(owl$bmi.post~owl$bmi.pre))
In this dataset, the pre and post BMI covariance is 9.891. This value can be found in R when evaluating the so-called covariance matrix:
cov(owl)
bmi.pre bmi.post
bmi.pre 11.477333 9.891333
bmi.post 9.891333 9.293778
Since it is difficult to interpret the Covariance, the correlation (\( \rho \)) is an adimensional measure of association that can be estimated according to:
\( r_{x_1,x_2} = \frac{S_{12}}{S_1 S_2} \)
where \( S_1 \) and \( S_2 \) are the sample standard deviations.
\( -1 \leq r_{x_1,x_2} \leq 1 \)
the sample correlation coefficient for a bivariate random variable \( (X,Y) \) is obtained from:
\( r_{x,y} = \frac{\sum_{i=1}^n (x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}} \)
cor(owl)
bmi.pre bmi.post
bmi.pre 1.0000000 0.9577192
bmi.post 0.9577192 1.0000000
The set of hypotheses in this setup :
\( H_0: \mu_{D} = \mu_1 - \mu_2 = 0 \) \( H_1: \mu_{D} = \mu_1 - \mu_2 \neq 0 \)
Note that \( D = X_1-X_2 \). The hypothesis test is applied to the difference between the variables. As a consequence, this is a one-sample test.
For testing the hypotheses, it is enough to have a sample of differences \( D_i = X_{1i}-X_{2i} \)
owl$Diff<-owl$bmi.post-owl$bmi.pre
kable(owl)
bmi.pre | bmi.post | Diff |
---|---|---|
41.5 | 36.6 | -4.9 |
43.8 | 38.6 | -5.2 |
34.7 | 31.5 | -3.2 |
35.6 | 32.5 | -3.1 |
41.3 | 37.2 | -4.1 |
37.3 | 33.8 | -3.5 |
36.3 | 32.8 | -3.5 |
35.4 | 31.1 | -4.3 |
43.0 | 39.3 | -3.7 |
38.9 | 37.2 | -1.7 |
The variance of \( D \) incorporates the variances of \( X_1 \), \( X_2 \) and also their covariance.
\( Var(D) = Var(X1-X2) \)
\( Var(D) = Var(X_1)+Var(X_2)-2\times Cov(X_1,X_2) \)
var.diff<-var(owl$bmi.post)+var(owl$bmi.pre)-2*cov(cbind(owl$bmi.post,owl$bmi.pre))[1,2]
var.diff
[1] 0.9884444
alternatively
var.diff<-var(owl$Diff)
var.diff
[1] 0.9884444
If the variances and covariance are known, the test statistic follows the standard normal distribution and the test is called z-test.
\( z = \frac{\bar{D}-\mu_{D}}{\frac{\sigma_{D}}{\sqrt{n}}} \)
When the variances and covariance are unknown, the test statistic follows the Student's t distribution with \( n-1 \) degrees of freedom.
\( t_{n-1} = \frac{\bar{D}-\mu_{D}}{\frac{S_{D}}{\sqrt{n}}} \)
\( S_{\bar{D}} = \frac{S_D}{\sqrt{n}} \)
where
\( S_{D} = \sqrt{\sum_{i=1}^n \frac{(D_i-\bar{D})^2}{n-1}} \)
\( (\bar{D}-t_{n-1}\frac{S_D}{\sqrt{n}},\bar{D}+t_{n-1}\frac{S_D}{\sqrt{n}}) \)
t.test(owl$bmi.pre,owl$bmi.post,paired=T)
Paired t-test
data: owl$bmi.pre and owl$bmi.post
t = 11.832, df = 9, p-value = 8.68e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
3.008788 4.431212
sample estimates:
mean of the differences
3.72
The variable \( X \) is normally distributed.
The experimental units are independent.
We do not take into account the dependence structure between the two samples ….
# t-test for two independent samples
t.test(owl$bmi.pre,owl$bmi.post)
Welch Two Sample t-test
data: owl$bmi.pre and owl$bmi.post
t = 2.5811, df = 17.803, p-value = 0.01894
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.6897117 6.7502883
sample estimates:
mean of x mean of y
38.78 35.06
Note the impact on the p-value!
One dependent variable and two nominal explanatory variables.
\( Y \) : BMI
\( Time \): Pre / Post
\( Subject \) : 1,2,3,…10
\( Yi = \beta_0 + \beta_1 Time + \alpha_i + \epsilon_i \)
\( \alpha_i \) is a random effect from the i-th subject
Y<-c(owl$bmi.pre,owl$bmi.post)
Time<-c(rep('Pre',10),rep('Post',10))
Subject<-c(rep(1:10,2))
df<-cbind.data.frame(Y=Y, Time = Time, Subject = Subject)
df$Time<-factor(df$Time,levels=c('Pre','Post'))
df$Subject<-factor(df$Subject)
Y | Time | Subject |
---|---|---|
41.5 | Pre | 1 |
43.8 | Pre | 2 |
34.7 | Pre | 3 |
35.6 | Pre | 4 |
41.3 | Pre | 5 |
37.3 | Pre | 6 |
36.3 | Pre | 7 |
35.4 | Pre | 8 |
43.0 | Pre | 9 |
38.9 | Pre | 10 |
36.6 | Post | 1 |
38.6 | Post | 2 |
31.5 | Post | 3 |
32.5 | Post | 4 |
37.2 | Post | 5 |
33.8 | Post | 6 |
32.8 | Post | 7 |
31.1 | Post | 8 |
39.3 | Post | 9 |
37.2 | Post | 10 |
require(nlme)
fit<-lme(Y~Time,random=~1|Subject,data=df)
summary(fit)
Linear mixed-effects model fit by REML
Data: df
AIC BIC logLik
84.42936 87.99085 -38.21468
Random effects:
Formula: ~1 | Subject
(Intercept) Residual
StdDev: 3.145046 0.7030101
Fixed effects: Y ~ Time
Value Std.Error DF t-value p-value
(Intercept) 38.78 1.0190946 9 38.05339 0
TimePost -3.72 0.3143957 9 -11.83222 0
Correlation:
(Intr)
TimePost -0.154
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-1.39750155 -0.51164809 0.09163605 0.29931726 1.47585704
Number of Observations: 20
Number of Groups: 10
fit<-lm(Y~Time,data=df)
summary(fit)
Call:
lm(formula = Y ~ Time, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.080 -2.715 -0.570 2.570 5.020
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.780 1.019 38.053 <2e-16 ***
TimePost -3.720 1.441 -2.581 0.0188 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.223 on 18 degrees of freedom
Multiple R-squared: 0.2701, Adjusted R-squared: 0.2296
F-statistic: 6.662 on 1 and 18 DF, p-value: 0.01883
require(ggplot2)
df$Subject<-factor(df$Subject)
ggplot(df,aes(x=Time,y=Y,colour=Subject))+
geom_point()+
geom_line(aes(group=Subject))