Lecture 6: Comparing Means of Two Dependent Populations

Joel Correa da Rosa
February 8th 2017

Inference

The comparison of means from two dependent populations is an inferential procedure that depends on 5 unknown parameters:

  • \( \mu_1 \) : mean of population 1

  • \( \mu_2 \) : mean of population 2

  • \( \sigma_1^2 \) : variance of population 1

  • \( \sigma_2^2 \) : variance of population 2

  • \( \sigma_{12} \) : covariance of populations 1 and 2

Scientific Question

The inference about two dependent population means is usually related to the following scientific question:

“Is there any difference between the means of a variable from two dependent populations”

Applications

  • healthy and “diseased” skin from the same patient

  • repeated measures (visit # 1 / visit #2)

  • before-after-intervetion experiments

  • age-gender matched samples

Motivational Example

10 subjects were given a very low calorie diet and Body Mass Indexes were measured before and after diet.

bmi.pre bmi.post
41.5 36.6
43.8 38.6
34.7 31.5
35.6 32.5
41.3 37.2
37.3 33.8
36.3 32.8
35.4 31.1
43.0 39.3
38.9 37.2

Paired Values

Instead of two independent samples, the comparison of two dependent means will be based on sample of pairs:

\( (X_{11},X_{21}),(X_{12},X_{22}),...,(X_{1n},X_{2n}) \)

from the bivariate random variable \( (X_1,X_2) \).

Covariance

Is a measure of how two variables vary together. It can be estimated from a sample according to the formula :

\( S_{12}=\sum_{i=1}^n \frac{(x_{1i}-\bar{x}_1)(x_{2i}-\bar{x}_2)}{n-1} \)

If \( X_1 \) and \( X_2 \) are independent, we expect \( S_{12}=0 \)

Example

The scatterplot is a tool for covariance visualization.

plot(owl$bmi.pre,owl$bmi.post)
abline(lm(owl$bmi.post~owl$bmi.pre))

plot of chunk unnamed-chunk-3

Example

In this dataset, the pre and post BMI covariance is 9.891. This value can be found in R when evaluating the so-called covariance matrix:

cov(owl)
           bmi.pre bmi.post
bmi.pre  11.477333 9.891333
bmi.post  9.891333 9.293778

Correlation

Since it is difficult to interpret the Covariance, the correlation (\( \rho \)) is an adimensional measure of association that can be estimated according to:

\( r_{x_1,x_2} = \frac{S_{12}}{S_1 S_2} \)

where \( S_1 \) and \( S_2 \) are the sample standard deviations.

\( -1 \leq r_{x_1,x_2} \leq 1 \)

Pearson's product-moment sample correlation coefficient

the sample correlation coefficient for a bivariate random variable \( (X,Y) \) is obtained from:

\( r_{x,y} = \frac{\sum_{i=1}^n (x_{i}-\bar{x})(y_{i}-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i-\bar{x})^2 \sum_{i=1}^n(y_i-\bar{y})^2}} \)

Example

cor(owl)
           bmi.pre  bmi.post
bmi.pre  1.0000000 0.9577192
bmi.post 0.9577192 1.0000000

Hypothesis Testing

The set of hypotheses in this setup :

\( H_0: \mu_{D} = \mu_1 - \mu_2 = 0 \) \( H_1: \mu_{D} = \mu_1 - \mu_2 \neq 0 \)

Note that \( D = X_1-X_2 \). The hypothesis test is applied to the difference between the variables. As a consequence, this is a one-sample test.

Difference BMI (post - pre)

For testing the hypotheses, it is enough to have a sample of differences \( D_i = X_{1i}-X_{2i} \)

owl$Diff<-owl$bmi.post-owl$bmi.pre
kable(owl)
bmi.pre bmi.post Diff
41.5 36.6 -4.9
43.8 38.6 -5.2
34.7 31.5 -3.2
35.6 32.5 -3.1
41.3 37.2 -4.1
37.3 33.8 -3.5
36.3 32.8 -3.5
35.4 31.1 -4.3
43.0 39.3 -3.7
38.9 37.2 -1.7

Variance

The variance of \( D \) incorporates the variances of \( X_1 \), \( X_2 \) and also their covariance.

\( Var(D) = Var(X1-X2) \)

\( Var(D) = Var(X_1)+Var(X_2)-2\times Cov(X_1,X_2) \)

Application

var.diff<-var(owl$bmi.post)+var(owl$bmi.pre)-2*cov(cbind(owl$bmi.post,owl$bmi.pre))[1,2]
var.diff
[1] 0.9884444

alternatively

var.diff<-var(owl$Diff)
var.diff
[1] 0.9884444

Test statistic

If the variances and covariance are known, the test statistic follows the standard normal distribution and the test is called z-test.

\( z = \frac{\bar{D}-\mu_{D}}{\frac{\sigma_{D}}{\sqrt{n}}} \)

Test statistic

When the variances and covariance are unknown, the test statistic follows the Student's t distribution with \( n-1 \) degrees of freedom.

\( t_{n-1} = \frac{\bar{D}-\mu_{D}}{\frac{S_{D}}{\sqrt{n}}} \)

Standard Error of the Mean Difference

\( S_{\bar{D}} = \frac{S_D}{\sqrt{n}} \)

where

\( S_{D} = \sqrt{\sum_{i=1}^n \frac{(D_i-\bar{D})^2}{n-1}} \)

The Confidence Interval

\( (\bar{D}-t_{n-1}\frac{S_D}{\sqrt{n}},\bar{D}+t_{n-1}\frac{S_D}{\sqrt{n}}) \)

Using R for Hypothesis Testing and Confidence Interval

t.test(owl$bmi.pre,owl$bmi.post,paired=T)

    Paired t-test

data:  owl$bmi.pre and owl$bmi.post
t = 11.832, df = 9, p-value = 8.68e-07
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 3.008788 4.431212
sample estimates:
mean of the differences 
                   3.72 

Assumptions for z-test and t-test

  1. The variable \( X \) is normally distributed.

  2. The experimental units are independent.

What-if

We do not take into account the dependence structure between the two samples ….

# t-test for two independent samples
t.test(owl$bmi.pre,owl$bmi.post)

    Welch Two Sample t-test

data:  owl$bmi.pre and owl$bmi.post
t = 2.5811, df = 17.803, p-value = 0.01894
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.6897117 6.7502883
sample estimates:
mean of x mean of y 
    38.78     35.06 

Note the impact on the p-value!

A Regression View on Paired T-test

One dependent variable and two nominal explanatory variables.

\( Y \) : BMI

\( Time \): Pre / Post

\( Subject \) : 1,2,3,…10

\( Yi = \beta_0 + \beta_1 Time + \alpha_i + \epsilon_i \)

\( \alpha_i \) is a random effect from the i-th subject

Example in R

Y<-c(owl$bmi.pre,owl$bmi.post)
Time<-c(rep('Pre',10),rep('Post',10)) 
Subject<-c(rep(1:10,2))
df<-cbind.data.frame(Y=Y, Time = Time, Subject = Subject)
df$Time<-factor(df$Time,levels=c('Pre','Post'))
df$Subject<-factor(df$Subject)

Example in R

Y Time Subject
41.5 Pre 1
43.8 Pre 2
34.7 Pre 3
35.6 Pre 4
41.3 Pre 5
37.3 Pre 6
36.3 Pre 7
35.4 Pre 8
43.0 Pre 9
38.9 Pre 10
36.6 Post 1
38.6 Post 2
31.5 Post 3
32.5 Post 4
37.2 Post 5
33.8 Post 6
32.8 Post 7
31.1 Post 8
39.3 Post 9
37.2 Post 10

Example in R

require(nlme)
fit<-lme(Y~Time,random=~1|Subject,data=df)
summary(fit)
Linear mixed-effects model fit by REML
 Data: df 
       AIC      BIC    logLik
  84.42936 87.99085 -38.21468

Random effects:
 Formula: ~1 | Subject
        (Intercept)  Residual
StdDev:    3.145046 0.7030101

Fixed effects: Y ~ Time 
            Value Std.Error DF   t-value p-value
(Intercept) 38.78 1.0190946  9  38.05339       0
TimePost    -3.72 0.3143957  9 -11.83222       0
 Correlation: 
         (Intr)
TimePost -0.154

Standardized Within-Group Residuals:
        Min          Q1         Med          Q3         Max 
-1.39750155 -0.51164809  0.09163605  0.29931726  1.47585704 

Number of Observations: 20
Number of Groups: 10 

Example in R (Without Random Effects)

fit<-lm(Y~Time,data=df)
summary(fit)

Call:
lm(formula = Y ~ Time, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-4.080 -2.715 -0.570  2.570  5.020 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   38.780      1.019  38.053   <2e-16 ***
TimePost      -3.720      1.441  -2.581   0.0188 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.223 on 18 degrees of freedom
Multiple R-squared:  0.2701,    Adjusted R-squared:  0.2296 
F-statistic: 6.662 on 1 and 18 DF,  p-value: 0.01883

Visualization of Random Effects

require(ggplot2)
df$Subject<-factor(df$Subject)
ggplot(df,aes(x=Time,y=Y,colour=Subject))+
geom_point()+
geom_line(aes(group=Subject))

plot of chunk unnamed-chunk-15