Two Sample t test

alt text

The Contrived Scenario

In September 1990, each student in a random sample of 200 biology majors at a large university was asked how many lab classes he or she was enrolled in. The sample results are shown below:

bio1990 <- c(rep(0,28), rep(1,62), rep(2,58), rep(3,28), rep(4,16), rep(5,8))
table(bio1990)

bio1990
 0  1  2  3  4  5 
28 62 58 28 16  8

... there's more

To determine whether the distribution has chnaged over the past 10 years a similar survey was conducted in September 2000 by selecting a random sample of 200 biology majors. Results from the year 2000 sample are shown below:

bio2000 <- c(rep(0,20), rep(1,72), rep(2,60), rep(3,10), rep(4,26), rep(5,12))
table(bio2000)

bio2000
 0  1  2  3  4  5 
20 72 60 10 26 12

Histograms

plot of chunk unnamed-chunk-3

Calculating Sample Means and Sample SDs

xhat1 <- mean(bio1990)
xhat2 <- mean(bio2000)
s1 <- sd(bio1990)
s2 <- sd(bio2000)
n1 <- length(bio1990)
n2 <- length(bio2000)
c(n1, xhat1, s1)

[1] 200.000   1.830   1.292

c(n2, xhat2, s2)

[1] 200.000   1.930   1.369

The t Statistic

\[ \begin{equation} \label{E:two sample t statistic} t = \frac{\bar{X_1}-\bar{X_2}} {\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2} }} \end{equation} \] The denominator is the standard error in the difference between the means \[ s_{\bar{X_1}-\bar{X_2}}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2} } \]

R code for the t statistic

diff_in_means <- xhat2-xhat1
SE_diff_mean <- sqrt(s1^2/n1+s2^2/n2)
t_stat <- diff_in_means/SE_diff_mean
t_stat

[1] 0.7512

2*(1-pt(t_stat, df=398)) #p-value

[1] 0.453

R code for CI for difference between means

diff_in_means+qt(c(.025,.975),df=398)*
  SE_diff_mean # 95%

[1] -0.1617  0.3617


diff_in_means+qt(c(.1,.9),df=398)*
  SE_diff_mean # 80%

[1] -0.07088  0.27088

And Here's the Simple R Code

t.test(bio2000, bio1990, var.equal=TRUE)


    Two Sample t-test

data:  bio2000 and bio1990
t = 0.7512, df = 398, p-value = 0.453
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1617  0.3617
sample estimates:
mean of x mean of y 
     1.93      1.83

And Similarly

t.test(bio2000, bio1990, var.equal=FALSE)


    Welch Two Sample t-test

data:  bio2000 and bio1990
t = 0.7512, df = 396.7, p-value = 0.453
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.1617  0.3617
sample estimates:
mean of x mean of y 
     1.93      1.83

Degrees of Freedom Craziness

When Variances are Equal:

\[ df = n_1 + n_2 - 2 \]

When Variances are Unequal:

\[ df = \frac{ (\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2 } {\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}} \] (which is always at least as larger as the smaller n)