In September 1990, each student in a random sample of 200 biology majors at a large university was asked how many lab classes he or she was enrolled in. The sample results are shown below:
bio1990 <- c(rep(0,28), rep(1,62), rep(2,58), rep(3,28), rep(4,16), rep(5,8))
table(bio1990)
bio1990
0 1 2 3 4 5
28 62 58 28 16 8
To determine whether the distribution has chnaged over the past 10 years a similar survey was conducted in September 2000 by selecting a random sample of 200 biology majors. Results from the year 2000 sample are shown below:
bio2000 <- c(rep(0,20), rep(1,72), rep(2,60), rep(3,10), rep(4,26), rep(5,12))
table(bio2000)
bio2000
0 1 2 3 4 5
20 72 60 10 26 12
xhat1 <- mean(bio1990)
xhat2 <- mean(bio2000)
s1 <- sd(bio1990)
s2 <- sd(bio2000)
n1 <- length(bio1990)
n2 <- length(bio2000)
c(n1, xhat1, s1)
[1] 200.000 1.830 1.292
c(n2, xhat2, s2)
[1] 200.000 1.930 1.369
\[ \begin{equation} \label{E:two sample t statistic} t = \frac{\bar{X_1}-\bar{X_2}} {\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2} }} \end{equation} \] The denominator is the standard error in the difference between the means \[ s_{\bar{X_1}-\bar{X_2}}=\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2} } \]
diff_in_means <- xhat2-xhat1
SE_diff_mean <- sqrt(s1^2/n1+s2^2/n2)
t_stat <- diff_in_means/SE_diff_mean
t_stat
[1] 0.7512
2*(1-pt(t_stat, df=398)) #p-value
[1] 0.453
diff_in_means+qt(c(.025,.975),df=398)*
SE_diff_mean # 95%
[1] -0.1617 0.3617
diff_in_means+qt(c(.1,.9),df=398)*
SE_diff_mean # 80%
[1] -0.07088 0.27088
t.test(bio2000, bio1990, var.equal=TRUE)
Two Sample t-test
data: bio2000 and bio1990
t = 0.7512, df = 398, p-value = 0.453
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1617 0.3617
sample estimates:
mean of x mean of y
1.93 1.83
t.test(bio2000, bio1990, var.equal=FALSE)
Welch Two Sample t-test
data: bio2000 and bio1990
t = 0.7512, df = 396.7, p-value = 0.453
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.1617 0.3617
sample estimates:
mean of x mean of y
1.93 1.83
When Variances are Equal:
\[ df = n_1 + n_2 - 2 \]
When Variances are Unequal:
\[ df = \frac{ (\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2})^2 } {\frac{(\frac{s_1^2}{n_1})^2}{n_1-1}+\frac{(\frac{s_2^2}{n_2})^2}{n_2-1}} \] (which is always at least as larger as the smaller n)