The set of all hypothetical values that we do not rule out at the \(\alpha < 0.05\) level of significance is known as the 95% confidence interval. If we are interested in the hypothetical values of a difference between samples means \(\bar{x}\) and \(\bar{y}\) that we cannot rule out, this interval would be calculated as follows:
\[ (\bar{x} - \bar{y}) \pm \tau \times \text{se} \]
where \(\text{se}\) is the standard error of the difference between the sample means. In other words, the confidence interval is a set of value centered around the difference in sample means that extends of \(\tau\) times the standard error of the difference. The value of \(\tau\) (Greek tau) that we need for this calculation depends on two things:
Crucial for the calculation of the confidence interval is the value of \(\tau\) which I will explain in the following. Consider this t-distribution with 1,000 degrees of freedom. The shape of this t-distribution is entirely determined by the number of degrees of freedom. The mean (=0) and the standard deviation (=1) are fixed. So if we know the degrees of freedom we can work out that the interval between \(\tau=-1\) and \(\tau=1\) covers 68% of the area under the curve.
We know this because the area below \(\tau=1\) is
pt(1, df = 1000)
[1] 0.8412238
and the area below \(\tau = -1\) is
pt(-1, df = 1000)
[1] 0.1587762
and so the area in between \(\tau=1\) and \(\tau=-1\) is
pt(1, df = 1000) - pt(-1, df = 1000)
[1] 0.6824476
The pt function gives us the area under the curve below
a provided value of \(\tau\). Now if we
want to determine the \(\tau\) value
that corresponds to a particular area under the curve, we need the
qt function. Similar to pt, qt is
making use of the fact that the shape of the t-distribution is
entirely determined by the number of degrees of freedom. Above we
determined that the area under the curve between \(\tau=1\) is 0.84 or 84%. Therefore we would
expect to obtain 1, if we use qt for a value of 0.84 and
indeed
qt(0.8412238, df = 1000)
[1] 1
Here is an illustration of this area corresponding to \(\tau=1\)
The area that we are interested in is the between the interval that is centering around 0, hence we subtracted the area below -1 above.
Now, if our goal is to determine the \(\tau\) values that correspond to the lower and upper bound of 95% confidence interval, we are interested in the range illustrated in this figure:
To determine this range we cannot run
qt(0.95, df = 1000). This function call would indeed return
a \(\tau\) value that contains 95% of
the area under the curve but the area also would include the lower tail
and wouldn’t be centered around 0. This is how it would look like:
To get the correct area we need to remember that there is a lower an an
upper bound for the confidence interval that is centered around the
quantity of interested (e.g. the difference in sample means above). If
we know that the area that we are interested in is 95% and the total
area under the curve is 100% then the mass in the tails must be half of
5%, so
(1 - .95) / 2
[1] 0.025
which is the same as the mass in the lower tail. The upper bound is therefore the total area under the curve take away the area under the lower bound.
1 - 0.025
[1] 0.975
These two values can then be used to obtain the \(\tau\) values for the area that spans 95% so
qt(0.025, df = 1000)
[1] -1.962339
for the lower bound and for the upper bound.
qt(0.975, df = 1000)
[1] 1.962339
Because the t-distribution is symmetric, the \(\tau\) value corresponding to the lower bound is always going to be the negative equivalent of the upper bound.