Confidence intervals

In the previous, we discussed creating a confidence interval using the CLT
In this lecture, we discuss some methods for small samples, notably Gosset's \( t \) distribution
To discuss the \( t \) distribution we must discuss the Chi-squared distribution
Throughout we use the following general procedure for creating CIs

a. Create a Pivot or statistic that does not depend on the parameter of interest

b. Solve the probability that the pivot lies between bounds for the parameter

The Chi-squared distribution

Suppose that \( S^2 \) is the sample variance from a collection of iid \( N(\mu,\sigma^2) \) data; then \[ \frac{(n - 1) S^2}{\sigma^2} \sim \chi^2_{n-1} \] which reads: follows a Chi-squared distribution with \( n-1 \) degrees of freedom
The Chi-squared distribution is skewed and has support on \( 0 \) to \( \infty \)
The mean of the Chi-squared is its degrees of freedom
The variance of the Chi-squared distribution is twice the degrees of freedom

Confidence interval for the variance

Note that if \( \chi^2_{n-1, \alpha} \) is the \( \alpha \) quantile of the Chi-squared distribution then

\[ \begin{eqnarray*} 1 - \alpha & = & P \left( \chi^2_{n-1, \alpha/2} \leq \frac{(n - 1) S^2}{\sigma^2} \leq \chi^2_{n-1,1 - \alpha/2} \right) \\ \\ & = & P\left(\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}} \leq \sigma^2 \leq \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}} \right) \\ \end{eqnarray*} \] So that \[ \left[\frac{(n-1)S^2}{\chi^2_{n-1,1-\alpha/2}}, \frac{(n-1)S^2}{\chi^2_{n-1,\alpha/2}}\right] \] is a \( 100(1-\alpha)\% \) confidence interval for \( \sigma^2 \)

Notes about this interval

This interval relies heavily on the assumed normality
Square-rooting the endpoints yields a CI for \( \sigma \)

Example

Confidence interval for the standard deviation of sons' heights from Galton's data

library(UsingR)

## Loading required package: MASS

data(father.son)
x <- father.son$sheight
s <- sd(x)
n <- length(x)
round(sqrt((n - 1) * s^2/qchisq(c(0.975, 0.025), n - 1)), 3)

## [1] 2.701 2.939

Gosset's \( t \) distribution

Invented by William Gosset (under the pseudonym “Student”) in 1908
Has thicker tails than the normal
Is indexed by a degrees of freedom; gets more like a standard normal as df gets larger
Is obtained as \[ \frac{Z}{\sqrt{\frac{\chi^2}{df}}} \] where \( Z \) and \( \chi^2 \) are independent standard normals and Chi-squared distributions respectively

Result

Suppose that \( (X_1,\ldots,X_n) \) are iid \( N(\mu,\sigma^2) \), then: a. \( \frac{\bar X - \mu}{\sigma / \sqrt{n}} \) is standard normal b. \( \sqrt{\frac{(n - 1) S^2}{\sigma^2 (n - 1)}} = S / \sigma \) is the square root of a Chi-squared divided by its df
Therefore \[ \frac{\frac{\bar X - \mu}{\sigma /\sqrt{n}}}{S/\sigma} = \frac{\bar X - \mu}{S/\sqrt{n}} \] follows Gosset's \( t \) distribution with \( n-1 \) degrees of freedom

Confidence intervals for the mean

Notice that the \( t \) statistic is a pivot, therefore we use it to create a confidence interval for \( \mu \)
Let \( t_{df,\alpha} \) be the \( \alpha^{th} \) quantile of the t distribution with \( df \) degrees of freedom \[ \begin{eqnarray*} & & 1 - \alpha \\ & = & P\left(-t_{n-1,1-\alpha/2} \leq \frac{\bar X - \mu}{S/\sqrt{n}} \leq t_{n-1,1-\alpha/2}\right) \\ \\ & = & P\left(\bar X - t_{n-1,1-\alpha/2} S / \sqrt{n} \leq \mu \leq \bar X + t_{n-1,1-\alpha/2}S /\sqrt{n}\right) \end{eqnarray*} \]
Interval is \( \bar X \pm t_{n-1,1-\alpha/2} S/\sqrt{n} \)

Note's about the \( t \) interval

The \( t \) interval technically assumes that the data are iid normal, though it is robust to this assumption
It works well whenever the distribution of the data is roughly symmetric and mound shaped
Paired observations are often analyzed using the \( t \) interval by taking differences
For large degrees of freedom, \( t \) quantiles become the same as standard normal quantiles; therefore this interval converges to the same interval as the CLT yielded
For skewed distributions, the spirit of the \( t \) interval assumptions are violated
Also, for skewed distributions, it doesn't make a lot of sense to center the interval at the mean
In this case, consider taking logs or using a different summary like the median
For highly discrete data, like binary, other intervals are available

Sleep data

In R typing data(sleep) brings up the sleep data originally analyzed in Gosset's Biometrika paper, which shows the increase in hours for 10 patients on two soporific drugs. R treats the data as two groups rather than paired.

The data

data(sleep)
head(sleep)

##   extra group ID
## 1   0.7     1  1
## 2  -1.6     1  2
## 3  -0.2     1  3
## 4  -1.2     1  4
## 5  -0.1     1  5
## 6   3.4     1  6

g1 <- sleep$extra[1:10]
g2 <- sleep$extra[11:20]
difference <- g2 - g1
mn <- mean(difference)
s <- sd(difference)
n <- 10
mn + c(-1, 1) * qt(0.975, n - 1) * s/sqrt(n)

## [1] 0.7001 2.4599

t.test(difference)$conf.int

## [1] 0.7001 2.4599
## attr(,"conf.level")
## [1] 0.95