In the previous lesson, we considered how a parameter can be estimated from sample data. However, it is important to understand how good is the estimate obtained. For example, suppose that we want to estimate \(\mu\) the mean cost for a car repair. We take a sample of \(100\) such car repairs and find that \(\bar x = 250\). Now because of sampling variability it is almost never the case that \(\bar x\) will equal \(\mu\). Is the mean repair likely to be between \(225\) and \(275\)? Or is is likely to be between \(240\) and \(260\)? Bounds that represent an interval of plausible values for a parameter are an example of a confidence interval.
We cannot be certain that the interval contains the true, unknown population parameter - we only use a sample from the population to compute the point estimate and the interval. However, the confidence interval is constructed so that we have high confidence that it does contain the unknown population parameter.
The basic ideas of a confidence interval (CI) are most easily understood by initially considering a simple situation. Suppose that we have a population with unknown mean \(\mu\) and known variance \(\sigma^2\). This is a somewhat unrealistic scenario because we typically do not know \(\sigma^2\).
It is important to note that we consider this simple unrealistic scenario only to derive the basic ideas. We will NOT be using the formulas in this section for anything other than understanding the concepts of confidence intervals.
Suppose that \(X_1,X_2,...,X_n\) is a random sample from a population with unknown mean \(\mu\) and known variance \(\sigma^2\). Then we know that \[\bar X \sim N(\mu,\sigma/\sqrt{n})\] as long as the population is normal or if \(n \ge 30\).
Furthermore, \[Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \sim N(0,1)\] It follows that \[P(-z_{\alpha/2} \le \frac{\bar X - \mu}{\sigma/\sqrt{n}} \le z_{\alpha/2}) = 1-\alpha\] where \(-z_{\alpha/2} = \text{qnorm}(\alpha/2,0,1)\) and \(z_{\alpha/2} = \text{qnorm}(1-\alpha/2,0,1)\)Rearranging terms we get the result \[P(\bar X - z_{\alpha/2}\cdot\sigma/\sqrt{n} \le \mu \le \bar X + z_{\alpha/2} \cdot\sigma/\sqrt{n})\]
Suppose there is a sample of size \(n\) from a population with unknown mean \(\mu\) and known variance \(\sigma^2\). Then a \(100(1-\alpha)\%\) confidence interval for \(\mu\) is \[\bar x \pm z_{\alpha/2}\cdot \sigma/\sqrt{n}\] if \(n \ge 30\) or if the population is normal.
In this problem, \(\mu=\) true mean weight of all bags of sugar produced by the process. We want a \(99\%\) confidence interval so \(100(1-\alpha) = 99\) which means that \(\alpha=0.01\). The easiest thing to do is use the following code in R:
sigma <- 1.2
xbar <- 19.8
n <- 25
alpha <- 0.01
z <- qnorm(1-alpha/2,0,1)
xbar - z*sigma/sqrt(n)
[1] 19.1818
xbar + z*sigma/sqrt(n)
[1] 20.4182
We are \(99\%\) confident that the true mean weight of all bags of sugar produced by the process lies between \(19.1818\) ounces and \(20.4182\) ounces. This means that \(99\%\) of the time we are using a method that will give us an interval that contains the true unknown value of \(\mu\).
We will now consider the more common case where we don’t know the population variance. This case depends on a class of probability distributions called the \(t\) distribution, developed by William Gosset, an Irish statistician, while working for the Guiness Brewery in Dublin in the early 1900’s.
The \(t\) distribution is a family of distributions indexed by the degrees of freedom parameter. In general we will use the parameter \(\nu\) to represent the degrees of freedom and a \(t\) random variable with \(\nu\) degrees of freedom will be written \(t_\nu\).
The \(t\) distribution is bell shaped and symmetric around it’s mean just like the normal distribution. However, it is fatter than the normal due to the extra variability introduced by estimating \(\sigma^2\). As the degrees of freedom increases, the \(t\) istribution and the normal distribution are virtually identical.
In order to base inferences about a population mean on the \(t\) distribution, critical values analogous to \(z_{\alpha/2}\) are needed. Just as \(z_{\alpha/2}\) is the value from the standard normal distribution such that the upper tail probability is \(\alpha/2\) so \(t_{\nu,\alpha/2}\) is the value from the \(t\) distribution with \(\nu\) degrees of freedom such that the upper tail probability is \(\alpha/2\). We can get this in R using the qt command:
qt(\(1-\alpha/2,\nu\))
The development of the confidence interval in the previous section depended on the fact that the random variable \[Z=\frac{\bar X - \mu}{\sigma/\sqrt{n}}\] had a standard normal distribution.
When \(\sigma^2\) is unknown, we replace it with the sample standard deviation \(s^2\) and we get a new random variable \[T=\frac{\bar X - \mu}{s/\sqrt{n}}\] which has a \(t\) distribution with \(n-1\) degrees of freedom.
Suppose there is a sample of size \(n\) from a population with unknown mean \(\mu\) and unknown variance \(\sigma^2\). Then a \(100(1-\alpha)\%\) confidence interval for \(\mu\) is \[\bar x \pm t_{n-1,\alpha/2}\cdot s/\sqrt{n}\] if \(n \ge 30\) or if the population is normal.
Example 2: A business school placement director wants to estimate the mean annual salaries five years after students graduate. A random sample of 25 such graduates found a mean salary of \(\$42,740\) and a standard deviation of \(\$4,780\). Construct a \(90\%\) confidence interval for the true mean annual salaries five years after students graduate. Assume the population is normal.
Click For AnswerWe want to estimate \(\mu\) = mean annual salaries five years after students graduate with \(90\%\) confidence so \(\alpha = 0.10\). Using the following R code we can find this confidence interval
xbar <- 42740
s <- 4780
n <- 25
alpha <- 0.10
df <- n - 1
t <- qt(1-alpha/2,df)
xbar - t*s/sqrt(n)
## [1] 41104.4
xbar + t*s/sqrt(n)
## [1] 44375.6
We are \(90\%\) confident that the true mean annual salaries five years after students graduate lies between \(\$41,104.40\) and \(\$44,375.60\).
We are interested in estimating \(\mu\) = average speed of all cars traveling over this stretch of highway.
Enter the data into R and use the t.test command.
speed <- c(79,73,68,77,86,71,69)
t.test(speed,conf.level=0.99)
One Sample t-test
data: speed
t = 30.908, df = 6, p-value = 7.617e-08
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
65.75217 83.67640
sample estimates:
mean of x
74.71429
We are 99% percent confident that the mean speed lies between \(65.75217\) mph and \(83.67640\) mph. We must assume that the data comes from a normal population since \(n < 30\).
Suppose that we want to estimate the population mean \(\mu\) with a margin of error \(m\) at \(100(1-\alpha)\%\) confidence. How large of a sample size would be required?
For this calculation, we will assume that we know the population variance. In that case, the margin of error is \[m = z_{\alpha/2} \cdot \sigma/\sqrt{n}\] Solving for \(n\), we get our result \[n = \Big ( \frac{z_{\alpha/2} \cdot \sigma}{m} \Big )^2\] The final step is to round our answer UP to the next integer.
Example 4: The branch manager of a bank would like an estimate of \(\mu\), the mean checking account balance of all checking account customers. How many accounts should be sampled to estimate \(\mu\) to within \(\$5\) with \(90\%\) confidence. Assume that \(\sigma = \$55.00\).
Click For Answer328 accounts should be sampled. Using R we can find the required sample size with the following code.
m <- 5
sigma <- 55
alpha <- 0.10
z <- qnorm(1-alpha/2,0,1)
n <- (z*sigma/m)^2
n
[1] 327.3708