More_Confidence_Intervals

Confidence Intervals

Previously, we used a bootstrap method to determine confidence intervals.
Now we'll look at a more classical method to find confidence intervals.
These are based on the Central Limit Theorem
Pay attention to the different cases, they include:
- the underlying population is normal and we assume we know the standard deviation
- the underlying population is normal, but we don't know the standard deviation
- the underlying population is not normal, maybe it is skewed….

Central Limit Theorem

The main idea is that a sampling distribution for the mean is normal with mean $ \mu $ and standard deviation $ \frac{\sigma}{\sqrt{n}} $ when $ n \rightarrow \infty $, where $ \mu $ is the mean of the population and $ \sigma $ is the standard deviation of the popultion.

Population normally distributed, known standard deviation

We'll work through the examples in the book, but I'll present a slightly different R procedure.

Example 7.1

Weights of populations are well known and normally distributed.
According to the CDC, 13 year old girls:
- have a mean weight of 101 lbs
- with a standard deviation of 24.6 lbs
A community wants to see if the mean weight of 13 year old girls in their community is significantly different than the overall population.
They take a sample of size $ n=150 $
the average weight of the sample is $ \overline{x}=95 $ lbs

Sampling distribution for community

The sampling distribution for the mean is normally distributed with:
- mean $ =\overline{x}=95 $
- standard deviation $ =\frac{24.6}{\sqrt{150}} $
We use qnorm to find a 95% confidence interval for the mean weight of 13 year old girls in this community:

xbar <- 95
sigma <- 24.6
n <- 150
qnorm(c(.025, .975), mean=xbar, sd=sigma/sqrt(n))

[1] 91.06325 98.93675

Interpretation

If we repeat this sampling process many times, 95% of the time the true mean weight will lie in the interval we obtain.
What does the community conclude? The interval does not contain the mean weight of the national population and so they have evidence that the mean weight of girls in their community is different than that of the nation.

Alternate form of reporting

This type of result is often reported as: $ \overline{x} \pm $ MOE, where MOE is the margin of error.
The MOE can be computed with:

qnorm(0.975, mean=0, sd=1)*sigma/sqrt(n)

[1] 3.936748

qnorm(0.025, mean=0, sd=1)*sigma/sqrt(n)

[1] -3.936748

Check:

xbar-qnorm(0.975, mean=0, sd=1)*sigma/sqrt(n)

[1] 91.06325

xbar+qnorm(0.975, mean=0, sd=1)*sigma/sqrt(n)

[1] 98.93675

Notice: We used a mean of 0 and a standard deviation of 1. This is for a standard normal distribution. The reason for this type of reporting is historical: before R there were tables of values, but only for standard normal distributions.

Your turn

Exercise 7.3a)

Population normally distributed, unknown standard deviation

Again we want to estimate a mean value.
We assume the population is normal.
But we don't have a standard deviation to work with.
The only option at hand is to use the standard deviation of the sample.
Instead of using $ \frac{\sigma}{\sqrt{n}} $, we'll use $ \frac{S}{\sqrt{n}} $
$ \sigma $ is the true standard deviation (which we don't know).
$ S $ is the standard devation of our sample, with which we'll have to be content.

But...

The sampling distribution isn't normally distributed in this case.
Fortunately, its distribution is known: it's a $ t $-distribution.
So, same process but we use a $ t $-distribution instead of a normal distribution.

Boys (Example 7.5)

For some reason the CDC lost the information about the weights of boys.
The community still wants to know the mean weight of boys in their community.
They assume the mean weight is normally distributed, but they do not have a known standard deviation to work with.
They have a sample from $ n=28 $ boys:
- whose mean weight is 110 lbs
- with a sample standard deviation of 7.5 lbs.

We compute a 90% confidence interval:
qt and pt don't quite work like qnorm and pnorm however.
We need to find the MOE and work from there.
If we want a 90% confidence interval, we put 5% in each tail. The 95th quantile is $ q $, and the MOE is \[ q \cdot \frac{S}{\sqrt{n}}. \]
the $ t $-distribution also takes a degrees of freedom parameter. The value is $ n-1 $.

xbar <- 110
S <- 7.5
n <- 28
df <- n-1
q <- qt(.95, df)
q

[1] 1.703288

MOE <- q*S/sqrt(n)
xbar-MOE

[1] 107.5858

xbar+MOE

[1] 112.4142

Your turn

Exercise 7.7
I haven't told you how to do a one-sided confidence interval - think about what makes sense.

If you have the data (Example 7.6)...

there is a shorter procedure.

Again, we'll assume the distribution is normal, let's check this assumption with a qqplot:

library(resampledata)
girls <- subset(NCBirths2004, select=Weight, subset=Gender=="Female", drop=T)
qqnorm(girls)

plot of chunk unnamed-chunk-5

to find the CI:

For a 99% confidence interval:

t.test(girls, conf.level=.99)$conf

[1] 3343.305 3453.328
attr(,"conf.level")
[1] 0.99

(If you take off $conf you'll see some other information.)

Population not normal

If the data is symmetric, but not normal, the $ t $-distribution still works well if the sample size is big enough.
If the data is not symmetric, if it is skewed, the $ t $-distribution does not work well.
We'll come back to this issue…. with a bootstrap version.
So: check the assumption about an underlying normal distribution with a qqplot before proceeding with a $ t $-distribution process.

Difference of means

We'd like to find a confidence interval for the difference of two means.
Instead of something like \[ \text{lower bound} < \overline{X} < \text{upper bound} \]
we want \[ \text{lower bound} < \overline{X}-\overline{Y} < \text{upper bound} \]
where $ \overline{X} $ and $ \overline{Y} $ are from two different groups: maybe a control group and a treatment group

Plan

Use a $ t $-distribution
Find the desired quantile with qt. Eg. for a 95% CI: q <- qt(.975, df)
See comment on page 199 about the degrees of freedom
Find the MOE.
- This is where things are different. The MOE is: \[ q \cdot \sqrt{\frac{S_1^2}{n_1}+\frac{S_2^2}{n_2}} \] where $ S_1 $ is the standard deviation for the first group which is of size $ n_1 $. Similar for group 2.
Then the CI is: $ \overline{X}-\overline{Y} \pm $ MOE

Your turn

Exercise 7.13