The puzzle
The confidence interval of the mean \(\overline{x}\) of a sample of size \(n\) from a Normal population whose mean \(\mu\) and variance \(\sigma\) are both unknown, but related in a known way, e.g. \(\sigma^2 = \frac{\mu}{10}\), is correctly estimated as a function of the sample mean \(\overline{x}\) but not as a function of the sample standard deviation \(s\).
Yet, since \(\sigma^2 = \frac{\mu}{10}\), both \(\overline{x}\) and \(s\) are unbiased estimators for both \(\mu\) and \(\sigma^2\). Why don’t they both work work equally well for calculating confidence intervals?
Three kinds of 95% CI of the mean of a sample drawn from this population
library(dplyr)
library(ggplot2)
# Draw `n` samples of size `m` from normal
# distro of mean `mu` and variance = `mu`/10,
# return simulated ci's at the `ci` level.
# `varden` stands for "variance denominator"
# and it's there to give us the option of trying
# different relationships between the population
# mean and variance (e.g., when `varden` < 1, we
# have a distribution with variance > mean).
showCI <- function(m = 5, n = 10^3, ci = .95, mu = 12, varden = 10) {
# draw n samples of size m from population of
# known ~ N(mean = mu, variance = mu / 10):
foo <- replicate(n = n, rnorm(n = m, mean = mu, sd = sqrt(mu / varden)))
# collect the n sample means and sample
# standard deviations in a tibble of
# 2 columns and n rows
msd <- cbind(sm = apply(foo, 2, mean),
ssd = apply(foo, 2, sd)) %>%
tbl_df()
# To each sm there corresponds a lower and
# upper bound of the `ci` confidence interval
# calculated with standard normal critical values
# times standard error of the mean as a function
# of the sample mean (x_bar) and varden. The
# standard normal critical values qnorm() are
# the correct ones to use when the sample is drawn
# from a population with known variance. Since the
# sample variance is an unbiased estimator of it,
# we can use the relationship between the population
# mean and variance to estimate the standard error
# of the sample mean, instead of using the direct
# calculation of the sample standard deviation.
msd <- mutate(msd,
lower_sm = sm - qnorm(p = ci + (1 - ci) / 2) * sqrt((sm / varden) / m),
upper_sm = sm + qnorm(p = ci + (1 - ci) / 2) * sqrt((sm / varden) / m))
# Now let's instead calculate the standard error of
# the mean as a function of the sample standard deviation
# shown in the ssd column, but again pair it with the
# qnorm() critical values, because we know the population
# variance
msd <- mutate(msd,
lower_ssd = sm - qnorm(p = ci + (1 - ci) / 2) * ssd / sqrt(m),
upper_ssd = sm + qnorm(p = ci + (1 - ci) / 2) * ssd / sqrt(m))
# And what if we used the t distro critical values paired with the
# standard error of the mean calculated as above, which is what
# we would have to do if we did not know the population variance?
msd <- mutate(msd,
lower_ssdt = sm - qt(p = ci + (1 - ci) / 2, df = m - 1) * ssd / sqrt(m),
upper_ssdt = sm + qt(p = ci + (1 - ci) / 2, df = m - 1) * ssd / sqrt(m))
# So how many times is mu seen inside each of these pairs
# of `ci` bounds? Let's see:
mutate(msd,
inside_sm = mu > lower_sm & mu < upper_sm,
inside_ssd = mu > lower_ssd & mu < upper_ssd,
inside_ssdt = mu > lower_ssdt & mu < upper_ssdt)
}
m <- 5
n <- 10^3
ci <- .95
mu <- 12
varden <- 10
checks <- showCI()
If you draw 1,000 samples of size 5 from a population distributed ~N(mu = 12, sigma^2 = 1.2), the population mean \(\mu\) falls inside each sample’s own calculated CI 95% of the time when this CI is calculated as a function of \(\overline{x}\) and 87% of the time when it’s calculated as a function of \(s\).
This replicates the original puzzle. But what if in the CI formula that uses \(s\) we used t critical values, \(qt\)? Then \(\mu\) would fall inside the CI 95% of the time. This is close enough to resolve the puzzle.
It appears that when the population variance is a function of the population mean, the sample mean \(\overline{x}\) is an unbiased estimator of the population variance, so you can use N(0, 1) critical values as if you drew from a population with unknown mean but known variance. But since the sample standard deviation \(s\) is not an unbiased estimator of the population variance, if you use \(s\) instead of \(\overline{x}\) you need to switch to t(df = m - 1) critical values, as if you drew from a population of unknown variance. The rules of when to use which are mentioned here.
Does this hold across many sample sizes?
Yes. Here’s a density plot that shows that the share of cases when \(\mu\) shows up inside the 95% CI spikes on .95, as it should, when you use t critical values with the sample standard deviation.
biglist <- lapply(c(5:50), showCI)
checkthese <- sapply(biglist, function(checks) {
c(prop.table(table(checks$inside_sm))['TRUE'],
prop.table(table(checks$inside_ssd))['TRUE'],
prop.table(table(checks$inside_ssdt))['TRUE'])
})
verify <- t(checkthese) %>%
tbl_df()
names(verify) <- c('Inside CI xbar', 'Inside CI s', 'Inside CI s qt')
verify <- mutate(verify, `Sample size` = c(5:50))
verify <- tidyr::gather(verify, 'Which', 'Value', 1:3)
ggplot(data = verify, aes(x = Value, colour = Which)) +
geom_line(stat = 'density') +
ggtitle('Using t distro with sample sd solves the puzzle')

---
title: "When the sample mean works, but sample standard deviation does not"
output: html_notebook
---

## The puzzle

The confidence interval of the mean \(\overline{x}\) of a sample of size \(n\) from a Normal population whose mean \(\mu\) and variance \(\sigma\) are both unknown, but related in a known way, e.g. \(\sigma^2 = \frac{\mu}{10}\), is correctly estimated as a function of the sample mean \(\overline{x}\) but not as a function of the sample standard deviation \(s\). 

Yet, since \(\sigma^2 = \frac{\mu}{10}\), both \(\overline{x}\) and \(s\) are unbiased estimators for both \(\mu\) and \(\sigma^2\). Why don't they both work work equally well for calculating confidence intervals?

## Three kinds of 95% CI of the mean of a sample drawn from this population

```{r simulateCI, message = FALSE, warning = FALSE}
library(dplyr)
library(ggplot2)

# Draw `n` samples of size `m` from normal 
# distro of mean `mu` and variance = `mu`/10,
# return simulated ci's at the `ci` level. 
# `varden` stands for "variance denominator" 
# and it's there to give us the option of trying 
# different relationships between the population 
# mean and variance (e.g., when `varden` < 1, we  
# have a distribution with variance > mean).
showCI <- function(m = 5, n = 10^3, ci = .95, mu = 12, varden = 10) {
   # draw n samples of size m from population of  
   # known ~ N(mean = mu, variance = mu / 10):
   foo <- replicate(n = n, rnorm(n = m, mean = mu, sd = sqrt(mu / varden)))
   
   # collect the n sample means and sample 
   # standard deviations in a tibble of 
   # 2 columns and n rows
   msd <- cbind(sm = apply(foo, 2, mean), 
                ssd = apply(foo, 2, sd)) %>% 
      tbl_df()
   
   # To each sm there corresponds a lower and 
   # upper bound of the `ci` confidence interval 
   # calculated with standard normal critical values
   # times standard error of the mean as a function
   # of the sample mean (x_bar) and varden. The  
   # standard normal critical values qnorm() are 
   # the correct ones to use when the sample is drawn
   # from a population with known variance. Since the 
   # sample variance is an unbiased estimator of it, 
   # we can use the relationship between the population 
   # mean and variance to estimate the standard error 
   # of the sample mean, instead of using the direct 
   # calculation of the sample standard deviation.
   msd <- mutate(msd, 
                 lower_sm = sm - qnorm(p = ci + (1 - ci) / 2) * sqrt((sm / varden) / m), 
                 upper_sm = sm + qnorm(p = ci + (1 - ci) / 2) * sqrt((sm / varden) / m))
   
   # Now let's instead calculate the standard error of 
   # the mean as a function of the sample standard deviation
   # shown in the ssd column, but again pair it with the 
   # qnorm() critical values, because we know the population
   # variance
   msd <- mutate(msd, 
                 lower_ssd = sm - qnorm(p = ci + (1 - ci) / 2) * ssd / sqrt(m), 
                 upper_ssd = sm + qnorm(p = ci + (1 - ci) / 2) * ssd / sqrt(m))
   
   # And what if we used the t distro critical values paired with the
   # standard error of the mean calculated as above, which is what
   # we would have to do if we did not know the population variance?
   msd <- mutate(msd, 
                 lower_ssdt = sm - qt(p = ci + (1 - ci) / 2, df = m - 1) * ssd / sqrt(m), 
                 upper_ssdt = sm + qt(p = ci + (1 - ci) / 2, df = m - 1) * ssd / sqrt(m))
   
   # So how many times is mu seen inside each of these pairs
   # of `ci` bounds? Let's see:
   mutate(msd, 
          inside_sm = mu > lower_sm & mu < upper_sm, 
          inside_ssd = mu > lower_ssd & mu < upper_ssd, 
          inside_ssdt = mu > lower_ssdt & mu < upper_ssdt)
}
m <- 5
n <- 10^3
ci <- .95
mu <- 12
varden <- 10
checks <- showCI()
```

If you draw `r prettyNum(n, big.mark = ',')` samples of size `r m` from a population distributed `r paste('~N(mu = ', mu, ', sigma^2 = ', mu/varden ,')', sep = '')`, the population mean \(\mu\) falls inside each sample's own calculated CI `r round(100 * prop.table(table(checks$inside_sm))['TRUE'])`% of the time when this CI is calculated as a function of \(\overline{x}\) and `r round(100 * prop.table(table(checks$inside_ssd))['TRUE'])`% of the time when it's calculated as a function of \(s\).

This replicates the original puzzle. But what if in the CI formula that uses \(s\) we used t critical values, \(qt\)? Then \(\mu\) would fall inside the CI `r round(100 * prop.table(table(checks$inside_ssdt))['TRUE'])`% of the time. This is close enough to resolve the puzzle.

It appears that when the population variance is a function of the population mean, the sample mean \(\overline{x}\) is an unbiased estimator of the population variance, so you can use N(0, 1) critical values as if you drew from a population with unknown mean but known variance. But since the sample standard deviation \(s\) is not an unbiased estimator of the population variance, if you use \(s\) instead of \(\overline{x}\) you need to switch to t(df = m - 1) critical values, as if you drew from a population of unknown variance. The rules of when to use which are mentioned [here](http://www.stat.yale.edu/Courses/1997-98/101/confint.htm).

## Does this hold across many sample sizes?

Yes. Here's a density plot that shows that the share of cases when \(\mu\) shows up inside the 95% CI spikes on .95, as it should, when you use t critical values with the sample standard deviation.

```{r multipleSampleSizes, fig.width = 8, fig.height = 6}
biglist <- lapply(c(5:50), showCI)
checkthese <- sapply(biglist, function(checks) {
   c(prop.table(table(checks$inside_sm))['TRUE'], 
     prop.table(table(checks$inside_ssd))['TRUE'], 
     prop.table(table(checks$inside_ssdt))['TRUE'])
})
verify <- t(checkthese) %>% 
   tbl_df()
names(verify) <- c('Inside CI xbar', 'Inside CI s', 'Inside CI s qt')
verify <- mutate(verify, `Sample size` = c(5:50))
verify <- tidyr::gather(verify, 'Which', 'Value', 1:3)
ggplot(data = verify, aes(x = Value, colour = Which)) + 
   geom_line(stat = 'density') + 
   ggtitle('Using t distro with sample sd solves the puzzle')
```

