Confidence Intervals for the Sample Mean Using R Computation Project

Jessica Stuart

Sampling distribution. Suppose you draw a random sample of size 9 from a population with a normal distribution and compute the sample mean. Imagine doing this 100 times so that you have 100 sample means.

Generate 9 random numbers, normally distributed with mean 80 and standard deviation 5, and repeat this 100 times.

data <- replicate(100, rnorm(9, mean = 80, sd = 5))

What does this data look like?

dim(data)
## [1]   9 100

There are 9 rows and 100 columns of numeric data.

Compute the sample means.

xbar <- apply(data, 2, mean)

Compare the distribution of the sample means with the population distribution.

hist(data[1, ], nclass = 20, col = "red", xlab = "Values for 100 Subjects", 
    main = "Population Measurements", xlim = c(60, 100))

plot of chunk unnamed-chunk-4

hist(xbar, nclass = 20, col = "blue", xlab = "Means for Samples of Size 9", 
    main = "Sample Means", xlim = c(60, 100))

plot of chunk unnamed-chunk-4

While the values are randomly generated for each simulation, generally the distribution of the population measurements is more normally distributed than the sample means and the values for the population measurements cover a large range of values than those of the sample means.

Confidence Intervals I. Suppose n = 9 men are selected at random from a population. Assume the heights of the men in this population are normal with mean 69 inches and standard deviation 3 inches. Simulate the results of this selection 100 times and in each case find a 90% confidence interval for the mean.

In order to do this, generate 100 columns of normally distributed random numbers (mean 69, standard deviation 3), each column has 9 rows.

heights <- replicate(100, rnorm(9, mean = 69, sd = 3))

Calculate 90% confidence intervals for each row.

tint <- matrix(NA, nrow = dim(heights)[2], ncol = 2)
for (i in 1:dim(heights)[2]) {
    temp <- t.test(heights[, i], conf.level = 0.9)
    tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")

What is the width of each confidence interval?

width <- apply(tint, 1, diff)
tint <- cbind(tint, width)
tint <- data.frame(tint)
head(tint)
##     lcl   ucl width
## 1 66.96 70.78 3.821
## 2 66.76 70.21 3.446
## 3 67.45 70.57 3.120
## 4 67.22 69.91 2.696
## 5 66.88 70.30 3.413
## 6 67.42 71.24 3.817

How many intervals contain \( \mu \)?

indx <- (tint$lcl <= 69) & (tint$ucl >= 69)
sum(indx)
## [1] 89

How many intervals do you expect to contain \( \mu \)?

I expect 90 intervals to contain \( \mu \) since we are finding confidence intervals at a 90% confidence level and 90 is 90% of 100.

Do all of the intervals have the same width? Why or why not?

No, not all of the intervals have the same width because the data is randomly generated in this simulation.

Confidence Intervals II. Suppose you computed 95% confidence intervals instead of 90% confidence intervals. Would you expect more or fewer intervals to cover 69? Would they be narrower or wider?

I would expect more intervals to cover 69 if the intervals are computed at a 95% confidence level compared to at a 90% confidence level. The 95% confidence intervals should be wider than the 90% confidence intervals.

To check this, I will repeat the above simulation.

First, generate 100 columns of normally distributed random numbers with mean 69 and standard deviation 3 (each column has 9 rows).

heights <- replicate(100, rnorm(9, mean = 69, sd = 3))

Calculate 95% confidence intervals for each row.

tint <- matrix(NA, nrow = dim(heights)[2], ncol = 2)
for (i in 1:dim(heights)[2]) {
    temp <- t.test(heights[, i], conf.level = 0.95)
    tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")

Find the width of each confidence interval.

width <- apply(tint, 1, diff)
tint <- cbind(tint, width)
tint <- data.frame(tint)
head(tint)
##     lcl   ucl width
## 1 66.08 69.38 3.294
## 2 67.07 70.98 3.905
## 3 67.81 70.53 2.721
## 4 66.72 70.41 3.692
## 5 65.97 70.19 4.226
## 6 69.38 73.59 4.215

Find the number of intervals that contain \( \mu \).

indx <- (tint$lcl <= 69) & (tint$ucl >= 69)
sum(indx)
## [1] 95

Based on the simulation, I can confirm my above claim that more intervals do cover the true mean at a 95% confidence level than at a 90% confidence level and that the 95% confidence intervals are wider than the 90% confidence intervals.

Confidence Intervals III. Suppose you took samples of size n = 100 instead of n = 9. Would you expect more or fewer 90% confidence intervals to cover 69? Would they be narrower or wider?

I would expect about the same number of 90% confidence intervals to cover 69 and that the confidence intervals would be narrower if the sample size is 100 rather than 9.

To check this, I will repeat the above simulation.

First, generate 100 columns of normally distributed random numbers with mean 69 and standard deviation 3 (each column has 100 rows).

heights <- replicate(100, rnorm(100, mean = 69, sd = 3))

Calculate 90% confidence interval for each row.

tint <- matrix(NA, nrow = dim(heights)[2], ncol = 2)
for (i in 1:dim(heights)[2]) {
    temp <- t.test(heights[, i], conf.level = 0.9)
    tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")

Find the width of each confidence interval.

width <- apply(tint, 1, diff)
tint <- cbind(tint, width)
tint <- data.frame(tint)
head(tint)
##     lcl   ucl  width
## 1 68.62 69.72 1.0936
## 2 68.75 69.82 1.0681
## 3 68.73 69.85 1.1177
## 4 68.80 69.78 0.9881
## 5 69.23 70.22 0.9882
## 6 68.29 69.43 1.1379

Find the number of intervals that contain \( \mu \).

indx <- (tint$lcl <= 69) & (tint$ucl >= 69)
sum(indx)
## [1] 87

Based on the simulation, I can confirm my above claim that around the same number of intervals cover the true mean and the width of the intervals is narrower when the confidence level is 90% and the number of samples taken is 100 instead of 9.

Robustness. The assumption for small sample confidence intervals (t-intervals) is that the population is normally distributed. What happens if this assumption is violated? You will generate observations from an exponential distribution with mean 25, and compute 90% confidence intervals.

To get a feel for the exponential distribution, generate 1000 observations from this distribution, and then make a histogram and a normal probability plot. Describe your results.

mean <- 25
data <- rexp(1000, rate = 1/mean)
hist(data, nclass = 20, col = "green", main = "Random Numbers from Exponential Distribution")

plot of chunk unnamed-chunk-17

qqnorm(data)
qqline(data)

plot of chunk unnamed-chunk-17

Based on the qqnorm and qqline plots, it is clear the the data is not normally distributed because the data points from the qqnorm plot do not lie in a straight line (only a small amount of points lie on the qqline plot).

Create an empty matrix and label the first 5 columns as follows:

xx <- matrix(NA, nrow = 1000, ncol = 5)
colnames(xx) <- c("xbar", "stdev", "lcl", "ucl", "cover true mean?")

Now generate 1000 samples with 10 observations each of random data from an exponential distribution with mean 25.

mean <- 25
data <- replicate(1000, rexp(10, rate = 1/mean))
xbar <- apply(data, 2, mean)
stdev <- apply(data, 2, sd)

Compute the lower and upper ends of the t-intervals.

tint <- matrix(NA, nrow = dim(data)[2], ncol = 2)
for (i in 1:dim(data)[2]) {
    temp <- t.test(data[, i], conf.level = 0.9)
    tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")

Store the results in the matrix that was created.

xx[, 1] <- xbar
xx[, 2] <- stdev
xx[, 3] <- tint[, 1]
xx[, 4] <- tint[, 2]

How many intervals include the true mean?

xx[, 5] <- (xx[, 3] < mean) & (xx[, 4] > mean)
sum(xx[, 5])
## [1] 850

What do you find? Is this what you expected? How does this compare to a coverage of confidence intervals for a normal distribution such as heights in the section Confidence Intervals I if it would have been done 1000 times?

The number of intervals that contain the true mean is below 900, which is the expected number of intervals to contain the true mean since we have 1000 intervals at a 90% confidence level. However, I would expect much closer to 900 of the confidence intervals for a normal distrubution at a 90% confidence level to contain the true mean if done 1000 times (like in the section Confidence Intervals I) because the data is normally distributed.

To verify this claim, I will repeat the simulation from Confidence Intervals I, but using 1000 simulations instead of 100.

heights <- replicate(1000, rnorm(9, mean = 69, sd = 3))
tint <- matrix(NA, nrow = dim(heights)[2], ncol = 2)
for (i in 1:dim(heights)[2]) {
    temp <- t.test(heights[, i], conf.level = 0.9)
    tint[i, ] <- temp$conf.int
}
colnames(tint) <- c("lcl", "ucl")
width <- apply(tint, 1, diff)
tint <- cbind(tint, width)
tint <- data.frame(tint)
indx <- (tint$lcl <= 69) & (tint$ucl >= 69)
sum(indx)
## [1] 895

Based on this simulation, the number of intervals containing the true mean is closer to 900 than when the data is generated from an exponential distribution.