1 Brief Introduction

Please watching this video, to get some ideas about Confidence Intervals (CI)

2 CI in Business

This video guide you, how can you apply Confidence Intervals in Business.

3 Your Exercise

In this section, your expected to get familiar with confidential intervals exercise:

3.1 Exercise 1

Find a point estimate of average university student Age with the sample data from survey!

library(MASS)
age <- survey$Age

mean <- mean(age, na.rm = TRUE)
mean
## [1] 20.37451

Based on these results, we can see the point estimate for the average age contained in the survey data is 20.37451
And I want to try how to get the confidence intervals for the average university student age.

pe <- t.test(age, conf.level = 0.95)

pe$conf.int
## [1] 19.54600 21.20303
## attr(,"conf.level")
## [1] 0.95

It turns out that we got good results because they match the point estimate, where for this confidence interval we can formulate it to be 19,546 < x < 21,203 with 95% confidence level.

3.2 Exercise 2

Assume the population standard deviation \(\sigma\) of the student Age in data survey is 7. Find the margin of error and interval estimate at 95% confidence level.

age.response = na.omit(survey$Age)
n = length(age.response)

sigma = 7

sd = sigma/sqrt(n)

e = qnorm(0.975)*sd

e
## [1] 0.8911934

Then we can find the margin of error is 0.8911934 years. After that, we add it with the sample mean to find the confidence interval.

xbar = mean(age.response)

xbar
## [1] 20.37451

And for the confidence intervals we can solve like this.

xbar + c(-e, e)
## [1] 19.48332 21.26571

The margin of error of the student age by assuming the population standard deviation is 7 at the 95% confidence level is 0.8911934 years. The confidence interval for this case is 19.48332 < x 21.26571

We can solve with another way, maybe we use z.test from TeachingDemos package.

library(TeachingDemos)
ztest <- z.test(age.response, sd = sigma)

ztest
## 
##  One Sample z-test
## 
## data:  age.response
## z = 44.809, n = 237.0000, Std. Dev. = 7.0000, Std. Dev. of the sample
## mean = 0.4547, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  19.48332 21.26571
## sample estimates:
## mean of age.response 
##             20.37451

We can see, based on the result is same with the first way we try.

3.3 Exercise 3

Without assuming the population standard deviation \(\sigma\) of the student Age in survey, find the margin of error and interval estimate at 95% confidence level.

E <- qt(0.975, df = n -1)*sd

E
## [1] 0.8957872

We find that the margin of error for the upper tail 95% of confidence intervals is 0.8957872 years

xbar + c(-E, E)
## [1] 19.47873 21.27030

The result of the margin of error for the student age survey is 0.8957872 years at 95% confidence level and the confidence interval is in between 19.47873 and 21.27030 years.

3.4 Exercise 4

Improve the quality of a sample survey by increasing the sample size with unknown standard deviation \(\sigma\)!.

zstar = qnorm(0.975)
x = zstar^2

size <- x*0.25 / (0.05)^2

size
## [1] 384.1459

So, we got 384.1459 or 384 sample sizes to improve the quality of a sample survey with unknown standard deviation σ.

3.5 Exercise 5

Assume you don’t have planned proportion estimate, find the sample size needed to achieve 5% margin of error for the male student survey at 95% confidence level!

gender.response = na.omit(survey$Sex)
n = length(gender.response)
k = sum(gender.response == "Male")

k
## [1] 118
pbar = k/n;pbar
## [1] 0.5

The number of male student is 118. The proportion of the male student is 0.5.

Now, we want to find the sample size to achieve 5% margin of error for the male student survey at 95% confidence level

zstar = qnorm(0.975)
p = 0.5

e = 0.05

sizee <- zstar^2 * p * (1-p) / e^2

sizee
## [1] 384.1459

The, we get that we need 384.1459 or 384 sample size to achieve 5% margin of error for the male student survey at 95% confidence level.

3.6 Exercise 6

Perform confidence intervals analysis on this data set from 2004 that includes data on average hourly earnings, marital status, gender, and age for thousands of people.

cps <- read.csv("cps04.csv", header = T, sep =",")

3.7 Hourly Earnings

avghour.response <- na.omit(cps$ahe)
n = length(avghour.response)

sigma = sd(avghour.response)

SE = sigma/sqrt(n)

E = qnorm(0.975) * SE

E
## [1] 0.1920964
xbar <- mean(avghour.response)

xbar
## [1] 16.7712
xbar + c(-E, E)
## [1] 16.57911 16.96330

From the code above, we can know that the margin of error of average hourly earnings is 0.1920964. xbar (sample mean) is 6.7712 while the confidence interval is inbetween 16.57911 and 16.96330.

3.8 Age

age.respon = na.omit(cps$age)
n = length(age.respon)

sigma = sd(age.respon)

SE = sigma/sqrt(n)

E = qnorm(0.975) * SE

E
## [1] 0.06340892
xbar <- mean(age.respon)

xbar
## [1] 29.75445
xbar + c(-E, E)
## [1] 29.69104 29.81785

From the code above, we can know that the margin of error of age is 0.06340892. xbar (sample mean) is 29.75445 while the confidence interval is inbetween 29.69104 and 29.81785 years.

3.9 Female

fem.response = na.omit(cps$female)
n = length(fem.response)
k = sum(fem.response == "1")

k
## [1] 3313
sigma = sd(fem.response)

SE = sigma/sqrt(n)

E = qnorm(0.975) * SE

E
## [1] 0.01080662
xbar <- mean(fem.response)

xbar + c(-E, E)
## [1] 0.4040444 0.4256576

From the code above, we can know that the total of the female is 3313 and the margin of error of age is 0.01080662. xbar (sample mean) is 0.414851 while the confidence interval is inbetween 0.4040444 and 0.4256576. From this interval we know that, there are more male than female participants.

3.10 Bachelor

bach.response = na.omit(cps$bachelor)
n = length(bach.response)
k = sum(bach.response == "1")

k
## [1] 3640
sigma = sd(bach.response)

SE = sigma/sqrt(n)

E = qnorm(0.975) * SE

E
## [1] 0.01092388
xbar <- mean(bach.response)


xbar + c(-E,E)
## [1] 0.4448738 0.4667215

From the code above, we can know that the total of the bachelor is 3460 and the margin of error of age is 0.01092388. xbar (sample mean) is 0.4557976 while the confidence interval is inbetween 0.4448738 and 0.4667215. From this interval we know that, there are more not bachelor than bachelor participants

4 Case Study

Assume you have access to data on an entire population, say the size of every house in all residential home sales in Ames, Iowa between 2006 and 2010 it’s straight forward to answer questions like,

  • How big is the typical house in Ames?
  • How much variation is there in sizes of houses?.
  • How much is the average price of house in Ames?
  • How much is the confidence interval price of house in Ames?

But, If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.

4.1 Collect Data

To access the data in R, type the following code:

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

In this case study we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.

#randomly set seed to fix outputs in this assignment
set.seed(0)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
samp
##  [1] 2200 2093 1040 2233 1523 1660 1555 1102  848 1136 2061 1122  960 1092 2610
## [16] 2217 1959 2334 1660 1576  848 2004  988 1500  874 1340 1800 1069 1456  784
## [31]  985 1928  882 1124 1639 1214 1434 1150 1544 1812 1511 1949 1077 1248 1480
## [46] 1320 1717 1367  928 2552 1953  693 2690 2276 1173 1258 2582 1558  672 1488

4.2 Visualization

As usual, before you begin to analyze more about your data. It’s important to visualize the data in advance. Here, we use a random sample of size 60 from the population.

# Histogram
library(moments)
## Warning: package 'moments' was built under R version 4.1.3
hist(samp, breaks = 20, col = 'pink')

# Make a histogram of your sample
hist(samp, main ="Distribution fo Samp", 
     col = "deeppink3", 
     xlim = c(200, 3500), 
     freq = F,
     xlab = "Samp")
# ...and add a density curve
curve(dnorm(x, 
            mean=mean(samp), 
            sd=sd(samp)), add=T, 
            col="blue", lwd=2)

Your Challenge:

  • Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
  • Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

4.3 Confidence Intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,

sample_mean <- mean(samp)
sample_mean
## [1] 1514.133

Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.

We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (I assume that you have been familiar with this formula).

se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1381.351 1646.915

This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.

Your Challenge:

  • For the confidence interval to be valid, the sample mean must be normally distributed and have standard error $ s/$. What conditions must be met for this to be true?
  • What does “95% confidence” mean?
  • Does your confidence interval capture the true average size of houses in Ames? If you are working on this case study, does your classmate’s interval capture this value?

4.4 Simulation

let’s simulate a scenario of confidence interval in classroom to capture the true average size of houses in Ames. Suppose we have 100 students in the classroom.

count = 0
for (i in 1:100) {
  samp <- sample(population,60)
  samp_mean<- mean(samp)
  se <- sd(samp)/sqrt(60)
  lower <- samp_mean-1.96*se
  upper <- samp_mean+1.96*se
  if ((lower <= 1499.69) & (upper >= 1499.69)) {
    count = count+1
  }  
}
count
## [1] 97

Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).

Here is the rough outline:

  • Obtain a random sample.
  • Calculate and store the sample’s mean and standard deviation.
  • Repeat steps (1) and (2) 50 times.
  • Use these stored statistics to calculate many confidence intervals.

But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as \(n\).

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
n
## [1] 60

Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}
samp
##  [1] 2206 3672 2270 1786 1041 2614 1655 1378 1250 1884 1358  764 1176 1595 1419
## [16] 1620 1299 1097 1073 1647 1220 1086 1928 1412 1091 2263 1968 1261 1538  793
## [31] 1337 1768 1604 1609 1479  980  480  816  951 1069 1709 1742 2237 1458  864
## [46] 1665 1778 1949 1040 1414  954 1142 1614 1368 5642 1383 1242  816 2082 1728

Lastly, we construct the confidence intervals.

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.

c(lower_vector[1], upper_vector[1])
## [1] 1400.415 1718.352
# confidential interval visualization
plot_ci(lower_vector, upper_vector, mean(population))

# For a 95% confidence interval, the critical value is -1.959964 and 1.959964.
qnorm((1-0.95)/2)
## [1] -1.959964
qnorm((1+0.95)/2)
## [1] 1.959964

Your Challenge:

  • What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
  • Pick a confidence level of your choosing, provided it is 99%. What is the appropriate critical value?
  • Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?