Please watching this video, to get some ideas about Confidence Intervals (CI)
This video guide you, how can you apply Confidence Intervals in Business.
In this section, your expected to get familiar with confidential intervals exercise:
Find a point estimate of average university student Age with the sample data from survey!
library(MASS)age <- survey$Age
mean <- mean(age, na.rm = TRUE)
mean## [1] 20.37451
Based on these results, we can see the point estimate for the average age contained in the survey data is 20.37451
And I want to try how to get the confidence intervals for the average university student age.
pe <- t.test(age, conf.level = 0.95)
pe$conf.int## [1] 19.54600 21.20303
## attr(,"conf.level")
## [1] 0.95
It turns out that we got good results because they match the point estimate, where for this confidence interval we can formulate it to be 19,546 < x < 21,203 with 95% confidence level.
Assume the population standard deviation \(\sigma\) of the student Age in data survey is 7. Find the margin of error and interval estimate at 95% confidence level.
age.response = na.omit(survey$Age)
n = length(age.response)
sigma = 7
sd = sigma/sqrt(n)
e = qnorm(0.975)*sd
e## [1] 0.8911934
Then we can find the margin of error is 0.8911934 years. After that, we add it with the sample mean to find the confidence interval.
xbar = mean(age.response)
xbar## [1] 20.37451
And for the confidence intervals we can solve like this.
xbar + c(-e, e)## [1] 19.48332 21.26571
The margin of error of the student age by assuming the population standard deviation is 7 at the 95% confidence level is 0.8911934 years. The confidence interval for this case is 19.48332 < x 21.26571
We can solve with another way, maybe we use z.test from TeachingDemos package.
library(TeachingDemos)ztest <- z.test(age.response, sd = sigma)
ztest##
## One Sample z-test
##
## data: age.response
## z = 44.809, n = 237.0000, Std. Dev. = 7.0000, Std. Dev. of the sample
## mean = 0.4547, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 19.48332 21.26571
## sample estimates:
## mean of age.response
## 20.37451
We can see, based on the result is same with the first way we try.
Without assuming the population standard deviation \(\sigma\) of the student Age in survey, find the margin of error and interval estimate at 95% confidence level.
E <- qt(0.975, df = n -1)*sd
E## [1] 0.8957872
We find that the margin of error for the upper tail 95% of confidence intervals is 0.8957872 years
xbar + c(-E, E)## [1] 19.47873 21.27030
The result of the margin of error for the student age survey is 0.8957872 years at 95% confidence level and the confidence interval is in between 19.47873 and 21.27030 years.
Improve the quality of a sample survey by increasing the sample size with unknown standard deviation \(\sigma\)!.
zstar = qnorm(0.975)x = zstar^2
size <- x*0.25 / (0.05)^2
size## [1] 384.1459
So, we got 384.1459 or 384 sample sizes to improve the quality of a sample survey with unknown standard deviation σ.
Assume you don’t have planned proportion estimate, find the sample size needed to achieve 5% margin of error for the male student survey at 95% confidence level!
gender.response = na.omit(survey$Sex)
n = length(gender.response)
k = sum(gender.response == "Male")
k## [1] 118
pbar = k/n;pbar## [1] 0.5
The number of male student is 118. The proportion of the male student is 0.5.
Now, we want to find the sample size to achieve 5% margin of error for the male student survey at 95% confidence level
zstar = qnorm(0.975)
p = 0.5
e = 0.05
sizee <- zstar^2 * p * (1-p) / e^2
sizee## [1] 384.1459
The, we get that we need 384.1459 or 384 sample size to achieve 5% margin of error for the male student survey at 95% confidence level.
Perform confidence intervals analysis on this data set from 2004 that includes data on average hourly earnings, marital status, gender, and age for thousands of people.
cps <- read.csv("cps04.csv", header = T, sep =",")avghour.response <- na.omit(cps$ahe)
n = length(avghour.response)
sigma = sd(avghour.response)
SE = sigma/sqrt(n)
E = qnorm(0.975) * SE
E## [1] 0.1920964
xbar <- mean(avghour.response)
xbar## [1] 16.7712
xbar + c(-E, E)## [1] 16.57911 16.96330
From the code above, we can know that the margin of error of average hourly earnings is 0.1920964. xbar (sample mean) is 6.7712 while the confidence interval is inbetween 16.57911 and 16.96330.
age.respon = na.omit(cps$age)
n = length(age.respon)
sigma = sd(age.respon)
SE = sigma/sqrt(n)
E = qnorm(0.975) * SE
E## [1] 0.06340892
xbar <- mean(age.respon)
xbar## [1] 29.75445
xbar + c(-E, E)## [1] 29.69104 29.81785
From the code above, we can know that the margin of error of age is 0.06340892. xbar (sample mean) is 29.75445 while the confidence interval is inbetween 29.69104 and 29.81785 years.
fem.response = na.omit(cps$female)
n = length(fem.response)
k = sum(fem.response == "1")
k## [1] 3313
sigma = sd(fem.response)
SE = sigma/sqrt(n)
E = qnorm(0.975) * SE
E## [1] 0.01080662
xbar <- mean(fem.response)
xbar + c(-E, E)## [1] 0.4040444 0.4256576
From the code above, we can know that the total of the female is 3313 and the margin of error of age is 0.01080662. xbar (sample mean) is 0.414851 while the confidence interval is inbetween 0.4040444 and 0.4256576. From this interval we know that, there are more male than female participants.
bach.response = na.omit(cps$bachelor)
n = length(bach.response)
k = sum(bach.response == "1")
k## [1] 3640
sigma = sd(bach.response)
SE = sigma/sqrt(n)
E = qnorm(0.975) * SE
E## [1] 0.01092388
xbar <- mean(bach.response)
xbar + c(-E,E)## [1] 0.4448738 0.4667215
From the code above, we can know that the total of the bachelor is 3460 and the margin of error of age is 0.01092388. xbar (sample mean) is 0.4557976 while the confidence interval is inbetween 0.4448738 and 0.4667215. From this interval we know that, there are more not bachelor than bachelor participants
Assume you have access to data on an entire population, say the size of every house in all residential home sales in Ames, Iowa between 2006 and 2010 it’s straight forward to answer questions like,
But, If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
To access the data in R, type the following code:
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")In this case study we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.
#randomly set seed to fix outputs in this assignment
set.seed(0)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
samp## [1] 2200 2093 1040 2233 1523 1660 1555 1102 848 1136 2061 1122 960 1092 2610
## [16] 2217 1959 2334 1660 1576 848 2004 988 1500 874 1340 1800 1069 1456 784
## [31] 985 1928 882 1124 1639 1214 1434 1150 1544 1812 1511 1949 1077 1248 1480
## [46] 1320 1717 1367 928 2552 1953 693 2690 2276 1173 1258 2582 1558 672 1488
As usual, before you begin to analyze more about your data. It’s important to visualize the data in advance. Here, we use a random sample of size 60 from the population.
# Histogram
library(moments)## Warning: package 'moments' was built under R version 4.1.3
hist(samp, breaks = 20, col = 'pink')# Make a histogram of your sample
hist(samp, main ="Distribution fo Samp",
col = "deeppink3",
xlim = c(200, 3500),
freq = F,
xlab = "Samp")
# ...and add a density curve
curve(dnorm(x,
mean=mean(samp),
sd=sd(samp)), add=T,
col="blue", lwd=2)Your Challenge:
One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,
sample_mean <- mean(samp)
sample_mean## [1] 1514.133
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.
We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (I assume that you have been familiar with this formula).
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)## [1] 1381.351 1646.915
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
Your Challenge:
Ames? If you are working on this case study, does your classmate’s interval capture this value?let’s simulate a scenario of confidence interval in classroom to capture the true average size of houses in Ames. Suppose we have 100 students in the classroom.
count = 0
for (i in 1:100) {
samp <- sample(population,60)
samp_mean<- mean(samp)
se <- sd(samp)/sqrt(60)
lower <- samp_mean-1.96*se
upper <- samp_mean+1.96*se
if ((lower <= 1499.69) & (upper >= 1499.69)) {
count = count+1
}
}
count## [1] 97
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
Here is the rough outline:
But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as \(n\).
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
n## [1] 60
Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
samp## [1] 2206 3672 2270 1786 1041 2614 1655 1378 1250 1884 1358 764 1176 1595 1419
## [16] 1620 1299 1097 1073 1647 1220 1086 1928 1412 1091 2263 1968 1261 1538 793
## [31] 1337 1768 1604 1609 1479 980 480 816 951 1069 1709 1742 2237 1458 864
## [46] 1665 1778 1949 1040 1414 954 1142 1614 1368 5642 1383 1242 816 2082 1728
Lastly, we construct the confidence intervals.
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.
c(lower_vector[1], upper_vector[1])## [1] 1400.415 1718.352
# confidential interval visualization
plot_ci(lower_vector, upper_vector, mean(population))# For a 95% confidence interval, the critical value is -1.959964 and 1.959964.
qnorm((1-0.95)/2)## [1] -1.959964
qnorm((1+0.95)/2)## [1] 1.959964
Your Challenge: