Please watching this video, to get some ideas about Confidence Intervals (CI)
This video guide you, how can you apply Confidence Intervals in Business.
In this section, your expected to get familiar with confidential intervals exercise:
Find a point estimate of average university student Age with the sample data from survey!
## [1] 20.37451
## [1] 19.54600 21.20303
## attr(,"conf.level")
## [1] 0.95
As we can see, confidence intervals for the average university student height with the sample data from survey is 19.54-21.20. Therefore, we can say with 95% confidence that this interval estimate includes the true population-mean is equal to 20.37.
Assume the population standard deviation \(\sigma\) of the student Age in data survey is 7. Find the margin of error and interval estimate at 95% confidence level.
library(MASS)
age.response = na.omit(survey$Age)
n = length(age.response)
sigma = 7
sem = sigma/sqrt(n) #
E = qnorm(.975)*sem ;E## [1] 0.8911934
## [1] 20.37451
## [1] 19.48332 21.26571
Assuming the population standard deviation being 7, the margin of error for the student height survey at 95% confidence level is 0.89 year. The confidence interval is between 19.48332 to 21.26571 years.
Without assuming the population standard deviation \(\sigma\) of the student Age in survey, find the margin of error and interval estimate at 95% confidence level.
library(MASS)
age.response = na.omit(survey$Age)
n = length(age.response)
s = 7
SE = s/sqrt(n)
E = qt(.975, df=n-1)*SE; E ## [1] 0.8957872
## [1] 20.37451
## [1] 19.47873 21.27030
Without assumption on the population standard deviation, the margin of error for the student age survey at 95% confidence level is 0.9 year. The confidence interval is between 19.47873 to 21.27030 years.
Improve the quality of a sample survey by increasing the sample size with unknown standard deviation \(\sigma\)!.
## [1] 130.7163
The sample size need to be 230.71 to achieve the margin error
Assume you don’t have planned proportion estimate, find the sample size needed to achieve 5% margin of error for the male student survey at 95% confidence level!
gender.response = na.omit(survey$Sex)
n = length(gender.response)
male = sum(gender.response == "Male"); male## [1] 118
## [1] 0.5
## [1] 384.1459
the sample size needed to achieve 5% margin of error for the male student survey at 95% confidence level is 118
Perform confidence intervals analysis on this data set from 2004 that includes data on average hourly earnings, marital status, gender, and age for thousands of people.
cps04 <- read.csv("cps04.csv", header = T, sep = ",")
# Average Hourly Earnings
ahe.resp = na.omit(cps04$ahe)
n = length(ahe.resp)
sigma = sd(ahe.resp)
sem = sigma/sqrt(n)
xbar <- mean(ahe.resp); xbar## [1] 16.7712
## [1] 16.7212 16.8212
# Marital Status
mar.response = na.omit(cps04$bachelor)
n = length(mar.response)
m = sum(mar.response == "1"); m## [1] 3640
## [1] 0.01092388
## [1] 0.4557976
## [1] 0.4448738 0.4667215
# Female
female.response = na.omit(cps04$female)
n = length(female.response)
f = sum(female.response == "1"); f## [1] 3313
## [1] 0.01080662
## [1] 0.414851
## [1] 0.4040444 0.4256576
# Age
age.resp = na.omit(cps04$age)
n = length(age.resp)
sigma = sd(age.resp)
sem = sigma/sqrt(n)
E = qnorm(.975)*sem ;E## [1] 0.06340892
## [1] 29.75445
## [1] 29.69104 29.81785
The average of average hour earnings is 16.7712, with the length between 16.7078 to 16.834.
The average of marital status is 0.4557976, with the length between 0.3923887 to 0.5192066.
The average of female is 0.414851, with the length between 0.3514421 to 0.4782599.
The average of age is 29.75445, with the length between 29.69104 to 29.81785.
Assume you have access to data on an entire population, say the size of every house in all residential home sales in Ames, Iowa between 2006 and 2010 it’s straight forward to answer questions like,
But, If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
To access the data in R, type the following code:
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")In this case study we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.
#randomly set seed to fix outputs in this assignment
set.seed(0)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
samp## [1] 2200 2093 1040 2233 1523 1660 1555 1102 848 1136 2061 1122 960 1092 2610
## [16] 2217 1959 2334 1660 1576 848 2004 988 1500 874 1340 1800 1069 1456 784
## [31] 985 1928 882 1124 1639 1214 1434 1150 1544 1812 1511 1949 1077 1248 1480
## [46] 1320 1717 1367 928 2552 1953 693 2690 2276 1173 1258 2582 1558 672 1488
As usual, before you begin to analyze more about your data. It’s important to visualize the data in advance. Here, we use a random sample of size 60 from the population.
# Make a histogram of your sample
hist(samp, main ="Distribution fo Samp",
col = "deeppink3",
xlim = c(200, 3500),
freq = F,
xlab = "Samp")
# ...and add a density curve
curve(dnorm(x,
mean=mean(samp),
sd=sd(samp)), add=T,
col="blue", lwd=2)Your Challenge:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 778 1122 1448 1536 1728 3608
## [1] 1499.69
No, this is a random sample of 35 randomly selected observations. The question here will be how similar is similar? or how near is near?, then if the values are near then yes, it could happen.
One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,
## [1] 1536.317
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.
We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (I assume that you have been familiar with this formula).
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)## [1] 1395.866 1676.767
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
Your Challenge:
Ames? If you are working on this case study, does your classmate’s interval capture this value? Yes, the above confidence interval capture the true average size of houses in Ames. I would expect my classmate to capture the mean value on their lab as well.let’s simulate a scenario of confidence interval in classroom to capture the true average size of houses in Ames. Suppose we have 100 students in the classroom.
count = 0
for (i in 1:100) {
samp <- sample(population,60)
samp_mean<- mean(samp)
se <- sd(samp)/sqrt(60)
lower <- samp_mean-1.96*se
upper <- samp_mean+1.96*se
if ((lower <= 1499.69) & (upper >= 1499.69)) {
count = count+1
}
}
count## [1] 97
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
Here is the rough outline:
But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as \(n\).
## [1] 60
Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
samp## [1] 825 1889 1047 2715 1636 1347 1659 996 1242 1256 1142 1501 1566 2172 1414
## [16] 1489 1788 1358 1488 1229 882 4476 1356 1657 1287 1107 1389 1044 1188 1367
## [31] 1125 1102 2237 1392 1495 1068 1218 1138 1137 1780 1105 1322 1074 1484 874
## [46] 1720 1682 1337 1240 1116 980 876 2136 1104 1299 864 1404 1665 2019 1432
Lastly, we construct the confidence intervals.
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.
## [1] 1331.196 1577.604
## [1] -1.959964
## [1] 1.959964
Your Challenge:
Vector <- data.frame(lower_vector, upper_vector)
meanp <- mean(population)
left <- sum(Vector$upper_vector < meanp)
right <- sum(Vector$lower_vector > meanp)
noMeanIncluded <- left + right
noMeanIncluded## [1] 2
## [1] 0.03
In this case only 95% include the population mean. This proportion is not necesarily the same as our confidence level but very near approximation of it.
lower_vector <- samp_mean - 0.84 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 0.84 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))Vector <- data.frame(lower_vector, upper_vector)
meanp <- mean(population)
left <- sum(Vector$upper_vector < meanp)
right <- sum(Vector$lower_vector > meanp)
noMeanIncluded <- left + right
noMeanIncluded## [1] 19
## [1] 0.32
In this case only 73% include the population mean. This proportion is not necesarily the same as our confidence level but very near approximation of it.