Please watching this video, to get some ideas about Confidence Intervals (CI)
This video guide you, how can you apply Confidence Intervals in Business.
In this section, your expected to get familiar with confidential intervals exercise:
Find a point estimate of average university student Age
with the sample data from survey
!
# Survey data from MASS package
library(MASS)
# The point estimate of average university student `Age`
mean(survey$Age)
## [1] 20.37451
From MASS package we get the survey data. After that, we use mean function to get the point estimate of age student.
Assume the population standard deviation \(\sigma\) of the student Age
in data survey
is 7. Find the margin of error and interval estimate at 95% confidence level.
Age.resp = na.omit(survey$Age) # filter out missing values in Age
n = length(Age.resp) # assign the length of response
sigma = 7 # population standard deviation
sem = sigma/sqrt(n) # standard error of the mean
E = qnorm(.975)*sem ;E # margin of error (upper tail 95% of CI)
## [1] 0.8911934
## [1] 20.37451
## [1] 19.48332 21.26571
First we have to use na.omit function to filter out missing values in Age. After that, we can find the standard error of the mean from sigma/sqrt(n). Finally we can get the error value by multiplying 1.96 or qnorm(0.975) with standard error of the mean .
Without assuming the population standard deviation \(\sigma\) of the student Age
in survey, find the margin of error and interval estimate at 95% confidence level.
Age.resp = na.omit(survey$Age) # filter out missing values in Age
n = length(Age.resp) # assign the length of response
s = 7 # sample standard deviation
SE = s/sqrt(n) # standard error estimate
E = qt(.975, df=n-1)*SE; E # margin of error (upper tail 95% of CI)
## [1] 0.8957872
## [1] 20.37451
## [1] 19.47873 21.27030
First we have to use na.omit function to filter out missing values in Age. After that, we can find the standard error estimate from s/sqrt(n). Now we can get the margin of error by multiplying qt(.975, df=n-1) with standard error estimate.
Improve the quality of a sample survey
by increasing the sample size with unknown standard deviation \(\sigma\)!.
Please explain something from your exercise result.
Assume you don’t have planned proportion estimate, find the sample size needed to achieve 5% margin of error for the male student survey
at 95% confidence level!
gender.response = na.omit(survey$Sex)
n = length(gender.response)
male = sum(gender.response == "Male"); male
## [1] 118
## [1] 0.5
## [1] 384.1459
Please explain something from your exercise result.
Perform confidence intervals analysis on this data set from 2004 that includes data on average hourly earnings, marital status, gender, and age for thousands of people.
cps04 <- read.csv("cps04.csv", header = T, sep = ",")
# Average Hourly Earnings
ahe.resp = na.omit(cps04$ahe)
n = length(ahe.resp)
sigma = sd(ahe.resp)
sem = sigma/sqrt(n)
xbar <- mean(ahe.resp); xbar
## [1] 16.7712
## [1] 16.7212 16.8212
# Marital Status
mar.response = na.omit(cps04$bachelor)
n = length(mar.response)
m = sum(mar.response == "1"); m
## [1] 3640
## [1] 0.01092388
## [1] 0.4557976
## [1] 0.4448738 0.4667215
# Female
female.response = na.omit(cps04$female)
n = length(female.response)
f = sum(female.response == "1"); f
## [1] 3313
## [1] 0.01080662
## [1] 0.414851
## [1] 0.4040444 0.4256576
# Age
age.resp = na.omit(cps04$age)
n = length(age.resp)
sigma = sd(age.resp)
sem = sigma/sqrt(n)
E = qnorm(.975)*sem ;E
## [1] 0.06340892
## [1] 29.75445
## [1] 29.69104 29.81785
Please explain something from your exercise result.
Assume you have access to data on an entire population, say the size of every house in all residential home sales in Ames, Iowa between 2006 and 2010 it’s straight forward to answer questions like,
But, If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
To access the data in R, type the following code:
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
In this case study we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area
.
#randomly set seed to fix outputs in this assignment
set.seed(0)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
samp
## [1] 2200 2093 1040 2233 1523 1660 1555 1102 848 1136 2061 1122 960 1092 2610
## [16] 2217 1959 2334 1660 1576 848 2004 988 1500 874 1340 1800 1069 1456 784
## [31] 985 1928 882 1124 1639 1214 1434 1150 1544 1812 1511 1949 1077 1248 1480
## [46] 1320 1717 1367 928 2552 1953 693 2690 2276 1173 1258 2582 1558 672 1488
As usual, before you begin to analyze more about your data. It’s important to visualize the data in advance. Here, we use a random sample of size 60 from the population.
# Make a histogram of your sample
hist(samp, main ="Distribution fo Samp",
col = "deeppink3",
xlim = c(200, 3500),
freq = F,
xlab = "Samp")
# ...and add a density curve
curve(dnorm(x,
mean=mean(samp),
sd=sd(samp)), add=T,
col="blue", lwd=2)
Your Challenge:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 672 1100 1484 1514 1933 2690
## [1] 1499.69
# The distribution in this sample case seems to be right skewed. I would not consider this is a “typical” sample size while in this case if we compared to our original population mean of 1499.6904437 we noticed that it is “considerable” near. A “typical” mean is the value that is near the most concurrent values in a distribution.
No, this is a random sample of 60 randomly selected observations. The question here will be how similar is similar? or how near is near?, then if the values are near then yes, it could happen.
One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,
## [1] 1514.133
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean
). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.
We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (I assume that you have been familiar with this formula).
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1381.351 1646.915
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames
lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
Your Challenge:
The sample must consists of at least 30 independent observations and the data should not be strongly skewed, then the distribution of the sample mean is well approximated by a normal model.
Is the confidence interval level for the normal model with standard error SE. The confidence interval for the population parameter is point estimate ± z.SE where corresponds to the confidence level selected.
Ames
? If you are working on this case study, does your classmate’s interval capture this value?Yes, the above confidence interval capture the true average size of houses in Ames. I would expect my classmate to capture the mean value on their lab as well.
let’s simulate a scenario of confidence interval in classroom to capture the true average size of houses in Ames
. Suppose we have 100 students in the classroom.
count = 0
for (i in 1:100) {
samp <- sample(population,60)
samp_mean<- mean(samp)
se <- sd(samp)/sqrt(60)
lower <- samp_mean-1.96*se
upper <- samp_mean+1.96*se
if ((lower <= 1499.69) & (upper >= 1499.69)) {
count = count+1
}
}
count
## [1] 97
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
Here is the rough outline:
But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as \(n\).
## [1] 60
Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
samp
## [1] 2206 3672 2270 1786 1041 2614 1655 1378 1250 1884 1358 764 1176 1595 1419
## [16] 1620 1299 1097 1073 1647 1220 1086 1928 1412 1091 2263 1968 1261 1538 793
## [31] 1337 1768 1604 1609 1479 980 480 816 951 1069 1709 1742 2237 1458 864
## [46] 1665 1778 1949 1040 1414 954 1142 1614 1368 5642 1383 1242 816 2082 1728
Lastly, we construct the confidence intervals.
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower_vector
, and the upper bounds are in upper_vector
. Let’s view the first interval.
## [1] 1400.415 1718.352
## [1] -1.959964
## [1] 1.959964
Your Challenge:
Vector <- data.frame(lower_vector, upper_vector)
meanp <- mean(population)
left <- sum(Vector$upper_vector < meanp)
right <- sum(Vector$lower_vector > meanp)
noMeanIncluded <- left + right
noMeanIncluded
## [1] 2
## [1] 0.03
In this case only 97% include the population mean. This proportion is not necesarily the same as our confidence level but very near approximation of it.
Let’s pick 99%. For this, the critical value will be 0.99.
plot_ci
function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?lower_vector <- samp_mean - 0.99 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 0.99 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))
Vector <- data.frame(lower_vector, upper_vector)
meanp <- mean(population)
left <- sum(Vector$upper_vector < meanp)
right <- sum(Vector$lower_vector > meanp)
noMeanIncluded <- left + right
noMeanIncluded
## [1] 17
## [1] 0.28
In this case only 99% include the population mean. This proportion is not necesarily the same as our confidence level but very near approximation of it.