Please watching this video, to get some ideas about Confidence Intervals (CI)
This video guide you, how can you apply Confidence Intervals in Business.
In this section, your expected to get familiar with confidential intervals exercise:
Find a point estimate of average university student Age
with the sample data from survey!
# Presents the MASS package data set survey
library(MASS)
# Save the survey data of student age
ageofsurvey = survey$Age
# Find the point estimate of student age
# As it turned out, not all the student have fill the age section, so we mhave to filter out the missing values. Therefore, we apply the na.rm argument as TRUE
mean(ageofsurvey, na.rm = TRUE)## [1] 20.37451
Hasil koding rata-rata estimasi titik di atas adalah 20,37451 tahun
point.estimate <-t.test(ageofsurvey, conf.level = 0.95 )
point.estimate$conf.int## [1] 19.54600 21.20303
## attr(,"conf.level")
## [1] 0.95
The confidence intervals untuk rata-rata usia mahasiswa adalah
19.546 - 21.20303. Jadi, tingkat kepercayaan 95% termasuk
rata-rata populasi sebenarnya yaitu sebesar 20,37451 tahun.
Assume the population standard deviation \(\sigma\) of the student Age in
data survey is 7. Find the margin of error and interval
estimate at 95% confidence level.
library(MASS)
age.response = na.omit(survey$Age)
n = length(age.response)
# population of standard deviation
sigma = 7
# standard error
SE = sigma/sqrt(n)
# margin of error
E = qnorm(0.975)*SE
E## [1] 0.8911934
Maka kita dapat menemukan margin of error adalah 0,8911934 tahun. Setelah itu, kita tambahkan dengan rata-rata sampel untuk mencari selang kepercayaan.
# sample mean
xbar = mean(age.response)
xbar## [1] 20.37451
#Confidence interval
xbar + c(-E,E)## [1] 19.48332 21.26571
Margin of error usia siswa dengan mengasumsikan standar deviasi populasi adalah 7 pada tingkat kepercayaan 95% adalah 0,8911934 tahun. Interval kepercayaan untuk kasus ini adalah antara 19,48332 dan 21,26571 tahun.
Without assuming the population standard deviation \(\sigma\) of the student Age in
survey, find the margin of error and interval estimate at 95% confidence
level.
# load the package of MASS
library(MASS)
# Filter out the missing value
age.response = na.omit(survey$Age)
# assign the length
n = length(age.response)
# sample standard deviation
s = 7
# Estimating standard error
SE = s/sqrt(n)
# Margin of error (upper tail 95% of Confidence Interval)
E = qt(0.975, df= n -1)*SE
E## [1] 0.8957872
Kami menemukan bahwa margin kesalahan untuk interval kepercayaan 95% ekor atas adalah 0,8957872 tahun.
# sample mean
xbar = mean(age.response)
xbar## [1] 20.37451
# Confidence Interval
xbar+c(-E, E)## [1] 19.47873 21.27030
Hasil margin of error survei usia siswa adalah 0,8957872 tahun pada tingkat kepercayaan 95% dan interval kepercayaan antara 19,47873 dan 21,27030 tahun.
Improve the quality of a sample survey by increasing the
sample size with unknown standard deviation \(\sigma\)!.
zstar = qnorm(0.975)
(zstar^2*(0.5)* (0.5))/ (0.05)^2## [1] 384.1459
Jadi, kita mendapatkan 384,1459 atau 384 ukuran sampel untuk
meningkatkan kualitas survei sampel dengan standar deviasi
yang tidak diketahui \(\sigma\)!.
Assume you don’t have planned proportion estimate, find the sample
size needed to achieve 5% margin of error for the male student
survey at 95% confidence level!
Solution:
Apa yang kita ketahui:
library(MASS)
gender.response = na.omit(survey$Sex)
n = length(gender.response)
k = sum(gender.response == "Male")
k## [1] 118
pbar = k/n;pbar## [1] 0.5
Jumlah siswa laki-laki 118. Proporsi siswa laki-laki 0,5.
Sekarang, kami ingin menemukan ukuran sampel untuk mencapai margin of
error 5% untuk survei siswa laki-laki pada tingkat
kepercayaan 95%
zstar = qnorm(0.975)
p=0.5
#Margin or error
E = 0.05
zstar^2*p*(1-p)/E^2## [1] 384.1459
Jadi, kita mendapatkan bahwa kita membutuhkan 384.1459 atau 384 ukuran sampel untuk mencapai margin kesalahan 5% untuk survei siswa laki-laki pada tingkat kepercayaan 95%.
Perform confidence intervals analysis on this data set from 2004 that includes data on average hourly earnings, marital status, gender, and age for thousands of people.
cps04 <-read.csv("cps04.csv", header = T, sep = ",")
# Average hourly earnings
avghour.response = na.omit(cps04$ahe)
n = length(avghour.response)
# Standard Deviation
sigma = sd(avghour.response)
# Standard error of the mean
SE = sigma /sqrt(n)
# Margin of error
E = qnorm(0.975)*SE
E## [1] 0.1920964
xbar <- mean(avghour.response)
xbar## [1] 16.7712
xbar +c(-E,E)## [1] 16.57911 16.96330
Dari koding di atas, kita dapat mengetahui bahwa margin of error rata-rata hasil survey per jam adalah 0,1920964. xbar (rata-rata sampel) adalah 6,7712 sedangkan interval kepercayaan antara 16,57911 dan 16,96330.
# Age
age.respon = na.omit(cps04$age)
n = length(age.respon)
#Standard Deviation
sigma = sd(age.respon)
#standard error of the mean
SE=sigma/sqrt(n)
# Margin of error
E= qnorm(0.975)*SE
E## [1] 0.06340892
xbar <- mean(age.respon)
xbar## [1] 29.75445
xbar+c(-E,E)## [1] 29.69104 29.81785
Dari kode diatas kita bisa mengetahui bahwa margin of error of age adalah 0.06340892. xbar adalah 29,75445 sedangkan interval kepercayaan antara 29,69104 dan 29,81785 tahun.
# Female
female.response = na.omit(cps04$female)
n = length(female.response)
k = sum(female.response == "1")
k## [1] 3313
#Standard Deviation
sigma = sd(female.response)
#standard error of the mean
SE=sigma/sqrt(n)
# Margin of error
E= qnorm(0.975)*SE
E## [1] 0.01080662
xbar <- mean(female.response)
xbar## [1] 0.414851
xbar+c(-E,E)## [1] 0.4040444 0.4256576
kita dapat mengetahui bahwa jumlah wanita adalah 3313 dan margin error of age adalah 0.01080662. xbar adalah 0,414851 sedangkan interval kepercayaan antara 0,4040444 dan 0,4256576. Dari interval ini kita tahu bahwa peserta laki-laki lebih banyak daripada perempuan
# Bachelor
bachelor.response = na.omit(cps04$bachelor)
n = length(bachelor.response)
k = sum(bachelor.response == "1")
k## [1] 3640
#Standard Deviation
sigma = sd(bachelor.response)
#standard error of the mean
SE=sigma/sqrt(n)
# Margin of error
E= qnorm(0.975)*SE
E## [1] 0.01092388
xbar <- mean(bachelor.response)
xbar## [1] 0.4557976
xbar+c(-E,E)## [1] 0.4448738 0.4667215
kita dapat mengetahui bahwa jumlah sarjana adalah 3460 dan margin kesalahan usia adalah 0,01092388. xbar (rata-rata sampel) adalah 0,4557976 sedangkan interval kepercayaan antara 0,4448738 dan 0,4667215. Dari interval ini kita tahu bahwa peserta yang bukan sarjana lebih banyak dari sarjana
Assume you have access to data on an entire population, say the size of every house in all residential home sales in Ames, Iowa between 2006 and 2010 it’s straight forward to answer questions like,
But, If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
To access the data in R, type the following code:
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")In this case study we’ll start with a simple random sample of size 60
from the population. Specifically, this is a simple random sample of
size 60. Note that the data set has information on many housing
variables, but for the first portion of the lab we’ll focus on the size
of the house, represented by the variable Gr.Liv.Area.
#randomly set seed to fix outputs in this assignment
set.seed(0)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)As usual, before you begin to analyze more about your data. It’s important to visualize the data in advance. Here, we use a random sample of size 60 from the population.
# Histogram
library(moments)
hist(samp, breaks = 20, col = 'pink')# Make a histogram of your sample
hist(samp, main ="Distribution fo Samp",
col = "deeppink3",
xlim = c(200, 3500),
freq = F,
xlab = "Samp")
# ...and add a density curve
curve(dnorm(x,
mean=mean(samp),
sd=sd(samp)), add=T,
col="blue", lwd=2)Your Challenge:
\(Answer\)
sumarry function to find the mean. So, the “typical” size
within my sample is 1514. For your information, a typical mean is the
value that is near the most simultaneous values in the
distribution.summary(samp)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 672 1100 1484 1514 1933 2690
One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,
sample_mean <- mean(samp)
sample_mean## [1] 1514.133
Return for a moment to the question that first motivated this lab:
based on this sample, what can we infer about the population? Based only
on this single sample, the best estimate of the average living area of
houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it
sample_mean). That serves as a good point estimate but it
would be useful to also communicate how uncertain we are of that
estimate. This can be captured by using a confidence interval.
We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (I assume that you have been familiar with this formula).
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)## [1] 1381.351 1646.915
This is an important inference that we’ve just made: even though we
don’t know what the full population looks like, we’re 95% confident that
the true average size of houses in Ames lies between the
values lower and upper. There are a few conditions that must be met for
this interval to be valid.
Your Challenge:
Ames? If you are working on this case study, does
your classmate’s interval capture this value?\(Answer\)
The conditions that must be met for this to be true are there are at least thirty independent observations without being too skewed. Also, more impotantly, the sample mean of the distribution properly estimated by a normal model.
95% confidence is a confidence interval level for the normal model. Also, it is with the standard error. Usually, 95% confidence means that it have 5% of margin of error.The formula of the confidence interval is \(point\) \(estimate\) ± Standard of error . z (which is corresponding to the confidence level).
Yes, my confidence interval captures the true average houses in Ames. Of course, my neighbour’s interval will also capture this value as well.
let’s simulate a scenario of confidence interval in classroom to
capture the true average size of houses in Ames. Suppose we
have 100 students in the classroom.
count = 0
for (i in 1:100) {
samp <- sample(population,60)
samp_mean<- mean(samp)
se <- sd(samp)/sqrt(60)
lower <- samp_mean-1.96*se
upper <- samp_mean+1.96*se
if ((lower <= 1499.69) & (upper >= 1499.69)) {
count = count+1
}
}
count## [1] 97
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
Here is the rough outline:
But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as \(n\).
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
n## [1] 60
Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
samp## [1] 2206 3672 2270 1786 1041 2614 1655 1378 1250 1884 1358 764 1176 1595 1419
## [16] 1620 1299 1097 1073 1647 1220 1086 1928 1412 1091 2263 1968 1261 1538 793
## [31] 1337 1768 1604 1609 1479 980 480 816 951 1069 1709 1742 2237 1458 864
## [46] 1665 1778 1949 1040 1414 954 1142 1614 1368 5642 1383 1242 816 2082 1728
Lastly, we construct the confidence intervals.
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)Lower bounds of these 50 confidence intervals are stored in
lower_vector, and the upper bounds are in
upper_vector. Let’s view the first interval.
c(lower_vector[1], upper_vector[1])## [1] 1400.415 1718.352
# confidential interval visualization
plot_ci(lower_vector, upper_vector, mean(population))# For a 95% confidence interval, the critical value is -1.959964 and 1.959964.
qnorm((1-0.95)/2)## [1] -1.959964
qnorm((1+0.95)/2)## [1] 1.959964
Your Challenge:
\(Answer\)
Vector <-data.frame(lower_vector,upper_vector)
meanpopulation = mean(population)
left <- sum(Vector$upper_vector <meanpopulation)
right <- sum(Vector$lower_vector > meanpopulation)
NotIncludingMean <- left+right
proportion <- (NotIncludingMean/n)
proportion## [1] 0.03333333
proportion <- round(proportion,2)
proportion## [1] 0.03
Solution:
Because we cannot choose 99%, let’s choose 70%
x <- 1-(98/100)
y <- 1-(x/2)
cv <-qnorm(y)
cv## [1] 2.326348
lower_vector <- samp_mean - 1.04 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.04 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))c(lower_vector[1], upper_vector[1])## [1] 1475.033 1643.734
Vector <-data.frame (lower_vector, upper_vector )
meanofpopulation <- mean (population)
left <- sum(Vector$upper_vector < meanofpopulation)
right <-sum(Vector$lower_vector > meanofpopulation)
NotIncludingMean <- left+right
NotIncludingMean## [1] 16
proportion <- (NotIncludingMean/n)
proportion ## [1] 0.2666667
proportion <- round(proportion,2)
proportion## [1] 0.27
So, from this answer only 67% include the population mean. From this we know, the proportion is not exactly the same as the confidence level but very close