Your Exercise
In this section, your expected to get familiar with confidential intervals exercise:
Exercise 1
Find a point estimate of average university student Age with the sample data from survey!
Solution:
Firstly, we call the survey data by using library(MASS). Afterward, we save the survey data of the student age. To find the point estimate of the average student age, we write the code using the mean function. It turns out that not all of the students have filled the age section, so we use na.rm=TRUE to filter out the missing value
## [1] 20.37451
The result of the coding of the point estimate average above is 20.37451 years
## [1] 19.54600 21.20303
## attr(,"conf.level")
## [1] 0.95
The confidence intervals for the average university student age is 19.546 - 21.20303. Hence, the 95% confidence level includes the true population mean which is equal to 20.37451 years.
Exercise 2
Assume the population standard deviation \(\sigma\) of the student Age in data survey is 7. Find the margin of error and interval estimate at 95% confidence level.
Solution:
We need to filter out the missing values in survey$Age by using na.omit function and save it as age.response. After that, we compute the standard error of the mean. Therefore, there are two tails in this normal distribution, the 95% confidence level would indicate the 97.5th percentile of the normal distribution at the upper tail. So, to get the margin of error, we multiply the qnorm(0.975) with the standard error of the mean.
## [1] 0.8911934
Then we can find the margin of error is 0.8911934 years. After that, we add it with the sample mean to find the confidence interval.
## [1] 20.37451
## [1] 19.48332 21.26571
The margin of error of the student age by assuming the population standard deviation is 7 at the 95% confidence level is 0.8911934 years. The confidence interval for this case is in between 19.48332 and 21.26571 years.
\(Alternative\) \(Solution\)
For the alternative solution, we can use the z.test function in the TeachingDemos package. It also has to be installed and loaded into the workspace.
## Warning: package 'TeachingDemos' was built under R version 3.6.3
##
## One Sample z-test
##
## data: age.response
## z = 44.809, n = 237.0000, Std. Dev. = 7.0000, Std. Dev. of the sample
## mean = 0.4547, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 19.48332 21.26571
## sample estimates:
## mean of age.response
## 20.37451
Exercise 3
Without assuming the population standard deviation \(\sigma\) of the student Age in survey, find the margin of error and interval estimate at 95% confidence level.
Solution:
We load the package of MASS first to get the data of survey. After that, we filter out the missing value using na.omit,assign the length using length funciton, and write the sample standard deviation. Then, we estimating the standard error and margin of error (upper tail 95% of confidence level) like the code below
## [1] 0.8957872
We find that the margin of error for the upper tail 95% of confidence intervals is 0.8957872 years.
## [1] 20.37451
## [1] 19.47873 21.27030
The result of the margin of error for the student age survey is 0.8957872 years at 95% confidence level and the confidence interval is in between 19.47873 and 21.27030 years.
\(Alternative\) \(Solution\)
##
## One Sample t-test
##
## data: age.response
## t = 48.447, df = 236, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 19.54600 21.20303
## sample estimates:
## mean of x
## 20.37451
Exercise 4
Improve the quality of a sample survey by increasing the sample size with unknown standard deviation \(\sigma\)!.
Solution:
Since, we dont know the standard deviation is and we need to improve the qualitiy of a sample survey, we are going to assume that half of the student write down the survey which give us the maximum variability. So, now the p is 0.5. Now, let’s say that we want 5% margin of error and 95% confidence level that give us the Z calues of 1.86.
## [1] 384.1459
So, we got 384.1459 or 384 sample sizes to improve the quality of a sample survey with unknown standard deviation \(\sigma\)!.`
Exercise 5
Assume you don’t have planned proportion estimate, find the sample size needed to achieve 5% margin of error for the male student survey at 95% confidence level!
Solution:
What we know:
- 5% margin of error
- 95% confidence interval So, we can get \(z\) = 1.96
First we need to find out the number of male students. We can find it using sum function and dividing it by n to find the male student proportion in this sample survey
## [1] 118
## [1] 0.5
The number of male student is 118. The proportion of the male student is 0.5.
Now, we want to find the sample size to achieve 5% margin of error for the male student survey at 95% confidence level
## [1] 384.1459
The, we get that we need 384.1459 or 384 sample size to achieve 5% margin of error for the male student survey at 95% confidence level.
Exercise 6
Perform confidence intervals analysis on this data set from 2004 that includes data on average hourly earnings, marital status, gender, and age for thousands of people.
\(Solution:\)
First we read the csv of cps04 before we find the confidence interval of the average hourly earnings, marital status, gender, and age for thousands of people.
## [1] 0.1920964
## [1] 16.7712
## [1] 16.57911 16.96330
From the code above, we can know that the margin of error of average hourly earnings is 0.1920964. xbar (sample mean) is 6.7712 while the confidence interval is inbetween 16.57911 and 16.96330.
## [1] 0.06340892
## [1] 29.75445
## [1] 29.69104 29.81785
From the code above, we can know that the margin of error of age is 0.06340892. xbar (sample mean) is 29.75445 while the confidence interval is inbetween 29.69104 and 29.81785 years.
## [1] 3313
## [1] 0.01080662
## [1] 0.414851
## [1] 0.4040444 0.4256576
From the code above, we can know that the total of the female is 3313 and the margin of error of age is 0.01080662. xbar (sample mean) is 0.414851 while the confidence interval is inbetween 0.4040444 and 0.4256576. From this interval we know that, there are more male than female participants
## [1] 3640
## [1] 0.01092388
## [1] 0.4557976
## [1] 0.4448738 0.4667215
From the code above, we can know that the total of the bachelor is 3460 and the margin of error of age is 0.01092388. xbar (sample mean) is 0.4557976 while the confidence interval is inbetween 0.4448738 and 0.4667215. From this interval we know that, there are more not bachelor than bachelor participants
Case Study
Assume you have access to data on an entire population, say the size of every house in all residential home sales in Ames, Iowa between 2006 and 2010 it’s straight forward to answer questions like,
- How big is the typical house in Ames?
- How much variation is there in sizes of houses?.
- How much is the average price of house in Ames?
- How much is the confidence interval price of house in Ames?
But, If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
Collect Data
To access the data in R, type the following code:
In this case study we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.
Visualization
As usual, before you begin to analyze more about your data. It’s important to visualize the data in advance. Here, we use a random sample of size 60 from the population.

# Make a histogram of your sample
hist(samp, main ="Distribution fo Samp",
col = "deeppink3",
xlim = c(200, 3500),
freq = F,
xlab = "Samp")
# ...and add a density curve
curve(dnorm(x,
mean=mean(samp),
sd=sd(samp)), add=T,
col="blue", lwd=2)

Your Challenge:
- Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
- Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
\(Answer\)
- My sample distribution is right skewed. The typical size is to be the mean of my sample population. To know that we can use
sumarry function to find the mean. So, the “typical” size within my sample is 1514. For your information, a typical mean is the value that is near the most simultaneous values in the distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 672 1100 1484 1514 1933 2690
- I would expect that they have almost the same as i have or similar but not identical to my distribution because this is a random sample of 60 randomly selected observations.
Confidence Intervals
One of the most common ways to describe the typical or central value of a distribution is to use the mean. In this case we can calculate the mean of the sample using,
## [1] 1514.133
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). That serves as a good point estimate but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.
We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate (I assume that you have been familiar with this formula).
## [1] 1381.351 1646.915
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
Your Challenge:
- For the confidence interval to be valid, the sample mean must be normally distributed and have standard error $ s/$. What conditions must be met for this to be true?
- What does “95% confidence” mean?
- Does your confidence interval capture the true average size of houses in
Ames? If you are working on this case study, does your classmate’s interval capture this value?
\(Answer\)
The conditions that must be met for this to be true are there are at least thirty independent observations without being too skewed. Also, more impotantly, the sample mean of the distribution properly estimated by a normal model.
95% confidence is a confidence interval level for the normal model. Also, it is with the standard error. Usually, 95% confidence means that it have 5% of margin of error.The formula of the confidence interval is \(point\) \(estimate\) ± Standard of error . z (which is corresponding to the confidence level).
Yes, my confidence interval captures the true average houses in Ames. Of course, my neighbour’s interval will also capture this value as well.
Simulation
let’s simulate a scenario of confidence interval in classroom to capture the true average size of houses in Ames. Suppose we have 100 students in the classroom.
## [1] 97
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
Here is the rough outline:
- Obtain a random sample.
- Calculate and store the sample’s mean and standard deviation.
- Repeat steps (1) and (2) 50 times.
- Use these stored statistics to calculate many confidence intervals.
But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as \(n\).
## [1] 60
Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
## [1] 2206 3672 2270 1786 1041 2614 1655 1378 1250 1884 1358 764 1176 1595 1419
## [16] 1620 1299 1097 1073 1647 1220 1086 1928 1412 1091 2263 1968 1261 1538 793
## [31] 1337 1768 1604 1609 1479 980 480 816 951 1069 1709 1742 2237 1458 864
## [46] 1665 1778 1949 1040 1414 954 1142 1614 1368 5642 1383 1242 816 2082 1728
Lastly, we construct the confidence intervals.
Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.
## [1] 1400.415 1718.352

## [1] -1.959964
## [1] 1.959964
Your Challenge:
- What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
- Pick a confidence level of your choosing, provided it is 99%. What is the appropriate critical value?
- Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?
\(Answer\)
- The proportion of my confidence intervals include the true population mean is 97%. This proportion is not exactly equal to the confidence level only approximately near and all of the confidence intervals are all created from random samples.
## [1] 0.03333333
## [1] 0.03
- Pick a confidence level of your choosing, provided it is not 99%. What is the appropriate critical value?
Solution:
Because we cannot choose 99%, let’s choose 70%
## [1] 2.326348
- Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

## [1] 1475.033 1643.734
## [1] 16
## [1] 0.2666667
## [1] 0.27
So, from this answer only 67% include the population mean. From this we know, the proportion is not exactly the same as the confidence level but very close
