The Data
Load up the data for this lab by inserting a code chunk here:
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
- Describing the Data
- Describe the distribution of your sample. (Be sure to use
appropriate terminology such as symmetric, skewed right/left, unimodal,
bimodal, etc.)
- What would you say is the “typical” size of your sample? State
precisely what you interpreted “typical” to mean.
population <- ames$Gr.Liv.Area
samp <- sample(population,60)
hist(samp)

mean(samp)
## [1] 1577.717
Solution: The distribution of the sample is unimodal and skewed
right. For the typical size of the sample, we use the sample mean, which
is computed above.
- Would you expect another group’s sample distribution to be identical
to yours? Would you expect it to be similar? Why or why not?
Solution: Because sampling is random, we shouldn’t expect another
group’s distribution to be identical. However, it should be somewhat
similar because the samples we are taking are reasonably large.
Confidence intervals
- Create a variable called se that contains the standard error. Then
create a variable called za that has the critical value for a 95%
confidence interval. Finally, create a variable called lower that has
the lower bound of the confidence interval and a variable called upper
that has the upper bound of the confidence interval. After doing that,
use the command c(lower, upper) to show the confidence interval you just
created.
sample_mean <- mean(samp)
za <- qnorm(0.025, lower.tail=FALSE)
se <- sd(population)/sqrt(60)
lower <- sample_mean - za*se
upper <- sample_mean + za*se
c(lower,upper)
## [1] 1449.808 1705.626
- For the confidence interval to be valid, the sample mean must be
normally distributed and have standard error sigma/sqrt(n). What
conditions must be met for this to be true?
Solution: We need to either be sampling from a normal population or
we need to have a sufficiently large sample size (bigger than 30) to
apply the central limit theorem. The latter holds true here.
- Does your confidence interval capture the true average size of
houses in Ames?
mean(population)
## [1] 1499.69
Solution: Yes, the confidence interval contains the true population
mean. (Note, this may not be true upon some knits of this notebook due
to the randam nature of sampling.)
- Each student in your class should have gotten a slightly different
confidence interval. Most of them probably contain the true population
mean, but it’s possible that same might not. What percent do you expect
to contain the true population mean?
Solution: We expect 95% to contain the true population mean because
we are creating 95% confidence intervals.
- Looping to create many samples and confidence intervals:
- Create a code chunk that performs the following actions: First,
create empty vectors of size 50 called samp_mean and samp_sd. Then
create a variable called n that holds the sample size, 60.
- Now we’re ready for the loop where we calculate the means and
standard deviations of 50 random samples. Create a for-loop that repeats
50 times. Inside the loop, first collect a sample of size n from the
population. Store the sample in a variable called samp. Then compute the
mean of samp and store it as the ith entry in the samp_mean vector you
created. Finally, compute the standard deviation of samp and store it as
the ith entry in the samp_sd vector.
samp_mean <- rep(0,50)
samp_sd <- rep(0,50)
n <- 60
for(i in 1:50){
samp <- sample(population,50)
samp_mean[i] <- mean(samp)
samp_sd[i] <- sd(samp)
}
- What proportion of your confidence intervals include the true
population mean? Is this proportion exactly equal to the confidence
level? If not, explain why.
za <- qnorm(0.025, lower.tail=FALSE)
lower_vector <- samp_mean - za * samp_sd / sqrt(n)
upper_vector <- samp_mean + za * samp_sd / sqrt(n)
plot_ci(lower_vector,upper_vector,mean(population))

Solution: This will vary from run to run. However, the proportion
will never exactly match the confidence level because we are doing 50
samples. The closes will be 47 out of 50, corresponding to 94% or 48 out
of 50, corresponding to 96%.
- More confidence intervals:
- Pick a new confidence level and use R to compute the appropriate
critical value.
- Calculate 50 confidence intervals at the confidence level you chose.
You do not need to obtain new samples, simply calculate new intervals
based on the sample means and standard deviations you have already
collected.
- Using the plot_ci function, plot all intervals and calculate the
proportion of intervals that include the true population mean. How does
this percentage compare to the confidence level selected for the
intervals?
za <- qnorm(0.005, lower.tail=FALSE)
lower_vector_99 <- samp_mean - za * samp_sd / sqrt(n)
upper_vector_99 <- samp_mean + za * samp_sd / sqrt(n)
plot_ci(lower_vector_99,upper_vector_99,mean(population))

Solution: I did a 99% confidence interval here. As expected, I often
get about 49 out of 50 confidence intervals that contain the true
population mean.