Lab 4b

load("more/ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
  1. Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

A typical size (mean of the distribution) would be around 1500.

summary(population)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
summary(samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     784    1209    1532    1612    1822    4676
hist(samp)

  1. Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

I wouldn’t expect it to be identical, but not completely different either. The reason I would expect it to be similar to mine is because we’re picking random samples from the same distribution (that isn’t very large).

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1447.276 1776.590
  1. For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?

A few conditions need to be met such as, kurtosis = 3 and 95% of the data falls within 2 std +/- the mean.

  1. What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

It means that after many random sampling from a population, 95% of the samples will include the population mean in it’s range of mean +/- 2 standard deviations.

mean(population)
## [1] 1499.69
  1. Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

It does because the mean of ~1500 falls within the lower and upper range calculated before

print(c(lower,upper))
## [1] 1447.276 1776.590
  1. Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

I would expect 95% to capture the true population mean, assuming the population distribution is normally distributed.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1322.517 1532.317

On your own

1 Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

Out of 50 samples, only 3 didn’t include the mean resulting in a confidence interval of 47/50 or 94% (fairly in line with the confidence level).

plot_ci(lower_vector, upper_vector, mean(population))

2 Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

I picked the 88% confidence level, which results in a criticla value of 1.175.

qt(0.88,df=10000)
## [1] 1.175057

3 Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

Below is the chart of 50 confidence intervals. 10 are highlighted in red because they didn’t include the population mean, resulting in a confidence interval of 80%.

lower_vector2 <- samp_mean - 1.175 * samp_sd / sqrt(n) 
upper_vector2 <- samp_mean + 1.175 * samp_sd / sqrt(n)

plot_ci(lower_vector2,upper_vector2,mean(population))