Foundations for statistical inference

1. Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

load("more/ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp)

Ans: The distribution of the population has fat left tail. The tipical size is 60, which means randomly select 60 examples from size of hourse variable.

2. Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

Ans: I am not expect to have same distribution as the others, but we might have skew distributions eith left or right skew since fat tail distribution shows there are some outliners data in the dataset, and if random samples include some of these data, then sh/he will has skew distribution.

3. For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?

Ans: The sample size must be equal or grater than 30; the sample observation must independent;mean within and the population distribution is not strong skew.

4. What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

Ans: “95% confidence” is of those interval within 2 standar error of the parameter will contain the actual mean.

5. Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

## [1] 1384.617 1628.083

mean(population)

## [1] 1499.69

Ans: Yes, the true average 1499.69 is within interval.

6. Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

Ans: I expect 95% of the students having their mean value within interval which have 2 standar error from actural mean. The reason is 95% confident has been set to that interval. There are 60 students in the class, and at least 57 student’s mean value will be within the interval.

On your own

- Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

c(lower_vector[1], upper_vector[1])

## [1] 1338.089 1586.977

plot_ci(lower_vector, upper_vector, mean(population))

Ans: In above data set, there 3 out of 60 of the interval which is 5%, is not include the population mean.Yes, it is exctly 95% in confident interval. But in other cases, it would be less than 3 examples out of the interval, which it will satisfify 95% confident interval assumption.

- Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

Ans: I pick 90% confident interval, then the critical value will be 1.64.

- Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the `plot_ci` function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

lower_vector90 <- samp_mean - 1.64 * samp_sd / sqrt(n) 
upper_vector90 <- samp_mean + 1.64 * samp_sd / sqrt(n)
plot_ci(lower_vector90, upper_vector90, mean(population))

Ans: In above same data set, there 3 out of 60 of the interval which is 10%, is not include the population mean.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Foundations for statistical inference - Confidence intervals

Chunhui Zhu

October 3, 2017