Foundations for Statistical Inference

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

Exercise 1

Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

population = ames$Gr.Liv.Area
samp = sample(population, 60)
hist(samp, breaks = 10, main = "Gr.Liv.Area Sample")

summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     605    1120    1416    1492    1735    2872

The distribution is right skewed and unimodal. Because the data is skewed, I prefer to use the median as an estimate for what a “typical” size within the sample (in the sense that “typical” means a value that is a majority within the sample.)

Exercise 2

Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not? Though I would not expect another student’s distribution to be identical to me, I would expect it to be similar (right skewed and unimodal) because the samples are supposed to be representative of the population meaning that most all of the samples (though not identical) should be similar.

Exercise 3

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/sqrt(n). What conditions must be met for this to be true? The population distribution should be normal. The sampling method should be random. Also, samples should be independent.

Exercise 4

What does “95% confidence” mean? Having a 95% confidence means that if we sample the same population 100 times and make interval estimates each time, the intervals calculated at the end would contain the true population estimate in about 95 cases out of 100.

Exercise 5

Does your confidence interval capture the true average size of houses in Ames?

pop_mean = mean(population)
se = sd(samp)/sqrt(60)
lower = pop_mean - 1.96 * se
upper = pop_mean + 1.96 * se
c(lower, upper)

## [1] 1376.683 1622.698

pop_mean = mean(samp)
if(pop_mean > lower & pop_mean < upper){
  samps = matrix(c(round(lower,2),
  round(pop_mean,2),
  round(upper,2),
  "TRUE"),ncol=4,byrow=TRUE)
colnames(samps) = c("Lower_Int", "Mean", "Upper_Int", "Within_Int?")
as.table(samps)
}else{  samps = matrix(c(round(lower,2),
                         round(pop_mean,2),
                         round(upper,2),
                         "FALSE"),ncol=4,byrow=TRUE)
colnames(samps) = c("Lower_Int", "Mean", "Upper_Int", "Within_Int?")
as.table(samps)}

##   Lower_Int Mean    Upper_Int Within_Int?
## A 1376.68   1491.92 1622.7    TRUE

Yes, my CI does capture the ture average of houses in Ames.

Exercise 6

Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean. This is not applicable to me, however by definition, I would approximate 95% of the intervals calculated by the students would capture the true population mean.

ON YOUR OWN

1. Using the following function “plot_ci”, plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

samp_mean = rep(NA, 50)
samp_sd = rep(NA, 50)
n = 60

for(i in 1:50){
  samp = sample(population, n)
  samp_mean[i] = mean(samp)
  samp_sd[i] = sd(samp)
}

lower_vector = samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector = samp_mean + 1.96 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))

1-(3/50)

## [1] 0.94

In a sample of 50, 3 did not contain the true popluation mean, making the proportion of my confidence interval including the true mean 94%. This proportion is not exactly equal to the confidence level, but it should not be expected to be exactly equal because the samples drawn from the population are just that, samples. The interval can only estimate a range and 94% is close enough to 95% that having four samples not containing the true population mean is acceptable.

2. Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

# 99% confidence interval
a = 100-99
d = a/100
da = d/2
conf = 1-da
Z = qnorm(conf)
Z

## [1] 2.575829

The appropriate critical value is 2.56

3. Calculate 50 confidence intervals at the confidence 99% level. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

samp_mean = rep(NA, 50)
samp_sd = rep(NA, 50)
n = 60

for(i in 1:50){
  samp = sample(population, n)
  samp_mean[i] = mean(samp)
  samp_sd[i] = sd(samp)
}

lower_vector = samp_mean - 2.56 * samp_sd / sqrt(n) 
upper_vector = samp_mean + 2.56 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))

1-(0/50)

## [1] 1

All confidence intervals contain the true population mean, which is not surprising given that the confidence interval is so narrow. It includes less values outside the interval than the 95% confidence interval.

Foundations for Statistical Inference - Confidence Intervals

Georgia Galanopoulos

ON YOUR OWN