DATA_606_Lab4b

load("more/ames.RData")

population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     816    1194    1542    1540    1713    2728

hist(samp)

The distribution is unimodal, but not symmetric. There are many more homes of area lower than the mode than areas higher than the mode. A typical size of the sample is 1540. I interpreted typical to be the mean of the sample.
I would not expect another student’s distribution to be identical to mine. My sample is a subset of the total population. A point estimate for the mean is unlikely to be identical to another point estimate. Another subset of the total population will be a little different. I would expect them to have similar means since the sample has enough elements that it should be somewhat representative of the general population.

sample_mean <- mean(samp)

se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

## [1] 1425.747 1653.953

In order for the confidence interval to be valid, the sample error must be normally distrubuted. This is true if sample observations are independent, the sample size is large enough (the sample size must be larger than 30), and the population must not be strongly skewed. However, the larger the sample size is, the more lenient we can be with the skew. We can verify that observations are independent if they are from a simple random sample and consist of fewer than 10% of the population.
95% confidence means that the sample mean is within 1.96 standard errors of the actual mean 95% of the time.

mean(population)

## [1] 1499.69

Yes, my confidence interval captures the true average size of houses in Ames. I am not working on this in a classroom.
I would expect that 95% of those intervals would capture the true population mean. That is what the 95% confidence interval means.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

plot_ci(lower_vector, upper_vector, mean(population))

47/50 = .94 94% of the confidence intervals include the sample mean. This proportion is almost equal to the confidence level. When there are 50 samples, in the sampling distribution, you can either have 94% or 96% of the confidence intervals include the mean. The larger the number of samples in the sampling distribution, the closer you can achieve having 95% of the samples having the actual mean inside their confidence intervals.

I am choosing an 80% confidence interval. This means that 80% of the samples should have their mean inside the confidence interval.
Z score = 1.28

lower_vector <- samp_mean - 1.28 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.28 * samp_sd / sqrt(n)

plot_ci(lower_vector, upper_vector, mean(population))

35/50 = 70% of samples have confidence intervals that include the population mean. This is not equal to 80% of the intervals. Taking more than 50 samples in the sampling distribution would help me achieve a more accurate confidence interval.

DATA_606_Lab4b

Sarah Wigodsky

October 1, 2017