The data

In the previous lab, ``Sampling Distributions’’, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.

load("more/ames.RData")

In this lab we’ll start with a simple random sample of size 60 from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but for the first portion of the lab we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.

population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
  1. Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

Answer: The simple random sample of size 60 has a bit negative skew comparing the distribution of population since the sample mean 1376 less than population mean 1500. Another, ther standard error 416.7429 for the sample mean is less than the population’s 505.5089 because it has less outliners.

summary(population)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
sd(population)
## [1] 505.5089
hist(population, breaks=25)

qqnorm(population)
qqline(population)

summary(samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     612    1211    1498    1522    1796    2868
sd(samp)
## [1] 492.8303
hist(samp, breaks=25)

qqnorm(samp)
qqline(samp)

#(mean(samp)- mean(population))/mean(population)
  1. Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

Answer: (1)Yes, it may be identical to mine. If so, we will have similar mean and SE. However, it may not identical to mine since the simple random sampling method may did sampling bias when sample size is not big.Here I generate a second sample with size 60. Second sample’s mean 1484 is bigger than first sample’s 1376 and has widely error 520.4294 than first’s 401.7573.

samp2 <- sample(population, 60)
summary(samp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1093    1445    1550    1882    5095
sd(samp2)
## [1] 662.4161
hist(samp2)

qqnorm(samp2)
qqline(samp2)

(2)Here I also generate 5000 sample’s means to see what’s resule of the samping distribution.As the result, most students will get their sample means which close to the population mean.

sample_means60 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(population, 60)
   sample_means60[i] <- mean(samp)
   }

hist(sample_means60)

Confidence intervals

  1. For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?

Answer: The conditions must be as following: 1) the sample size equal or more than 30, 2) the sample observations are independence, 3) the population distribution is not strong skewed.

Confidence levels

  1. What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

Answer:There are 95 probability that the unobsered random variable X will be with 1.96 standard deviations of mean.

mean(population)
## [1] 1499.69
  1. Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

Answer:Yes, my sample mean with 95% confidence interval is [1317.233, 1528.134] and capture the population mean is 1499.69. My neighbor’s interval may difference to mine since it depends on the sample mean and sd.

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1327.834 1582.666
  1. Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

Answer: I don’t know what data my classmate have but I expect 95% students can capture the true population mean. I can estemate what they get by random sampling. Assume my class size is 50. 95% confident interval is (1291.933, 1553.433), which capture the population mean 1499.69.

sample_means50 <- rep(NA, 50)

for(i in 1:50){
   samp <- sample(population, 50)
   sample_means50[i] <- mean(samp)
   }
se <- sd(samp) / sqrt(50)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1329.151 1581.349
hist(sample_means50, breaks=50)

Here is the rough outline:

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

Lastly, we construct the confidence intervals.

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Let’s view the first interval.

c(lower_vector[1], upper_vector[1])
## [1] 1450.773 1741.760

On your own

Answer: 50 95% confidence intervals of 50 random samples with size 60 each were plotted. 5 outliners didn’t capture the true mean 1499.6904.There is 90% propotion of CI plot (1-(5/50)).So the CI are not meant to capture exact values but a range of values that 95% likely to contain true value of the population.

  plot_ci(lower_vector, upper_vector, mean(population))

Answer: Pick 99% confidence level and its signifiance level is 0.995 a in two tail model.The critical value is 2.575829.

z <- qnorm(0.995)
z
## [1] 2.575829

Answer: 99% CI in the range is wider than 95% CI’s. My sample in 99% CI covers all ramdon observations.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

for(i in 1:50){
  samp <- sample(population, n)
  samp_mean[i] <- mean(samp)
  samp_sd[i] <- sd(samp)
}

lower99 <- samp_mean - 2.58 * samp_sd / sqrt(n) 
upper99<- samp_mean + 2.58 * samp_sd / sqrt(n)

c(lower99[1], upper99[1])
## [1] 1344.499 1701.768
  plot_ci(lower99, upper99, mean(population))