In this lab we will use the CASchools dataset used in class and the last lab. The package AER that contains it is already loaded in the first code chunk.

  1. Start by loading the data.
data("CASchools", package = "AER")
  1. The variable lunch indicates the share of students who qualify for free lunch. Add a new variable to the original dataset that is true when more than 40% of students qualify for free lunch.
CASchools$above_40_lunch <- CASchools$lunch > 0.4
  1. Does the new variable created above roughly split the sample in two subsamples of similar size? Why? (Compute summary statistics for lunch (e.g. mean, median, etc.) to answer.) The sample is roughly split in 2 based on the fact that the mean and median are similar.
mean(CASchools$lunch[CASchools$above_40_lunch == TRUE])
## [1] 46.01862
mean(CASchools$lunch[CASchools$above_40_lunch == FALSE])
## [1] 0.05005
median(CASchools$lunch[CASchools$above_40_lunch == TRUE])
## [1] 44.3252
median(CASchools$lunch[CASchools$above_40_lunch == FALSE])
## [1] 0
  1. Compute the mean score in math and read conditional on more than 40% of students qualifying for free lunch. How do they compare to the population means? They are both about 2 apart from one another.
mean_math_high_lunch <- mean(CASchools$math[CASchools$above_40_lunch == TRUE])
mean_read_high_lunch <- mean(CASchools$read[CASchools$above_40_lunch == TRUE])
  1. Draw 10 samples of 40 observations each from the population of schools where more than 40% of students qualify for a free lunch. For each sample, compute the mean read score, and create a histogram showing the distribution of these values. Notice that you will build the histogram with 10 observations, each corresponding to one sample mean. Add the true mean value to the histogram as a vertical line. How does the mean of the sample average compare to the population mean? The means are very similar
set.seed(123)

n_samples <- 10
n_obs <- 40

mean_read_samples <- numeric(n_samples)

for (i in 1:n_samples) {
  sample_index <- sample(which(CASchools$above_40_lunch == TRUE), n_obs)
  mean_read_samples[i] <- mean(CASchools$read[sample_index])
}
hist(mean_read_samples, main = "Distribution of Mean Read Scores", xlab = "Mean Read Score", col = "blue")


true_mean_read <- mean(CASchools$read[CASchools$above_40_lunch == TRUE])
hist(mean_read_samples, main = "Distribution of Mean Read Scores", xlab = "Mean Read Score", col = "blue")
abline(v = true_mean_read, col = "red", lwd = 2, lty = 2)

  1. Now repeat the exercise above but drawing 50 samples of 40 observations each. How does the mean of the sample average compare to the population mean? How did it change now that you increase the number of draws? Now that we increased the number of draws the mean has decreased.
set.seed(123)

n_samples <- 50
n_obs <- 40

mean_read_samples <- numeric(n_samples)

for (i in 1:n_samples) {
  sample_index <- sample(which(CASchools$above_40_lunch == TRUE), n_obs)
  mean_read_samples[i] <- mean(CASchools$read[sample_index])
}

mean(mean_read_samples)
## [1] 653.745
  1. What would happen if you kept doing it and drawing more and more samples? What would happen with the mean and with the distribution? Why? Explain. We can assume that if we kept adding more to the draws, we would decrease the mean because that is what happened to the data in #5-6.