In this lab we will use the CASchools dataset used in
class and the last lab. The package AER that contains it is
already loaded in the first code chunk.
data("CASchools", package = "AER")
lunch indicates the share of students who
qualify for free lunch. Add a new variable to the original dataset that
is true when more than 40% of students qualify for free lunch.CASchools$above_40_lunch <- CASchools$lunch > 0.4
lunch (e.g. mean, median, etc.) to answer.) The sample is
roughly split in 2 based on the fact that the mean and median are
similar.mean(CASchools$lunch[CASchools$above_40_lunch == TRUE])
## [1] 46.01862
mean(CASchools$lunch[CASchools$above_40_lunch == FALSE])
## [1] 0.05005
median(CASchools$lunch[CASchools$above_40_lunch == TRUE])
## [1] 44.3252
median(CASchools$lunch[CASchools$above_40_lunch == FALSE])
## [1] 0
math and read
conditional on more than 40% of students qualifying for free lunch. How
do they compare to the population means? They are both about 2 apart
from one another.mean_math_high_lunch <- mean(CASchools$math[CASchools$above_40_lunch == TRUE])
mean_read_high_lunch <- mean(CASchools$read[CASchools$above_40_lunch == TRUE])
read score, and create a
histogram showing the distribution of these values. Notice that you will
build the histogram with 10 observations, each corresponding to one
sample mean. Add the true mean value to the histogram as a vertical
line. How does the mean of the sample average compare to the population
mean? The means are very similarset.seed(123)
n_samples <- 10
n_obs <- 40
mean_read_samples <- numeric(n_samples)
for (i in 1:n_samples) {
sample_index <- sample(which(CASchools$above_40_lunch == TRUE), n_obs)
mean_read_samples[i] <- mean(CASchools$read[sample_index])
}
hist(mean_read_samples, main = "Distribution of Mean Read Scores", xlab = "Mean Read Score", col = "blue")
true_mean_read <- mean(CASchools$read[CASchools$above_40_lunch == TRUE])
hist(mean_read_samples, main = "Distribution of Mean Read Scores", xlab = "Mean Read Score", col = "blue")
abline(v = true_mean_read, col = "red", lwd = 2, lty = 2)
set.seed(123)
n_samples <- 50
n_obs <- 40
mean_read_samples <- numeric(n_samples)
for (i in 1:n_samples) {
sample_index <- sample(which(CASchools$above_40_lunch == TRUE), n_obs)
mean_read_samples[i] <- mean(CASchools$read[sample_index])
}
mean(mean_read_samples)
## [1] 653.745