This homework assignment is written in R Markdown in the hopes that you will open this document in RStudio, knit the document to run the R code provided, answer the questions, add your own R code as necessary, and turn in one final, knitted, seamless document with all code, numerical output, graphical output, and documentation in one place.
Canvas as cex.pdf. Skim through this document and use it to answer the following questions:Answer:
Answer:
Answer:
Answer:
Answer:
Answer:
Answer:
set.seed(571) # set a random number seed to make results reproducible
N <- 5000
y_pop <- rnorm(N)
Next, draw a boxplot of the response variables \(\{y_k\}_{k\in U}\) and compute the theoretical \(y\)-total, \(T_y\):
boxplot(y_pop)
T_y <- sum(y_pop)
T_y
## [1] 9.55913
Draw a sample \(s\subset U\) via simple random sampling without replacement of size \(n=50\). Compute the Horvitz-Thompson estimator (\(N\bar y_s\)) of the total. The estimated standard error of HT, using the with-replacement approximation, is
\[
\left(\frac{N^2}{n}S^2_{ys}\right)^{1/2}.
\] The HT estimator is trivial in this case, but for future reference we will use the svytotal function in the R survey package to compute the HT estimator and its standard error, then check against the simple computations.
library(survey)
## Loading required package: grid
## Loading required package: Matrix
## Loading required package: survival
##
## Attaching package: 'survey'
## The following object is masked from 'package:graphics':
##
## dotchart
n <- 50
# Draw the srswor:
s <- sample(1:N, size = n, replace = FALSE)
y_sample <- y_pop[s]
# Simple computation of HT and its standard error:
N * mean(y_sample)
## [1] -456.4132
sqrt((N ^ 2 / n) * var(y_sample))
## [1] 706.9018
# Compute the sampling weights and construct data frame.
Sampling_Weight <- rep(N / n, n)
d <- data.frame(y_sample, Sampling_Weight)
# Create survey design object, specifying the weights.
dd <- svydesign(ids = ~1, weights = Sampling_Weight, data = d)
svytotal(~y_sample, design = dd)
## total SE
## y_sample -456.41 706.9
Now we will simulate the randomization distribution of the HT estimator by repeating the above computations over 1,000 independent simple random samples without replacement of size \(n=50\) from the finite population. For each sample, we compute the Horvitz-Thompson estimator (\(N\bar y_s\)) of the total and its standard error using the svytotal function.
nreps <- 1000
# Initialize vectors into which to put simulated results
HT <- rep(0, nreps)
SE_HT <- rep(0, nreps)
for (r in 1:nreps) {
# Draw the srswor:
s <- sample(1:N, size = n, replace = FALSE)
y_sample <- y_pop[s]
# Compute the sampling weights and construct data frame.
Sampling_Weight <- rep(N / n, n)
d <- data.frame(y_sample, Sampling_Weight)
# Create survey design object, specifying the weights.
dd <- svydesign(ids = ~1, weights = Sampling_Weight, data = d)
out <- svytotal(~y_sample, design = dd)
HT[r] <- coef(out)
SE_HT[r] <- SE(out)
}
Compute the Monte Carlo approximation to the theoretical expectation of the HT estimator, by computing the empirical mean of our 1,000 simulated HT estimators. Compare this approximation to the theoretical mean of the HT estimator, \(T_y=\sum_{k\in U}y_k\) with a formal hypothesis test: \[ H_0: \mbox{E}[\widehat T_y]=T_y\;\mbox{versus}\;H_A: \mbox{E}[\widehat T_y]\ne T_y \] Test statistic is \[ Z=\frac{\mbox{mean}\left(\{\widehat{T}_{y,r}\}_{r=1}^{1000}\right)-T_y}{\sqrt{\mbox{var}\left(\{\widehat{T}_{y,r}\}_{r=1}^{1000}\right) / 1000}} \] which has an approximately Normal(0,1) distribution under \(H_0\).
Z <- (mean(HT) - T_y) / sqrt(var(HT) / nreps)
Z
## [1] 0.7552266
Clearly, do not reject \(H_0\). The Monte Carlo mean of the HT estimator is perfectly consistent with its theoretical value, \(T_y\).
Now compute the Monte Carlo approximation to the theoretical variance of the HT estimator, by computing the empirical variance of your 1,000 simulated HT estimators. Compare this approximation to the theoretical variance of the HT estimator, \(N^2(1-n/N)S^2_{yU}/n\).
# Monte Carlo approximation to theoretical variance:
var(HT)
## [1] 474676.6
# Theoretical variance under srswor:
N ^ 2 * (1 - n / N) * var(y_pop) / n
## [1] 482997.8
These are clearly very close. Next, compare the theoretical variance above to the average estimated variance:
mean(SE_HT ^ 2)
## [1] 488898.9
Even though we have used the with-replacement approximation, our estimated variance is nearly unbiased.
Finally, plot the histogram of the 1,000 Horvitz-Thompson estimators. Add to your histogram a curve giving the normal approximation to the distribution of the HT estimator. Add a red vertical line at the theoretical mean of the HT estimator to your histogram. Comment on the quality of the approximation.
hist(HT,
col = "lightblue",
freq = FALSE,
xlab = "Horvitz-Thompson Estimator")
abline(v = T_y, col = "red", lwd = 2) # true value
# fine grid of points on which to plot normal approximation
grid <- min(HT) + (1:1000) * (max(HT) - min(HT)) / 1000
lines(grid, dnorm(grid, mean = T_y,
sd = sqrt(N ^ 2 * var(y_pop) * (1 - n / N) / n)),
col = "blue",
lwd = 3)
Clearly, the normal approximation is very good.
Now, repeat the steps above for each of the following distributions:
Answer: Please reset your random number seed via set.seed(571) prior to each of the following simulations.
Simulation for \(t_3\):
Simulation for \(\chi^2_1\):
Simulation for Bernoulli(0.2)
Simulation for Uniform(0,1):
A common statement is ``I don’t trust normal-theory confidence intervals because my responses are nowhere near normal.’’ Given your simulation results, comment on the usefulness of normal-theory confidence intervals for Horvitz-Thompson estimation of finite population totals under simple random sampling without replacement.
Answer:
\[ \begin{array}{r|cccccc} s & \{1,2\} & \{1,3\} & \{1,4\} & \{2,3\} & \{2, 4\}& \{3,4\}\\\hline p(s) & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 \end{array} \]
As an alternative, consider a design in which the sample \(s\) is selected with probability proportional to the total farm acres in the sample, \(\sum_{k\in s}x_k\). So, for example, sample \(\{1, 2\}\) is selected with probability proportional to \(x_1+x_2=4+6=10\) and sample \(\{3,4\}\) is selected with probability proportional to \(x_3+x_4=6+20=26\).
Answer:
\[ \begin{array}{r|cccccc} s & \{1,2\} & \{1,3\} & \{1,4\} & \{2,3\} & \{2, 4\}& \{3,4\}\\\hline p(s) \end{array} \]
Answer:
Answer:
Answer:
Answer:
Answer: