STAA571 Homework #1: Due Sunday, February 4, 2018

Name:

This homework assignment is written in R Markdown in the hopes that you will open this document in RStudio, knit the document to run the R code provided, answer the questions, add your own R code as necessary, and turn in one final, knitted, seamless document with all code, numerical output, graphical output, and documentation in one place.

Documentation for the Consumer Expenditure Survey, conducted by the US Bureau of Labor Statistics, is available on Canvas as cex.pdf. Skim through this document and use it to answer the following questions:

What are the objectives of the CE?

Answer:

What are the two component surveys within CE and why are there two component surveys? That is, what measurement issues/errors are they trying to address?

Answer:

What two types of frames are used in the selection of households for the CE?

Answer:

What kind of coverage error is represented by Type C nonresponses?

Answer:

What population does the CE sample represent? Be precise.

Answer:

What is the name of the method used to estimate CE standard errors?

Answer:

According to the documentation, what is the typical value for \(\pi_k^{-1}\) in the CE, where \(k\) is a consumer unit?

Answer:

Consider a finite population \(U=\{1,\ldots,N\}\) with \(N=5,000\). In practice, you would draw one sample. In this simulation experiment, we will draw 1,000 replicated samples to study the randomization distribution of the Horvitz-Thompson estimator. First, we will generate response variables \(\{y_k\}_{k\in U}\) as independent and identically distributed Normal(0,1); these will remain fixed for the following sampling experiment.

set.seed(571) # set a random number seed to make results reproducible
N <- 5000
y_pop <- rnorm(N)

Next, draw a boxplot of the response variables \(\{y_k\}_{k\in U}\) and compute the theoretical \(y\)-total, \(T_y\):

boxplot(y_pop)

T_y <- sum(y_pop)
T_y

## [1] 9.55913

Draw a sample \(s\subset U\) via simple random sampling without replacement of size \(n=50\). Compute the Horvitz-Thompson estimator (\(N\bar y_s\)) of the total. The estimated standard error of HT, using the with-replacement approximation, is
\[ \left(\frac{N^2}{n}S^2_{ys}\right)^{1/2}. \] The HT estimator is trivial in this case, but for future reference we will use the svytotal function in the R survey package to compute the HT estimator and its standard error, then check against the simple computations.

library(survey)

## Loading required package: grid

## Loading required package: Matrix

## Loading required package: survival

## 
## Attaching package: 'survey'

## The following object is masked from 'package:graphics':
## 
##     dotchart

n <- 50
# Draw the srswor:
s <- sample(1:N, size = n, replace = FALSE)
y_sample <- y_pop[s]
# Simple computation of HT and its standard error:
N * mean(y_sample)

## [1] -456.4132

sqrt((N ^ 2 / n) * var(y_sample))

## [1] 706.9018

# Compute the sampling weights and construct data frame.
Sampling_Weight <- rep(N / n, n)
d <- data.frame(y_sample, Sampling_Weight)
# Create survey design object, specifying the weights.
dd <- svydesign(ids = ~1, weights = Sampling_Weight, data = d)
svytotal(~y_sample, design = dd)

##            total    SE
## y_sample -456.41 706.9

Now we will simulate the randomization distribution of the HT estimator by repeating the above computations over 1,000 independent simple random samples without replacement of size \(n=50\) from the finite population. For each sample, we compute the Horvitz-Thompson estimator (\(N\bar y_s\)) of the total and its standard error using the svytotal function.

nreps <- 1000
# Initialize vectors into which to put simulated results
HT <- rep(0, nreps)  
SE_HT <- rep(0, nreps)
for (r in 1:nreps) {
  # Draw the srswor:
  s <- sample(1:N, size = n, replace = FALSE)
  y_sample <- y_pop[s]
  # Compute the sampling weights and construct data frame.
  Sampling_Weight <- rep(N / n, n)
  d <- data.frame(y_sample, Sampling_Weight)
  # Create survey design object, specifying the weights.
  dd <- svydesign(ids = ~1, weights = Sampling_Weight, data = d)
  out <- svytotal(~y_sample, design = dd)
  HT[r] <- coef(out)
  SE_HT[r] <- SE(out)
}

Compute the Monte Carlo approximation to the theoretical expectation of the HT estimator, by computing the empirical mean of our 1,000 simulated HT estimators. Compare this approximation to the theoretical mean of the HT estimator, \(T_y=\sum_{k\in U}y_k\) with a formal hypothesis test: \[ H_0: \mbox{E}[\widehat T_y]=T_y\;\mbox{versus}\;H_A: \mbox{E}[\widehat T_y]\ne T_y \] Test statistic is \[ Z=\frac{\mbox{mean}\left(\{\widehat{T}_{y,r}\}_{r=1}^{1000}\right)-T_y}{\sqrt{\mbox{var}\left(\{\widehat{T}_{y,r}\}_{r=1}^{1000}\right) / 1000}} \] which has an approximately Normal(0,1) distribution under \(H_0\).

Z <- (mean(HT) - T_y) / sqrt(var(HT) / nreps)
Z

## [1] 0.7552266

Clearly, do not reject \(H_0\). The Monte Carlo mean of the HT estimator is perfectly consistent with its theoretical value, \(T_y\).

Now compute the Monte Carlo approximation to the theoretical variance of the HT estimator, by computing the empirical variance of your 1,000 simulated HT estimators. Compare this approximation to the theoretical variance of the HT estimator, \(N^2(1-n/N)S^2_{yU}/n\).

# Monte Carlo approximation to theoretical variance:
var(HT)

## [1] 474676.6

# Theoretical variance under srswor:
N ^ 2 * (1 - n / N) * var(y_pop) / n

## [1] 482997.8

These are clearly very close. Next, compare the theoretical variance above to the average estimated variance:

mean(SE_HT ^ 2)

## [1] 488898.9

Even though we have used the with-replacement approximation, our estimated variance is nearly unbiased.

Finally, plot the histogram of the 1,000 Horvitz-Thompson estimators. Add to your histogram a curve giving the normal approximation to the distribution of the HT estimator. Add a red vertical line at the theoretical mean of the HT estimator to your histogram. Comment on the quality of the approximation.

hist(HT,
     col = "lightblue",
     freq = FALSE,
     xlab = "Horvitz-Thompson Estimator")
abline(v = T_y, col = "red", lwd = 2) # true value
# fine grid of points on which to plot normal approximation
grid <- min(HT) + (1:1000) * (max(HT) - min(HT)) / 1000
lines(grid, dnorm(grid, mean = T_y, 
                  sd = sqrt(N ^ 2 * var(y_pop) * (1 - n / N) / n)),
     col = "blue",
     lwd = 3)

Clearly, the normal approximation is very good.

Now, repeat the steps above for each of the following distributions:

Student’s \(t\) with 3 degrees of freedom.
\(\chi^2\) with 1 degree of freedom.
Bernoulli (binary) with success probability 0.2.
Uniform(0,1).

Answer: Please reset your random number seed via set.seed(571) prior to each of the following simulations.

Simulation for \(t_3\):

Simulation for \(\chi^2_1\):

Simulation for Bernoulli(0.2)

Simulation for Uniform(0,1):

A common statement is ``I don’t trust normal-theory confidence intervals because my responses are nowhere near normal.’’ Given your simulation results, comment on the usefulness of normal-theory confidence intervals for Horvitz-Thompson estimation of finite population totals under simple random sampling without replacement.

Answer:

In the Farm I example discussed in class, we considered the sampling design

\[ \begin{array}{r|cccccc} s & \{1,2\} & \{1,3\} & \{1,4\} & \{2,3\} & \{2, 4\}& \{3,4\}\\\hline p(s) & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 & 1/6 \end{array} \]

As an alternative, consider a design in which the sample \(s\) is selected with probability proportional to the total farm acres in the sample, \(\sum_{k\in s}x_k\). So, for example, sample \(\{1, 2\}\) is selected with probability proportional to \(x_1+x_2=4+6=10\) and sample \(\{3,4\}\) is selected with probability proportional to \(x_3+x_4=6+20=26\).

Compute \(p(s)\) under this alternative design

Answer:

\[ \begin{array}{r|cccccc} s & \{1,2\} & \{1,3\} & \{1,4\} & \{2,3\} & \{2, 4\}& \{3,4\}\\\hline p(s) \end{array} \]

For the alternative design, write down all of the first-order inclusion probabilities (FOIPs). Verify that \(\sum_{k\in U}\pi_k=\mbox{E}[n] = 2\) for this design.

Answer:

For the alternative design, write down all of the second-order inclusion probabilities (SOIPs) \(\{\pi_{k\ell}\}\) in a \(4\times 4\) table, and write down all of the design covariances \(\{\Delta_{k\ell}\}\) in a \(4\times 4\) table. Since \(n=2\) is not random under this design, Var\((n)=0\), so the sum of the design covariances is theoretically equal to zero: \[ 0=\mbox{Var}(n)=\mbox{Var}\left(\sum_{k\in U}I_k\right)=\sum\sum_{k,\ell\in U}\Delta_{k\ell}. \] Verify that, to a good approximation, your computed \(\Delta_{k\ell}\)’s sum to zero.

Answer:

Compute the randomization distribution of the Horvitz-Thompson (HT) estimator of the total acres of corn under this alternative design; that is, compute all six possible values of the HT estimator. Verify directly from the randomization distribution that HT is unbiased, by computing the sum of the six possible values times their respective probabilities: \(\sum_{s\in{\cal S}}p(s)\widehat T_y(s) = T_y\).

Answer:

Compute the design variance of the HT estimator under the altenative design in two ways: directly from the randomization distribution, via \[ \mbox{Var}(\widehat T_y) = \sum_{s\in{\cal S}}p(s)\widehat T_y^2(s)- T_y^2, \] and from the variance formula \[ \mbox{Var}(\widehat T_y) = \sum\sum_{k,\ell\in U}\Delta_{k\ell}\frac{y_k}{\pi_k}\frac{y_\ell}{\pi_\ell}. \] Compare to the theoretical variance of HT under the Farm I design, \[ \frac{N^2}{n}\left(1-\frac{n}{N}\right)S_{yU}^2. \]

Answer:

Graph the randomization distribution (HT on the horizontal axis, \(p(s)\) on the vertical axis) and compare in the same plot to the distribution of the HT estimator under the Farm~I design. Which design do you prefer, and why?

Answer: