Question 1
No, this alone does not guarantee a simple random sample. In a simple random sample, each unit must have the same probability of being in the sample and each subset of units must have the same probability of being in the sample. If you had a stratified sampling scheme using proportional allocation, every unit would have the same probability of being included, but some subsets would never occur. This means that not all subsets have the same probability of occuring, making this different from a simple random sample.

Question 2
a. You could do ratio estimation by using some known demographic information. For example, if you wanted to know how many votes a candidate will get from a certain age group, you could obtain voting and age information from the exit poll to find out what proportion of people in that age group are voting for the candidate. You can use that ratio and census information to estimate the total number of votes that candidate will get from that age group. This could also be done with other demographic information such as race.

  1. Exit polls have a source of potential bias in that only those who physically go to the polls on election day can end up in the sample. Those who vote early or who vote by mail have no chance of being included. As with many surveys, there could be something different about people who are more willing to respond to the survey. Just as an example, there may be people going to vote on their lunch break during the workday, or before work in the morning. They could be less likely to stop and take a survey because they are in more of a hurry than someone who doesn’t work or has a more flexible schedule. People might also not want to respond to the survey if their opinion on the election is different from most other people in their area or polling place (even though I believe the questionnaires are supposed to be anonymous).

Question 3

Stratified and cluster sampling are similar in that observations are grouped in some way, but they differ in how the units are selected. For stratified sampling, you construct a sampling scheme that involves taking some number of units from every stratum. In cluster sampling, on the other hand, not all groups are sampled. You would take an SRS of the clusters, and then either sample every unit from that cluster, or perform a second stage of sampling. In stratified sampling, all strata will be represented in your sample. In cluster sampling, only the clusters that were selected by the SRS will be represented.

In stratified sampling, ideally each stratum is very different from the others but homogenous within it because the variance depends on the variability within each stratum. With cluster sampling, we want the variability between the clusters to be low but the variability within to be high so that each unit in the cluster gives you more information. The variance here depends mostly on the variability between the different clusters.

Question 4

library(SDaA)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
measle <- read.csv("measles.csv")

#Mi total is total nonimmunized students at school. mi is number in sample

N <- 46
n <- 10

#a. estimate for each school percentage of parents who returned consent form;
#I guess this would be those who returned the form/parents who got the survey?
# 1 is yes, 0 is no

#change the nonresponses to zeros
measle$returnf[which(measle$returnf==9)] <- 0

prop <- group_by(measle, school) %>% summarise(returns = sum(returnf),m_i = mean(mi), M_i = mean(Mitotal) )
prop$percent = prop$returns/prop$m_i

#answer for part a:
prop$percent
##  [1] 0.4750000 0.5000000 0.6842105 0.6000000 0.4000000 0.5200000 0.6521739
##  [8] 0.4883721 0.6052632 0.3333333
#b. sampling weights using number of respondents as mi -> number of yes's for returnf

prop$weight <- (N/n)*(prop$M_i/prop$returns)

prop$weight #sampling weights, answer for b
##  [1]  18.88421  57.62105  92.35385  44.46667  90.46667  66.52308  34.65333
##  [8]  37.23810  59.20000 136.02857
#c. overall percentage of parents who received consent form

forms <- group_by(measle, school) %>% summarise(formz = sum(form),m_i = mean(mi), M_i = mean(Mitotal))

totalforms = sum(forms$formz)
forms$p_i <- forms$formz/totalforms
forms$t_i <- forms$M_i*forms$p_i

#basically t hat proportion. using t hat unbiased with pi instead of ybar i
that_prop = (N/n)*(1/sum(forms$M_i))*sum(forms$M_i*forms$p_i) 

#variance of that_prop

ssquared_t = (1/(n-1))*sum((forms$t_i-(that_prop/N))^2) 
var2 = (N/n)*sum((1-(forms$m_i/forms$M_i)*forms$M_i^2*(forms$p_i*(1-forms$p_i)/(forms$m_i-1))))

variance <- N^2*(1-n/N)*(ssquared_t/n)+ var2
se <- sqrt(variance)*(1/sum(forms$M_i))
CI <- that_prop+ c(-1,1)*qnorm((1-(0.05/2)))*se

#d analyze as SRS

phat <- totalforms/sum(forms$M_i)
var_phat <- (1-(n/N))*(phat*(1-phat))/(n-1)
se_phat <- sqrt(var_phat)

ratio = variance/var_phat

Question 4, part c: overall percentage of parents who received a consent form: 44.5%
Confidence Interval: 0.17 to 0.722

Question 4, part d: ratio is 6455107.97, so clustering added a huge amount of variability. We expect that cluster sampling will have higher variability in an SRS, especially when the Mi’s are unequal. We are also performing two stages of sampling here, which adds another level of variability. I am not confident that this number is correct but it makes sense that we would find cluster sampling to be less efficient here than a simple random sample.

Question 5

  1. psi i for each state is the state’s land area divided by the total land area of the country.
  2. psi i for each state is the state’s population divided by the total population of the country.
data(statepps)

states <- statepps

total_land <- max(states$cumland)
total_pop <- max(states$cumpopn)

states$psi_iland <- states$landarea/total_land
states$psi_ipop <- states$popn/total_pop


set.seed(314)

n<-50
samp_landarea <- sample(c(1:51), n, replace = TRUE, prob = states$psi_iland )

samp_population <- sample(c(1:51), n, replace = TRUE, prob = states$psi_ipop)

#estimate total number of counties using samp_landarea

t_hat_psiland <- (1/n)*sum(states$counties[samp_landarea]/states$psi_iland[samp_landarea])
t_hat_psipop <- (1/n)*sum(states$counties[samp_population]/states$psi_ipop[samp_population])

#and the variance...
var_land <- (1/n)*(1/(n-1))*sum(((states$counties[samp_landarea]/states$psi_iland[samp_landarea])-t_hat_psiland)^2)
s_e_land = sqrt(var_land)

var_pop <- (1/n)*(1/(n-1))*sum(((states$counties[samp_population]/states$psi_ipop[samp_population])-t_hat_psipop)^2)
s_e_pop <- sqrt(var_pop)
  1. estimate of total number of counties, probabilities proportional to land area: 2749
    standard error: 342
    The true total number of counties is 3007. This estimate is off by 258 counties.

  2. estimate of total number of counties, probabilities proportional to population: 3215
    standard error: 346 This estimate is off from the true total by 208. Using the state population gave an estimate that was closer to the true value and we didn’t really lose any precision (the standard errors for both are very close).

#how about an SRS; use land sample first

bigN <- 51

ybar <- sum(states$counties[samp_landarea])/n
t_hat <- ybar*bigN

s_squared <- var(states$counties[samp_landarea])

#not sure about fpc because this is a sample with replacement so I'll take it out.
standard_error <- sqrt(bigN^2*(s_squared/n))

#now population sample

ybarpop <- sum(states$counties[samp_population])/n
t_hatpop <- ybarpop*bigN

s_squaredpop <- var(states$counties[samp_population])

#not sure about fpc because this is a sample with replacement so I'll take it out.
standard_errorpop <- sqrt(bigN^2*(s_squaredpop/n))
  1. If I used the sample generated using the probabilities proportional to land area and treat it as an SRS, the estimate for total number of counties is 3888, which is way off. The standard error of this estimate is 446, which is higher than with the previous two estimates.
    The sample generated using the probabilities proportional to population is even worse. The estimate for the total number of counties is 4887 with a standard error of 451 if you treat it as a simpe random sample.