Question 1
No, this alone does not guarantee a simple random sample. In a simple random sample, each unit must have the same probability of being in the sample and each subset of units must have the same probability of being in the sample. If you had a stratified sampling scheme using proportional allocation, every unit would have the same probability of being included, but some subsets would never occur. This means that not all subsets have the same probability of occuring, making this different from a simple random sample.
Question 2
a. You could do ratio estimation by using some known demographic information. For example, if you wanted to know how many votes a candidate will get from a certain age group, you could obtain voting and age information from the exit poll to find out what proportion of people in that age group are voting for the candidate. You can use that ratio and census information to estimate the total number of votes that candidate will get from that age group. This could also be done with other demographic information such as race.
Question 3
Stratified and cluster sampling are similar in that observations are grouped in some way, but they differ in how the units are selected. For stratified sampling, you construct a sampling scheme that involves taking some number of units from every stratum. In cluster sampling, on the other hand, not all groups are sampled. You would take an SRS of the clusters, and then either sample every unit from that cluster, or perform a second stage of sampling. In stratified sampling, all strata will be represented in your sample. In cluster sampling, only the clusters that were selected by the SRS will be represented.
In stratified sampling, ideally each stratum is very different from the others but homogenous within it because the variance depends on the variability within each stratum. With cluster sampling, we want the variability between the clusters to be low but the variability within to be high so that each unit in the cluster gives you more information. The variance here depends mostly on the variability between the different clusters.
Question 4
library(SDaA)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
measle <- read.csv("measles.csv")
#Mi total is total nonimmunized students at school. mi is number in sample
N <- 46
n <- 10
#a. estimate for each school percentage of parents who returned consent form;
#I guess this would be those who returned the form/parents who got the survey?
# 1 is yes, 0 is no
#change the nonresponses to zeros
measle$returnf[which(measle$returnf==9)] <- 0
prop <- group_by(measle, school) %>% summarise(returns = sum(returnf),m_i = mean(mi), M_i = mean(Mitotal) )
prop$percent = prop$returns/prop$m_i
#answer for part a:
prop$percent
## [1] 0.4750000 0.5000000 0.6842105 0.6000000 0.4000000 0.5200000 0.6521739
## [8] 0.4883721 0.6052632 0.3333333
#b. sampling weights using number of respondents as mi -> number of yes's for returnf
prop$weight <- (N/n)*(prop$M_i/prop$returns)
prop$weight #sampling weights, answer for b
## [1] 18.88421 57.62105 92.35385 44.46667 90.46667 66.52308 34.65333
## [8] 37.23810 59.20000 136.02857
#c. overall percentage of parents who received consent form
forms <- group_by(measle, school) %>% summarise(formz = sum(form),m_i = mean(mi), M_i = mean(Mitotal))
totalforms = sum(forms$formz)
forms$p_i <- forms$formz/totalforms
forms$t_i <- forms$M_i*forms$p_i
#basically t hat proportion. using t hat unbiased with pi instead of ybar i
that_prop = (N/n)*(1/sum(forms$M_i))*sum(forms$M_i*forms$p_i)
#variance of that_prop
ssquared_t = (1/(n-1))*sum((forms$t_i-(that_prop/N))^2)
var2 = (N/n)*sum((1-(forms$m_i/forms$M_i)*forms$M_i^2*(forms$p_i*(1-forms$p_i)/(forms$m_i-1))))
variance <- N^2*(1-n/N)*(ssquared_t/n)+ var2
se <- sqrt(variance)*(1/sum(forms$M_i))
CI <- that_prop+ c(-1,1)*qnorm((1-(0.05/2)))*se
#d analyze as SRS
phat <- totalforms/sum(forms$M_i)
var_phat <- (1-(n/N))*(phat*(1-phat))/(n-1)
se_phat <- sqrt(var_phat)
ratio = variance/var_phat
Question 4, part c: overall percentage of parents who received a consent form: 44.5%
Confidence Interval: 0.17 to 0.722
Question 4, part d: ratio is 6455107.97, so clustering added a huge amount of variability. We expect that cluster sampling will have higher variability in an SRS, especially when the Mi’s are unequal. We are also performing two stages of sampling here, which adds another level of variability. I am not confident that this number is correct but it makes sense that we would find cluster sampling to be less efficient here than a simple random sample.
Question 5
data(statepps)
states <- statepps
total_land <- max(states$cumland)
total_pop <- max(states$cumpopn)
states$psi_iland <- states$landarea/total_land
states$psi_ipop <- states$popn/total_pop
set.seed(314)
n<-50
samp_landarea <- sample(c(1:51), n, replace = TRUE, prob = states$psi_iland )
samp_population <- sample(c(1:51), n, replace = TRUE, prob = states$psi_ipop)
#estimate total number of counties using samp_landarea
t_hat_psiland <- (1/n)*sum(states$counties[samp_landarea]/states$psi_iland[samp_landarea])
t_hat_psipop <- (1/n)*sum(states$counties[samp_population]/states$psi_ipop[samp_population])
#and the variance...
var_land <- (1/n)*(1/(n-1))*sum(((states$counties[samp_landarea]/states$psi_iland[samp_landarea])-t_hat_psiland)^2)
s_e_land = sqrt(var_land)
var_pop <- (1/n)*(1/(n-1))*sum(((states$counties[samp_population]/states$psi_ipop[samp_population])-t_hat_psipop)^2)
s_e_pop <- sqrt(var_pop)
estimate of total number of counties, probabilities proportional to land area: 2749
standard error: 342
The true total number of counties is 3007. This estimate is off by 258 counties.
estimate of total number of counties, probabilities proportional to population: 3215
standard error: 346 This estimate is off from the true total by 208. Using the state population gave an estimate that was closer to the true value and we didn’t really lose any precision (the standard errors for both are very close).
#how about an SRS; use land sample first
bigN <- 51
ybar <- sum(states$counties[samp_landarea])/n
t_hat <- ybar*bigN
s_squared <- var(states$counties[samp_landarea])
#not sure about fpc because this is a sample with replacement so I'll take it out.
standard_error <- sqrt(bigN^2*(s_squared/n))
#now population sample
ybarpop <- sum(states$counties[samp_population])/n
t_hatpop <- ybarpop*bigN
s_squaredpop <- var(states$counties[samp_population])
#not sure about fpc because this is a sample with replacement so I'll take it out.
standard_errorpop <- sqrt(bigN^2*(s_squaredpop/n))