No, this fact alone does not guarantee that this sample is a simple random sample. In a simple random sample they all have the same probability, but that’s only part of it. All combinations and subsets must also have the same probability which separates it from proportional allocation.
A great article from the New York Times describes the issues of exit polls:
https://www.nytimes.com/2018/03/29/opinion/2016-exit-polls-election.html
In the article, it discusses the major problem with exit polls; it is very difficult to weigh an exit poll to the target population, especially on the night of the election because we don’t know yet what the electorate’s demographics will be when the voting is happening.
Ratio estimation can help because we can ask those administrating the poll to approximate an individuals age, race, and gender and weight their vote based on this. This auxillary information can be used to create a proportion to better accurately capture the way the individuals are voting.
Some sources of selection bias would include the fact that the sample consists of entirely volunteers. More educated individuals may be more interested in sharing their feedback in survey polls. Polls near college campuses may have a a larger audience who is willing to share their opinions which introduces bias.
Additionally, the individuals who vote early are not included in the exit polls.
A Stratified Random Sample
Stratified Random Sample: Population is divided into subgroups called strata; then a SRS is taken from each stratum, and the SRS’s in each strata are done independently. Often times, stratified random sampling increases the precision of the estimate compared to SRS. Stratification is often more efficient when means differ. In the “ANOVA” comparison, if there is a large between sum of squares and smaller variability within a strata, then stratified random sampling may be the better option.
Cluster Sample: Observations in the population are aggregated into clusters and a SRS is taken of the clusters. Optionally, a sample could be taken within each cluster. A clustering sample may provide a less precise estimate than an SRS of the same size. In the “ANOVA” comparison, if there is a small between sum of squares and larger variability within a cluster, then cluster sampling may be the better option. The variance essentially depends on variability between clusters.
A simple example to help put things into perspective. Lets take students at Loyola as our target population and we are trying to estimate their overall satisfaction. Lets assume our strata or cluster was grade level. If all first year students had a similar average satisfaction level, but a first year in comparison to a second year had a wildly different average satisfaction rate compared to a second year, we would use stratified random sampling. If there was more variability within the first years, but the average between the years was about the same, we would use cluster sampling.
The proportion of parents from each school and the estimated number of parents from each school who returned a consent form is listed in the last two columns of the new dataframe.
library(SDaA)
m <- read.csv("measles.csv")
m <- data.frame(m)
N <- 46
n <- 10
library(plyr)
##
## Attaching package: 'plyr'
## The following object is masked from 'package:SDaA':
##
## ozone
yes_returnf <- m[which(m$returnf == '1'),]
num_par <- count(yes_returnf$school)
no_shots <- m[which(m$noshot == '0'),]
num_noshot <- count(no_shots$school)
num_noshot$freq
## [1] 36 33 16 27 29 24 23 41 34 21
Mi <- c(78,238,261,174,236,188,113,170,296,207)
p <- num_par$freq/sum(num_par$freq)
newdf <- data.frame("Schools"=1:10, "Number of parents" =
num_par$freq, "Mi" = Mi, "pi" = p, "Mi*pi" = Mi*p)
newdf
## Schools Number.of.parents Mi pi Mi.pi
## 1 1 19 78 0.11875 9.26250
## 2 2 19 238 0.11875 28.26250
## 3 3 13 261 0.08125 21.20625
## 4 4 18 174 0.11250 19.57500
## 5 5 12 236 0.07500 17.70000
## 6 6 13 188 0.08125 15.27500
## 7 7 15 113 0.09375 10.59375
## 8 8 21 170 0.13125 22.31250
## 9 9 23 296 0.14375 42.55000
## 10 10 7 207 0.04375 9.05625
The sampling weights are calculated by the following formula:
\(w_{ij} = \frac{N* M_i}{n * m_i}\)
The weights have been extracted and listed below for each school.
m$wi <- (N*m$Mitotal)/(n*m$mi)
#m
wi <- c(8.970, 28.8105, 63.189, 26.68, 36.18667, 34.592, 22.6, 18.186, 35.8315, 45.343)
wi
## [1] 8.97000 28.81050 63.18900 26.68000 36.18667 34.59200 22.60000
## [8] 18.18600 35.83150 45.34300
The estimated value of percentage of parents who received the consent form is \(\hat{t_{prop}}\) = 0.4534
The variance is 77761.21.
The standard error is 0.1422023
The 95% confidence interval for the estimated percentage of parents who received the consent form is (0.1747546, 0.7321877).
yes_receivef <- m[which(m$form == '1'),]
num_par_form <- count(yes_receivef$school)
Mi <- c(78,238,261,174,236,188,113,170,296,207)
p <- num_par_form$freq/sum(num_par_form$freq)
mi <- c(40,38,19,30,30,25,23,43,38,21)
newdf_2 <- data.frame("Schools"=1:10, "Number of parents" =
num_par_form$freq, "Mi" = Mi, "mi" = mi, "pi" = p, "Mi*pi" = Mi*p)
newdf_2
## Schools Number.of.parents Mi mi pi Mi.pi
## 1 1 38 78 40 0.13523132 10.548043
## 2 2 36 238 38 0.12811388 30.491103
## 3 3 17 261 19 0.06049822 15.790036
## 4 4 30 174 30 0.10676157 18.576512
## 5 5 26 236 30 0.09252669 21.836299
## 6 6 24 188 25 0.08540925 16.056940
## 7 7 22 113 23 0.07829181 8.846975
## 8 8 36 170 43 0.12811388 21.779359
## 9 9 35 296 38 0.12455516 36.868327
## 10 10 17 207 21 0.06049822 12.523132
t_hat_prop <- (N/(n*sum(Mi)))*(sum(Mi*p))
t_hat_prop
## [1] 0.4534712
mp_tn <- ((Mi*p)-(4.6/N))**2
t_hat <- (1-(mi/Mi))*(Mi**2)*((p*(1-p))/(mi-1))
var_thatprop<- (((N**2)*(1-n/N))/n)*(sum(mp_tn)/n-1)+(N/n)*(sum(t_hat))
var_thatprop
## [1] 77762.21
SE <- (1/sum(Mi))*sqrt(var_thatprop)
SE
## [1] 0.1422023
CI <- t_hat_prop + c(-1,1)*1.96*SE
CI
## [1] 0.1747546 0.7321877
Now, we will ignore clustering and compute the estimates assuming SRS.
The estimated total number of parents who received the consent form is 8200.542.
The standard error is 76225.51.
The 95% confidence interval for the total estimated number of parents who received the consent form is (-141201.5, 157602.6).
The ratio and effect of clustering is \(\frac{\text {estimated var clustering}}{\text{estimated var SRS}}\) = 1.338344e-05. So, clustering is much more efficient.
newdf_3 <- data.frame("Schools"=1:10, "Number of parents" = num_par_form$freq, "Mi" = Mi, "mi" = mi, "wi" = wi, "wi*yi" = wi*num_par_form$freq)
newdf_3
## Schools Number.of.parents Mi mi wi wi.yi
## 1 1 38 78 40 8.97000 340.8600
## 2 2 36 238 38 28.81050 1037.1780
## 3 3 17 261 19 63.18900 1074.2130
## 4 4 30 174 30 26.68000 800.4000
## 5 5 26 236 30 36.18667 940.8534
## 6 6 24 188 25 34.59200 830.2080
## 7 7 22 113 23 22.60000 497.2000
## 8 8 36 170 43 18.18600 654.6960
## 9 9 35 296 38 35.83150 1254.1025
## 10 10 17 207 21 45.34300 770.8310
t_hat_unb <- sum(wi*num_par_form$freq)
t_hat_unb
## [1] 8200.542
y_bar <- mean(num_par_form$freq)
y_bar
## [1] 28.1
t <- Mi*y_bar
t
## [1] 2191.8 6687.8 7334.1 4889.4 6631.6 5282.8 3175.3 4777.0 8317.6 5816.7
si_2 <- 1/(mi -1) * sum((num_par_form$freq - y_bar)**2)
si_2
## [1] 14.84359 15.64595 32.16111 19.96207 19.96207 24.12083 26.31364
## [8] 13.78333 15.64595 28.94500
s2_t <- 1/(n-1)*sum((t-(t_hat_unb/N))**2)
s2_t
## [1] 35077937
var_thatunb <- (N**2 * (1-n/N) * (s2_t/n)) + N/n*(sum((1-(mi/Mi))*Mi**2*(si_2/mi)))
var_thatunb
## [1] 5810329070
SE <- sqrt(var_thatunb)
SE
## [1] 76225.51
CI <- t_hat_unb + c(-1,1)*1.96*SE
CI
## [1] -141201.5 157602.6
var_thatprop/var_thatunb
## [1] 1.338344e-05
The psi values are listed below along with the new data set from sampling 1000 observations with replacement using the probabilities of land area.
data("statepps")
head(statepps)
## state counties cumcount landarea cumland popn cumpopn
## 1 Alabama 67 67 50750 50750 4137511 4137511
## 2 Alaska 25 92 570374 621124 587766 4725277
## 3 Arizona 15 107 113642 734766 3832368 8557645
## 4 Arkansas 75 182 52075 786841 2394253 10951898
## 5 California 58 240 155973 942814 30895356 41847254
## 6 Colorado 63 303 103729 1046543 3464675 45311929
p_la <- statepps$landarea/sum(statepps$landarea)
data <- data.frame(statepps)
data$p_la <- statepps$landarea/sum(statepps$landarea)
ndata <- nrow(data)
set.seed(1234)
nsample <- 100000
samp_idx <- sample(seq_len(nrow(data)), nsample, prob=data$p_la, replace=TRUE)
new_la <- data[samp_idx, ]
state_freq <- as.data.frame(table(new_la$state))
state_freq$psi <- state_freq$Freq/nsample
state_freq$county <-data$counties
state_freq
## Var1 Freq psi county
## 1 Alabama 1421 0.01421 67
## 2 Alaska 16121 0.16121 25
## 3 Arizona 3164 0.03164 15
## 4 Arkansas 1394 0.01394 75
## 5 California 4336 0.04336 58
## 6 Colorado 2987 0.02987 63
## 7 Connecticut 126 0.00126 8
## 8 Delaware 58 0.00058 3
## 9 District of Columbia 1 0.00001 1
## 10 Florida 1489 0.01489 67
## 11 Georgia 1686 0.01686 159
## 12 Hawaii 192 0.00192 5
## 13 Idaho 2330 0.02330 44
## 14 Illinois 1641 0.01641 102
## 15 Indiana 972 0.00972 92
## 16 Iowa 1539 0.01539 99
## 17 Kansas 2321 0.02321 105
## 18 Kentucky 1124 0.01124 120
## 19 Louisiana 1233 0.01233 64
## 20 Maine 874 0.00874 16
## 21 Maryland 272 0.00272 24
## 22 Massachusetts 213 0.00213 14
## 23 Michigan 1676 0.01676 83
## 24 Minnesota 2264 0.02264 87
## 25 Mississippi 1360 0.01360 82
## 26 Missouri 2056 0.02056 115
## 27 Montana 4090 0.04090 57
## 28 Nebraska 2208 0.02208 93
## 29 Nevada 3096 0.03096 18
## 30 New Hampshire 260 0.00260 10
## 31 New Jersey 209 0.00209 21
## 32 New Mexico 3385 0.03385 33
## 33 New York 1338 0.01338 62
## 34 North Carolina 1358 0.01358 100
## 35 North Dakota 1860 0.01860 53
## 36 Ohio 1161 0.01161 88
## 37 Oklahoma 1982 0.01982 77
## 38 Oregon 2682 0.02682 36
## 39 Pennsylvania 1304 0.01304 67
## 40 Rhode Island 38 0.00038 5
## 41 South Carolina 849 0.00849 46
## 42 South Dakota 2112 0.02112 66
## 43 Tennessee 1137 0.01137 95
## 44 Texas 7462 0.07462 254
## 45 Utah 2249 0.02249 29
## 46 Vermont 253 0.00253 14
## 47 Virginia 1169 0.01169 136
## 48 Washington 1898 0.01898 39
## 49 West Virginia 717 0.00717 55
## 50 Wisconsin 1529 0.01529 72
## 51 Wyoming 2804 0.02804 23
The psi values are listed below along with the new data set from sampling 1000 observations with replacement using the probabilities proportional to population.
data("statepps")
head(statepps)
## state counties cumcount landarea cumland popn cumpopn
## 1 Alabama 67 67 50750 50750 4137511 4137511
## 2 Alaska 25 92 570374 621124 587766 4725277
## 3 Arizona 15 107 113642 734766 3832368 8557645
## 4 Arkansas 75 182 52075 786841 2394253 10951898
## 5 California 58 240 155973 942814 30895356 41847254
## 6 Colorado 63 303 103729 1046543 3464675 45311929
p_pop <- statepps$popn/sum(statepps$popn)
data <- data.frame(statepps)
data$p_pop <- statepps$popn/sum(statepps$popn)
ndata <- nrow(data)
set.seed(1234)
nsample <- 100000
samp_idx <- sample(seq_len(nrow(data)), nsample, prob=data$p_pop, replac=TRUE)
new_pop <- data[samp_idx, ]
state_freq_pop <- as.data.frame(table(new_pop$state))
state_freq_pop$psi <- state_freq_pop$Freq/nsample
state_freq_pop$county <- data$counties
state_freq_pop
## Var1 Freq psi county
## 1 Alabama 1618 0.01618 67
## 2 Alaska 216 0.00216 25
## 3 Arizona 1476 0.01476 15
## 4 Arkansas 946 0.00946 75
## 5 California 12083 0.12083 58
## 6 Colorado 1316 0.01316 63
## 7 Connecticut 1304 0.01304 8
## 8 Delaware 263 0.00263 3
## 9 District of Columbia 236 0.00236 1
## 10 Florida 5230 0.05230 67
## 11 Georgia 2651 0.02651 159
## 12 Hawaii 467 0.00467 5
## 13 Idaho 403 0.00403 44
## 14 Illinois 4506 0.04506 102
## 15 Indiana 2152 0.02152 92
## 16 Iowa 1126 0.01126 99
## 17 Kansas 964 0.00964 105
## 18 Kentucky 1374 0.01374 120
## 19 Louisiana 1766 0.01766 64
## 20 Maine 465 0.00465 16
## 21 Maryland 2015 0.02015 24
## 22 Massachusetts 2358 0.02358 14
## 23 Michigan 3713 0.03713 83
## 24 Minnesota 1686 0.01686 87
## 25 Mississippi 1003 0.01003 82
## 26 Missouri 2029 0.02029 115
## 27 Montana 319 0.00319 57
## 28 Nebraska 665 0.00665 93
## 29 Nevada 528 0.00528 18
## 30 New Hampshire 451 0.00451 10
## 31 New Jersey 3025 0.03025 21
## 32 New Mexico 594 0.00594 33
## 33 New York 7190 0.07190 62
## 34 North Carolina 2608 0.02608 100
## 35 North Dakota 248 0.00248 53
## 36 Ohio 4440 0.04440 88
## 37 Oklahoma 1262 0.01262 77
## 38 Oregon 1201 0.01201 36
## 39 Pennsylvania 4627 0.04627 67
## 40 Rhode Island 428 0.00428 5
## 41 South Carolina 1423 0.01423 46
## 42 South Dakota 285 0.00285 66
## 43 Tennessee 2042 0.02042 95
## 44 Texas 6867 0.06867 254
## 45 Utah 712 0.00712 29
## 46 Vermont 235 0.00235 14
## 47 Virginia 2535 0.02535 136
## 48 Washington 2094 0.02094 39
## 49 West Virginia 705 0.00705 55
## 50 Wisconsin 1970 0.01970 72
## 51 Wyoming 180 0.00180 23
The estimate of the total number of counties in the United States using the land area as a proportion is 6770.283 or 6770. The standard error is 1913.563.
I had to make my sample 100,000 because otherwise I would have psi values of 0.
t_hat_psi_la <- 1/51*sum(state_freq$county/state_freq$psi)
t_hat_psi_la
## [1] 6770.283
var_tpsi_la <- (1/51)*(1/50)*sum(((state_freq$county/state_freq$psi)-t_hat_psi_la)**2)
var_tpsi_la
## [1] 3661723
sqrt(var_tpsi_la)
## [1] 1913.563
The estimate of the total number of counties in the United States using the population as a proportion is 5414.608 or 5415 counties. The standard error is 727.0919.
t_hat_psi_pop <- 1/51*sum(state_freq_pop$county/state_freq_pop$psi)
t_hat_psi_pop
## [1] 5414.608
var_tpsi_pop <- (1/51)*(1/50)*sum(((state_freq_pop$county/state_freq_pop$psi)-t_hat_psi_pop)**2)
var_tpsi_pop
## [1] 528662.6
sqrt(var_tpsi_pop)
## [1] 727.0919
The estimate of the total number of counties is 3145.201, or 3145 counties, in the United States. The standard error is 337.6008.
The standard error here is MUCH smaller than from parts b and c which were 1913.563 and 727.0919.
It seems that an SRS would be much more effective here as the actual number of counties 3142.
data <- data.frame(statepps)
data$p_s <- 1/51 #Equal probabilities
ndata <- nrow(data)
set.seed(1234)
nsample <- 100000
samp_idx <- sample(seq_len(nrow(data)), nsample, prob=data$p_s, replace=TRUE)
new_same <- data[samp_idx, ]
state_freq_same<- as.data.frame(table(new_same$state))
state_freq_same$psi <- state_freq_same$Freq/nsample
state_freq_same$county <-data$counties
state_freq_same
## Var1 Freq psi county
## 1 Alabama 1953 0.01953 67
## 2 Alaska 1934 0.01934 25
## 3 Arizona 2048 0.02048 15
## 4 Arkansas 1954 0.01954 75
## 5 California 1870 0.01870 58
## 6 Colorado 1996 0.01996 63
## 7 Connecticut 1954 0.01954 8
## 8 Delaware 1950 0.01950 3
## 9 District of Columbia 1929 0.01929 1
## 10 Florida 2108 0.02108 67
## 11 Georgia 1953 0.01953 159
## 12 Hawaii 1913 0.01913 5
## 13 Idaho 1967 0.01967 44
## 14 Illinois 1915 0.01915 102
## 15 Indiana 1961 0.01961 92
## 16 Iowa 1964 0.01964 99
## 17 Kansas 1911 0.01911 105
## 18 Kentucky 1917 0.01917 120
## 19 Louisiana 1966 0.01966 64
## 20 Maine 1932 0.01932 16
## 21 Maryland 1918 0.01918 24
## 22 Massachusetts 1903 0.01903 14
## 23 Michigan 2014 0.02014 83
## 24 Minnesota 2071 0.02071 87
## 25 Mississippi 1975 0.01975 82
## 26 Missouri 1956 0.01956 115
## 27 Montana 1899 0.01899 57
## 28 Nebraska 1936 0.01936 93
## 29 Nevada 1921 0.01921 18
## 30 New Hampshire 1974 0.01974 10
## 31 New Jersey 1955 0.01955 21
## 32 New Mexico 2017 0.02017 33
## 33 New York 1893 0.01893 62
## 34 North Carolina 1923 0.01923 100
## 35 North Dakota 2002 0.02002 53
## 36 Ohio 2026 0.02026 88
## 37 Oklahoma 1993 0.01993 77
## 38 Oregon 2013 0.02013 36
## 39 Pennsylvania 2024 0.02024 67
## 40 Rhode Island 1919 0.01919 5
## 41 South Carolina 2031 0.02031 46
## 42 South Dakota 1924 0.01924 66
## 43 Tennessee 1859 0.01859 95
## 44 Texas 1939 0.01939 254
## 45 Utah 1965 0.01965 29
## 46 Vermont 2011 0.02011 14
## 47 Virginia 1999 0.01999 136
## 48 Washington 1924 0.01924 39
## 49 West Virginia 1972 0.01972 55
## 50 Wisconsin 1953 0.01953 72
## 51 Wyoming 1996 0.01996 23
t_hat_psi <- 1/51*sum(state_freq_same$county/state_freq_same$psi)
t_hat_psi
## [1] 3145.201
var_tpsi<- (1/51)*(1/50)*sum(((state_freq_same$county/state_freq_same$psi)-t_hat_psi)**2)
var_tpsi
## [1] 113974.3
sqrt(var_tpsi)
## [1] 337.6008