Problem 1

No, this fact alone does not guarantee that this sample is a simple random sample. In a simple random sample they all have the same probability, but that’s only part of it. All combinations and subsets must also have the same probability which separates it from proportional allocation.

Problem 2

Problem 2.a

A great article from the New York Times describes the issues of exit polls:

https://www.nytimes.com/2018/03/29/opinion/2016-exit-polls-election.html

In the article, it discusses the major problem with exit polls; it is very difficult to weigh an exit poll to the target population, especially on the night of the election because we don’t know yet what the electorate’s demographics will be when the voting is happening.

Ratio estimation can help because we can ask those administrating the poll to approximate an individuals age, race, and gender and weight their vote based on this. This auxillary information can be used to create a proportion to better accurately capture the way the individuals are voting.

Problem 2.b

Some sources of selection bias would include the fact that the sample consists of entirely volunteers. More educated individuals may be more interested in sharing their feedback in survey polls. Polls near college campuses may have a a larger audience who is willing to share their opinions which introduces bias.

Additionally, the individuals who vote early are not included in the exit polls.

Problem 3

A Stratified Random Sample

Stratified Random Sample: Population is divided into subgroups called strata; then a SRS is taken from each stratum, and the SRS’s in each strata are done independently. Often times, stratified random sampling increases the precision of the estimate compared to SRS. Stratification is often more efficient when means differ. In the “ANOVA” comparison, if there is a large between sum of squares and smaller variability within a strata, then stratified random sampling may be the better option.

Cluster Sample: Observations in the population are aggregated into clusters and a SRS is taken of the clusters. Optionally, a sample could be taken within each cluster. A clustering sample may provide a less precise estimate than an SRS of the same size. In the “ANOVA” comparison, if there is a small between sum of squares and larger variability within a cluster, then cluster sampling may be the better option. The variance essentially depends on variability between clusters.

A simple example to help put things into perspective. Lets take students at Loyola as our target population and we are trying to estimate their overall satisfaction. Lets assume our strata or cluster was grade level. If all first year students had a similar average satisfaction level, but a first year in comparison to a second year had a wildly different average satisfaction rate compared to a second year, we would use stratified random sampling. If there was more variability within the first years, but the average between the years was about the same, we would use cluster sampling.

Problem 4

Problem 4.a

The proportion of parents from each school and the estimated number of parents from each school who returned a consent form is listed in the last two columns of the new dataframe.

library(SDaA)
m <- read.csv("measles.csv")
m <- data.frame(m)

N <- 46
n <- 10

library(plyr)

## 
## Attaching package: 'plyr'

## The following object is masked from 'package:SDaA':
## 
##     ozone

yes_returnf <- m[which(m$returnf == '1'),]
num_par <- count(yes_returnf$school)

no_shots <- m[which(m$noshot == '0'),]
num_noshot <- count(no_shots$school)
num_noshot$freq

##  [1] 36 33 16 27 29 24 23 41 34 21

Mi <- c(78,238,261,174,236,188,113,170,296,207)
p <- num_par$freq/sum(num_par$freq)

newdf <- data.frame("Schools"=1:10, "Number of parents" = 
num_par$freq, "Mi" = Mi, "pi" = p, "Mi*pi" = Mi*p)
newdf

##    Schools Number.of.parents  Mi      pi    Mi.pi
## 1        1                19  78 0.11875  9.26250
## 2        2                19 238 0.11875 28.26250
## 3        3                13 261 0.08125 21.20625
## 4        4                18 174 0.11250 19.57500
## 5        5                12 236 0.07500 17.70000
## 6        6                13 188 0.08125 15.27500
## 7        7                15 113 0.09375 10.59375
## 8        8                21 170 0.13125 22.31250
## 9        9                23 296 0.14375 42.55000
## 10      10                 7 207 0.04375  9.05625

Problem 4.b

The sampling weights are calculated by the following formula:

\(w_{ij} = \frac{N* M_i}{n * m_i}\)

The weights have been extracted and listed below for each school.

m$wi <- (N*m$Mitotal)/(n*m$mi)
#m
wi <- c(8.970, 28.8105, 63.189, 26.68, 36.18667, 34.592, 22.6, 18.186, 35.8315, 45.343)
wi

##  [1]  8.97000 28.81050 63.18900 26.68000 36.18667 34.59200 22.60000
##  [8] 18.18600 35.83150 45.34300

Problem 4.c

The estimated value of percentage of parents who received the consent form is \(\hat{t_{prop}}\) = 0.4534

The variance is 77761.21.

The standard error is 0.1422023

The 95% confidence interval for the estimated percentage of parents who received the consent form is (0.1747546, 0.7321877).

yes_receivef <- m[which(m$form == '1'),]
num_par_form <- count(yes_receivef$school)

Mi <- c(78,238,261,174,236,188,113,170,296,207)
p <- num_par_form$freq/sum(num_par_form$freq)
mi <- c(40,38,19,30,30,25,23,43,38,21)

newdf_2 <- data.frame("Schools"=1:10, "Number of parents" =
num_par_form$freq, "Mi" = Mi, "mi" = mi, "pi" = p, "Mi*pi" = Mi*p)
newdf_2

##    Schools Number.of.parents  Mi mi         pi     Mi.pi
## 1        1                38  78 40 0.13523132 10.548043
## 2        2                36 238 38 0.12811388 30.491103
## 3        3                17 261 19 0.06049822 15.790036
## 4        4                30 174 30 0.10676157 18.576512
## 5        5                26 236 30 0.09252669 21.836299
## 6        6                24 188 25 0.08540925 16.056940
## 7        7                22 113 23 0.07829181  8.846975
## 8        8                36 170 43 0.12811388 21.779359
## 9        9                35 296 38 0.12455516 36.868327
## 10      10                17 207 21 0.06049822 12.523132

t_hat_prop <- (N/(n*sum(Mi)))*(sum(Mi*p))
t_hat_prop

## [1] 0.4534712

mp_tn <- ((Mi*p)-(4.6/N))**2
t_hat <- (1-(mi/Mi))*(Mi**2)*((p*(1-p))/(mi-1))

var_thatprop<- (((N**2)*(1-n/N))/n)*(sum(mp_tn)/n-1)+(N/n)*(sum(t_hat))
var_thatprop

## [1] 77762.21

SE <- (1/sum(Mi))*sqrt(var_thatprop)
SE

## [1] 0.1422023

CI <- t_hat_prop + c(-1,1)*1.96*SE
CI

## [1] 0.1747546 0.7321877

Problem 4.d

Now, we will ignore clustering and compute the estimates assuming SRS.

The estimated total number of parents who received the consent form is 8200.542.

The standard error is 76225.51.

The 95% confidence interval for the total estimated number of parents who received the consent form is (-141201.5, 157602.6).

The ratio and effect of clustering is \(\frac{\text {estimated var clustering}}{\text{estimated var SRS}}\) = 1.338344e-05. So, clustering is much more efficient.

newdf_3 <- data.frame("Schools"=1:10, "Number of parents" = num_par_form$freq, "Mi" = Mi, "mi" = mi, "wi" = wi, "wi*yi" = wi*num_par_form$freq)
newdf_3

##    Schools Number.of.parents  Mi mi       wi     wi.yi
## 1        1                38  78 40  8.97000  340.8600
## 2        2                36 238 38 28.81050 1037.1780
## 3        3                17 261 19 63.18900 1074.2130
## 4        4                30 174 30 26.68000  800.4000
## 5        5                26 236 30 36.18667  940.8534
## 6        6                24 188 25 34.59200  830.2080
## 7        7                22 113 23 22.60000  497.2000
## 8        8                36 170 43 18.18600  654.6960
## 9        9                35 296 38 35.83150 1254.1025
## 10      10                17 207 21 45.34300  770.8310

t_hat_unb <- sum(wi*num_par_form$freq)
t_hat_unb

## [1] 8200.542

y_bar <- mean(num_par_form$freq)
y_bar

## [1] 28.1

t <- Mi*y_bar
t

##  [1] 2191.8 6687.8 7334.1 4889.4 6631.6 5282.8 3175.3 4777.0 8317.6 5816.7

si_2 <- 1/(mi -1) * sum((num_par_form$freq - y_bar)**2)
si_2

##  [1] 14.84359 15.64595 32.16111 19.96207 19.96207 24.12083 26.31364
##  [8] 13.78333 15.64595 28.94500

s2_t <- 1/(n-1)*sum((t-(t_hat_unb/N))**2)
s2_t

## [1] 35077937

var_thatunb <- (N**2 * (1-n/N) * (s2_t/n)) + N/n*(sum((1-(mi/Mi))*Mi**2*(si_2/mi)))
var_thatunb

## [1] 5810329070

SE <- sqrt(var_thatunb)
SE

## [1] 76225.51

CI <- t_hat_unb + c(-1,1)*1.96*SE
CI

## [1] -141201.5  157602.6

var_thatprop/var_thatunb

## [1] 1.338344e-05

Problem 5

Problem 5.a

The psi values are listed below along with the new data set from sampling 1000 observations with replacement using the probabilities of land area.

data("statepps")
head(statepps)

##        state counties cumcount landarea cumland     popn  cumpopn
## 1    Alabama       67       67    50750   50750  4137511  4137511
## 2     Alaska       25       92   570374  621124   587766  4725277
## 3    Arizona       15      107   113642  734766  3832368  8557645
## 4   Arkansas       75      182    52075  786841  2394253 10951898
## 5 California       58      240   155973  942814 30895356 41847254
## 6   Colorado       63      303   103729 1046543  3464675 45311929

p_la <- statepps$landarea/sum(statepps$landarea)

data <- data.frame(statepps)
data$p_la <- statepps$landarea/sum(statepps$landarea)
ndata <- nrow(data)

set.seed(1234)
nsample <- 100000
samp_idx <- sample(seq_len(nrow(data)), nsample, prob=data$p_la, replace=TRUE)
new_la <- data[samp_idx, ]

state_freq <- as.data.frame(table(new_la$state))
state_freq$psi <- state_freq$Freq/nsample
state_freq$county <-data$counties
state_freq

##                    Var1  Freq     psi county
## 1               Alabama  1421 0.01421     67
## 2                Alaska 16121 0.16121     25
## 3               Arizona  3164 0.03164     15
## 4              Arkansas  1394 0.01394     75
## 5            California  4336 0.04336     58
## 6              Colorado  2987 0.02987     63
## 7           Connecticut   126 0.00126      8
## 8              Delaware    58 0.00058      3
## 9  District of Columbia     1 0.00001      1
## 10              Florida  1489 0.01489     67
## 11              Georgia  1686 0.01686    159
## 12               Hawaii   192 0.00192      5
## 13                Idaho  2330 0.02330     44
## 14             Illinois  1641 0.01641    102
## 15              Indiana   972 0.00972     92
## 16                 Iowa  1539 0.01539     99
## 17               Kansas  2321 0.02321    105
## 18             Kentucky  1124 0.01124    120
## 19            Louisiana  1233 0.01233     64
## 20                Maine   874 0.00874     16
## 21             Maryland   272 0.00272     24
## 22        Massachusetts   213 0.00213     14
## 23             Michigan  1676 0.01676     83
## 24            Minnesota  2264 0.02264     87
## 25          Mississippi  1360 0.01360     82
## 26             Missouri  2056 0.02056    115
## 27              Montana  4090 0.04090     57
## 28             Nebraska  2208 0.02208     93
## 29               Nevada  3096 0.03096     18
## 30        New Hampshire   260 0.00260     10
## 31           New Jersey   209 0.00209     21
## 32           New Mexico  3385 0.03385     33
## 33             New York  1338 0.01338     62
## 34       North Carolina  1358 0.01358    100
## 35         North Dakota  1860 0.01860     53
## 36                 Ohio  1161 0.01161     88
## 37             Oklahoma  1982 0.01982     77
## 38               Oregon  2682 0.02682     36
## 39         Pennsylvania  1304 0.01304     67
## 40         Rhode Island    38 0.00038      5
## 41       South Carolina   849 0.00849     46
## 42         South Dakota  2112 0.02112     66
## 43            Tennessee  1137 0.01137     95
## 44                Texas  7462 0.07462    254
## 45                 Utah  2249 0.02249     29
## 46              Vermont   253 0.00253     14
## 47             Virginia  1169 0.01169    136
## 48           Washington  1898 0.01898     39
## 49        West Virginia   717 0.00717     55
## 50            Wisconsin  1529 0.01529     72
## 51              Wyoming  2804 0.02804     23

Problem 5.b

The psi values are listed below along with the new data set from sampling 1000 observations with replacement using the probabilities proportional to population.

data("statepps")
head(statepps)

##        state counties cumcount landarea cumland     popn  cumpopn
## 1    Alabama       67       67    50750   50750  4137511  4137511
## 2     Alaska       25       92   570374  621124   587766  4725277
## 3    Arizona       15      107   113642  734766  3832368  8557645
## 4   Arkansas       75      182    52075  786841  2394253 10951898
## 5 California       58      240   155973  942814 30895356 41847254
## 6   Colorado       63      303   103729 1046543  3464675 45311929

p_pop <- statepps$popn/sum(statepps$popn)

data <- data.frame(statepps)
data$p_pop <- statepps$popn/sum(statepps$popn)
ndata <- nrow(data)

set.seed(1234)
nsample <- 100000
samp_idx <- sample(seq_len(nrow(data)), nsample, prob=data$p_pop, replac=TRUE)
new_pop <- data[samp_idx, ]

state_freq_pop <- as.data.frame(table(new_pop$state))
state_freq_pop$psi <- state_freq_pop$Freq/nsample
state_freq_pop$county <- data$counties
state_freq_pop

##                    Var1  Freq     psi county
## 1               Alabama  1618 0.01618     67
## 2                Alaska   216 0.00216     25
## 3               Arizona  1476 0.01476     15
## 4              Arkansas   946 0.00946     75
## 5            California 12083 0.12083     58
## 6              Colorado  1316 0.01316     63
## 7           Connecticut  1304 0.01304      8
## 8              Delaware   263 0.00263      3
## 9  District of Columbia   236 0.00236      1
## 10              Florida  5230 0.05230     67
## 11              Georgia  2651 0.02651    159
## 12               Hawaii   467 0.00467      5
## 13                Idaho   403 0.00403     44
## 14             Illinois  4506 0.04506    102
## 15              Indiana  2152 0.02152     92
## 16                 Iowa  1126 0.01126     99
## 17               Kansas   964 0.00964    105
## 18             Kentucky  1374 0.01374    120
## 19            Louisiana  1766 0.01766     64
## 20                Maine   465 0.00465     16
## 21             Maryland  2015 0.02015     24
## 22        Massachusetts  2358 0.02358     14
## 23             Michigan  3713 0.03713     83
## 24            Minnesota  1686 0.01686     87
## 25          Mississippi  1003 0.01003     82
## 26             Missouri  2029 0.02029    115
## 27              Montana   319 0.00319     57
## 28             Nebraska   665 0.00665     93
## 29               Nevada   528 0.00528     18
## 30        New Hampshire   451 0.00451     10
## 31           New Jersey  3025 0.03025     21
## 32           New Mexico   594 0.00594     33
## 33             New York  7190 0.07190     62
## 34       North Carolina  2608 0.02608    100
## 35         North Dakota   248 0.00248     53
## 36                 Ohio  4440 0.04440     88
## 37             Oklahoma  1262 0.01262     77
## 38               Oregon  1201 0.01201     36
## 39         Pennsylvania  4627 0.04627     67
## 40         Rhode Island   428 0.00428      5
## 41       South Carolina  1423 0.01423     46
## 42         South Dakota   285 0.00285     66
## 43            Tennessee  2042 0.02042     95
## 44                Texas  6867 0.06867    254
## 45                 Utah   712 0.00712     29
## 46              Vermont   235 0.00235     14
## 47             Virginia  2535 0.02535    136
## 48           Washington  2094 0.02094     39
## 49        West Virginia   705 0.00705     55
## 50            Wisconsin  1970 0.01970     72
## 51              Wyoming   180 0.00180     23

Problem 5.c

The estimate of the total number of counties in the United States using the land area as a proportion is 6770.283 or 6770. The standard error is 1913.563.

I had to make my sample 100,000 because otherwise I would have psi values of 0.

t_hat_psi_la <- 1/51*sum(state_freq$county/state_freq$psi)
t_hat_psi_la

## [1] 6770.283

var_tpsi_la <- (1/51)*(1/50)*sum(((state_freq$county/state_freq$psi)-t_hat_psi_la)**2)
var_tpsi_la

## [1] 3661723

sqrt(var_tpsi_la)

## [1] 1913.563

Problem 5.d

The estimate of the total number of counties in the United States using the population as a proportion is 5414.608 or 5415 counties. The standard error is 727.0919.

t_hat_psi_pop <- 1/51*sum(state_freq_pop$county/state_freq_pop$psi)
t_hat_psi_pop

## [1] 5414.608

var_tpsi_pop <- (1/51)*(1/50)*sum(((state_freq_pop$county/state_freq_pop$psi)-t_hat_psi_pop)**2)
var_tpsi_pop

## [1] 528662.6

sqrt(var_tpsi_pop)

## [1] 727.0919

Problem 5.e

The estimate of the total number of counties is 3145.201, or 3145 counties, in the United States. The standard error is 337.6008.

The standard error here is MUCH smaller than from parts b and c which were 1913.563 and 727.0919.

It seems that an SRS would be much more effective here as the actual number of counties 3142.

data <- data.frame(statepps)
data$p_s <- 1/51 #Equal probabilities
ndata <- nrow(data)

set.seed(1234)
nsample <- 100000
samp_idx <- sample(seq_len(nrow(data)), nsample, prob=data$p_s, replace=TRUE)
new_same <- data[samp_idx, ]

state_freq_same<- as.data.frame(table(new_same$state))
state_freq_same$psi <- state_freq_same$Freq/nsample
state_freq_same$county <-data$counties
state_freq_same

##                    Var1 Freq     psi county
## 1               Alabama 1953 0.01953     67
## 2                Alaska 1934 0.01934     25
## 3               Arizona 2048 0.02048     15
## 4              Arkansas 1954 0.01954     75
## 5            California 1870 0.01870     58
## 6              Colorado 1996 0.01996     63
## 7           Connecticut 1954 0.01954      8
## 8              Delaware 1950 0.01950      3
## 9  District of Columbia 1929 0.01929      1
## 10              Florida 2108 0.02108     67
## 11              Georgia 1953 0.01953    159
## 12               Hawaii 1913 0.01913      5
## 13                Idaho 1967 0.01967     44
## 14             Illinois 1915 0.01915    102
## 15              Indiana 1961 0.01961     92
## 16                 Iowa 1964 0.01964     99
## 17               Kansas 1911 0.01911    105
## 18             Kentucky 1917 0.01917    120
## 19            Louisiana 1966 0.01966     64
## 20                Maine 1932 0.01932     16
## 21             Maryland 1918 0.01918     24
## 22        Massachusetts 1903 0.01903     14
## 23             Michigan 2014 0.02014     83
## 24            Minnesota 2071 0.02071     87
## 25          Mississippi 1975 0.01975     82
## 26             Missouri 1956 0.01956    115
## 27              Montana 1899 0.01899     57
## 28             Nebraska 1936 0.01936     93
## 29               Nevada 1921 0.01921     18
## 30        New Hampshire 1974 0.01974     10
## 31           New Jersey 1955 0.01955     21
## 32           New Mexico 2017 0.02017     33
## 33             New York 1893 0.01893     62
## 34       North Carolina 1923 0.01923    100
## 35         North Dakota 2002 0.02002     53
## 36                 Ohio 2026 0.02026     88
## 37             Oklahoma 1993 0.01993     77
## 38               Oregon 2013 0.02013     36
## 39         Pennsylvania 2024 0.02024     67
## 40         Rhode Island 1919 0.01919      5
## 41       South Carolina 2031 0.02031     46
## 42         South Dakota 1924 0.01924     66
## 43            Tennessee 1859 0.01859     95
## 44                Texas 1939 0.01939    254
## 45                 Utah 1965 0.01965     29
## 46              Vermont 2011 0.02011     14
## 47             Virginia 1999 0.01999    136
## 48           Washington 1924 0.01924     39
## 49        West Virginia 1972 0.01972     55
## 50            Wisconsin 1953 0.01953     72
## 51              Wyoming 1996 0.01996     23

t_hat_psi <- 1/51*sum(state_freq_same$county/state_freq_same$psi)
t_hat_psi

## [1] 3145.201

var_tpsi<- (1/51)*(1/50)*sum(((state_freq_same$county/state_freq_same$psi)-t_hat_psi)**2)
var_tpsi

## [1] 113974.3

sqrt(var_tpsi)

## [1] 337.6008

STAT 488 Exam 2

Kajal Chokshi

12/4/2018