Objective : Use R as a calculator & Physical Notes.

Every single question here has been verified to be correct with the “Selected Answers” and provided solutions.

Aside

Problems with : SP denote study problems

Ch 4 : Simple Random Sample

Assigned Reading :

4.1-4.7 (Entire Chapter)

Ex. 4.1)

List all possible simple random samples of size n = 2 that can be selected from the population {0, 1, 2, 3, 4}. Calculate s2 for the population and V(y) for the sample.

N <- 5
n <- 2
sample_space <- choose(N,n) # 10 total comb. of n = 2
sample_space

## [1] 10

P <- 0:4
mu <- (1/5)*sum(P)
mu

## [1] 2

sigma2 <- (1/5)*sum((P-mu)^2)
FPC <- (N-n)/(N-1)
V_ybar <- (sigma2/n)*FPC
V_ybar

## [1] 0.75

Ex. 4.14) SP

State park officials were interested in the proportion of campers who consider the camp- site spacing adequate in a particular campground. They decided to take a simple random sample of n = 30 from the first N = 300 camping parties that visited the campground. Let yi = 0 if the head of the ith party sampled does not think the campsite spacing is ad- equateandyi = 1ifhedoes(i =1,2,…,30).Usethedataintheaccompanyingtable to estimate p, the proportion of campers who consider the campsite spacing adequate. Place a bound on the error of estimation.

rm(list=ls())
N <- 300
n <- 30
sm_yi <- 25
phat <- sm_yi/n; round(phat, 3)

## [1] 0.833

phat_c <- 1-phat
FPC <- (1-n/N)
Vhat_phat <- ((phat*phat_c)*(FPC))/(n-1); round(Vhat_phat, 3)

## [1] 0.004

Z_c <- 2 # est. 1.96
B <- Z_c*sqrt(Vhat_phat)
round(B, 3)

## [1] 0.131

Ex. 4.15)

Use the data in Exercise 4.14 to determine the sample size required to estimate p with a bound on the error of estimation of magnitude B = 0.05.

B <- .05
D <- (B^2)/4
n <- (N*phat*phat_c)/((N-1)*D+(phat*phat_c))
ceiling(n) # round to nearest integer

## [1] 128

Ex. 4.16) SP

A simple random sample of 100 water meters within a community is monitored to esti- mate the average daily water consumption per household over a specified dry spell. The sample mean and sample variance are found to be y = 12.5 and s2 = 1252. If we assume that there are N = 10,000 households within the community, estimate m, the true mean daily consumption.

rm(list=ls())
ybar <- 12.5
s2 <- 1252
N <-10000
n <- 100
FPC <- 1-n/N
Vhat_ybar <- (s2/n)*FPC; round(sqrt(Vhat_ybar), 3)

## [1] 3.521

Z_c <- 2 # est. 1.96
B <- Z_c*sqrt(Vhat_ybar); round(B, 3)

## [1] 7.041

Ex. 4.17)

Using data in Exercise 4.16, estimate the total number of gallons of water, t, used daily during the dry spell. Place a bound on the error of estimation.

tau_hat <- N*ybar; round(tau_hat, 3)

## [1] 125000

Vhat_tau <- (N^2)*Vhat_ybar; round(sqrt(Vhat_tau), 3)

## [1] 35206.25

B <- 2*sqrt(Vhat_tau); round(B, 3)

## [1] 70412.5

Ex. 4.19)

A dentist was interested in the effectiveness of a new toothpaste. A group of N = 1000 schoolchildren participated in a study. Prestudy records showed there was an average of 2.2 cavities every six months for the group. After three months of the study, the dentist sampled n = 10 children to determine how they were progressing on the new toothpaste. Using the data in the accompanying table, estimate the mean number of cavities for the entire group and place a bound on the error of estimation.

rm(list=ls())
N <-1000
y <- c(0,4,2,3,2,0,3,4,1,1)
hist(y)

df <- data.frame(y=y); df

ybar <- mean(y); ybar

## [1] 2

n <- nrow(df)
s2 <- sum((y-ybar)^2)/(n-1)
FPC <- 1-n/N
Vhat_ybar <- (s2/n)*FPC; round( sqrt(Vhat_ybar), 3)

## [1] 0.469

B <- 2*sqrt(Vhat_ybar)
round(B, 3)

## [1] 0.938

Ex. 4.20)

The Fish and Game Department of a particular state was concerned about the direction of its future hunting programs. To provide for a greater potential for future hunting, the department wanted to determine the proportion of hunters seeking any type of game bird. A simple random sample of n = 1000 of the N = 99,000 licensed hunters was obtained. Suppose 430 indicated that they hunted game birds. Estimate p, the proportion of licensed hunters seeking game birds. Place a bound on the error of estimation.

rm(list=ls())
n <- 1000
N <- 99000
FPC <- 1-n/N
sm_yi <- 430
phat <- sm_yi/n; round(phat, 3)

## [1] 0.43

qhat <- 1-phat
Vhat_phat <- (phat*qhat*FPC)/(n-1)
B <- 2*sqrt(Vhat_phat); round(B, 3)

## [1] 0.031

Ex. 4.21)

Using the data in Exercise 4.20, determine the sample size the department must obtain to estimate the proportion of game bird hunters, given a bound on the error of estimation of magnitude B = 0.02.

B <- .02
D <- (B^2)/4
n <- (N*phat*qhat)/((N-1)*D+phat*qhat)
ceiling(n)

## [1] 2392

Ex. 4.22) SP

A company auditor was interested in estimating the total number of travel vouchers that were incorrectly filed. In a simple random sample of n = 50 vouchers taken from a group of N = 250, 20 were filed incorrectly. Estimate the total number of vouchers from the N = 250 that have been filed incorrectly and place a bound on the error of estima- tion. [Hint: If p is the population proportion of incorrect vouchers, then Np is the total number of incorrect vouchers. An estimator of Np is NpN , which has an estimated vari- ance given by N2VN (pN ).]

rm(list=ls())
N <- 250
n <- 50
sm_y <- 20
phat <- sm_y/n
qhat <- 1-phat
tau_hat <- N*phat; round(tau_hat, 3)

## [1] 100

FPC <- 1 - n/N
Vhat_phat <- (phat*qhat*FPC)/(n-1)
Vhat_tau <- (N^2)*Vhat_phat
B <- 2*sqrt(Vhat_tau); round(B, 3)

## [1] 31.298

Ex. 4.48) SP

A marketing research firm estimates the proportion of potential customers preferring a certain brand of lipstick by “randomly” selecting 100 women who come by their booth in a shopping mall. Of the 100 sampled, 65 women stated a preference for brand A.

How would you estimate the true proportion of women preferring brand A, with a bound on the error of estimation?

\[ \hat{p}=\frac{1}{n}\sum_{i\in S}{y_i} \]

\[ B =Z_c*\sqrt{\hat{V}_{\hat{p}}} \]

\[ Z_c \approx2; \hat{V}_{\hat{p}}=\text{FPC}\frac{\hat{p}\hat{q}}{n-1} \]

\[ \text{FPC}=1-\frac{n}{N} \]

What is the target population in this study?

“potential customers preferring a certain brand of lipstick” – which is a poorly defined target population

Did the marketing research firm select a simple random sample?

No, this is a convenient sample – “selecting 100 women who come by their booth in a shopping mall” – self selection bias.

What additional problems do you see with this type of sampling?

not everyone is a customer. This is not a probability sample–cant apply probabalistic methods to a non-prob. samp. – further they will have been biased by the entertainment of those there

Ch 5 : Stratified Random Sampling

Assigned Reading :

5.1-5.12 (Entire Chapter)

Ex. 5.1)

A chain of department stores is interested in estimating the proportion of accounts receivable that are delinquent. The chain consists of four stores. To reduce the cost of sampling, stratified random sampling is used, with each store as a stratum. Because no information on population proportions is available before sampling, proportional allocation is used. From the accompanying table, estimate p, the proportion of delinquent accounts for the chain, and place a bound on the error of estimation.

rm(list=ls())
N1 <- 65
N2 <- 42
N3 <- 93
N4 <- 25

n1 <- 14
n2 <- 9
n3 <- 21
n4 <- 6

numb_delinquent_accounts <- c(4,2,8,1)
Ni <- c(N1, N2, N3, N4)
N <- sum(Ni); N

## [1] 225

ni <- c(n1,n2,n3,n4)
pi <- numb_delinquent_accounts/ni; round(pi, 3)

## [1] 0.286 0.222 0.381 0.167

qi <- 1-pi

phat <- (1/N)*sum(pi*Ni); round(phat, 3)

## [1] 0.3

FPCi <- (1 - ni/Ni); round(FPCi, 3)

## [1] 0.785 0.786 0.774 0.760

Vhat_pst <- (1/N^2)*sum((pi*qi*FPCi*Ni^2)/(ni-1))

Z_c <- 2

B <- 2*sqrt(Vhat_pst); round(B, 3)

## [1] 0.117

Ex. 5.2) SP

A corporation desires to estimate the total number of worker-hours lost, for a given month, because of accidents among all employees. Because laborers, technicians, and administrators have different accident rates, the researcher decides to use stratified ran- dom sampling, with each group forming a separate stratum. Data from previous years suggest the variances shown in the accompanying table for the number of worker-hours lost per employee in the three groups, and current data give the stratum sizes. Determine the Neyman allocation for a sample of n = 30 employees.

rm(list=ls())
sigma1 <- 36
sigma2 <- 25 
sigma3 <- 9

N1 <- 132
N2 <- 92
N3 <- 27

n <- 30

Ni <- c(N1, N2, N3)
sigmai <- c(sigma1, sigma2, sigma3)


ni <- n*(Ni*sqrt(sigmai)/sum(Ni*sqrt(sigmai)))
ni <- ceiling(ni); ni

## [1] 18 11  2

The main point of Neyman allocation is that we use it when the cost per observation is equivalent.

Ex. 5.5) SP

A corporation wishes to obtain information on the effectiveness of a business machine. A number of division heads will be interviewed by telephone and asked to rate the equip- ment on a numerical scale. The divisions are located in North America, Europe, and Asia. Hence, stratified sampling is used. The costs are larger for interviewing division heads located outside North America. The accompanying table gives the costs per inter- view, approximate variances of the ratings, and N that have been established. The corpo- ration wants to estimate the average rating with V(yst) = 0.1. Choose the sample size n that achieves this bound, and find the appropriate allocation.

rm(list=ls())
c1 <- 9
c2 <- 25
c3 <- 36

var1 <- 2.25
var2 <- 3.24
var3 <- 3.24

N1 <- 112
N2 <- 68
N3 <- 39

ci <- c(c1,c2,c3)
vari <- c(var1, var2, var3)
Ni <- c(N1, N2, N3)

var_yst <- .1
D <- var_yst


N <- sum(Ni)

A <- sum((Ni*sqrt(vari))/sqrt(ci))
B <- sum(Ni*sqrt(vari)*sqrt(ci))
C <- sum(Ni*vari)
E <- (N^2)*D 

n <- (A*B)/(E+C)
# round up
n <- ceiling(n); n

## [1] 27

# allocation : 
ai <- ((Ni*sqrt(vari)/sqrt(ci))/sum(Ni*sqrt(vari)/sqrt(ci)))

ni <- n*ai
round(ni, 2)

## [1] 16.40  7.17  3.43

# n
ni <- floor(ni)
sum(ni) # therefore must add one to one

## [1] 26

ni[3] <- ni[3] + 1 # highest
ni

## [1] 16  7  4

sum(ni) == n #!!!

## [1] TRUE

Ex. 5.22) SP

When does stratification produce large gains in precision over simple random sampling? (Assume costs of observations are constant under both designs.)

Stratafication is a good idea when 2 things are true : (1) Strata Means differ dramatically and (2) Within Strata variance is small (homogeneous) relative to population variance as apposed to a more varying population variance (heterogeneity).

Ex. 5.40) SP

Ch 6 : Ratio, Reg. & Difference est.

Assigned Reading :

6.1-6.9 (Entire Chapter)

Ex. 6.1) SP

\[ \text{Lec. Formulation : } \]

\[ \hat{\text{V}}(\hat{\tau_y})=\tau_x^2 \hat{\text{V}}(r) \\ \hat{\text{V}}(r)= \text{FPC}*\frac{1}{\mu_x^2}\frac{s_r^2}{n}\\ s^2_r=\frac{1}{n-1}\sum_{i\in S}(y_i - rx_i)\\ \text{FPC}=\frac{N-n}{N-1}\approx\frac{N-n}{N}\approx 1-\frac{n}{N} \]

Calculation

x <- c(.3, .5, .4, .9, .7, .2, .6, .5, .8, .4, .8, .6)
y <- c(6, 9, 7, 19, 15, 5,12, 9, 20, 9, 18, 13)
T_x <- 75
sm_y <- sum(y)
sm_x <- sum(x)
df <- data.frame(x=x, y=y)
n <- 12
N <- 250
r <- sm_y/sm_x
T_y <- r*T_x
round(T_y, 2) # estimation

## [1] 1589.55

FPC <- 1-n/N
sr2 <- (1/(n-1))*sum((y-r*x)^2)
mu_x <- T_x/N
V_r <- FPC*(1/mu_x^2)*(sr2/n)
V_Ty <- V_r * T_x^2
B <- 2*sqrt(V_Ty)
round(B, 2) # Bound

## [1] 186.32

ci_Ty <- T_y + c(-1, 1)*B
ci_Ty # conf intvl

## [1] 1403.235 1775.870

Ex. 6.2) SP

Calculation

T_y2 <- mean(y)*N
round(T_y2, 2) # est

## [1] 2958.33

V_Ty2 <-(var(y)/n)*FPC*N^2
B2 <- 2*sqrt(V_Ty2)
round(B2, 2) # bound

## [1] 730.13

# plt
plot(x,y)
abline(a=0, b = r, col = "green")

ci_Ty2 <- T_y2 + c(-1, 1)*B2
ci_Ty2 # conf intvl

## [1] 2228.199 3688.468

Comparison :

tab <- data.frame(
  Estimate = c("T_y", "V_Ty", "B","T_y2", "V_Ty2", "B2"),
  Value = c(T_y, V_Ty, B, T_y2, V_Ty2, B2)
)

tab

Given the origin-intercept, with slope r model good fit, it is clear that our ratio est is very likely better. Further given the fact that our est. bounds are dramatically smaller with our ratio est.

Further notice :

\[ N\bar{y} =\tau_x*r \iff \bar{y}=r\mu_x \\ \text{ or,}\\ N\bar{y} =\tau_x*r \iff \bar{x}=\mu_x \]

proof :

\[ N\bar{y} =\tau_xr \\ N\bar{y} =(N\mu_x)r \\ \boxed{\bar{y}=\mu_xr} \\ \bar{y}=\mu_x \frac{\bar{y}}{\bar{x}} \\ 1=\mu_x \frac{1}{\bar{x}}\\ \boxed{\bar{x}=\mu_x} \]

ex. 6.3)

rm(list=ls()) #del gloabal env obj

x <- c("25,100", "32,200", "29,600", "35,000", "34,400", "26,500", "28,700", "28,200", "34,600", "32,700", "31,500", "30,600", "27,700", "28,500")
x <- gsub(",", "", x) |> as.numeric() # rep "," with empty space, then make it a num-vec

y <- c(3800, 5100, 4200, 6200, 5800, 4100, 3900, 3600, 3800, 4100, 4500, 5100, 4200, 4000)

df <- data.frame(x=x, y=y)
df

plot(x,y)

Recall :

\[ \hat{\text{V}}(r)= \text{FPC}*\frac{1}{\mu_x^2}\frac{s_r^2}{n}\\ s^2_r=\frac{1}{n-1}\sum_{i\in S}(y_i - rx_i)\\ \text{FPC}=\frac{N-n}{N-1}\approx\frac{N-n}{N}\approx 1-\frac{n}{N} \]

Calculation

n <- 14
N <- 150
r <- mean(y)/mean(x)
round(r, 3)

## [1] 0.147

FPC <- 1-n/N 
Sr2 <- (1/(n-1))*sum((y-r*x)^2)
V_r <- (Sr2/n)*(1/mean(x)^2)*FPC #est. of mu_x with xbar
B <- 2*sqrt(V_r)
round(B, 4)

## [1] 0.0102

ci <- r + c(-1,1)*B
ci

## [1] 0.1364745 0.1569654

ex. 6.4)

recall :

rm(list=ls())

x <- c(550, 720, 1500, 1020, 620, 980, 928, 1200, 1350, 1750, 670, 729, 1530)
y <- c(610, 780, 1600, 1030, 600, 1050, 977, 1440, 1570, 2210, 980, 865, 1710)
df <- data.frame(x=x, y=y)
df

plot(x,y) # good candidate for ratio reg.

Calculation

T_x <- 128200
n <- 13
N <- 123
FPC <- 1-n/N
mu_x <- T_x/N

r <- sum(y)/sum(x)
T_y <- T_x * r 
round(T_y, 2) # est

## [1] 145943.8

y_hat <- r*x
sr2 <- (1/(n-1))*(sum((y-y_hat)^2))
V_r <- (sr2/n)*FPC*(1/mu_x^2)
V_Ty <- V_r*T_x^2
B <- 2*sqrt(V_Ty)
round(B, 2) # slightly off 7353.67

## [1] 7353.67

Ex. 6.5)

Calculation

mu_y <- mu_x*r
round(mu_y, 2)

## [1] 1186.53

V_muy <- (mu_x^2)*V_r
B <- 2*sqrt(V_muy)
round(B, 2)

## [1] 59.79

Ex. 6.6)

rm(list=ls())
x <- c(14.3, 15.7, 17.8, 17.5, 13.2, 18.8, 17.6, 14.3, 14.9, 17.9, 19.2)
y <- c(15.2, 16.1, 18.1, 17.6, 14.5, 19.4, 17.5, 14.1, 15.2, 18.1, 19.5)
N <- 763
n <- 11
mu_x <- 17.2

calculation

r <- sum(y)/sum(x)
r

## [1] 1.022627

plot(x,y)
abline(a=0, b=r, col = "red") # pretty decent fit

mu_y <- r*mu_x
round(mu_y, 2)

## [1] 17.59

FPC <- 1-n/N
sr2 <- (1/(n-1))*sum((y-r*x)^2)

V_r <- (sr2/n)*FPC*(1/mu_x^2)
V_muy <- (sr2/n)*FPC #notice they cancel out (mu_x^2)
B <- 2*sqrt(V_muy)
round(B, 3)

## [1] 0.271

Ex. 6.18) SP

Under what conditions should you employ a ratio estimator of a population total, rather than an estimator of the form $N\bar{y}$ ?

Notice for each of these we must visualize our data and check our assumption that (1) the relationship is linear and, (2) the intercept is through the origin $(x,y)=(0,0)$ – so under those conditions (and without extreme outliers or leverage pts. ), ratio est. of the mean is a better fit than our basic $N\bar{y}$ – “It is even better if the variation in y increases with x”.

Ex. 6.23) SP

National income from manufacturing industries is to be estimated for 1989 from a sample of 6 of the 19 industry categories that reported figures early for that year. Incomes from all 19 industries are known for 1980 and the total is $674 billion. From the data provided, estimate the total national income from manufacturing in 1989, with a bound on the error. All figures are in billions of constant (1982) dollars.

Find a ratio estimator of the 1989 total income, and place a bound on the error of estimation.
Find a regression estimator of the 1989 total income, and place a bound on the error of estimation
Find a difference estimator of the 1989 total income, and place a bound on the error of estimation

Aside

Units are in billions of dollars

Ratio Est.

rm(list=ls())
n <- 6
N <- 19

industry_data <- data.frame(
  industry = c(
    "Lumber and wood products",
    "Electric and electronic equipment",
    "Motor vehicles and equipment",
    "Food and kindred products",
    "Textile mill products",
    "Chemicals and allied products"
  ),
  `Income 1980` = c(21, 63, 91, 60, 70, 50),
  `Income 1989` = c(26, 91, 47, 70, 70, 50)
)
industry_data

plot(industry_data$Income.1980, industry_data$Income.1989)

sm_y <- sum(industry_data$Income.1989)
sm_x <- sum(industry_data$Income.1980)

r <- sm_y/sm_x

abline(a=0,b=r, col = "red")

Now… I dont know if you remember this but i do… this was on the exam – and, isn’t linear. God damn it… ITS NOT LINEAR – “you can fit a line through anything” – thank you Susan Cochran, I can see that…

T_x <- 674
T_y <- r*T_x; T_y

## [1] 672.1014

mu_x <- T_x/N
# variance  :
yhat <- r*industry_data$Income.1980
y <- industry_data$Income.1989
sr2 <- sum((y - yhat)^2)/(n-1)
FPC <- 1-n/N
Vhat_Ty <- (T_x^2)*(sr2/n)*FPC*(1/mu_x^2)
B <- 2*sqrt(Vhat_Ty)
round(B, 2)

## [1] 305.64

# conf interval : 85%
ci_Ty <- T_y + c(-1,1)*B

paste0(round(ci_Ty[1], 2), " Billion Dollars to ", round(ci_Ty[2], 2), " Billion Dollars -- est. ", round(T_y, 2), " B$")

## [1] "366.46 Billion Dollars to 977.74 Billion Dollars -- est. 672.1 B$"

Regression Est.

y <- industry_data$Income.1989
x <- industry_data$Income.1980

b <- sum((y -mean(y))*(x - mean(x)))/sum((x-mean(x))^2)
a <- mean(y) - b*mean(x)

# pt est. 
muhat_y <- a+b*mu_x
T_y2 <- N*muhat_y; T_y2

## [1] 918.1025

# variance 
yhat <- a+b*x
syL2 <- (1/(n-2))*sum((y-yhat)^2)
MSE <- syL2

Vhat_muy <- (syL2/n)*FPC

Vhat_Ty2 <- (N^2)*Vhat_muy
B2 <- 2*sqrt(Vhat_Ty2); round(B2, 2)

## [1] 289.38

ci_Ty2 <- T_y2 + c(-1,1)*B2
round(ci_Ty2,2)

## [1]  628.72 1207.48

paste0(round(ci_Ty2[1], 2), " Billion Dollars to ", round(ci_Ty2[2], 2), " Billion Dollars -- est. ", round(T_y2, 2), " B$")

## [1] "628.72 Billion Dollars to 1207.48 Billion Dollars -- est. 918.1 B$"

Difference Est.

#pt est. 
di <- y-x
mu_yD <- mu_x - mean(di)
T_y3 <- N*mu_yD

# variance est. 
syD2 <- sum((di - mean(di))^2)/(n-1)
Vhat_muyD <- (syD2/n)*FPC
Vhat_Ty3 <- (N^2)*Vhat_muyD

B3 <- 2*sqrt(Vhat_Ty3)

ci_Ty3 <- T_y3 + c(-1,1)*B3
paste0(round(ci_Ty3[1], 2), " Billion Dollars to ", round(ci_Ty3[2], 2), " Billion Dollars -- est. ", round(T_y3, 2), " B$")

## [1] "371.08 Billion Dollars to 983.25 Billion Dollars -- est. 677.17 B$"

Comparison

grep("^ci", ls(), value = T)

## [1] "ci_Ty"  "ci_Ty2" "ci_Ty3"

grep("^T_y", ls(), value = T)

## [1] "T_y"  "T_y2" "T_y3"

grep("^Vhat", ls(), value = T)

## [1] "Vhat_muy"  "Vhat_muyD" "Vhat_Ty"   "Vhat_Ty2"  "Vhat_Ty3"

# Basic scatter plot of 1980 vs 1989
plot(
  industry_data$Income.1980,
  industry_data$Income.1989,
  pch = 16,
  col = "black",
  xlab = "Income 1980",
  ylab = "Income 1989",
  main = "Estimated Models"
)

# Add estimator lines
abline(a = 0, b = r,  col = "red",   lwd = 2)   # Ratio estimator
abline(a = a, b = b, col = "blue",  lwd = 2)   # Regression estimator
abline(a = 0, b = 1, col = "green", lwd = 2)   # Difference estimator

legend(
  "topleft",
  legend = c(
    "Ty1: Ratio estimator",
    "Ty2: Regression estimator",
    "Ty3: Difference estimator"
  ),
  col = c("red", "blue", "green"),
  lwd = 2,
  bty = "n"
)

As we can see our estimate of ratio and difference are similar – ie. slopes are both about 1 with intercept (0,0)
Notice our regression est. is quite bad due to the encorperation of a intercept which was moved by leverage point + outliers.
Notice all models do not accurately describe the behavior of the data as it is curvelinear – ie. not a linear relationship

# Combine estimates and CIs
estimates <- c(T_y, T_y2, T_y3)

ci_mat <- rbind(
  ci_Ty,
  ci_Ty2,
  ci_Ty3
)

labels <- c(
  "Ratio Estimator",
  "Regression Estimator",
  "Difference Estimator"
)

# Create empty plot
plot(
  estimates,
  1:3,
  xlim = range(ci_mat),
  pch = 16,
  yaxt = "n",
  ylab = "",
  xlab = "Estimate",
  main = "Confidence Intervals of Estimators"
)

axis(2, at = 1:3, labels = labels)

# Add CI lines
segments(
  x0 = ci_mat[, 1],
  x1 = ci_mat[, 2],
  y0 = 1:3,
  y1 = 1:3,
  lwd = 2
)

abline(v = 0, lty = 2)

and as we can see the ratio and difference point estimates are in similar positions mean while regression estimate is quite different. All estimates have similar variances.

library(dplyr)
library(knitr)
library(kableExtra)

results_df <- data.frame(
  Estimator = c(
    "Ratio Estimator",
    "Regression Estimator",
    "Difference Estimator"
  ),
  Estimate = c(T_y, T_y2, T_y3),
  Variance = c(Vhat_Ty, Vhat_Ty2, Vhat_Ty3),
  Std_Error = sqrt(c(Vhat_Ty, Vhat_Ty2, Vhat_Ty3)),
  CI_Lower = c(ci_Ty[1], ci_Ty2[1], ci_Ty3[1]),
  CI_Upper = c(ci_Ty[2], ci_Ty2[2], ci_Ty3[2])
) |> 
  mutate(
    Relative_Efficiency = Variance / min(Variance)
  )

results_df |>
  kable(
    caption = "Point Estimates, Variance Estimates, and 95% Confidence Intervals",
    digits = 4,
    align = "c"
  ) |>
  kable_styling(
    full_width = FALSE,
    position = "center",
    latex_options = c("hold_position", "striped")
  )

Point Estimates, Variance Estimates, and 95% Confidence Intervals
Estimator	Estimate	Variance	Std_Error	CI_Lower	CI_Upper	Relative_Efficiency
Ratio Estimator	672.1014	23354.18	152.8207	366.4599	977.7429	1.1155
Regression Estimator	918.1025	20935.13	144.6898	628.7230	1207.4820	1.0000
Difference Estimator	677.1667	23422.46	153.0440	371.0787	983.2546	1.1188

Notice despite our regression est. having the best relative eff – it is also the most biased. Indicating the best relative eff. isnt necessarily the best option when we know there is clear bias

Conclusion :

None of these approaches are recommended as the relationship is not linear

Ch 7 : Systematic Sampling

Assigned Reading :

7.1-7.8 (Entire Chapter)

Ex. 7.4) SP

The management of a particular company is interested in estimating the proportion of employees favoring a new investment policy. A 1-in-10 systematic sample is obtained from employees leaving the building at the end of a particular workday. Use the data in the accompanying table to estimate p, the proportion in favor of the new policy and place a bound on the error of estimation. Assume N = 2000.

rm(list=ls())
N <- 2000
n <- 200
sm_y <- 132
phat <- sm_y/n
FPC <- 1-n/N
qhat <- 1-phat

# variance est . 
Vhat_psy <- (phat*qhat*FPC)/(n-1)

B <- 2 * sqrt(Vhat_psy); round(B, 4)

## [1] 0.0637

ci_psy <- phat + c(-1,1)*B; round(ci_psy, 2)

## [1] 0.60 0.72

Ex. 7.5)

For the situation outlined in Exercise 7.4, determine the sample size required to estimate p to within 0.01 unit. What type of systematic sample should be run?

rm(n) # determine it for myself 
B <- .01

D <- (B^2)/4

n <- (N*phat*qhat)/((N-1)*D + phat*qhat)
n <- ceiling(n); n # same as they used!

## [1] 1636

Ex. 7.9)

The highway patrol of a particular state is concerned about the proportion of motorists who carry their licenses. A checkpoint is set up on a major highway, and the driver of every seventh car is questioned. Use the data in the accompanying table to estimate the proportion of drivers carrying their licenses. Place a bound on the error of estimation. Assume that N 2800 cars pass the checkpoint during the sampling period.

rm(list=ls())

N <- 2800
n <- 400
FPC <- 1-n/N

sm_y <- 324
phat <- sm_y/n; round(phat, 2)

## [1] 0.81

qhat <- 1-phat

Vhat_psy <- (phat*qhat*FPC)/(n-1)

B <- 2 * sqrt(Vhat_psy); round(B, 3)

## [1] 0.036

ci_psy <- phat + c(-1,1)*B; round(ci_psy, 2)

## [1] 0.77 0.85

Ex. 7.10) SP

The highway patrol expects at least N = 3000 cars to pass the checkpoint. Determine the sample size required to estimate p to within B = 0.015 unit.

N <- 3000
B <- .015
rm(n)

D <- (B^2)/4
n <- (N*phat*qhat)/((N-1)*D + phat*qhat)
n <- ceiling(n); n # same as they used!

## [1] 1432

Ch 8 : Cluster Sampling

Assigned Reading :

8.1-8.10* (Entire Chapter)

Ex. 8.23) SP

A large firm has its equipment inventories listed separately by department. From the 15 departments in the firm, 5 are to be randomly sampled by an auditor, who will then check to make sure that all equipment is properly identified and located. The proportion of in- ventory items not properly identified is of interest to the auditor. The data are given in the accompanying table. Estimate the proportion of inventory items in the firm not properly identified and place a bound on the error of estimation

rm(list=ls())
N <- 15
n <- 5

FPC <- 1-n/N

dept <- paste("Dept. ", as.character(1:5))
Numb_equip <- c(15, 27,9,31,16)
Nump_inc <- c(2,3,1,1,2)

df <- data.frame(dept=dept, Numb_equip=Numb_equip, Nump_inc=Nump_inc)
sm_m <- sum(df$Numb_equip)
sm_a <- sum(df$Nump_inc)
phat <- sm_a/sm_m; round(phat, 4)

## [1] 0.0918

# variance est. 
ai <- df$Nump_inc
mi <- df$Numb_equip

sp2 <- (1/(n-1))*sum((ai - phat*mi)^2)
mbar <- (1/n)*sum(mi)
  
  
Vhat_phatc <- FPC*(sp2/n)*(1/mbar^2)

B <- sqrt(Vhat_phatc)*2; round(B, 4)

## [1] 0.039

ci_phatc <- phat + c(-1,1)*B; round(ci_phatc, 2)

## [1] 0.05 0.13

Ch 9 : Two-Stage Cluster Sampling

Assigned Reading :

9.1-9.5 (Entire Chapter)

Example 9.1 )

A garment manufacturer has 90 plants located throughout the United States and wants to estimate the average number of hours that the sewing machines were down for repairs in the past months. Because the plants are widely scattered, she decides to use cluster sampling, specifying each plant as a cluster of machines. Each plant con- tains many machines, and checking the repair record for each machine would be time-consuming. Therefore, she uses two-stage sampling. Enough time and money are available to sample n = 10 plants and approximately 20% of the machines in each plant. Using the data in Table 9.1, estimate the average downtime per machine and place a bound on the error of estimation. The manufacturer knows she has a combined total of 4500 machines in all plants.

Plant <- 1:10
M_i <- c(50, 65, 45, 48, 52, 58, 42, 66, 40, 56)
m_i <- c(10, 13, 9, 10, 10, 12, 8, 13, 8, 11)
ybar_i <- c(
    5.40, 4.00, 5.67, 4.80, 4.30,
    3.83, 5.00, 3.85, 4.88, 5.00
  )
s2_i <- c(
    11.38, 10.67, 16.75, 13.29, 11.12,
    14.88, 5.14, 4.31, 6.13, 11.80
)

plant_data <- data.frame(
  Plant = Plant,
  M_i = M_i,
  m_i = m_i,
  ybar_i = ybar_i,
  s2_i = s2_i
)

downtime_list <- list(
  c(5, 7, 9, 0, 11, 2, 8, 4, 3, 5),
  c(4, 3, 7, 2, 11, 0, 1, 9, 4, 3, 2, 1, 5),
  c(5, 6, 4, 11, 12, 0, 1, 8, 4),
  c(6, 4, 0, 1, 0, 9, 8, 4, 6, 10),
  c(11, 4, 3, 1, 0, 2, 8, 6, 5, 3),
  c(12, 11, 3, 4, 2, 0, 0, 1, 4, 3, 2, 4),
  c(3, 7, 6, 7, 8, 4, 3, 2),
  c(3, 6, 4, 3, 2, 2, 8, 4, 0, 4, 5, 6, 3),
  c(6, 4, 7, 3, 9, 1, 4, 5),
  c(6, 7, 5, 10, 11, 2, 1, 4, 0, 5, 4)
)

downtime_data <- do.call(
  rbind,
  lapply(1:10, function(i) {
    data.frame(
      Plant = i,
      Downtime = downtime_list[[i]]
    )
  })
)

# Merge plant-level info
downtime_data <- merge(downtime_data, plant_data, by = "Plant")

downtime_data$Plant <- as.factor(downtime_data$Plant)

n <- 10
N <- 90
Mbar <- 4500/90
FPC <- 1-n/N

mu_hat <- (1/Mbar)*sum(M_i*ybar_i)/n; round(mu_hat, 2)

## [1] 4.8

sb2 <- sum((M_i*ybar_i - Mbar*mu_hat)^2)/(n-1)
sm <- sum((M_i^2)*(1-m_i/M_i)*(s2_i/m_i))

Vhat_mu <- (sb2*FPC*1/(n*(Mbar^2)))+(1/(N*n*Mbar^2)*sm); round(Vhat_mu, 4)

## [1] 0.0371

B <- 2*sqrt(Vhat_mu)

ci <- mu_hat + c(-1,1)*B; round(ci, 2)

## [1] 4.42 5.19

Ex. 9.2) SP – INCOMPLETE

A nurseryman wants to estimate the average height of seedlings in a large field that is divided into 50 plots that vary slightly in size. He believes the heights are fairly constant throughout each plot but may vary considerably from plot to plot. Therefore, he decides to sample 10% of the trees within each of 10 plots using a two-stage cluster sample. The data are as given in the accompanying table. Estimate the average height of seedlings in the field and place a bound on the error of estimation

rm(list=ls())
N <- 50
n <- 10

FPC <- 1-n/N
plot_data <- data.frame(
  Plot = 1:10,
  Total_Seedlings = c(52, 56, 60, 46, 49, 51, 50, 61, 60, 45),
  Sampled_Seedlings = c(5, 6, 6, 5, 5, 5, 5, 6, 6, 6)
)

heights_list <- list(
  c(12, 11, 12, 10, 13),
  c(10, 9, 7, 9, 8, 10),
  c(6, 5, 7, 5, 6, 4),
  c(7, 8, 7, 7, 6),
  c(10, 11, 13, 12, 12),
  c(14, 15, 13, 12, 13),
  c(6, 7, 6, 8, 7),
  c(9, 10, 8, 9, 9, 10),
  c(7, 10, 8, 9, 9, 10),
  c(12, 11, 12, 13, 12, 12)
)

seedling_data <- do.call(
  rbind,
  lapply(1:10, function(i) {
    data.frame(
      Plot = i,
      Height = heights_list[[i]]
    )
  })
)


# Merge cluster info
seedling_data <- merge(seedling_data, plot_data, by = "Plot")

seedling_data # long form so we can squeeze into df

Mi <- plot_data$Total_Seedlings
M <- sum(Mi)
Mbarhat <- sum(Mi)/n

summary <- seedling_data |> group_by(Plot) |> summarize(ybar_i = mean(Height))
ybar_i <- summary$ybar_i
mu_rts <- sum(Mi*ybar_i)/sum(Mi); round(mu_rts, 2)

## [1] 9.38

# variance est 
FPC <- 1-n/N; FPC

## [1] 0.8

sr2 <- (1/(n-1))*sum((Mi*ybar_i - Mi*mu_rts)^2)
between <- FPC*(1/(n*(Mbarhat^2)))*sr2

Ex. 9.3) SP

Ex. 9.18) SP

(Aside : Notice we skip Ch10 as it is not discussed in the course)

Ch 11 : Supplemental Topics

Assigned Reading :

11.5*

Ch 12 : Summary

Assigned Reading :

12.1-12.2

Applied Sampling Practice Catelog

Isaiah C. Mireles

Ch 4 : Simple Random Sample

Assigned Reading :

Ex. 4.1)

Ex. 4.14) SP

Ex. 4.15)

Ex. 4.16) SP

Ex. 4.17)

Ex. 4.19)

Ex. 4.20)

Ex. 4.21)

Ex. 4.22) SP

Ex. 4.48) SP

Ch 5 : Stratified Random Sampling

Assigned Reading :

Ex. 5.1)

Ex. 5.2) SP

Ex. 5.5) SP

Ex. 5.22) SP

Ex. 5.40) SP

Ch 6 : Ratio, Reg. & Difference est.

Assigned Reading :

Ex. 6.1) SP

Calculation

Ex. 6.2) SP

Calculation

Comparison :

ex. 6.3)

Calculation

ex. 6.4)

Calculation

Ex. 6.5)

Calculation

Ex. 6.6)

calculation

Ex. 6.18) SP

Ex. 6.23) SP

Ratio Est.

Regression Est.

Difference Est.

Comparison

Conclusion :

Ch 7 : Systematic Sampling

Assigned Reading :

Ex. 7.4) SP

Ex. 7.5)

Ex. 7.9)

Ex. 7.10) SP

Ch 8 : Cluster Sampling

Assigned Reading :

Ex. 8.23) SP

Ch 9 : Two-Stage Cluster Sampling

Assigned Reading :

Example 9.1 )

Ex. 9.2) SP – INCOMPLETE

Ex. 9.3) SP

Ex. 9.18) SP

Ch 11 : Supplemental Topics

Assigned Reading :

Ch 12 : Summary

Assigned Reading :

Ex. 12.4) SP

Ex. 12.8) SP