Objective : Use R as a calculator & Physical Notes.
Every single question here has been verified to be correct with the “Selected Answers” and provided solutions.
Aside
4.1-4.7 (Entire Chapter)
List all possible simple random samples of size n = 2 that can be selected from the population {0, 1, 2, 3, 4}. Calculate s2 for the population and V(y) for the sample.
N <- 5
n <- 2
sample_space <- choose(N,n) # 10 total comb. of n = 2
sample_space
## [1] 10
P <- 0:4
mu <- (1/5)*sum(P)
mu
## [1] 2
sigma2 <- (1/5)*sum((P-mu)^2)
FPC <- (N-n)/(N-1)
V_ybar <- (sigma2/n)*FPC
V_ybar
## [1] 0.75
State park officials were interested in the proportion of campers who consider the camp- site spacing adequate in a particular campground. They decided to take a simple random sample of n = 30 from the first N = 300 camping parties that visited the campground. Let yi = 0 if the head of the ith party sampled does not think the campsite spacing is ad- equateandyi = 1ifhedoes(i =1,2,…,30).Usethedataintheaccompanyingtable to estimate p, the proportion of campers who consider the campsite spacing adequate. Place a bound on the error of estimation.
rm(list=ls())
N <- 300
n <- 30
sm_yi <- 25
phat <- sm_yi/n; round(phat, 3)
## [1] 0.833
phat_c <- 1-phat
FPC <- (1-n/N)
Vhat_phat <- ((phat*phat_c)*(FPC))/(n-1); round(Vhat_phat, 3)
## [1] 0.004
Z_c <- 2 # est. 1.96
B <- Z_c*sqrt(Vhat_phat)
round(B, 3)
## [1] 0.131
Use the data in Exercise 4.14 to determine the sample size required to estimate p with a bound on the error of estimation of magnitude B = 0.05.
B <- .05
D <- (B^2)/4
n <- (N*phat*phat_c)/((N-1)*D+(phat*phat_c))
ceiling(n) # round to nearest integer
## [1] 128
A simple random sample of 100 water meters within a community is monitored to esti- mate the average daily water consumption per household over a specified dry spell. The sample mean and sample variance are found to be y = 12.5 and s2 = 1252. If we assume that there are N = 10,000 households within the community, estimate m, the true mean daily consumption.
rm(list=ls())
ybar <- 12.5
s2 <- 1252
N <-10000
n <- 100
FPC <- 1-n/N
Vhat_ybar <- (s2/n)*FPC; round(sqrt(Vhat_ybar), 3)
## [1] 3.521
Z_c <- 2 # est. 1.96
B <- Z_c*sqrt(Vhat_ybar); round(B, 3)
## [1] 7.041
Using data in Exercise 4.16, estimate the total number of gallons of water, t, used daily during the dry spell. Place a bound on the error of estimation.
tau_hat <- N*ybar; round(tau_hat, 3)
## [1] 125000
Vhat_tau <- (N^2)*Vhat_ybar; round(sqrt(Vhat_tau), 3)
## [1] 35206.25
B <- 2*sqrt(Vhat_tau); round(B, 3)
## [1] 70412.5
A dentist was interested in the effectiveness of a new toothpaste. A group of N = 1000 schoolchildren participated in a study. Prestudy records showed there was an average of 2.2 cavities every six months for the group. After three months of the study, the dentist sampled n = 10 children to determine how they were progressing on the new toothpaste. Using the data in the accompanying table, estimate the mean number of cavities for the entire group and place a bound on the error of estimation.
rm(list=ls())
N <-1000
y <- c(0,4,2,3,2,0,3,4,1,1)
hist(y)
df <- data.frame(y=y); df
ybar <- mean(y); ybar
## [1] 2
n <- nrow(df)
s2 <- sum((y-ybar)^2)/(n-1)
FPC <- 1-n/N
Vhat_ybar <- (s2/n)*FPC; round( sqrt(Vhat_ybar), 3)
## [1] 0.469
B <- 2*sqrt(Vhat_ybar)
round(B, 3)
## [1] 0.938
The Fish and Game Department of a particular state was concerned about the direction of its future hunting programs. To provide for a greater potential for future hunting, the department wanted to determine the proportion of hunters seeking any type of game bird. A simple random sample of n = 1000 of the N = 99,000 licensed hunters was obtained. Suppose 430 indicated that they hunted game birds. Estimate p, the proportion of licensed hunters seeking game birds. Place a bound on the error of estimation.
rm(list=ls())
n <- 1000
N <- 99000
FPC <- 1-n/N
sm_yi <- 430
phat <- sm_yi/n; round(phat, 3)
## [1] 0.43
qhat <- 1-phat
Vhat_phat <- (phat*qhat*FPC)/(n-1)
B <- 2*sqrt(Vhat_phat); round(B, 3)
## [1] 0.031
Using the data in Exercise 4.20, determine the sample size the department must obtain to estimate the proportion of game bird hunters, given a bound on the error of estimation of magnitude B = 0.02.
B <- .02
D <- (B^2)/4
n <- (N*phat*qhat)/((N-1)*D+phat*qhat)
ceiling(n)
## [1] 2392
A company auditor was interested in estimating the total number of travel vouchers that were incorrectly filed. In a simple random sample of n = 50 vouchers taken from a group of N = 250, 20 were filed incorrectly. Estimate the total number of vouchers from the N = 250 that have been filed incorrectly and place a bound on the error of estima- tion. [Hint: If p is the population proportion of incorrect vouchers, then Np is the total number of incorrect vouchers. An estimator of Np is NpN , which has an estimated vari- ance given by N2VN (pN ).]
rm(list=ls())
N <- 250
n <- 50
sm_y <- 20
phat <- sm_y/n
qhat <- 1-phat
tau_hat <- N*phat; round(tau_hat, 3)
## [1] 100
FPC <- 1 - n/N
Vhat_phat <- (phat*qhat*FPC)/(n-1)
Vhat_tau <- (N^2)*Vhat_phat
B <- 2*sqrt(Vhat_tau); round(B, 3)
## [1] 31.298
A marketing research firm estimates the proportion of potential customers preferring a certain brand of lipstick by “randomly” selecting 100 women who come by their booth in a shopping mall. Of the 100 sampled, 65 women stated a preference for brand A.
\[ \hat{p}=\frac{1}{n}\sum_{i\in S}{y_i} \]
\[ B =Z_c*\sqrt{\hat{V}_{\hat{p}}} \]
\[ Z_c \approx2; \hat{V}_{\hat{p}}=\text{FPC}\frac{\hat{p}\hat{q}}{n-1} \]
\[ \text{FPC}=1-\frac{n}{N} \]
“potential customers preferring a certain brand of lipstick” – which is a poorly defined target population
No, this is a convenient sample – “selecting 100 women who come by their booth in a shopping mall” – self selection bias.
not everyone is a customer. This is not a probability sample–cant apply probabalistic methods to a non-prob. samp. – further they will have been biased by the entertainment of those there
5.1-5.12 (Entire Chapter)
A chain of department stores is interested in estimating the proportion of accounts receivable that are delinquent. The chain consists of four stores. To reduce the cost of sampling, stratified random sampling is used, with each store as a stratum. Because no information on population proportions is available before sampling, proportional allocation is used. From the accompanying table, estimate p, the proportion of delinquent accounts for the chain, and place a bound on the error of estimation.
rm(list=ls())
N1 <- 65
N2 <- 42
N3 <- 93
N4 <- 25
n1 <- 14
n2 <- 9
n3 <- 21
n4 <- 6
numb_delinquent_accounts <- c(4,2,8,1)
Ni <- c(N1, N2, N3, N4)
N <- sum(Ni); N
## [1] 225
ni <- c(n1,n2,n3,n4)
pi <- numb_delinquent_accounts/ni; round(pi, 3)
## [1] 0.286 0.222 0.381 0.167
qi <- 1-pi
phat <- (1/N)*sum(pi*Ni); round(phat, 3)
## [1] 0.3
FPCi <- (1 - ni/Ni); round(FPCi, 3)
## [1] 0.785 0.786 0.774 0.760
Vhat_pst <- (1/N^2)*sum((pi*qi*FPCi*Ni^2)/(ni-1))
Z_c <- 2
B <- 2*sqrt(Vhat_pst); round(B, 3)
## [1] 0.117
A corporation desires to estimate the total number of worker-hours lost, for a given month, because of accidents among all employees. Because laborers, technicians, and administrators have different accident rates, the researcher decides to use stratified ran- dom sampling, with each group forming a separate stratum. Data from previous years suggest the variances shown in the accompanying table for the number of worker-hours lost per employee in the three groups, and current data give the stratum sizes. Determine the Neyman allocation for a sample of n = 30 employees.
rm(list=ls())
sigma1 <- 36
sigma2 <- 25
sigma3 <- 9
N1 <- 132
N2 <- 92
N3 <- 27
n <- 30
Ni <- c(N1, N2, N3)
sigmai <- c(sigma1, sigma2, sigma3)
ni <- n*(Ni*sqrt(sigmai)/sum(Ni*sqrt(sigmai)))
ni <- ceiling(ni); ni
## [1] 18 11 2
The main point of Neyman allocation is that we use it when the cost per observation is equivalent.
A corporation wishes to obtain information on the effectiveness of a business machine. A number of division heads will be interviewed by telephone and asked to rate the equip- ment on a numerical scale. The divisions are located in North America, Europe, and Asia. Hence, stratified sampling is used. The costs are larger for interviewing division heads located outside North America. The accompanying table gives the costs per inter- view, approximate variances of the ratings, and N that have been established. The corpo- ration wants to estimate the average rating with V(yst) = 0.1. Choose the sample size n that achieves this bound, and find the appropriate allocation.
rm(list=ls())
c1 <- 9
c2 <- 25
c3 <- 36
var1 <- 2.25
var2 <- 3.24
var3 <- 3.24
N1 <- 112
N2 <- 68
N3 <- 39
ci <- c(c1,c2,c3)
vari <- c(var1, var2, var3)
Ni <- c(N1, N2, N3)
var_yst <- .1
D <- var_yst
N <- sum(Ni)
A <- sum((Ni*sqrt(vari))/sqrt(ci))
B <- sum(Ni*sqrt(vari)*sqrt(ci))
C <- sum(Ni*vari)
E <- (N^2)*D
n <- (A*B)/(E+C)
# round up
n <- ceiling(n); n
## [1] 27
# allocation :
ai <- ((Ni*sqrt(vari)/sqrt(ci))/sum(Ni*sqrt(vari)/sqrt(ci)))
ni <- n*ai
round(ni, 2)
## [1] 16.40 7.17 3.43
# n
ni <- floor(ni)
sum(ni) # therefore must add one to one
## [1] 26
ni[3] <- ni[3] + 1 # highest
ni
## [1] 16 7 4
sum(ni) == n #!!!
## [1] TRUE
When does stratification produce large gains in precision over simple random sampling? (Assume costs of observations are constant under both designs.)
Stratafication is a good idea when 2 things are true : (1) Strata Means differ dramatically and (2) Within Strata variance is small (homogeneous) relative to population variance as apposed to a more varying population variance (heterogeneity).
6.1-6.9 (Entire Chapter)
\[ \text{Lec. Formulation : } \]
\[ \hat{\text{V}}(\hat{\tau_y})=\tau_x^2 \hat{\text{V}}(r) \\ \hat{\text{V}}(r)= \text{FPC}*\frac{1}{\mu_x^2}\frac{s_r^2}{n}\\ s^2_r=\frac{1}{n-1}\sum_{i\in S}(y_i - rx_i)\\ \text{FPC}=\frac{N-n}{N-1}\approx\frac{N-n}{N}\approx 1-\frac{n}{N} \]
x <- c(.3, .5, .4, .9, .7, .2, .6, .5, .8, .4, .8, .6)
y <- c(6, 9, 7, 19, 15, 5,12, 9, 20, 9, 18, 13)
T_x <- 75
sm_y <- sum(y)
sm_x <- sum(x)
df <- data.frame(x=x, y=y)
n <- 12
N <- 250
r <- sm_y/sm_x
T_y <- r*T_x
round(T_y, 2) # estimation
## [1] 1589.55
FPC <- 1-n/N
sr2 <- (1/(n-1))*sum((y-r*x)^2)
mu_x <- T_x/N
V_r <- FPC*(1/mu_x^2)*(sr2/n)
V_Ty <- V_r * T_x^2
B <- 2*sqrt(V_Ty)
round(B, 2) # Bound
## [1] 186.32
ci_Ty <- T_y + c(-1, 1)*B
ci_Ty # conf intvl
## [1] 1403.235 1775.870
T_y2 <- mean(y)*N
round(T_y2, 2) # est
## [1] 2958.33
V_Ty2 <-(var(y)/n)*FPC*N^2
B2 <- 2*sqrt(V_Ty2)
round(B2, 2) # bound
## [1] 730.13
# plt
plot(x,y)
abline(a=0, b = r, col = "green")
ci_Ty2 <- T_y2 + c(-1, 1)*B2
ci_Ty2 # conf intvl
## [1] 2228.199 3688.468
tab <- data.frame(
Estimate = c("T_y", "V_Ty", "B","T_y2", "V_Ty2", "B2"),
Value = c(T_y, V_Ty, B, T_y2, V_Ty2, B2)
)
tab
Given the origin-intercept, with slope r model good fit, it is clear that our ratio est is very likely better. Further given the fact that our est. bounds are dramatically smaller with our ratio est.
Further notice :
\[ N\bar{y} =\tau_x*r \iff \bar{y}=r\mu_x \\ \text{ or,}\\ N\bar{y} =\tau_x*r \iff \bar{x}=\mu_x \]
proof :
\[ N\bar{y} =\tau_xr \\ N\bar{y} =(N\mu_x)r \\ \boxed{\bar{y}=\mu_xr} \\ \bar{y}=\mu_x \frac{\bar{y}}{\bar{x}} \\ 1=\mu_x \frac{1}{\bar{x}}\\ \boxed{\bar{x}=\mu_x} \]
rm(list=ls()) #del gloabal env obj
x <- c("25,100", "32,200", "29,600", "35,000", "34,400", "26,500", "28,700", "28,200", "34,600", "32,700", "31,500", "30,600", "27,700", "28,500")
x <- gsub(",", "", x) |> as.numeric() # rep "," with empty space, then make it a num-vec
y <- c(3800, 5100, 4200, 6200, 5800, 4100, 3900, 3600, 3800, 4100, 4500, 5100, 4200, 4000)
df <- data.frame(x=x, y=y)
df
plot(x,y)
Recall :
\[ \hat{\text{V}}(r)= \text{FPC}*\frac{1}{\mu_x^2}\frac{s_r^2}{n}\\ s^2_r=\frac{1}{n-1}\sum_{i\in S}(y_i - rx_i)\\ \text{FPC}=\frac{N-n}{N-1}\approx\frac{N-n}{N}\approx 1-\frac{n}{N} \]
n <- 14
N <- 150
r <- mean(y)/mean(x)
round(r, 3)
## [1] 0.147
FPC <- 1-n/N
Sr2 <- (1/(n-1))*sum((y-r*x)^2)
V_r <- (Sr2/n)*(1/mean(x)^2)*FPC #est. of mu_x with xbar
B <- 2*sqrt(V_r)
round(B, 4)
## [1] 0.0102
ci <- r + c(-1,1)*B
ci
## [1] 0.1364745 0.1569654
recall :
\[ \hat{\text{V}}(\hat{\tau_y})=\tau_x^2 \hat{\text{V}}(r) \\ \hat{\text{V}}(r)= \text{FPC}*\frac{1}{\mu_x^2}\frac{s_r^2}{n}\\ s^2_r=\frac{1}{n-1}\sum_{i\in S}(y_i - rx_i)\\ \text{FPC}=\frac{N-n}{N-1}\approx\frac{N-n}{N}\approx 1-\frac{n}{N} \]
rm(list=ls())
x <- c(550, 720, 1500, 1020, 620, 980, 928, 1200, 1350, 1750, 670, 729, 1530)
y <- c(610, 780, 1600, 1030, 600, 1050, 977, 1440, 1570, 2210, 980, 865, 1710)
df <- data.frame(x=x, y=y)
df
plot(x,y) # good candidate for ratio reg.
T_x <- 128200
n <- 13
N <- 123
FPC <- 1-n/N
mu_x <- T_x/N
r <- sum(y)/sum(x)
T_y <- T_x * r
round(T_y, 2) # est
## [1] 145943.8
y_hat <- r*x
sr2 <- (1/(n-1))*(sum((y-y_hat)^2))
V_r <- (sr2/n)*FPC*(1/mu_x^2)
V_Ty <- V_r*T_x^2
B <- 2*sqrt(V_Ty)
round(B, 2) # slightly off 7353.67
## [1] 7353.67
mu_y <- mu_x*r
round(mu_y, 2)
## [1] 1186.53
V_muy <- (mu_x^2)*V_r
B <- 2*sqrt(V_muy)
round(B, 2)
## [1] 59.79
rm(list=ls())
x <- c(14.3, 15.7, 17.8, 17.5, 13.2, 18.8, 17.6, 14.3, 14.9, 17.9, 19.2)
y <- c(15.2, 16.1, 18.1, 17.6, 14.5, 19.4, 17.5, 14.1, 15.2, 18.1, 19.5)
N <- 763
n <- 11
mu_x <- 17.2
r <- sum(y)/sum(x)
r
## [1] 1.022627
plot(x,y)
abline(a=0, b=r, col = "red") # pretty decent fit
mu_y <- r*mu_x
round(mu_y, 2)
## [1] 17.59
FPC <- 1-n/N
sr2 <- (1/(n-1))*sum((y-r*x)^2)
V_r <- (sr2/n)*FPC*(1/mu_x^2)
V_muy <- (sr2/n)*FPC #notice they cancel out (mu_x^2)
B <- 2*sqrt(V_muy)
round(B, 3)
## [1] 0.271
Under what conditions should you employ a ratio estimator of a population total, rather than an estimator of the form \(N\bar{y}\) ?
National income from manufacturing industries is to be estimated for 1989 from a sample of 6 of the 19 industry categories that reported figures early for that year. Incomes from all 19 industries are known for 1980 and the total is $674 billion. From the data provided, estimate the total national income from manufacturing in 1989, with a bound on the error. All figures are in billions of constant (1982) dollars.
Find a ratio estimator of the 1989 total income, and place a bound on the error of estimation.
Find a regression estimator of the 1989 total income, and place a bound on the error of estimation
Find a difference estimator of the 1989 total income, and place a bound on the error of estimation
Aside
rm(list=ls())
n <- 6
N <- 19
industry_data <- data.frame(
industry = c(
"Lumber and wood products",
"Electric and electronic equipment",
"Motor vehicles and equipment",
"Food and kindred products",
"Textile mill products",
"Chemicals and allied products"
),
`Income 1980` = c(21, 63, 91, 60, 70, 50),
`Income 1989` = c(26, 91, 47, 70, 70, 50)
)
industry_data
plot(industry_data$Income.1980, industry_data$Income.1989)
sm_y <- sum(industry_data$Income.1989)
sm_x <- sum(industry_data$Income.1980)
r <- sm_y/sm_x
abline(a=0,b=r, col = "red")
Now… I dont know if you remember this but i do… this was on the exam – and, isn’t linear. God damn it… ITS NOT LINEAR – “you can fit a line through anything” – thank you Susan Cochran, I can see that…
T_x <- 674
T_y <- r*T_x; T_y
## [1] 672.1014
mu_x <- T_x/N
# variance :
yhat <- r*industry_data$Income.1980
y <- industry_data$Income.1989
sr2 <- sum((y - yhat)^2)/(n-1)
FPC <- 1-n/N
Vhat_Ty <- (T_x^2)*(sr2/n)*FPC*(1/mu_x^2)
B <- 2*sqrt(Vhat_Ty)
round(B, 2)
## [1] 305.64
# conf interval : 85%
ci_Ty <- T_y + c(-1,1)*B
paste0(round(ci_Ty[1], 2), " Billion Dollars to ", round(ci_Ty[2], 2), " Billion Dollars -- est. ", round(T_y, 2), " B$")
## [1] "366.46 Billion Dollars to 977.74 Billion Dollars -- est. 672.1 B$"
y <- industry_data$Income.1989
x <- industry_data$Income.1980
b <- sum((y -mean(y))*(x - mean(x)))/sum((x-mean(x))^2)
a <- mean(y) - b*mean(x)
# pt est.
muhat_y <- a+b*mu_x
T_y2 <- N*muhat_y; T_y2
## [1] 918.1025
# variance
yhat <- a+b*x
syL2 <- (1/(n-2))*sum((y-yhat)^2)
MSE <- syL2
Vhat_muy <- (syL2/n)*FPC
Vhat_Ty2 <- (N^2)*Vhat_muy
B2 <- 2*sqrt(Vhat_Ty2); round(B2, 2)
## [1] 289.38
ci_Ty2 <- T_y2 + c(-1,1)*B2
round(ci_Ty2,2)
## [1] 628.72 1207.48
paste0(round(ci_Ty2[1], 2), " Billion Dollars to ", round(ci_Ty2[2], 2), " Billion Dollars -- est. ", round(T_y2, 2), " B$")
## [1] "628.72 Billion Dollars to 1207.48 Billion Dollars -- est. 918.1 B$"
#pt est.
di <- y-x
mu_yD <- mu_x - mean(di)
T_y3 <- N*mu_yD
# variance est.
syD2 <- sum((di - mean(di))^2)/(n-1)
Vhat_muyD <- (syD2/n)*FPC
Vhat_Ty3 <- (N^2)*Vhat_muyD
B3 <- 2*sqrt(Vhat_Ty3)
ci_Ty3 <- T_y3 + c(-1,1)*B3
paste0(round(ci_Ty3[1], 2), " Billion Dollars to ", round(ci_Ty3[2], 2), " Billion Dollars -- est. ", round(T_y3, 2), " B$")
## [1] "371.08 Billion Dollars to 983.25 Billion Dollars -- est. 677.17 B$"
grep("^ci", ls(), value = T)
## [1] "ci_Ty" "ci_Ty2" "ci_Ty3"
grep("^T_y", ls(), value = T)
## [1] "T_y" "T_y2" "T_y3"
grep("^Vhat", ls(), value = T)
## [1] "Vhat_muy" "Vhat_muyD" "Vhat_Ty" "Vhat_Ty2" "Vhat_Ty3"
# Basic scatter plot of 1980 vs 1989
plot(
industry_data$Income.1980,
industry_data$Income.1989,
pch = 16,
col = "black",
xlab = "Income 1980",
ylab = "Income 1989",
main = "Estimated Models"
)
# Add estimator lines
abline(a = 0, b = r, col = "red", lwd = 2) # Ratio estimator
abline(a = a, b = b, col = "blue", lwd = 2) # Regression estimator
abline(a = 0, b = 1, col = "green", lwd = 2) # Difference estimator
legend(
"topleft",
legend = c(
"Ty1: Ratio estimator",
"Ty2: Regression estimator",
"Ty3: Difference estimator"
),
col = c("red", "blue", "green"),
lwd = 2,
bty = "n"
)
As we can see our estimate of ratio and difference are similar – ie. slopes are both about 1 with intercept (0,0)
Notice our regression est. is quite bad due to the encorperation of a intercept which was moved by leverage point + outliers.
Notice all models do not accurately describe the behavior of the data as it is curvelinear – ie. not a linear relationship
# Combine estimates and CIs
estimates <- c(T_y, T_y2, T_y3)
ci_mat <- rbind(
ci_Ty,
ci_Ty2,
ci_Ty3
)
labels <- c(
"Ratio Estimator",
"Regression Estimator",
"Difference Estimator"
)
# Create empty plot
plot(
estimates,
1:3,
xlim = range(ci_mat),
pch = 16,
yaxt = "n",
ylab = "",
xlab = "Estimate",
main = "Confidence Intervals of Estimators"
)
axis(2, at = 1:3, labels = labels)
# Add CI lines
segments(
x0 = ci_mat[, 1],
x1 = ci_mat[, 2],
y0 = 1:3,
y1 = 1:3,
lwd = 2
)
abline(v = 0, lty = 2)
library(dplyr)
library(knitr)
library(kableExtra)
results_df <- data.frame(
Estimator = c(
"Ratio Estimator",
"Regression Estimator",
"Difference Estimator"
),
Estimate = c(T_y, T_y2, T_y3),
Variance = c(Vhat_Ty, Vhat_Ty2, Vhat_Ty3),
Std_Error = sqrt(c(Vhat_Ty, Vhat_Ty2, Vhat_Ty3)),
CI_Lower = c(ci_Ty[1], ci_Ty2[1], ci_Ty3[1]),
CI_Upper = c(ci_Ty[2], ci_Ty2[2], ci_Ty3[2])
) |>
mutate(
Relative_Efficiency = Variance / min(Variance)
)
results_df |>
kable(
caption = "Point Estimates, Variance Estimates, and 95% Confidence Intervals",
digits = 4,
align = "c"
) |>
kable_styling(
full_width = FALSE,
position = "center",
latex_options = c("hold_position", "striped")
)
| Estimator | Estimate | Variance | Std_Error | CI_Lower | CI_Upper | Relative_Efficiency |
|---|---|---|---|---|---|---|
| Ratio Estimator | 672.1014 | 23354.18 | 152.8207 | 366.4599 | 977.7429 | 1.1155 |
| Regression Estimator | 918.1025 | 20935.13 | 144.6898 | 628.7230 | 1207.4820 | 1.0000 |
| Difference Estimator | 677.1667 | 23422.46 | 153.0440 | 371.0787 | 983.2546 | 1.1188 |
7.1-7.8 (Entire Chapter)
The management of a particular company is interested in estimating the proportion of employees favoring a new investment policy. A 1-in-10 systematic sample is obtained from employees leaving the building at the end of a particular workday. Use the data in the accompanying table to estimate p, the proportion in favor of the new policy and place a bound on the error of estimation. Assume N = 2000.
rm(list=ls())
N <- 2000
n <- 200
sm_y <- 132
phat <- sm_y/n
FPC <- 1-n/N
qhat <- 1-phat
# variance est .
Vhat_psy <- (phat*qhat*FPC)/(n-1)
B <- 2 * sqrt(Vhat_psy); round(B, 4)
## [1] 0.0637
ci_psy <- phat + c(-1,1)*B; round(ci_psy, 2)
## [1] 0.60 0.72
For the situation outlined in Exercise 7.4, determine the sample size required to estimate p to within 0.01 unit. What type of systematic sample should be run?
rm(n) # determine it for myself
B <- .01
D <- (B^2)/4
n <- (N*phat*qhat)/((N-1)*D + phat*qhat)
n <- ceiling(n); n # same as they used!
## [1] 1636
The highway patrol of a particular state is concerned about the proportion of motorists who carry their licenses. A checkpoint is set up on a major highway, and the driver of every seventh car is questioned. Use the data in the accompanying table to estimate the proportion of drivers carrying their licenses. Place a bound on the error of estimation. Assume that N 2800 cars pass the checkpoint during the sampling period.
rm(list=ls())
N <- 2800
n <- 400
FPC <- 1-n/N
sm_y <- 324
phat <- sm_y/n; round(phat, 2)
## [1] 0.81
qhat <- 1-phat
Vhat_psy <- (phat*qhat*FPC)/(n-1)
B <- 2 * sqrt(Vhat_psy); round(B, 3)
## [1] 0.036
ci_psy <- phat + c(-1,1)*B; round(ci_psy, 2)
## [1] 0.77 0.85
The highway patrol expects at least N = 3000 cars to pass the checkpoint. Determine the sample size required to estimate p to within B = 0.015 unit.
N <- 3000
B <- .015
rm(n)
D <- (B^2)/4
n <- (N*phat*qhat)/((N-1)*D + phat*qhat)
n <- ceiling(n); n # same as they used!
## [1] 1432
8.1-8.10* (Entire Chapter)
A large firm has its equipment inventories listed separately by department. From the 15 departments in the firm, 5 are to be randomly sampled by an auditor, who will then check to make sure that all equipment is properly identified and located. The proportion of in- ventory items not properly identified is of interest to the auditor. The data are given in the accompanying table. Estimate the proportion of inventory items in the firm not properly identified and place a bound on the error of estimation
rm(list=ls())
N <- 15
n <- 5
FPC <- 1-n/N
dept <- paste("Dept. ", as.character(1:5))
Numb_equip <- c(15, 27,9,31,16)
Nump_inc <- c(2,3,1,1,2)
df <- data.frame(dept=dept, Numb_equip=Numb_equip, Nump_inc=Nump_inc)
sm_m <- sum(df$Numb_equip)
sm_a <- sum(df$Nump_inc)
phat <- sm_a/sm_m; round(phat, 4)
## [1] 0.0918
# variance est.
ai <- df$Nump_inc
mi <- df$Numb_equip
sp2 <- (1/(n-1))*sum((ai - phat*mi)^2)
mbar <- (1/n)*sum(mi)
Vhat_phatc <- FPC*(sp2/n)*(1/mbar^2)
B <- sqrt(Vhat_phatc)*2; round(B, 4)
## [1] 0.039
ci_phatc <- phat + c(-1,1)*B; round(ci_phatc, 2)
## [1] 0.05 0.13
9.1-9.5 (Entire Chapter)
A garment manufacturer has 90 plants located throughout the United States and wants to estimate the average number of hours that the sewing machines were down for repairs in the past months. Because the plants are widely scattered, she decides to use cluster sampling, specifying each plant as a cluster of machines. Each plant con- tains many machines, and checking the repair record for each machine would be time-consuming. Therefore, she uses two-stage sampling. Enough time and money are available to sample n = 10 plants and approximately 20% of the machines in each plant. Using the data in Table 9.1, estimate the average downtime per machine and place a bound on the error of estimation. The manufacturer knows she has a combined total of 4500 machines in all plants.
Plant <- 1:10
M_i <- c(50, 65, 45, 48, 52, 58, 42, 66, 40, 56)
m_i <- c(10, 13, 9, 10, 10, 12, 8, 13, 8, 11)
ybar_i <- c(
5.40, 4.00, 5.67, 4.80, 4.30,
3.83, 5.00, 3.85, 4.88, 5.00
)
s2_i <- c(
11.38, 10.67, 16.75, 13.29, 11.12,
14.88, 5.14, 4.31, 6.13, 11.80
)
plant_data <- data.frame(
Plant = Plant,
M_i = M_i,
m_i = m_i,
ybar_i = ybar_i,
s2_i = s2_i
)
downtime_list <- list(
c(5, 7, 9, 0, 11, 2, 8, 4, 3, 5),
c(4, 3, 7, 2, 11, 0, 1, 9, 4, 3, 2, 1, 5),
c(5, 6, 4, 11, 12, 0, 1, 8, 4),
c(6, 4, 0, 1, 0, 9, 8, 4, 6, 10),
c(11, 4, 3, 1, 0, 2, 8, 6, 5, 3),
c(12, 11, 3, 4, 2, 0, 0, 1, 4, 3, 2, 4),
c(3, 7, 6, 7, 8, 4, 3, 2),
c(3, 6, 4, 3, 2, 2, 8, 4, 0, 4, 5, 6, 3),
c(6, 4, 7, 3, 9, 1, 4, 5),
c(6, 7, 5, 10, 11, 2, 1, 4, 0, 5, 4)
)
downtime_data <- do.call(
rbind,
lapply(1:10, function(i) {
data.frame(
Plant = i,
Downtime = downtime_list[[i]]
)
})
)
# Merge plant-level info
downtime_data <- merge(downtime_data, plant_data, by = "Plant")
downtime_data$Plant <- as.factor(downtime_data$Plant)
n <- 10
N <- 90
Mbar <- 4500/90
FPC <- 1-n/N
mu_hat <- (1/Mbar)*sum(M_i*ybar_i)/n; round(mu_hat, 2)
## [1] 4.8
sb2 <- sum((M_i*ybar_i - Mbar*mu_hat)^2)/(n-1)
sm <- sum((M_i^2)*(1-m_i/M_i)*(s2_i/m_i))
Vhat_mu <- (sb2*FPC*1/(n*(Mbar^2)))+(1/(N*n*Mbar^2)*sm); round(Vhat_mu, 4)
## [1] 0.0371
B <- 2*sqrt(Vhat_mu)
ci <- mu_hat + c(-1,1)*B; round(ci, 2)
## [1] 4.42 5.19
A nurseryman wants to estimate the average height of seedlings in a large field that is divided into 50 plots that vary slightly in size. He believes the heights are fairly constant throughout each plot but may vary considerably from plot to plot. Therefore, he decides to sample 10% of the trees within each of 10 plots using a two-stage cluster sample. The data are as given in the accompanying table. Estimate the average height of seedlings in the field and place a bound on the error of estimation
rm(list=ls())
N <- 50
n <- 10
FPC <- 1-n/N
plot_data <- data.frame(
Plot = 1:10,
Total_Seedlings = c(52, 56, 60, 46, 49, 51, 50, 61, 60, 45),
Sampled_Seedlings = c(5, 6, 6, 5, 5, 5, 5, 6, 6, 6)
)
heights_list <- list(
c(12, 11, 12, 10, 13),
c(10, 9, 7, 9, 8, 10),
c(6, 5, 7, 5, 6, 4),
c(7, 8, 7, 7, 6),
c(10, 11, 13, 12, 12),
c(14, 15, 13, 12, 13),
c(6, 7, 6, 8, 7),
c(9, 10, 8, 9, 9, 10),
c(7, 10, 8, 9, 9, 10),
c(12, 11, 12, 13, 12, 12)
)
seedling_data <- do.call(
rbind,
lapply(1:10, function(i) {
data.frame(
Plot = i,
Height = heights_list[[i]]
)
})
)
# Merge cluster info
seedling_data <- merge(seedling_data, plot_data, by = "Plot")
seedling_data # long form so we can squeeze into df
Mi <- plot_data$Total_Seedlings
M <- sum(Mi)
Mbarhat <- sum(Mi)/n
summary <- seedling_data |> group_by(Plot) |> summarize(ybar_i = mean(Height))
ybar_i <- summary$ybar_i
mu_rts <- sum(Mi*ybar_i)/sum(Mi); round(mu_rts, 2)
## [1] 9.38
# variance est
FPC <- 1-n/N; FPC
## [1] 0.8
sr2 <- (1/(n-1))*sum((Mi*ybar_i - Mi*mu_rts)^2)
between <- FPC*(1/(n*(Mbarhat^2)))*sr2
(Aside : Notice we skip Ch10 as it is not discussed in the course)
11.5*
12.1-12.2