Stat130HW3

Q1:

Ratio estimation: I would measure the time devoted to sports in a sample of news broadcasts and use the ratio of sports time to total time to estimate the proportion across all broadcasts.
Regression estimation: I would use this by relating the number of fish caught to variables such as the angler’s type of bait used to predict the average catch per hour.
Ratio estimation: I would do this by dividing the total textbook spending by the total number of students.
Ratio estimation: I would measure the usable meat weight of a sample of chickens and calculate the ratio of usable to total weight.

library(ggplot2)
library(dplyr)
data <- data.frame(
  Diameter = c(12.0, 11.4, 7.9, 9.0, 10.5, 7.9, 7.3, 10.2, 11.7, 11.3,5.7, 8.0, 10.3, 12.0, 9.2, 8.5, 7.0, 10.7, 9.3, 8.2),
  Age = c(125, 119, 83, 85, 99, 117, 69, 133, 154, 168,61, 80, 114, 147, 122, 106, 82, 88, 97, 99)
)

#age vsdiameter 
ggplot(data, aes(x = Diameter, y = Age)) +
  geom_point() +
  labs(title = "Tree Age vs. Diameter",
       x = "Diameter in cm",
       y = "Age in years")

x_bar <- mean(data$Diameter)
y_bar <- mean(data$Age)

#polation mean
X_pop <- 10.3

#ratio estimation of the mean age
R_hat <- y_bar / x_bar
Y_hat_rat <- R_hat * X_pop
Y_hat_rat

## [1] 117.6204

residuals <- data$Age - R_hat * data$Diameter
S_e_squared <- sum(residuals^2) / (length(data$Age) - 1)

#ratio estimation standardd error
n <- nrow(data)
SE_1 <- sqrt((1 - n/1132)*(S_e_squared/n)*(X_pop^2/x_bar^2))
SE_1

## [1] 4.354872

#regression
model <- lm(Age ~ Diameter, data = data)

beta_0 <- coef(model)[1]
beta_1 <- coef(model)[2]
Y_hat_reg <- beta_0 + beta_1 * 10.3
Y_hat_reg

## (Intercept) 
##    118.3634

#regression estimation standard error
SE_regression <- summary(model)$sigma / sqrt(n) * sqrt(1 - n / 1132)
cat("Standard Error of Regression Estimate", SE_regression, "\n")

## Standard Error of Regression Estimate 4.070774

# ratio and reg estimates in graph
ggplot(data, aes(x = Diameter, y = Age)) +
  geom_point() +
  geom_hline(yintercept = Y_hat_rat, color = "blue") +
  geom_hline(yintercept = Y_hat_reg, color = "red", linetype = "dotted") +
  labs(title = "Scatterplot of Tree Age vs. Diameter with Estimates",
       x = "Diameter (cm)",
       y = "Age (years)")

The Ratio Estimation is the solid blue line.
The Regression estimation is the dotted red line.
How do the estimates compare? The ratio and regression estimates are almost overlapping on the graph, which suggests that both estimation methods are calculating very similar population mean ages, which we see in the calculations above.

Q10

data(agsrs)
#acres92 vs. farms87
plot(agsrs$farms87, agsrs$acres92, 
     xlab = "Number of Farms in 1987", 
     ylab = "Acres Devoted to Farming in 1992", 
     main = "Relationship between Number of Farms (1987) and Acres Farmed (1992)")

farms_tot_87 <- 2087759
sample_ratio <- sum(agsrs$acres92) / sum(agsrs$farms87)

#ratio estimate
rat_est <- sample_ratio * farms_tot_87
rat_est

## [1] 960155061

tot_farms87 <- 2087759
reg_model <- lm(acres92 ~ farms87, data = agsrs)

betaa_0 <- coef(reg_model)[1]
betaa_1 <- coef(reg_model)[2]

reg_est <- (tot_farms87 * betaa_1) + (300 * betaa_0)
reg_est

##   farms87 
## 179597441

To find out which method Gives the most Precision, I have to find the Standard Error of Both Estimates:

var_y <- var(agsrs$acres92)
var_x <- var(agsrs$farms87)

se_ratio <- (960155061)*sqrt((1/300)* (var_y/var_x) * (1 - (300/2087759)))

residuals <- residuals(reg_model)
se_regression <- sqrt(sum(residuals^2) / (300 - 2))
se_total_regression <- 179597441 * se_regression

cat("Standard Error for Ratio Estimate: ", se_ratio, "\n")

## Standard Error for Ratio Estimate:  44284844226

cat("Standard Error for Reg Estimate: ", se_total_regression, "\n")

## Standard Error for Reg Estimate:  6.187402e+13

With our output, we see the ratio estimate has the lower standard error, meaning this method gives the most precision.

Q17

mean_lead <- 127
sd_lead <- 146
n_lead <- 121

se_lead <- sd_lead / sqrt(n_lead)
ci_lead_lower <- mean_lead - 1.98 * se_lead
ci_lead_upper <- mean_lead + 1.98 * se_lead

mean_copper <- 35
sd_copper <- 16
n_copper <- 121

se_copper <- sd_copper / sqrt(n_copper)
ci_copper_lower <- mean_copper - 1.98 * se_copper
ci_copper_upper <- mean_copper + 1.98 * se_copper

cat("Confidence Interval for Lead: ", "[",ci_lead_lower, ",",ci_lead_upper, "]", "\n")

## Confidence Interval for Lead:  [ 100.72 , 153.28 ]

cat("Confidence Interval for copper: ", "[",ci_copper_lower, ",",ci_copper_upper, "]")

## Confidence Interval for copper:  [ 32.12 , 37.88 ]

n_lead_total <- 82 + 31 + 8

weighted_mean_lead <- (82 * 71 + 31 * 259 + 8 * 189) / n_lead_total

var_A_lead <- (28^2) / 82
var_B_lead <- (232^2) / 31
var_C_lead <- (79^2) / 8

var_lead_strat <- var_A_lead + var_B_lead + var_C_lead

se_lead_strat <- sqrt(var_lead_strat)

ci_lead_strat_lower <- weighted_mean_lead - 1.98 * se_lead_strat
ci_lead_strat_upper <- weighted_mean_lead + 1.98 * se_lead_strat

cat("Poststratified Sample 95% CI for Lead:", ci_lead_strat_lower, "to", ci_lead_strat_upper, "\n")

## Poststratified Sample 95% CI for Lead: 27.45458 to 226.4793

n_total <- 82 + 31 + 8
mean_weighted <- (82 * 28 + 31 * 50 + 8 * 45) / n_total

var_A <- (9^2) / 82
var_B <- (18^2) / 31
var_C <- (15^2) / 8

total_variance <- (82^2 / n_total^2) * var_A + (31^2 / n_total^2) * var_B + (8^2 / n_total^2) * var_C

# se
se_stratified <- sqrt(total_variance)

df_stratified <- n_total - 1

t_value_stratified <- qt(0.975, df_stratified)

ci_lower_stratified <- mean_weighted - t_value_stratified * se_stratified
ci_upper_stratified <- mean_weighted + t_value_stratified * se_stratified

cat("Poststratified Sample 95% CI for Copper:", ci_lower_stratified, "to", ci_upper_stratified, "\n")

## Poststratified Sample 95% CI for Copper: 32.53556 to 36.98511

The poststratified CI for Lead is larger than the regular CI

The poststratified CI for Copper is narrower compared to the regular CI

Yes, i believe using stratification in future surveys would increae precision.

Q40

data(agpop)
agpop$farms92 <- factor(agpop$farms92)

n_srs <- 500
n <- 40
y_values <- numeric(n_srs)
ypost_values <- numeric(n_srs)

Q42

Case Fatality Rate: This rate focuses on the proportion of confirmed cases that result in death and is used to measure the lethality of a disease among people diagnosed with it.

Use of Ratio Estimation: We would use a ratio of deaths to confirmed cases, estimating the lethality of a disease among diagnosed individuals.

Sampling Error: In some countries, testing could have biases which could result in an underreporting or overreporting of the number of confirmed cases.

Measurement Error: There may be errors in diagnosis as some deaths may not be linked to the disease.

Selection Bias: If only hospitalized patients are included in the confirmed cases, this may exclude milder cases of the disease, inflating the CFR for the population with severe disease.

Population Mortality Rate: This rate gives an estimate of the likelihood of dying from a particular disease for the entire population.

Use of Ratio Estimation: Population Mortality Rate is a ratio of deaths to the total population, estimating the likelihood of death for the entire population.

Sampling Error: If the data on deaths is sampled, it may not represent the entire population accurately.

Measurement Error: Errors can arise from classifying certain deaths as something else.

Selection Bias: If the population data isn’t representative of certain groups or regions, it can bias the mortality rate.

Q45B

data(baseball)

total_homeruns <- sum(baseball$homerun)
total_runs <- sum(baseball$run)

baseball_ratio <- total_homeruns / total_runs
cat("Ratio:", baseball_ratio, "\n")

## Ratio: 0.2363149

set.seed(123)
ratios <- numeric(1000)

for (i in 1:1000) {
  sample_data <- baseball[sample(1:nrow(baseball), replace = TRUE), ]
  
  total_hr <- sum(sample_data$homerun)
  total_r <- sum(sample_data$run)
  ratios[i] <- total_hr / total_r
}

ci_lower <- quantile(ratios, 0.025)
ci_upper <- quantile(ratios, 0.975)
cat("Confidence Interval for Ratio: [", ci_lower, ",", ci_upper, "]\n")

## Confidence Interval for Ratio: [ 0.2233592 , 0.2489921 ]

Stat130HW3

Nate McClaskey

2024-11-15