Tutorial Submission 3

Workshop 7

Part 2: Theory

According to data from Ipsos Global Express, 64% of Americans say there is never enough time in the day to get things done. Suppose this percentage is based on a random sample of 900 Americans
- What is the point estimate of the corresponding population proportion?
  
  \[ p=0.64 \]
- Construct 97% CI for the population proportion using the Wald and Agresti-Coull methods
  
  Wald Method
  
  \[ \begin{align} [p-z_{1-a/2}\sqrt{\frac{p(1-p)}{n}},&\ p+z_{1-a/2}\sqrt{\frac{p(1-p)}{n}}] \\ [0.64-2.17\cdot\sqrt{\frac{0.64(1-0.64)}{900}},&\ 0.64+2.17\cdot\sqrt{\frac{0.64(1-0.64)}{900}}] \\ [0.6053,0.6747] \end{align} \]
  
  Agresti-Coull Method
  
  \[ \begin{align} [\tilde{p}-z_{1-a/2}\sqrt{\frac{\tilde{p}(1-\tilde{p})}{\tilde{n}}},&\ \tilde{p}+z_{1-a/2}\sqrt{\frac{\tilde{p}(1-\tilde{p})}{\tilde{n}}}] \\ \tilde{n}=900+2.17&^2=904.7089 \\ \tilde{p}=\frac{1}{904.7089}(576+&0.5\cdot2.17^2) = 0.6392... \\ \text{Plugging values in:}& \\ [0.6046,0.6739] \end{align} \]

A random sample of 16 airline passengers at the Bay City airport showed that the mean time spent waiting in line to check in at the ticket counters was 31 minutes with a standard deviation of 7 minutes. Construct a 99% CI for the mean time spent waiting in line by all passengers at this airport. Assume the waiting times are normally distributed.

\[ \begin{align} \bar{x}\pm z_{1-a/2}\frac{\sigma}{\sqrt{n}} \\ 31 \pm 2.576\cdot\frac{7}{\sqrt{16}} \\ [26.492,35.508] \end{align} \]
Money magazine reported that 30% of adults in the US could not correctly define any of the four main types of life insurance. The article cautioned there is a 3.1% margin of error. How many adults were intervied in this survey? Use your own word, explain what 3.1% margin of error implies?

\[ \begin{align} 0.031 &= 1.96\sqrt{\frac{1}{4n}} \\ 0.0158...&=\sqrt{\frac{1}{4n}} \\ n&=\frac{1}{4\cdot0.0158...^2} \\ &=999.375...\\ &=1000\ \text{adults} \end{align} \]

The margin of error represents half of the 95% confidence interval, which is the range in which 30% of adults would be unable to define the main types of insurance, 95% of the time.
Let $Y_1,...,Y_n$ be a random sample of size $n$ from the pdf of $f_Y(y;\theta)=\frac{1}{\theta}e^{-y/\theta},y>0$.
1. Show that $\hat{\theta}_1=Y_1$ and $\hat{\theta}_2=\bar Y$ are unbiased estimators for $\theta$?
  
  \[ \begin{align} Y\sim& Exp(\lambda=\frac{1}{\theta}) \\ E[\hat{\theta}_1]&=E[\hat{\theta}_2]=\theta \\ E[Y_1]&=\theta\\ E[\bar{Y}]=E[\frac{\sum^n_{i=1}Y_i}{n}]&=\sum^n_{i=1}\frac{E[Y_i]}{n}=\frac{n\theta}{n}=\theta \end{align} \]
  
  Both equal $\theta$, so both are unbiased.
2. Which is the more efficient estimator?
  
  \[ \begin{align} Var[Y_1]&=\theta^2 \\ Var[\bar{Y}]&=\frac{\theta^2}{n} \\ \frac{\theta^2}{n}&<\theta^2\\ \text{so, }\bar{Y} &\text{ is the more efficient estimator} \end{align} \]
(challenge question) A public health survey is being planned in a large metropolitan area for the purpose of estimating the proportion of children, ages 0 to 14, who are lacking adequate polio immunization. The sample proportion of inadequately immunized children is $\frac{X}{n}$ . The organiser plans to estimate CI using the Wald CI with the confidence level of 98%. How many samples do they need so the true proportion is within 0.05 of the sample proportion? (Hint: Use similar approach as the sample size calculation for population sample mean with the maximum $p(1 − p) = 0.25$)

\[ \begin{align} d &= z_{1-a/2}\sqrt{\frac{1}{4n}} \\ (\frac{d}{z_{1-a/2}})^2&=\frac{1}{4n}\\ n&=0.25(z_{1-a/2}/d)^2\\ &=848.26...\\ &=849 \end{align} \]

Workshop 8

Computer Workshop

Write your own function which computes bootstrapped CI using the percentile method. (You can only use the functions in the base package).

library(PASWR2)

## Loading required package: lattice

## Loading required package: ggplot2

bootstrapped_CI <- function(data, statistic, R=9999, alpha) {
  B = R
  
  boot_stats <- numeric(B)
   
  for (i in 1:B) {
    
     sample <- sample(TITANIC3$survived, 30, replace = TRUE)
     
     boot_stats[i] <- statistic(sample)
  }
  
  PCBI <- c(sort(boot_stats)[(B+1)*(alpha/2)],
            sort(boot_stats)[(B+1)*(1-alpha/2)])
  
  # lower <- quantile(boot_stats, probs = alpha/2)
  # upper <- quantile(boot_stats, probss = 1 - alpha)
  
  return(PCBI)
}

prop <- function(data) {
  p <- sum(data) / length(data)
  return(p)
}

Randomly select 30 samples from the TITANIC3 data (in PASWR2 package) and use this as your sample.
```
s <- sample(TITANIC3$survived, 30, replace = TRUE)
```
Use the function you wrote to calculate 90% bootstrapped confidence interval (percentile) for the proportion of the population survived.
```
bootstrapped_CI(TITANIC3$survived, prop, alpha = 0.1)
```
```
## [1] 0.2333333 0.5333333
```

Check if the true survival rate is within the interval.

pop <- sum(TITANIC3$survived) / length(TITANIC3$survived)

print(pop)

## [1] 0.381971

Lies within the interval

Repeat the analysis for 100 times, what is the coverage of bootstrapped confidence interval (percentile)?

coverage <- 0

for (j in 1:100) {
  CI <- bootstrapped_CI(TITANIC3$survived, prop, alpha = 0.1)

  if (pop >= CI[1] && pop <= CI[2]) {
    coverage <- coverage + 1
  }
}

coverage

## [1] 100

Tutorial Submission 3

2024-04-28

Workshop 7

Part 2: Theory

Workshop 8

Computer Workshop