\(var(\hat{p})=var(\overline{X}_{N})=var(\frac{1}{N}\sum_{i = 1}^{N}X_{i})\) \(=\frac{1}{N^2}var(\sum_{i = 1}^{N}X_{i})\) \(=\frac{p(1-p)}{N^2}\)
\(se(\hat{p})=se(\overline{X}_{N})=(var(\overline{X}_{N}))^{\frac{1}{2}}=(\frac{p(1-p)}{N^2})^{\frac{1}{2}}\) \(=\sqrt{\frac{p(1-p)}{N}}\)
From (A), we know \(se(\overline{X}_{N})=\sqrt{\frac{p(1-p)}{N}}\), so \(se(\overline{Y}_{M})=\sqrt{\frac{q(1-q)}{M}}\)
\(se(\hat{p}-\hat{q})=se(\overline{X}_{N}-\overline{Y}_{M})=\sqrt{se^2(\overline{X}_{N})+se^2(\overline{Y}_{M})}\) \(=\sqrt{\frac{p(1-p)}{N}+\frac{q(1-q)}{M}}\)
\(se(\hat{p}-\hat{q})=\sqrt{\frac{p(1-p)}{N}+\frac{q(1-q)}{M}}\)
predimed = mutate(predimed,
meddiet = ifelse(group == 'Control', 'No', 'Yes'))
predimed <- predimed %>% mutate(card_event = ifelse(event=='Yes', yes=1, no=0))
predimed %>%
group_by(meddiet) %>%
summarize(avg=mean(card_event), count = n(), std_dev = sd(card_event))
## # A tibble: 2 × 4
## meddiet avg count std_dev
## <chr> <dbl> <int> <dbl>
## 1 No 0.0475 2042 0.213
## 2 Yes 0.0362 4282 0.187
se_hat_p = 0.2127631/sqrt(2042)
se_hat_q = 0.1868044/sqrt(4282)
se_difference = sqrt(se_hat_p^2 + se_hat_q^2)
So, we have the \(se(\hat{p}-\hat{q})\)
se_difference
## [1] 0.005506175
From the standard error we get from (C), we now can compute the normal-based CI
CI_normal_med = (0.04750245 - 0.03619804) + c(-1.96, 1.96)*se_difference
Now, we compute the bootstrap version of CI
## Confidence Interval from Bootstrap Distribution (1000 replicates)
Compare the two CI
CI_normal_med
## [1] 0.0005123067 0.0220965133
CI_boot_med
## 2.5% 97.5%
## percentile 0.0006525066 0.02259476
We can see the Bootstrapped CI has smaller interval, it is more precise than the normal-based one
t.test(card_event~meddiet, data = predimed , alternative = 'greater')
##
## Welch Two Sample t-test
##
## data: card_event by meddiet
## t = 2.053, df = 3586.4, p-value = 0.02007
## alternative hypothesis: true difference in means between group No and group Yes is greater than 0
## 95 percent confidence interval:
## 0.002245218 Inf
## sample estimates:
## mean in group No mean in group Yes
## 0.04750245 0.03619804
The p-value under null hypothesis is 0.02007
\(se(\overline{X}_{N}-\overline{Y}_{M})=\sqrt{(se(\overline{X}_{N}))^2+(se(\overline{Y}_{M}))^2}\) \(=\sqrt{\frac{var_{X}}{N}+\frac{var_{Y}}{M}}\) \(=\sqrt{\frac{\sigma_{X}^2}{N}+\frac{\sigma_{Y}^2}{M}}\)
First, we find the stnadard error of both group
GasPrices %>% group_by(Highway) %>%
summarize(price_mean = mean(Price), sample_size = n(), std_dev = sd(Price))
## # A tibble: 2 × 4
## Highway price_mean sample_size std_dev
## <chr> <dbl> <int> <dbl>
## 1 N 1.85 79 0.0807
## 2 Y 1.9 22 0.0759
se_hat_Yes = 0.08066524/sqrt(79)
se_hat_No = 0.07590721/sqrt(22)
Then, we compute the normal-based confidence interval
se_hat = sqrt(se_hat_Yes^2+se_hat_No^2)
CI_normal_gas = 1.900-1.854303 + c(-1.96, 1.96)*se_hat
Compute the bootstrap CI
boot2 = do(1000)*{
GasPrices_boot = resample(GasPrices)
table2_boot = GasPrices_boot %>%
group_by(Highway) %>%
summarize(mprice = mean(Price))
table2_boot$mprice
}
CI_boot_gas = confint(boot2[,2]-boot2[,1], level=0.95)
Compare the two CI
CI_normal_gas
## [1] 0.009330134 0.082063866
CI_boot_gas
## 2.5% 97.5%
## percentile 0.009004028 0.08064436
From these two Ci, we can see the interval is almost the same size, but normal-based one has a smaller lower-bond and upper-bond comparing to the bootstrapped one.
t.test(Price~Highway, data = GasPrices)
##
## Welch Two Sample t-test
##
## data: Price by Highway
## t = -2.4628, df = 35.344, p-value = 0.0188
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
## -0.083350777 -0.008041628
## sample estimates:
## mean in group N mean in group Y
## 1.854304 1.900000
Under the assumption, the p-value is 0.0188
From the hint, we have \(e^y=\sum_{k=0}^{\infty}\frac{y^k}{k!}\), and \(e^-\lambda=\sum_{k=0}^{\infty}\frac{-\lambda^k}{k!}\)
\[\begin{aligned} E(x)=\sum_{k=0}^{\infty}ke^-\lambda*\frac{\lambda^k}{k!}\\ =e^-\lambda\sum_{k=0}^{\infty}k*\frac{\lambda^k}{k!} \\ =e^-\lambda\sum_{k=0}^{\infty}\frac{\lambda^k}{(k-1)!}\\ =e^-\lambda*\lambda\sum_{k=0}^{\infty}\frac{\lambda^k}{(k-1)!} \\ =e^-\lambda*(\lambda e^\lambda) =\lambda \end{aligned}\]\(E(x)=\lambda\)
\[\begin{aligned} E(X_{i}^2)=\sum(x^2)\frac{\lambda^x}{x!}e^-\lambda=\sum(x(x-1+1))*\frac{\lambda^x}{x!}e^-\lambda\\ =\sum x(x-1)*\frac{\lambda^x}{x!}*e^-\lambda+\sum x*\frac{\lambda^x}{x!}*e^-\lambda \\ =e^-\lambda\sum\frac{\lambda^x}{(x-2)!}+e^-\lambda\sum\frac{\lambda^x}{(x-1)!}\\ =e^-\lambda*\lambda^2\sum\frac{\lambda^(x-2)}{(x-2)!}+e^-\lambda*\lambda^2\sum\frac{\lambda^(x-1)}{(x-1)!}\\ =e^-\lambda*(\lambda^2 e^\lambda)+e^-\lambda*(\lambda e^\lambda)\\ =\lambda^2+\lambda \end{aligned}\]
\(var(X_{i})=(\lambda^2+\lambda)-\lambda^2=\lambda\)
Or, we just use the fact that \(\hat{\lambda}=\overline{X}_{n}\), and we have
\(s.e.(\hat{\lambda})=\sqrt\frac{var(\lambda)}{N} =\sqrt\frac {var(X_{i})}{N}= \overline{X}_{n}\)
\[\begin{aligned} logL(\lambda)=log(e^-n\lambda*\frac{\lambda\sum_{i = 1}^{n}x_{i}}{\prod_{i = 1}^{n}x!})\\ logL(\lambda)=-n\lambda+\sum_{i = 1}^{n}x_{i}log\lambda-log\sum_{i = 1}^{n}x_{i} \end{aligned}\]
\[\begin{aligned} \frac{d}{dL}logL(\lambda)=\frac{d}{dL}(-n\lambda+\sum_{i = 1}^{n}x_{i}log\lambda-log\sum_{i = 1}^{n}x_{i}!)\\ =-n+\frac{\sum_{i = 1}^{n}x_{i}} {\lambda} = 0\\ n=\frac{\sum_{i = 1}^{n}x_{i}} {\lambda}\\ \lambda=\frac{\sum_{i = 1}^{n}x_{i}} {n}\\ \lambda=\overline{X}_{n}\\ \end{aligned}\]
\(N=30\), \(\overline{X}=14.0667\), \(var(X)=12.823\) We want to calculate the 95% confidence interval for \(\lambda_{d,h}\), the formula is \(\hat{\lambda}\pm1.96*se\hat{\lambda}\) from (C) we know s.e. of \(\hat{\lambda}\)= s.e. of mean \(s.e.(\hat{\lambda})=s.e.(\overline{X})=\sqrt{\frac{var(X)}{N}}\) \(=\sqrt{\frac{12.823}{30}}=0.6538\)
Confidence Interval:\(14.0667\pm1.96*0.6538\) = >\(14.0667\pm1.2814\)
\(12.7853<\lambda<15.3481\)
Let \(Z_{i}=I(X_{i}\le x)\), and it takes either 0 and 1 as its value, so we can see it as Bernoulli distribution \(P(Z_{i}=1)=P(X_{i}\le x)=F(x)\), note that \(F(X)\) is the CDF \(E(Z_{i})=F(x)\), \(var(Z_{i})=F(x)(1-F(x))\)
After deriving this, we know \(\hat{F_{n}}(x)\) is the rescaled Binomial \((F(x), N)\), so \((F(x), N)\)~\(\frac{1}{n}Binomial({F(x), N})\) , the empirical CDF will converges to true CDF.
We use the empirical CDF function to calculate F(-0.1), then we compute confidence interval by using standard error and the z-score of 95%: 1.96
x<-pull(stocks_bonds,SP500)
Fn <- ecdf(x)
Fn(-0.1)
## [1] 0.1235955
std_Fn<-sqrt(Fn(-0.1)*(1-Fn(-0.1)))
se_Fn<-std_Fn/sqrt(89)
CI_Fn = Fn(-0.1) + c(-1.96, 1.96)*se_Fn
Finally, we get the confidence interval for F(-0.1)
CI_Fn
## [1] 0.05521777 0.19197324