Stats Hw5

Question 1

1(A)

\(var(\hat{p})=var(\overline{X}_{N})=var(\frac{1}{N}\sum_{i = 1}^{N}X_{i})\) \(=\frac{1}{N^2}var(\sum_{i = 1}^{N}X_{i})\) \(=\frac{p(1-p)}{N^2}\)

\(se(\hat{p})=se(\overline{X}_{N})=(var(\overline{X}_{N}))^{\frac{1}{2}}=(\frac{p(1-p)}{N^2})^{\frac{1}{2}}\) \(=\sqrt{\frac{p(1-p)}{N}}\)

1(B)

From (A), we know \(se(\overline{X}_{N})=\sqrt{\frac{p(1-p)}{N}}\), so \(se(\overline{Y}_{M})=\sqrt{\frac{q(1-q)}{M}}\)

\(se(\hat{p}-\hat{q})=se(\overline{X}_{N}-\overline{Y}_{M})=\sqrt{se^2(\overline{X}_{N})+se^2(\overline{Y}_{M})}\) \(=\sqrt{\frac{p(1-p)}{N}+\frac{q(1-q)}{M}}\)

\(se(\hat{p}-\hat{q})=\sqrt{\frac{p(1-p)}{N}+\frac{q(1-q)}{M}}\)

1(C) plug-in standard error for difference in proportion

predimed = mutate(predimed,
                  meddiet = ifelse(group == 'Control', 'No', 'Yes'))

predimed <- predimed %>% mutate(card_event = ifelse(event=='Yes', yes=1, no=0))

predimed %>%
  group_by(meddiet) %>%
  summarize(avg=mean(card_event), count = n(), std_dev = sd(card_event))

## # A tibble: 2 × 4
##   meddiet    avg count std_dev
##   <chr>    <dbl> <int>   <dbl>
## 1 No      0.0475  2042   0.213
## 2 Yes     0.0362  4282   0.187

se_hat_p = 0.2127631/sqrt(2042)
se_hat_q = 0.1868044/sqrt(4282)
se_difference = sqrt(se_hat_p^2 + se_hat_q^2)

So, we have the \(se(\hat{p}-\hat{q})\)

se_difference

## [1] 0.005506175

1(D) bootstrap CI

From the standard error we get from (C), we now can compute the normal-based CI

CI_normal_med = (0.04750245 - 0.03619804) + c(-1.96, 1.96)*se_difference

Now, we compute the bootstrap version of CI

## Confidence Interval from Bootstrap Distribution (1000 replicates)

Compare the two CI

CI_normal_med

## [1] 0.0005123067 0.0220965133

CI_boot_med

##                    2.5%      97.5%
## percentile 0.0006525066 0.02259476

We can see the Bootstrapped CI has smaller interval, it is more precise than the normal-based one

1(E) hypothesis testing

t.test(card_event~meddiet, data = predimed , alternative = 'greater')

## 
##  Welch Two Sample t-test
## 
## data:  card_event by meddiet
## t = 2.053, df = 3586.4, p-value = 0.02007
## alternative hypothesis: true difference in means between group No and group Yes is greater than 0
## 95 percent confidence interval:
##  0.002245218         Inf
## sample estimates:
##  mean in group No mean in group Yes 
##        0.04750245        0.03619804

The p-value under null hypothesis is 0.02007

Question 2

2(A)

\(se(\overline{X}_{N}-\overline{Y}_{M})=\sqrt{(se(\overline{X}_{N}))^2+(se(\overline{Y}_{M}))^2}\) \(=\sqrt{\frac{var_{X}}{N}+\frac{var_{Y}}{M}}\) \(=\sqrt{\frac{\sigma_{X}^2}{N}+\frac{\sigma_{Y}^2}{M}}\)

2(B)

First, we find the stnadard error of both group

GasPrices %>% group_by(Highway) %>%
    summarize(price_mean = mean(Price), sample_size = n(), std_dev = sd(Price))

## # A tibble: 2 × 4
##   Highway price_mean sample_size std_dev
##   <chr>        <dbl>       <int>   <dbl>
## 1 N             1.85          79  0.0807
## 2 Y             1.9           22  0.0759

se_hat_Yes = 0.08066524/sqrt(79)
se_hat_No = 0.07590721/sqrt(22)

Then, we compute the normal-based confidence interval

se_hat = sqrt(se_hat_Yes^2+se_hat_No^2)
CI_normal_gas = 1.900-1.854303 + c(-1.96, 1.96)*se_hat

Compute the bootstrap CI

boot2 = do(1000)*{
  GasPrices_boot = resample(GasPrices)
  
  table2_boot = GasPrices_boot %>%
    group_by(Highway) %>%
    summarize(mprice = mean(Price)) 
  
  table2_boot$mprice
}
  
CI_boot_gas = confint(boot2[,2]-boot2[,1], level=0.95)

Compare the two CI

CI_normal_gas

## [1] 0.009330134 0.082063866

CI_boot_gas

##                   2.5%      97.5%
## percentile 0.009004028 0.08064436

From these two Ci, we can see the interval is almost the same size, but normal-based one has a smaller lower-bond and upper-bond comparing to the bootstrapped one.

2(C) hypothesis testing

t.test(Price~Highway, data = GasPrices)

## 
##  Welch Two Sample t-test
## 
## data:  Price by Highway
## t = -2.4628, df = 35.344, p-value = 0.0188
## alternative hypothesis: true difference in means between group N and group Y is not equal to 0
## 95 percent confidence interval:
##  -0.083350777 -0.008041628
## sample estimates:
## mean in group N mean in group Y 
##        1.854304        1.900000

Under the assumption, the p-value is 0.0188

Question 3

3(A)

From the hint, we have \(e^y=\sum_{k=0}^{\infty}\frac{y^k}{k!}\), and \(e^-\lambda=\sum_{k=0}^{\infty}\frac{-\lambda^k}{k!}\)

\[\begin{aligned} E(x)=\sum_{k=0}^{\infty}ke^-\lambda*\frac{\lambda^k}{k!}\\ =e^-\lambda\sum_{k=0}^{\infty}k*\frac{\lambda^k}{k!} \\ =e^-\lambda\sum_{k=0}^{\infty}\frac{\lambda^k}{(k-1)!}\\ =e^-\lambda*\lambda\sum_{k=0}^{\infty}\frac{\lambda^k}{(k-1)!} \\ =e^-\lambda*(\lambda e^\lambda) =\lambda \end{aligned}\]

\(E(x)=\lambda\)

3(B)

\(var(X_{i})=E(X_{i}^2)-E(X_{i})^2\)

\[\begin{aligned} E(X_{i}^2)=\sum(x^2)\frac{\lambda^x}{x!}e^-\lambda=\sum(x(x-1+1))*\frac{\lambda^x}{x!}e^-\lambda\\ =\sum x(x-1)*\frac{\lambda^x}{x!}*e^-\lambda+\sum x*\frac{\lambda^x}{x!}*e^-\lambda \\ =e^-\lambda\sum\frac{\lambda^x}{(x-2)!}+e^-\lambda\sum\frac{\lambda^x}{(x-1)!}\\ =e^-\lambda*\lambda^2\sum\frac{\lambda^(x-2)}{(x-2)!}+e^-\lambda*\lambda^2\sum\frac{\lambda^(x-1)}{(x-1)!}\\ =e^-\lambda*(\lambda^2 e^\lambda)+e^-\lambda*(\lambda e^\lambda)\\ =\lambda^2+\lambda \end{aligned}\]

\(var(X_{i})=(\lambda^2+\lambda)-\lambda^2=\lambda\)

Or, we just use the fact that \(\hat{\lambda}=\overline{X}_{n}\), and we have

\(s.e.(\hat{\lambda})=\sqrt\frac{var(\lambda)}{N} =\sqrt\frac {var(X_{i})}{N}= \overline{X}_{n}\)

3(C)

\[\begin{aligned} L(\lambda)=\prod_{i = 1}^{n}\frac{\lambda^(x_{i})}{x!}*e^-\lambda=e^-n\lambda*\frac{\lambda^\sum x_{i}}{\prod_{i = 1}^{n}x!} \end{aligned}\]

\[\begin{aligned} logL(\lambda)=log(e^-n\lambda*\frac{\lambda\sum_{i = 1}^{n}x_{i}}{\prod_{i = 1}^{n}x!})\\ logL(\lambda)=-n\lambda+\sum_{i = 1}^{n}x_{i}log\lambda-log\sum_{i = 1}^{n}x_{i} \end{aligned}\]

\[\begin{aligned} \frac{d}{dL}logL(\lambda)=\frac{d}{dL}(-n\lambda+\sum_{i = 1}^{n}x_{i}log\lambda-log\sum_{i = 1}^{n}x_{i}!)\\ =-n+\frac{\sum_{i = 1}^{n}x_{i}} {\lambda} = 0\\ n=\frac{\sum_{i = 1}^{n}x_{i}} {\lambda}\\ \lambda=\frac{\sum_{i = 1}^{n}x_{i}} {n}\\ \lambda=\overline{X}_{n}\\ \end{aligned}\]

3(D)

\(N=30\), \(\overline{X}=14.0667\), \(var(X)=12.823\) We want to calculate the 95% confidence interval for \(\lambda_{d,h}\), the formula is \(\hat{\lambda}\pm1.96*se\hat{\lambda}\) from (C) we know s.e. of \(\hat{\lambda}\)= s.e. of mean \(s.e.(\hat{\lambda})=s.e.(\overline{X})=\sqrt{\frac{var(X)}{N}}\) \(=\sqrt{\frac{12.823}{30}}=0.6538\)

Confidence Interval:\(14.0667\pm1.96*0.6538\) = >\(14.0667\pm1.2814\)

\(12.7853<\lambda<15.3481\)

Question 4

4(A)

Let \(Z_{i}=I(X_{i}\le x)\), and it takes either 0 and 1 as its value, so we can see it as Bernoulli distribution \(P(Z_{i}=1)=P(X_{i}\le x)=F(x)\), note that \(F(X)\) is the CDF \(E(Z_{i})=F(x)\), \(var(Z_{i})=F(x)(1-F(x))\)

4(B)

\[\begin{aligned} \hat{F_{n}}(x)=\frac{1}{n}\sum_{i=1}^{n}Z_{n}\\ E(\hat{F_{n}}(x))=\frac{1}{n}\sum_{i=1}^{n}E(Z_{n})=\frac{1}{n}\sum_{i=1}^{n}F(x)=F(x)\\ Var(\hat{F_{n}}(x))=Var(\frac{1}{n}\sum_{i=1}^{n}Z_{i})=\frac{1}{n^2}\sum_{i=1}^{n}Var(Z_{i})\\ =\frac{1}{n^2}*n*F(x)(1-F(x))\\ =\frac{F(x)(1-F(x)}{n} \end{aligned}\]

After deriving this, we know \(\hat{F_{n}}(x)\) is the rescaled Binomial \((F(x), N)\), so \((F(x), N)\)~\(\frac{1}{n}Binomial({F(x), N})\) , the empirical CDF will converges to true CDF.

4(C)

We use the empirical CDF function to calculate F(-0.1), then we compute confidence interval by using standard error and the z-score of 95%: 1.96

x<-pull(stocks_bonds,SP500)
Fn <- ecdf(x)
Fn(-0.1)

## [1] 0.1235955

std_Fn<-sqrt(Fn(-0.1)*(1-Fn(-0.1)))
se_Fn<-std_Fn/sqrt(89)
CI_Fn = Fn(-0.1) + c(-1.96, 1.96)*se_Fn

Finally, we get the confidence interval for F(-0.1)

CI_Fn

## [1] 0.05521777 0.19197324