#Question 1
success = 1
trials = 15
pihatWald <- success/trials
#part A, Wald
SE_Wald <- sqrt((pihatWald*(1-pihatWald))/trials)
WaldCI <- pihatWald + c(-1,1)*qnorm(0.025, lower.tail = FALSE)*SE_Wald
#part B, Modified Wald
#add 0.5 successes and 1 to trials
pihatMod<- (success + 0.5)/(trials + 1)
SE_Mod <- sqrt((pihatMod*(1-pihatMod))/(trials+1))
Mod_WaldCI <- pihatMod + c(-1,1)*qnorm(0.025, lower.tail = FALSE)*SE_Mod
#part C, Logit Transformation
phi_hat <- log((pihatWald)/(1-pihatWald))
I_phi_hat <- trials*pihatWald*(1-pihatWald)
phi_hat_CI <- phi_hat + c(-1,1)*qnorm(0.025, lower.tail = FALSE)*sqrt(1/(I_phi_hat))
logit_CI <- exp(phi_hat_CI)/(1+exp(phi_hat_CI))
#part D, Log-Likelihood
#let's use a function
library(binom)
binom.confint(success, trials, conf.level = 0.95, methods = "lrt")
## method x n mean lower upper
## 1 lrt 1 15 0.06666667 0.003926124 0.2621293
Question 1
The last method is preferred. It avoids the issue some of the other intervals have of having a negative lower bound, which doesn’t make sense for a probability. Likelihood ratio intervals are also scale invariant, and in this case it is the narrowest interval.
Question 2
#Question 2, part b
probability <- dpois(15, 20)
#Question 2, part c
#probability at most 15 -> y less than or equal to 15
probtwo <- ppois(15,20)
z <- (15.5-20)/sqrt(20)
approx <- pnorm(z)
#Question 2, part d
or <- 5
gr <- 2
pur <- 3
prob_multi <- dmultinom(c(or, gr, pur), size = 10, prob = c(8/20, 7/20, 5/20), log= FALSE)
The exact answer to the probability of at most 15 such students is 0.1565131. The normal approximation is 0.1571523. This was found using the ppois and pnorm functions.
The probability that, of 10 randomly chosen students, 5 have orange hair, 2 have green hair, and 3 have purple hair is 0.049392. This is because the conditional distribution of the vector \(X = X_1X_2...\) given the total \(n = X_1 + X_2 +...\) is \(multi(n, \pi)\).
#Question 3
n <-96
AA <- 15
Aa <- 33
aa <- 48
#part A, estimate proportion of dominant genes
#there are 2n total genes
#part B
#Pearson first
prop_est_dom <- (2*AA + Aa)/(2*n)
prop_est_rec <- (2*aa+Aa)/(2*n)
expected <- c(n*(prop_est_dom^2), 2*n*prop_est_dom*prop_est_rec, n*(prop_est_rec^2))
observed <- c(AA, Aa, aa)
chisq <- sum((expected-observed)^2/expected)
#get p-value
#degrees freedom is number of categories minus 1, minus 1 for estimation of pi
pvalue_one <- pchisq(4.66, df= 1, lower.tail=FALSE)
#Likelihood ratio
g_squared <- 2*sum(observed*log(observed/expected))
p_value_two <- pchisq(g_squared, df= 1, lower.tail=FALSE)
#part c
residuals <- (observed-expected)/sqrt(expected)
The estimate of the proportion of dominant genes in the population is 0.328125
For the goodness of fit test, the null hypothesis is that the population follows the Hardy Weinberg proportions. The alternative hypothesis is that the populations does not follow the Hardy Weinberg proportions.
For the Pearson test, the p-value is 0.097, which is greater than 0.05. That this p-value we fail to reject the null hypothesis.
For the likelihood test, the p-value is 0.0328. At this value we reject the null hypothesis. There is sufficient evidence to suggest that this population does not follow the Hardy Weinberg proportions.
The residuals for the test using the Pearson method are 1.451, -1.434, and 0.709. None of these values are larger than \(\sqrt{(k-1)/k}\) plus 2, so the model appears to fit for all cells.