CDA Homework 1

#Question 1

success = 1
trials = 15
pihatWald <- success/trials

#part A, Wald
SE_Wald <- sqrt((pihatWald*(1-pihatWald))/trials)

WaldCI <- pihatWald + c(-1,1)*qnorm(0.025, lower.tail = FALSE)*SE_Wald

#part B, Modified Wald
#add 0.5 successes and 1 to trials
pihatMod<- (success + 0.5)/(trials + 1)

SE_Mod <- sqrt((pihatMod*(1-pihatMod))/(trials+1)) 

Mod_WaldCI <- pihatMod + c(-1,1)*qnorm(0.025, lower.tail = FALSE)*SE_Mod

#part C, Logit Transformation

phi_hat <- log((pihatWald)/(1-pihatWald))

I_phi_hat <- trials*pihatWald*(1-pihatWald)
phi_hat_CI <- phi_hat + c(-1,1)*qnorm(0.025, lower.tail = FALSE)*sqrt(1/(I_phi_hat))
logit_CI <- exp(phi_hat_CI)/(1+exp(phi_hat_CI))

#part D, Log-Likelihood
#let's use a function
library(binom)
binom.confint(success, trials, conf.level = 0.95, methods = "lrt")

##   method x  n       mean       lower     upper
## 1    lrt 1 15 0.06666667 0.003926124 0.2621293

Question 1

The wald confidence interval is -0.059567, 0.1929003
The modified Wald confidence interval is -0.0490729, 0.2365729
The Logit Transformation confidence interval is 0.0093054, 0.3519903
The Log-likelihood confidence interval is 0.003962, 0.26213.

The last method is preferred. It avoids the issue some of the other intervals have of having a negative lower bound, which doesn’t make sense for a probability. Likelihood ratio intervals are also scale invariant, and in this case it is the narrowest interval.

Question 2

The distribution of students with either orange, green, or purple hair is Poisson with a mean of 20. The sum of indepdent Poisson random variables has a Poisson distribution with mean \(\lambda_1\) \(+\) \(\lambda_2\) \(+\)… In this case \(8+7+5 = 20\).

#Question 2, part b

probability <- dpois(15, 20)

The probability of exactly 15 such students is 0.0516489. This is the probability that this random variable, which has a Poisson distribution with parameter \(\lambda = 20\), takes the exact value 15, calculated using the dpois function.

#Question 2, part c

#probability at most 15 -> y less than or equal to 15

probtwo <- ppois(15,20)
z <- (15.5-20)/sqrt(20)
approx <- pnorm(z)

#Question 2, part d

or <- 5
gr <- 2
pur <- 3

prob_multi <- dmultinom(c(or, gr, pur), size = 10, prob = c(8/20, 7/20, 5/20), log= FALSE)

The exact answer to the probability of at most 15 such students is 0.1565131. The normal approximation is 0.1571523. This was found using the ppois and pnorm functions.
The probability that, of 10 randomly chosen students, 5 have orange hair, 2 have green hair, and 3 have purple hair is 0.049392. This is because the conditional distribution of the vector \(X = X_1X_2...\) given the total \(n = X_1 + X_2 +...\) is \(multi(n, \pi)\).

#Question 3
n <-96

AA <- 15
Aa <- 33
aa <- 48

#part A, estimate proportion of dominant genes
#there are 2n total genes

#part B
#Pearson first
prop_est_dom <- (2*AA + Aa)/(2*n)
prop_est_rec <- (2*aa+Aa)/(2*n)

expected <- c(n*(prop_est_dom^2), 2*n*prop_est_dom*prop_est_rec, n*(prop_est_rec^2))
observed <- c(AA, Aa, aa)

chisq <- sum((expected-observed)^2/expected)

#get p-value
#degrees freedom is number of categories minus 1, minus 1 for estimation of pi
pvalue_one <- pchisq(4.66, df= 1, lower.tail=FALSE)

#Likelihood ratio 

g_squared <- 2*sum(observed*log(observed/expected))

p_value_two <- pchisq(g_squared, df= 1, lower.tail=FALSE)

#part c

residuals <- (observed-expected)/sqrt(expected)

The estimate of the proportion of dominant genes in the population is 0.328125
For the goodness of fit test, the null hypothesis is that the population follows the Hardy Weinberg proportions. The alternative hypothesis is that the populations does not follow the Hardy Weinberg proportions.
For the Pearson test, the p-value is 0.097, which is greater than 0.05. That this p-value we fail to reject the null hypothesis.
For the likelihood test, the p-value is 0.0328. At this value we reject the null hypothesis. There is sufficient evidence to suggest that this population does not follow the Hardy Weinberg proportions.
The residuals for the test using the Pearson method are 1.451, -1.434, and 0.709. None of these values are larger than \(\sqrt{(k-1)/k}\) plus 2, so the model appears to fit for all cells.

CDA Homework 1

Holly Michalak

January 20, 2019