Suppose I keep track of the total number of students who attend my office hours eachweek for 15 weeks. Let xi denote the total number of office hours visitors during theithweek of class. Suppose that ∑xi= 100. Using the (questionable) assumption that the number of visitors each week is independent and identically distributed according to a Poisson distribution with unknown parameter λ, plot the log likelihood function for λ.
Submit your annotated R code and your plot. Make sure your plot has useful axis labels.Make sure your R code is annotated to explain the purpose of each chunk of code.
Hint: the book demonstrates plotting the likelihood for the parameter from a Binomial distribution. It may help to try that first. Figure 2.2 in the book.
Hint: check your work. Eyeball the MLE according to your log likelihood plot, calculateby hand the MLE using calculus, and compare the two values to make sure they agree.
#pdf of Poisson Distribution: p(x|lambda)=(exp(-Lambda)*Lambda^x)/factorial(x)
#Log-Likelihood is proportional to -15*lambda+(∑xi)*log(lambda)
#Need possible range of lambdas, sum of Xis, and number of weeks.
#function to produce the log-likelihood of each lambda:
LogLikelihood<- function(numWeeks,sumOfXs, lambda){
numWeeks*(-lambda)+sumOfXs*log(lambda)
}
#Add in a minimum lambda, maximum lambda, and step to create a range of lambdas to check. I chose 0.2 because it gave a close estimate to the real answer without the graph looking extremely messy.
lambdaVector<-seq(.5,10, by=0.2)
#Vector of log-likelihoods for the lambdas that we want to check
LLvector<-LogLikelihood(15,100,lambdaVector)
#To find lambda with maximum likelihood:
MaxLikeLambda<-lambdaVector[LLvector == max(LLvector)]
#Using calculus, the real answer should be 6.6666!
#Plot the lambdas against their log-likelihoods and add a line to connect them to visually see where the max could be:
#Blue denotes the real maximum while Red is the maximum produced from the exercise.
plot(lambdaVector,LLvector, xlab="Lambda", ylab="Log-Likelihood", main="Log-Likelihood for Poisson")
lines(lambdaVector, LLvector)
points(MaxLikeLambda,y= LogLikelihood(15,100,MaxLikeLambda), col="red",pch=7)
points(6.6666, y=LogLikelihood(15,100,6.6666),col="blue",pch=17)
Use the swiss data set, which is included in base R. Fit a model for infant mortality usingall the predictors (but no interactions). We want to decide whether to drop Agriculturefrom the model. Answer this question using: (a) the likelihood ratio test (Wilks test). (b) a Wald test. (c) an F-test (anova). (d) AIC. (e) BIC. Hint: get the log likelihood for a model using the logLik function. For example, fit yourmodel using the lm function, then typelogLike(mymod)to find the log likelihood.
#model with all factors
mod0<- lm(Infant.Mortality~Fertility+Agriculture+Examination+Education+Catholic, data=swiss)
#model with all factors except Agriculture
mod1<-lm(Infant.Mortality~Fertility+Examination+Education+Catholic, data=swiss)
A-- Wilks Test
logLik(mod0)
## 'log Lik.' -109.8608 (df=7)
logLik(mod1)
## 'log Lik.' -109.9607 (df=6)
#LRT (test stat) = 2*(logLike(bigMod)-logLike(smallMod))
LRTteststat<-2*(-109.8608-(-109.9607))
#chi-squared test with df= difference in dfs of the models
pchisq(LRTteststat, lower.tail= FALSE, df=1)
## [1] 0.6548823
Since the test result is not significant, we can assume the the smaller model (the one without Aggriculture) is better.
B-- Wald Test
summary(mod0)
##
## Call:
## lm(formula = Infant.Mortality ~ Fertility + Agriculture + Examination +
## Education + Catholic, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2512 -1.2860 0.1821 1.6914 6.0937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.667e+00 5.435e+00 1.595 0.11850
## Fertility 1.510e-01 5.351e-02 2.822 0.00734 **
## Agriculture -1.175e-02 2.812e-02 -0.418 0.67827
## Examination 3.695e-02 9.607e-02 0.385 0.70250
## Education 6.099e-02 8.484e-02 0.719 0.47631
## Catholic 6.711e-05 1.454e-02 0.005 0.99634
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.683 on 41 degrees of freedom
## Multiple R-squared: 0.2439, Adjusted R-squared: 0.1517
## F-statistic: 2.645 on 5 and 41 DF, p-value: 0.03665
The test stat is -0.418 with a corresponding p-value of 0.67827, which is not significant at any level. We can assume that with the other variables included, Agriculture does not add to the linear relationship.
C-- F-Test (ANOVA)
anova(mod0,mod1)
## Analysis of Variance Table
##
## Model 1: Infant.Mortality ~ Fertility + Agriculture + Examination + Education +
## Catholic
## Model 2: Infant.Mortality ~ Fertility + Examination + Education + Catholic
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 295.07
## 2 42 296.32 -1 -1.2563 0.1746 0.6783
Since the p-value of the F-test is not significant (0.6783), we can assume that including Agriculture is not necessary.
D & E-- AIC & BIC
AIC(mod0)
## [1] 233.7217
AIC(mod1)
## [1] 231.9214
BIC(mod0)
## [1] 246.6727
BIC(mod1)
## [1] 243.0222
Since the AIC and BIC scores are lower for the model without Agriculture, we can assume that the added complexity of this variable is not worth it.
We should not use the variable "Agriculture" because it fails every test.