Discuss the likelihood ratio test, deviance, AIC, and BIC
The likelihood ratio test is \( LR = -2\ln\lambda = -2\ln(\frac{\hat{\mathcal{L}}_{R}}{\hat{\mathcal{L}}_{U}}) = LR = -2[\ln(\hat{\mathcal{L}}_{R}) - \ln(\hat{\mathcal{L}}_{U})]. \) It is derived from \( \lambda \), the ratio of the likelihood of a restricted model to that of an unrestricted model. To interpret this: the null hypothesis says that the restriction is valid, so smaller values of \( \lambda \) means there is a big difference between the two likelihoods, and therefore there is evidence against the restricted model, and thus the null hypothesis. Because \( LR \) follows a \( \chi^2 \) distribution with degrees of freedom equal to the number of restrictions under the null, the p-value (which is calculated from \( \chi^2 \)) will be harder to achieve for significance as more parameters, and thus degrees of freedeom, are added.
The deviance is \( D = -2\ln\mathcal{L}(\boldsymbol{y}|\boldsymbol{y}) - 2(\ln\mathcal{L}(\boldsymbol{\hat{\theta}}|\boldsymbol{y})) = -2(\ln\mathcal{L}(\boldsymbol{\hat{\theta}}|\boldsymbol{y})), \) because \( \mathcal{L}(\boldsymbol{y}|\boldsymbol{y}) \), a saturated model, = 1, therefore \( \ln\mathcal{L}(\boldsymbol{y}|\boldsymbol{y}) = 0. \) The deviance is derived from the difference between the model and a saturated (perfect) model–it is a measure of error. Therefore, to interpret: the lower the deviance, the better the model fit. Looking at the formula for deviance, the higher the maximum likelihood, \( \hat{\theta} \), the lower the deviance, so maximizing likelihood can be considered minimizing deviance. The deviance does not include penalties for parameters.
The AIC is derived from the deviance, with a linear penalty for additional parameters: \( AIC = -2(\ln\mathcal{L}(\boldsymbol{\hat{\theta}}|\boldsymbol{y})) + 2p = D(\boldsymbol{\hat{\theta}}) + 2p, \) where \( p \) is the number of estimated parameters. The AIC can be interpreted the way deviance is interpreted above, however it adds a penalty term for model parsimony, meaning additional parameters that are added to lower the deviance must be substantial enough to outweigh a positive \( 2p \) term.
Like the AIC, the BIC is derived from deviance, with a penalty for additional parameters: \( BIC = - 2(\ln\mathcal{L}(\boldsymbol{\hat{\theta}}|\boldsymbol{y})) + plog(n) = D(\boldsymbol{\hat{\theta}}) + pln(n), \) where \( p \) is the number of estimated parameters and \( n \) is the number of observations. The BIC can be interpreted similarly to the AIC in that it has an added penalty for model parsimony, however, the BIC more heavily penalizes model complexity based on both the number of parameters and the sample size.
Penalties are important to avoid kitchen-sink models where non-significant predictors are added in attempt to increase model fit.
In one paragraph, explain GLM on an intuitive level. With the explanation, include the assumptions we make and a discussion of the role of the link function.
The generalized linear model (GLM) is a generalization of classical linear regression that allows for the dependent variable to have error distributions other than a normal distribution. The GLM has a systematic component that is the linear predictor–a linear equation of independent variables and their corresponding beta coefficients. It also has a stochastic component where the components of the response, or dependent, variable are drawn from a probability mass function (PMF) or probability density function (PDF) in the exponential family. The link function 'links' the systematic and stochastic components by providing a way to connect the mean function of the distribution to the linear predictor,i.e., drop the linear predictor into the distribution we're working with for the response. Assumptions we make include: the observations are indepedent and identically distributed; we are assuming the correct error distribution for the response variable; we are assuming the correct link function specification; and that the response variable can be explained by a linear model of explanatory variables.
a: Write a function in R that computes the log-likelihood of the Poisson distribution for a set of coefficients \( \beta \) given a dependent variable \( \boldsymbol{y} \) and a matrix of independent variables \( \boldsymbol{x} \).
For a Poisson distribution, \( \boldsymbol{\lambda} = exp(\boldsymbol{\eta}) \), where \( \boldsymbol{\eta} = \boldsymbol{x\beta}. \)
## Poisson Log-likelihood
ll.pois <- function(par, y, x) {
lambda <- exp(x %*% par)
sum(dpois(y, lambda = lambda, log = T)) # Log-likelihood is sum of log of density for Poisson
}
b: Generate some data and use \( \verb!optim()! \) to maximize the function from part a. There should be an intercept and at least two independent variables in the data. Report the coefficient estimates, standard errors, \( z \)-statistics, and \( p \)-values in a \( \LaTeX \) table.
## Simulate Data ##
set.seed(20599)
n <- 1000 # Sample size
x1 <- runif(n, -1, 1)
x2 <- runif(n, -1, 1)
b0 <- 0.75 # True parameter values
b1 <- 1
b2 <- 0.33
eta <- b0 + b1 * x1 + b2 * x2 # Systematic component of the model
y <- rpois(n, exp(eta)) # Generate response variable
## Maximize the Function ##
start.values <- c(0, 0, 0) # Starting values for the parameters, no constraints on betas
# Use optim()
model.pois <- optim(par = start.values, y = y, x = as.matrix(cbind(1, x1, x2)),
fn = ll.pois, method = "BFGS", hessian = T, control = list(fnscale = -1)) # column of 1's in x for intercept, control=list() returns maximum (vs. minimum) values
# Covariance matrix
cov.pois <- solve(-model.pois$hessian) # Covariance is inverse of Fisher Information, SE are sqrt(main diagonal of covariance matrix)
## Results! ##
results.pois <- cbind(model.pois$par, sqrt(diag(cov.pois)), model.pois$par/sqrt(diag(cov.pois)),
dt(model.pois$par/sqrt(diag(cov.pois)), n - length(model.pois$par)))
colnames(results.pois) <- c("Estimate", "SE", "z-statistic", "p-value")
rownames(results.pois) <- c("Intercept", "x1", "x2")
# Generate LaTex table
library(xtable)
table.pois <- xtable(results.pois, digits = 4) # Round to 4 decimals to match glm() results
print(table.pois, floating = F)
## % latex table generated in R 3.0.1 by xtable 1.7-1 package
## % Thu Oct 10 17:16:50 2013
## \begin{tabular}{rrrrr}
## \hline
## & Estimate & SE & z-statistic & p-value \\
## \hline
## Intercept & 0.8135 & 0.0225 & 36.1991 & 0.0000 \\
## x1 & 0.9496 & 0.0362 & 26.2369 & 0.0000 \\
## x2 & 0.3341 & 0.0343 & 9.7534 & 0.0000 \\
## \hline
## \end{tabular}
See \( \LaTeX \) document for \( \verb!optim()! \)-generated results table.
c: Estimate the model with the \( \verb!glm()! \) function and report the coefficient estimates, standard errors, \( z \)-statistics, and \( p \)-values in another \( \LaTeX \) table. How close are the results from \( \verb!glm()! \) to the results from part b?
## Estimate parameters using glm()
model.check <- glm(y ~ x1 + x2, family = "poisson")
# Compare the results
summary(model.check)
##
## Call:
## glm(formula = y ~ x1 + x2, family = "poisson")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.694 -0.927 -0.117 0.610 3.814
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.8136 0.0225 36.20 <2e-16 ***
## x1 0.9496 0.0362 26.24 <2e-16 ***
## x2 0.3340 0.0343 9.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2005.1 on 999 degrees of freedom
## Residual deviance: 1139.1 on 997 degrees of freedom
## AIC: 3525
##
## Number of Fisher Scoring iterations: 5
signif(results.pois, digits = 5)
## Estimate SE z-statistic p-value
## Intercept 0.8135 0.02247 36.199 5.672e-183
## x1 0.9496 0.03619 26.237 6.729e-115
## x2 0.3341 0.03425 9.753 7.093e-21
glm.table <- xtable(model.check)
print(glm.table, floating = F)
## % latex table generated in R 3.0.1 by xtable 1.7-1 package
## % Thu Oct 10 17:16:50 2013
## \begin{tabular}{rrrrr}
## \hline
## & Estimate & Std. Error & z value & Pr($>$$|$z$|$) \\
## \hline
## (Intercept) & 0.8136 & 0.0225 & 36.20 & 0.0000 \\
## x1 & 0.9496 & 0.0362 & 26.24 & 0.0000 \\
## x2 & 0.3340 & 0.0343 & 9.75 & 0.0000 \\
## \hline
## \end{tabular}
See \( \LaTeX \) document for \( \verb!glm()! \)-generated results table and comparison of the results generated from parts b and c.
d: Maximize the function for a different model which include random noise as two new independent variables. In other words, create two new variables with \( \verb!runif()! \) or \( \verb!rnorm()! \) and include them in the estimation, but not in the true data-generation process (DGP). Compare this model with the model from part b using the likelihood ratio test, AIC, and BIC. Which model fits the data better and why?
## Create Unrestricted Model ##
# Create two variables that are random noise
set.seed(12344)
x3 <- runif(n, -1, 1)
x4 <- runif(n, -1, 1)
start.values2 <- c(0, 0, 0, 0, 0) # Starting values for intercept and 4 independent variables
# Use optim()
model.pois2 <- optim(par = start.values2, y = y, x = as.matrix(cbind(1, x1,
x2, x3, x4)), fn = ll.pois, method = "BFGS", hessian = T, control = list(fnscale = -1)) # column of 1's in x for intercept, control=list() returns maximum (vs. minimum) values
# Covariance matrix
cov.pois2 <- solve(-model.pois2$hessian) # Covariance is inverse of Fisher Information, SE are sqrt(main diagonal of covariance matrix)
## Results! ##
results.pois2 <- cbind(model.pois2$par, sqrt(diag(cov.pois2)), model.pois2$par/sqrt(diag(cov.pois2)),
dt(model.pois2$par/sqrt(diag(cov.pois2)), n - length(model.pois2$par)))
colnames(results.pois2) <- c("Estimate", "SE", "z-statistic", "p-value")
rownames(results.pois2) <- c("Intercept", "x1", "x2", "x3", "x4")
signif(results.pois2, digits = 5)
## Estimate SE z-statistic p-value
## Intercept 0.81365 0.02247 36.2090 6.389e-183
## x1 0.94622 0.03629 26.0720 9.714e-114
## x2 0.33454 0.03426 9.7658 6.381e-21
## x3 -0.01784 0.03396 -0.5252 3.474e-01
## x4 0.03301 0.03421 0.9650 2.503e-01
# Generate LaTeX table
table.pois2 <- xtable(results.pois2, digits = 4)
print(table.pois2, floating = F)
## % latex table generated in R 3.0.1 by xtable 1.7-1 package
## % Thu Oct 10 17:16:50 2013
## \begin{tabular}{rrrrr}
## \hline
## & Estimate & SE & z-statistic & p-value \\
## \hline
## Intercept & 0.8136 & 0.0225 & 36.2088 & 0.0000 \\
## x1 & 0.9462 & 0.0363 & 26.0723 & 0.0000 \\
## x2 & 0.3345 & 0.0343 & 9.7658 & 0.0000 \\
## x3 & -0.0178 & 0.0340 & -0.5252 & 0.3474 \\
## x4 & 0.0330 & 0.0342 & 0.9650 & 0.2503 \\
## \hline
## \end{tabular}
See \( \LaTeX \) document for results table for the unrestricted model.
# Likelihood ratio test
lr <- -2 * (model.pois$value - model.pois2$value) # log-likelihood of restricted model versus that of unrestricted model
lr
## [1] 1.257
p.value <- 1 - pchisq(lr, df = 2) # df = 2 b/c we dropped 2 parameters, forcing those two betas to 0; the null hypothesis is the restricted model is better
p.value # p-value is statistically not significant, so we accept the null hypothesis and we should not include beta_3 and beta_4
## [1] 0.5334
# AIC and BIC
aic1 <- -2 * (model.pois$value) + 2 * length(model.pois$par)
aic2 <- -2 * (model.pois2$value) + 2 * length(model.pois2$par)
bic1 <- -2 * (model.pois$value) + length(model.pois$par) * log(n)
bic2 <- -2 * (model.pois2$value) + length(model.pois2$par) * log(n)
compare <- matrix(data = cbind(aic1, aic2, bic1, bic2), nrow = 2, ncol = 2,
byrow = T)
colnames(compare) <- c("Restricted", "Unrestricted")
rownames(compare) <- c("AIC", "BIC")
compare # Restricted model has lower AIC and BIC values, reinforcing LR test above
## Restricted Unrestricted
## AIC 3525 3528
## BIC 3540 3553
All three comparison tests above show that the restricted model (with two independent variables) is better than the unrestricted model (with four indepdent variables). This is consistent with the fact that we know that the extra two independent variables did not contribute to the data-generating process and are random noise.
Even though this is not required in the assignment, the \( \LaTeX \) table for the unrestricted model generated above also validates the LR, AIC, and BIC tests. The null hypothesis is that each independent variable contributes to explaining the response variable. However, since the p-values for \( \boldsymbol{x_3} \) and \( \boldsymbol{x_4} \) are not statistically significant, we reject the null and we should not include these variables, leading to the conclusion that the restricted model (with two variables) is a better fit. Note, however, that we should not rely on p-values for significance because random noise independent variables can return significance. Usee comparison tests instead for model fit.