(without random effects)
With the normal distribution of errors, likelihood can be expressed
explicitly as the product of the densities of each of the \(n\) independent normal
observations.
\[\ell=-\log L\]
the negative log-likelihood \[ \begin{aligned} \ell\left(\left(y_{1}, \ldots, y_{n}\right), \mu, \sigma^{2}\right) &=-\log \left[\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \sigma^{2}}} \exp \left(-\frac{\left(y_{i}-\mu\right)^{2}}{2 \sigma^{2}}\right)\right] \\ &=-\sum_{i=1}^{n}\left[\log \left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)+\left(-\frac{\left(y_{i}-\mu\right)^{2}}{2 \sigma^{2}}\right)\right] \end{aligned} \] then,
for matrix format
\[ \begin{aligned} \ell(\mathbf{y}, \beta, \gamma) &=\frac{1}{2}\left\{n \log (2 \pi)+\log \left|\sigma^{2} \mathbf{I}\right|+(\mathbf{y}-\mathbf{X} \beta)^{\prime}\left(\sigma^{2} \mathbf{I}\right)^{-1}(\mathbf{y}-\mathbf{X} \beta)\right\} \\ &=\frac{1}{2}\left\{n \log (2 \pi)+\log \left(\prod_{i=1}^{n} \sigma^{2}\right)+(\mathbf{y}-\mathbf{X} \beta)^{\prime}(\mathbf{y}-\mathbf{X} \beta) / \sigma^{2}\right\} \\ &=\frac{1}{2}\left\{n \log (2 \pi)+n \log \left(\sigma^{2}\right)+(\mathbf{y}-\mu)^{\prime}(\mathbf{y}-\mu) / \sigma^{2}\right\} \\ &=\frac{1}{2}\left\{n \log (2 \pi)+n \log \left(\sigma^{2}\right)+\sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2} / \sigma^{2}\right\} \end{aligned} \] where \(\gamma\) derived from \(\sigma^{2} \mathbf{I}\).
taking the derivatives of the negative log-likelihood function.
\[ \begin{aligned} \frac{\partial \ell\left(\mu, \sigma^{2}\right)}{\partial \mu} &=\frac{1}{2}\left[\sum_{i=1}^{n}(-2)\left(y_{i}-\mu\right) / \sigma^{2}\right] \\ &=\left(n \mu-\sum_{i=1}^{n} y_{i}\right) / \sigma^{2}=0 \end{aligned} \]
\[ \frac{\partial \ell\left(\mu, \sigma^{2}\right)}{\partial \sigma^{2}}=\frac{1}{2}\left[\frac{n}{\sigma^{2}}-\frac{\sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}}{\left(\sigma^{2}\right)^{2}}\right]=0 \]
setting the derivatives equal to zero and solving for the parameters
\[ \begin{aligned} \hat{\mu} &=\bar{y} \\ \hat{\sigma}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2} \\ &=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2} \end{aligned} \]
but for REML \[
\begin{aligned}
\hat{\sigma}^{2} &=\frac{1}{n-1}
\sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2} \\
&=\frac{1}{n-1} \sum_{i=1}^{n}\left(y_{i}-\bar{y}\right)^{2}
\end{aligned}
\]
R demonstrationset.seed(123)
n <- 20000
x <- rnorm(n, 2, sqrt(2))
s <- rnorm(n, 0, 0.8)
y <- 1.5+x*3+s
mydata <- data.frame(y,x)
lmfit <- lm(y~., data=mydata)
# logLik(lmfit)
coefficients(lmfit)
## (Intercept) x
## 1.487701 3.003695
(summary(lmfit)$sigma**2)
## [1] 0.6397411
-log max likelihood estimate formulanotice, using vector and matrix notation
## Using the mathematical expression:
minusloglik <- function(param){
beta <- param[-1] #Regression Coefficients
sigma <- param[1] #Variance
y <- as.vector(mydata$y) #DV
x <- cbind(1, mydata$x) #IV
mu <- x%*%beta #multiply matrices
0.5*(n*log(2*pi) + n*log(sigma) + sum((y-mu)^2)/sigma)
}
MLoptimize <- optim( c (1, 1, 1 ), minusloglik)
## The results:
MLoptimize$par
## [1] 0.6397201 1.4876186 3.0038675
# max
library(maxLik)
ols.lf <- function(param) {
beta <- param[-1] #Regression Coefficients
sigma <- param[1] #Variance
y <- as.vector(mydata$y) #DV
x <- cbind(1, mydata$x) #IV
mu <- x%*%beta #multiply matrices
sum(dnorm(y, mu, sqrt(sigma), log = TRUE)) #normal distribution(vector of observations, mean, sd)
}
mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1 ))
summary(mle_ols)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 11 iterations
## Return code 8: successive function values within relative tolerance limit (reltol)
## Log-Likelihood: -23910.85
## 3 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## sigma 0.639677 0.006396 100.0 <2e-16 ***
## beta1 1.487701 0.009768 152.3 <2e-16 ***
## beta2 3.003695 0.003999 751.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
ols.lf <- function(param) {
beta <- param[-1] #Regression Coefficients
sigma <- param[1] #Variance
y <- as.vector(mtcars$mpg) #DV
x <- cbind(1, mtcars$cyl, mtcars$disp) #IV
mu <- x%*%beta #multiply matrices
sum(dnorm(y, mu, sqrt(sigma), log = TRUE)) #normal distribution(vector of observations, mean, sd)
}
mle_ols <- maxLik(logLik = ols.lf, start = c(sigma = 1, beta1 = 1, beta2 = 1, beta3=1))
summary(mle_ols)
## --------------------------------------------
## Maximum Likelihood estimation
## Newton-Raphson maximisation, 28 iterations
## Return code 2: successive function values within tolerance limit (tol)
## Log-Likelihood: -79.57282
## 4 free parameters
## Estimates:
## Estimate Std. error t value Pr(> t)
## sigma 8.460632 2.037359 4.153 3.29e-05 ***
## beta1 34.661013 2.395871 14.467 < 2e-16 ***
## beta2 -1.587281 0.673675 -2.356 0.0185 *
## beta3 -0.020584 0.009757 -2.110 0.0349 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## --------------------------------------------
#Checking against linear regression
lmfit2 <- (lm(mpg~cyl+disp, data=mtcars))
lmfit2
##
## Call:
## lm(formula = mpg ~ cyl + disp, data = mtcars)
##
## Coefficients:
## (Intercept) cyl disp
## 34.66099 -1.58728 -0.02058
(summary(lmfit2)$sigma**2)
## [1] 9.335872
Most classical confidence intervals for parameters are estimated using the likelihood approach, the Wald interval (or the asymptotic normality property).
\[ \hat{\theta}_{i} \pm z_{1-\alpha / 2} S E_{\hat{\theta}_{i}} \]
where the standard error is from the second derivative of the log-likelihood function. This is the Hessian matrix/(observed) Information matrix if there is more than one single model parameter.
\[ I(\theta)=\ell^{\prime \prime}(\theta) \]
e.g.ย for one independent variable (with parameters: \(\beta\) and \(\sigma^2\) in the model).
Here, giving the four entries of the \(2 \times 2\) Hessian matrix \[ \frac{\partial^{2} \ell\left(\mu, \sigma^{2}\right)}{\partial \mu^{2}}=\frac{n}{\sigma^{2}} \] \[ \frac{\partial^{2} \ell\left(\mu, \sigma^{2}\right)}{ \partial\left(\sigma^{2}\right) }=\sqrt{\frac{2\left(\hat{\sigma}^{2}\right)^{2}}{n}} \]
\[ \frac{\partial^{2} \ell\left(\mu, \sigma^{2}\right)}{\partial \mu \partial\left(\sigma^{2}\right)}=0\\ \frac{\partial^{2} \ell\left(\mu, \sigma^{2}\right)}{ \partial\left(\sigma^{2}\right)\partial \mu}=0 \]
therefore \[ S E_{\hat{\theta}_{i}}=\sqrt{\left(I(\hat{\theta})^{-1}\right)_{i}} \]
the inverse of the Fisher information is just each diagonal element, so
For \(\beta\) \[
S E_{\hat{\mu}}=\sqrt{\left(I\left(\hat{\mu},
\hat{\sigma}^{2}\right)^{-1}\right)_{11}}=\sqrt{\frac{\hat{\sigma}^{2}}{n}}
\]
Thus, the Wald confidence interval for the mean would be
\[
\hat{u}^{2} \pm z_{1-\alpha / 2} \sqrt{\frac{\hat{\sigma}^{2}}{n}}
\]
For variance \[ S E_{\hat{\sigma}^{2}}=\sqrt{\left(I\left(\hat{\mu}, \hat{\sigma}^{2}\right)^{-1}\right)_{22}}=\sqrt{\frac{2\left(\hat{\sigma}^{2}\right)^{2}}{n}} \]
Thus, the Wald confidence interval for the variance would be
\[
\hat{\sigma}^{2} \pm z_{1-\alpha / 2} \hat{\sigma}^{2}
\sqrt{\frac{2}{n}}
\]
However, the following approach confidence interval will generally have much better small sample properties than the Wald interval.
\[\left\{\theta \mid \frac{L(\theta)}{L(\hat{\theta})}>\exp (-3.84 / 2)\right\}\]
It can be profiled by maximizing the likelihood function with respect to all the other parameters.
In the following equation, \(\sigma^{2}\) expressed by \(\mu\)
\[ \begin{aligned} L_{p}(\mu) &=L\left(\mu, \frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}\right) \\ &=\prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}}} \exp \left(-\frac{\left(y_{i}-\mu\right)^{2}}{2 \frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-\mu\right)^{2}}\right) \end{aligned} \]
Similarly, the profile likelihood for the variance \(\sigma^{2}\) can be expressed \[ \begin{aligned} L_{p}\left(\sigma^{2}\right) &=L\left(\hat{\mu}\left(\sigma^{2}\right), \sigma^{2}\right) \\ &=L\left(\bar{y}, \sigma^{2}\right) \end{aligned} \]
This becomes particularly simple, as the \(u\)-estimate does not depend on the \(\sigma\).