Notes 5

Likelihood Estimation for Regression Parameters

Loading packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GLMsData)
library(cowplot) #for plot_grid
## Warning: package 'cowplot' was built under R version 4.3.2
## 
## Attaching package: 'cowplot'
## 
## The following object is masked from 'package:lubridate':
## 
##     stamp
library(metR) #for geom_contour2
## Warning: package 'metR' was built under R version 4.3.3
## 
## Attaching package: 'metR'
## 
## The following object is masked from 'package:purrr':
## 
##     cross

quilpie data cont.

Example 4.10, p. 180

We now consider the relationship between whether the total July rainfall exceeds 10 mm and the southern oscillation index (SOI), the standardized difference between the air pressures at Darwin and Tahiti.

data(quilpie)
head(quilpie, 5)
##   Year Rain   SOI Phase Exceed y
## 1 1921 38.4   2.7     2    Yes 1
## 2 1922  0.0   2.0     5     No 0
## 3 1923  0.0 -10.7     3     No 0
## 4 1924 24.4   6.9     2    Yes 1
## 5 1925  0.0 -12.5     3     No 0

Plots indicate that a higher SOI is associated with more often exceeding the rain threshold:

boxplot <- ggplot(
  data = quilpie, 
  mapping = aes(x = SOI, y = Exceed)
) + 
  geom_boxplot() + 
  labs(y = "July rainfall exceeds 10 mm", 
       x = "July average SOI")

scatterplot <- ggplot(
  data = quilpie, 
  mapping = aes(x = SOI, y = Exceed)
) + 
  geom_jitter(height = .1) + 
  labs(y = "July rainfall exceeds 10 mm", 
       x = "July average SOI")

plot_grid(boxplot, scatterplot)

This indicates using a generalized linear model with the SOI predicting whether the rainfall exceeds 10 mm.

The binary response suggests a logistic regression model with

  • \(y \sim Binomial(p, 1)\)

  • \(\log\frac{p}{1-p} = \beta_0 + \beta_1 SOI\)


Review:

  1. What are the two components of a generalized linear model called? First: Random component, Second: systematic component

  2. What is name of the function of \(p\), \(\log\frac{p}{1-p}\),

    1. In general for GLMs: link function

    2. This specific function: logit function

    3. So we call it the: logit link function


Solving for \(p\) in the systematic component gives

\[p = \frac{\exp[\beta_0 + \beta_1 SOI]}{1 + \exp[\beta_0 + \beta_1 SOI]}\],

which guarantees that \(p\) is between 0 and 1

Example 4.15, p. 187

So now we have 2 parameters to estimate (\(\beta_0\) and \(\beta_1\)) which makes the probability for each year depend on the SOI.

Maximum likelihood estimation means finding the values of \(\beta_0\) and \(\beta_1\) that maximize the loglikelihood function. In this case, we can plot the loglikelihood function as contour plot over a 2-D grid of (\(\beta_0, \beta_1\)) values.

Log-likelihood function

The function below calculates the loglikehood function for specific (\(\beta_0, \beta_1\)) values

LL <- function(beta0, beta1)
{
  p <- exp(beta0 + beta1 *quilpie$SOI) / (1 + exp(beta0 + beta1 *quilpie$SOI))
  dbinom(quilpie$y, size=1, prob=p, log = TRUE) |> 
    sum()
}

LL_v <- Vectorize(LL)

Note that the function LL() is only defined to work with a single (\(\beta_0, \beta_1\)) pair. Otherwise, R will get confused about whether it is summing over rows of the dataset or parameter values. Using Vectorize allows it to work with vectors of (\(\beta_0, \beta_1\)) pairs.

Making a grid

Nesting the function seq() within rep() can be used to make a grid of values. For instance, suppose we want to make a grid of \((x,y)\) values where each of x and y ranges from 1 to 4.

tibble(
  x = seq(from=1, to=4, by=1) |> rep(each = 4), 
  y = seq(from=1, to=4, by=1) |> rep(times = 4)
)
## # A tibble: 16 × 2
##        x     y
##    <dbl> <dbl>
##  1     1     1
##  2     1     2
##  3     1     3
##  4     1     4
##  5     2     1
##  6     2     2
##  7     2     3
##  8     2     4
##  9     3     1
## 10     3     2
## 11     3     3
## 12     3     4
## 13     4     1
## 14     4     2
## 15     4     3
## 16     4     4

What do the following arguments in rep() do?

  • each: repeats each entry 4 times before moving to the next

  • times: repeats the entire sequence

Contour plot of loglikelihood function

Evaluating the loglikelihood function at each \((\beta_0, \beta_1)\) value and making a contour plot (Compare with Figure 4.7, p. 188)

tibble(
  beta0 = seq(from=-0.5, to=0.5, length=101) |> rep(each = 101), 
  beta1 = seq(from=0, to=0.25, length=101) |> rep(times = 101), 
  loglikelihood = LL_v(beta0, beta1)
) |> 
  ggplot(
    mapping = aes(x = beta0, y = beta1, z = loglikelihood)
  ) + 
  metR::geom_contour2(
    aes(label = after_stat(level)), 
    breaks = c(-48, -46, -44, -42, -40, -39, -38.5, -38), skip=0
  )

What values of \((\beta_0, \beta_1)\) maximize the loglikelihood? beta_0 = .1, beta_1 = .15

The book talks about the Fisher scoring algorithm which takes steps toward maximizing the loglikelihood. (This is beyond the scope of the course.)

Figure 4.6, p. 187

ggplot(
  data = quilpie, 
  mapping = aes(x = SOI, y = y)
) + 
  geom_jitter(height = .02) + 
  labs(y = "Prob of rainfall exceeding threshold", 
       x = "July average SOI") + 
  geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") + 
  geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "black")

This plot gives two fitted lines to the data:

  • solid: from fitted logstic regression model:

\[p = \frac{\exp[.1 + .15 SOI]}{1 + \exp[.1 + .15 SOI]}\]

  • dashed: from ordinary least squares regression (assuming normal distribution)

One of the lines has a clear problem with it. What is it? The normal based line (dashed) goes outside the possible range of probabilities of 0 - 1.