Notes 5
Likelihood Estimation for Regression Parameters
Loading packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'cowplot' was built under R version 4.3.2
##
## Attaching package: 'cowplot'
##
## The following object is masked from 'package:lubridate':
##
## stamp
## Warning: package 'metR' was built under R version 4.3.3
##
## Attaching package: 'metR'
##
## The following object is masked from 'package:purrr':
##
## cross
quilpie
data cont.
Example 4.10, p. 180
We now consider the relationship between whether the total July rainfall exceeds 10 mm and the southern oscillation index (SOI), the standardized difference between the air pressures at Darwin and Tahiti.
## Year Rain SOI Phase Exceed y
## 1 1921 38.4 2.7 2 Yes 1
## 2 1922 0.0 2.0 5 No 0
## 3 1923 0.0 -10.7 3 No 0
## 4 1924 24.4 6.9 2 Yes 1
## 5 1925 0.0 -12.5 3 No 0
Plots indicate that a higher SOI is associated with more often exceeding the rain threshold:
boxplot <- ggplot(
data = quilpie,
mapping = aes(x = SOI, y = Exceed)
) +
geom_boxplot() +
labs(y = "July rainfall exceeds 10 mm",
x = "July average SOI")
scatterplot <- ggplot(
data = quilpie,
mapping = aes(x = SOI, y = Exceed)
) +
geom_jitter(height = .1) +
labs(y = "July rainfall exceeds 10 mm",
x = "July average SOI")
plot_grid(boxplot, scatterplot)
This indicates using a generalized linear model with the SOI predicting whether the rainfall exceeds 10 mm.
The binary response suggests a logistic regression model with
\(y \sim Binomial(p, 1)\)
\(\log\frac{p}{1-p} = \beta_0 + \beta_1 SOI\)
Review:
What are the two components of a generalized linear model called? First: Random component, Second: systematic component
What is name of the function of \(p\), \(\log\frac{p}{1-p}\),
In general for GLMs: link function
This specific function: logit function
So we call it the: logit link function
Solving for \(p\) in the systematic component gives
\[p = \frac{\exp[\beta_0 + \beta_1 SOI]}{1 + \exp[\beta_0 + \beta_1 SOI]}\],
which guarantees that \(p\) is between 0 and 1
Example 4.15, p. 187
So now we have 2 parameters to estimate (\(\beta_0\) and \(\beta_1\)) which makes the probability for each year depend on the SOI.
Maximum likelihood estimation means finding the values of \(\beta_0\) and \(\beta_1\) that maximize the loglikelihood function. In this case, we can plot the loglikelihood function as contour plot over a 2-D grid of (\(\beta_0, \beta_1\)) values.
Log-likelihood function
The function below calculates the loglikehood function for specific (\(\beta_0, \beta_1\)) values
LL <- function(beta0, beta1)
{
p <- exp(beta0 + beta1 *quilpie$SOI) / (1 + exp(beta0 + beta1 *quilpie$SOI))
dbinom(quilpie$y, size=1, prob=p, log = TRUE) |>
sum()
}
LL_v <- Vectorize(LL)
Note that the function LL()
is only defined to work with
a single (\(\beta_0, \beta_1\)) pair.
Otherwise, R will get confused about whether it is summing over rows of
the dataset or parameter values. Using Vectorize
allows it
to work with vectors of (\(\beta_0,
\beta_1\)) pairs.
Making a grid
Nesting the function seq()
within rep()
can
be used to make a grid of values. For instance, suppose we want to make
a grid of \((x,y)\) values where each
of x and y ranges from 1 to 4.
tibble(
x = seq(from=1, to=4, by=1) |> rep(each = 4),
y = seq(from=1, to=4, by=1) |> rep(times = 4)
)
## # A tibble: 16 × 2
## x y
## <dbl> <dbl>
## 1 1 1
## 2 1 2
## 3 1 3
## 4 1 4
## 5 2 1
## 6 2 2
## 7 2 3
## 8 2 4
## 9 3 1
## 10 3 2
## 11 3 3
## 12 3 4
## 13 4 1
## 14 4 2
## 15 4 3
## 16 4 4
What do the following arguments in rep()
do?
each
: repeats each entry 4 times before moving to the nexttimes
: repeats the entire sequence
Contour plot of loglikelihood function
Evaluating the loglikelihood function at each \((\beta_0, \beta_1)\) value and making a contour plot (Compare with Figure 4.7, p. 188)
tibble(
beta0 = seq(from=-0.5, to=0.5, length=101) |> rep(each = 101),
beta1 = seq(from=0, to=0.25, length=101) |> rep(times = 101),
loglikelihood = LL_v(beta0, beta1)
) |>
ggplot(
mapping = aes(x = beta0, y = beta1, z = loglikelihood)
) +
metR::geom_contour2(
aes(label = after_stat(level)),
breaks = c(-48, -46, -44, -42, -40, -39, -38.5, -38), skip=0
)
What values of \((\beta_0, \beta_1)\) maximize the loglikelihood? beta_0 = .1, beta_1 = .15
The book talks about the Fisher scoring algorithm which takes steps toward maximizing the loglikelihood. (This is beyond the scope of the course.)
Figure 4.6, p. 187
ggplot(
data = quilpie,
mapping = aes(x = SOI, y = y)
) +
geom_jitter(height = .02) +
labs(y = "Prob of rainfall exceeding threshold",
x = "July average SOI") +
geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "black")
This plot gives two fitted lines to the data:
- solid: from fitted logstic regression model:
\[p = \frac{\exp[.1 + .15 SOI]}{1 + \exp[.1 + .15 SOI]}\]
- dashed: from ordinary least squares regression (assuming normal distribution)
One of the lines has a clear problem with it. What is it? The normal based line (dashed) goes outside the possible range of probabilities of 0 - 1.