Notes 1

Motivation for Generalized Linear Models

You will have to install the package GLMsData just once: Tools > Install Packages

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GLMsData)
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Reading companion

  • Reference book: Generalized Linear Models with Examples in R

  • Chapters 2 and 3: cover linear regression (normal-based)

  • This notes: Section 4.2 in GLMs: The need for non-normal regression models

Linear regression assumes

Model:

  • \(y\sim Normal(\mu, \sigma)\)
  • \(\mu = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p\)

Assumptions:

  • Constant variance
  • Linearity
  • Response follows a normal distribution
  • Independent observations

Three common non-normal situations

1. Response is a proportion

  • Not normal because: proportions fall between 0 and 1 values

  • Not constant standard variance because the variance is smaller for proportions near 0 and 1 than proportions around 0.5

Example: shuttles data - Example 4.2, p. 167

Note: The data() function takes datasets from the GLMsData package (book) and loads it into the R environment

data(shuttles)
head(shuttles)
##   Temp Damaged
## 1   53       2
## 2   57       1
## 3   58       1
## 4   63       1
## 5   66       0
## 6   67       0
  • Response variable: Damaged, the number of o-rings that failed out of 6.

  • Explanatory variable: temperature in F.

shuttles |> 
  count(Damaged, Temp) |> 
  ggplot(
    mapping = aes(x = Temp, y = Damaged, size = n)
  ) + 
  geom_point() + 
  scale_size_continuous(breaks=1:3) + 
  labs(
    x = "Temperature (deg F)", 
    y = "Number of O-ring failures out of 6", 
    size = "# Obs"
  )

Model

  • \(y \sim Binomial(p, 6)\)
  • \(\log(\frac{p}{1-p}) = \beta_0 + \beta_1 temp\)

2. Response is a count

  • Not normal because: the normal distribution models continous quantities; counts are discrete

  • Not constant variance because the variance increases with the number of counts.

Example: nminer (noisy miner) data - Example 4.3, p. 168

data(nminer)
head(nminer)
##   Miners Eucs Area Grazed Shrubs Bulokes Timber Minerab
## 1      0    2   22      0      1     120     16       0
## 2      0   10   11      0      1      67     25       0
## 3      1   16   51      0      1      85     13       3
## 4      1   20   22      0      1      45     12       2
## 5      1   19    4      0      1     160     14       8
## 6      1   18   61      0      1      75      6       1
  • Response variable: number of noisy miners

  • Explanatory variable: number of trees

nminer |> 
  count(Eucs, Minerab) |> 
ggplot(
  mapping = aes(x = Eucs, y = Minerab, size = n)
) + 
  geom_point() + 
  labs(
    x = "Number of eucylaptus trees", 
    y = "Number of noisy miner birds", 
    size = "# Obs"
  ) + 
  scale_size_continuous(breaks=1:3)

Model

  • \(y\sim Poisson(\mu)\)
  • \(\log(\mu) = \beta_0 + \beta_1 trees\)

3. Response is positive continuous

  • Not normal because: often data is right-skewed; also normal distribution puts probability on negative values.

  • Not constant standard variance because: often, variation increases with the size.

Example: sdrink (soft drink) data - Example 4.4, p. 169

data(sdrink)
head(sdrink)
##    Time Cases Distance
## 1 16.68     7      560
## 2 11.50     3      220
## 3 12.03     3      340
## 4 14.88     4       80
## 5 13.75     6      150
## 6 18.11     7      330
  • Response variable: Time

  • Explanatory variable: Number of cases and distance walked

sdrink |> 
  relocate(-Time) |>
  ggpairs()

Model

  • \(y\sim Gamma(\mu; \phi)\)
  • \(\mu = \beta_0 + \beta_1 cases + \beta_2 distance\)

Section 5.2: The two components of generalized linear models

  1. Random component \(y\): Defines the distribution of the response

  2. Systematic component: Incorporates a linear combination of the explanatory variables in describing the mean \(\mu\). Uses the link function of the mean