Notes 1
Motivation for Generalized Linear Models
You will have to install the package GLMsData
just once:
Tools > Install Packages
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Reading companion
Reference book: Generalized Linear Models with Examples in R
Chapters 2 and 3: cover linear regression (normal-based)
This notes: Section 4.2 in GLMs: The need for non-normal regression models
Linear regression assumes
Model:
- \(y\sim Normal(\mu, \sigma)\)
- \(\mu = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_px_p\)
Assumptions:
- Constant variance
- Linearity
- Response follows a normal distribution
- Independent observations
Three common non-normal situations
1. Response is a proportion
Not normal because: proportions fall between 0 and 1 values
Not constant standard variance because the variance is smaller for proportions near 0 and 1 than proportions around 0.5
Example: shuttles
data - Example 4.2, p. 167
Note: The data()
function takes datasets from the
GLMsData
package (book) and loads it into the R
environment
## Temp Damaged
## 1 53 2
## 2 57 1
## 3 58 1
## 4 63 1
## 5 66 0
## 6 67 0
Response variable:
Damaged
, the number of o-rings that failed out of 6.Explanatory variable: temperature in F.
shuttles |>
count(Damaged, Temp) |>
ggplot(
mapping = aes(x = Temp, y = Damaged, size = n)
) +
geom_point() +
scale_size_continuous(breaks=1:3) +
labs(
x = "Temperature (deg F)",
y = "Number of O-ring failures out of 6",
size = "# Obs"
)
Model
- \(y \sim Binomial(p, 6)\)
- \(\log(\frac{p}{1-p}) = \beta_0 + \beta_1 temp\)
2. Response is a count
Not normal because: the normal distribution models continous quantities; counts are discrete
Not constant variance because the variance increases with the number of counts.
Example: nminer
(noisy miner) data - Example 4.3,
p. 168
## Miners Eucs Area Grazed Shrubs Bulokes Timber Minerab
## 1 0 2 22 0 1 120 16 0
## 2 0 10 11 0 1 67 25 0
## 3 1 16 51 0 1 85 13 3
## 4 1 20 22 0 1 45 12 2
## 5 1 19 4 0 1 160 14 8
## 6 1 18 61 0 1 75 6 1
Response variable: number of noisy miners
Explanatory variable: number of trees
nminer |>
count(Eucs, Minerab) |>
ggplot(
mapping = aes(x = Eucs, y = Minerab, size = n)
) +
geom_point() +
labs(
x = "Number of eucylaptus trees",
y = "Number of noisy miner birds",
size = "# Obs"
) +
scale_size_continuous(breaks=1:3)
Model
- \(y\sim Poisson(\mu)\)
- \(\log(\mu) = \beta_0 + \beta_1 trees\)
3. Response is positive continuous
Not normal because: often data is right-skewed; also normal distribution puts probability on negative values.
Not constant standard variance because: often, variation increases with the size.
Example: sdrink
(soft drink) data - Example 4.4,
p. 169
## Time Cases Distance
## 1 16.68 7 560
## 2 11.50 3 220
## 3 12.03 3 340
## 4 14.88 4 80
## 5 13.75 6 150
## 6 18.11 7 330
Response variable: Time
Explanatory variable: Number of cases and distance walked
Model
- \(y\sim Gamma(\mu; \phi)\)
- \(\mu = \beta_0 + \beta_1 cases + \beta_2 distance\)
Section 5.2: The two components of generalized linear models
Random component \(y\): Defines the distribution of the response
Systematic component: Incorporates a linear combination of the explanatory variables in describing the mean \(\mu\). Uses the link function of the mean