For weeks, we’ve built analytical capacity using two core models:
Linear Regression (Normal Distribution) and
Logistic Regression (Binomial Distribution). Today, we
look under the hood of the glm() function in R
to see how these models are just two members of a much larger,
incredibly flexible family: Generalized Linear Models
(GLMs).
Your knowledge of Linear and Logistic Regression is the foundation.
If you can identify the type of data you have, you can select the
correct GLM family, and the rest of the process (fitting, interpreting,
diagnosing) follows familiar rules.
The GLM Framework: The Three Pillars
Every GLM is defined by three fundamental components. In
glm(), you only need to specify the Random
Component (via family).
| 1. Random |
The probability distribution of the response variable (\(Y\)). |
Normal (family = gaussian) |
Binomial (family = binomial) |
Poisson, Gamma, etc. |
| 2. Systematic |
The linear predictor: \(X\beta\)
(the familiar part). |
\(\beta_0 + \beta_1 X_1 +
\dots\) |
\(\beta_0 + \beta_1 X_1 +
\dots\) |
Same! |
| 3. Link |
The function that links the mean of \(Y\) to the linear predictor. |
Identity (\(\mu =
\eta\)) |
Logit (\(\log(\frac{\mu}{1-\mu})
= \eta\)) |
Log, Inverse, etc. |
For any new GLM, we only need to learn:
- Which
family to use (based on the data type).
- What the
link function does to the mean \(\mu\).
Modeling Counts with Poisson Regression
Call Center Incident Reports: A company wants to
model the number of customer support incidents reported per week for
different product lines. Since the response variable is a count (0, 1,
2, 3, …), the Normal (Gaussian) distribution is inappropriate, and the
Binomial (Logistic) distribution is for proportions. Count data is
exactly applicable to the Poisson GLM Family.
The Poisson Link Function
The Poisson distribution requires the mean (\(\mu\)) to be positive. The standard link
function is the Log Link:
\[\log(\mu) = \beta_0 + \beta_1 X_1 +
\dots\]
This is identical to the Logit link in Logistic Regression! This
means the coefficients will be interpreted similarly:
multiplicatively.
R Code & Interpretation
We’ll simulate some data for demonstration.
library(tidyverse)
# 1. Simulate data (Incident Reports)
set.seed(42)
data_poisson <- tibble(
Product_Age_Years = runif(100, 1, 10),
Support_Level = factor(sample(c("Basic", "Premium"), 100, replace = TRUE, prob = c(0.7, 0.3))),
# Generate counts (rate depends on age and support)
lambda = exp(1.5 - 0.2 * Product_Age_Years + 0.5 * (Support_Level == "Premium")),
Incidents_Weekly = rpois(100, lambda)
)
# 2. Fit the Poisson GLM
poisson_model <- glm(
Incidents_Weekly ~ Product_Age_Years + Support_Level,
family = poisson(link = "log"), # The only change from lm() or glm(..., family=binomial)
data = data_poisson
)
summary(poisson_model)
Call:
glm(formula = Incidents_Weekly ~ Product_Age_Years + Support_Level,
family = poisson(link = "log"), data = data_poisson)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.38469 0.16897 8.195 2.51e-16 ***
Product_Age_Years -0.19174 0.02863 -6.698 2.11e-11 ***
Support_LevelPremium 0.38595 0.15029 2.568 0.0102 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 183.54 on 99 degrees of freedom
Residual deviance: 124.62 on 97 degrees of freedom
AIC: 326.7
Number of Fisher Scoring iterations: 5
Interpretation of Coefficients:
Just like in Logistic Regression, we must exponentiate the
coefficients to return them to the original scale of the response
variable (counts). Exponentiated coefficients are interpreted as Rate
Ratios.
# Exponentiate the coefficients
exp(coef(poisson_model))
(Intercept) Product_Age_Years Support_LevelPremium
3.993592 0.825518 1.471006
Product_Age_Years |
-0.19 |
0.82 |
A 1-year increase in product age is associated with an 18%
decrease (\(\approx 1 -
0.82\)) in the expected weekly incident rate. |
Support_LevelPremium |
0.39 |
1.47 |
Premium support products have a 47% higher expected
weekly incident rate compared to Basic support products, holding age
constant. |
Assumption Check: Overdispersion
The core assumption of Poisson is that the mean equals the variance
(\(\text{Var}(Y) = \mu\)). If the
variance is much larger than the mean, we have overdispersion, which
leads to deflated standard errors (and inflated significance).
- Diagnostic: Check the Residual Deviance against the
Degrees of Freedom (DoF) from the
summary().
# Check for Overdispersion: Ratio of Residual Deviance / Residual DoF
deviance_ratio <- summary(poisson_model)$deviance / summary(poisson_model)$df.residual
print(paste("Deviance Ratio:", round(deviance_ratio, 2)))
[1] "Deviance Ratio: 1.28"
# Rule of Thumb: If ratio > 1.5, consider switching to a Quasi-Poisson or Negative Binomial model.
Graphical Summary
A key visualization is comparing the predicted rates to the observed
counts.
# Add predictions (expected counts) back to the data
data_poisson <- data_poisson %>%
mutate(Expected_Incidents = predict(poisson_model, type = "response"))
# Plot Expected vs. Observed Counts
ggplot(data_poisson, aes(x = Expected_Incidents, y = Incidents_Weekly, color = Support_Level)) +
geom_point(alpha = 0.6) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
labs(
title = "Poisson Model Fit: Expected vs. Observed Incident Counts",
x = "Expected Weekly Incidents (Model Prediction)",
y = "Observed Weekly Incidents",
color = "Support Level"
) +
theme_minimal()

Points on the graph above the line indicate where our model
underestimates the count, points below the line indicate where our model
overestimates the count, and points exactly on the line are perfect
predictions.
Modeling Skewed Continuous Data with Gamma Regression
Insurance Claim Amounts: An insurance company wants
to model the size of financial claims paid out to customers. Claim
amounts are continuous and positive (cannot be
negative). They are highly right-skewed (most claims are small,
but a few are very large). Positive, continuous, skewed data types can
be modeled with the Gamma GLM Family.
The Gamma Link Function
The Gamma distribution commonly uses the Log Link
(or sometimes the Inverse link). We’ll stick with Log
as it allows for the multiplicative interpretation we just learned.
\[\log(\mu) = \beta_0 + \beta_1 X_1 +
\dots\]
R Code & Interpretation
# 3. Simulate Gamma data (Claim Amounts)
set.seed(101)
data_gamma <- tibble(
Customer_Score = runif(100, 500, 800), # e.g., credit/risk score
Claim_Type = factor(sample(c("Minor", "Major"), 100, replace = TRUE, prob = c(0.7, 0.3))),
# Generate claim amounts (mean depends on score and claim type)
mu = exp(7 - 0.005 * Customer_Score + 0.5 * (Claim_Type == "Major")),
Claim_Amount = rgamma(100, shape = 2, rate = 2/mu) # Gamma parameterization
)
# 4. Fit the Gamma GLM
gamma_model <- glm(
Claim_Amount ~ Customer_Score + Claim_Type,
family = Gamma(link = "log"), # Family changes, link stays 'log'
data = data_gamma
)
summary(gamma_model)
Call:
glm(formula = Claim_Amount ~ Customer_Score + Claim_Type, family = Gamma(link = "log"),
data = data_gamma)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.3717659 0.6912186 10.665 < 2e-16 ***
Customer_Score -0.0048348 0.0009964 -4.852 4.66e-06 ***
Claim_TypeMinor -0.5678030 0.1919092 -2.959 0.00388 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.7735794)
Null deviance: 84.850 on 99 degrees of freedom
Residual deviance: 63.418 on 97 degrees of freedom
AIC: 953.49
Number of Fisher Scoring iterations: 5
Interpretation of Coefficients (Rate Ratios Again!):
Since we used the Log Link again, the interpretation template is the
same as Poisson and Logistic Regression—exponentiate the coefficient for
a multiplicative effect.
# Exponentiate the coefficients
exp(coef(gamma_model))
(Intercept) Customer_Score Claim_TypeMinor
1590.4398662 0.9951769 0.5667693
Since the Gamma model used the Log Link
(family = Gamma(link = "log")), we interpret the
exponentiated coefficients (\(\exp(\beta)\)) as a multiplicative
ratio relative to the expected mean (claim amount).
(Intercept) |
7.3717659 |
1590.44 |
This is the expected claim amount (in currency units, e.g., dollars)
when all predictors are zero (i.e., Customer Score = 0 and Claim Type =
Major). This value is usually theoretical since a score of
0 is unrealistic. |
Customer_Score |
-0.0048348 |
0.9951769 |
For every one-unit increase in the Customer Score,
the expected claim amount is multiplied by 0.995, or decreases by
approximately 0.48% (\(1 -
0.9951769 = 0.00482\)). This suggests higher customer scores are
associated with slightly lower claim amounts. |
Claim_TypeMinor |
-0.5678030 |
0.5667693 |
The baseline (omitted) category is Claim_TypeMajor.
Therefore, the expected claim amount for a Minor claim is
0.567 times the expected claim amount for a
Major claim, holding Customer Score constant. In other
words, Minor claims are expected to be about 43.3%
lower (\(1 - 0.5667693\)) than
Major claims. |
Assumption Check: Homoscedasticity vs. Variance Function
Unlike Linear Regression where variance is constant
(homoscedasticity), GLMs allow variance to be a function of the mean
(\(\text{Var}(Y) = V(\mu)\)).
- For Normal, \(V(\mu) = \sigma^2\)
(Constant variance).
- For Poisson, \(V(\mu) = \mu\)
(Variance increases with the mean).
- For Gamma, \(V(\mu) = \phi \mu^2\)
(Variance increases with the square of the mean).
Diagnostic: Plotting the standardized residuals
against the fitted values (mean).
plot(
fitted(gamma_model),
resid(gamma_model, type = "deviance"),
xlab = "Fitted Values (Expected Claim Amount)",
ylab = "Deviance Residuals",
main = "Gamma Model Residuals vs. Fitted Values"
)
abline(h = 0, col = "grey", lty = 2)
We look for random scatter around zero. We are not looking for
constant variance (as in Linear Regression), but rather ensuring there
is no obvious pattern (like a cone shape, which would indicate a poor
fit for the chosen variance function, \(V(\mu)
= \phi \mu^2\)).
Conclusion
The Transferable Analytical Capacity
You now know three different GLMs, and yet, the core skills you use
are identical:
| Model Specification |
Y ~ X1 + X2 (The systematic component is the same) |
| P-Value Interpretation |
A high Wald \(\chi^2\) statistic
(t-statistic for Linear) results in a low p-value, indicating the
predictor is relevant to the link function of the mean. |
| Coefficient Interpretation |
Identity Link (Linear): Additive effect. \(\beta_1\) is the change in \(Y\) per unit \(X_1\). |
|
Log Link (Logistic, Poisson, Gamma): Multiplicative effect. \(\exp(\beta_1)\) is the ratio of
means/rates/odds per unit \(X_1\). |
Whether you are dealing with Log Odds (Logistic), Log Rates
(Poisson), or Log Means (Gamma), the process of exponentiating the
coefficient to get a multiplicative ratio is the same. You just need to
adjust your language (odds, rate, or mean) to match the data type!
---
title: "Generalized Linear Models (GLMs)"
output: 
  html_notebook:
    toc: true
    toc_float: true
    theme: cerulean
---

For weeks, we’ve built analytical capacity using two core models: **Linear Regression (Normal Distribution)** and **Logistic Regression (Binomial Distribution)**. Today, we look under the hood of the `glm()` function in `R` to see how these models are just two members of a much larger, incredibly flexible family: **Generalized Linear Models (GLMs)**.

Your knowledge of Linear and Logistic Regression is the foundation. If you can identify the type of data you have, you can select the correct GLM family, and the rest of the process (fitting, interpreting, diagnosing) follows familiar rules.


## The GLM Framework: The Three Pillars 

Every GLM is defined by three fundamental components. In `glm()`, you only need to specify the **Random Component** (via `family`).

| Component     | Description                                                      | Linear Regression (Normal)      | Logistic Regression (Binomial)           | New GLMs             |
|---------------|------------------------------------------------------------------|---------------------------------|------------------------------------------|----------------------|
| 1. **Random**     | The probability distribution of the response variable ($Y$).     | *Normal* (`family = gaussian`)      | *Binomial* (`family = binomial`)             | *Poisson*, *Gamma*, etc. |
| 2. **Systematic** | The linear predictor: $X\beta$ (the familiar part).              | $\beta_0 + \beta_1 X_1 + \dots$ | $\beta_0 + \beta_1 X_1 + \dots$          | **Same!**                |
| 3. **Link**       | The function that links the mean of $Y$ to the linear predictor. | *Identity* ($\mu = \eta$)         | *Logit* ($\log(\frac{\mu}{1-\mu}) = \eta$) | *Log*, *Inverse*, etc.   |


For any new GLM, we only need to learn:

 - Which `family` to use (based on the data type).  
 - What the `link` function does to the mean $\mu$.
 
## Modeling Counts with Poisson Regression 

**Call Center Incident Reports:** A company wants to model the number of customer support incidents reported per week for different product lines. Since the response variable is a count (0, 1, 2, 3, ...), the Normal (Gaussian) distribution is inappropriate, and the Binomial (Logistic) distribution is for proportions. Count data is exactly applicable to the Poisson GLM Family.

### The Poisson Link Function

The Poisson distribution requires the mean ($\mu$) to be positive. The standard link function is the Log Link:

$$\log(\mu) = \beta_0 + \beta_1 X_1 + \dots$$

This is identical to the Logit link in Logistic Regression! This means the coefficients will be interpreted similarly: multiplicatively.

### R Code & Interpretation

We'll simulate some data for demonstration.

```{r}
library(tidyverse)

# 1. Simulate data (Incident Reports)
set.seed(42)
data_poisson <- tibble(
  Product_Age_Years = runif(100, 1, 10),
  Support_Level = factor(sample(c("Basic", "Premium"), 100, replace = TRUE, prob = c(0.7, 0.3))),
  # Generate counts (rate depends on age and support)
  lambda = exp(1.5 - 0.2 * Product_Age_Years + 0.5 * (Support_Level == "Premium")),
  Incidents_Weekly = rpois(100, lambda)
)

# 2. Fit the Poisson GLM
poisson_model <- glm(
  Incidents_Weekly ~ Product_Age_Years + Support_Level,
  family = poisson(link = "log"), # The only change from lm() or glm(..., family=binomial)
  data = data_poisson
)

summary(poisson_model)
```

### Interpretation of Coefficients:

Just like in Logistic Regression, we must exponentiate the coefficients to return them to the original scale of the response variable (counts). Exponentiated coefficients are interpreted as Rate Ratios.

```{r}
# Exponentiate the coefficients
exp(coef(poisson_model))
```

| Coefficient          | Estimate | Exp(Estimate) | Interpretation                                                                                                                     |
|----------------------|----------|---------------|------------------------------------------------------------------------------------------------------------------------------------|
| `Product_Age_Years`    | -0.19    | __0.82__          | A 1-year increase in product age is associated with an __18% decrease__ ($\approx 1 - 0.82$) in the expected weekly incident rate.     |
| `Support_LevelPremium` | 0.39     | __1.47__          | Premium support products have a __47% higher__ expected weekly incident rate compared to Basic support products, holding age constant. |


### Assumption Check: Overdispersion

The core assumption of Poisson is that the mean equals the variance ($\text{Var}(Y) = \mu$). If the variance is much larger than the mean, we have overdispersion, which leads to deflated standard errors (and inflated significance). 

 - __Diagnostic:__ Check the Residual Deviance against the Degrees of Freedom (DoF) from the `summary()`.
 
```{r}
# Check for Overdispersion: Ratio of Residual Deviance / Residual DoF
deviance_ratio <- summary(poisson_model)$deviance / summary(poisson_model)$df.residual
print(paste("Deviance Ratio:", round(deviance_ratio, 2)))

# Rule of Thumb: If ratio > 1.5, consider switching to a Quasi-Poisson or Negative Binomial model.
```

### Graphical Summary

A key visualization is comparing the predicted rates to the observed counts.

```{r}
# Add predictions (expected counts) back to the data
data_poisson <- data_poisson %>%
  mutate(Expected_Incidents = predict(poisson_model, type = "response"))

# Plot Expected vs. Observed Counts
ggplot(data_poisson, aes(x = Expected_Incidents, y = Incidents_Weekly, color = Support_Level)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
  labs(
    title = "Poisson Model Fit: Expected vs. Observed Incident Counts",
    x = "Expected Weekly Incidents (Model Prediction)",
    y = "Observed Weekly Incidents",
    color = "Support Level"
  ) +
  theme_minimal()
```
Points on the graph above the line indicate where our model underestimates the count, points below the line indicate where our model overestimates the count, and points exactly on the line are perfect predictions. 

## Modeling Skewed Continuous Data with Gamma Regression 

__Insurance Claim Amounts__: An insurance company wants to model the size of financial claims paid out to customers. Claim amounts are *continuous* and *positive* (cannot be negative). They are *highly right-skewed* (most claims are small, but a few are very large). Positive, continuous, skewed data types can be modeled with the __Gamma__ GLM Family.

### The Gamma Link Function

The Gamma distribution commonly uses the **Log Link** (or sometimes the **Inverse link**). We'll stick with Log as it allows for the multiplicative interpretation we just learned.

$$\log(\mu) = \beta_0 + \beta_1 X_1 + \dots$$

### R Code & Interpretation
```{r}
# 3. Simulate Gamma data (Claim Amounts)
set.seed(101)
data_gamma <- tibble(
  Customer_Score = runif(100, 500, 800), # e.g., credit/risk score
  Claim_Type = factor(sample(c("Minor", "Major"), 100, replace = TRUE, prob = c(0.7, 0.3))),
  # Generate claim amounts (mean depends on score and claim type)
  mu = exp(7 - 0.005 * Customer_Score + 0.5 * (Claim_Type == "Major")),
  Claim_Amount = rgamma(100, shape = 2, rate = 2/mu) # Gamma parameterization
)

# 4. Fit the Gamma GLM
gamma_model <- glm(
  Claim_Amount ~ Customer_Score + Claim_Type,
  family = Gamma(link = "log"), # Family changes, link stays 'log'
  data = data_gamma
)

summary(gamma_model)
```

### Interpretation of Coefficients (Rate Ratios Again!):

Since we used the Log Link again, the interpretation template is the same as Poisson and Logistic Regression—exponentiate the coefficient for a multiplicative effect.

```{r}
# Exponentiate the coefficients
exp(coef(gamma_model))
```

Since the Gamma model used the Log Link (`family = Gamma(link = "log")`), we interpret the exponentiated coefficients ($\exp(\beta)$) as a *multiplicative ratio* relative to the expected mean (claim amount).

| Coefficient     | Estimate $(\beta)$ | $\exp(\beta)$    | Interpretation (Multiplicative)                                                                                                                                                                                                                                                                                |
|-----------------|--------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `(Intercept)`     | 7.3717659    | 1590.44   | This is the expected claim amount (in currency units, e.g., dollars) when all predictors are zero (i.e., Customer Score = 0 and Claim Type = `Major`). This value is usually theoretical since a score of 0 is unrealistic.                                                                                      |
| `Customer_Score`  | -0.0048348   | **0.9951769** | For every **one-unit increase** in the Customer Score, the expected claim amount is multiplied by 0.995, or decreases by approximately **0.48%** ($1 - 0.9951769 = 0.00482$). This suggests higher customer scores are associated with slightly lower claim amounts.                                                   |
| `Claim_TypeMinor` | -0.5678030   | **0.5667693** | The baseline (omitted) category is `Claim_TypeMajor`. Therefore, the expected claim amount for a `Minor` claim is **0.567 times** the expected claim amount for a `Major` claim, holding Customer Score constant. In other words, Minor claims are expected to be about **43.3% lower** ($1 - 0.5667693$) than Major claims. |



### Assumption Check: Homoscedasticity vs. Variance Function

Unlike Linear Regression where variance is constant (homoscedasticity), GLMs allow variance to be a function of the mean ($\text{Var}(Y) = V(\mu)$). 

 - For Normal, $V(\mu) = \sigma^2$ (Constant variance). 
 - For Poisson, $V(\mu) = \mu$ (Variance increases with the mean).  
 - For Gamma, $V(\mu) = \phi \mu^2$ (Variance increases with the square of the mean).
 
__Diagnostic:__ Plotting the standardized residuals against the fitted values (mean).

```{r}
plot(
  fitted(gamma_model),
  resid(gamma_model, type = "deviance"),
  xlab = "Fitted Values (Expected Claim Amount)",
  ylab = "Deviance Residuals",
  main = "Gamma Model Residuals vs. Fitted Values"
)
abline(h = 0, col = "grey", lty = 2)
```

We look for random scatter around zero. We are not looking for constant variance (as in Linear Regression), but rather ensuring there is no obvious pattern (like a cone shape, which would indicate a poor fit for the chosen variance function, $V(\mu) = \phi \mu^2$).

## Conclusion

The Transferable Analytical Capacity

You now know three different GLMs, and yet, the core skills you use are identical:

| Task                       | Linear/Logistic/Poisson/Gamma                                                                                                                            |
|----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Model Specification        | `Y ~ X1 + X2` (The systematic component is the same)                                                                                                       |
| *P*-Value Interpretation     | A high Wald $\chi^2$ statistic (t-statistic for Linear) results in a low p-value, indicating the predictor is relevant to the link function of the mean. |
| Coefficient Interpretation | Identity Link (Linear): Additive effect. $\beta_1$ is the change in $Y$ per unit $X_1$.                                                                  |
|                            | Log Link (Logistic, Poisson, Gamma): Multiplicative effect. $\exp(\beta_1)$ is the ratio of means/rates/odds per unit $X_1$.                             |

Whether you are dealing with Log Odds (Logistic), Log Rates (Poisson), or Log Means (Gamma), the process of exponentiating the coefficient to get a multiplicative ratio is the same. You just need to adjust your language (odds, rate, or mean) to match the data type!

