For weeks, we’ve built analytical capacity using two core models: Linear Regression (Normal Distribution) and Logistic Regression (Binomial Distribution). Today, we look under the hood of the glm() function in R to see how these models are just two members of a much larger, incredibly flexible family: Generalized Linear Models (GLMs).

Your knowledge of Linear and Logistic Regression is the foundation. If you can identify the type of data you have, you can select the correct GLM family, and the rest of the process (fitting, interpreting, diagnosing) follows familiar rules.

The GLM Framework: The Three Pillars

Every GLM is defined by three fundamental components. In glm(), you only need to specify the Random Component (via family).

Component Description Linear Regression (Normal) Logistic Regression (Binomial) New GLMs
1. Random The probability distribution of the response variable (\(Y\)). Normal (family = gaussian) Binomial (family = binomial) Poisson, Gamma, etc.
2. Systematic The linear predictor: \(X\beta\) (the familiar part). \(\beta_0 + \beta_1 X_1 + \dots\) \(\beta_0 + \beta_1 X_1 + \dots\) Same!
3. Link The function that links the mean of \(Y\) to the linear predictor. Identity (\(\mu = \eta\)) Logit (\(\log(\frac{\mu}{1-\mu}) = \eta\)) Log, Inverse, etc.

For any new GLM, we only need to learn:

  • Which family to use (based on the data type).
  • What the link function does to the mean \(\mu\).

Modeling Counts with Poisson Regression

Call Center Incident Reports: A company wants to model the number of customer support incidents reported per week for different product lines. Since the response variable is a count (0, 1, 2, 3, …), the Normal (Gaussian) distribution is inappropriate, and the Binomial (Logistic) distribution is for proportions. Count data is exactly applicable to the Poisson GLM Family.

R Code & Interpretation

We’ll simulate some data for demonstration.

library(tidyverse)

# 1. Simulate data (Incident Reports)
set.seed(42)
data_poisson <- tibble(
  Product_Age_Years = runif(100, 1, 10),
  Support_Level = factor(sample(c("Basic", "Premium"), 100, replace = TRUE, prob = c(0.7, 0.3))),
  # Generate counts (rate depends on age and support)
  lambda = exp(1.5 - 0.2 * Product_Age_Years + 0.5 * (Support_Level == "Premium")),
  Incidents_Weekly = rpois(100, lambda)
)

# 2. Fit the Poisson GLM
poisson_model <- glm(
  Incidents_Weekly ~ Product_Age_Years + Support_Level,
  family = poisson(link = "log"), # The only change from lm() or glm(..., family=binomial)
  data = data_poisson
)

summary(poisson_model)

Call:
glm(formula = Incidents_Weekly ~ Product_Age_Years + Support_Level, 
    family = poisson(link = "log"), data = data_poisson)

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)           1.38469    0.16897   8.195 2.51e-16 ***
Product_Age_Years    -0.19174    0.02863  -6.698 2.11e-11 ***
Support_LevelPremium  0.38595    0.15029   2.568   0.0102 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 183.54  on 99  degrees of freedom
Residual deviance: 124.62  on 97  degrees of freedom
AIC: 326.7

Number of Fisher Scoring iterations: 5

Interpretation of Coefficients:

Just like in Logistic Regression, we must exponentiate the coefficients to return them to the original scale of the response variable (counts). Exponentiated coefficients are interpreted as Rate Ratios.

# Exponentiate the coefficients
exp(coef(poisson_model))
         (Intercept)    Product_Age_Years Support_LevelPremium 
            3.993592             0.825518             1.471006 
Coefficient Estimate Exp(Estimate) Interpretation
Product_Age_Years -0.19 0.82 A 1-year increase in product age is associated with an 18% decrease (\(\approx 1 - 0.82\)) in the expected weekly incident rate.
Support_LevelPremium 0.39 1.47 Premium support products have a 47% higher expected weekly incident rate compared to Basic support products, holding age constant.

Assumption Check: Overdispersion

The core assumption of Poisson is that the mean equals the variance (\(\text{Var}(Y) = \mu\)). If the variance is much larger than the mean, we have overdispersion, which leads to deflated standard errors (and inflated significance).

  • Diagnostic: Check the Residual Deviance against the Degrees of Freedom (DoF) from the summary().
# Check for Overdispersion: Ratio of Residual Deviance / Residual DoF
deviance_ratio <- summary(poisson_model)$deviance / summary(poisson_model)$df.residual
print(paste("Deviance Ratio:", round(deviance_ratio, 2)))
[1] "Deviance Ratio: 1.28"
# Rule of Thumb: If ratio > 1.5, consider switching to a Quasi-Poisson or Negative Binomial model.

Graphical Summary

A key visualization is comparing the predicted rates to the observed counts.

# Add predictions (expected counts) back to the data
data_poisson <- data_poisson %>%
  mutate(Expected_Incidents = predict(poisson_model, type = "response"))

# Plot Expected vs. Observed Counts
ggplot(data_poisson, aes(x = Expected_Incidents, y = Incidents_Weekly, color = Support_Level)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
  labs(
    title = "Poisson Model Fit: Expected vs. Observed Incident Counts",
    x = "Expected Weekly Incidents (Model Prediction)",
    y = "Observed Weekly Incidents",
    color = "Support Level"
  ) +
  theme_minimal()

Points on the graph above the line indicate where our model underestimates the count, points below the line indicate where our model overestimates the count, and points exactly on the line are perfect predictions.

Modeling Skewed Continuous Data with Gamma Regression

Insurance Claim Amounts: An insurance company wants to model the size of financial claims paid out to customers. Claim amounts are continuous and positive (cannot be negative). They are highly right-skewed (most claims are small, but a few are very large). Positive, continuous, skewed data types can be modeled with the Gamma GLM Family.

R Code & Interpretation

# 3. Simulate Gamma data (Claim Amounts)
set.seed(101)
data_gamma <- tibble(
  Customer_Score = runif(100, 500, 800), # e.g., credit/risk score
  Claim_Type = factor(sample(c("Minor", "Major"), 100, replace = TRUE, prob = c(0.7, 0.3))),
  # Generate claim amounts (mean depends on score and claim type)
  mu = exp(7 - 0.005 * Customer_Score + 0.5 * (Claim_Type == "Major")),
  Claim_Amount = rgamma(100, shape = 2, rate = 2/mu) # Gamma parameterization
)

# 4. Fit the Gamma GLM
gamma_model <- glm(
  Claim_Amount ~ Customer_Score + Claim_Type,
  family = Gamma(link = "log"), # Family changes, link stays 'log'
  data = data_gamma
)

summary(gamma_model)

Call:
glm(formula = Claim_Amount ~ Customer_Score + Claim_Type, family = Gamma(link = "log"), 
    data = data_gamma)

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      7.3717659  0.6912186  10.665  < 2e-16 ***
Customer_Score  -0.0048348  0.0009964  -4.852 4.66e-06 ***
Claim_TypeMinor -0.5678030  0.1919092  -2.959  0.00388 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Gamma family taken to be 0.7735794)

    Null deviance: 84.850  on 99  degrees of freedom
Residual deviance: 63.418  on 97  degrees of freedom
AIC: 953.49

Number of Fisher Scoring iterations: 5

Interpretation of Coefficients (Rate Ratios Again!):

Since we used the Log Link again, the interpretation template is the same as Poisson and Logistic Regression—exponentiate the coefficient for a multiplicative effect.

# Exponentiate the coefficients
exp(coef(gamma_model))
    (Intercept)  Customer_Score Claim_TypeMinor 
   1590.4398662       0.9951769       0.5667693 

Since the Gamma model used the Log Link (family = Gamma(link = "log")), we interpret the exponentiated coefficients (\(\exp(\beta)\)) as a multiplicative ratio relative to the expected mean (claim amount).

Coefficient Estimate \((\beta)\) \(\exp(\beta)\) Interpretation (Multiplicative)
(Intercept) 7.3717659 1590.44 This is the expected claim amount (in currency units, e.g., dollars) when all predictors are zero (i.e., Customer Score = 0 and Claim Type = Major). This value is usually theoretical since a score of 0 is unrealistic.
Customer_Score -0.0048348 0.9951769 For every one-unit increase in the Customer Score, the expected claim amount is multiplied by 0.995, or decreases by approximately 0.48% (\(1 - 0.9951769 = 0.00482\)). This suggests higher customer scores are associated with slightly lower claim amounts.
Claim_TypeMinor -0.5678030 0.5667693 The baseline (omitted) category is Claim_TypeMajor. Therefore, the expected claim amount for a Minor claim is 0.567 times the expected claim amount for a Major claim, holding Customer Score constant. In other words, Minor claims are expected to be about 43.3% lower (\(1 - 0.5667693\)) than Major claims.

Assumption Check: Homoscedasticity vs. Variance Function

Unlike Linear Regression where variance is constant (homoscedasticity), GLMs allow variance to be a function of the mean (\(\text{Var}(Y) = V(\mu)\)).

  • For Normal, \(V(\mu) = \sigma^2\) (Constant variance).
  • For Poisson, \(V(\mu) = \mu\) (Variance increases with the mean).
  • For Gamma, \(V(\mu) = \phi \mu^2\) (Variance increases with the square of the mean).

Diagnostic: Plotting the standardized residuals against the fitted values (mean).

plot(
  fitted(gamma_model),
  resid(gamma_model, type = "deviance"),
  xlab = "Fitted Values (Expected Claim Amount)",
  ylab = "Deviance Residuals",
  main = "Gamma Model Residuals vs. Fitted Values"
)
abline(h = 0, col = "grey", lty = 2)

We look for random scatter around zero. We are not looking for constant variance (as in Linear Regression), but rather ensuring there is no obvious pattern (like a cone shape, which would indicate a poor fit for the chosen variance function, \(V(\mu) = \phi \mu^2\)).

Conclusion

The Transferable Analytical Capacity

You now know three different GLMs, and yet, the core skills you use are identical:

Task Linear/Logistic/Poisson/Gamma
Model Specification Y ~ X1 + X2 (The systematic component is the same)
P-Value Interpretation A high Wald \(\chi^2\) statistic (t-statistic for Linear) results in a low p-value, indicating the predictor is relevant to the link function of the mean.
Coefficient Interpretation Identity Link (Linear): Additive effect. \(\beta_1\) is the change in \(Y\) per unit \(X_1\).
Log Link (Logistic, Poisson, Gamma): Multiplicative effect. \(\exp(\beta_1)\) is the ratio of means/rates/odds per unit \(X_1\).

Whether you are dealing with Log Odds (Logistic), Log Rates (Poisson), or Log Means (Gamma), the process of exponentiating the coefficient to get a multiplicative ratio is the same. You just need to adjust your language (odds, rate, or mean) to match the data type!

---
title: "Generalized Linear Models (GLMs)"
output: 
  html_notebook:
    toc: true
    toc_float: true
    theme: cerulean
---

For weeks, we’ve built analytical capacity using two core models: **Linear Regression (Normal Distribution)** and **Logistic Regression (Binomial Distribution)**. Today, we look under the hood of the `glm()` function in `R` to see how these models are just two members of a much larger, incredibly flexible family: **Generalized Linear Models (GLMs)**.

Your knowledge of Linear and Logistic Regression is the foundation. If you can identify the type of data you have, you can select the correct GLM family, and the rest of the process (fitting, interpreting, diagnosing) follows familiar rules.


## The GLM Framework: The Three Pillars 

Every GLM is defined by three fundamental components. In `glm()`, you only need to specify the **Random Component** (via `family`).

| Component     | Description                                                      | Linear Regression (Normal)      | Logistic Regression (Binomial)           | New GLMs             |
|---------------|------------------------------------------------------------------|---------------------------------|------------------------------------------|----------------------|
| 1. **Random**     | The probability distribution of the response variable ($Y$).     | *Normal* (`family = gaussian`)      | *Binomial* (`family = binomial`)             | *Poisson*, *Gamma*, etc. |
| 2. **Systematic** | The linear predictor: $X\beta$ (the familiar part).              | $\beta_0 + \beta_1 X_1 + \dots$ | $\beta_0 + \beta_1 X_1 + \dots$          | **Same!**                |
| 3. **Link**       | The function that links the mean of $Y$ to the linear predictor. | *Identity* ($\mu = \eta$)         | *Logit* ($\log(\frac{\mu}{1-\mu}) = \eta$) | *Log*, *Inverse*, etc.   |


For any new GLM, we only need to learn:

 - Which `family` to use (based on the data type).  
 - What the `link` function does to the mean $\mu$.
 
## Modeling Counts with Poisson Regression 

**Call Center Incident Reports:** A company wants to model the number of customer support incidents reported per week for different product lines. Since the response variable is a count (0, 1, 2, 3, ...), the Normal (Gaussian) distribution is inappropriate, and the Binomial (Logistic) distribution is for proportions. Count data is exactly applicable to the Poisson GLM Family.

### The Poisson Link Function

The Poisson distribution requires the mean ($\mu$) to be positive. The standard link function is the Log Link:

$$\log(\mu) = \beta_0 + \beta_1 X_1 + \dots$$

This is identical to the Logit link in Logistic Regression! This means the coefficients will be interpreted similarly: multiplicatively.

### R Code & Interpretation

We'll simulate some data for demonstration.

```{r}
library(tidyverse)

# 1. Simulate data (Incident Reports)
set.seed(42)
data_poisson <- tibble(
  Product_Age_Years = runif(100, 1, 10),
  Support_Level = factor(sample(c("Basic", "Premium"), 100, replace = TRUE, prob = c(0.7, 0.3))),
  # Generate counts (rate depends on age and support)
  lambda = exp(1.5 - 0.2 * Product_Age_Years + 0.5 * (Support_Level == "Premium")),
  Incidents_Weekly = rpois(100, lambda)
)

# 2. Fit the Poisson GLM
poisson_model <- glm(
  Incidents_Weekly ~ Product_Age_Years + Support_Level,
  family = poisson(link = "log"), # The only change from lm() or glm(..., family=binomial)
  data = data_poisson
)

summary(poisson_model)
```

### Interpretation of Coefficients:

Just like in Logistic Regression, we must exponentiate the coefficients to return them to the original scale of the response variable (counts). Exponentiated coefficients are interpreted as Rate Ratios.

```{r}
# Exponentiate the coefficients
exp(coef(poisson_model))
```

| Coefficient          | Estimate | Exp(Estimate) | Interpretation                                                                                                                     |
|----------------------|----------|---------------|------------------------------------------------------------------------------------------------------------------------------------|
| `Product_Age_Years`    | -0.19    | __0.82__          | A 1-year increase in product age is associated with an __18% decrease__ ($\approx 1 - 0.82$) in the expected weekly incident rate.     |
| `Support_LevelPremium` | 0.39     | __1.47__          | Premium support products have a __47% higher__ expected weekly incident rate compared to Basic support products, holding age constant. |


### Assumption Check: Overdispersion

The core assumption of Poisson is that the mean equals the variance ($\text{Var}(Y) = \mu$). If the variance is much larger than the mean, we have overdispersion, which leads to deflated standard errors (and inflated significance). 

 - __Diagnostic:__ Check the Residual Deviance against the Degrees of Freedom (DoF) from the `summary()`.
 
```{r}
# Check for Overdispersion: Ratio of Residual Deviance / Residual DoF
deviance_ratio <- summary(poisson_model)$deviance / summary(poisson_model)$df.residual
print(paste("Deviance Ratio:", round(deviance_ratio, 2)))

# Rule of Thumb: If ratio > 1.5, consider switching to a Quasi-Poisson or Negative Binomial model.
```

### Graphical Summary

A key visualization is comparing the predicted rates to the observed counts.

```{r}
# Add predictions (expected counts) back to the data
data_poisson <- data_poisson %>%
  mutate(Expected_Incidents = predict(poisson_model, type = "response"))

# Plot Expected vs. Observed Counts
ggplot(data_poisson, aes(x = Expected_Incidents, y = Incidents_Weekly, color = Support_Level)) +
  geom_point(alpha = 0.6) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey") +
  labs(
    title = "Poisson Model Fit: Expected vs. Observed Incident Counts",
    x = "Expected Weekly Incidents (Model Prediction)",
    y = "Observed Weekly Incidents",
    color = "Support Level"
  ) +
  theme_minimal()
```
Points on the graph above the line indicate where our model underestimates the count, points below the line indicate where our model overestimates the count, and points exactly on the line are perfect predictions. 

## Modeling Skewed Continuous Data with Gamma Regression 

__Insurance Claim Amounts__: An insurance company wants to model the size of financial claims paid out to customers. Claim amounts are *continuous* and *positive* (cannot be negative). They are *highly right-skewed* (most claims are small, but a few are very large). Positive, continuous, skewed data types can be modeled with the __Gamma__ GLM Family.

### The Gamma Link Function

The Gamma distribution commonly uses the **Log Link** (or sometimes the **Inverse link**). We'll stick with Log as it allows for the multiplicative interpretation we just learned.

$$\log(\mu) = \beta_0 + \beta_1 X_1 + \dots$$

### R Code & Interpretation
```{r}
# 3. Simulate Gamma data (Claim Amounts)
set.seed(101)
data_gamma <- tibble(
  Customer_Score = runif(100, 500, 800), # e.g., credit/risk score
  Claim_Type = factor(sample(c("Minor", "Major"), 100, replace = TRUE, prob = c(0.7, 0.3))),
  # Generate claim amounts (mean depends on score and claim type)
  mu = exp(7 - 0.005 * Customer_Score + 0.5 * (Claim_Type == "Major")),
  Claim_Amount = rgamma(100, shape = 2, rate = 2/mu) # Gamma parameterization
)

# 4. Fit the Gamma GLM
gamma_model <- glm(
  Claim_Amount ~ Customer_Score + Claim_Type,
  family = Gamma(link = "log"), # Family changes, link stays 'log'
  data = data_gamma
)

summary(gamma_model)
```

### Interpretation of Coefficients (Rate Ratios Again!):

Since we used the Log Link again, the interpretation template is the same as Poisson and Logistic Regression—exponentiate the coefficient for a multiplicative effect.

```{r}
# Exponentiate the coefficients
exp(coef(gamma_model))
```

Since the Gamma model used the Log Link (`family = Gamma(link = "log")`), we interpret the exponentiated coefficients ($\exp(\beta)$) as a *multiplicative ratio* relative to the expected mean (claim amount).

| Coefficient     | Estimate $(\beta)$ | $\exp(\beta)$    | Interpretation (Multiplicative)                                                                                                                                                                                                                                                                                |
|-----------------|--------------|-----------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `(Intercept)`     | 7.3717659    | 1590.44   | This is the expected claim amount (in currency units, e.g., dollars) when all predictors are zero (i.e., Customer Score = 0 and Claim Type = `Major`). This value is usually theoretical since a score of 0 is unrealistic.                                                                                      |
| `Customer_Score`  | -0.0048348   | **0.9951769** | For every **one-unit increase** in the Customer Score, the expected claim amount is multiplied by 0.995, or decreases by approximately **0.48%** ($1 - 0.9951769 = 0.00482$). This suggests higher customer scores are associated with slightly lower claim amounts.                                                   |
| `Claim_TypeMinor` | -0.5678030   | **0.5667693** | The baseline (omitted) category is `Claim_TypeMajor`. Therefore, the expected claim amount for a `Minor` claim is **0.567 times** the expected claim amount for a `Major` claim, holding Customer Score constant. In other words, Minor claims are expected to be about **43.3% lower** ($1 - 0.5667693$) than Major claims. |



### Assumption Check: Homoscedasticity vs. Variance Function

Unlike Linear Regression where variance is constant (homoscedasticity), GLMs allow variance to be a function of the mean ($\text{Var}(Y) = V(\mu)$). 

 - For Normal, $V(\mu) = \sigma^2$ (Constant variance). 
 - For Poisson, $V(\mu) = \mu$ (Variance increases with the mean).  
 - For Gamma, $V(\mu) = \phi \mu^2$ (Variance increases with the square of the mean).
 
__Diagnostic:__ Plotting the standardized residuals against the fitted values (mean).

```{r}
plot(
  fitted(gamma_model),
  resid(gamma_model, type = "deviance"),
  xlab = "Fitted Values (Expected Claim Amount)",
  ylab = "Deviance Residuals",
  main = "Gamma Model Residuals vs. Fitted Values"
)
abline(h = 0, col = "grey", lty = 2)
```

We look for random scatter around zero. We are not looking for constant variance (as in Linear Regression), but rather ensuring there is no obvious pattern (like a cone shape, which would indicate a poor fit for the chosen variance function, $V(\mu) = \phi \mu^2$).

## Conclusion

The Transferable Analytical Capacity

You now know three different GLMs, and yet, the core skills you use are identical:

| Task                       | Linear/Logistic/Poisson/Gamma                                                                                                                            |
|----------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------|
| Model Specification        | `Y ~ X1 + X2` (The systematic component is the same)                                                                                                       |
| *P*-Value Interpretation     | A high Wald $\chi^2$ statistic (t-statistic for Linear) results in a low p-value, indicating the predictor is relevant to the link function of the mean. |
| Coefficient Interpretation | Identity Link (Linear): Additive effect. $\beta_1$ is the change in $Y$ per unit $X_1$.                                                                  |
|                            | Log Link (Logistic, Poisson, Gamma): Multiplicative effect. $\exp(\beta_1)$ is the ratio of means/rates/odds per unit $X_1$.                             |

Whether you are dealing with Log Odds (Logistic), Log Rates (Poisson), or Log Means (Gamma), the process of exponentiating the coefficient to get a multiplicative ratio is the same. You just need to adjust your language (odds, rate, or mean) to match the data type!

