Generalized Linear Models 1

Introduction

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

There are three components to any GLM:

Random Component – refers to the probability distribution of the response variable (Y); e.g. normal distribution for Y in the linear regression, or binomial distribution for Y in the binary logistic regression. Also called a noise model or error model. How is random error added to the prediction that comes out of the link function?

Systematic Component - specifies the explanatory variables \((X_1, X_2, ...X_k)\) in the model, more specifically their linear combination in creating the so called linear predictor; e.g.,\(\beta_0X_0+\beta_1X_1+ \beta_2X_2\) as we have seen in a linear regression, or as we will see in a logistic regression in this lesson.

Link Function (n) - specifies the link between random and systematic components. It says how the expected value of the response relates to the linear predictor of explanatory variables; e.g. n=\(E(Y_i)\) for linear regression, or n=logit(\(\pi\)) for logistic regression

Generalized Linear Model with binomial random component and canonical link (Logit Link=Log Odds) - is called Logistic Regression.
In R, glm function are used for the Logistic regression with following parameters.

- formula: predictors used in formula (Systematic Component) 
- family: "binomial" needs to be chosen(Random Component) 
- link: "logit" (default).(Link Component)

Some other type of family and link function are follows:
family: “gaussian” ; link: “identity”
family: “Gamma” ; link: “inverse”
family: “poisson” ; link: “log”
…

Saturated Model: Saturated Model has parameter for every observation, so it is perfect fit. Saturated model is not a model any useful for statistical sense, it is a baseline for a comparison.

Drop in deviance Test: It is used for model comparison, it uses Likelihood ratio test.

Likelihood Ratio Test = -2 ln (max’d likelihood under constraints/ max’d likelihood without constraints) where H0: simpler model HA: complex model

Likelihood Ratio Test:

= -2 ln(simpler_GLM - complex_GLM)

= -2 ln(simpler_GLM) +2 ln(complex_GLM)

=  Deviance_Small_GLM - Deviance_Big_GLM

Data:

Heart disease data

Objective:

In this study, I compared generalized linear models from complex(saturated) the simplest(null) model to evaluate the relationship between the heart disease and snoring.

Methods to use: - Ordered Categorical Variables - GLM Models (Saturated to Null) - Drop in Deviance Test to compare the models

Data Set

library(ggplot2)
#Creating contingency table with 4 parameters
heartDisease <-
  array(c(24, 35, 21, 30, 1355, 603, 192, 224),
  dim = c(4, 2),
  dimnames = list(
  Snoring = c("Never", "Occasional", "Almost Always", "Always"),
  HeartDisease = c("Yes", "No")
  )
  )

heartDisease

##                HeartDisease
## Snoring         Yes   No
##   Never          24 1355
##   Occasional     35  603
##   Almost Always  21  192
##   Always         30  224

Converting ordinal categorical variables to numeric scores:

#Scores: "Never" =0, "Occasional"=2, "Almost Always"=4, "Always"=5
Snoring_Score <- c(0, 2, 4, 5)

In this case, we have N=4 binomial observations, with different sample sizes \(n_i\) for each.

Method 1: Saturated model

The saturated model gives \[ \hat\theta_i=\frac{\mbox{number Yes}}{\mbox{number Yes}+\mbox{number No}} \] for \(i=1,2,3,4\).

Compute the estimated probabilities for this saturated model and the corresponding empirical logits (data) and plot the logits versus snoring score:

# Compute n_i
n_i <- apply(heartDisease, MAR = 1, FUN = "sum") #compute row sums
Empirical_Prob <- heartDisease[, 1] / n_i
Empirical_Logit <- log(Empirical_Prob / (1 - Empirical_Prob))
plot(Snoring_Score, Empirical_Logit, pch = 16, col = "blue")

The saturated model has four parameters. We can fit the saturated model and see that it fits perfectly:

sat <- glm(heartDisease ~ factor(Snoring_Score), family = binomial(link = logit))
summary(sat)

## 
## Call:
## glm(formula = heartDisease ~ factor(Snoring_Score), family = binomial(link = logit))
## 
## Deviance Residuals: 
## [1]  0  0  0  0
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -4.0335     0.2059 -19.587  < 2e-16 ***
## factor(Snoring_Score)2   1.1869     0.2695   4.404 1.06e-05 ***
## factor(Snoring_Score)4   1.8205     0.3086   5.899 3.65e-09 ***
## factor(Snoring_Score)5   2.0231     0.2832   7.144 9.10e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.5904e+01  on 3  degrees of freedom
## Residual deviance: 8.7930e-14  on 0  degrees of freedom
## AIC: 28.253
## 
## Number of Fisher Scoring iterations: 3

Deviance is a measure of goodness of fit of a model. Higher numbers always indicates bad fit.

The null deviance and the residual deviance are used to test whether the independent variables provide statistically significant explanation.

AIC: Akaike’s Information Criterion (AIC) = 2k -2*log-likelihood where k is the number of estimated parameters. It is useful when comparing models. The smaller the AIC value, the better the model fit.

Note that Fisher Scoring is the iterative numerical optimization method used to get the parameter estimates.

Notice that the saturated model reproduces the simple proportions:

predict.glm(sat, type = "response")

##         Never    Occasional Almost Always        Always 
##    0.01740392    0.05485893    0.09859155    0.11811024

Empirical_Prob

##         Never    Occasional Almost Always        Always 
##    0.01740392    0.05485893    0.09859155    0.11811024

Method 2: Cubic Model (Another way to fit the Saturated Model)

Another way to fit a saturated model would be to treat Snoring_Score as a continuous variable and use a cubic polynomial (remember that an intercept is included by default):

cubic <- glm(heartDisease ~ poly(Snoring_Score, 3), family = binomial(link = logit))
summary(cubic)

## 
## Call:
## glm(formula = heartDisease ~ poly(Snoring_Score, 3), family = binomial(link = logit))
## 
## Deviance Residuals: 
## [1]  0  0  0  0
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.77587    0.10101 -27.480  < 2e-16 ***
## poly(Snoring_Score, 3)1  1.54595    0.20363   7.592 3.15e-14 ***
## poly(Snoring_Score, 3)2 -0.31305    0.19101  -1.639    0.101    
## poly(Snoring_Score, 3)3  0.03738    0.21094   0.177    0.859    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance:  6.5904e+01  on 3  degrees of freedom
## Residual deviance: -7.6605e-14  on 0  degrees of freedom
## AIC: 28.253
## 
## Number of Fisher Scoring iterations: 3

predict(cubic, type = "response")

##         Never    Occasional Almost Always        Always 
##    0.01740392    0.05485893    0.09859155    0.11811024

Empirical_Prob

##         Never    Occasional Almost Always        Always 
##    0.01740392    0.05485893    0.09859155    0.11811024

Note: the poly function in R does not just create \(x, x^2, \ldots,x^p\), but actually creates orthogonal polynomials. Hence, poly(Snoring_Score,3) is not simply the coefficient of Snoring_Score ^ 3.

Method 3: Quadratic model

Now let’s fit a simpler model than the saturated model. How about a quadratic polynomial on the logit scale?

The formula and family statements specify that the response heartDisease will be modeled as a binomial random variable with mean that is a function of Snoring_Score and an intercept, included by default.

quad <- glm(heartDisease ~ poly(Snoring_Score, 2), family = binomial(link = logit))
summary(quad)

## 
## Call:
## glm(formula = heartDisease ~ poly(Snoring_Score, 2), family = binomial(link = logit))
## 
## Deviance Residuals: 
##         Never     Occasional  Almost Always         Always  
##      -0.02616        0.07404       -0.14496        0.06620  
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -2.77268    0.09927 -27.932  < 2e-16 ***
## poly(Snoring_Score, 2)1  1.54786    0.20310   7.621 2.51e-14 ***
## poly(Snoring_Score, 2)2 -0.31751    0.18970  -1.674   0.0942 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.904481  on 3  degrees of freedom
## Residual deviance:  0.031563  on 1  degrees of freedom
## AIC: 26.284
## 
## Number of Fisher Scoring iterations: 3

The Residual deviance is a likelihood ratio test that compares the current model to the saturated model.

The quadratic models fit extremely well. Becasue the residual deviance(test statistics) is smaller than degrees of freedom. It is extremely small so it is extremely good fit.

predict(quad, type = "response")

##         Never    Occasional Almost Always        Always 
##    0.01749620    0.05419401    0.10157902    0.11677411

Empirical_Prob

##         Never    Occasional Almost Always        Always 
##    0.01740392    0.05485893    0.09859155    0.11811024

We can also check out the deviance residuals, they all are smaller than 1, so it is an evidence that it is a good fit.

The Null deviance is for a model with only the intercept. The Residual deviance is for our model with \(p+1=3\) parameters for the intercept, linear, and quadratic terms. Because we have grouped data, we can conduct the formal lack-of-fit test, comparing our quadratic model to the saturated model.

The null hypothesis is that the quadratic model fits adequately and the alternative is that it does not (it is oversimplified). We conduct the test by comparing Residual deviance to \(\chi^2_{4-3}=\chi^2_1\).

A chi-square test, using the difference between the two residuals (Null & Residual), indicates the overall significance of the model.

pchisq(quad$deviance,df = 1,lower.tail=F)

## [1] 0.8589901

The p-value indicates no evidence of lack-of-fit.

Model 4: Linear model

Now let’s fit an even simpler model: linear instead of quadratic:

line <- glm(heartDisease ~ Snoring_Score, family = binomial(link = logit))
summary(line)

## 
## Call:
## glm(formula = heartDisease ~ Snoring_Score, family = binomial(link = logit))
## 
## Deviance Residuals: 
##         Never     Occasional  Almost Always         Always  
##       -0.8346         1.2521         0.2758        -0.6845  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -3.86625    0.16621 -23.261  < 2e-16 ***
## Snoring_Score  0.39734    0.05001   7.945 1.94e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.9045  on 3  degrees of freedom
## Residual deviance:  2.8089  on 2  degrees of freedom
## AIC: 27.061
## 
## Number of Fisher Scoring iterations: 4

The residual deviance is 2.8 which is very close to 2(degrees of freedom). It is still good fit.

predict(line, type = "response")

##         Never    Occasional Almost Always        Always 
##    0.02050742    0.04429511    0.09305411    0.13243885

Empirical_Prob

##         Never    Occasional Almost Always        Always 
##    0.01740392    0.05485893    0.09859155    0.11811024

pchisq(line$deviance, df = 2, lower.tail = F)

## [1] 0.2455006

The p value is not small so we would not reject the null hypothesis that the simpler model fits.

The p-value indicates no evidence of lack-of-fit.

Drop in Deviance Test

We could do a drop-in-deviance test to see if the linear model fits as well as the quadratic model :

# Drop-in-deviance test between linear and quadratic model.
line$deviance - quad$deviance

## [1] 2.777349

pchisq(line$deviance - quad$deviance, df = 1, lower = F)

## [1] 0.09560631

This is a bit borderline, but we’d definitely prefer a simpler model for such a small data set (and might not want to get too carried away with the numerical properties of the snoring score).

Next, let’s add the fitted logits for our linear specification to our plot of empirical logits:

Predicted_Logit <- predict.glm(line, type = "link")
plot(c(Snoring_Score, Snoring_Score), c(Empirical_Logit, Predicted_Logit), 
     type = "n", ylab = "Logit", xlab = "Snoring Score")
points(Snoring_Score, Empirical_Logit, col = "blue", pch = 16)
points(Snoring_Score, Predicted_Logit, col = "red", pch = 17)

Fitted values of the linear model(triangles) track the empirical logits (dots) quite well.

So, since the linear model fits, can we simplify further?

Model 5: Null model

One last test: could we drop snoring scores from the model and use only the overall intercept? That is, are the odds of heart disease (and hence probability of heart disease) constant across snoring levels?

# Drop-in-deviance test between null and linear model.
line$null.deviance - line$deviance

## [1] 63.09557

pchisq(line$null.deviance - line$deviance, df = 1, lower = F)

## [1] 1.969166e-15

The p-value indicates there is an evidence of lack-of-fit which means we do not simplify further. The null model doesn’t fit at all. We will keep the snoring score.

References

CSU Course notes- Generalized Regression Model - STAA 552
https://en.wikipedia.org/wiki/Generalized_linear_model
https://online.stat.psu.edu/stat504/lesson/6/6.1
https://www.statmethods.net/advstats/glm.html

************************

Generalized Linear Models 1

Mustafa Arslan

10/20/2021