GLMs

The normal distribution plays a central role when making inference for Ordinary linear regression models.These models assume that the response variable y follows the normal distribution.

In some practical situations however,the response variable may be a discrete variable, such as count of events.Another possibility is the binary response variable where the response variable is either success or failure (i.e., 0 or 1,yes or No).

GLMs are often better models than ordinary linear regression models.They allow us to fit regression models for response data whose response variable Y, follow the exponential family of distributions. The exponential family of distributions include the normal, binomial, Poisson, geometric, negative binomial, exponential, gamma, and inverse normal distributions.

The usual GLMs assumes that the observations are independent, when this assumption is violated, it is appropriate to fit Generalized estimating equations (GEEs) to account for a correlation structure between observations

Two very important members of the family of generalized linear models are logistic regression and Poisson regression. They both find extensive application in biological, biomedical, and environmental problems. I will start with logistic regression.

For illustration purposes, I shall use data from GLOW Study which is publicly available.

The Global Longitudinal study of Osteoporosis in Women (GLOW), data was collected from tens of thousands of women worldwide concerning risk factors for osteoporosis, and they were followed up to determine whether they suffered a bone fracture of any kind within the subsequent year. Sample has information from 500 women, n1 = 125 of whom had a fracture during the first year of follow up and n0 = 375 who did not have a fracture.

Objective of the study was to predict whether a participant will suffer a fracture, during follow up.

The study had several variables measured, but those thought to be closely related to fracture in the first year were: Age, Weight,Prior Fracture (PRIORFRAC), Early Menopause (PREMENO), and Self-Reported Risk of Fracture (RATERISK).

Logistic Regression Models.

What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous.In this example, we have FRACTURE in the First Year of Follow Up (yes or no) as the response/outcome variable.

The most commonly used expression for logistic regression model is:

\[\pi_i=\frac{exp(\beta_0+\beta_1x_{i1}+......+\beta_1x_{ip})}{1+exp(\beta_0+\beta_1x_{i1}+......+ \beta_1x_{ip} }\] or after some algebraic manipulations may as well be written as:

\[log\{\frac{\pi_i}{1-\pi_i}\}= \beta_0+\beta_1x_{i1}+......+\beta_1x_{ip} \]

This transformation of \(\pi_i\) is often referred to as the logit transformation. The \(i\) subscript is when referring to a particular observation.

How to estimate regression parameters for a Logistic Regression Model.

  1. Maximum likelihood Estimation is used to estimate the regression parameters \(\beta_0,\beta_1,....,\beta_p\) of the logistic regression model.However, for now I will avoid being much technical on how to derive the log-likelihood function,but rather acknowledge that the glm() function (“glm” stands for “generalized linear model”) within R implements this parameter estimation procedure with ease.

    Below is how to fit a logistic regression model using the glm() function. I have added comments on the commands to help you understand.

    rm(list=ls())
    install.packages("tidyverse",dependencies = T)#### tidyverse function comes with readr function for importing data 
    library(readr) 
    GLOW_Study <- read_csv("C:/Users/USER/Desktop/GLOW Study.csv")
    View(GLOW_Study) #### To have a preview of the data
    
    data<-GLOW_Study ### this is our data
    
    ## FITTING THE LOGISTIC REGRESSION MODEL
    
    glow.model<-glm(data$FRACTURE~data$AGE+data$WEIGHT+data$PRIORFRAC+data$PREMENO+factor(data$RATERISK));glow.model
    
    summary(glow.model)

    The summary() function gives more information about the model, That is:

    ### output of summary()
    Call:
    glm(formula = data$FRACTURE ~ data$AGE + data$WEIGHT + data$PRIORFRAC + 
        data$PREMENO + factor(data$RATERISK))
    
    Deviance Residuals: 
         Min        1Q    Median        3Q       Max  
    -0.58108  -0.25440  -0.16946   0.09002   0.98256  
    
    Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
    (Intercept)            -0.5211077  0.2025130  -2.573 0.010367 *  
    data$AGE                0.0088605  0.0023033   3.847 0.000135 ***
    data$WEIGHT             0.0006767  0.0011941   0.567 0.571168    
    data$PRIORFRAC          0.1411731  0.0459970   3.069 0.002265 ** 
    data$PREMENO            0.0263517  0.0480023   0.549 0.583277    
    factor(data$RATERISK)2  0.0822376  0.0446741   1.841 0.066247 .  
    factor(data$RATERISK)3  0.1487330  0.0485318   3.065 0.002299 ** 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    (Dispersion parameter for gaussian family taken to be 0.1732265)
    
        Null deviance: 93.750  on 499  degrees of freedom
    Residual deviance: 85.401  on 493  degrees of freedom
      (1 observation deleted due to missingness)
    AIC: 551.31
    
    Number of Fisher Scoring iterations: 2

    From the data, we see that the estimated logistic regression model is logit \(\hat{\pi}\) = -0.5211077 +0.0088605Age+0.0006767WEIGHT+0.1411731PRIORFRAC+0.0263517PREMENO+0.0822376RATERISK_EQ_2+0.1487330RATERISK_EQ_3

Once the model has been fitted, we then assess its significance.

To assess significance of the variables in the model, statistical hypothesis are formulated and tested to determine whether the independent variables in the model are "significantly" related to the outcome variable.
One approach is to answer the question that; Does the model that includes the variable in question tell us more about the outcome (or response) variable than a model that does not include that variable? To answer this question we compare two models: The Reduced model vs full/saturated model. The Wald test or Likelihood Ratio Test(LRT) is used perform these tests.

  1. For the Wald test we proceed as follows.

  2. For Likelihood Ratio Test(LRT) we proceed as follows.