The normal distribution plays a central role when making inference for Ordinary linear regression models.These models assume that the response variable y follows the normal distribution.
In some practical situations however,the response variable may be a discrete variable, such as count of events.Another possibility is the binary response variable where the response variable is either success or failure (i.e., 0 or 1,yes or No).
GLMs are often better models than ordinary linear regression models.They allow us to fit regression models for response data whose response variable Y, follow the exponential family of distributions. The exponential family of distributions include the normal, binomial, Poisson, geometric, negative binomial, exponential, gamma, and inverse normal distributions.
The usual GLMs assumes that the observations are independent, when this assumption is violated, it is appropriate to fit Generalized estimating equations (GEEs) to account for a correlation structure between observations
Two very important members of the family of generalized linear models are logistic regression and Poisson regression. They both find extensive application in biological, biomedical, and environmental problems. I will start with logistic regression.
For illustration purposes, I shall use data from GLOW Study which is publicly available.
The Global Longitudinal study of Osteoporosis in Women (GLOW), data was collected from tens of thousands of women worldwide concerning risk factors for osteoporosis, and they were followed up to determine whether they suffered a bone fracture of any kind within the subsequent year. Sample has information from 500 women, n1 = 125 of whom had a fracture during the first year of follow up and n0 = 375 who did not have a fracture.
Objective of the study was to predict whether a participant will suffer a fracture, during follow up.
The study had several variables measured, but those thought to be closely related to fracture in the first year were: Age, Weight,Prior Fracture (PRIORFRAC), Early Menopause (PREMENO), and Self-Reported Risk of Fracture (RATERISK).
Logistic Regression Models.
What distinguishes a logistic regression model from the linear regression model is that the outcome variable in logistic regression is binary or dichotomous.In this example, we have FRACTURE in the First Year of Follow Up (yes or no) as the response/outcome variable.
The most commonly used expression for logistic regression model is:
\[\pi_i=\frac{exp(\beta_0+\beta_1x_{i1}+......+\beta_1x_{ip})}{1+exp(\beta_0+\beta_1x_{i1}+......+ \beta_1x_{ip} }\] or after some algebraic manipulations may as well be written as:
\[log\{\frac{\pi_i}{1-\pi_i}\}= \beta_0+\beta_1x_{i1}+......+\beta_1x_{ip} \]
This transformation of \(\pi_i\) is often referred to as the logit transformation. The \(i\) subscript is when referring to a particular observation.
How to estimate regression parameters for a Logistic Regression Model.
Maximum likelihood Estimation is used to estimate the regression parameters \(\beta_0,\beta_1,....,\beta_p\) of the logistic regression model.However, for now I will avoid being much technical on how to derive the log-likelihood function,but rather acknowledge that the glm() function (“glm” stands for “generalized linear model”) within R implements this parameter estimation procedure with ease.
Below is how to fit a logistic regression model using the glm() function. I have added comments on the commands to help you understand.
rm(list=ls())
install.packages("tidyverse",dependencies = T)#### tidyverse function comes with readr function for importing data
library(readr)
GLOW_Study <- read_csv("C:/Users/USER/Desktop/GLOW Study.csv")
View(GLOW_Study) #### To have a preview of the data
data<-GLOW_Study ### this is our data
## FITTING THE LOGISTIC REGRESSION MODEL
glow.model<-glm(data$FRACTURE~data$AGE+data$WEIGHT+data$PRIORFRAC+data$PREMENO+factor(data$RATERISK));glow.model
summary(glow.model)
The summary() function gives more information about the model, That is:
### output of summary()
Call:
glm(formula = data$FRACTURE ~ data$AGE + data$WEIGHT + data$PRIORFRAC +
data$PREMENO + factor(data$RATERISK))
Deviance Residuals:
Min 1Q Median 3Q Max
-0.58108 -0.25440 -0.16946 0.09002 0.98256
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5211077 0.2025130 -2.573 0.010367 *
data$AGE 0.0088605 0.0023033 3.847 0.000135 ***
data$WEIGHT 0.0006767 0.0011941 0.567 0.571168
data$PRIORFRAC 0.1411731 0.0459970 3.069 0.002265 **
data$PREMENO 0.0263517 0.0480023 0.549 0.583277
factor(data$RATERISK)2 0.0822376 0.0446741 1.841 0.066247 .
factor(data$RATERISK)3 0.1487330 0.0485318 3.065 0.002299 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1732265)
Null deviance: 93.750 on 499 degrees of freedom
Residual deviance: 85.401 on 493 degrees of freedom
(1 observation deleted due to missingness)
AIC: 551.31
Number of Fisher Scoring iterations: 2
From the data, we see that the estimated logistic regression model is logit \(\hat{\pi}\) = -0.5211077 +0.0088605Age+0.0006767WEIGHT+0.1411731PRIORFRAC+0.0263517PREMENO+0.0822376RATERISK_EQ_2+0.1487330RATERISK_EQ_3
Once the model has been fitted, we then assess its significance.
To assess significance of the variables in the model, statistical
hypothesis are formulated and tested to determine whether the
independent variables in the model are "significantly" related to the
outcome variable.
One approach is to answer the question that; Does the model
that includes the variable in question tell us more about the outcome
(or response) variable than a model that does not include that
variable? To answer this question we compare two models:
The Reduced model vs full/saturated model. The
Wald test or Likelihood Ratio
Test(LRT) is used perform these tests.
For the Wald test we proceed as follows.
For Likelihood Ratio Test(LRT) we proceed as follows.