\(~\)

Generalized Linear Model:

We have so far created simple and multiple linear models; now I will be introducing generalized linear model. This is a generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.

GLM build a linear relationship between the response and predictors even if their relationship is not linear. There are some basic assumptions for GLMs (some are modified for LMs):

  • Data should be independent and random (each random variable has the same probability distribution)
  • The response variable y does not need to be normally distributed, but the distribution is from an exponential family (e.g. bionmia, Poisson, multinomial, normal)
  • The original response variable need not have a linear relationship with the independent variables, but the transformed response variable (through the link function) is linearly dependent on the independent variables.
  • Feature engineering on the Independent variable can be applied i.e instead of taking the original raw independent variables, variable transformation can be done, and the transformed independent variables, such as taking a log transformation, squaring the variables, reciprocal of the variables, can also be used to build the GLM model.
  • Homoscedasticity (i.e constant variance) need not be satisfied. Response variable Error variance can increase, or decrease with the independent variables.
  • Errors are independent but need not be normally distributed

\(~\)

To start we will be using a data from the ‘GLMsData’ cran r-project package to select a data set to create our generalized linear models.

\(~\)

Load Libraries

These are the libraries we will be using:

library(ggplot2)
library(tidyverse)
library(jtools)
library(hrbrthemes)
library(GLMsData)
library(gtsummary)
library(MASS)
library(performance)

There’s a couple of data sets that we can use from the GLMsData but in order to view these options on your end use the function data() and you’ll see a small box with the list of the data sets in this library.

\(~\)

Load Data

We will be using the Australian Institute of Sports (AIS) data, which has physical and blood measurements from high performance athletes at the AIS.

\(~\)

Name: Description:
Sex the sex of the athlete: F means female, and M means male
Sport the sport of the athlete; one of BBall (basketball), Field, Gym (gymnastics), Netball, Rowing, Swim (swimming), T400m, (track, further than 400m), Tennis, TPSprnt (track sprint events), WPolo (waterpolo)
LBM lean body mass, in kg
Ht height, in cm
Wt weight, in kg
BMI body mass index, in kg per metre-squared
SSF sum of skin folds
PBF percentage body fat
RBC red blood cell count, in 1012 per litre
WBC white blood cell count, in 1012 per litre
HCT hematocrit, in percent
HGB hemoglobin concentration, in grams per decilitre
Ferr plasma ferritins, in ng per decilitre

\(~\)

To load the data we the data() function and the head() function to view it:

##   Sex Sport   LBM    Ht   Wt   BMI   SSF  RBC WBC  HCT  HGB Ferr   PBF
## 1   F BBall 63.32 195.9 78.9 20.56 109.1 3.96 7.5 37.5 12.3   60 19.75
## 2   F BBall 58.55 189.7 74.4 20.67 102.8 4.41 8.3 38.2 12.7   68 21.30
## 3   F BBall 55.36 177.8 69.1 21.86 104.6 4.14 5.0 36.4 11.6   21 19.88
## 4   F BBall 57.18 185.0 74.9 21.88 126.4 4.11 5.3 37.3 12.6   69 23.66
## 5   F BBall 53.20 184.6 64.6 18.96  80.3 4.45 6.8 41.5 14.0   29 17.64
## 6   F BBall 53.77 174.0 63.7 21.04  75.2 4.10 4.4 37.4 12.5   42 15.58

\(~\)

Data Exploration

Lets get the structure of this data; Using the str() function we notice we have 202 observations with 13 variables.

## 'data.frame':    202 obs. of  13 variables:
##  $ Sex  : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Sport: Factor w/ 10 levels "BBall","Field",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LBM  : num  63.3 58.5 55.4 57.2 53.2 ...
##  $ Ht   : num  196 190 178 185 185 ...
##  $ Wt   : num  78.9 74.4 69.1 74.9 64.6 63.7 75.2 62.3 66.5 62.9 ...
##  $ BMI  : num  20.6 20.7 21.9 21.9 19 ...
##  $ SSF  : num  109.1 102.8 104.6 126.4 80.3 ...
##  $ RBC  : num  3.96 4.41 4.14 4.11 4.45 4.1 4.31 4.42 4.3 4.51 ...
##  $ WBC  : num  7.5 8.3 5 5.3 6.8 4.4 5.3 5.7 8.9 4.4 ...
##  $ HCT  : num  37.5 38.2 36.4 37.3 41.5 37.4 39.6 39.9 41.1 41.6 ...
##  $ HGB  : num  12.3 12.7 11.6 12.6 14 12.5 12.8 13.2 13.5 12.7 ...
##  $ Ferr : int  60 68 21 69 29 42 73 44 41 44 ...
##  $ PBF  : num  19.8 21.3 19.9 23.7 17.6 ...

\(~\)

Using summary() function to get the statistical structure of our data

##  Sex         Sport         LBM               Ht              Wt        
##  F:100   Rowing :37   Min.   : 34.36   Min.   :148.9   Min.   : 37.80  
##  M:102   T400m  :29   1st Qu.: 54.67   1st Qu.:174.0   1st Qu.: 66.53  
##          BBall  :25   Median : 63.03   Median :179.7   Median : 74.40  
##          Netball:23   Mean   : 64.87   Mean   :180.1   Mean   : 75.01  
##          Swim   :22   3rd Qu.: 74.75   3rd Qu.:186.2   3rd Qu.: 84.12  
##          Field  :19   Max.   :106.00   Max.   :209.4   Max.   :123.20  
##          (Other):47                                                    
##       BMI             SSF              RBC             WBC        
##  Min.   :16.75   Min.   : 28.00   Min.   :3.800   Min.   : 3.300  
##  1st Qu.:21.08   1st Qu.: 43.85   1st Qu.:4.372   1st Qu.: 5.900  
##  Median :22.72   Median : 58.60   Median :4.755   Median : 6.850  
##  Mean   :22.96   Mean   : 69.02   Mean   :4.719   Mean   : 7.109  
##  3rd Qu.:24.46   3rd Qu.: 90.35   3rd Qu.:5.030   3rd Qu.: 8.275  
##  Max.   :34.42   Max.   :200.80   Max.   :6.720   Max.   :14.300  
##                                                                   
##       HCT             HGB             Ferr             PBF        
##  Min.   :35.90   Min.   :11.60   Min.   :  8.00   Min.   : 5.630  
##  1st Qu.:40.60   1st Qu.:13.50   1st Qu.: 41.25   1st Qu.: 8.545  
##  Median :43.50   Median :14.70   Median : 65.50   Median :11.650  
##  Mean   :43.09   Mean   :14.57   Mean   : 76.88   Mean   :13.507  
##  3rd Qu.:45.58   3rd Qu.:15.57   3rd Qu.: 97.00   3rd Qu.:18.080  
##  Max.   :59.70   Max.   :19.20   Max.   :234.00   Max.   :35.520  
## 

\(~\)

And of course use of gtsummary() library to view our data’s structure differently.

Characteristic N = 2021
Sex
F 100 (50%)
M 102 (50%)
Sport
BBall 25 (12%)
Field 19 (9.4%)
Gym 4 (2.0%)
Netball 23 (11%)
Rowing 37 (18%)
Swim 22 (11%)
T400m 29 (14%)
Tennis 11 (5.4%)
TSprnt 15 (7.4%)
WPolo 17 (8.4%)
LBM 63 (55, 75)
Ht 180 (174, 186)
Wt 74 (67, 84)
BMI 22.72 (21.08, 24.46)
SSF 59 (44, 90)
RBC 4.76 (4.37, 5.03)
WBC 6.85 (5.90, 8.28)
HCT 43.5 (40.6, 45.6)
HGB 14.70 (13.50, 15.57)
Ferr 66 (41, 97)
PBF 11.7 (8.5, 18.1)
1 n (%); Median (IQR)

\(~\)

Model Building:

\(~\)

To create our models we use the following syntax:

glm (formula, family, data, weights, subset, Start=null, model=TRUE,method=””…)

\(~\)

Below we have other link functions to use depending on the family you will be using:

Family Default Link Function
binomial (link = “logit”)
gaussian (link = “identity”)
Gamma (link = “inverse”)
inverse.gaussian (link = “1/mu^2”)
poisson (link = “log”)
quasi (link = “identity”, variance = “constant”)
quasibinomial (link = “logit”)
quasipoisson (link = “log”)

I have opted to used the guassian family for the models below.

\(~\)

Model 1 will be a base model:

## 
## Call:
## glm(formula = BMI ~ Sex + Sport + LBM + Ht + Wt + SSF + PBF + 
##     WBC + HCT + HGB + Ferr, family = gaussian(link = "identity"), 
##     data = AIS)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.48202  -0.09002   0.04780   0.16179   0.88570  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.5942420  0.9111177  46.749  < 2e-16 ***
## SexM          0.1922183  0.1134696   1.694  0.09197 .  
## SportField    0.3233329  0.1152391   2.806  0.00557 ** 
## SportGym     -1.2811581  0.1964370  -6.522 6.64e-10 ***
## SportNetball  0.0600713  0.0977253   0.615  0.53952    
## SportRowing   0.1005367  0.0818731   1.228  0.22105    
## SportSwim     0.2194978  0.0976737   2.247  0.02582 *  
## SportT400m    0.1421550  0.1051004   1.353  0.17787    
## SportTennis  -0.0186392  0.1195542  -0.156  0.87628    
## SportTSprnt   0.2977256  0.1206913   2.467  0.01456 *  
## SportWPolo    0.0179661  0.1046757   0.172  0.86391    
## LBM           0.0344505  0.0371041   0.928  0.35439    
## Ht           -0.2378588  0.0047731 -49.833  < 2e-16 ***
## Wt            0.2624938  0.0327839   8.007 1.35e-13 ***
## SSF           0.0033669  0.0034438   0.978  0.32953    
## PBF           0.0411612  0.0314146   1.310  0.19176    
## WBC           0.0066557  0.0129736   0.513  0.60856    
## HCT          -0.0122763  0.0193140  -0.636  0.52582    
## HGB           0.0542869  0.0542895   1.000  0.31866    
## Ferr         -0.0001934  0.0005264  -0.367  0.71370    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.08399842)
## 
##     Null deviance: 1648.624  on 201  degrees of freedom
## Residual deviance:   15.288  on 182  degrees of freedom
## AIC: 93.845
## 
## Number of Fisher Scoring iterations: 2

\(~\)

For Model2 we use Backwards elimination:

## 
## Call:
## glm(formula = BMI ~ Sport + Ht + Wt, family = gaussian(link = "identity"), 
##     data = AIS)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.46437  -0.10602   0.07047   0.15836   0.88324  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  44.694272   0.650160  68.743  < 2e-16 ***
## SportField    0.211174   0.111202   1.899   0.0591 .  
## SportGym     -1.505624   0.190622  -7.898 2.21e-13 ***
## SportNetball  0.156607   0.094649   1.655   0.0997 .  
## SportRowing   0.063430   0.081893   0.775   0.4396    
## SportSwim     0.077695   0.091998   0.845   0.3994    
## SportT400m   -0.020106   0.090648  -0.222   0.8247    
## SportTennis  -0.104362   0.116670  -0.895   0.3722    
## SportTSprnt   0.110755   0.105509   1.050   0.2952    
## SportWPolo   -0.001183   0.097734  -0.012   0.9904    
## Ht           -0.246724   0.004324 -57.063  < 2e-16 ***
## Wt            0.302235   0.003130  96.556  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.09119712)
## 
##     Null deviance: 1648.624  on 201  degrees of freedom
## Residual deviance:   17.327  on 190  degrees of freedom
## AIC: 103.14
## 
## Number of Fisher Scoring iterations: 2

Based on the summary output and plots of the two models, model2 has a higher AIC hence seeming to be a better fit model.

When working with different data you’d be able to perform more than 2 models along with checking performance and making predictions on the evaluation data. Yet this is the start to perform generalized linear models.

\(~\)

References: