In this blog post I’ll be looking to create a generalized linear model. I’m basing some work off this DataCamp GLM Tutorial as a reference

We’ll be working off the built-in cervical dataset from the GLMsData package in R, which tracks the number of deaths from cervical cancer for.

First let’s build a basic linear

lm1 <- lm(Deaths ~ Age + Country + Wyears, cervical)

summary(lm1)
## 
## Call:
## lm(formula = Deaths ~ Age + Country + Wyears, data = cervical)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -825.77 -159.95  -11.18  236.64  478.60 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.918e+02  3.183e+02  -0.603 0.563483    
## Age35to44       -4.644e+01  3.468e+02  -0.134 0.896797    
## Age45to54        6.817e+02  3.484e+02   1.956 0.086139 .  
## Age55to64        8.337e+02  3.496e+02   2.385 0.044228 *  
## CountryEngWales  2.080e+03  3.732e+02   5.573 0.000527 ***
## CountryFrance    6.581e+02  3.179e+02   2.070 0.072197 .  
## CountryItaly     4.541e+02  3.199e+02   1.420 0.193503    
## Wyears          -1.092e-02  4.279e-03  -2.553 0.034026 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 443.2 on 8 degrees of freedom
## Multiple R-squared:  0.8748, Adjusted R-squared:  0.7652 
## F-statistic: 7.982 on 7 and 8 DF,  p-value: 0.004438

Let’s create a plot of our linear model to see how well of a fit we’re getting.

plot(lm1)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

We can see from the Residuals vs Fitted graph that our residuals from a basic linear model have a general pattern. Let’s plot a histogram of our residuals in order to see if they are normally distributed, which is an assumption of linearity when modeling.

hist(lm1$residuals)

AIC(lm1)
## [1] 247.3235

We can now build a generalized linear model using the R function glm. Given we’re dealing with count data (number of deaths), we can pass the family=poisson parameter to model based off the Poisson distribution.

glm1 <- glm(Deaths ~ Age + Country + Wyears, family="poisson", data=cervical)
summary(glm1)
## 
## Call:
## glm(formula = Deaths ~ Age + Country + Wyears, family = "poisson", 
##     data = cervical)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4029  -1.6002  -0.0286   0.9339   4.9031  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      2.857e+00  9.317e-02   30.66   <2e-16 ***
## Age35to44        1.577e+00  8.752e-02   18.02   <2e-16 ***
## Age45to54        2.593e+00  8.538e-02   30.38   <2e-16 ***
## Age55to64        2.722e+00  8.559e-02   31.80   <2e-16 ***
## CountryEngWales  2.439e+00  4.472e-02   54.54   <2e-16 ***
## CountryFrance    1.503e+00  4.619e-02   32.55   <2e-16 ***
## CountryItaly     1.080e+00  4.863e-02   22.21   <2e-16 ***
## Wyears          -1.697e-07  8.077e-07   -0.21    0.834    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 13857.502  on 15  degrees of freedom
## Residual deviance:    68.105  on  8  degrees of freedom
## AIC: 206.22
## 
## Number of Fisher Scoring iterations: 4

We can use the Akaike Information Criterion as a measure of goodness of fit of our model. In general, a minimized AIC value is preferred, and our GLM is seeing better performance out of our poisson-based GLM (206) than our regular linear model (247)

Let’s take a look at our adjusted-\(R^2\) values as well. Our linear model produced an \(R^2\) value of 0.7652. By contrast, our GLM produced an adjuster \(R^2\) of

with(summary(glm1), 1 - deviance/null.deviance)
## [1] 0.9950853

Which is an improvement from the linear model!