In this blog post I’ll be looking to create a generalized linear model. I’m basing some work off this DataCamp GLM Tutorial as a reference
We’ll be working off the built-in cervical
dataset from the GLMsData package in R, which tracks the number of
deaths from cervical cancer for.
First let’s build a basic linear
lm1 <- lm(Deaths ~ Age + Country + Wyears, cervical)
summary(lm1)
##
## Call:
## lm(formula = Deaths ~ Age + Country + Wyears, data = cervical)
##
## Residuals:
## Min 1Q Median 3Q Max
## -825.77 -159.95 -11.18 236.64 478.60
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.918e+02 3.183e+02 -0.603 0.563483
## Age35to44 -4.644e+01 3.468e+02 -0.134 0.896797
## Age45to54 6.817e+02 3.484e+02 1.956 0.086139 .
## Age55to64 8.337e+02 3.496e+02 2.385 0.044228 *
## CountryEngWales 2.080e+03 3.732e+02 5.573 0.000527 ***
## CountryFrance 6.581e+02 3.179e+02 2.070 0.072197 .
## CountryItaly 4.541e+02 3.199e+02 1.420 0.193503
## Wyears -1.092e-02 4.279e-03 -2.553 0.034026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 443.2 on 8 degrees of freedom
## Multiple R-squared: 0.8748, Adjusted R-squared: 0.7652
## F-statistic: 7.982 on 7 and 8 DF, p-value: 0.004438
Let’s create a plot of our linear model to see how well of a fit we’re getting.
plot(lm1)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
We can see from the Residuals vs Fitted graph that our residuals from a basic linear model have a general pattern. Let’s plot a histogram of our residuals in order to see if they are normally distributed, which is an assumption of linearity when modeling.
hist(lm1$residuals)
AIC(lm1)
## [1] 247.3235
We can now build a generalized linear model using the R function
glm. Given we’re dealing with count data (number of
deaths), we can pass the family=poisson parameter to model
based off the Poisson distribution.
glm1 <- glm(Deaths ~ Age + Country + Wyears, family="poisson", data=cervical)
summary(glm1)
##
## Call:
## glm(formula = Deaths ~ Age + Country + Wyears, family = "poisson",
## data = cervical)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4029 -1.6002 -0.0286 0.9339 4.9031
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.857e+00 9.317e-02 30.66 <2e-16 ***
## Age35to44 1.577e+00 8.752e-02 18.02 <2e-16 ***
## Age45to54 2.593e+00 8.538e-02 30.38 <2e-16 ***
## Age55to64 2.722e+00 8.559e-02 31.80 <2e-16 ***
## CountryEngWales 2.439e+00 4.472e-02 54.54 <2e-16 ***
## CountryFrance 1.503e+00 4.619e-02 32.55 <2e-16 ***
## CountryItaly 1.080e+00 4.863e-02 22.21 <2e-16 ***
## Wyears -1.697e-07 8.077e-07 -0.21 0.834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 13857.502 on 15 degrees of freedom
## Residual deviance: 68.105 on 8 degrees of freedom
## AIC: 206.22
##
## Number of Fisher Scoring iterations: 4
We can use the Akaike Information Criterion as a measure of goodness of fit of our model. In general, a minimized AIC value is preferred, and our GLM is seeing better performance out of our poisson-based GLM (206) than our regular linear model (247)
Let’s take a look at our adjusted-\(R^2\) values as well. Our linear model produced an \(R^2\) value of 0.7652. By contrast, our GLM produced an adjuster \(R^2\) of
with(summary(glm1), 1 - deviance/null.deviance)
## [1] 0.9950853
Which is an improvement from the linear model!