Setting up a linear model using glm()

Introduction

The glm() function in R uses generalised linear models to perform regression on response variables where the error distribution isn’t the normal distribution. This vignette will introduce the types of models that can be created using this function, then use an example dataset to create and refine a simple model.

Types of GLMs

The first five of the six families of models that the glm() function can use are based on the Gaussian, Binomial, Poisson, Gamma and Inverse Gaussian distributions. The sixth family is known as the Quasi family of models, which allows the user to define a model based on maximum quasi-likelihood.

Example model using trees

As an example, let’s use the trees data available in base R, which provides the girth, height and Volume of 30 Black Cherry trees, and model the relationship between the girth and volume of the trees.

Step 1: Save the data as a dataframe

The first step is to save the data as a dataframe. This will allow the data to be processed faster.

tree_data <- trees
tree_data
#>    Girth Height Volume
#> 1    8.3     70   10.3
#> 2    8.6     65   10.3
#> 3    8.8     63   10.2
#> 4   10.5     72   16.4
#> 5   10.7     81   18.8
#> 6   10.8     83   19.7
#> 7   11.0     66   15.6
#> 8   11.0     75   18.2
#> 9   11.1     80   22.6
#> 10  11.2     75   19.9
#> 11  11.3     79   24.2
#> 12  11.4     76   21.0
#> 13  11.4     76   21.4
#> 14  11.7     69   21.3
#> 15  12.0     75   19.1
#> 16  12.9     74   22.2
#> 17  12.9     85   33.8
#> 18  13.3     86   27.4
#> 19  13.7     71   25.7
#> 20  13.8     64   24.9
#> 21  14.0     78   34.5
#> 22  14.2     80   31.7
#> 23  14.5     74   36.3
#> 24  16.0     72   38.3
#> 25  16.3     77   42.6
#> 26  17.3     81   55.4
#> 27  17.5     82   55.7
#> 28  17.9     80   58.3
#> 29  18.0     80   51.5
#> 30  18.0     80   51.0
#> 31  20.6     87   77.0

Step 2: Visualise the data

The next step will be to plot the data on a graph. This will allow us to decide what the best model to use will be. In this vignette, a function will be defined to be used for plotting.

treeScatterPlot <- function(...){
  plot(Volume ~ Girth, data=tree_data, bty="n", lwd=2,
       main="Tree Volume vs Girth", col="#00526D", 
       xlab="Girth (m)", 
       ylab="Volume (m3)", ...)
  axis(side = 1, col="grey")
  axis(side = 2, col="grey")
}

treeScatterPlot()

Step 3: Initial model

The data appears to be mostly linear at the centre of the plot, and we are modelling a continuous variable, so an appropriate choice for an initial model would be a gaussian model. The link function must be specified, the default and only choice for gaussian models is “identity”. The Linear Model is set up using the glm() function, then predictions are made using the model. Both the original data and predictions are plotted below.

linearModel <- glm(Volume ~ Girth, data=tree_data, 
              family=gaussian(link="identity"))

linearPredicts <- predict(linearModel)

treeScatterPlot()
lines(tree_data$Girth, linearPredicts, col="red")
legend(x="topleft", bty="n",
       legend=c("observation", "Transformed LM"),
         col=c("#00526D","red"),  pch=c(1,NA))

To test the accuracy of the graph, we’ll use the Akaike Information Criterion (AIC), which is displayed at the end of the model summary.

summary(linearModel)
#> 
#> Call:
#> glm(formula = Volume ~ Girth, family = gaussian(link = "identity"), 
#>     data = tree_data)
#> 
#> Deviance Residuals: 
#>    Min      1Q  Median      3Q     Max  
#> -8.065  -3.107   0.152   3.495   9.587  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) -36.9435     3.3651  -10.98 7.62e-12 ***
#> Girth         5.0659     0.2474   20.48  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 18.0794)
#> 
#>     Null deviance: 8106.1  on 30  degrees of freedom
#> Residual deviance:  524.3  on 29  degrees of freedom
#> AIC: 181.64
#> 
#> Number of Fisher Scoring iterations: 2

Step 4: Refining the model

The value of 181.64 given for the AIC is very low, but the model is inaccurate for lower values of the girth. The volume of a tree is proportional to the square of the girth; therefore, an exponential or polynomial curve is likely to be more appropriate. To model an exponential relationship between Volume and Girth, we take the logarithm of the volume values and set up a linear model between log(Volume) and Girth.

logLinearModel <- glm(log(Volume) ~ Girth, data=tree_data, 
              family=gaussian(link="identity"))

To make accurate predictions using this model, we need to reverse the transformation used on the data. As we used the logarithm function before, we now need to take the exponential of the log-linear model values before getting predictions.

linearPredicts <- exp(predict(logLinearModel))

Once the predictions have been made, we can plot the curve.

treeScatterPlot()
lines(tree_data$Girth, linearPredicts, col="red")
legend(x="topleft", bty="n",
       legend=c("observation", "Log-transformed LM"),
         col=c("#00526D","red"),  pch=c(1,NA))

This model fits the data much better than the previous model given in Step 3, which is reflected by a lower AIC value as shown below. The negative sign doesn’t change the interpretation of the value, the model accuracy is determined be the absolute value of the AIC.

summary(logLinearModel)
#> 
#> Call:
#> glm(formula = log(Volume) ~ Girth, family = gaussian(link = "identity"), 
#>     data = tree_data)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -0.22719  -0.11468   0.02889   0.07930   0.30436  
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 1.118997   0.104021   10.76 1.23e-11 ***
#> Girth       0.162566   0.007647   21.26  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 0.01727492)
#> 
#>     Null deviance: 8.30869  on 30  degrees of freedom
#> Residual deviance: 0.50097  on 29  degrees of freedom
#> AIC: -33.907
#> 
#> Number of Fisher Scoring iterations: 2

Conclusion

In conclusion, the choice of model is based on the type of data to be modelled, and it is up to the data scientist to choose the most appropriate for the given scenario. Determining the correct choice is based on an understanding of the differences between models and the scenario to be modelled. This vignette simply offers an example of the simplest model.

References

[1] W. H. Greene, “Akaike information criterion,” Research, 10 September 2012. [Online]. Available: https://www.researchgate.net/post/Akaike_information_criterion. [Accessed 4 April 2020].

[2] Y. Kida, “Generalized linear models,” Towards Data Science, 23 Spetember 2019. [Online]. Available: https://towardsdatascience.com/generalized-linear-models-9cbf848bb8ab. [Accessed 4 April 2020].

[3] G. Rodriguez, “Generalized Linear Models,” Princeton University, January 2020. [Online]. Available: https://data.princeton.edu/R/GLMs. [Accessed 4 April 2020].

[4] M. Gesmann, “Generalised Linear Models in R,” mages’ blog, 4 August 2015. [Online]. Available: https://magesblog.com/post/2015-08-04-generalised-linear-models-in-r/. [Accessed 4 April 2020].

[5] RDocumentation, “glm function,” RDocumentation, [Online]. Available: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm. [Accessed 4 April 2020].