Setting up a linear model using glm()

Sachintha Mohotti

2020-04-07

Introduction

The glm() function in R uses generalised linear models to perform regression on response variables where the error distribution isn’t the normal distribution. This vignette will introduce the types of models that can be created using this function, then use an example dataset to create and refine a simple model.

Types of GLMs

The first five of the six families of models that the glm() function can use are based on the Gaussian, Binomial, Poisson, Gamma and Inverse Gaussian distributions. The sixth family is known as the Quasi family of models, which allows the user to define a model based on maximum quasi-likelihood.

Example model using trees

As an example, let’s use the trees data available in base R, which provides the girth, height and Volume of 30 Black Cherry trees, and model the relationship between the girth and volume of the trees.

Step 1: Save the data as a dataframe

The first step is to save the data as a dataframe. This will allow the data to be processed faster.

Step 2: Visualise the data

The next step will be to plot the data on a graph. This will allow us to decide what the best model to use will be. In this vignette, a function will be defined to be used for plotting.

Step 3: Initial model

The data appears to be mostly linear at the centre of the plot, and we are modelling a continuous variable, so an appropriate choice for an initial model would be a gaussian model. The link function must be specified, the default and only choice for gaussian models is “identity”. The Linear Model is set up using the glm() function, then predictions are made using the model. Both the original data and predictions are plotted below.

To test the accuracy of the graph, we’ll use the Akaike Information Criterion (AIC), which is displayed at the end of the model summary.

Step 4: Refining the model

The value of 181.64 given for the AIC is very low, but the model is inaccurate for lower values of the girth. The volume of a tree is proportional to the square of the girth; therefore, an exponential or polynomial curve is likely to be more appropriate. To model an exponential relationship between Volume and Girth, we take the logarithm of the volume values and set up a linear model between log(Volume) and Girth.

To make accurate predictions using this model, we need to reverse the transformation used on the data. As we used the logarithm function before, we now need to take the exponential of the log-linear model values before getting predictions.

Once the predictions have been made, we can plot the curve.

This model fits the data much better than the previous model given in Step 3, which is reflected by a lower AIC value as shown below. The negative sign doesn’t change the interpretation of the value, the model accuracy is determined be the absolute value of the AIC.

Conclusion

In conclusion, the choice of model is based on the type of data to be modelled, and it is up to the data scientist to choose the most appropriate for the given scenario. Determining the correct choice is based on an understanding of the differences between models and the scenario to be modelled. This vignette simply offers an example of the simplest model.

References

[1] W. H. Greene, “Akaike information criterion,” Research, 10 September 2012. [Online]. Available: https://www.researchgate.net/post/Akaike_information_criterion. [Accessed 4 April 2020].

[2] Y. Kida, “Generalized linear models,” Towards Data Science, 23 Spetember 2019. [Online]. Available: https://towardsdatascience.com/generalized-linear-models-9cbf848bb8ab. [Accessed 4 April 2020].

[3] G. Rodriguez, “Generalized Linear Models,” Princeton University, January 2020. [Online]. Available: https://data.princeton.edu/R/GLMs. [Accessed 4 April 2020].

[4] M. Gesmann, “Generalised Linear Models in R,” mages’ blog, 4 August 2015. [Online]. Available: https://magesblog.com/post/2015-08-04-generalised-linear-models-in-r/. [Accessed 4 April 2020].

[5] RDocumentation, “glm function,” RDocumentation, [Online]. Available: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/glm. [Accessed 4 April 2020].