Linear Regression - Basic Tutorial

Purpose of a Linear Model

Linear models are constructed to understand the relationship between a dependent variable and one or more independent variables, under the assumption that the change is of a constant rate.

In R, linear models are created using the formula:

lm(y ~ x, data = dataset)

Where y is the dependent variable, x is the independent variable, and dataset is the data frame containing these variables.

Linear Regression in Action

To create a linear model, we first need access to a set of data. For the purpose of this example, we will use AFL data from the fitzRoy package. If you don’t have the package installed, you can do so using the following command, followed by the tidyverse, sjPlot, patchwork and performance packages.

install.packages("fitzRoy")
install.packages("tidyverse")
install.packages("sjPlot")
install.packages("patchwork")
install.packages("performance")

library(fitzRoy)
library(tidyverse)
library(sjPlot)
library(patchwork)
library(performance)

We can now load data in from a season using the get_afl_data() function. For this example, we will use data from the 2024 season.

# Create data frame
afl_data <- fetch_player_stats_afltables(season = 2024)

Simple Linear Regression

To begin modelling, we can first examine the relationship between two variables. In this case, we will look at how Contested Possessions won by a player impacts the number of Disposals they have.

# Create linear model - call it lm1
lm1 <- lm(Disposals ~ Contested.Possessions, data = afl_data)

# Summary of the linear model
lm1_summary <- tab_model(lm1)

lm1_summary

	Disposals
Predictors	Estimates	CI	p
(Intercept)	8.02	7.82 – 8.22	<0.001
Contested Possessions	1.30	1.27 – 1.33	<0.001
Observations	9936
R² / R² adjusted	0.418 / 0.418

# Visualise the linear model
lm1_plot <- ggplot(afl_data, aes(x = Contested.Possessions, y = Disposals)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  labs(title = "Relationship between Disposals and Contested Possessions",
       subtitle = "AFL 2024 Season - Player Statistics",
       x = "Number of Contested Possessions",
       y = "Number of Disposals") +
  theme_minimal()

# Display plot
print(lm1_plot)

From the plot above, we can see a positive correlation between Contested Possessions and Disposals. As the number of Contested Possessions increases, the number of Disposals tends to increase as well. As shown in the display table, with the intercept at 8.02, a player who has recorded 0 Contested Possessions could be expected to have 8 Disposals on average. For every additional Contested Possession, a player can be expected to have 1.3 additional Disposals.

Multiple Linear Regression

A Simple Model can be further extended upon by adding in more independent variables, therefore having multiple predictors of the dependent variable. In this case, we will add in Marks as an additional predictor of Disposals.

# Create multiple linear model - call it lm2
lm2 <- lm(Disposals ~ Contested.Possessions + Marks, data = afl_data)

# Summary of the multiple linear model
lm2_summary <- tab_model(lm2)

lm2_summary

	Disposals
Predictors	Estimates	CI	p
(Intercept)	3.27	3.06 – 3.49	<0.001
Contested Possessions	1.29	1.27 – 1.32	<0.001
Marks	1.19	1.16 – 1.23	<0.001
Observations	9936
R² / R² adjusted	0.606 / 0.606

# Visualise the multiple linear model
lm2_cp <- plot_model(lm2, type = "pred", terms = "Contested.Possessions")
lm2_marks <- plot_model(lm2, type = "pred", terms = "Marks")

# Display plot
lm2_plot <- lm2_cp / lm2_marks

print(lm2_plot)

Based on the coefficients, a player with 0 Contested Possessions and 0 Marks can be expected to have approximately 3.27 Disposals. When Marks are held constant, each additional Contested Possession increases a player’s Disposals by 1.29, and while Contested Possessions are held constant, each additional Mark increases a player’s Disposals by 1.19.

Compare Models

# Compare both models in one table
model_comparison <- tab_model(lm1, lm2,
          dv.labels = c("Model 1 - Simple", "Model 2 - Multiple"))

# Display Table
model_comparison

	Model 1 - Simple			Model 2 - Multiple
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	8.02	7.82 – 8.22	<0.001	3.27	3.06 – 3.49	<0.001
Contested Possessions	1.30	1.27 – 1.33	<0.001	1.29	1.27 – 1.32	<0.001
Marks				1.19	1.16 – 1.23	<0.001
Observations	9936			9936
R² / R² adjusted	0.418 / 0.418			0.606 / 0.606

As can be seen in the table above, by interpreting the R² output, Contested Possessions alone explain 41.8% of the variation in Disposals. When further analysed, by adding in Marks, the variance increases to 60.6%, thus increasing the predictive power of the model.

Before drawing conclusion as to which model is a better predictor of a player’s Disposals, it is also important to consider the Akaike Information Criterion (AIC) value. By assessing the AIC value, we can help prevent overfitting of data. The AIC analyses if the added complexity of the model is justified by the increase in predictive power. For AIC values, a smaller value is better than a larger value.

# Display AIC values
compare_performance(lm1, lm2)

## # Comparison of Model Performance Indices
## 
## Name | Model |   AIC (weights) |  AICc (weights) |   BIC (weights) |    R2
## --------------------------------------------------------------------------
## lm1  |    lm | 62227.7 (<.001) | 62227.7 (<.001) | 62249.3 (<.001) | 0.418
## lm2  |    lm | 58363.9 (>.999) | 58363.9 (>.999) | 58392.7 (>.999) | 0.606
## 
## Name | R2 (adj.) |  RMSE | Sigma
## --------------------------------
## lm1  |     0.418 | 5.541 | 5.542
## lm2  |     0.606 | 4.561 | 4.562

From the output above, we can see that Model 2 has a lower AIC value (58363.9) compared to Model 1 (62227.7), indicating that the added complexity of including Marks as an additional predictor is justified by the increase in predictive power. Combined with it’s higher R² value, and lower RMSE (4.561 vs 5.541), it can be determined that Model 2 is the preferred model compared to Model 1 for predicting Disposals based on Contested Possessions and Marks.

Model Diagnostics

Before we can finally tick off the use of Model 2, we can perform model diagnostics to ensure that the assumptions of linear regression have been met. This can be done using check_model().

check_model(lm2)

From the diagnostic plots above, we can see several key findings:

Posterior Predictive Check (top left): The model-predicted distribution closely matches the observed data, indicating good overall model fit.

Linearity (top right): The residuals show a relatively flat pattern around zero, though there is a slight downward trend at higher fitted values, suggesting minor non-linearity. This is acceptable for practical purposes.

Homogeneity of Variance (middle left): The spread of residuals remains relatively constant across fitted values, confirming the assumption of homoscedasticity is reasonably met.

Influential Observations (middle right): All data points fall well within the Cook’s distance contour lines, indicating no individual observations are unduly influencing the model results.

Collinearity (bottom left): Both predictors show VIF values well below 5, confirming no multicollinearity issues between Contested Possessions and Marks.

Normality of Residuals (bottom right): The Q-Q plot shows points largely following the diagonal line, with some deviation at the extremes indicating slightly heavy-tailed residuals. This is common with count data and does not invalidate the model.

Overall, the diagnostics suggest that the assumptions of linear regression have been adequately met for Model 2, supporting the validity of our conclusions.

Conclusion

This analysis demonstrates the key steps in linear modelling: 1. Starting with a simple model to understand basic relationships 2. Extending to multiple predictors to improve explanatory power 3. Comparing models using multiple criteria (R², AIC, RMSE) 4. Validating assumptions through diagnostic checks

In summary, with it’s added complexity justified by a lower AIC value, higher R2 value, and acceptable model diagnostics, Model 2 can be considered the preferred model for predicting Disposals based on Contested Possessions and Marks.