Scenario

Our company, a global industrial manufacturing firm with over 10,000 employees, recently faced scrutiny after a workplace accident in South America, highlighting variability in safety practices across regions.The CEO has requested an analysis of workplace injury data to recommend a safety regime that should be adopted as the company-wide standard for injury prevention.

Load Packages

library(tidyverse)
library(MASS)
library(ggpubr)
library(DHARMa)
library(AER)
library(ggplot2)
library(reshape2)

Data

The dataset, injury.csv, covers the last 12 months of operation and includes the following variables:

The data is aggregated by experience level and the safety regime implemented at each factory.

# Load the CSV file
injury_data <- read.csv("C:/Users/navee/Downloads/injury.csv")

# Examine the data structure
str(injury_data)
## 'data.frame':    72 obs. of  8 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Injuries  : int  6 8 7 19 39 57 60 40 40 1 ...
##  $ Safety    : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ Experience: int  4 4 4 3 3 2 2 1 1 1 ...
##  $ Hours     : int  231437 126655 87847 222970 376438 316462 235670 127361 96667 51156 ...
##  $ bonus     : num  0.27 0.37 0.57 0.91 0.2 0.9 0.94 0.66 0.63 0.06 ...
##  $ training  : num  0.35 0.33 0.48 0.89 0.86 0.39 0.78 0.96 0.43 0.71 ...
##  $ university: num  0.73 0.45 0.18 0.75 0.1 0.86 0.61 0.56 0.33 0.45 ...
summary(injury_data)
##        X            Injuries          Safety       Experience 
##  Min.   : 1.00   Min.   :  0.00   Min.   :1.00   Min.   :1.0  
##  1st Qu.:18.75   1st Qu.: 25.00   1st Qu.:1.75   1st Qu.:1.0  
##  Median :36.50   Median : 57.50   Median :2.50   Median :2.5  
##  Mean   :36.50   Mean   : 87.46   Mean   :2.50   Mean   :2.5  
##  3rd Qu.:54.25   3rd Qu.: 96.50   3rd Qu.:3.25   3rd Qu.:4.0  
##  Max.   :72.00   Max.   :491.00   Max.   :4.00   Max.   :4.0  
##      Hours             bonus           training        university    
##  Min.   :  34574   Min.   :0.0100   Min.   :0.0100   Min.   :0.0600  
##  1st Qu.: 130272   1st Qu.:0.3125   1st Qu.:0.2675   1st Qu.:0.2800  
##  Median : 302879   Median :0.4950   Median :0.5050   Median :0.5200  
##  Mean   : 549996   Mean   :0.5140   Mean   :0.5096   Mean   :0.5189  
##  3rd Qu.: 813381   3rd Qu.:0.7400   3rd Qu.:0.7125   3rd Qu.:0.7600  
##  Max.   :2135146   Max.   :0.9900   Max.   :0.9900   Max.   :0.9600

Research Questions

Data Preprocessing

Handling Categorical Variables Some variables in the dataset, such as Safety Regime and Experience, were categorical. These variables were converted to factors to be correctly interpreted by the model. In this data, no missing data were identified. If there had been missing values, appropriate strategies such as imputation or removal of rows would have been applied.

# Converting safety and experience to factors
injury_data$Safety <- as.factor(injury_data$Safety)
injury_data$Experience <- as.factor(injury_data$Experience)
str(injury_data)
## 'data.frame':    72 obs. of  8 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Injuries  : int  6 8 7 19 39 57 60 40 40 1 ...
##  $ Safety    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Experience: Factor w/ 4 levels "1","2","3","4": 4 4 4 3 3 2 2 1 1 1 ...
##  $ Hours     : int  231437 126655 87847 222970 376438 316462 235670 127361 96667 51156 ...
##  $ bonus     : num  0.27 0.37 0.57 0.91 0.2 0.9 0.94 0.66 0.63 0.06 ...
##  $ training  : num  0.35 0.33 0.48 0.89 0.86 0.39 0.78 0.96 0.43 0.71 ...
##  $ university: num  0.73 0.45 0.18 0.75 0.1 0.86 0.61 0.56 0.33 0.45 ...
colSums(is.na(injury_data))
##          X   Injuries     Safety Experience      Hours      bonus   training 
##          0          0          0          0          0          0          0 
## university 
##          0
# Check first few rows
head(injury_data)
##   X Injuries Safety Experience  Hours bonus training university
## 1 1        6      1          4 231437  0.27     0.35       0.73
## 2 2        8      1          4 126655  0.37     0.33       0.45
## 3 3        7      1          4  87847  0.57     0.48       0.18
## 4 4       19      1          3 222970  0.91     0.89       0.75
## 5 5       39      1          3 376438  0.20     0.86       0.10
## 6 6       57      1          2 316462  0.90     0.39       0.86

Exploratory Data Analysis

Exploratory Data Analysis (EDA) helps us gain initial insights into the patterns, relationships, and distributions within our dataset before diving into detailed modeling. By using techniques such as box plots, scatter plots, and heatmaps, we can visually explore how different variables relate to injury rates. For instance

## `geom_smooth()` using formula = 'y ~ x'

Based on the box plot showing injury prevention performance by safety regime, Safety Regime 1 should again be recommended as the international standard for workplace safety. It has the lowest median number of injuries and the smallest spread (interquartile range), indicating that it consistently performs better in injury prevention compared to the other safety regimes. Safety Regime 3 shows the highest median and widest range, making it the least effective in terms of injury prevention.

From the heatmap, we can observe the following patterns regarding the interaction between safety regime and experience level in relation to mean injuries: The heatmap shows that Safety Regime 3 and Experience Level 3 have the highest injury rates, suggesting poor effectiveness for that combination. In contrast, Safety Regime 4 and higher experience levels (especially Level 4) are associated with fewer injuries, indicating that this regime works better, particularly for more experienced workers. Overall, experience plays a key role in reducing injury rates, especially with the more effective safety regimes.

The scatter plot with a smoothing line shows the relationship between Injuries and Bonus Proportion. From the plot, we can observe that there is no clear pattern. The red smoothing line remains relatively flat across the different bonus proportions, indicating that there is no strong relationship between the proportion of bonuses received and the number of injuries.

Fit Models

Poisson Regression Model Performing model selection using backward and forward selection approaches on Poisson regression models. The purpose is to find the best-fitting model to predict the number of injuries based on various predictor variables, while minimizing the model’s complexity.

# Full model with all possible interactions for backwards selection
full_interaction_model <- glm(
  data = injury_data,
  formula = Injuries ~ Safety + Experience + bonus + Hours + training + university,
  family = poisson(link = "log")
)

# Null model for forwards selection
null_model <- glm(
  data = injury_data,
  formula = Injuries ~ 1,
  family = poisson(link = "log")
)

# Perform backward selection
backward_sel_model <- stepAIC(
  full_interaction_model,
  direction = "backward",
  trace = 0
)

# Perform forward selection
forward_sel_model <- stepAIC(
  null_model,
  scope = formula(full_interaction_model),
  direction = "forward",
  trace = 0
)

# Summary of the selected models
summary(backward_sel_model)
## 
## Call:
## glm(formula = Injuries ~ Safety + Experience + Hours + training + 
##     university, family = poisson(link = "log"), data = injury_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.604e+00  5.998e-02  60.075  < 2e-16 ***
## Safety2     -6.565e-02  5.750e-02  -1.142  0.25355    
## Safety3      5.740e-01  5.986e-02   9.590  < 2e-16 ***
## Safety4      6.813e-01  5.083e-02  13.403  < 2e-16 ***
## Experience2  1.492e-01  3.520e-02   4.238 2.25e-05 ***
## Experience3 -2.524e-01  4.553e-02  -5.544 2.95e-08 ***
## Experience4 -1.135e+00  4.867e-02 -23.328  < 2e-16 ***
## Hours        9.023e-07  3.618e-08  24.940  < 2e-16 ***
## training     1.577e-01  5.316e-02   2.967  0.00301 ** 
## university  -1.432e-01  5.505e-02  -2.601  0.00929 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6213.8  on 71  degrees of freedom
## Residual deviance: 1331.4  on 62  degrees of freedom
## AIC: 1753.1
## 
## Number of Fisher Scoring iterations: 5
summary(forward_sel_model)
## 
## Call:
## glm(formula = Injuries ~ Hours + Experience + Safety + training + 
##     university, family = poisson(link = "log"), data = injury_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.604e+00  5.998e-02  60.075  < 2e-16 ***
## Hours        9.023e-07  3.618e-08  24.940  < 2e-16 ***
## Experience2  1.492e-01  3.520e-02   4.238 2.25e-05 ***
## Experience3 -2.524e-01  4.553e-02  -5.544 2.95e-08 ***
## Experience4 -1.135e+00  4.867e-02 -23.328  < 2e-16 ***
## Safety2     -6.565e-02  5.750e-02  -1.142  0.25355    
## Safety3      5.740e-01  5.986e-02   9.590  < 2e-16 ***
## Safety4      6.813e-01  5.083e-02  13.403  < 2e-16 ***
## training     1.577e-01  5.316e-02   2.967  0.00301 ** 
## university  -1.432e-01  5.505e-02  -2.601  0.00929 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6213.8  on 71  degrees of freedom
## Residual deviance: 1331.4  on 62  degrees of freedom
## AIC: 1753.1
## 
## Number of Fisher Scoring iterations: 5
#Inspect models:
backward_sel_model$formula
## Injuries ~ Safety + Experience + Hours + training + university
forward_sel_model$formula
## Injuries ~ Hours + Experience + Safety + training + university

Both backward and forward selection are optimization techniques that aim to find the best-fitting model by minimizing the AIC.In this case, both approaches led to the exact same final set of predictors, which means the data strongly supports this model as the optimal one. The exclusion of the bonus variable indicates that it does not significantly contribute to predicting the number of injuries. ## Checking Validity of the Poisson Regression model

#Simulate residuals from the model:
poisson_residuals = simulateResiduals(backward_sel_model)

#Plot observed quantile versus expected quantile to assess distribution fit, and predicted value versus standardised residuals for unmodelled pattern in the residuals.
plot(poisson_residuals)

Interpretation of Diagnostic Results:

Checking Dispersion

disp_result <- dispersiontest(backward_sel_model)
print(disp_result)
## 
##  Overdispersion test
## 
## data:  backward_sel_model
## z = 4.5188, p-value = 3.109e-06
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   16.47555

Overdispersion Test: The dispersion parameter is 16.47555, far greater than 1, confirming significant overdispersion. The p-value of 3.109e-06 supports the conclusion that the Poisson model is inappropriate. ## Conclusion: The Poisson model exhibits significant overdispersion and poor fit, as shown by the residuals and dispersion test.

Negative Binomial Model

Switching from the Poisson model to a Negative Binomial model to account for the overdispersion and improve the model’s fit.

# Full model with all predictors for backwards selection
full_nb_model <- glm.nb(
  formula = Injuries ~ Safety + Experience + bonus + Hours + training + university,
  data = injury_data
)

# Null model for forward selection (only intercept)
null_nb_model <- glm.nb(
  formula = Injuries ~ 1,
  data = injury_data
)

# Perform backward selection with Negative Binomial model
backward_nb_model <- stepAIC(
  full_nb_model,
  direction = "backward",
  trace = 0
)

# Perform forward selection with Negative Binomial model
forward_nb_model <- stepAIC(
  null_nb_model,
  scope = formula(full_nb_model),
  direction = "forward",
  trace = 0
)

# Summary of the selected models
summary(backward_nb_model)
## 
## Call:
## glm.nb(formula = Injuries ~ Safety + Experience + Hours, data = injury_data, 
##     init.theta = 3.063965228, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.642e+00  1.883e-01  19.335  < 2e-16 ***
## Safety2     -2.211e-01  2.057e-01  -1.075  0.28229    
## Safety3      4.629e-01  2.822e-01   1.641  0.10090    
## Safety4      6.255e-01  2.300e-01   2.719  0.00655 ** 
## Experience2  1.744e-02  2.067e-01   0.084  0.93279    
## Experience3 -5.995e-01  2.445e-01  -2.452  0.01419 *  
## Experience4 -1.573e+00  2.028e-01  -7.759 8.56e-15 ***
## Hours        1.273e-06  2.180e-07   5.837 5.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.064) family taken to be 1)
## 
##     Null deviance: 282.794  on 71  degrees of freedom
## Residual deviance:  81.732  on 64  degrees of freedom
## AIC: 707.44
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.064 
##           Std. Err.:  0.576 
## 
##  2 x log-likelihood:  -689.442
summary(forward_nb_model)
## 
## Call:
## glm.nb(formula = Injuries ~ Hours + Experience + Safety, data = injury_data, 
##     init.theta = 3.063965225, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.642e+00  1.883e-01  19.335  < 2e-16 ***
## Hours        1.273e-06  2.180e-07   5.837 5.32e-09 ***
## Experience2  1.744e-02  2.067e-01   0.084  0.93279    
## Experience3 -5.995e-01  2.445e-01  -2.452  0.01419 *  
## Experience4 -1.573e+00  2.028e-01  -7.759 8.56e-15 ***
## Safety2     -2.211e-01  2.057e-01  -1.075  0.28229    
## Safety3      4.629e-01  2.822e-01   1.641  0.10090    
## Safety4      6.255e-01  2.300e-01   2.719  0.00655 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.064) family taken to be 1)
## 
##     Null deviance: 282.794  on 71  degrees of freedom
## Residual deviance:  81.732  on 64  degrees of freedom
## AIC: 707.44
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.064 
##           Std. Err.:  0.576 
## 
##  2 x log-likelihood:  -689.442

Both forward and backward model selection processes arrived at the same model with an AIC of 707.44, confirming that Safety, Experience, and Hours worked are the key factors influencing injury rates. The exclusion of bonus, training, and university suggests these variables are not crucial in predicting injuries in this context.

Checking Validity of the Negative Binomial model

#Simulate residuals from the model:
poisson_residuals = simulateResiduals(backward_nb_model)

#Plot observed quantile versus expected quantile to assess distribution fit, and predicted value versus standardised residuals for unmodelled pattern in the residuals.
plot(poisson_residuals)

Interpretation of Diagnostic Results

Conclusion:

The Negative Binomial model is a good fit, with no significant residual, dispersion, or outlier issues. Minor quantile deviations exist but do not seriously impact model performance.

Negative binomial with interaction

Here we perform model selection using forward and backward selection on a Negative Binomial regression model with interactions between Safety and Experience. Interactions were based on the EDA findings, which indicated a relationship between these two variables.

# Full model with all predictors for backwards selection
full_nb_model <- glm.nb(
  formula = Injuries ~ Safety*Experience + bonus + Hours + training + university,
  data = injury_data
)

# Null model for forward selection (only intercept)
null_nb_model <- glm.nb(
  formula = Injuries ~ 1,
  data = injury_data
)

# Perform backward selection with Negative Binomial model
backward_nb_model <- stepAIC(
  full_nb_model,
  direction = "backward",
  trace = 0
)

# Perform forward selection with Negative Binomial model
forward_nb_model <- stepAIC(
  null_nb_model,
  scope = formula(full_nb_model),
  direction = "forward",
  trace = 0
)

# Summary of the selected models
summary(backward_nb_model)
## 
## Call:
## glm.nb(formula = Injuries ~ Safety + Experience + Hours + training + 
##     Safety:Experience, data = injury_data, init.theta = 5.503787367, 
##     link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.902e+00  2.603e-01  14.993  < 2e-16 ***
## Safety2              1.229e-01  2.923e-01   0.420 0.674202    
## Safety3              6.840e-01  3.159e-01   2.165 0.030360 *  
## Safety4             -3.214e-01  3.275e-01  -0.981 0.326467    
## Experience2          1.035e-01  3.201e-01   0.323 0.746450    
## Experience3         -1.553e-01  3.269e-01  -0.475 0.634788    
## Experience4         -1.907e+00  3.279e-01  -5.816 6.04e-09 ***
## Hours                7.979e-07  2.200e-07   3.627 0.000287 ***
## training            -3.977e-01  2.343e-01  -1.698 0.089594 .  
## Safety2:Experience2 -2.537e-01  4.297e-01  -0.590 0.554968    
## Safety3:Experience2 -4.487e-02  4.265e-01  -0.105 0.916217    
## Safety4:Experience2  1.019e+00  4.647e-01   2.192 0.028380 *  
## Safety2:Experience3 -9.156e-01  4.433e-01  -2.065 0.038877 *  
## Safety3:Experience3 -1.834e-02  4.797e-01  -0.038 0.969498    
## Safety4:Experience3  1.178e+00  5.028e-01   2.343 0.019120 *  
## Safety2:Experience4 -1.967e+00  6.614e-01  -2.975 0.002934 ** 
## Safety3:Experience4  7.644e-01  4.442e-01   1.721 0.085284 .  
## Safety4:Experience4  1.958e+00  4.737e-01   4.134 3.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.5038) family taken to be 1)
## 
##     Null deviance: 471.645  on 71  degrees of freedom
## Residual deviance:  76.773  on 54  degrees of freedom
## AIC: 685.32
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.50 
##           Std. Err.:  1.09 
## 
##  2 x log-likelihood:  -647.324
summary(forward_nb_model)
## 
## Call:
## glm.nb(formula = Injuries ~ Hours + Experience + Safety + training + 
##     Experience:Safety, data = injury_data, init.theta = 5.503787363, 
##     link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.902e+00  2.603e-01  14.993  < 2e-16 ***
## Hours                7.979e-07  2.200e-07   3.627 0.000287 ***
## Experience2          1.035e-01  3.201e-01   0.323 0.746450    
## Experience3         -1.553e-01  3.269e-01  -0.475 0.634788    
## Experience4         -1.907e+00  3.279e-01  -5.816 6.04e-09 ***
## Safety2              1.229e-01  2.923e-01   0.420 0.674202    
## Safety3              6.840e-01  3.159e-01   2.165 0.030360 *  
## Safety4             -3.214e-01  3.275e-01  -0.981 0.326467    
## training            -3.977e-01  2.343e-01  -1.698 0.089594 .  
## Experience2:Safety2 -2.537e-01  4.297e-01  -0.590 0.554968    
## Experience3:Safety2 -9.156e-01  4.433e-01  -2.065 0.038877 *  
## Experience4:Safety2 -1.967e+00  6.614e-01  -2.975 0.002934 ** 
## Experience2:Safety3 -4.487e-02  4.265e-01  -0.105 0.916217    
## Experience3:Safety3 -1.834e-02  4.797e-01  -0.038 0.969498    
## Experience4:Safety3  7.644e-01  4.442e-01   1.721 0.085284 .  
## Experience2:Safety4  1.019e+00  4.647e-01   2.192 0.028380 *  
## Experience3:Safety4  1.178e+00  5.028e-01   2.343 0.019120 *  
## Experience4:Safety4  1.958e+00  4.737e-01   4.134 3.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.5038) family taken to be 1)
## 
##     Null deviance: 471.645  on 71  degrees of freedom
## Residual deviance:  76.773  on 54  degrees of freedom
## AIC: 685.32
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.50 
##           Std. Err.:  1.09 
## 
##  2 x log-likelihood:  -647.324

Checking Validity of the Negative Binomial model

#Simulate residuals from the model:
poisson_residuals = simulateResiduals(backward_nb_model)

#Plot observed quantile versus expected quantile to assess distribution fit, and predicted value versus standardised residuals for unmodelled pattern in the residuals.
plot(poisson_residuals)

Interpretation of the Results

Conclusion:

With an AIC of 685.32 and favorable results from the residual diagnostics, this Negative Binomial model with Safety:Experience interaction appears to be a strong fit for the data. There are no significant issues with overdispersion, outliers, or residuals, and the model captures the relationships between the predictors and injury counts effectively. This model is a robust option for predicting injuries based on Safety, Experience, Hours worked, training, and their interactions. ## Summary stats of the selected models

# Load necessary libraries
library(MASS)
library(broom)

# Fit the Poisson Regression Model
poisson_model <- glm(
  formula = Injuries ~ Safety + Experience + Hours + training + university,
  family = poisson(link = "log"),
  data = injury_data
)

# Fit the Negative Binomial Regression Model
nb_model <- glm.nb(
  formula = Injuries ~ Safety + Experience + Hours,
  data = injury_data
)

# Fit the Negative Binomial Model with Interaction Terms
nb_interaction_model <- glm.nb(
  formula = Injuries ~ Safety*Experience + Hours + training ,
  data = injury_data
)

# Function to extract model statistics
model_stats <- function(model, model_name) {
  data.frame(
    Model = model_name,
    AIC = AIC(model),
    BIC = BIC(model),
    logLik = as.numeric(logLik(model)),
    deviance = deviance(model),
    df_residual = df.residual(model)
  )
}

# Get statistics for each model and store them in a dataframe
poisson_stats <- model_stats(poisson_model, "Poisson Regression Model")
nb_stats <- model_stats(nb_model, "Negative Binomial Model")
nb_interaction_stats <- model_stats(nb_interaction_model, "Negative Binomial with Safety and Experience Interaction")

# Combine the statistics into a single dataframe
all_model_stats_df <- rbind(poisson_stats, nb_stats, nb_interaction_stats)

# Display the dataframe
print(all_model_stats_df)
##                                                      Model       AIC       BIC
## 1                                 Poisson Regression Model 1753.1357 1775.9024
## 2                                  Negative Binomial Model  707.4422  727.9322
## 3 Negative Binomial with Safety and Experience Interaction  685.3239  728.5806
##      logLik   deviance df_residual
## 1 -866.5679 1331.39222          62
## 2 -344.7211   81.73201          64
## 3 -323.6620   76.77264          54
# Sample data: AIC, BIC, log-likelihood for three models
plot_data <- data.frame(
  model = c("Poisson Regression Model", "Negative Binomial Model", "Negative Binomial with Interaction"),
  AIC = c(1753.10, 707.44, 685.32),
  BIC = c(1783.45, 728.50, 722.45),
  logL = c(-868.55, -353.72, -341.66)
)

# Melt the data into long form for ggplot:
long_plot_data <- melt(data = plot_data, id = "model", variable.name = "measure")

# Plot together for comparison
ggplot(
  data = long_plot_data,
  mapping = aes(
    x = model,
    y = value,
    group = measure,
    colour = measure
  )
) +
  geom_point() +
  scale_colour_discrete(
    breaks = c("AIC", "BIC", "logL"),
    labels = c("AIC", "BIC", "Log-Likelihood")
  ) +
  labs(x = "Model", y = "Value", colour = "Measure") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Model Comparison

Research Question answers and recommendations

Appendix

# Calculate phi_hat for Poisson model
residual_df <- backward_sel_model$df.residual 
phi_hat <- deviance(backward_sel_model) / residual_df 

# Plot mean-variance relationship for Poisson and Negative Binomial models
# Use predictions from the Negative Binomial model
predicted_values <- predict(backward_nb_model)
grouped_values <- cut(predicted_values, breaks = quantile(predicted_values, seq(0, 100, 10)/100))

# Mean and variance for each group
mean_values <- tapply(injury_data$Injuries, grouped_values, mean)
variance_values <- tapply(injury_data$Injuries, grouped_values, var)

# Plot the mean-variance relationship
plot(mean_values, variance_values, xlab = "Mean", ylab = "Variance",
     main = "Mean-Variance Relationship")

# Add lines for Poisson and Negative Binomial models
x_seq <- seq(min(mean_values), max(mean_values), length.out = 100)
lines(x_seq, x_seq * phi_hat, lty = "dashed")  # Quasi-Poisson line
lines(x_seq, x_seq * (1 + x_seq / backward_nb_model$theta))  # Negative Binomial line

# Add legend to the plot
legend("topleft", lty = c("dashed", "solid"),
       legend = c("Quasi-Poisson", "Negative Binomial"), inset = 0.05)

The Negative Binomial model fits the data better, especially at higher mean values, where the variance is larger. This supports the use of a Negative Binomial model over a Poisson or Quasi-Poisson model when overdispersion is present, as it better captures the increasing variance in the data.