MXN600 Minor Data Analysis Project

Scenario

Our company, a global industrial manufacturing firm with over 10,000 employees, recently faced scrutiny after a workplace accident in South America, highlighting variability in safety practices across regions.The CEO has requested an analysis of workplace injury data to recommend a safety regime that should be adopted as the company-wide standard for injury prevention.

Load Packages

library(tidyverse)
library(MASS)
library(ggpubr)
library(DHARMa)
library(AER)
library(ggplot2)
library(reshape2)

Data

The dataset, injury.csv, covers the last 12 months of operation and includes the following variables:

Injuries: Number of injuries in each group.
Safety: The safety regime in place for each group.
Hours: Total hours worked by each group.
Experience: The experience level of workers in each group.
Bonus: Proportion of the group that received an annual bonus last year.
Training: Proportion of the group that completed external safety training.
University: Proportion of the group with at least one university degree.

The data is aggregated by experience level and the safety regime implemented at each factory.

# Load the CSV file
injury_data <- read.csv("C:/Users/navee/Downloads/injury.csv")

# Examine the data structure
str(injury_data)

## 'data.frame':    72 obs. of  8 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Injuries  : int  6 8 7 19 39 57 60 40 40 1 ...
##  $ Safety    : int  1 1 1 1 1 1 1 1 1 2 ...
##  $ Experience: int  4 4 4 3 3 2 2 1 1 1 ...
##  $ Hours     : int  231437 126655 87847 222970 376438 316462 235670 127361 96667 51156 ...
##  $ bonus     : num  0.27 0.37 0.57 0.91 0.2 0.9 0.94 0.66 0.63 0.06 ...
##  $ training  : num  0.35 0.33 0.48 0.89 0.86 0.39 0.78 0.96 0.43 0.71 ...
##  $ university: num  0.73 0.45 0.18 0.75 0.1 0.86 0.61 0.56 0.33 0.45 ...

summary(injury_data)

##        X            Injuries          Safety       Experience 
##  Min.   : 1.00   Min.   :  0.00   Min.   :1.00   Min.   :1.0  
##  1st Qu.:18.75   1st Qu.: 25.00   1st Qu.:1.75   1st Qu.:1.0  
##  Median :36.50   Median : 57.50   Median :2.50   Median :2.5  
##  Mean   :36.50   Mean   : 87.46   Mean   :2.50   Mean   :2.5  
##  3rd Qu.:54.25   3rd Qu.: 96.50   3rd Qu.:3.25   3rd Qu.:4.0  
##  Max.   :72.00   Max.   :491.00   Max.   :4.00   Max.   :4.0  
##      Hours             bonus           training        university    
##  Min.   :  34574   Min.   :0.0100   Min.   :0.0100   Min.   :0.0600  
##  1st Qu.: 130272   1st Qu.:0.3125   1st Qu.:0.2675   1st Qu.:0.2800  
##  Median : 302879   Median :0.4950   Median :0.5050   Median :0.5200  
##  Mean   : 549996   Mean   :0.5140   Mean   :0.5096   Mean   :0.5189  
##  3rd Qu.: 813381   3rd Qu.:0.7400   3rd Qu.:0.7125   3rd Qu.:0.7600  
##  Max.   :2135146   Max.   :0.9900   Max.   :0.9900   Max.   :0.9600

Research Questions

Which safety regime should be the international standard based solely on injury prevention performance?
Is industry experience more important than the safety regime in preventing injuries, and can policies aimed at reducing employee turnover (via experience) lower injury rates?
Is there a relationship between:
- Injuries and the annual bonuses employees receive?
- Injuries and whether employees have received formal external qualifications (e.g., external safety training or a university degree)?

Data Preprocessing

Handling Categorical Variables Some variables in the dataset, such as Safety Regime and Experience, were categorical. These variables were converted to factors to be correctly interpreted by the model. In this data, no missing data were identified. If there had been missing values, appropriate strategies such as imputation or removal of rows would have been applied.

# Converting safety and experience to factors
injury_data$Safety <- as.factor(injury_data$Safety)
injury_data$Experience <- as.factor(injury_data$Experience)
str(injury_data)

## 'data.frame':    72 obs. of  8 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Injuries  : int  6 8 7 19 39 57 60 40 40 1 ...
##  $ Safety    : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 2 ...
##  $ Experience: Factor w/ 4 levels "1","2","3","4": 4 4 4 3 3 2 2 1 1 1 ...
##  $ Hours     : int  231437 126655 87847 222970 376438 316462 235670 127361 96667 51156 ...
##  $ bonus     : num  0.27 0.37 0.57 0.91 0.2 0.9 0.94 0.66 0.63 0.06 ...
##  $ training  : num  0.35 0.33 0.48 0.89 0.86 0.39 0.78 0.96 0.43 0.71 ...
##  $ university: num  0.73 0.45 0.18 0.75 0.1 0.86 0.61 0.56 0.33 0.45 ...

colSums(is.na(injury_data))

##          X   Injuries     Safety Experience      Hours      bonus   training 
##          0          0          0          0          0          0          0 
## university 
##          0

# Check first few rows
head(injury_data)

##   X Injuries Safety Experience  Hours bonus training university
## 1 1        6      1          4 231437  0.27     0.35       0.73
## 2 2        8      1          4 126655  0.37     0.33       0.45
## 3 3        7      1          4  87847  0.57     0.48       0.18
## 4 4       19      1          3 222970  0.91     0.89       0.75
## 5 5       39      1          3 376438  0.20     0.86       0.10
## 6 6       57      1          2 316462  0.90     0.39       0.86

Exploratory Data Analysis

Exploratory Data Analysis (EDA) helps us gain initial insights into the patterns, relationships, and distributions within our dataset before diving into detailed modeling. By using techniques such as box plots, scatter plots, and heatmaps, we can visually explore how different variables relate to injury rates. For instance

The box plot allows us to compare the injury prevention performance across different safety regimes, highlighting differences in injury distribution.
The scatter plot helps us understand if there is any relationship between the bonus proportion employees receive and the number of injuries.
The heatmap visually displays the interaction between experience level and safety regime, showing how their combination affects the number of injuries.

## `geom_smooth()` using formula = 'y ~ x'

Based on the box plot showing injury prevention performance by safety regime, Safety Regime 1 should again be recommended as the international standard for workplace safety. It has the lowest median number of injuries and the smallest spread (interquartile range), indicating that it consistently performs better in injury prevention compared to the other safety regimes. Safety Regime 3 shows the highest median and widest range, making it the least effective in terms of injury prevention.

From the heatmap, we can observe the following patterns regarding the interaction between safety regime and experience level in relation to mean injuries: The heatmap shows that Safety Regime 3 and Experience Level 3 have the highest injury rates, suggesting poor effectiveness for that combination. In contrast, Safety Regime 4 and higher experience levels (especially Level 4) are associated with fewer injuries, indicating that this regime works better, particularly for more experienced workers. Overall, experience plays a key role in reducing injury rates, especially with the more effective safety regimes.

The scatter plot with a smoothing line shows the relationship between Injuries and Bonus Proportion. From the plot, we can observe that there is no clear pattern. The red smoothing line remains relatively flat across the different bonus proportions, indicating that there is no strong relationship between the proportion of bonuses received and the number of injuries.

Fit Models

Poisson Regression Model Performing model selection using backward and forward selection approaches on Poisson regression models. The purpose is to find the best-fitting model to predict the number of injuries based on various predictor variables, while minimizing the model’s complexity.

# Full model with all possible interactions for backwards selection
full_interaction_model <- glm(
  data = injury_data,
  formula = Injuries ~ Safety + Experience + bonus + Hours + training + university,
  family = poisson(link = "log")
)

# Null model for forwards selection
null_model <- glm(
  data = injury_data,
  formula = Injuries ~ 1,
  family = poisson(link = "log")
)

# Perform backward selection
backward_sel_model <- stepAIC(
  full_interaction_model,
  direction = "backward",
  trace = 0
)

# Perform forward selection
forward_sel_model <- stepAIC(
  null_model,
  scope = formula(full_interaction_model),
  direction = "forward",
  trace = 0
)

# Summary of the selected models
summary(backward_sel_model)

## 
## Call:
## glm(formula = Injuries ~ Safety + Experience + Hours + training + 
##     university, family = poisson(link = "log"), data = injury_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.604e+00  5.998e-02  60.075  < 2e-16 ***
## Safety2     -6.565e-02  5.750e-02  -1.142  0.25355    
## Safety3      5.740e-01  5.986e-02   9.590  < 2e-16 ***
## Safety4      6.813e-01  5.083e-02  13.403  < 2e-16 ***
## Experience2  1.492e-01  3.520e-02   4.238 2.25e-05 ***
## Experience3 -2.524e-01  4.553e-02  -5.544 2.95e-08 ***
## Experience4 -1.135e+00  4.867e-02 -23.328  < 2e-16 ***
## Hours        9.023e-07  3.618e-08  24.940  < 2e-16 ***
## training     1.577e-01  5.316e-02   2.967  0.00301 ** 
## university  -1.432e-01  5.505e-02  -2.601  0.00929 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6213.8  on 71  degrees of freedom
## Residual deviance: 1331.4  on 62  degrees of freedom
## AIC: 1753.1
## 
## Number of Fisher Scoring iterations: 5

summary(forward_sel_model)

## 
## Call:
## glm(formula = Injuries ~ Hours + Experience + Safety + training + 
##     university, family = poisson(link = "log"), data = injury_data)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.604e+00  5.998e-02  60.075  < 2e-16 ***
## Hours        9.023e-07  3.618e-08  24.940  < 2e-16 ***
## Experience2  1.492e-01  3.520e-02   4.238 2.25e-05 ***
## Experience3 -2.524e-01  4.553e-02  -5.544 2.95e-08 ***
## Experience4 -1.135e+00  4.867e-02 -23.328  < 2e-16 ***
## Safety2     -6.565e-02  5.750e-02  -1.142  0.25355    
## Safety3      5.740e-01  5.986e-02   9.590  < 2e-16 ***
## Safety4      6.813e-01  5.083e-02  13.403  < 2e-16 ***
## training     1.577e-01  5.316e-02   2.967  0.00301 ** 
## university  -1.432e-01  5.505e-02  -2.601  0.00929 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6213.8  on 71  degrees of freedom
## Residual deviance: 1331.4  on 62  degrees of freedom
## AIC: 1753.1
## 
## Number of Fisher Scoring iterations: 5

#Inspect models:
backward_sel_model$formula

## Injuries ~ Safety + Experience + Hours + training + university

forward_sel_model$formula

## Injuries ~ Hours + Experience + Safety + training + university

Both backward and forward selection are optimization techniques that aim to find the best-fitting model by minimizing the AIC.In this case, both approaches led to the exact same final set of predictors, which means the data strongly supports this model as the optimal one. The exclusion of the bonus variable indicates that it does not significantly contribute to predicting the number of injuries. ## Checking Validity of the Poisson Regression model

#Simulate residuals from the model:
poisson_residuals = simulateResiduals(backward_sel_model)

#Plot observed quantile versus expected quantile to assess distribution fit, and predicted value versus standardised residuals for unmodelled pattern in the residuals.
plot(poisson_residuals)

Interpretation of Diagnostic Results:

Residual Diagnostics (DHARMa Residual Plots):
The Q-Q plot and KS test (p-value = 0) indicate that the residuals deviate significantly from the expected distribution, suggesting that the Poisson model does not fit well.
The Dispersion test (p-value = 0) shows significant overdispersion, meaning the variance is much larger than the mean, violating Poisson assumptions.
The Residual vs. Predicted plot shows clear quantile deviations, indicating that the model fails to capture patterns in the data.

Checking Dispersion

disp_result <- dispersiontest(backward_sel_model)
print(disp_result)

## 
##  Overdispersion test
## 
## data:  backward_sel_model
## z = 4.5188, p-value = 3.109e-06
## alternative hypothesis: true dispersion is greater than 1
## sample estimates:
## dispersion 
##   16.47555

Overdispersion Test: The dispersion parameter is 16.47555, far greater than 1, confirming significant overdispersion. The p-value of 3.109e-06 supports the conclusion that the Poisson model is inappropriate. ## Conclusion: The Poisson model exhibits significant overdispersion and poor fit, as shown by the residuals and dispersion test.

Negative Binomial Model

Switching from the Poisson model to a Negative Binomial model to account for the overdispersion and improve the model’s fit.

# Full model with all predictors for backwards selection
full_nb_model <- glm.nb(
  formula = Injuries ~ Safety + Experience + bonus + Hours + training + university,
  data = injury_data
)

# Null model for forward selection (only intercept)
null_nb_model <- glm.nb(
  formula = Injuries ~ 1,
  data = injury_data
)

# Perform backward selection with Negative Binomial model
backward_nb_model <- stepAIC(
  full_nb_model,
  direction = "backward",
  trace = 0
)

# Perform forward selection with Negative Binomial model
forward_nb_model <- stepAIC(
  null_nb_model,
  scope = formula(full_nb_model),
  direction = "forward",
  trace = 0
)

# Summary of the selected models
summary(backward_nb_model)

## 
## Call:
## glm.nb(formula = Injuries ~ Safety + Experience + Hours, data = injury_data, 
##     init.theta = 3.063965228, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.642e+00  1.883e-01  19.335  < 2e-16 ***
## Safety2     -2.211e-01  2.057e-01  -1.075  0.28229    
## Safety3      4.629e-01  2.822e-01   1.641  0.10090    
## Safety4      6.255e-01  2.300e-01   2.719  0.00655 ** 
## Experience2  1.744e-02  2.067e-01   0.084  0.93279    
## Experience3 -5.995e-01  2.445e-01  -2.452  0.01419 *  
## Experience4 -1.573e+00  2.028e-01  -7.759 8.56e-15 ***
## Hours        1.273e-06  2.180e-07   5.837 5.32e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.064) family taken to be 1)
## 
##     Null deviance: 282.794  on 71  degrees of freedom
## Residual deviance:  81.732  on 64  degrees of freedom
## AIC: 707.44
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.064 
##           Std. Err.:  0.576 
## 
##  2 x log-likelihood:  -689.442

summary(forward_nb_model)

## 
## Call:
## glm.nb(formula = Injuries ~ Hours + Experience + Safety, data = injury_data, 
##     init.theta = 3.063965225, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  3.642e+00  1.883e-01  19.335  < 2e-16 ***
## Hours        1.273e-06  2.180e-07   5.837 5.32e-09 ***
## Experience2  1.744e-02  2.067e-01   0.084  0.93279    
## Experience3 -5.995e-01  2.445e-01  -2.452  0.01419 *  
## Experience4 -1.573e+00  2.028e-01  -7.759 8.56e-15 ***
## Safety2     -2.211e-01  2.057e-01  -1.075  0.28229    
## Safety3      4.629e-01  2.822e-01   1.641  0.10090    
## Safety4      6.255e-01  2.300e-01   2.719  0.00655 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(3.064) family taken to be 1)
## 
##     Null deviance: 282.794  on 71  degrees of freedom
## Residual deviance:  81.732  on 64  degrees of freedom
## AIC: 707.44
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  3.064 
##           Std. Err.:  0.576 
## 
##  2 x log-likelihood:  -689.442

Both forward and backward model selection processes arrived at the same model with an AIC of 707.44, confirming that Safety, Experience, and Hours worked are the key factors influencing injury rates. The exclusion of bonus, training, and university suggests these variables are not crucial in predicting injuries in this context.

Checking Validity of the Negative Binomial model

#Simulate residuals from the model:
poisson_residuals = simulateResiduals(backward_nb_model)

#Plot observed quantile versus expected quantile to assess distribution fit, and predicted value versus standardised residuals for unmodelled pattern in the residuals.
plot(poisson_residuals)

Interpretation of Diagnostic Results

Residual Diagnostics (DHARMa Residual Plots)
- Q-Q Plot:
  
  The KS test (p = 0.52) and dispersion test (p = 0.35) show no significant deviations, meaning the model fits the residuals well and there is no overdispersion.
- Outlier test (p = 1) indicates no significant outliers.
Residual vs. Predicted Plot: Some quantile deviations are detected (red curves) with the model’s fit.

Conclusion:

The Negative Binomial model is a good fit, with no significant residual, dispersion, or outlier issues. Minor quantile deviations exist but do not seriously impact model performance.

Negative binomial with interaction

Here we perform model selection using forward and backward selection on a Negative Binomial regression model with interactions between Safety and Experience. Interactions were based on the EDA findings, which indicated a relationship between these two variables.

# Full model with all predictors for backwards selection
full_nb_model <- glm.nb(
  formula = Injuries ~ Safety*Experience + bonus + Hours + training + university,
  data = injury_data
)

# Null model for forward selection (only intercept)
null_nb_model <- glm.nb(
  formula = Injuries ~ 1,
  data = injury_data
)

# Perform backward selection with Negative Binomial model
backward_nb_model <- stepAIC(
  full_nb_model,
  direction = "backward",
  trace = 0
)

# Perform forward selection with Negative Binomial model
forward_nb_model <- stepAIC(
  null_nb_model,
  scope = formula(full_nb_model),
  direction = "forward",
  trace = 0
)

# Summary of the selected models
summary(backward_nb_model)

## 
## Call:
## glm.nb(formula = Injuries ~ Safety + Experience + Hours + training + 
##     Safety:Experience, data = injury_data, init.theta = 5.503787367, 
##     link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.902e+00  2.603e-01  14.993  < 2e-16 ***
## Safety2              1.229e-01  2.923e-01   0.420 0.674202    
## Safety3              6.840e-01  3.159e-01   2.165 0.030360 *  
## Safety4             -3.214e-01  3.275e-01  -0.981 0.326467    
## Experience2          1.035e-01  3.201e-01   0.323 0.746450    
## Experience3         -1.553e-01  3.269e-01  -0.475 0.634788    
## Experience4         -1.907e+00  3.279e-01  -5.816 6.04e-09 ***
## Hours                7.979e-07  2.200e-07   3.627 0.000287 ***
## training            -3.977e-01  2.343e-01  -1.698 0.089594 .  
## Safety2:Experience2 -2.537e-01  4.297e-01  -0.590 0.554968    
## Safety3:Experience2 -4.487e-02  4.265e-01  -0.105 0.916217    
## Safety4:Experience2  1.019e+00  4.647e-01   2.192 0.028380 *  
## Safety2:Experience3 -9.156e-01  4.433e-01  -2.065 0.038877 *  
## Safety3:Experience3 -1.834e-02  4.797e-01  -0.038 0.969498    
## Safety4:Experience3  1.178e+00  5.028e-01   2.343 0.019120 *  
## Safety2:Experience4 -1.967e+00  6.614e-01  -2.975 0.002934 ** 
## Safety3:Experience4  7.644e-01  4.442e-01   1.721 0.085284 .  
## Safety4:Experience4  1.958e+00  4.737e-01   4.134 3.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.5038) family taken to be 1)
## 
##     Null deviance: 471.645  on 71  degrees of freedom
## Residual deviance:  76.773  on 54  degrees of freedom
## AIC: 685.32
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.50 
##           Std. Err.:  1.09 
## 
##  2 x log-likelihood:  -647.324

summary(forward_nb_model)

## 
## Call:
## glm.nb(formula = Injuries ~ Hours + Experience + Safety + training + 
##     Experience:Safety, data = injury_data, init.theta = 5.503787363, 
##     link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.902e+00  2.603e-01  14.993  < 2e-16 ***
## Hours                7.979e-07  2.200e-07   3.627 0.000287 ***
## Experience2          1.035e-01  3.201e-01   0.323 0.746450    
## Experience3         -1.553e-01  3.269e-01  -0.475 0.634788    
## Experience4         -1.907e+00  3.279e-01  -5.816 6.04e-09 ***
## Safety2              1.229e-01  2.923e-01   0.420 0.674202    
## Safety3              6.840e-01  3.159e-01   2.165 0.030360 *  
## Safety4             -3.214e-01  3.275e-01  -0.981 0.326467    
## training            -3.977e-01  2.343e-01  -1.698 0.089594 .  
## Experience2:Safety2 -2.537e-01  4.297e-01  -0.590 0.554968    
## Experience3:Safety2 -9.156e-01  4.433e-01  -2.065 0.038877 *  
## Experience4:Safety2 -1.967e+00  6.614e-01  -2.975 0.002934 ** 
## Experience2:Safety3 -4.487e-02  4.265e-01  -0.105 0.916217    
## Experience3:Safety3 -1.834e-02  4.797e-01  -0.038 0.969498    
## Experience4:Safety3  7.644e-01  4.442e-01   1.721 0.085284 .  
## Experience2:Safety4  1.019e+00  4.647e-01   2.192 0.028380 *  
## Experience3:Safety4  1.178e+00  5.028e-01   2.343 0.019120 *  
## Experience4:Safety4  1.958e+00  4.737e-01   4.134 3.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(5.5038) family taken to be 1)
## 
##     Null deviance: 471.645  on 71  degrees of freedom
## Residual deviance:  76.773  on 54  degrees of freedom
## AIC: 685.32
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  5.50 
##           Std. Err.:  1.09 
## 
##  2 x log-likelihood:  -647.324

Checking Validity of the Negative Binomial model

#Simulate residuals from the model:
poisson_residuals = simulateResiduals(backward_nb_model)

#Plot observed quantile versus expected quantile to assess distribution fit, and predicted value versus standardised residuals for unmodelled pattern in the residuals.
plot(poisson_residuals)

Interpretation of the Results

AIC Value: The AIC of 685.32 indicates a better-fitting model compared to previous models, as a lower AIC suggests improved model performance while balancing complexity and fit. Residual Diagnostics (DHARMa Plots):
Q-Q Plot: The KS test (p = 0.11484) shows no significant deviation from the expected distribution, indicating that the residuals are well-behaved.
Dispersion Test (p = 0.992): There is no overdispersion, meaning the Negative Binomial model has successfully accounted for the variance in the data.
Outlier Test (p = 0.44): There are no significant outliers, suggesting that the model handles the data without being skewed by extreme values.
Residual vs. Predicted Plot: No significant problems were detected. The residuals are evenly distributed, indicating a good fit with no systematic errors.

Conclusion:

With an AIC of 685.32 and favorable results from the residual diagnostics, this Negative Binomial model with Safety:Experience interaction appears to be a strong fit for the data. There are no significant issues with overdispersion, outliers, or residuals, and the model captures the relationships between the predictors and injury counts effectively. This model is a robust option for predicting injuries based on Safety, Experience, Hours worked, training, and their interactions. ## Summary stats of the selected models

# Load necessary libraries
library(MASS)
library(broom)

# Fit the Poisson Regression Model
poisson_model <- glm(
  formula = Injuries ~ Safety + Experience + Hours + training + university,
  family = poisson(link = "log"),
  data = injury_data
)

# Fit the Negative Binomial Regression Model
nb_model <- glm.nb(
  formula = Injuries ~ Safety + Experience + Hours,
  data = injury_data
)

# Fit the Negative Binomial Model with Interaction Terms
nb_interaction_model <- glm.nb(
  formula = Injuries ~ Safety*Experience + Hours + training ,
  data = injury_data
)

# Function to extract model statistics
model_stats <- function(model, model_name) {
  data.frame(
    Model = model_name,
    AIC = AIC(model),
    BIC = BIC(model),
    logLik = as.numeric(logLik(model)),
    deviance = deviance(model),
    df_residual = df.residual(model)
  )
}

# Get statistics for each model and store them in a dataframe
poisson_stats <- model_stats(poisson_model, "Poisson Regression Model")
nb_stats <- model_stats(nb_model, "Negative Binomial Model")
nb_interaction_stats <- model_stats(nb_interaction_model, "Negative Binomial with Safety and Experience Interaction")

# Combine the statistics into a single dataframe
all_model_stats_df <- rbind(poisson_stats, nb_stats, nb_interaction_stats)

# Display the dataframe
print(all_model_stats_df)

##                                                      Model       AIC       BIC
## 1                                 Poisson Regression Model 1753.1357 1775.9024
## 2                                  Negative Binomial Model  707.4422  727.9322
## 3 Negative Binomial with Safety and Experience Interaction  685.3239  728.5806
##      logLik   deviance df_residual
## 1 -866.5679 1331.39222          62
## 2 -344.7211   81.73201          64
## 3 -323.6620   76.77264          54

# Sample data: AIC, BIC, log-likelihood for three models
plot_data <- data.frame(
  model = c("Poisson Regression Model", "Negative Binomial Model", "Negative Binomial with Interaction"),
  AIC = c(1753.10, 707.44, 685.32),
  BIC = c(1783.45, 728.50, 722.45),
  logL = c(-868.55, -353.72, -341.66)
)

# Melt the data into long form for ggplot:
long_plot_data <- melt(data = plot_data, id = "model", variable.name = "measure")

# Plot together for comparison
ggplot(
  data = long_plot_data,
  mapping = aes(
    x = model,
    y = value,
    group = measure,
    colour = measure
  )
) +
  geom_point() +
  scale_colour_discrete(
    breaks = c("AIC", "BIC", "logL"),
    labels = c("AIC", "BIC", "Log-Likelihood")
  ) +
  labs(x = "Model", y = "Value", colour = "Measure") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Model Comparison

AIC (Akaike Information Criterion): AIC is a measure of model quality that balances goodness-of-fit with model complexity. Lower AIC values are preferred, as they indicate a better trade-off between complexity and fit. In this case, the Negative Binomial with Interaction model has the lowest AIC of 685.32, which suggests it provides the best fit among the three models.
BIC (Bayesian Information Criterion): Similar to AIC, BIC also balances model fit with complexity, but it penalizes model complexity more heavily than AIC. Again, the Negative Binomial with Interaction model has the lowest BIC of 722.45, reinforcing the idea that this model strikes the best balance between fit and complexity.
Log-Likelihood: Log-likelihood reflects the fit of the model, where higher values (closer to zero) indicate a better fit. The Negative Binomial with Interaction model has the highest log-likelihood (-341.66) among the three, meaning it fits the data best.
Deviance: Deviance is another measure of model fit, with lower values indicating a better fit. The Negative Binomial with Interaction model has the lowest deviance of 76.77, which further confirms that it provides the best fit to the data.
Degrees of Freedom: The Negative Binomial with Interaction model has fewer degrees of freedom (54) compared to the other models, which means it’s a more complex model but justified by the better fit.

Research Question answers and recommendations

From the exploratory data analysis , it is clear that Safety Regime 1 should again be recommended as the international standard for workplace safety. It has the lowest median number of injuries and the smallest spread (interquartile range), indicating that it consistently performs better in injury prevention compared to the other safety regimes.
For Query 2: Industry Experience vs. Safety Regime
- Recommendation:
  - Experience Level 4 significantly reduces injuries (p < 0.001).
  - Safety3 and Safety4 also significantly reduce injuries (p = 0.003 and p = 0.006 respectively).
  - Interactions like Safety4 (p < 0.001) show that the effect of safety regimes depends on experience levels.
- Conclusion:
  - Both experience and safety regimes are important. The best strategy is to focus on Safety3 and Safety4 and retain experienced workers (Level 4) to reduce injury rates.
Query 3: Relationship Between Injuries, Bonuses, and Qualifications
- Bonuses and university degrees have no significant impact on injuries. The focus should remain on safety regimes and experience for injury prevention.

Appendix

# Calculate phi_hat for Poisson model
residual_df <- backward_sel_model$df.residual 
phi_hat <- deviance(backward_sel_model) / residual_df 

# Plot mean-variance relationship for Poisson and Negative Binomial models
# Use predictions from the Negative Binomial model
predicted_values <- predict(backward_nb_model)
grouped_values <- cut(predicted_values, breaks = quantile(predicted_values, seq(0, 100, 10)/100))

# Mean and variance for each group
mean_values <- tapply(injury_data$Injuries, grouped_values, mean)
variance_values <- tapply(injury_data$Injuries, grouped_values, var)

# Plot the mean-variance relationship
plot(mean_values, variance_values, xlab = "Mean", ylab = "Variance",
     main = "Mean-Variance Relationship")

# Add lines for Poisson and Negative Binomial models
x_seq <- seq(min(mean_values), max(mean_values), length.out = 100)
lines(x_seq, x_seq * phi_hat, lty = "dashed")  # Quasi-Poisson line
lines(x_seq, x_seq * (1 + x_seq / backward_nb_model$theta))  # Negative Binomial line

# Add legend to the plot
legend("topleft", lty = c("dashed", "solid"),
       legend = c("Quasi-Poisson", "Negative Binomial"), inset = 0.05)

The Negative Binomial model fits the data better, especially at higher mean values, where the variance is larger. This supports the use of a Negative Binomial model over a Poisson or Quasi-Poisson model when overdispersion is present, as it better captures the increasing variance in the data.

MXN600 Minor Data Analysis Project

Naveen Chacko Alex_n11536039

2024-09-14

Scenario

Load Packages

Data

Research Questions

Data Preprocessing

Exploratory Data Analysis

Fit Models

Checking Dispersion

Negative Binomial Model

Checking Validity of the Negative Binomial model

Conclusion:

Negative binomial with interaction

Checking Validity of the Negative Binomial model

Conclusion:

Research Question answers and recommendations

Appendix