In this notebook, we will build a generalized linear model with overall rating as the response variable and stamina, strength, dribbling and shooting as the explanatory variables. After this, we will diagnose potential issues, interpret model co-efficient.
# Fit the GLM model
model <- glm(overall_rating ~ stamina + strength + dribbling + acceleration,
data = player_data,
family = gaussian()) # Use gaussian for a continuous response variable
summary(model)
##
## Call:
## glm(formula = overall_rating ~ stamina + strength + dribbling +
## acceleration, family = gaussian(), data = player_data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.735062 0.334006 139.923 <2e-16 ***
## stamina 0.010254 0.004468 2.295 0.0217 *
## strength 0.194667 0.004045 48.127 <2e-16 ***
## dribbling 0.158976 0.003976 39.988 <2e-16 ***
## acceleration -0.040414 0.004867 -8.303 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 34.81231)
##
## Null deviance: 870604 on 17953 degrees of freedom
## Residual deviance: 624846 on 17949 degrees of freedom
## AIC: 114694
##
## Number of Fisher Scoring iterations: 2
Intercept: The intercept is 46.735, which which is the overall_rating when all other variables are 0.
Stamina, Strength, Dribbling, Acceleration: Co-efficient are 0.010, 0.195, 0.159, -0.040 indicating a positive relationship for Stamina, Strength, Dribbling and a negative relationship for Acceleration and that for every point increase in Stamina, Strength, Dribbling, the response variable overall rating increases by 0.010, 0.195, 1.159. For acceleration, 1 point increase will result in decrease of 0.040.
The p-values give that strength, dribbling and acceleration are highly significant to the overall rating whereas stamina is also significant but now by a big margin.
# Calculate fitted values and residuals
fitted_values <- model$fitted.values
residuals <- model$residuals
# Plot Residuals vs Fitted Values
plot(fitted_values, residuals,
xlab = "Fitted Values (hat{y})",
ylab = "Residuals",
main = "Residuals vs. Fitted Values")
abline(h = 0, col = "red") # Horizontal line at 0
A lot of variances of the error terms are consistent towards the higher fitted values indicating better model fit at higher values and poor model fit towards lower values.
par(mfrow = c(2, 2)) # Arranges plots in a 2x2 grid
plot(model$model$stamina, resid(model),
xlab = "Stamina", ylab = "Residuals", main = "Residuals vs Stamina")
plot(model$model$strength, resid(model),
xlab = "Strength", ylab = "Residuals", main = "Residuals vs Strength")
plot(model$model$dribbling, resid(model),
xlab = "Dribbling", ylab = "Residuals", main = "Residuals vs Dribbling")
plot(model$model$acceleration, resid(model),
xlab = "Acceleration", ylab = "Residuals", main = "Residuals vs Acceleration")
We can see that Strength has the most linear relationship with the Strength and Acceleration. But Stamina and Dribbling contribute more towards the non-linearity with overall rating.
# QQ-Plot
qqnorm(residuals)
qqline(residuals, col = "red")
We can see that towards the start and tail end, there is a deviation of the samples indicating that it’s not a normal distribution of the residuals.
plot(cooks.distance(model), type = "h", main = "Cook's Distance by Observation",
ylab = "Cook's Distance", xlab = "Observation Index")
abline(h = 4 / length(cooks.distance(model)), col = "red", lty = 2)
We can see that there are a lot of values especially towards the tail end having a strong impact on the model fit.
The residuals vs fitted values indicate patterns suggesting that the relationships between predictors and the response are not entirely linear.
There is Heteroscedasticity because the spread of residuals increases with fitted values. This affects the reliability of standard errors.
Towards the start and the tail end of the Q-Q plot the residuals are not normally distributed.
With Cook’s D we can see that there are observations with high Cook’s distance towards the tail end indicating the presence of outliers.
The coefficient for strength is 0.1947, suggesting that for each one-unit increase in a player’s strength, the expected value of the overall rating increases by approximately 0.195 units, assuming other variables remain constant. This relationship is statistically significant, with a p-value less than 0.001, indicating strong evidence that strength has a positive impact on overall rating.