Data Dive 11

Linear Regression

Built with the explanatory variables- w_p1_tot_attacks, w_p1_tot_kills, w_p1_tot_errors

and the response variable- w_p1_tot_hitpct

volley_data <- read.csv("C:\\Users\\brian\\Downloads\\bvb_matches_2022.csv")

model <- lm(w_p1_tot_hitpct ~ w_p1_tot_attacks + w_p1_tot_kills + w_p1_tot_errors, data = volley_data)
summary(model)

## 
## Call:
## lm(formula = w_p1_tot_hitpct ~ w_p1_tot_attacks + w_p1_tot_kills + 
##     w_p1_tot_errors, data = volley_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24277 -0.02243 -0.01057  0.01435  0.37660 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.4674398  0.0076610   61.02   <2e-16 ***
## w_p1_tot_attacks -0.0159347  0.0005819  -27.38   <2e-16 ***
## w_p1_tot_kills    0.0354294  0.0008959   39.55   <2e-16 ***
## w_p1_tot_errors  -0.0356809  0.0014839  -24.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05322 on 493 degrees of freedom
##   (3709 observations deleted due to missingness)
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.8693 
## F-statistic:  1100 on 3 and 493 DF,  p-value: < 2.2e-16

Interpreting Coefficients

The intercept (0.4674398) is the baseline hitting percentage of winning player 1. The three values below this show the amount to which each factor affects the hitting percentage. In other words, for every increase in total attack, a players hitting percentage generally decreases by 0.0159347. For every additional kill, a players hitting percentage increases by 0.0354294. And finally for every additional error, a players hitting percentage decreases by 0.0356809.

The R- squared value of .87 means that approximately 87% of the variability in hitting percentage is explained by the model/ explanatory variables.

plot(model$residuals, main = 'Residuals Plot', ylab = 'Residuals', xlab = 'Index')
abline(h = 0, col = 'red')

Model Diagnosis

Residuals vs Fitted

This plot shows that the variance of the residuals is not necessarily constant. Ideally the points should be scattered around the x axis if the relationship is linear and the residuals have equal variance.

plot(model, which = 1)

This plot shows that the relationship may not be perfectly linear which makes it difficult to create an accurate regression model.

Normal Q-Q Plot

For the most part, the residuals lie on the straight line which is a good sign and means that they are somewhat normally distributed.

plot(model, which = 2)

This plot shows that the residuals are not normally distributed. The tails curving away from the straight line show that there are some extreme values that may affect the data. The data not being normally distributed means that a linear regression may not be the best way to assess the data.

Scale- Location Plot

Ideally the residuals would be constant across all fitted values, but in this case, they seem to be concentrated between .4-.6 and not very consistent otherwise.

plot(model, which = 3)

The concentration of values in one area suggests nonlinearity in our model or it may mean that the model is being affected by outliers.

Residuals vs Leverage Plot

This model helps show us points that could be considered influential to our model. We see a few that really stray from the bulk of the Cook’s distance lines.

plot(model, which = 5)

For the most part, all of the data points cluster around the horizontal ‘zero’ line which means that they are not problematic or influential. We do see an influential point labeled 51. We need to evaluate the cause of this point and it may be removed to more accurately assess the data.