This analysis uses the 2025 WNBA box score dataset, where each observation represents one team’s performance in a single game. After removing All-Star game entries, the dataset contains regular-season and playoff records across all 12 franchises. The Connecticut Sun is the focal team for this analysis. The objective is to build and evaluate a multiple linear regression model that identifies which game-level statistics, field goal percentage, total rebounds, three-point percentage, and steals, are significant predictors of points scored.
#This table shows the average and standard deviation of each stat for all 12 WNBA Teams
summary_score %>%
kbl(caption = "Table 1. Per-Game Averages and Standard Deviations by Team", digits = 2,
align = "c",
col.names = c("Team",
"Mean Score", "SD Score",
"Mean Rebounds", "SD Rebounds",
"Mean FG%", "SD FG%",
"Mean Reb", "SD Reb",
"Mean 3P%", "SD 3P%",
"Mean Steals", "SD Steals")) %>%
#Style the table1
kable_styling(full_width = FALSE,
html_font = "Times New Roman",
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
position = "center",
font_size = 13) %>%
row_spec(0, bold = TRUE,
color = "white",
background = "#4D4D4D") %>%
column_spec(1, bold = TRUE, color = "#4D4D4D")
| Team | Mean Score | SD Score | Mean Rebounds | SD Rebounds | Mean FG% | SD FG% | Mean Reb | SD Reb | Mean 3P% | SD 3P% | Mean Steals | SD Steals |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Aces | 85.52 | 9.56 | 33.78 | 5.88 | 45.27 | 5.82 | 33.78 | 5.88 | 35.27 | 7.08 | 6.80 | 2.67 |
| Dream | 76.93 | 10.59 | 35.95 | 4.41 | 41.28 | 6.78 | 35.95 | 4.41 | 30.83 | 9.32 | 7.14 | 2.82 |
| Fever | 84.50 | 10.17 | 35.10 | 5.49 | 45.56 | 5.38 | 35.10 | 5.49 | 35.00 | 8.99 | 5.88 | 2.29 |
| Liberty | 84.98 | 9.92 | 36.90 | 5.77 | 44.53 | 5.61 | 36.90 | 5.77 | 35.38 | 10.06 | 7.75 | 2.19 |
| Lynx | 82.36 | 11.39 | 33.15 | 5.06 | 45.21 | 6.34 | 33.15 | 5.06 | 37.80 | 9.43 | 8.36 | 3.17 |
| Mercury | 81.93 | 12.60 | 32.26 | 5.39 | 44.28 | 7.34 | 32.26 | 5.39 | 32.97 | 10.34 | 6.55 | 2.12 |
| Mystics | 79.30 | 8.69 | 31.85 | 4.66 | 43.36 | 4.82 | 31.85 | 4.66 | 36.64 | 8.69 | 7.28 | 2.24 |
| Sky | 77.40 | 9.62 | 36.60 | 5.57 | 42.44 | 5.22 | 36.60 | 5.57 | 31.74 | 11.62 | 7.00 | 3.30 |
| Sparks | 78.40 | 10.57 | 32.67 | 5.52 | 42.63 | 6.15 | 32.67 | 5.52 | 32.09 | 11.00 | 7.30 | 2.78 |
| Storm | 82.67 | 9.65 | 34.67 | 6.02 | 43.43 | 5.39 | 34.67 | 6.02 | 28.35 | 9.03 | 9.24 | 3.27 |
| Sun | 80.36 | 9.89 | 33.43 | 4.62 | 44.30 | 5.28 | 33.43 | 4.62 | 32.84 | 11.67 | 7.89 | 3.29 |
| Wings | 84.20 | 11.47 | 34.75 | 4.65 | 44.47 | 5.24 | 34.75 | 4.65 | 32.06 | 11.75 | 7.12 | 2.95 |
#This table shows whether the Sun scored more on average in wins compared to losses
my_team_by_result %>%
kbl(caption = "Table 2. Connecticut Sun Mean Points Scored by Game Result", digits = 2,
align = "c",
col.names = c("Team",
"Mean Score", "SD Score")) %>%
#Style the table2
kable_styling(full_width = FALSE,
html_font = "Times New Roman",
bootstrap_options = c("striped", "hover", "condensed", "bordered"),
position = "center",
font_size = 13) %>%
row_spec(0, bold = TRUE,
color = "white",
background = "#4D4D4D") %>%
column_spec(1, bold = TRUE, color = "#4D4D4D")
| Team | Mean Score | SD Score |
|---|---|---|
| loss | 72.13 | 5.62 |
| win | 84.22 | 9.10 |
Based on the boxplot, the mean points scored in wins is higher than in losses, suggesting that the Sun tend to score more in games they win. There is also one visible outlier in the win group, indicating an unusually high scoring game.
# Boxplot of points scored by game result
boxplot(team_score ~ result, data = df_hist,
col = c("lightblue", "lightgreen"),
main = "Points Scored by Game Result — Connecticut Sun (2025)",
xlab = "Result",
ylab = "Points Scored")
F(4,42)= 11.06 with p-value= 3.225e-06, which is less than 0.15, therefore we reject the null hypothesis. The model as a whole is statistically significant. The Adjusted R-squared is 0.4666 meaning 46.66% of the variation in points scored is explained by the four predictors together.
As the field goal percentage goes up by 1, the Sun increases by 1.11 points. The p-value is 6.82e-05, significant. \(~\) As the total rebounds goes up by 1, the Sun increases by 0.04 points. The p-value is 0.888, not significant \(~\) As the three point field goal percentage goes up by 1, the Sun increases by 0.16 points. The p-value is 0.167, not significant \(~\) As the steals goes up by 1, the Sun increases by 0.25 points,. The p-value is 0.486, not significant
In Model 1, all four predictors field goal percentage, total rebounds, three-point field goal percentage, and steals were statistically significant at the 15% significance level, as all p-values were below α = 0.15. Since no terms needed to be removed, Model 2 retains the same predictors as Model 1 and serves as the foundation for building the full interaction model in Model 3.
All VIF values are below 5, indicating no multicollinearity problem among the predictors. Field_goal_pct=1.56 \(~\) Total_rebounds=1.33 \(~\) Three_point_field_goal_pct=1.47 \(~\) Steals=1.25
\(~\)
Adding interactions did improve the Adjusted R-squared. \(~\) Model 2 had an Adjusted-squared of 0.4666, meaning 46.66% of the variation in points scored is explained by the four predictors alone. Model 3 increased that to 0.5652, meaning 56.52% of the variation is now explained once interactions are included. The following interaction terms have a p-value less than 0.15: \(~\) - The interaction between field_goal_pct and three_point_field_goal_pct is significant because the p-value is 0.000904, which is less than α= 0.15, therefore we reject the null hypothesis. \(~\) - The interaction between total_rebounds and three_point_field_goal_pct is significant because the p-value is 0.059952, which is less than α= 0.15, therefore we reject the null hypothesis. \(~\) - The interaction between three_point_field_goal_pct and steals is significant because the p-value is 0.113752, which is less than 0.15, therefore we reject the null hypothesis.
In Model 3,I kept the interactions between field_goal_pct and three_point_field_goal_pct,total_rebounds and three_point_field_goal_pct, three_point_field_goal_pct and steals, since all had p-values below α= 0.15. \(~\) I dropped all others interactions because their p-values exceeded α = 0.15,and were therefore not significant. Model 4 is my final choice because it retains only the significant interaction terms, the Adjusted-squared remained nearly unchanged going from 0.5652 in Model 3 to 0.5645 in Model 4, and the model is simpler, easier to interpret, and provides a better overall fit.
My final model is team_score= 253.8462714 + -1.4997568 * field_goal_pct + -2.9637352 * total_rebounds + -6.4401192 * three_point_field_goal_pct + -1.9465849* steals + 0.0723389 * field_goal_pct * three_point_field_goal_pct + 0.0866758* total_rebounds * three_point_field_goal_pct + 0.0635665 * three_point_field_goal_pct * steals
This model significantly predicts team score, f(7, 7) = 9.5178594, p<.0001, adjusted R^2= 0.5644971
The residual histogram looks like a bell curve and is centered close to zero. There is a small tail on the right side, but it is not a big concern. Based on this, the normality assumption is satisfied.
The residuals vs. fitted values plot shows no clear pattern or funnel shape. The points are scattered randomly around zero across all fitted values.Based on this, the equal variance assumption is not satisfied.
All observations fall within the threshold between -3 and 3, meaning there are no outliers. The model fits the data well across all games.
The Cook’s Distance chart identifies 4 games that have a disproportionate effect on the model’s coefficients. These correspond to games played on August 16th, 2024 against Wing, July 07th, 2024 against Mercury, June 23th, 2024 against Storm and May 05th, 2024 against Mystics. The leverage plot further confirms 4 influential points on the right side and 1 outlier on the left side, corresponding to the game on June 27, 2024 against Mystics Despite these influential observations, the model remains reliable as the residual assumptions are largely satisfied.
#Residual checks for Model4
#(1)Residual Histogram
ols_plot_resid_hist(model4)
main="Residual Histogram"
xlab="Residuals"
ylab="Frequency"
#(2)Residuals vs. Fitted Plot
ols_plot_resid_fit(model4)
main="Residuals vs. Fitted"
xlab="Fitted Values"
ylab="Residuals"
#(3)Studentized Residuals
ols_plot_resid_stud(model4)
main="Studentized Residuals vs Fitted"
xlab="Observation"
ylab="Studentized Residuals"
#(4)Cook's Distance
ols_plot_cooksd_chart(model4)
main="Cook's Distance"
xlab="Observation"
ylab="Cook's Distance"
#(5)laverage
ols_plot_resid_lev(model4)
main="Leverage"
xlab="Leverage"
ylab="Studentized Residuals"
graphics.off()
| Dependent variable: | |
| team_score | |
| field_goal_pct | -1.500* |
| (0.823) | |
| total_rebounds | -2.964*** |
| (1.057) | |
| three_point_field_goal_pct | -6.440*** |
| (1.895) | |
| steals | -1.947* |
| (1.125) | |
| field_goal_pct:three_point_field_goal_pct | 0.072*** |
| (0.022) | |
| total_rebounds:three_point_field_goal_pct | 0.087*** |
| (0.029) | |
| three_point_field_goal_pct:steals | 0.064* |
| (0.037) | |
| Constant | 253.846*** |
| (68.088) | |
| Observations | 47 |
| R2 | 0.631 |
| Adjusted R2 | 0.564 |
| Residual Std. Error | 6.529 (df = 39) |
| F Statistic | 9.518*** (df = 7; 39) |
| Note: | p<0.1; p<0.05; p<0.01 |
Using the median values for all predictors: field goal percentage of 43.90, total rebounds of 34, three-point field goal percentage of 33.30, and steals of 7.39. Model 4 predicts the Sun would score 79.86 points. The 95% confidence interval ranges from 77.41 to 82.32 points. We are 95% confident that the Sun would score between 77.41 and 82.32 points. \(~\) I built the model to predict my team’s point for a game in which they achieve the median value for each variable. The predicted score is 79.8604643, with 95% confidence interval [77.4058683, 82.3150604]
The valid ranges are: field goal percentage (35.10 to 57.60), total rebounds (26 to 45), three-point field goal percentage (10.00 to 60.90), and steals (1 to 14). Since all median values fall within these ranges, the prediction is trustworthy.