Data
This data set is all about all 2025 WNBA games which includes all
star games. \(~\)
The filtered data set takes out the all star games and only looks at
the teams within the WNBA. \(~\)
After filtering out the all star games, we want to predict
specifically the team score of the median Seattle Storms game using a
statistical model.
# Keep in case, below creates a simple table
#kable(summaryC)
# "Fancy Table" for league summary
summaryC %>%
kbl(digits = 2, caption = "2025 League Stats") %>%
kable_classic_2(full_width = F)
2025 League Stats
|
team_name
|
mean_score
|
sd_score
|
mean_rebounds
|
sd_rebounds
|
mean_fgp
|
sd_fgp
|
mean_3pp
|
sd_3pp
|
mean_steals
|
sd_steals
|
|
Aces
|
85.52
|
9.56
|
33.78
|
5.88
|
45.27
|
5.82
|
35.27
|
7.08
|
6.80
|
2.67
|
|
Dream
|
76.93
|
10.59
|
35.95
|
4.41
|
41.28
|
6.78
|
30.83
|
9.32
|
7.14
|
2.82
|
|
Fever
|
84.50
|
10.17
|
35.10
|
5.49
|
45.56
|
5.38
|
35.00
|
8.99
|
5.88
|
2.29
|
|
Liberty
|
84.98
|
9.92
|
36.90
|
5.77
|
44.53
|
5.61
|
35.38
|
10.06
|
7.75
|
2.19
|
|
Lynx
|
82.36
|
11.39
|
33.15
|
5.06
|
45.21
|
6.34
|
37.80
|
9.43
|
8.36
|
3.17
|
|
Mercury
|
81.93
|
12.60
|
32.26
|
5.39
|
44.28
|
7.34
|
32.97
|
10.34
|
6.55
|
2.12
|
|
Mystics
|
79.30
|
8.69
|
31.85
|
4.66
|
43.36
|
4.82
|
36.64
|
8.69
|
7.28
|
2.24
|
|
Sky
|
77.40
|
9.62
|
36.60
|
5.57
|
42.44
|
5.22
|
31.74
|
11.62
|
7.00
|
3.30
|
|
Sparks
|
78.40
|
10.57
|
32.67
|
5.52
|
42.63
|
6.15
|
32.09
|
11.00
|
7.30
|
2.78
|
|
Storm
|
82.67
|
9.65
|
34.67
|
6.02
|
43.43
|
5.39
|
28.35
|
9.03
|
9.24
|
3.27
|
|
Sun
|
80.36
|
9.89
|
33.43
|
4.62
|
44.30
|
5.28
|
32.84
|
11.67
|
7.89
|
3.29
|
|
Wings
|
84.20
|
11.47
|
34.75
|
4.65
|
44.47
|
5.24
|
32.06
|
11.75
|
7.12
|
2.95
|
# Keep in case, below creates a simple table
#kable(my_team_s)
# "Fancy Table" for win and loss median/variance
my_team_s %>%
kbl(digits = 2, caption = "2025 Seattle Storms Win/Loss Means and Standard Deviations") %>%
kable_classic(full_width = F)
2025 Seattle Storms Win/Loss Means and Standard Deviations
|
team_winner
|
mean
|
sd
|
|
FALSE
|
76.35
|
8.36
|
|
TRUE
|
86.96
|
8.08
|
Histogram and
Boxplot
###Side by side histograms
p = df_hist %>%
ggplot( aes(x=team_score, fill=result)) +
geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
scale_fill_manual(values=c("#2C5234", "#FBE122"))
p

boxplot(team_score ~ result, data = df_hist,
col = c("#2C5234", "#FBE122"),
main = "Points scored by game result",
xlab = "Result", ylab = "Points")

These graphs show us that the average team score for a win is higher
than a loss. This can be seen easier in the box plot since the median is
higher in the box plot representing the winning games team score.
\(~\)
Models
Our first order model consists of a dependent variable team_score and
independent variables field_goal_pct (fg%), total_rebounds (tr),
three_point_field_goal_pct (3-pt %), and steals.
The equation for this first order model is team_score = -7.7862548 +
1.4083407 * fg% + 0.53063 * tr + 0.1411404 * 3-pt % + 0.746605 *
steals.
\(~\)
After using the vif command, we found no multicollinearity issues
within our variables as all values were between 1 and 2.
\(~\)
Looking at the correlation matrix, we found that no variables were
highly correlated with each other. The highest correlation coefficient
seen was 0.52 but that is under the threshold for high correlation.
\(~\)
Interaction Model
The full interaction model for predicting team score for the Seattle
Storms is team_score = 28.937801 + 0.6626619 * fg% + -0.0405748 * tr +
-0.3439152 * 3-pt % + -0.9288016 * steals + 0.0039546 * fg% * tr +
-0.0024727 * fg% * 3-pt % + 0.0839728 * fg% * steals + 0.0240574 * tr *
3-pt% + -0.0249502 * tr * steals + -0.0293786 * 3-pt% * steals.
\(~\)
For a cleaner look for the full interaction model, look below.
|
|
|
|
Dependent variable:
|
|
|
|
|
|
team_score
|
|
|
|
field_goal_pct
|
0.663
|
|
|
(2.900)
|
|
|
|
|
total_rebounds
|
-0.041
|
|
|
(2.714)
|
|
|
|
|
three_point_field_goal_pct
|
-0.344
|
|
|
(2.074)
|
|
|
|
|
steals
|
-0.929
|
|
|
(6.695)
|
|
|
|
|
field_goal_pct:total_rebounds
|
0.004
|
|
|
(0.053)
|
|
|
|
|
field_goal_pct:three_point_field_goal_pct
|
-0.002
|
|
|
(0.024)
|
|
|
|
|
field_goal_pct:steals
|
0.084
|
|
|
(0.122)
|
|
|
|
|
total_rebounds:three_point_field_goal_pct
|
0.024
|
|
|
(0.029)
|
|
|
|
|
total_rebounds:steals
|
-0.025
|
|
|
(0.071)
|
|
|
|
|
three_point_field_goal_pct:steals
|
-0.029
|
|
|
(0.055)
|
|
|
|
|
Constant
|
28.938
|
|
|
(147.927)
|
|
|
|
|
|
|
Observations
|
42
|
|
R2
|
0.701
|
|
Adjusted R2
|
0.605
|
|
Residual Std. Error
|
6.069 (df = 31)
|
|
F Statistic
|
7.274*** (df = 10; 31)
|
|
|
|
Note:
|
p<0.1; p<0.05;
p<0.01
|
\(~\)
When reducing this model, we were given the limit of \(\alpha\) = 0.15. While reducing, I kept
note of the adjusted R2 value just to see if an interaction
term was not significant but predicted team score better.
\(~\)
While reducing, I ended up reducing all interaction terms except for
total rebounds (tr) and 3-pt % and took note that while the interaction
term was not significant at \(\alpha\)
= 0.15, adjusted R2 was 0.6429 which was higher than the base
model.
\(~\)
My final model is team_score = -7.7862548 + 1.4083407 * fg% + 0.53063
* tr + 0.1411404 * 3-pt % + 0.746605 * steals.
This model significantly predicts team score, f(4,37) = 18.9870959,
p, adjusted R2 = 0.6370023
See table below for a clearer look for the final model.
|
|
|
|
Dependent variable:
|
|
|
|
|
|
team_score
|
|
|
|
field_goal_pct
|
1.408***
|
|
|
(0.223)
|
|
|
|
|
total_rebounds
|
0.531***
|
|
|
(0.173)
|
|
|
|
|
three_point_field_goal_pct
|
0.141
|
|
|
(0.124)
|
|
|
|
|
steals
|
0.747**
|
|
|
(0.287)
|
|
|
|
|
Constant
|
-7.786
|
|
|
(12.729)
|
|
|
|
|
|
|
Observations
|
42
|
|
R2
|
0.672
|
|
Adjusted R2
|
0.637
|
|
Residual Std. Error
|
5.817 (df = 37)
|
|
F Statistic
|
18.987*** (df = 4; 37)
|
|
|
|
Note:
|
p<0.1; p<0.05;
p<0.01
|
\(~\)
Residual Analysis
# Gives us the versus fits plot to asses constant variance
ols_plot_resid_fit(model4)

# Gives us the histogram of the residuals to asses normality of the residuals
ols_plot_resid_hist(model4)

# Gives us the view for any outliers in the data
ols_plot_resid_stud(model4)

# Gives us both leverage and outliers in the data
ols_plot_resid_lev(model4, threshold = 3)

# Gives us the cooks d chart for significant values
ols_plot_cooksd_chart(model4)

\(~\)
For our residual analysis, from the histogram, we can tell that the
distribution of the residuals is approximately normal with some skewness
from the left. From our versus fits plot, we cannot see a pattern within
the variance so we can say that there is constant variance for our
residuals as well. Even with a slightly skewed histogram, I think that
we can use this regression line as a solid predictor for the Seattle
Storms team score.
We have influential values at games 10, 12, 37, and 41. \(~\)
We also have leverage points at games 12, 37, and 41. \(~\)
Game 10 was against the Connecticut Suns. \(~\)
Game 12 was against the New York Liberties. \(~\)
Game 37 was against the Washington Mystics. \(~\)
Game 41 was against the Minnesota Lynx.
\(~\)
Prediction
I built the model to predict my team’s points for a game in which
they achieve the median value for each variable. \(~\)
The predicted team score is 82.0419787, with 95% confidence interval
(80.1458374, 83.93812).
---
title: "2025 WNBA Seattle Storms Analysis"
author: "Marius David Firulovici"
date: "`r Sys.Date()`"
output:
  html_document: 
    toc: yes
    toc_depth: 4
    toc_float: yes
    number_sections: yes
    toc_collapsed: yes
    code_folding: hide
    code_download: yes
    smooth_scroll: yes
    theme: lumen
  pdf_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    number_sections: yes
    fig_width: 3
    fig_height: 3
  word_document: 
    toc: yes
    toc_depth: 4
    fig_caption: yes
    keep_md: yes
editor_options: 
  chunk_output_type: inline
---

```{css, echo = FALSE}
#TOC::before {
  content: "Table of Contents";
  font-weight: bold;
  font-size: 1.2em;
  display: block;
  color: navy;
  margin-bottom: 10px;
}


div#TOC li {     /* table of content  */
    list-style:upper-roman;
    background-image:none;
    background-repeat:none;
    background-position:0;
}

h1.title {    /* level 1 header of title  */
  font-size: 22px;
  font-weight: bold;
  color: DarkRed;
  text-align: center;
  font-family: "Gill Sans", sans-serif;
}

h4.author { /* Header 4 - and the author and data headers use this too  */
  font-size: 15px;
  font-weight: bold;
  font-family: system-ui;
  color: navy;
  text-align: center;
}

h4.date { /* Header 4 - and the author and data headers use this too  */
  font-size: 18px;
  font-weight: bold;
  font-family: "Gill Sans", sans-serif;
  color: DarkBlue;
  text-align: center;
}

h1 { /* Header 1 - and the author and data headers use this too  */
    font-size: 20px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

h2 { /* Header 2 - and the author and data headers use this too  */
    font-size: 18px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h3 { /* Header 3 - and the author and data headers use this too  */
    font-size: 16px;
    font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: navy;
    text-align: left;
}

h4 { /* Header 4 - and the author and data headers use this too  */
    font-size: 14px;
  font-weight: bold;
    font-family: "Times New Roman", Times, serif;
    color: darkred;
    text-align: left;
}

/* Add dots after numbered headers */
.header-section-number::after {
  content: ".";

body { background-color:white; }

.highlightme { background-color:yellow; }

p { background-color:white; }

}
```


``` {r setup, include=F}
# Setup Chunk

# Sets the working directory (wd) as the working folder
setwd("~/STA 319 Spring 2026")

# Import the WNBA data set to get information about the Seattle Storms
data = read.csv("WNBA_2025_box-scores.csv", header = T)

# Libraries we will be using throughout this project
# 
library(ggplot2)
# 
library(dplyr)
# VIF in car package
library(car)
# 
library(olsrr)

# Package for nicer markdown tables
library(kableExtra)

# Nicer Tables for Regression and stuff
library(stargazer)


```

# Data

``` {r wrangling, include = F}
# Data Wrangling Chunk

summary(data)

# Filtering out All Star Games (Primary use here)
df = data %>%
  filter(team_id < 21)

# Alternate way of filtering out the All Star Games
df2 = data %>%
  filter((team_name != "Team WNBA" & team_name != "Team USA"))

# Group data by team, finds mean/sd of all variables

summaryC = df %>%
  group_by(team_name) %>%
  summarise(mean_score = mean(team_score), sd_score = sd(team_score), 
            mean_rebounds=mean(total_rebounds), sd_rebounds=sd(total_rebounds),
            mean_fgp=mean(field_goal_pct), sd_fgp=sd(field_goal_pct),
            mean_3pp=mean(three_point_field_goal_pct), sd_3pp=sd(three_point_field_goal_pct),
            mean_steals=mean(steals), sd_steals=sd(steals))

# Gives a summary of the variables
summaryC


# New Data Frame only using team_score, field_goal_percent, rebounds, 3-Point
# Percentage, and Steals
# Filters out every team but the Seattle Storms "Storm"

my_team = df %>%
  select(team_score, field_goal_pct, total_rebounds, 
         three_point_field_goal_pct, steals, team_name, team_winner) %>%
  filter(team_name == "Storm")

# Groups cases by win vs. loss
# Summarizes all variables with mean and sd

my_team_s = my_team %>%
  group_by(team_winner) %>%
  summarise(mean = mean(team_score), sd = sd(team_score))

my_team_s


#Mutate and case_when

df_hist = df %>%
  filter(team_name == "Storm") %>%
  mutate(result = case_when(team_winner == TRUE ~ "Win",
                           team_winner == FALSE ~ "Loss"))



```


This data set is all about all 2025 WNBA games which includes all star games.
$~$

The filtered data set takes out the all star games and only looks at the teams within the WNBA.
$~$

After filtering out the all star games, we want to predict specifically the team score of the
median Seattle Storms game using a statistical model.

```{r tables, include=TRUE, comment=NA}

# Keep in case, below creates a simple table
#kable(summaryC)

# "Fancy Table" for league summary
summaryC %>%
  kbl(digits = 2, caption = "2025 League Stats") %>%
  kable_classic_2(full_width = F)

# Keep in case, below creates a simple table
#kable(my_team_s)

# "Fancy Table" for win and loss median/variance
my_team_s %>%
  kbl(digits = 2, caption = "2025 Seattle Storms Win/Loss Means and Standard Deviations") %>%
  kable_classic(full_width = F)


```

# Histogram and Boxplot

```{r graphs, include = T, fig.width = 4, fig.height = 4, message = F}
###Side by side histograms

p = df_hist %>%
  ggplot( aes(x=team_score, fill=result)) +
  geom_histogram( color="#e9ecef", alpha=0.6, position = 'identity') +
  scale_fill_manual(values=c("#2C5234", "#FBE122"))
p

boxplot(team_score ~ result, data = df_hist,
        col = c("#2C5234", "#FBE122"),
        main = "Points scored by game result",
        xlab = "Result", ylab = "Points")

```

These graphs show us that the average team score for a win is higher than a loss. 
This can be seen easier in the box plot since the median is higher in the box plot representing the winning games team score.

$~$

# Models


``` {r first order model, echo = F, include = F}

# First Order Model
model1 = lm(team_score ~ field_goal_pct + total_rebounds+ 
         three_point_field_goal_pct+ steals,
         data = my_team)

summary(model1)

# Gives us the VIF for our 1st model to find multicollinearity
vif(model1)

# Gives us only independent variables
cor_data = my_team %>%
  select(field_goal_pct, total_rebounds, three_point_field_goal_pct,
         steals)

# Gives us a correlation matrix between our independent variables
cor(cor_data)


# No need for model 2 as there were no highly correlated pairs (|r|> 0.8)
# VIF values also look fine, nothing above 4
# Shortened total rebounds to tr for writing purposes in narrative

```

Our first order model consists of a dependent variable team_score and independent
variables field_goal_pct (fg%), total_rebounds (tr), three_point_field_goal_pct (3-pt %),
and steals.

The equation for this first order model is team_score = `r model1$coefficients[1]` + `r model1$coefficients[2]` * fg% +
`r model1$coefficients[3]` * tr + `r model1$coefficients[4]` * 3-pt % +
`r model1$coefficients[5]` * steals.

$~$

After using the vif command, we found no multicollinearity issues within our variables
as all values were between 1 and 2.

$~$

Looking at the correlation matrix, we found that no variables were highly correlated
with each other. The highest correlation coefficient seen was 0.52 but that is under
the threshold for high correlation.

$~$

# Interaction Model

``` {r interaction model, results = 'asis', echo = F, include = F, comment=NA}

# Full interaction model (Considered Model 3 due to assignment)
model3 = lm(team_score ~ field_goal_pct + total_rebounds 
         + three_point_field_goal_pct+ steals
         + field_goal_pct * total_rebounds 
         + field_goal_pct * three_point_field_goal_pct
         + field_goal_pct * steals
         + total_rebounds * three_point_field_goal_pct
         + total_rebounds * steals
         + three_point_field_goal_pct * steals 
         , data = my_team)

# Summarizes model 3
summary(model3)



# Interaction Model Adjusted for when alpha = 0.15 (Aka P > 0.15)
# No interaction term was significant at alpha = 0.15
# Closest one was total_rebounds * three_point_field_goal_pct at p = 0.21243

model4 = lm(team_score ~ field_goal_pct + total_rebounds 
         + three_point_field_goal_pct+ steals, data = my_team)

# Summarizes model 4 for visualization
model4sum = summary(model4)

```

```{r experiment, include = F, echo = F}
# Experimental Model, not for actual grading
# Wanted to see if there were any terms that had a better
# adj. R^2 than the base model
modelE = lm(team_score ~ field_goal_pct + total_rebounds 
         + three_point_field_goal_pct+ steals
         + total_rebounds * three_point_field_goal_pct
         , data = my_team)

summary(modelE)

summary(model4)

```

``` {r inline code, include=F}
# The stuff in red is taken from the global environment, formatting is
# a little weird tho, make sure to check any formatting errors
# Super scripts are done by ^funnything^
# $~$ creates a blank space
# <br> creates a <br>eak, useful for the graph later on
```

The full interaction model for predicting team score for the Seattle Storms is
team_score = `r model3$coefficients[1]` +  `r model3$coefficients[2]` * fg% +
 `r model3$coefficients[3]` * tr +  `r model3$coefficients[4]` * 3-pt % +
 `r model3$coefficients[5]` * steals +  `r model3$coefficients[6]` * fg% * tr +
 `r model3$coefficients[7]` * fg% * 3-pt % + `r model3$coefficients[8]` * fg% * steals +
 `r model3$coefficients[9]` * tr * 3-pt% +
 `r model3$coefficients[10]` * tr * steals +
 `r model3$coefficients[11]` * 3-pt% * steals.
 
$~$

For a cleaner look for the full interaction model, look below.

``` {r fullinteractiongraph, include =  T, echo = F, results = 'asis'}
# message = F gets rid of a future message, I didn't have that problem here
library(stargazer)

# Makes a nice looking table for our final model
stargazer(model3, type = "html")
```

$~$

When reducing this model, we were given the limit of $\alpha$ = 0.15.
While reducing, I kept note of the adjusted R^2^ value just to see if an interaction
term was not significant but predicted team score better.

$~$

While reducing, I ended up reducing all interaction terms except for total rebounds (tr)
and 3-pt % and took note that while the interaction term was not significant at
$\alpha$ = 0.15, adjusted R^2^ was 0.6429 which was higher than the base model.

$~$

My final model is team_score = `r model4$coefficients[1]` + `r model4$coefficients[2]` * fg% +
`r model4$coefficients[3]` * tr + `r model4$coefficients[4]` * 3-pt % +
`r model4$coefficients[5]` * steals.


This model significantly predicts team score, f(`r model4sum$fstatistic[2]`,`r model4sum$fstatistic[3]`) = `r model4sum$fstatistic[1]`, 
p, adjusted R^2^ = `r model4sum$adj.r.squared`

See table below for a clearer look for the final model.

``` {r, include = T, echo = F, results = 'asis'}
# message = F gets rid of a future message, I didn't have that problem though here
library(stargazer)

# Makes a nice looking table for our final model
stargazer(model4, type = "html")
```

$~$

# Residual Analysis

``` {r residual analysis, include = T, fig.width = 4, fig.height = 4}

# Gives us the versus fits plot to asses constant variance
ols_plot_resid_fit(model4)

# Gives us the histogram of the residuals to asses normality of the residuals
ols_plot_resid_hist(model4)

# Gives us the view for any outliers in the data
ols_plot_resid_stud(model4)

# Gives us both leverage and outliers in the data
ols_plot_resid_lev(model4, threshold = 3)

# Gives us the cooks d chart for significant values
ols_plot_cooksd_chart(model4)




```

<br>
$~$

For our residual analysis, from the histogram, we can tell that the distribution of the residuals is approximately normal with some skewness from the left. From our versus fits plot, we cannot see a pattern within the variance so we can say that there is constant variance for our residuals as well. Even with a slightly skewed histogram, I think that we can use this regression line as a solid predictor for the Seattle Storms team score.

We have influential values at games 10, 12, 37, and 41.
$~$

We also have leverage points at games 12, 37, and 41.
$~$

Game 10 was against the Connecticut Suns.
$~$

Game 12 was against the New York Liberties.
$~$

Game 37 was against the Washington Mystics.
$~$

Game 41 was against the Minnesota Lynx.

$~$

# Prediction

``` {r prediction, include = F}

# Summarizes our variables to grab the experimental region and
# the median for our variables
summary(cor_data)

# New data frame for the median games
newdata = data.frame(field_goal_pct = 43.00, total_rebounds = 34.00,
                     three_point_field_goal_pct = 29.30, steals = 9.5)

# Predicting the median game with a 95% confidence interval
median_game = predict(model4, newdata, interval = "confidence", level = .95)

```

I built the model to predict my team's points for a game in which they achieve the median value for each variable. 
$~$

The predicted team score is `r median_game[1]`, with 
95% confidence interval (`r median_game[2]`, `r median_game[3]`).





