Week 10 Data Dive

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes)
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.3.3
library(broom)
library(lindia)
library(car)
## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some
options(scipen = 6)
theme_set(theme_minimal())
df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Creaing new column effiency
df <- df %>%
  mutate(efficiency= WAR162/pct_PT
  )
head(df)
## # A tibble: 6 × 17
##   name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos  
##   <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>    
## 1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B, …
## 2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P        
## 3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B   
## 4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS       
## 5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B       
## 6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B    
## # ℹ 8 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>

Above I created a new column called “efficiency”, which captures how much value (WAR162) a player provides relative to their share of playing time. However since there can be zero division we may have NaN values, which I must remove. WAR162 stands for “Wins Above Replacement for 162 game season”, where replacement is a replacement level player.

library(dplyr) #Changing all instances where a player has multiple positions, to "UTL"
library(stringr) #UTL stands for utility player



df <- df %>%
  mutate(utility = ifelse(grepl(",", def_pos), 1, 0))

# Print the updated dataframe
print(df)
## # A tibble: 98,796 × 18
##    name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos 
##    <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>   
##  1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B,…
##  2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P       
##  3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B  
##  4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS      
##  5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B      
##  6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B   
##  7 David Peralta       31 peralda01    2019 ARI     NL      4.17   1.79 LF      
##  8 Robbie Ray          27 rayro02      2019 ARI     NL      4.69   1.59 P       
##  9 Merrill Kelly       30 kellyme01    2019 ARI     NL      5.13   1.21 P       
## 10 Jarrod Dyson        34 dysonja01    2019 ARI     NL      4.39   1.21 CF, RF,…
## # ℹ 98,786 more rows
## # ℹ 9 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>, utility <dbl>
head(df)
## # A tibble: 6 × 18
##   name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos  
##   <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>    
## 1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B, …
## 2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P        
## 3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B   
## 4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS       
## 5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B       
## 6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B    
## # ℹ 9 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>, utility <dbl>

In the code above, I made a new column called “UTL”. In baseball when a player has two or more positions, they are referred to as a utility player. If they are a utility player there will be a 1 in the new column, if not there will be a 0.

# Remove rows with NA values
df_clean <- na.omit(df)

# Filter the dataframe for years greater than 2004 and are finite
filtered_data <- df_clean |> filter(year_ID > 2004 & is.finite(efficiency))
filtered_data
## # A tibble: 21,256 × 18
##    name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos 
##    <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>   
##  1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B,…
##  2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P       
##  3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B  
##  4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS      
##  5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B      
##  6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B   
##  7 David Peralta       31 peralda01    2019 ARI     NL      4.17   1.79 LF      
##  8 Robbie Ray          27 rayro02      2019 ARI     NL      4.69   1.59 P       
##  9 Merrill Kelly       30 kellyme01    2019 ARI     NL      5.13   1.21 P       
## 10 Jarrod Dyson        34 dysonja01    2019 ARI     NL      4.39   1.21 CF, RF,…
## # ℹ 21,246 more rows
## # ℹ 9 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>, utility <dbl>

Above, I removed all NULL values as it would hinder analysis, luckily there were only 4 out the 10,000+ rows in my new data frame.

# Filter for non-utility players (utility player == 0)
players_nonutility <- filter(filtered_data, utility == 0)

# Fit linear model: WAR per 162 games ~ Efficiency
model1 <- lm(quasi_ws ~ pct_PT, players_nonutility)

# Get R-squared from model summary
rsquared <- summary(model1)$r.squared

filtered_data |>
  filter(utility==0)|>
  ggplot(mapping=aes(x=pct_PT, y=quasi_ws))+
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed', se = FALSE) +
  labs(
    title = "Quasi Win Share vs. Playing Time (Non-Utility Players)",
    subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3)),
    x = "Playing Time",
    y = "Quasi Win Share"
  ) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'

Based on this visual between playing time and Quasi Win Share for players who have only one position, it appears that it may be more of a quadratic relationship (meaning it is not exactly linear). Since there appears to be a slight upward curve, I will use a quadratic transformation on Playing Time.

Quasi Win Share is a derived statistic, not a count of observed events. It is based on three times total wins created per 162 games; generated by adding WAR162 to wins BELOW replacement (determined by playing time) and rounding to nearest whole number. This makes it inappropriate to treat Quasi Win Share as count data and use Poisson Regression. Poisson Regression assumes the response variable represents counts of independent events occurring over fixed intervals, which is not the case.

model2 <- lm(quasi_ws ~ I(pct_PT ^ 2) + pct_PT,
            players_nonutility)

rsquared <- summary(model2)$r.squared

players_nonutility |>
  ggplot(mapping = aes(x = pct_PT ^ 2, 
                       y = quasi_ws)) +
  geom_point() +
  geom_smooth(method = 'lm', color = 'gray', linetype = 'dashed',
              se = FALSE) +
  geom_smooth(se = FALSE) +
  labs(title = "Quasi Win Share vs. (Playing Time) ^ 2",
       subtitle = paste("Linear Fit R-Squared =", round(rsquared, 3))) +
  theme_classic()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

summary(model2)
## 
## Call:
## lm(formula = quasi_ws ~ I(pct_PT^2) + pct_PT, data = players_nonutility)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6719  -0.9304  -0.0932   0.9015  19.5591 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.073521   0.042048  -1.749   0.0804 .  
## I(pct_PT^2)  0.247049   0.006552  37.706   <2e-16 ***
## pct_PT       1.591237   0.040352  39.434   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.74 on 15111 degrees of freedom
## Multiple R-squared:  0.8284, Adjusted R-squared:  0.8284 
## F-statistic: 3.647e+04 on 2 and 15111 DF,  p-value: < 2.2e-16

The new model is performing better than the original. It does a better job of capturing the non-linear relationship and a better R-Squared value. Approximately 83% of the variation in Quasi Win Share is explained by playing time.

plots <- gg_diagnose(model2, plot.all = FALSE)
plot_all(plots[c('res_fitted', 'qqplot')], 
         max.per.page = 1)

From the visuals above we can see two things:

  1. The lower residuals are clustered more tightly around zero, while the larger residuals show more spread.

2)The tails in the Q-Q plot deviate from the straight line, indicating that the residuals do not follow a normal distribution.

This is a problem because it violates the assumptions for residuals of constant variance and normality of residuals. It tells us that the response variable is not normally distributed, but it can be fixed with a power transformation.

pT <- powerTransform(model2, family = "bcnPower")
pT$lambda
## [1] 0.3237929

Below, I will try a few different transformations to see which one fits my data best.

p1 <- ggplot(data = filtered_data) +
  geom_histogram(aes(x = quasi_ws), color='white') +
  labs(x = "y")

p2 <- ggplot(data = filtered_data) +
  geom_histogram(aes(x = sqrt(quasi_ws)), color='white') +
  labs(x = "square root(y)")

p1 + p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p3 <- ggplot(data = filtered_data) + 
  geom_histogram(aes(x = quasi_ws), color = 'white') + 
  labs(x = "y")

p4 <- ggplot(data = filtered_data) + 
  geom_histogram(aes(x = (quasi_ws)^0.324), color = 'white') + 
  labs(x = "Box-Cox transformed y (λ = 0.324)")

p3 + p4
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

p1 <- ggplot(data = filtered_data) +
  geom_histogram(aes(x = quasi_ws), color='white') +
  labs(x = "y")

p2 <- ggplot(data = filtered_data) +
  geom_histogram(aes(x = log(quasi_ws)), color='white') +
  labs(x = "log(y)")

p1 + p2
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 6224 rows containing non-finite outside the scale range
## (`stat_bin()`).

This is not necessarily needed for this data dive, just wanted to see how my data compared.å

Logistic Regression

My goal is to predict if a player is a utility player. I will use is_utility, which is a binary response variable, and WAR162 as my explanatory variable.

filtered_data |>
  ggplot(mapping = aes(x = WAR162, y = utility)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

model <- glm(utility ~ WAR162, data = filtered_data, family = binomial)
summary(model)
## 
## Call:
## glm(formula = utility ~ WAR162, family = binomial, data = filtered_data)
## 
## Coefficients:
##             Estimate Std. Error z value       Pr(>|z|)    
## (Intercept) -0.85434    0.01670 -51.160        < 2e-16 ***
## WAR162      -0.06853    0.01097  -6.248 0.000000000416 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25559  on 21255  degrees of freedom
## Residual deviance: 25518  on 21254  degrees of freedom
## AIC: 25522
## 
## Number of Fisher Scoring iterations: 4
model <- glm(utility ~ WAR162, data = filtered_data,
             family = binomial(link = 'logit'))

model$coefficients
## (Intercept)      WAR162 
## -0.85433720 -0.06852524

From this, we can determine that when WAR162 is 0, the probability of being a utility player is about 29.8%. Additionally, each 1-unit increase in WAR162 reduces the odds they are a utility player by about 6.6%.

# these coefficients come from the model
sigmoid <- \(x) 1 / (1 + exp(-(-.854 + -0.0685 * x)))

filtered_data |>
  ggplot(mapping = aes(x = WAR162, y = utility)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Modeling a Binary Response with Sigmoid") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme_minimal()

Here we can see that this sigmoid function gives us insight into the likelihood that a player is utility player, given a WAR162 statistic. Unfortunately, it does not tell us a whole lot.

Next I will use is_utility, which is a binary response variable, and WAR162 as my explanatory variable predict if a player is a utility player.

filtered_data |>
  ggplot(mapping = aes(x = pct_PT, y = utility)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_smooth(method = 'lm', se = FALSE) +
  labs(title = "Modeling a Binary Response with OLS") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

model <- glm(utility ~ pct_PT, data = filtered_data, family = binomial)
summary(model)
## 
## Call:
## glm(formula = utility ~ pct_PT, family = binomial, data = filtered_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.012453   0.022673  -44.65  < 2e-16 ***
## pct_PT       0.052013   0.007694    6.76 1.38e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25559  on 21255  degrees of freedom
## Residual deviance: 25513  on 21254  degrees of freedom
## AIC: 25517
## 
## Number of Fisher Scoring iterations: 4
model <- glm(utility ~ pct_PT, data = filtered_data,
             family = binomial(link = 'logit'))

model$coefficients
## (Intercept)      pct_PT 
## -1.01245331  0.05201334

From this, that for every 1 percent increase in point increase in playing time, the odds of being a player being utility player increases by about 5.3 percent. When playing time is 0, we predict that there is a 26.7% probability that the player is a utility player.

# these coefficients come from the model
sigmoid <- \(x) 1 / (1 + exp(-(-1.012 + 0.052 * x)))

filtered_data |>
  ggplot(mapping = aes(x = pct_PT, y = utility)) +
  geom_jitter(width = 0, height = 0.1, shape = 'O', size = 3) +
  geom_function(fun = sigmoid, color = 'blue', linewidth = 1) +
  labs(title = "Modeling a Binary Response with Sigmoid") +
  scale_y_continuous(breaks = c(0, 0.5, 1)) +
  theme_minimal()

Based on this logistic regression plot, we can in fact see that playing time has a weak but positive relationship with the probability of being a utility player.

Confidence Interval

coef_estimate <- coef(summary(model))["pct_PT", "Estimate"]
std_error <- coef(summary(model))["pct_PT", "Std. Error"]
lower_bound <- coef_estimate - 1.96 * std_error
upper_bound <- coef_estimate + 1.96 * std_error

cat("95% CI for pct_PT coefficient: [", round(lower_bound, 3), ",", round(upper_bound, 3), "]\n")
## 95% CI for pct_PT coefficient: [ 0.037 , 0.067 ]
exp(0.037)
## [1] 1.037693
exp(0.067)
## [1] 1.069295

Based on the confidence interval above, we are 95% confident that each 1% increase in playing time increases the log-odds of being a utility player by between 0.037 and 0.067, which is between 3.8% and 6.9%.

Further Questions

  1. If you can play multiple positions, then you should have have more opportunity to get playing time. So, why is that not being reflected?

  2. Would it be appropriate perform evaluations for only players who have WAR162 above a certain value? This could have made my transformation much better.