Week9_Data

Week 9 Data Dive

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggthemes)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.3

library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

library(tidyverse)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)

# remove scientific notation
options(scipen = 6)

# default theme, unless otherwise noted
theme_set(theme_minimal())

df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")

## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df |>
  filter(year_ID>2004 & year_ID<2019)|>
  mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))|>
  group_by(def_pos)|> #Grouping based on defensive position
  summarize(count1=n(),#count of player by year
  mean(WAR162))

## # A tibble: 11 × 3
##    def_pos count1 `mean(WAR162)`
##    <chr>    <int>          <dbl>
##  1 1B         636          1.14 
##  2 2B         366          1.63 
##  3 3B         340          1.86 
##  4 C         1125          0.437
##  5 CF         371          2.08 
##  6 LF         297          1.08 
##  7 P        10230          0.572
##  8 RF         264          1.52 
##  9 SS         431          1.85 
## 10 UTL       5676          0.597
## 11 <NA>       114          0.136

df <- df |> 
  mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))
df <- df |> drop_na() #getting ride of rows with missing values

# Check if changes were applied
df |> filter(def_pos == "UTL") |> head()

## # A tibble: 6 × 16
##   name_common       age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos
##   <chr>           <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>  
## 1 Ketel Marte        25 marteke01    2019 ARI     NL      6.19  7.16  UTL    
## 2 Eduardo Escobar    30 escobed01    2019 ARI     NL      6.76  4.03  UTL    
## 3 Carson Kelly       24 kellyca02    2019 ARI     NL      3.56  1.90  UTL    
## 4 Jarrod Dyson       34 dysonja01    2019 ARI     NL      4.39  1.21  UTL    
## 5 Adam Jones         33 jonesad01    2019 ARI     NL      5.16 -0.107 UTL    
## 6 Alex Avila         32 avilaal01    2019 ARI     NL      1.94  1.21  UTL    
## # ℹ 7 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>

df_new <- df |> filter(year_ID > 2004)

df_new|>  
  ggplot(mapping = aes(x = pct_PT, y = WAR162)) +  
  geom_point(size = 2) +  
  geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +  
  labs(title = "Linear Relationship Between Playing Time and WAR162",  
       x = "Percent of Playing Time",  
       y = "WAR162") +  
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

model <- lm(WAR162 ~ pct_PT, data = df_new)
summary(model)

## 
## Call:
## lm(formula = WAR162 ~ pct_PT, data = df_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2702 -0.4408  0.0741  0.4078  7.9911 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.443377   0.010095  -43.92   <2e-16 ***
## pct_PT       0.543718   0.003525  154.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9963 on 21285 degrees of freedom
## Multiple R-squared:  0.5279, Adjusted R-squared:  0.5278 
## F-statistic: 2.38e+04 on 1 and 21285 DF,  p-value: < 2.2e-16

In last weeks data dive I built a linear regression model WAR162 (wins above a replacement level player for the season) and share of playing time. There did appear to be a linear association between these two variables. There is an estimated slope of 0.544, which means that for every 1 percent point in increase in pct_PT, WAR162 increases by 0.544. The small p-value also showed me that pct_PT (playing time) has a statistically significant relationship with WAR162. Additionally, we can see that Multiple R-Squared value is about 0.53. This means that approximately 53% of the variance in WAR162 is explained by pct_PT.

More Variables

model1 <- lm(WAR162 ~ pct_PT + age, data = df_new)
summary(model1)

## 
## Call:
## lm(formula = WAR162 ~ pct_PT + age, data = df_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1937 -0.4440  0.0713  0.4046  7.9020 
## 
## Coefficients:
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  0.206685   0.047476   4.353 0.0000135 ***
## pct_PT       0.547605   0.003519 155.594   < 2e-16 ***
## age         -0.023318   0.001664 -14.010   < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9918 on 21284 degrees of freedom
## Multiple R-squared:  0.5322, Adjusted R-squared:  0.5321 
## F-statistic: 1.211e+04 on 2 and 21284 DF,  p-value: < 2.2e-16

tidy(model1, conf.int = TRUE)

## # A tibble: 3 × 7
##   term        estimate std.error statistic  p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)   0.207    0.0475       4.35 1.35e- 5   0.114     0.300 
## 2 pct_PT        0.548    0.00352    156.   0          0.541     0.555 
## 3 age          -0.0233   0.00166    -14.0  2.13e-44  -0.0266   -0.0201

Above, I added the age variable into my regression model. Age is an appropriate variable to include because as players get older their performance tends to decline

By adding the age variable my Adjusted went from 0.5278 up to 0.5321. I also get a very large F-Statistic (12,110) and very small p-value which would lead me to reject the null hypothesis that share of playing time and age have no effect on WAR162.

model2 <- lm(WAR162 ~ pct_PT + age + 
               quasi_ws + pct_PT:quasi_ws
             , data = df_new)
summary(model2)

## 
## Call:
## lm(formula = WAR162 ~ pct_PT + age + quasi_ws + pct_PT:quasi_ws, 
##     data = df_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52015 -0.09413  0.01126  0.11382  0.78169 
## 
## Coefficients:
##                   Estimate Std. Error  t value       Pr(>|t|)    
## (Intercept)     -0.0119491  0.0086264   -1.385          0.166    
## pct_PT          -0.4484316  0.0014134 -317.280        < 2e-16 ***
## age             -0.0018563  0.0003031   -6.125 0.000000000922 ***
## quasi_ws         0.3386023  0.0007877  429.857        < 2e-16 ***
## pct_PT:quasi_ws -0.0012449  0.0001197  -10.400        < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1797 on 21282 degrees of freedom
## Multiple R-squared:  0.9846, Adjusted R-squared:  0.9846 
## F-statistic: 3.411e+05 on 4 and 21282 DF,  p-value: < 2.2e-16

tidy(model2, conf.int = TRUE)

## # A tibble: 5 × 7
##   term            estimate std.error statistic  p.value conf.low conf.high
##   <chr>              <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
## 1 (Intercept)     -0.0119   0.00863      -1.39 1.66e- 1 -0.0289    0.00496
## 2 pct_PT          -0.448    0.00141    -317.   0        -0.451    -0.446  
## 3 age             -0.00186  0.000303     -6.13 9.22e-10 -0.00245  -0.00126
## 4 quasi_ws         0.339    0.000788    430.   0         0.337     0.340  
## 5 pct_PT:quasi_ws -0.00124  0.000120    -10.4  2.85e-25 -0.00148  -0.00101

Above, I added the quasi_ws (Quasi Win Shares) column, which is a measure of player value based upon their contribution to winning. This makes sense because it captures a player’s impact beyond share of playing time. However, there may be a problem with multicollinearity because WAR162 and Quasi Win Share both measure a players value.

The coefficient for pct_PT is negative, which is unexpected. This could be due to multicollinearity, as pct_PT and quasi_win_share are likely correlated, leading to an unstable estimate. It tells us that for each 1 unit increase in playing time, WAR162 decreases by -0.4484.

Age also has a small, but negative and significant effect on WAR162 which was expected. Out interaction term was share of playing time: Quasi Win Share. Rather than assuming Quasi Win Share has a fixed effect on WAR162, the interaction term allows the effect to change based upon the share of playing time. Surprisingly, we got a coefficient of -0.0012 with a statistically significant p-value. This means that as playing time increases the effect of quasi win share on WAR162 slightly decreases. For each 1 unit increase in age, WAR162 decreases by 0.0019.

Quasi Win Share has a strong and significant effect on WAR162, and for every 1 increment increase in Quasi Win Share, WAR162 increases by .34 which is a lot. The problem is that it could be driven by multicollinearity rather than a truly independent relationship.

I got an Adjusted R Squared Value of .9846 which means that 98.46% of the variance in WAR162 is explained by this model. The Residual Standard Error was 0.1797, indicating that the actual difference between the anticipated and actual WAR162 is about 0.18 (pretty low).

#vif(model2)
#looking for multicollinearity

Above we can see the VIF for Quasi Win Share and the interaction term (playing time and Quasi Win Share) are very high, which shows there is multicollinearity. This suggests Quasi Win Share is highly correlated with one or more other variables, and the interaction term is strongly correlated with other predictors. This will likely throw off my regression by overfitting, increasing standard error, and causing unstable coefficients.

Evaluating the Model

gg_resfitted(model2) +
  geom_smooth(se=FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

This plot actually looks pretty good. There is no obvious signs of fanning, but there may be some outliers. This indicates that the assumptions are reasonably met.

Each data point must be independent of every other data point.
The distribution of within each group is roughly normal
The variance of data within groups remains consistent for every group

gg_reshist(model2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The histogram of residuals above appears to be fairly normally distributed. This means that it meets the assumption of normality is reasonably met.

# the normal QQ plot
gg_qqplot(model2)

The QQ-Plot above checks if the assumption of normally distributed residuals. We can see some deviation on both ends, which indicates that there might be some mild skewness and the distribution could be non-normal.

ggplot(data = df_new) +
  geom_point(mapping = aes(x=pct_PT, y=WAR162))

The visual above is checking to see if there is a linear relationship between WAR162 and playing time. Based on what I can see, there does appear to be a linear relationship. This means that as share of playing time increases, wins above replacement also increases. The visual tells me that a linear model is appropriate.

# take a look at the docs for `threshold`
gg_cooksd(model2, threshold = 'matlab')

Above, we created a Cook’s distance plot which is meant to show observations that have high influence on the regression model. Many of the points above may look very infleuential, but the highest Cook’s distnance is about 0.008 which is very low. While these points may have some influence, I do not believe it is very strong due to how large the data frame is.

Week9_Data_Dive

2025-03-26

Week 9 Data Dive

More Variables

Evaluating the Model