library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggthemes)
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.3.3
library(effsize)
library(pwrss)
##
## Attaching package: 'pwrss'
##
## The following object is masked from 'package:stats':
##
## power.t.test
library(tidyverse)
library(ggthemes)
library(ggrepel)
library(AmesHousing)
library(boot)
library(broom)
library(lindia)
# remove scientific notation
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
df |>
filter(year_ID>2004 & year_ID<2019)|>
mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))|>
group_by(def_pos)|> #Grouping based on defensive position
summarize(count1=n(),#count of player by year
mean(WAR162))
## # A tibble: 11 × 3
## def_pos count1 `mean(WAR162)`
## <chr> <int> <dbl>
## 1 1B 636 1.14
## 2 2B 366 1.63
## 3 3B 340 1.86
## 4 C 1125 0.437
## 5 CF 371 2.08
## 6 LF 297 1.08
## 7 P 10230 0.572
## 8 RF 264 1.52
## 9 SS 431 1.85
## 10 UTL 5676 0.597
## 11 <NA> 114 0.136
df <- df |>
mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))
df <- df |> drop_na() #getting ride of rows with missing values
# Check if changes were applied
df |> filter(def_pos == "UTL") |> head()
## # A tibble: 6 × 16
## name_common age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos
## <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 Ketel Marte 25 marteke01 2019 ARI NL 6.19 7.16 UTL
## 2 Eduardo Escobar 30 escobed01 2019 ARI NL 6.76 4.03 UTL
## 3 Carson Kelly 24 kellyca02 2019 ARI NL 3.56 1.90 UTL
## 4 Jarrod Dyson 34 dysonja01 2019 ARI NL 4.39 1.21 UTL
## 5 Adam Jones 33 jonesad01 2019 ARI NL 5.16 -0.107 UTL
## 6 Alex Avila 32 avilaal01 2019 ARI NL 1.94 1.21 UTL
## # ℹ 7 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## # prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>
df_new <- df |> filter(year_ID > 2004)
df_new|>
ggplot(mapping = aes(x = pct_PT, y = WAR162)) +
geom_point(size = 2) +
geom_smooth(method = "lm", se = FALSE, color = 'darkblue') +
labs(title = "Linear Relationship Between Playing Time and WAR162",
x = "Percent of Playing Time",
y = "WAR162") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
model <- lm(WAR162 ~ pct_PT, data = df_new)
summary(model)
##
## Call:
## lm(formula = WAR162 ~ pct_PT, data = df_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2702 -0.4408 0.0741 0.4078 7.9911
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.443377 0.010095 -43.92 <2e-16 ***
## pct_PT 0.543718 0.003525 154.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9963 on 21285 degrees of freedom
## Multiple R-squared: 0.5279, Adjusted R-squared: 0.5278
## F-statistic: 2.38e+04 on 1 and 21285 DF, p-value: < 2.2e-16
In last weeks data dive I built a linear regression model WAR162 (wins above a replacement level player for the season) and share of playing time. There did appear to be a linear association between these two variables. There is an estimated slope of 0.544, which means that for every 1 percent point in increase in pct_PT, WAR162 increases by 0.544. The small p-value also showed me that pct_PT (playing time) has a statistically significant relationship with WAR162. Additionally, we can see that Multiple R-Squared value is about 0.53. This means that approximately 53% of the variance in WAR162 is explained by pct_PT.
model1 <- lm(WAR162 ~ pct_PT + age, data = df_new)
summary(model1)
##
## Call:
## lm(formula = WAR162 ~ pct_PT + age, data = df_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1937 -0.4440 0.0713 0.4046 7.9020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.206685 0.047476 4.353 0.0000135 ***
## pct_PT 0.547605 0.003519 155.594 < 2e-16 ***
## age -0.023318 0.001664 -14.010 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9918 on 21284 degrees of freedom
## Multiple R-squared: 0.5322, Adjusted R-squared: 0.5321
## F-statistic: 1.211e+04 on 2 and 21284 DF, p-value: < 2.2e-16
tidy(model1, conf.int = TRUE)
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.207 0.0475 4.35 1.35e- 5 0.114 0.300
## 2 pct_PT 0.548 0.00352 156. 0 0.541 0.555
## 3 age -0.0233 0.00166 -14.0 2.13e-44 -0.0266 -0.0201
Above, I added the age variable into my regression model. Age is an appropriate variable to include because as players get older their performance tends to decline
By adding the age variable my Adjusted went from 0.5278 up to 0.5321. I also get a very large F-Statistic (12,110) and very small p-value which would lead me to reject the null hypothesis that share of playing time and age have no effect on WAR162.
model2 <- lm(WAR162 ~ pct_PT + age +
quasi_ws + pct_PT:quasi_ws
, data = df_new)
summary(model2)
##
## Call:
## lm(formula = WAR162 ~ pct_PT + age + quasi_ws + pct_PT:quasi_ws,
## data = df_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52015 -0.09413 0.01126 0.11382 0.78169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0119491 0.0086264 -1.385 0.166
## pct_PT -0.4484316 0.0014134 -317.280 < 2e-16 ***
## age -0.0018563 0.0003031 -6.125 0.000000000922 ***
## quasi_ws 0.3386023 0.0007877 429.857 < 2e-16 ***
## pct_PT:quasi_ws -0.0012449 0.0001197 -10.400 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1797 on 21282 degrees of freedom
## Multiple R-squared: 0.9846, Adjusted R-squared: 0.9846
## F-statistic: 3.411e+05 on 4 and 21282 DF, p-value: < 2.2e-16
tidy(model2, conf.int = TRUE)
## # A tibble: 5 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -0.0119 0.00863 -1.39 1.66e- 1 -0.0289 0.00496
## 2 pct_PT -0.448 0.00141 -317. 0 -0.451 -0.446
## 3 age -0.00186 0.000303 -6.13 9.22e-10 -0.00245 -0.00126
## 4 quasi_ws 0.339 0.000788 430. 0 0.337 0.340
## 5 pct_PT:quasi_ws -0.00124 0.000120 -10.4 2.85e-25 -0.00148 -0.00101
Above, I added the quasi_ws (Quasi Win Shares) column, which is a measure of player value based upon their contribution to winning. This makes sense because it captures a player’s impact beyond share of playing time. However, there may be a problem with multicollinearity because WAR162 and Quasi Win Share both measure a players value.
The coefficient for pct_PT is negative, which is unexpected. This could be due to multicollinearity, as pct_PT and quasi_win_share are likely correlated, leading to an unstable estimate. It tells us that for each 1 unit increase in playing time, WAR162 decreases by -0.4484.
Age also has a small, but negative and significant effect on WAR162 which was expected. Out interaction term was share of playing time: Quasi Win Share. Rather than assuming Quasi Win Share has a fixed effect on WAR162, the interaction term allows the effect to change based upon the share of playing time. Surprisingly, we got a coefficient of -0.0012 with a statistically significant p-value. This means that as playing time increases the effect of quasi win share on WAR162 slightly decreases. For each 1 unit increase in age, WAR162 decreases by 0.0019.
Quasi Win Share has a strong and significant effect on WAR162, and for every 1 increment increase in Quasi Win Share, WAR162 increases by .34 which is a lot. The problem is that it could be driven by multicollinearity rather than a truly independent relationship.
I got an Adjusted R Squared Value of .9846 which means that 98.46% of the variance in WAR162 is explained by this model. The Residual Standard Error was 0.1797, indicating that the actual difference between the anticipated and actual WAR162 is about 0.18 (pretty low).
#vif(model2)
#looking for multicollinearity
Above we can see the VIF for Quasi Win Share and the interaction term (playing time and Quasi Win Share) are very high, which shows there is multicollinearity. This suggests Quasi Win Share is highly correlated with one or more other variables, and the interaction term is strongly correlated with other predictors. This will likely throw off my regression by overfitting, increasing standard error, and causing unstable coefficients.
gg_resfitted(model2) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
This plot actually looks pretty good. There is no obvious signs of fanning, but there may be some outliers. This indicates that the assumptions are reasonably met.
gg_reshist(model2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The histogram of residuals above appears to be fairly normally distributed. This means that it meets the assumption of normality is reasonably met.
# the normal QQ plot
gg_qqplot(model2)
The QQ-Plot above checks if the assumption of normally distributed residuals. We can see some deviation on both ends, which indicates that there might be some mild skewness and the distribution could be non-normal.
ggplot(data = df_new) +
geom_point(mapping = aes(x=pct_PT, y=WAR162))
The visual above is checking to see if there is a linear relationship between WAR162 and playing time. Based on what I can see, there does appear to be a linear relationship. This means that as share of playing time increases, wins above replacement also increases. The visual tells me that a linear model is appropriate.
# take a look at the docs for `threshold`
gg_cooksd(model2, threshold = 'matlab')
Above, we created a Cook’s distance plot which is meant to show observations that have high influence on the regression model. Many of the points above may look very infleuential, but the highest Cook’s distnance is about 0.008 which is very low. While these points may have some influence, I do not believe it is very strong due to how large the data frame is.