Week7_DataDive

Week 7 Data Dive

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(tidyverse)
library(ggthemes)
library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.3.3

library(effsize)
library(pwrss)

## 
## Attaching package: 'pwrss'
## 
## The following object is masked from 'package:stats':
## 
##     power.t.test

df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")

## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#Creaing new column effiency
df <- df %>%
  mutate(efficiency= WAR162/pct_PT
  )
head(df)

## # A tibble: 6 × 17
##   name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos  
##   <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>    
## 1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B, …
## 2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P        
## 3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B   
## 4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS       
## 5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B       
## 6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B    
## # ℹ 8 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>

Above I created a new column called “efficiency”, which captures how much value (WAR162) a player provides relative to their share of playing time. However since there can be zero division we may have NaN values, which I must remove.

library(dplyr) #Changing all instances where a player has multiple positions, to "UTL"
library(stringr) #UTL stands for utility player



df <- df %>%
  mutate(utility = ifelse(grepl(",", def_pos), 1, 0))

# Print the updated dataframe
print(df)

## # A tibble: 98,796 × 18
##    name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos 
##    <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>   
##  1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B,…
##  2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P       
##  3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B  
##  4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS      
##  5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B      
##  6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B   
##  7 David Peralta       31 peralda01    2019 ARI     NL      4.17   1.79 LF      
##  8 Robbie Ray          27 rayro02      2019 ARI     NL      4.69   1.59 P       
##  9 Merrill Kelly       30 kellyme01    2019 ARI     NL      5.13   1.21 P       
## 10 Jarrod Dyson        34 dysonja01    2019 ARI     NL      4.39   1.21 CF, RF,…
## # ℹ 98,786 more rows
## # ℹ 9 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>, utility <dbl>

head(df)

## # A tibble: 6 × 18
##   name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos  
##   <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>    
## 1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B, …
## 2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P        
## 3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B   
## 4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS       
## 5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B       
## 6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B    
## # ℹ 9 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>, utility <dbl>

In the code above, I made a new column called “UTL”. In baseball when a player has two or more positions, they are referred to as a utility player. If they are a utility player there will be a 1 in the new column, if not there will be a 0.

# Remove rows with NA values
df_clean <- na.omit(df)

# Filter the dataframe for years greater than 2004 and are finite
filtered_data <- df_clean |> filter(year_ID > 2004 & is.finite(efficiency))
filtered_data

## # A tibble: 21,256 × 18
##    name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos 
##    <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>   
##  1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B,…
##  2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P       
##  3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B  
##  4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS      
##  5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B      
##  6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B   
##  7 David Peralta       31 peralda01    2019 ARI     NL      4.17   1.79 LF      
##  8 Robbie Ray          27 rayro02      2019 ARI     NL      4.69   1.59 P       
##  9 Merrill Kelly       30 kellyme01    2019 ARI     NL      5.13   1.21 P       
## 10 Jarrod Dyson        34 dysonja01    2019 ARI     NL      4.39   1.21 CF, RF,…
## # ℹ 21,246 more rows
## # ℹ 9 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>, utility <dbl>

Above, I removed all NULL values as it would hinder analysis, luckily there were only 4 out the 10,000+ rows in my new data frame.

Hypotheses

The average WAR162 of Utility players (players with multiple defensive positions) and single position players is the same.

The average Efficiency of players in the American League and National League is the same.

bootstrap <- function (x, func=mean, n_iter=10^4) {
  # empty vector to be filled with values from each iteration
  func_values <- c(NULL)

  # we simulate sampling `n_iter` times
  for (i in 1:n_iter) {
    # pull the sample (a vector)
    x_sample <- sample(x, size = length(x), replace = TRUE)

    # add on this iteration's value to the collection
    func_values <- c(func_values, func(x_sample))
  }

  return(func_values)
}

#Calcuating avgs for sample and finding difference
avgs_utl <- filtered_data |> 
  filter(utility == 1) |>
  pluck("WAR162") |>
  bootstrap(n_iter = 100)

avgs_po <- filtered_data |>
  filter(utility == 0) |>
  pluck("WAR162") |>
  bootstrap(n_iter = 100)

diffs_in_avg <- avgs_utl - avgs_po

#Finding the average WAR162 for each group, from data set
avg_war162s <- filtered_data|>
  group_by(utility) |>
  summarize(avg_war162 = mean(WAR162)) |>
  arrange(utility)

avg_war162s

## # A tibble: 2 × 2
##   utility avg_war162
##     <dbl>      <dbl>
## 1       0      0.744
## 2       1      0.607

#Findiing the observed difference in mean WAR162
observed_difference <- (avg_war162s$avg_war162[2] -
                  avg_war162s$avg_war162[1])
paste("Observed Difference: ", observed_difference)

## [1] "Observed Difference:  -0.13751077088804"

Above we found that the observed difference between the WAR162 of utility players and single-position players is -.138 which does indicate a difference.

ggplot() +
  geom_function(xlim = c(-1, 1), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_avg
                                                
                                                ))) +
  geom_vline(mapping = aes(xintercept = observed_difference,
                           color = paste("observed: ",
                                         (observed_difference)))) +
  labs(title = "Bootstrapped Sampling Distribution of WAR162 Differences",
       x = "Difference in WAR162",
       y = "Probability Density",
       color = "") +
  scale_x_continuous(breaks = seq(-1, 1, .25)) +
  theme_minimal()

Above, we can the see the observed difference in WAR162 means between utility and single position players is -0.138. Based upon the visualization, it seems like this difference is going to be significant as it out sample is far outside the expected difference.

cohen.d(d = filter(filtered_data, utility == 0) |> pluck("WAR162"),
        f = filter(filtered_data, utility == 1) |> pluck("WAR162"))

## 
## Cohen's d
## 
## d estimate: 0.09487229 (negligible)
## 95 percent confidence interval:
##      lower      upper 
## 0.06519873 0.12454585

Above we calculate Cohen’s d, which is a measure of effect size that quantifies the difference between two groups in terms of standard deviations. We can see that the is a negligible affect, which warrants hypothesis testing.

filtered_data |>
  group_by(utility) |>
  summarize(sd = sd(WAR162),
            mean = mean(WAR162))

## # A tibble: 2 × 3
##   utility    sd  mean
##     <dbl> <dbl> <dbl>
## 1       0  1.46 0.744
## 2       1  1.42 0.607

Above I found the mean WAR162 for each group, and their standard deviations. The standard deviation shows how spread the values in the data set are from the mean. Since the standard deviations are similar we can use the whole data set to approximate the overall value.

#Finding the mean WAR162 for the data set
my_mean=mean(filtered_data$WAR162, na.rm = TRUE)
my_mean

## [1] 0.7045695

test1 <- pwrss.t.2means(mu1 = .70457, 
                       sd1 = sd(pluck(filtered_data, "WAR162")),
                       kappa = 1,
                       power = .95, alpha = 0.08, 
                       alternative = "not equal")

##  Difference between Two means 
##  (Independent Samples t Test) 
##  H0: mu1 = mu2 
##  HA: mu1 != mu2 
##  ------------------------------ 
##   Statistical power = 0.95 
##   n1 = 99 
##   n2 = 99 
##  ------------------------------ 
##  Alternative = "not equal" 
##  Degrees of freedom = 196 
##  Non-centrality parameter = 3.417 
##  Type I error rate = 0.08 
##  Type II error rate = 0.05

plot(test1)

## Warning in qt(1 - prob.extreme, df = df, ncp = ncp, lower.tail = TRUE): full
## precision may not have been achieved in 'pnt{final}'

sample_size <- filtered_data %>%
  group_by(utility) %>%
  summarise(n = n())

print(sample_size)

## # A tibble: 2 × 2
##   utility     n
##     <dbl> <int>
## 1       0 15114
## 2       1  6142

Here we can see we need at least data on 99 players for each group for our hypothesis test to maintain the strength defined. We have 15,114 single position players, and 6142 utility players so we have sufficient enough data to perform hypothesis testing.

My alternative hypothesis is that the mean WAR162 of utility players is different than that of single position players. I am choosing an alpha (false positive rate) value of .08 because I am willing to take an 8% of making the mistake of thinking utility players have significantly different WAR162 than single-position players when they do not. Utility players tend to be less expensive, and if they do not produce better results they can become a backup for many different positions. This test will also have a (\(\beta\)) value of .05, which gives me a 5% chance of failing to detect a real difference in WAR162 between utility and single-position players when there is one. It could save teams money to pay a utility player rather than a single position player, which is worth the risk for less wealthy teams. If there is a .5 difference between utility and single position players, teams we will want to either stay away from or go after utility players \(\Delta = 0.05\).

f_sampling <- function(x) dnorm(x, mean = 0, 
                                sd = sd(diffs_in_avg))

ggplot() +
  stat_function(mapping = aes(fill = 'more extreme samples'),
                fun = f_sampling, 
                xlim = c(observed_difference, .25),
                geom = "area") +
  stat_function(mapping = aes(fill = 'more extreme samples'),
                fun = f_sampling, 
                xlim = c(-.25, -observed_difference),
                geom = "area") +
  geom_function(xlim = c(-.25, .25), 
                fun = f_sampling) +
  geom_vline(mapping = aes(xintercept = observed_difference,
                           color = paste("observed: ",
                                         (observed_difference)))) +
  labs(title = "Bootstrapped Sampling Distribution of WAR162",
       x = "Difference in WAR162 Calculated",
       y = "Probability Density",
       color = "",
       fill = "") +
  scale_x_continuous(breaks = seq(-.25, .25, .05)) +
  scale_fill_manual(values = 'lightblue') +
  theme_minimal()

# "demean" the bootstrapped samples to simulate mu = 0
diffs_in_avgs_de <- diffs_in_avg - mean(diffs_in_avg)

# proportion of times the difference is more extreme
paste("p-value ", 
      sum(abs(observed_difference) < abs(diffs_in_avgs_de)) /
        length(diffs_in_avgs_de))

## [1] "p-value  0"

In the visualization, we can see that the observed difference is different than what expected by the null hypothesis. We expected no difference but calculated a difference of -0.138

We also calculated a p-value of 0. This means that these results are very statistically significant, and the probability of getting my observed difference is essentially zero. For this reason, I reject the null hypothesis that the average WAR162 of utility players and single position players is the same because the observed difference in WAR162 between utility players and single-position players is statistically significant. Since the p-value is so small, we conclude that there is enough evidence to suggest that utility players tend to have worse WAR162 stats than single-position players.

One question that I have is if pitchers are causing the p-value to be so low. Pitchers tend to have higher WAR162’s, and are classified as single position players in this data set which might contribute to the observed difference in WAR162 that we saw.

Fisher’s Test for Hypothesis 2

Null Hypothesis: The average Efficiency of players in the American League and National League are the same.

Alternative Hypothesis: The average Efficiency of players in the American League and National League are different.

I will be using a significance level of .08, meaning that I am willing to reject the null hypothesis if the probability of obtaining my observed results (or more extreme results) by random chance is less than 8%. I chose the .08 significance level because it allows for a slightly higher tolerance for Type I errors (false positives), which may be appropriate in exploratory research or when seeking to detect potentially smaller effects.

avgs_al <- filtered_data |>
  filter(lg_ID == 'NL') |>
  pluck("WAR162") |>
  bootstrap(n_iter = 100)

avgs_nl <- filtered_data |>
  filter(lg_ID == 'AL') |>
  pluck("WAR162") |>
  bootstrap(n_iter = 100)

diffs_in_avg2 <- avgs_al - avgs_nl

avg_effs <- filtered_data |>
  group_by(lg_ID) |>
  summarize(avg_eff = mean(efficiency)) |>
  arrange(lg_ID)

avg_effs

## # A tibble: 2 × 2
##   lg_ID  avg_eff
##   <chr>    <dbl>
## 1 AL     0.00624
## 2 NL    -0.0250

observed_difference2 <- (avg_effs$avg_eff[2] -
                  avg_effs$avg_eff[1])
paste("Observed Difference: ", observed_difference2)

## [1] "Observed Difference:  -0.0312531060094961"

The code above gives me the observed difference of the mean efficiency of the American League and National League.

# "demean" the bootstrapped samples to simulate mu = 0
diffs_in_avgs_d2 <- diffs_in_avg2 - mean(diffs_in_avg2)

# proportion of times the difference is more extreme
paste("p-value ", 
      sum(abs(observed_difference2) < abs(diffs_in_avgs_d2)) /
        length(diffs_in_avgs_d2))

## [1] "p-value  0.12"

With the code above, I have calculated a p-value of 0.11. This means that there is 11% chance of observing the difference (or more extreme differences) in the data purely due to random variation, assuming the null hypothesis is correct. With this p-value, I fail to reject the null hypothesis that the average Efficiency of players in the American League and National League are the same. There is not enough statistically significant evidence to support a difference in average efficiency of AL and NL players.

Visual

ggplot() +
  geom_function(xlim = c(-1, 1), 
                fun = function(x) dnorm(x, mean = 0, 
                                        sd = sd(diffs_in_avg2
                                                
                                                ))) +
  geom_vline(mapping = aes(xintercept = observed_difference2,
                           color = paste("observed: ",
                                         (observed_difference2)))) +
  labs(title = "Bootstrapped Sampling Distribution of Efficiency Differences",
       x = "Difference in Efficiency",
       y = "Probability Density",
       color = "") +
  scale_x_continuous(breaks = seq(-1, 1, .25)) +
  theme_minimal()

The visual above shows our observed difference is compared to what I would expect from the null hypothesis. The observed value does not fall very far from the expected value of zero, which further shows why the difference is not statistically significant.

Week7_DataDive

2025-02-25

Week 7 Data Dive

Hypotheses

Fisher’s Test for Hypothesis 2

Visual