Week6_DataDive

Week 6 Data Dive

Choose two numeric variables, and pair each one with a column you built (i.e., calculated based on others)
- So, you should have two pairs of columns (1 original column, and 1 created/“mutated” column)
- All variables for this data dive should be either continuous (i.e., numeric) or ordered (e.g., ['small', 'medium', 'large'] is okay, but ["apples", "oranges", "bananas"] is not)
- At least one pair should be a response variable and an explanatory variabl

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)


df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")

## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

df <- df %>%
  mutate(efficiency= WAR162/pct_PT
  )
head(df)

## # A tibble: 6 × 17
##   name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos  
##   <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>    
## 1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B, …
## 2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P        
## 3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B   
## 4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS       
## 5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B       
## 6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B    
## # ℹ 8 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   efficiency <dbl>

In the code above, I first created a new column called “efficiency”. This column is created by dividing WAR162 by pct_PT. pct_PT represents the share of total team playing time (measured by plate appearances and leverage-weighted innings) for a player, and WAR162 measures a players wins above a replacement level player for the 162 game season. This new column will measure how productive a player is per unit of playing time. This new column will be great for showing high impact players, even if they had low playing time.

Plotting Visualizations

For the first visualization, I will search for a relationship between Quasi Winshare (three times total wins created per 162 games; generated by adding WAR162 to wins BELOW replacement (determined by playing time) and rounding to nearest whole number) and efficiency. When trying to understand how the efficiency affects their Quasi Winshare, efficiency would be the explanatory variable and Quasi Winshare would be the response variable.

df|>
  filter(year_ID>=2004)|> #filter from 2004 becuase WAR162 was not used until this year
  ggplot()+
  geom_point(mapping=aes(x=efficiency, y= quasi_ws), color= 'darkred')+
  labs(title="Efficiency vs. Quasi Winshare",
       x= 'Efficiency', y= 'Quasi Winshare')+
  theme_minimal()

## Warning: Removed 35 rows containing missing values or values outside the scale range
## (`geom_point()`).

df |>
  filter(year_ID >= 2004 ) |> 
  ggplot()+ 
  geom_boxplot(mapping=aes(x=efficiency,y=""))+
  labs(title="Efficiency",
       x="Efficiency", y="") +
  theme_classic()

## Warning: Removed 42 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The first visualization above is quite interesting. Players with very low efficiency have much lower Quasi Winshare, then there is a huge spike from about 0-3, and then lower Quasi Winshares from 3 and beyond. I was expecting much more a of positive correlation, where higher efficiency would correlate to higher Quasi Winshare. This would mean that players with a higher efficiency would contribute more wins to their team.

We can also see that there are several outliers in the efficiency column based off of the visualizations. The boxplot shows that there is very little variability in the efficiency column and values are clustered around the median. One thing that we can conclude from this visualization is that there seems to be a sweet spot when it comes to efficiency. Having an efficiency rating between about 0 and 3 can result in much higher results in the Quasi Winshare column.

df|>
  filter(year_ID>=2004)|> #filter from 2004 becuase WAR162 was not used until this year
  ggplot()+
  geom_point(mapping=aes(x=year_ID, y=efficiency ), color= 'darkred')+
  labs(title="Efficiency vs. Year",
       x= 'Year', y= 'Efficiency')+
  theme_minimal()

## Warning: Removed 35 rows containing missing values or values outside the scale range
## (`geom_point()`).

In the plot above, I am examining how efficiency changes from year to year. In this case, year serves as the explanatory variable, while efficiency is the response variable, as year is independent of efficiency, and we are analyzing how efficiency fluctuates over time. Based on this visualization, efficiency changes only slightly from year to year, with values generally dispersed between approximately -8 and 5. Additionally, each year appears to contain outliers, representing instances of exceptionally high or low efficiency.

Correlation Coefficient

Since division by zero may result in NA and NaN values in the efficiency column, I needed to remove these non-finite values before calculating the correlation coefficient. There were four such values that had to be excluded.

After cleaning the data, the correlation coefficient between efficiency and Quasi Winshare was 0.37, indicating a moderate positive correlation. Meanwhile, the correlation coefficient between year and efficiency was 0, suggesting no relationship.

This means that higher efficiency tends to be associated with moderately higher Quasi Winshare, but the relationship is not particularly strong. On the other hand, efficiency does not increase or decrease over time, as there is no meaningful correlation with year.

Based on the visualizations, these results align with expectations. However, I was somewhat surprised that the correlation between Quasi Winshare and efficiency was as high as 0.37, as the scatterplot did not appear to show much of a relationship.

df_clean <- na.omit(df)

filtered_data <- df_clean |> filter(year_ID >= 2004 & is.finite(efficiency))


quasi_ws_correlation <- round(cor(filtered_data$quasi_ws, filtered_data$efficiency), 2)
year_id_correlation <- round(cor(filtered_data$year_ID, filtered_data$efficiency), 2)

print(quasi_ws_correlation)

## [1] 0.37

print(year_id_correlation)

## [1] 0

Confidence Intervals

With the code above I have found a 95% confidence interval for efficiency, which is an estimate of where the true population mean for efficiency is likely to fall 95% of the time. With this calculation, I am 95% confident that the true average efficiency is between -0.0264 and 0.0046, which means that This is an interesting insight because as someone who follows baseball, I would have originally thought this range would be higher.

These results pose two questions:

Is there lack of strong variation in efficiency, or players that similar?
Would breaking this dataset down by league, position, or team result in different trends?

efficiency_2004 <-
  filtered_data |>
    pluck("efficiency")

ggplot() +
  geom_histogram(mapping = aes(x = efficiency_2004),
                 colour='white') +
  theme_minimal()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

library(boot)

boot_ci <- function (v, func = median, conf = 0.95, n_iter = 1000) {
  
  boot_func <- \(x, i) func(x[i], na.rm=TRUE)
  
  b <- boot(v, boot_func, R = n_iter)
  
  boot.ci(b, conf = conf, type = "perc")
}

boot_ci(efficiency_2004, mean, 0.95)

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = b, conf = conf, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (-0.0266,  0.0035 )  
## Calculations and Intervals on Original Scale

Week6_DataDive

2025-02-25

Week 6 Data Dive

Plotting Visualizations

Correlation Coefficient

Confidence Intervals