Week5_Data

Week 5 Data Dive

3 Columns/Values that are Unclear

quasi_ws column

The quasi_ws column was very unclear to me until I read that documentation, so as someone who has been around baseball statistics I was eager to learn more. The quasi_ws column represents a new statistic called Quasi-Win Share. Quasi-Win Share is meant to measure how many total wins a player contributes to their team over the course of the season. It is found by three taking times total wins created per 162 games; generated by adding WAR162 to wins BELOW replacement (determined by playing time) and rounding to nearest whole number
WAR162

WAR is a statistic that quantifies a player’s total value in terms of wins above a replacement level player for a full season. WAR162 measures the same thing as WAR, but accounts for players who may have missed time due to circumstances such as injury or being cut from the team. WAR162 scales WAR to account for that player would have contributed over the 162 game season.
Values of 0 in the WAR162 Column

In the WAR162 column, there are values of 0 which do not make sense. Some players have a measure playing time and a measure for their Quasi-Win Share but their WAR162 shows a 0. After reading through the documentation and seeing how they were calculating this statistic, it made sense that someone could have performed exactly like “replacement level player”

Elements that are Still Unclear

One element that is unclear is that there some players have two entries for a single year. It is not directly explained by the documentation, but it would be possible that if a player is traded, or cut and picked up by another team, that they could have two entries. One entry would be for their performance with the first team, and the other show performance with the next.

This could be an interesting phenomena that could show how players perform before and after being traded or picked up by another team. However, we cannot know if they were traded, or cut and recruited by another team.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)


df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")

## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

duplicates <- df|>
  filter(year_ID > 2004 & year_ID < 2019) |> 
  group_by(name_common, year_ID) |>
  tally() |>
  filter(n > 1)

# View the results
print(duplicates)

## # A tibble: 1,618 × 3
## # Groups:   name_common [1,203]
##    name_common     year_ID     n
##    <chr>             <dbl> <int>
##  1 A.J. Cole          2018     2
##  2 A.J. Ellis         2016     2
##  3 A.J. Pierzynski    2014     2
##  4 AJ Ramos           2017     2
##  5 Aaron Brooks       2015     2
##  6 Aaron Guiel        2006     2
##  7 Aaron Harang       2013     2
##  8 Aaron Hill         2011     2
##  9 Aaron Hill         2016     2
## 10 Aaron Laffey       2011     2
## # ℹ 1,608 more rows

# Add a new column for duplicate status in the original dataset
df<-df|>
  mutate(duplicate_status = ifelse(paste(name_common, year_ID) %in% paste(duplicates$name_common, duplicates$year_ID), "Duplicate", "Non-Duplicate"))
status_counts <- table(df$duplicate_status)

print(status_counts)

## 
##     Duplicate Non-Duplicate 
##          3318         95478

Below is pie chart to give us a better picture of how many duplicates there are in this data set.

#Chart of Duplicates vs Non Duplicates
status_counts_df <- as.data.frame(status_counts)
ggplot(status_counts_df, aes(x = "", y = Freq, fill = Var1)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar(theta = "y") +
  labs(title = "Duplicates vs Non-Duplicates", fill = "Status") +
  theme_minimal() +
  theme(axis.text.x = element_blank())

Another element that is unclear is the amount of “NULL” values in the next_franch column.The next_franch column in this data set represents the “Next Franchise” that a player played for. When dealing with this data set, I struggled with the “Null” values in this column. The “NULL” values are not explained at all in the documentation, which worries me much more than the duplicate values metioned earlier. After reading the documentation, this phenomena was not explained. I concluded that a Null value in this column would represent a player who retired, did not move teams, or has yet to sign with another team.

To demonstrate this, I built a histogram that shows the counts of players whose next franchise played for was in the American Leauge or National League, or if it “Null”. First I had to build new column called next_franch_null which represents if the next franchise that they played for was is Null. This column will be called “next_franch_null”. Below, we can see there are 64839 instances where players went to another team and 33957 instances where they either stayed, retired, or were cut.

#Creating a new column
df <-   df |>     mutate(next_franch_null = next_franch == "NULL") 
df |>   count(next_franch_null)

## # A tibble: 2 × 2
##   next_franch_null     n
##   <lgl>            <int>
## 1 FALSE            64839
## 2 TRUE             33957

p1 <- df |> 
  filter(year_ID > 2004 )|>
                                                                     
  group_by(next_franch_null) |>  #group by new column
  summarize(count = n()) |>  # Count NULLS and players who went to new team
  ggplot(mapping = aes(x = next_franch_null, y = count, fill = next_franch_null)) + 
  geom_bar(stat = "identity") +  
  theme_minimal() +  # Minimal theme
  ##scale_fill_viridis(discrete = TRUE) +  
  ggtitle("Count of Players Who Retired, Stayed with Team, or Were Cut") +  # Title
  xlab("Next Franchise is NULL") +  # X-axis label
  ylab("Count of Playes") +  
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  

p1

The problem with the graph above, is that we do not know why a player’s next_franch would be “NULL”. There are three possibilities but no way of knowing which is correct. Additionally, this makes it difficult to get an understanding of which players a team is keeping because we cannot see if they stayed or left.

The next graph visualizes if a players who are in the NL or AL tend to stay in that league, go to a different league, or have the “NULL” value based on the next_franch column. First I must create a new column called “next_leaugue” which represents the next league that players went to.

al_teams <- c("NYY", "BOS", "HOU", "OAK", "SEA", "MIN", "CHW", "CLE", "DET", "KCR", "ANA", "LAA", "TEX", "TOR", "TBD","TBR", "BAL")  
nl_teams <- c("LAD", "SFG", "SDP", "COL", "ARI", "CHC", "STL", "MIL", "CIN", "PIT", "ATL", "NYM", "PHI", "WSN", "MIA", "FLA")  

# Create a new column labeling as AL, NL, or NULL
df <- df %>%
  mutate(next_league = case_when(
    next_franch %in% al_teams ~ "AL",
    next_franch %in% nl_teams ~ "NL",
    is.na(next_franch) | next_franch == "NULL" ~ "NULL",
    TRUE ~ "Unknown" # In case there are other values
  ))
head(df)

## # A tibble: 6 × 19
##   name_common        age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos  
##   <chr>            <dbl> <chr>       <dbl> <chr>   <chr>  <dbl>  <dbl> <chr>    
## 1 Ketel Marte         25 marteke01    2019 ARI     NL      6.19   7.16 CF, 2B, …
## 2 Zack Greinke        35 greinza01    2019 ARI     NL      4.11   5.02 P        
## 3 Eduardo Escobar     30 escobed01    2019 ARI     NL      6.76   4.03 3B, 2B   
## 4 Nick Ahmed          29 ahmedni01    2019 ARI     NL      6.04   3.75 SS       
## 5 Christian Walker    28 walkech02    2019 ARI     NL      5.83   2.19 1B       
## 6 Carson Kelly        24 kellyca02    2019 ARI     NL      3.56   1.90 C, 3B    
## # ℹ 10 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## #   prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## #   duplicate_status <chr>, next_franch_null <lgl>, next_league <chr>

next_lg_graph <- df |> 
  filter(year_ID > 2004 )|>
  ggplot(aes(x = lg_ID, fill = next_league)) +
  geom_bar(position = "fill") +  
  labs(
    title = "Player Movement AL vs NL",
    x = "Current League",
    y = "Proportion of Players",
    fill = "Next League"
  ) +
  scale_y_continuous(labels = scales::percent) +  # Show percentages
  theme_minimal()
print(next_lg_graph)

Based on this graph we can see the percent of players who go to a different league, stay with the team, or fall into the null category are very similar when comparing AL and NL. Much more could be understood if the documentation that explained why exactly these players next_franch falls into the null category. Perhaps one league could have more players retire, one has more get cut, etc.

Missing Rows and Empty Groups

Finding Missing Rows

missing_values <- sum(is.na(df$def_pos))

# Print the number of missing values
print(paste("Number of missing values in 'def_pos':", missing_values))

## [1] "Number of missing values in 'def_pos': 9589"

missing_values2 <- sum(is.na(df$next_franch))

# Print the number of missing values
print(paste("Number of missing values in 'next_franch':", missing_values2))

## [1] "Number of missing values in 'next_franch': 0"

missing_values3 <- sum(is.na(df$prev_franch))

# Print the number of missing values
print(paste("Number of missing values in 'prev_franch':", missing_values3))

## [1] "Number of missing values in 'prev_franch': 0"

missing_values4 <- sum(is.na(df$lg_ID))

# Print the number of missing values
print(paste("Number of missing values in 'lg_Id':", missing_values4))

## [1] "Number of missing values in 'lg_Id': 0"

missing_values5<- sum(is.na(df$team_ID))

# Print the number of missing values
print(paste("Number of missing values in 'team_ID':", missing_values5))

## [1] "Number of missing values in 'team_ID': 0"

The only categorical column that has missing rows in this data set is the def_pos column, which shows the players defensive position. These are explicitly missing rows. This problem could affect my ability to compare statistics based on the position of players, which is a great way evaluate players. It is very helpful to compare players of the same position or domain. However when they are very different, it can lead to false inference. One example is that pitchers, who tend to have higher a WAR162 statistic than players in other positions, so it would not be fair to compare them to catchers.

Luckily, from 2004 and on there do not seem to be very many missing values for def_pos.

#Looking at how many def_pos are missing by year
missing_by_year <- df %>%
  filter(is.na(def_pos)) %>%
  group_by(year_ID) %>%
  tally()

# View the result
print(missing_by_year)

## # A tibble: 119 × 2
##    year_ID     n
##      <dbl> <int>
##  1    1901     3
##  2    1902     5
##  3    1903     5
##  4    1904     3
##  5    1905     5
##  6    1906     5
##  7    1907     9
##  8    1908    13
##  9    1909     7
## 10    1910    22
## # ℹ 109 more rows

Defining Outlier in WAR162 Column

Usually when dealing with this data set, I have it filtered from 2004-2019 because the WAR162 statistic was not in use until the 2004 season. From 2004 and on, it has had a major effect on whether teams traded, cut, or kept a player.

To define outliers in this data set, I would use the IQR method. This means that any data point outside 1.5 times the IQR would be considered an outlier. While these rows are outliers, it is still interesting to see them because they represent either really bad performances from players or exceptional performances from players.

The IQR test to find outliers could be used to for all continuous columns (WAR162, pct_PT which represenets share of playing time, and quasi_ws).

#IQR for WAR162
df |>
  filter(year_ID > 2004 & year_ID < 2019) |> 
  ggplot()+ 
  geom_boxplot(mapping=aes(x=WAR162,y=""))+
  labs(title="WAR162",
       x="WAR162", y="") +
  theme_classic()

Week5_Data_Dive

2025-02-18

Week 5 Data Dive

3 Columns/Values that are Unclear

Elements that are Still Unclear

Missing Rows and Empty Groups

Defining Outlier in WAR162 Column