quasi_ws column
The quasi_ws column was very unclear to me until I read that documentation, so as someone who has been around baseball statistics I was eager to learn more. The quasi_ws column represents a new statistic called Quasi-Win Share. Quasi-Win Share is meant to measure how many total wins a player contributes to their team over the course of the season. It is found by three taking times total wins created per 162 games; generated by adding WAR162 to wins BELOW replacement (determined by playing time) and rounding to nearest whole number
WAR162
WAR is a statistic that quantifies a player’s total value in terms of wins above a replacement level player for a full season. WAR162 measures the same thing as WAR, but accounts for players who may have missed time due to circumstances such as injury or being cut from the team. WAR162 scales WAR to account for that player would have contributed over the 162 game season.
Values of 0 in the WAR162 Column
In the WAR162 column, there are values of 0 which do not make sense. Some players have a measure playing time and a measure for their Quasi-Win Share but their WAR162 shows a 0. After reading through the documentation and seeing how they were calculating this statistic, it made sense that someone could have performed exactly like “replacement level player”
One element that is unclear is that there some players have two entries for a single year. It is not directly explained by the documentation, but it would be possible that if a player is traded, or cut and picked up by another team, that they could have two entries. One entry would be for their performance with the first team, and the other show performance with the next.
This could be an interesting phenomena that could show how players perform before and after being traded or picked up by another team. However, we cannot know if they were traded, or cut and recruited by another team.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
df<- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
duplicates <- df|>
filter(year_ID > 2004 & year_ID < 2019) |>
group_by(name_common, year_ID) |>
tally() |>
filter(n > 1)
# View the results
print(duplicates)
## # A tibble: 1,618 × 3
## # Groups: name_common [1,203]
## name_common year_ID n
## <chr> <dbl> <int>
## 1 A.J. Cole 2018 2
## 2 A.J. Ellis 2016 2
## 3 A.J. Pierzynski 2014 2
## 4 AJ Ramos 2017 2
## 5 Aaron Brooks 2015 2
## 6 Aaron Guiel 2006 2
## 7 Aaron Harang 2013 2
## 8 Aaron Hill 2011 2
## 9 Aaron Hill 2016 2
## 10 Aaron Laffey 2011 2
## # ℹ 1,608 more rows
# Add a new column for duplicate status in the original dataset
df<-df|>
mutate(duplicate_status = ifelse(paste(name_common, year_ID) %in% paste(duplicates$name_common, duplicates$year_ID), "Duplicate", "Non-Duplicate"))
status_counts <- table(df$duplicate_status)
print(status_counts)
##
## Duplicate Non-Duplicate
## 3318 95478
Below is pie chart to give us a better picture of how many duplicates there are in this data set.
#Chart of Duplicates vs Non Duplicates
status_counts_df <- as.data.frame(status_counts)
ggplot(status_counts_df, aes(x = "", y = Freq, fill = Var1)) +
geom_bar(stat = "identity", width = 1) +
coord_polar(theta = "y") +
labs(title = "Duplicates vs Non-Duplicates", fill = "Status") +
theme_minimal() +
theme(axis.text.x = element_blank())
Another element that is unclear is the amount of “NULL” values in the next_franch column.The next_franch column in this data set represents the “Next Franchise” that a player played for. When dealing with this data set, I struggled with the “Null” values in this column. The “NULL” values are not explained at all in the documentation, which worries me much more than the duplicate values metioned earlier. After reading the documentation, this phenomena was not explained. I concluded that a Null value in this column would represent a player who retired, did not move teams, or has yet to sign with another team.
To demonstrate this, I built a histogram that shows the counts of players whose next franchise played for was in the American Leauge or National League, or if it “Null”. First I had to build new column called next_franch_null which represents if the next franchise that they played for was is Null. This column will be called “next_franch_null”. Below, we can see there are 64839 instances where players went to another team and 33957 instances where they either stayed, retired, or were cut.
#Creating a new column
df <- df |> mutate(next_franch_null = next_franch == "NULL")
df |> count(next_franch_null)
## # A tibble: 2 × 2
## next_franch_null n
## <lgl> <int>
## 1 FALSE 64839
## 2 TRUE 33957
p1 <- df |>
filter(year_ID > 2004 )|>
group_by(next_franch_null) |> #group by new column
summarize(count = n()) |> # Count NULLS and players who went to new team
ggplot(mapping = aes(x = next_franch_null, y = count, fill = next_franch_null)) +
geom_bar(stat = "identity") +
theme_minimal() + # Minimal theme
##scale_fill_viridis(discrete = TRUE) +
ggtitle("Count of Players Who Retired, Stayed with Team, or Were Cut") + # Title
xlab("Next Franchise is NULL") + # X-axis label
ylab("Count of Playes") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p1
The problem with the graph above, is that we do not know why a player’s next_franch would be “NULL”. There are three possibilities but no way of knowing which is correct. Additionally, this makes it difficult to get an understanding of which players a team is keeping because we cannot see if they stayed or left.
The next graph visualizes if a players who are in the NL or AL tend to stay in that league, go to a different league, or have the “NULL” value based on the next_franch column. First I must create a new column called “next_leaugue” which represents the next league that players went to.
al_teams <- c("NYY", "BOS", "HOU", "OAK", "SEA", "MIN", "CHW", "CLE", "DET", "KCR", "ANA", "LAA", "TEX", "TOR", "TBD","TBR", "BAL")
nl_teams <- c("LAD", "SFG", "SDP", "COL", "ARI", "CHC", "STL", "MIL", "CIN", "PIT", "ATL", "NYM", "PHI", "WSN", "MIA", "FLA")
# Create a new column labeling as AL, NL, or NULL
df <- df %>%
mutate(next_league = case_when(
next_franch %in% al_teams ~ "AL",
next_franch %in% nl_teams ~ "NL",
is.na(next_franch) | next_franch == "NULL" ~ "NULL",
TRUE ~ "Unknown" # In case there are other values
))
head(df)
## # A tibble: 6 × 19
## name_common age player_ID year_ID team_ID lg_ID pct_PT WAR162 def_pos
## <chr> <dbl> <chr> <dbl> <chr> <chr> <dbl> <dbl> <chr>
## 1 Ketel Marte 25 marteke01 2019 ARI NL 6.19 7.16 CF, 2B, …
## 2 Zack Greinke 35 greinza01 2019 ARI NL 4.11 5.02 P
## 3 Eduardo Escobar 30 escobed01 2019 ARI NL 6.76 4.03 3B, 2B
## 4 Nick Ahmed 29 ahmedni01 2019 ARI NL 6.04 3.75 SS
## 5 Christian Walker 28 walkech02 2019 ARI NL 5.83 2.19 1B
## 6 Carson Kelly 24 kellyca02 2019 ARI NL 3.56 1.90 C, 3B
## # ℹ 10 more variables: quasi_ws <dbl>, stint_ID <dbl>, franch_id <chr>,
## # prev_franch <chr>, year_acq <dbl>, year_left <dbl>, next_franch <chr>,
## # duplicate_status <chr>, next_franch_null <lgl>, next_league <chr>
next_lg_graph <- df |>
filter(year_ID > 2004 )|>
ggplot(aes(x = lg_ID, fill = next_league)) +
geom_bar(position = "fill") +
labs(
title = "Player Movement AL vs NL",
x = "Current League",
y = "Proportion of Players",
fill = "Next League"
) +
scale_y_continuous(labels = scales::percent) + # Show percentages
theme_minimal()
print(next_lg_graph)
Based on this graph we can see the percent of players who go to a different league, stay with the team, or fall into the null category are very similar when comparing AL and NL. Much more could be understood if the documentation that explained why exactly these players next_franch falls into the null category. Perhaps one league could have more players retire, one has more get cut, etc.
Finding Missing Rows
missing_values <- sum(is.na(df$def_pos))
# Print the number of missing values
print(paste("Number of missing values in 'def_pos':", missing_values))
## [1] "Number of missing values in 'def_pos': 9589"
missing_values2 <- sum(is.na(df$next_franch))
# Print the number of missing values
print(paste("Number of missing values in 'next_franch':", missing_values2))
## [1] "Number of missing values in 'next_franch': 0"
missing_values3 <- sum(is.na(df$prev_franch))
# Print the number of missing values
print(paste("Number of missing values in 'prev_franch':", missing_values3))
## [1] "Number of missing values in 'prev_franch': 0"
missing_values4 <- sum(is.na(df$lg_ID))
# Print the number of missing values
print(paste("Number of missing values in 'lg_Id':", missing_values4))
## [1] "Number of missing values in 'lg_Id': 0"
missing_values5<- sum(is.na(df$team_ID))
# Print the number of missing values
print(paste("Number of missing values in 'team_ID':", missing_values5))
## [1] "Number of missing values in 'team_ID': 0"
The only categorical column that has missing rows in this data set is the def_pos column, which shows the players defensive position. These are explicitly missing rows. This problem could affect my ability to compare statistics based on the position of players, which is a great way evaluate players. It is very helpful to compare players of the same position or domain. However when they are very different, it can lead to false inference. One example is that pitchers, who tend to have higher a WAR162 statistic than players in other positions, so it would not be fair to compare them to catchers.
Luckily, from 2004 and on there do not seem to be very many missing values for def_pos.
#Looking at how many def_pos are missing by year
missing_by_year <- df %>%
filter(is.na(def_pos)) %>%
group_by(year_ID) %>%
tally()
# View the result
print(missing_by_year)
## # A tibble: 119 × 2
## year_ID n
## <dbl> <int>
## 1 1901 3
## 2 1902 5
## 3 1903 5
## 4 1904 3
## 5 1905 5
## 6 1906 5
## 7 1907 9
## 8 1908 13
## 9 1909 7
## 10 1910 22
## # ℹ 109 more rows
Usually when dealing with this data set, I have it filtered from 2004-2019 because the WAR162 statistic was not in use until the 2004 season. From 2004 and on, it has had a major effect on whether teams traded, cut, or kept a player.
To define outliers in this data set, I would use the IQR method. This means that any data point outside 1.5 times the IQR would be considered an outlier. While these rows are outliers, it is still interesting to see them because they represent either really bad performances from players or exceptional performances from players.
The IQR test to find outliers could be used to for all continuous columns (WAR162, pct_PT which represenets share of playing time, and quasi_ws).
#IQR for WAR162
df |>
filter(year_ID > 2004 & year_ID < 2019) |>
ggplot()+
geom_boxplot(mapping=aes(x=WAR162,y=""))+
labs(title="WAR162",
x="WAR162", y="") +
theme_classic()