This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Within these groupings, we take a look at the mean WAR162. WAR162 stands for “wins above replacement” for a 162 game season. So when we look at the mean WAR162 we see how many more games, on average, that these groupings won over a replacement level player. I also filtered this data set to look at players from 2005-2018. It is filtered this way because organizations did not start utilizing the WAR162 statistic when evaluating players until the 2005 season. The 2019 season is filtered out because the players did yet have a chance to go to a different franchise, therefore the “next_franch” column is null for all players in that season.
dataset |>
filter(year_ID > 2004 & year_ID < 2019) |> # Filter for years between 2005 and 2018
group_by(next_franch) |> # Group by the team acquiring players
summarize(
count1 = n(), # Count of players acquired
mean_WAR162 = mean(WAR162, na.rm = TRUE) # Mean WAR162 of acquired players
)
## # A tibble: 31 × 3
## next_franch count1 mean_WAR162
## <chr> <int> <dbl>
## 1 ANA 452 0.927
## 2 ARI 477 0.707
## 3 ATL 528 0.931
## 4 BAL 442 0.474
## 5 BOS 482 1.09
## 6 CHC 457 0.723
## 7 CHW 361 0.660
## 8 CIN 319 0.544
## 9 CLE 401 0.714
## 10 COL 364 0.537
## # ℹ 21 more rows
In the code above, I wanted to group by the next franchise that players played for. Getting a count by these groupings would allow me to see what teams acquire more/less players from other teams. Additionally, it shows how many players either retire or get cut from the league.
I was surprised to find that 7042 players remained with their original organization, with the LA Dodgers acquiring the second-highest number of players at 559.”
The lowest probability group with the above grouping is “STL” (St. Louis) with a count of 261 players playing for then as their next franchise. This means that St. Louis is the least likely team to acquire a player from another team. As a player, you would have a better chance of going to a team other than St. Louis.
My hypothesis for why St. Louis is the least likely to acquire players from other teams is that their current roster performs well in terms of WAR162 and Quasi Winshare. As a result, players tend to stay with the organization longer due to their strong performance, which reduces the need for the team to acquire new players.
library(viridis)
## Loading required package: viridisLite
p1 <- dataset |>
filter(year_ID > 2004 & year_ID < 2019) |> # Filter for years between 2005 and 2018
group_by(next_franch) |> # Group by the franchise acquiring players
summarize(count = n()) |> # Count the number of players acquired by each franchise
ggplot(mapping = aes(x = next_franch, y = count, fill = next_franch)) +
geom_bar(stat = "identity") +
theme_minimal() + # Minimal theme
scale_fill_viridis(discrete = TRUE) +
ggtitle("Count of Players Acquired by Franchise (2005-2018)") + # Title
xlab("Franchise") + # X-axis label
ylab("Count of Players Acquired") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p1
dataset |>
filter(year_ID > 2004 & year_ID < 2019) |> # Filter years between 2005 and 2018
group_by(year_ID) |> # Group by year_ID
summarize(
count1 = n(), # Count the number of players for each year
mean_WAR162 = mean(WAR162, na.rm = TRUE) # Calculate the mean WAR162 for each year
)
## # A tibble: 14 × 3
## year_ID count1 mean_WAR162
## <dbl> <int> <dbl>
## 1 2005 1329 0.753
## 2 2006 1377 0.726
## 3 2007 1384 0.723
## 4 2008 1384 0.722
## 5 2009 1388 0.720
## 6 2010 1356 0.737
## 7 2011 1389 0.720
## 8 2012 1407 0.711
## 9 2013 1409 0.709
## 10 2014 1433 0.698
## 11 2015 1485 0.673
## 12 2016 1482 0.675
## 13 2017 1494 0.669
## 14 2018 1533 0.652
In the code above, we once again filtered the data to get players between 2005-2018. Then grouped by year the get a count of the players for each year 2005-2019. The mean WAR162 is also computed for players in the respective season.
From this grouping we can see the that in 2005, the MLB had the least amount of players appear in a game. This means that if you played in 2005, you had the worst chance of actually getting on field to play. The number of players who appeared in a game also appears to gradually increase over this 13 year span. In respect to WAR162, the statistic seems to gradually decrease over the 13 year span.
My hypothesis for why the total number of players in 2005 was so low is that replacement players had not yet began to play a significant role on MLB teams. Judging by the WAR162 statistic from 2005-2018, the average WAR162 (wins above replacement) statistic appears to decrease over the 13-year span, potentially reflecting the increasing role of replacement-level players or changes in player performance across the league.
# Creating a bar plot for count by year
p <- dataset |>
filter(year_ID > 2004 & year_ID < 2019) |> # Filter for years between 2004 and 2019
ggplot(mapping = aes(x = year_ID)) + # Define the mapping for the plot
geom_bar() + # Create the bar plot
theme_minimal() +
scale_fill_brewer(palette = 'Dark2')+
ggtitle("Count of Players by Year")
p
library(dplyr) #Changing all instances where a player has multiple positions, to "UTL"
library(stringr) #UTL stands for utility player
dataset |>
filter(year_ID>2004 & year_ID<2019)|>
mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))|>
group_by(def_pos)|> #Grouping based on defensive position
summarize(count1=n(),#count of player by year
mean(WAR162))
## # A tibble: 11 × 3
## def_pos count1 `mean(WAR162)`
## <chr> <int> <dbl>
## 1 1B 636 1.14
## 2 2B 366 1.63
## 3 3B 340 1.86
## 4 C 1125 0.437
## 5 CF 371 2.08
## 6 LF 297 1.08
## 7 P 10230 0.572
## 8 RF 264 1.52
## 9 SS 431 1.85
## 10 UTL 5676 0.597
## 11 <NA> 114 0.136
In the code above, a few different things are going on. Firstly, there were many players in this data set who played multiple positions, which brought many combinations. I decided to make all of these players go under a value called “UTL” which stands for utility player, and will help with grouping and comparison.
The number of from 2005-2019 pitchers came in at 10230 which was not surprising because many pitchers can play in a single game, the number of players who played multiple position came in at 5675 which seems somewhat high, and least played position was right field (RF). One observation was that 114 players who were counted as “NA” which means their position was not recorded.
Based on this grouping, players in the right field (RF) position were the least likely to exclusively play that position. In practical terms, as a player you have a higher likelihood of playing in the MLB as if you play a different position, OR if you can learn to play multiple positions and become “utility” player (in the the UTL grouping).
My hypothesis for why the RF position count is low is that many players who played RF also took on other positions, resulting in their classification as utility players (UTL) instead.
#filter and create the boxplot
ggplot(dataset |>
filter(year_ID > 2004 & year_ID < 2019), aes(x = def_pos, y = WAR162)) +
geom_boxplot() + #boxplot of WAR162 by defensive position
labs(title = "WAR162 by Position", x = "Defensive Position", y = "WAR162") + # Labels
theme_minimal()
Here I am finding the least common, most common, and missing combinations for team_ID and next_franch. The next_franch column represents that next team that a player played for. The least/most common combination would show us what specific team is least/most likely to acquire a player from a different, specific team. The missing combinations show us pairings where a player has never moved from one team to another, which can reveal interesting insights about team transactions.
filtered_data <- dataset |> #filter the dataset for the years between 2005 and 2018
filter(year_ID > 2004 & year_ID < 2019)
all_combinations <- expand.grid( #possible combinations of 'team_ID' and 'next_franch'
team_ID = unique(filtered_data$team_ID),
next_franch = unique(filtered_data$next_franch)
)
existing_combinations <- filtered_data |> #group by 'team_ID' and 'next_franch' to get the count of each combination
group_by(team_ID, next_franch) |>
count() |>
ungroup()
missing_combinations <- anti_join(all_combinations, existing_combinations, by = c("team_ID", "next_franch")) #find missing combinations
most_common <- existing_combinations |> #find the most and least common combinations
arrange(desc(n)) |>
head(1)
least_common <- existing_combinations |>
arrange(n) |>
head(1)
list(
missing_combinations = missing_combinations,
most_common = most_common,
least_common = least_common
)
## $missing_combinations
## team_ID next_franch
## 1 STL STL
## 2 TBD STL
## 3 WSN WSN
## 4 HOU HOU
## 5 DET LAD
## 6 LAD LAD
## 7 TOR TOR
## 8 CHC CHC
## 9 TEX TEX
## 10 TBD TEX
## 11 ATL ATL
## 12 CHW CHW
## 13 TBD CHW
## 14 KCR KCR
## 15 PHI PHI
## 16 SEA SEA
## 17 NYY NYY
## 18 MIA FLA
## 19 FLA FLA
## 20 CIN CIN
## 21 LAA ANA
## 22 CLE CLE
## 23 SFG SFG
## 24 BAL BAL
## 25 DET DET
## 26 SFG DET
## 27 ARI ARI
## 28 COL COL
## 29 NYM NYM
## 30 MIL MIL
## 31 BOS BOS
## 32 MIA BOS
## 33 PIT PIT
## 34 MIA OAK
## 35 OAK OAK
## 36 SDP SDP
## 37 TBR TBD
## 38 TBD TBD
## 39 MIN MIN
##
## $most_common
## # A tibble: 1 × 3
## team_ID next_franch n
## <chr> <chr> <int>
## 1 LAD NULL 290
##
## $least_common
## # A tibble: 1 × 3
## team_ID next_franch n
## <chr> <chr> <int>
## 1 CHW DET 1
Based on these results, there are only a few missing combinations. Since a team cannot acquire a player from itself, those combinations are excluded from the analysis. We can then see that LA Dodgers have never acquired a player from Detroit Tigers, Detroit has never acquired a player from the San Francisco Giants, and Boston Red Sox & Oakland A’s have never acquired a player from the Miami Marlins.
The most common combination is the LA Dodgers with next_franch Null. This means that LA Dodgers had many players stay with the team and rather than moving to another.
The least common combination of team_ID and next_franch is the Chicago White Sox and the Detroit Tigers. From 2004-2018, the Detroit Tigers have only acquired one player from the White Sox.
filtered_data <- dataset |> #filter 2005-2018
filter(year_ID > 2004 & year_ID < 2019)
null_data <- filtered_data |> #find where next_franch= "NULL"
filter(next_franch == "NULL")
team_count <- null_data |> #group by team and count nulls for each
group_by(team_ID) |>
count()
ggplot(team_count, aes(x = team_ID, y = n, fill = team_ID)) + #barplot
geom_bar(stat = "identity") +
theme_minimal() +
labs(
title = "Count Players Who Remained with Team by team_ID",
x = "Team",
y = "Count of 'NULL' in next_franch",
fill = "Team ID"
) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))