R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Week 3 Data Dive

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dataset <- read_delim("/Users/matthewjobe/Downloads/quasi_winshares.csv", delim = ",")
## Rows: 98796 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): name_common, player_ID, team_ID, lg_ID, def_pos, franch_id, prev_fr...
## dbl (8): age, year_ID, pct_PT, WAR162, quasi_ws, stint_ID, year_acq, year_left
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

3 Group By Functions With Visuals

Within these groupings, we take a look at the mean WAR162. WAR162 stands for “wins above replacement” for a 162 game season. So when we look at the mean WAR162 we see how many more games, on average, that these groupings won over a replacement level player. I also filtered this data set to look at players from 2005-2018. It is filtered this way because organizations did not start utilizing the WAR162 statistic when evaluating players until the 2005 season. The 2019 season is filtered out because the players did yet have a chance to go to a different franchise, therefore the “next_franch” column is null for all players in that season.

dataset |>
  filter(year_ID > 2004 & year_ID < 2019) |>   # Filter for years between 2005 and 2018
  group_by(next_franch) |>                       # Group by the team acquiring players
  summarize(
    count1 = n(),                                # Count of players acquired
    mean_WAR162 = mean(WAR162, na.rm = TRUE)      # Mean WAR162 of acquired players
  )
## # A tibble: 31 × 3
##    next_franch count1 mean_WAR162
##    <chr>        <int>       <dbl>
##  1 ANA            452       0.927
##  2 ARI            477       0.707
##  3 ATL            528       0.931
##  4 BAL            442       0.474
##  5 BOS            482       1.09 
##  6 CHC            457       0.723
##  7 CHW            361       0.660
##  8 CIN            319       0.544
##  9 CLE            401       0.714
## 10 COL            364       0.537
## # ℹ 21 more rows

In the code above, I wanted to group by the next franchise that players played for. Getting a count by these groupings would allow me to see what teams acquire more/less players from other teams. Additionally, it shows how many players either retire or get cut from the league.

I was surprised to find that 7042 players remained with their original organization, with the LA Dodgers acquiring the second-highest number of players at 559.”

The lowest probability group with the above grouping is “STL” (St. Louis) with a count of 261 players playing for then as their next franchise. This means that St. Louis is the least likely team to acquire a player from another team. As a player, you would have a better chance of going to a team other than St. Louis.

My hypothesis for why St. Louis is the least likely to acquire players from other teams is that their current roster performs well in terms of WAR162 and Quasi Winshare. As a result, players tend to stay with the organization longer due to their strong performance, which reduces the need for the team to acquire new players.

library(viridis)
## Loading required package: viridisLite
p1 <- dataset |> 
  filter(year_ID > 2004 & year_ID < 2019) |>   # Filter for years between 2005 and 2018
  group_by(next_franch) |>  # Group by the franchise acquiring players
  summarize(count = n()) |>  # Count the number of players acquired by each franchise
  ggplot(mapping = aes(x = next_franch, y = count, fill = next_franch)) + 
  geom_bar(stat = "identity") +  
  theme_minimal() +  # Minimal theme
  scale_fill_viridis(discrete = TRUE) +  
  ggtitle("Count of Players Acquired by Franchise (2005-2018)") +  # Title
  xlab("Franchise") +  # X-axis label
  ylab("Count of Players Acquired") +  
  theme(axis.text.x = element_text(angle = 90, hjust = 1))  

p1

dataset |> 
  filter(year_ID > 2004 & year_ID < 2019) |>  # Filter years between 2005 and 2018
  group_by(year_ID) |>                         # Group by year_ID
  summarize(
    count1 = n(),                              # Count the number of players for each year
    mean_WAR162 = mean(WAR162, na.rm = TRUE)    # Calculate the mean WAR162 for each year
  )
## # A tibble: 14 × 3
##    year_ID count1 mean_WAR162
##      <dbl>  <int>       <dbl>
##  1    2005   1329       0.753
##  2    2006   1377       0.726
##  3    2007   1384       0.723
##  4    2008   1384       0.722
##  5    2009   1388       0.720
##  6    2010   1356       0.737
##  7    2011   1389       0.720
##  8    2012   1407       0.711
##  9    2013   1409       0.709
## 10    2014   1433       0.698
## 11    2015   1485       0.673
## 12    2016   1482       0.675
## 13    2017   1494       0.669
## 14    2018   1533       0.652

In the code above, we once again filtered the data to get players between 2005-2018. Then grouped by year the get a count of the players for each year 2005-2019. The mean WAR162 is also computed for players in the respective season.

From this grouping we can see the that in 2005, the MLB had the least amount of players appear in a game. This means that if you played in 2005, you had the worst chance of actually getting on field to play. The number of players who appeared in a game also appears to gradually increase over this 13 year span. In respect to WAR162, the statistic seems to gradually decrease over the 13 year span.

My hypothesis for why the total number of players in 2005 was so low is that replacement players had not yet began to play a significant role on MLB teams. Judging by the WAR162 statistic from 2005-2018, the average WAR162 (wins above replacement) statistic appears to decrease over the 13-year span, potentially reflecting the increasing role of replacement-level players or changes in player performance across the league.

# Creating a bar plot for count by year
p <- dataset |>
  filter(year_ID > 2004 & year_ID < 2019) |>   # Filter for years between 2004 and 2019
  ggplot(mapping = aes(x = year_ID)) +           # Define the mapping for the plot
  geom_bar() +                                  # Create the bar plot
  theme_minimal() +                             
  scale_fill_brewer(palette = 'Dark2')+
  ggtitle("Count of Players by Year")

p

library(dplyr) #Changing all instances where a player has multiple positions, to "UTL"
library(stringr) #UTL stands for utility player




dataset |>
  filter(year_ID>2004 & year_ID<2019)|>
  mutate(def_pos = ifelse(str_detect(def_pos, ","), "UTL", def_pos))|>
  group_by(def_pos)|> #Grouping based on defensive position
  summarize(count1=n(),#count of player by year
  mean(WAR162))
## # A tibble: 11 × 3
##    def_pos count1 `mean(WAR162)`
##    <chr>    <int>          <dbl>
##  1 1B         636          1.14 
##  2 2B         366          1.63 
##  3 3B         340          1.86 
##  4 C         1125          0.437
##  5 CF         371          2.08 
##  6 LF         297          1.08 
##  7 P        10230          0.572
##  8 RF         264          1.52 
##  9 SS         431          1.85 
## 10 UTL       5676          0.597
## 11 <NA>       114          0.136

In the code above, a few different things are going on. Firstly, there were many players in this data set who played multiple positions, which brought many combinations. I decided to make all of these players go under a value called “UTL” which stands for utility player, and will help with grouping and comparison.

The number of from 2005-2019 pitchers came in at 10230 which was not surprising because many pitchers can play in a single game, the number of players who played multiple position came in at 5675 which seems somewhat high, and least played position was right field (RF). One observation was that 114 players who were counted as “NA” which means their position was not recorded.

Based on this grouping, players in the right field (RF) position were the least likely to exclusively play that position. In practical terms, as a player you have a higher likelihood of playing in the MLB as if you play a different position, OR if you can learn to play multiple positions and become “utility” player (in the the UTL grouping).

My hypothesis for why the RF position count is low is that many players who played RF also took on other positions, resulting in their classification as utility players (UTL) instead.

#filter and create the boxplot
ggplot(dataset |> 
         filter(year_ID > 2004 & year_ID < 2019), aes(x = def_pos, y = WAR162)) + 
  geom_boxplot() +  #boxplot of WAR162 by defensive position
  labs(title = "WAR162 by Position", x = "Defensive Position", y = "WAR162") +  # Labels
  theme_minimal()  

Combinations of team_ID and next_franch

Here I am finding the least common, most common, and missing combinations for team_ID and next_franch. The next_franch column represents that next team that a player played for. The least/most common combination would show us what specific team is least/most likely to acquire a player from a different, specific team. The missing combinations show us pairings where a player has never moved from one team to another, which can reveal interesting insights about team transactions.

filtered_data <- dataset |> #filter the dataset for the years between 2005 and 2018
  filter(year_ID > 2004 & year_ID < 2019)


all_combinations <- expand.grid( #possible combinations of 'team_ID' and 'next_franch'
  team_ID = unique(filtered_data$team_ID), 
  next_franch = unique(filtered_data$next_franch)
)

existing_combinations <- filtered_data |> #group by 'team_ID' and 'next_franch' to get the count of each combination
  group_by(team_ID, next_franch) |> 
  count() |> 
  ungroup()


missing_combinations <- anti_join(all_combinations, existing_combinations, by = c("team_ID", "next_franch")) #find missing combinations 


most_common <- existing_combinations |> #find the most and least common combinations
  arrange(desc(n)) |> 
  head(1)  

least_common <- existing_combinations |> 
  arrange(n) |> 
  head(1) 

list(
  missing_combinations = missing_combinations,
  most_common = most_common,
  least_common = least_common
)
## $missing_combinations
##    team_ID next_franch
## 1      STL         STL
## 2      TBD         STL
## 3      WSN         WSN
## 4      HOU         HOU
## 5      DET         LAD
## 6      LAD         LAD
## 7      TOR         TOR
## 8      CHC         CHC
## 9      TEX         TEX
## 10     TBD         TEX
## 11     ATL         ATL
## 12     CHW         CHW
## 13     TBD         CHW
## 14     KCR         KCR
## 15     PHI         PHI
## 16     SEA         SEA
## 17     NYY         NYY
## 18     MIA         FLA
## 19     FLA         FLA
## 20     CIN         CIN
## 21     LAA         ANA
## 22     CLE         CLE
## 23     SFG         SFG
## 24     BAL         BAL
## 25     DET         DET
## 26     SFG         DET
## 27     ARI         ARI
## 28     COL         COL
## 29     NYM         NYM
## 30     MIL         MIL
## 31     BOS         BOS
## 32     MIA         BOS
## 33     PIT         PIT
## 34     MIA         OAK
## 35     OAK         OAK
## 36     SDP         SDP
## 37     TBR         TBD
## 38     TBD         TBD
## 39     MIN         MIN
## 
## $most_common
## # A tibble: 1 × 3
##   team_ID next_franch     n
##   <chr>   <chr>       <int>
## 1 LAD     NULL          290
## 
## $least_common
## # A tibble: 1 × 3
##   team_ID next_franch     n
##   <chr>   <chr>       <int>
## 1 CHW     DET             1

Based on these results, there are only a few missing combinations. Since a team cannot acquire a player from itself, those combinations are excluded from the analysis. We can then see that LA Dodgers have never acquired a player from Detroit Tigers, Detroit has never acquired a player from the San Francisco Giants, and Boston Red Sox & Oakland A’s have never acquired a player from the Miami Marlins.

The most common combination is the LA Dodgers with next_franch Null. This means that LA Dodgers had many players stay with the team and rather than moving to another.

The least common combination of team_ID and next_franch is the Chicago White Sox and the Detroit Tigers. From 2004-2018, the Detroit Tigers have only acquired one player from the White Sox.

filtered_data <- dataset |> #filter 2005-2018
  filter(year_ID > 2004 & year_ID < 2019) 

null_data <- filtered_data |> #find where next_franch= "NULL"
  filter(next_franch == "NULL")


team_count <- null_data |> #group by team and count nulls for each
  group_by(team_ID) |> 
  count()

ggplot(team_count, aes(x = team_ID, y = n, fill = team_ID)) + #barplot
  geom_bar(stat = "identity") +   
  theme_minimal() +               
  labs(
    title = "Count Players Who Remained with Team by team_ID",
    x = "Team",
    y = "Count of 'NULL' in next_franch",
    fill = "Team ID"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))