Data Sources:
All Qualified Batters from 1940 through 2022
• FanGraphs.com • BaseballRefernce.com • Chadwick Player Register
R Packages Used:
• baseballr • dplyr • data.table • scales • lubridate • plotyly • ggplot2
#set root directory for project#
setwd("/Users/chriscoxen/Library/CloudStorage/OneDrive-LoyolaUniversityMaryland/DS-736 Data Visualization for Decision Making/DS 736 R Directory/R_datafiles")
library(baseballr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(ggplot2)
library(scales)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggrepel)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(maps)
#pull player statistical data from Fangraphs.com using baseballr#
fangraphs_data <- fg_batter_leaders(1940, 2022,league = "all", qual = "y", ind = 1, exc_p = TRUE)
fg_data <- data.frame(fangraphs_data)
#import player DEF and Off WAR data for 2011 thru 2020 from Fangraphs#
qBatter_WAR_2010s <- fread("FanGraphs_Off_Def_WAR.csv")
#import player biographical data from Chadwick csv file#
master_player_list <- fread("Master Player List.csv")
#create dataframe of player bio graphical data#
mpl_df <- master_player_list %>%
select(fangraphs, birthCity, birthState, birthCountry, birthYear, birthRegion) %>%
rename(playerid=fangraphs) %>%
data.frame()
mpl_df$playerid <- as.character(mpl_df$playerid)
#merge player bio data with Fangraphs stats#
fg_data_wBio <- merge(fg_data, mpl_df, by = c("playerid"), all.x=TRUE)
fg_data_wBio$Season <- as.numeric(fg_data_wBio$Season)
fg_data_wBio$birthRegion <- as.character(fg_data_wBio$birthRegion)
#group performance stats by decade#
fg_data_wBio <- fg_data_wBio %>%
mutate(decade=case_when(
Season<=1949 ~ '1940s',
Season<=1959 ~ '1950s',
Season<=1969 ~ '1960s',
Season<=1979 ~ '1970s',
Season<=1989 ~ '1980s',
Season<=1999 ~ '1990s',
Season<=2009 ~ '2000s',
Season<=2019 ~ '2010s',
Season<=2029 ~ '2020s',
TRUE ~ 'OTHER'
)) %>%
data.frame()
Total At Bats by Birth Region per Decade - 1940 through 2022
Change in totals ABs by geographic region of player origin. Data from Fangraphs, Baseball Reference, and Chadwick.
Insights: Baseball is one of the most diverse sports in the world, but it wasn’t always so. This trellis chart shows the slow increase in player ABs from regions outside the USA over time. Latin American players are the second largest cohort after American-born players. The success of Ichiro Suzuki in the early 2000’s paved the way for more Asian players join MLB.
#-------------------------------<visualize total ABs per decade by player region>---------------------------------------------#
#new dataframe abs per decade
abs_decade <- fg_data_wBio[, c("birthRegion", "decade", "AB")]
# Sum AB by birthRegion and decade
sum_by_region_decade <- aggregate(AB ~ birthRegion + decade, data = abs_decade, FUN = sum)
# Define order for birthRegion
region_order <- c("Other","Australia", "Asia", "Europe", "Canada", "Latin America", "USA")
region_colors <- c("lightgray", "lightblue", "violet", "green", "red", "orange", "blue")
# Convert birthRegion to a factor with specified order
sum_by_region_decade$birthRegion <- factor(sum_by_region_decade$birthRegion, levels = region_order)
ggplot(sum_by_region_decade, aes(x = AB, y = birthRegion, fill = birthRegion)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = comma(AB)), position = position_dodge(width = 1), hjust = -0.005) +
facet_wrap(~decade, nrow = 3, ncol = 3, strip.position = "bottom") +
scale_fill_manual(values = region_colors) +
xlab("Total ABs per Region") +
ylab(NULL) +
ggtitle("Total At Bats by Birth Region per Decade - 1940 through 2022") +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
strip.text.x = element_text(size = 14, face = "bold", hjust = 0.5, margin = margin(t = 12, r = 0, b = 7, l = 0)),
legend.position = "none",
panel.background = element_rect(fill = "white", colour = "gray90"),
panel.grid.major = element_line(linewidth = 0.5, color = "gray80"),
panel.grid.minor = element_line(linewidth = 0.25, color = "gray90"),
plot.background = element_rect(fill = "gray98")) +
scale_x_continuous(breaks = seq(0, max(sum_by_region_decade$AB), by = 100000), labels = comma)
#-------------------------------</visualize total ABs per decade by player region>---------------------------------------------#
Top 100 batters of the 2010’s by WAR: Def vs. Off
This plot shows the top 100 batters in terms of their collective WAR score for the 2010s season. The data is plotted on a scatter chart, with the x-axis representing the player’s Defensive WAR score and the y-axis representing the player’s Offensive WAR score. Players in the chart’s upper right quadrant had positive scores for both Defensive and Offensive WAR. Players located in the upper left quadrant had negative Defensive WAR scores and positive Offensive WAR scores. Players located in the lower right quadrant had positive Defensive WAR scores and negative Offensive WAR scores. Finally, players located in the lower left quadrant had negative scores for both Defensive and Offensive WAR.
Insights: The chart does an excellent job of demonstrating how a player can contribute to his team despite below-average defense or offense. For example, Andrew Simmons had a net 24.4 WAR (wins above replacement level player) despite being poor offensive player, thanks to his elite defense. Conversely, Carlos Santana had net 21.2 WAR despite being the worst defensive player in the top 100, thanks his offensive prowess. Baseball is not a one-dimensional game and neither are its players.
#-------------------------------<visualize relationship between Offensive WAR and DEF WAR>-------------------------------------#
# Filter the data to the top 100 players by WAR
qBatter_WAR_2010s_top100 <- qBatter_WAR_2010s %>%
arrange(desc(WAR)) %>%
head(100)
# Create the scatter plot using plotly
# Create the scatter plot using plotly
p <- plot_ly(qBatter_WAR_2010s_top100, x = ~Def, y = ~Off, color = ~playerid, colors = c("red", "blue"), text = ~paste("Name: ", Name, "<br>", "WAR: ", WAR)) %>%
add_markers() %>%
layout(xaxis = list(title = "Defensive WAR"),
yaxis = list(title = "Offensive WAR"),
legend = list(title = "Player ID"),
title = list(text = "Top 100 batters of the 2010's by WAR: Def vs. Off", x = 0.85, y = 0.95, xanchor = "right", yanchor = "bottom"))
# Add quadrant labels to the plot
p <- p %>%
layout(annotations = list(
list(
x = 0.05, y = 0.95,
xref = "paper", yref = "paper",
text = "-Def, +Off",
showarrow = FALSE,
font = list(color = "purple")
),
list(
x = 0.05, y = 0.05,
xref = "paper", yref = "paper",
text = "-Def, -Off",
showarrow = FALSE,
font = list(color = "red")
),
list(
x = 0.95, y = 0.95,
xref = "paper", yref = "paper",
text = "+Def, +Off",
showarrow = FALSE,
font = list(color = "blue")
),
list(
x = 0.95, y = 0.05,
xref = "paper", yref = "paper",
text = "+Def, -Off",
showarrow = FALSE,
font = list(color = "purple")
)
))
p
dWAR_vs_OWAR_plot <- p
#-------------------------------</visualize relationship between Offensive WAR and DEF WAR>-----------------------------------#
This plot charts the change in ABs per HR over time. We note several key historical events throughout baseball history. In summary, HRs have become more frequent since 1943.
Insights: Homerun frequency has varied over time, but there is a distinct trend towards more HRs over time. I have noted a few key historical periods for context:
1940 – WWII draft begin and many active players were enlisted. As a results, the quality of offensive play took a sharp decline. Fewer HRs were hit per at bat form 1940 to 1943.
1968 – Known as the “Year of Pitcher” when pitchers had one of most dominant season in MLB history. The league responded by lowering the height of the mound.
1998 – Mark McGwire and Sammy Sosa break the single-season HR record in the same season.
2013 – the steroid scandal forces MLB to implement PED testing, and the HR pace declines.
2017 – hitting coaches emphasize batter launch angle to increase the trajectory of batted balls. The HR pace increases.
#----------------------------------------------<visualize ABs per HR over Time>-----------------------------------------------#
ab_per_hr <- fg_data_wBio %>%
group_by(Season) %>%
summarize(total_hr = sum(HR),
total_abs = sum(AB),
avg_ab_per_hr = mean(total_abs/total_hr)) %>%
mutate(Season = as.numeric(Season))
fig <- plot_ly(ab_per_hr, x = ~Season, y = ~avg_ab_per_hr, type = 'scatter', mode = 'lines', line = list(color = 'blue', width = 1)) %>%
add_trace(text = paste("Season: ", ab_per_hr$Season, "<br>ABs per HR: ", round(ab_per_hr$avg_ab_per_hr, 2)),
hoverinfo = 'text', line = list(color = 'blue', width = 1)) %>%
add_annotations(text = "WWII Draft", x = 1940, y = ab_per_hr$avg_ab_per_hr[ab_per_hr$Season == 1940], showarrow = TRUE) %>%
add_annotations(text = "1968 'Year of the Pitcher'", x = 1968, y = ab_per_hr$avg_ab_per_hr[ab_per_hr$Season == 1968], showarrow = TRUE) %>%
add_annotations(text = "McGwire & Sosa break single-season HR record", x = 1998, y = ab_per_hr$avg_ab_per_hr[ab_per_hr$Season == 1998], showarrow = TRUE) %>%
add_annotations(text = "PED Testing", x = 2010, y = ab_per_hr$avg_ab_per_hr[ab_per_hr$Season == 2013], showarrow = TRUE) %>%
add_annotations(text = "Launch Angle Trend takes off", x = 2017, y = ab_per_hr$avg_ab_per_hr[ab_per_hr$Season == 2017], showarrow = TRUE) %>%
layout(title = list(text = "At Bats per HR by Season: 1940 - 2022", x = 0.095, y = 0.925, xanchor = "left", yanchor = "top"),
xaxis = list(title = "Season", tickmode = "linear", tick0 = 1940, dtick = 5),
yaxis = list(title = "ABs per HR", tickmode = "linear", tick0 = 10, dtick = 10),
showlegend = FALSE,
legend = list(x = 1.1, y = 0.5, bgcolor = "transparent"),
hoverlabel = list(bgcolor = "white", font = list(family = "sans-serif", size = 12, color = "black")),
margin = list(t = 80, l = 80))
fig
ab_per_hr_graph <- fig
#----------------------------------------------</visualize ABs per HR over Time>-----------------------------------------------#
Mean Weighted Runs Created by Player Region Origin
This chart plots the mean wRC for all qualified batters by their region of birth, including the USA, Latin America, Canada, Europe, Australia, and misc. countries.
Insight: MLB has a long history of discrimination and systemic racism which prevented players of color and immigrants from becoming big leaguers in the early to mid-part of the 20th century.
#----------------------------------------------<visualize mean wRC by player birth region>-----------------------------------------------#
# Define a vector of colors to use for the density curves
color_vector <- c("#9400D3", "#4B0082", "#0000FF", "#00FF00", "#FFFF00", "#FF7F00", "#FF0000")
# Filter out rows where birthRegion is NA
fg_data_wBio_filtered <- fg_data_wBio[!is.na(fg_data_wBio$birthRegion),]
# Filter the data by decade
decade_filter <- fg_data_wBio_filtered$decade == "1940s"| fg_data_wBio_filtered$decade == "1950s"| fg_data_wBio_filtered$decade == "1960s" | fg_data_wBio_filtered$decade == "1970s" | fg_data_wBio_filtered$decade == "1980s"| fg_data_wBio_filtered$decade == "1990s"| fg_data_wBio_filtered$decade == "2000s"| fg_data_wBio_filtered$decade == "2010s" | fg_data_wBio_filtered$decade == "2020s"
fg_data_wBio_decade <- fg_data_wBio_filtered[decade_filter, ]
# Create the density plot with rainbow colors
ggplot(fg_data_wBio_decade, aes(x = wRC, fill = birthRegion)) +
geom_density(alpha = 0.5) +
ggtitle("Distribution of wRC by Player Region across Decades (excluding unknow Region)") +
xlab("wRC") +
ylab(NULL) +
scale_fill_manual(values = color_vector, name = "Birth Region") +
facet_wrap(~decade) +
scale_x_continuous(breaks = seq(0, max(fg_data_wBio_decade$wRC), by = 25)) +
theme(axis.text.y=element_blank(), axis.ticks.y=element_blank())
#----------------------------------------------</visualize mean wRC by player birth region>-----------------------------------------------#
This chart shows the relationship between walks per plate appearance vs. strikeouts per plate appearance over time. In summary, batters are striking out more often and walking less frequently.
Insights: This dual-axis line plot demonstrates the overall decline in batter walk rate vs. the sharp increase in batter strikeout rate. There are several possible explanations for these trends:
#----------------------------------------<visualize walks per pa vs. strikeouts per pa over time>-----------------------------------------#
# Filter out rows where SO is NA
fg_data_filtered <- fg_data %>% filter(!is.na(SO))
# Aggregate the data by season and calculate the total BB and SO, and PA
fg_data_totals <- fg_data_filtered %>%
group_by(Season) %>%
summarize(Total_BB = sum(BB), Total_SO = sum(SO), Total_PA = sum(PA))
# Create the plotly object and add the first trace for BB per PA
plot <- plot_ly(fg_data_totals, x = ~Season, y = ~Total_BB/Total_PA, name = 'BB per PA', type = 'scatter', mode = 'lines', line = list(width = 3))
# Add the second trace for SO per PA
plot <- add_trace(plot, y = ~Total_SO/Total_PA, name = 'SO per PA', yaxis = 'y2', line = list(width = 3, color = "red"))
# Configure the layout with two y-axes and a chart title
plot <- layout(plot,
yaxis = list(title = 'BB per PA'),
yaxis2 = list(title = 'SO per PA', overlaying = 'y', side = 'right'),
xaxis = list(
tickmode = 'linear',
dtick = 5
),
title = list(text = 'BB per PA vs. SO per PA by season', x = 0.05, y = 0.95, xanchor = 'left', yanchor = 'top', pad = list(t = 10)),
margin = list(l = 50, r = 50, t = 50, b = 50)
)
# Display the plot
plot
bb_vs_so_plot <- plot
#----------------------------------------</visualize walks per pa vs. strikeouts per pa over time>-----------------------------------------#