## Warning: package 'baseballr' was built under R version 4.2.3
## Warning: package 'mlbplotR' was built under R version 4.2.3
Recently, I’ve decided to improve my skills in R generally and data visualization/communication more specifically. I do data/evaluation things for work, but not with these tools. I think that I can a) improve my work going forward and b) do more fun personal things with a sharpening of skills here. To that end, I’ve been looking for problems to apply these skills to as well as arenas to post my work and receive feedback. And no place is better than Reddit for this kind of thing.
Not long ago, I was working on a TidyTuesday project where I ended up making a bi-directional bar chart showing goal differential of the Eurpoean Premier League. That was interesting (mostly because I had never made that kind of chart before) but I’m not a particularly big soccer fan. So I turned the new skill toward an are of interest: baseball.
My first attempt at making something that communicated interesitng facts was done by going to Baseball Reference, looking up the team stats table, and copying the info into a Google Sheet. From there, I downloaded that as a .csv file that I could read into R. After some manipulaton, I had a single file with each team, their runs scored, runs allowed, and the difference between the two. Here is that process.
batting <- read_csv(here("data", "mlb_2023_batting.csv")) %>%
select(Tm, R) %>%
rename(rs = R)
pitching <- read_csv(here("data", "mlb_2023_pitching.csv")) %>%
select(Tm, R) %>%
rename(ra = R)
run_diff <- inner_join(batting, pitching) %>%
mutate(diff = rs - ra,
pos = case_when(diff >0 ~ T,
diff <0 ~ F))That resulted in this graph.
ggplot(run_diff, aes(x = reorder(Tm, diff), y = diff, fill = pos))+
geom_bar(stat="identity", colour = "black", linewidth = 0.25) +
scale_fill_manual(values = c("#FF0000", "#00FF33"), guide = "none") +
coord_flip() +
labs(title = "MLB Run Differentials",
subtitle = "2023 Season",
x = "Team",
y = "Run Differential",
caption = "*Data courtesy of Baseball Reference \n
data as of 4/6/23") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())After I finished fiddling with all of the ggplot bits, I felt pretty proud of this one. It’s simple, but it seems to communicate the information effectively and easily. I thought that the fine people at r/baseball would also appreciate it, so I posted it there.
On the whole, it was well received which was nice to see. I did, however, encounter my first data communication error. I had pulled this information while games were still going on that day. In my excitement to show my work, I went ahead and posted it before getting complete information for that day. So when I said “data as of 4/6/23” in the caption, it wasn’t entirely clear when what this data reflected. There were several comments noting this, which was a totally valid issue to pose.
The second thing that I took from this first posting was a few people asking to see salary information in addition to run differential. The idea being that it would be interesting to see whether or not teams who spent more were doing better at this point in the season. So, with this information in hand, I went back to the drawing board and made a few changes.
After I had created the first script for making the datasets and plot for MLB run differentials, I was made aware of the baseballr package which would allow a quicker, more reliable way to get data. Manually pulling data into a sheet is fun and all, but this meant I could hand the code to someone else and they would be able to run it without any issue or extra .csv file.
Here’s how the data reading looked for the second attempt:
team_batting <- fg_team_batter(2023, 2023, qual = "n", league = "all") %>%
select(Team, R) %>%
rename(rs = R)
team_pitching <- fg_team_pitcher(2023, 2023, qual = "n", league = "all") %>%
select(Team, R) %>%
rename(ra = R)I also found a list of 2023 team payroll on opening day and added a .csv file into the mix with that info. Once I read it in, I joined it up with the other two files into one cohesive dataset. Here’s that process:
payroll <- read_csv(here("data", "mlb_2023_payroll.csv")) %>%
select(Team, "2023 Payroll Proj") %>%
rename(sal = "2023 Payroll Proj")
#Join pitching and batting datasets to get combined set with runs scored and allowed
run_diff <- inner_join(team_batting, team_pitching) %>%
mutate(diff = rs - ra,
pos = case_when(diff >0 ~ T,
diff <0 ~ F))
run_diff <- inner_join(run_diff, payroll) %>%
mutate(sal2 = as.numeric(str_sub(sal, 2, -2)))I also knew that I wanted to made the plot title dynamic so that I didn’t have to update it anytime I ran a new version. To do that I set a variable for the system data and in the ggplot call, worked that into the title. The other change I made to the graph was labeling the bars with the team salary info. Here’s both of those things and the resulting graph:
today.date <- Sys.Date()
ggplot(run_diff, aes(x = reorder(Team, diff), y = diff, fill = pos))+
geom_bar(stat="identity", colour = "black", linewidth = 0.25) +
geom_text(aes(label = sal ), size = 2.5,
position = position_stack(vjust = 0.5), colour = "black", fontface = "bold") +
scale_fill_manual(values = c("#FF0000", "#00FF33"), guide = "none") +
coord_flip() +
labs(title = paste('MLB Run Differentials before start of play on', today.date, sep = " "),
subtitle = "2023 Season",
x = "Team",
y = "Run Differential",
caption = "*Data courtesy of Fangraphs through baseballr \n
graph by MWooten34") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())This was posted to much less comment than the first. Several issues were pointed out that I generally agreed with on closer inspection. First, there were several comments pointing out that the addition of salary information in this way does not add a lot of value and makes the chart more difficult to interpret. What was previously a graph that was interpretable at a glance, was now one where the viewer needed to more closely scan the entire chart to see the values for salary and try and put them into the appropriate context. When I initially got the request to add salary information I didn’t even really consider changing the chart type, just adding the information as an annotation on the graph already made. I think in the long run, this made the overall visual display and communication of information worse.
After the second post, I decided to try and create a chart with run differential information and salary information laid out in a meaningful way. Several comments on post number 2 suggested some sort of scatter plot for run differential with dot size based on team salary. That sounded interesting so I got to it. I used the mlbplotR package so I could use team logos instead of dots. More fun that way! I wanted to vary the size of the logo based on salary, however, that functionality doesn’t exist in the package at this time. Thankfully, package creator Camden Kay responded to my request for help on r/Sabermetrics and helped me get around that problem by creating a scaling factor for salary and passing that info into the argument for logo width.
run_diff_scaled <- run_diff %>%
mutate(sal_scale = as.vector(0.075* sal2/max(sal2)))
run_diff_salary <- ggplot(run_diff_scaled, aes(x = ra, y = rs, h_var = rs, v_var = ra)) +
geom_mlb_light_cap_logos(aes(team_abbr = Team, width = sal_scale, alpha = 0.85)) +
geom_mean_lines() +
labs(title = paste('MLB Run Differentials before start of play on', today.date, sep = " "),
subtitle = "Logo Size Indicative of Salary",
x = "Runs Allowed",
y = "Runs Scored",
caption = "*Data courtesy of Fangraphs through baseballr \n
Logos courtesy of mlbplotR \n
graph by MWooten34") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
view(run_diff_salary)
#ggsave(here("figs", "run_diff_salary.png"))So there you have it. After several iterations, I’ve settled on something I’m happy with… for now. Though this was a short and relatively inconsequential project, I enjoyed working through the taking of feedback, incorporation of feedback, and re-sharing of the work. Those are skills that are and will continue to be beneficial in projects with more direct impact that I might do in my “real” job. I’d be happy to iterate further if you have suggestions for improving this graphic even more!
view(run_diff_salary)