Data & Packages
library(tidyverse)
library(ggrepel)
library(viridisLite)
mlb_batting <- read_csv('baseballref.csv')
mlb_standings <- read_csv("mlbstandings22.csv")
mlb <- mlb_batting %>%
left_join(mlb_standings, by = 'Tm')
Plot 1
- Show the Correlation between Home Runs, Batting Average and
Wins
- Does the amount of home runs or batting average have a higher impact
on team wins?
mlb %>%
ggplot() +
geom_point(aes(x = BA, y = HR, size = W, color = Tm), alpha = .8) +
geom_point(aes(x = BA, y = HR, size = W), color = "black", shape = 21, aplha = .15) +
geom_text_repel(aes(x = BA, y = HR, size = W, label = Tm), size = 3) +
scale_size(guide = 'none') +
scale_y_continuous(breaks = c(100, 120, 140, 160, 180, 200, 220, 240, 260)) +
labs(x = "Batting Average", y = "Home Runs", title = "MLB Batting Average vs Home Runs", color = "Team") +
guides(color = FALSE) +
scale_color_viridis_d(option = "magma")

- To answer this question, I created a bubble plot with the team
Batting average on the x axis and total home runs on the y axis. the
size of the bubble depends on the team wins. I found that there are
teams like the Yankees that were towards the middle of the league in
batting average but led the league in home runs. then there are teams
like the Guardians who were among the top of the league in Batting
average, but near the bottom in home runs. While these teams were on
opposite ends of the spectrum in terms of hitting style, they both found
success in winning games. While looking in the upper right quadrant of
the plot we can see that this area contains a large number of teams with
a lot of wins. these are teams that were near the top of the league in
both Batting average and Home Runs.
Edit Data
mlb_H_prop = mlb %>%
rename(Double = '2B', Triple = '3B') %>%
mutate(Single = H - Double - Triple - HR ) %>%
mutate(Singles_prop = Single / H,
Doubles_prop = Double / H,
Triples_prop = Triple / H,
HR_prop = HR / H) %>%
select(Tm, Singles_prop, Doubles_prop, Triples_prop, HR_prop, Division)
mlb_long2 <- mlb_H_prop %>%
pivot_longer(cols = c(Singles_prop, Doubles_prop, Triples_prop, HR_prop), names_to = "stat", values_to = "value")
Plot 2
- What are the percentages of each hit type for all the teams in the
Al Central? Does any team have a higher percentage at a specific hit
type than the rest?
mlb_long2 %>%
select(Tm, stat, value, Division) %>%
filter(Division == "AL Central") %>%
ggplot(aes(x = Tm, y = value, fill = stat)) +
geom_col() +
coord_polar(theta = "y") +
scale_fill_viridis_d(name = 'Hit Type', option = 'magma') +
scale_x_discrete(limits = c(" "," "," ", "Cleveland Guardians", "Chicago White Sox", "Detroit Tigers","Kansas City Royals","Minnesota Twins")) +
theme_void() +
theme(legend.position = "right") +
geom_text(aes(label = paste0(round(value* 100), "%")),
position = position_stack(vjust = .5), size = 3.5, color = "darkgrey", fontface = 'bold') +
geom_text(data = mlb_long2 %>%
select(Tm, stat, value, Division) %>%
filter(Division == "AL Central" & !duplicated(Tm)),
aes(label = Tm), position = position_stack(vjust = 0.75), size = 3, color = "Black", fontface = 'bold') +
labs(title = "Proportions of Singles, Doubles, Triples, and Homeruns for the AL Central", fill = "Hit Type") +
scale_fill_manual(values = c('#000004','#fcfdbf', '#b73779','#51127c'),
labels = c("Doubles", "Home Runs", "Singles", "Triples"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

- After creating the needed variables to answer this question, I
created a Donut Plot using geom_col() and coord_polar(). The data shows
that the Al Central teams all hit similar percentages of each hit type.
It is interesting to see that about 13% of the Minnesota Twins hits were
Home Runs, while the other teams were either at 9% or 10%. This makes
sense because we can also see that 66% of the twins hits are Singles
where as the rest of the teams Singles are in the range of 68-70% of
their hits.