Final Project

Data & Packages

library(tidyverse)
library(ggrepel) 
library(viridisLite)
mlb_batting <- read_csv('baseballref.csv')
mlb_standings <- read_csv("mlbstandings22.csv")
mlb <- mlb_batting %>% 
  left_join(mlb_standings, by = 'Tm')

Plot 1

Show the Correlation between Home Runs, Batting Average and Wins
Does the amount of home runs or batting average have a higher impact on team wins?

mlb %>% 
  ggplot() +
  geom_point(aes(x = BA, y = HR, size = W, color = Tm),  alpha = .8) +
  geom_point(aes(x = BA, y = HR, size = W), color = "black", shape = 21, aplha = .15) +
  geom_text_repel(aes(x = BA, y = HR, size = W, label = Tm), size = 3) +
  scale_size(guide = 'none') +
  scale_y_continuous(breaks = c(100, 120, 140, 160, 180, 200, 220, 240, 260)) +
  labs(x = "Batting Average", y = "Home Runs", title = "MLB Batting Average vs Home Runs", color = "Team") +
  guides(color = FALSE) +
  scale_color_viridis_d(option = "magma")

To answer this question, I created a bubble plot with the team Batting average on the x axis and total home runs on the y axis. the size of the bubble depends on the team wins. I found that there are teams like the Yankees that were towards the middle of the league in batting average but led the league in home runs. then there are teams like the Guardians who were among the top of the league in Batting average, but near the bottom in home runs. While these teams were on opposite ends of the spectrum in terms of hitting style, they both found success in winning games. While looking in the upper right quadrant of the plot we can see that this area contains a large number of teams with a lot of wins. these are teams that were near the top of the league in both Batting average and Home Runs.

Edit Data

mlb_H_prop = mlb %>%
  rename(Double = '2B', Triple = '3B') %>% 
  mutate(Single = H - Double - Triple - HR ) %>% 
  mutate(Singles_prop = Single / H,
         Doubles_prop = Double / H,
         Triples_prop = Triple / H,
         HR_prop = HR / H) %>% 
  select(Tm, Singles_prop, Doubles_prop, Triples_prop, HR_prop, Division)

mlb_long2 <- mlb_H_prop %>%
  pivot_longer(cols = c(Singles_prop, Doubles_prop, Triples_prop, HR_prop), names_to = "stat", values_to = "value")

Plot 2

What are the percentages of each hit type for all the teams in the Al Central? Does any team have a higher percentage at a specific hit type than the rest?

mlb_long2 %>% 
  select(Tm, stat, value, Division) %>%
  filter(Division == "AL Central") %>% 
  ggplot(aes(x = Tm, y = value, fill = stat)) +
  geom_col() +
  coord_polar(theta = "y") +
  scale_fill_viridis_d(name = 'Hit Type', option = 'magma') +
  scale_x_discrete(limits = c(" "," "," ", "Cleveland Guardians", "Chicago White Sox", "Detroit Tigers","Kansas City Royals","Minnesota Twins")) +
  theme_void() +
  theme(legend.position = "right") +
  geom_text(aes(label = paste0(round(value* 100), "%")),
            position = position_stack(vjust = .5), size = 3.5, color = "darkgrey", fontface = 'bold') + 
  geom_text(data = mlb_long2 %>% 
              select(Tm, stat, value, Division) %>%
              filter(Division == "AL Central" & !duplicated(Tm)),
            aes(label = Tm), position = position_stack(vjust = 0.75), size = 3, color = "Black", fontface = 'bold') +
  labs(title = "Proportions of Singles, Doubles, Triples, and Homeruns for the AL Central", fill = "Hit Type") +
  scale_fill_manual(values = c('#000004','#fcfdbf', '#b73779','#51127c'),
                    labels = c("Doubles", "Home Runs", "Singles", "Triples"))

## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.

After creating the needed variables to answer this question, I created a Donut Plot using geom_col() and coord_polar(). The data shows that the Al Central teams all hit similar percentages of each hit type. It is interesting to see that about 13% of the Minnesota Twins hits were Home Runs, while the other teams were either at 9% or 10%. This makes sense because we can also see that 66% of the twins hits are Singles where as the rest of the teams Singles are in the range of 68-70% of their hits.

Final Project

Michael Rigo

2023-04-12

Data & Packages

Plot 1

Edit Data

Plot 2