Overview

This is the first part of my Project 2 assignment for DATA607 in the Fall 2023 Term at CUNY SPS. In this assignment I import a wide data set, tidy it, and then analyze it. I created this first data set, which contains the seasonal home run totals for the starting player from each position for each team in the American League Eastern Division of Major League Baseball from 2018 to 2023 (excluding the pandemic-shortened 2020 season).

Tidying Data

In this code block, I load the necessary libraries, import the data from my github repository, and rename the columns.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)

hr_data <- read.csv("https://raw.githubusercontent.com/Marley-Myrianthopoulos/Data607Project2/main/HR_Data_607.csv")

colnames(hr_data) <- c("Team", "Position", "2018", "2019", "2021", "2022", "2023")

kable(hr_data, format = "pipe", caption = "Initial Homerun Data", align = "lcccccc")
Initial Homerun Data
Team Position 2018 2019 2021 2022 2023
BAL C 3 13 11 13 20
1B 16 12 33 22 18
2B 17 24 5 13 13
3B 24 6 9 13 7
SS 7 12 11 16 4
LF 24 13 22 16 16
CF 15 10 30 16 15
RF 8 35 18 33 28
DH 17 31 21 10 14
BOS C 5 23 6 8 9
1B 15 19 25 12 24
2B 10 3 6 16 3
3B 23 33 23 15 6
SS 21 32 38 27 33
LF 16 13 13 11 15
CF 13 21 20 6 8
RF 32 29 31 3 13
DH 43 36 28 16 23
NYY C 18 34 23 11 10
1B 11 21 8 32 12
2B 24 26 10 24 25
3B 27 16 9 4 21
SS 27 21 14 15 15
LF 12 13 13 8 5
CF 27 28 10 62 7
RF 27 27 39 12 37
DH 38 13 35 31 24
TBR C 14 9 33 6 11
1B 11 19 13 11 22
2B 7 17 39 8 21
3B 10 20 7 8 17
SS 4 14 11 9 31
LF 7 21 27 20 23
CF 7 14 4 7 25
RF 9 20 10 4 20
DH 30 33 13 6 12
TOR C 10 13 1 14 8
1B 25 22 48 32 26
2B 11 16 45 7 11
3B 18 18 29 24 20
SS 17 15 2 27 17
LF 22 20 21 5 20
CF 15 26 22 25 8
RF 25 31 32 25 21
DH 21 21 22 4 19

In this code block, I “tidy” the data by using pivot longer to convert the data into a format that includes a variable for the season, rather than having each season be a separate column.

tidy_hr_data <- hr_data %>%
  mutate(Team = na_if(Team, "")) %>%
  fill(Team) %>%
  pivot_longer(
    cols = -c("Team", "Position"),
    names_to = "Season",
    values_to = "HRs")

kable(tidy_hr_data, format = "pipe", caption = "Tidy Homerun Data", align = "lccc")
Tidy Homerun Data
Team Position Season HRs
BAL C 2018 3
BAL C 2019 13
BAL C 2021 11
BAL C 2022 13
BAL C 2023 20
BAL 1B 2018 16
BAL 1B 2019 12
BAL 1B 2021 33
BAL 1B 2022 22
BAL 1B 2023 18
BAL 2B 2018 17
BAL 2B 2019 24
BAL 2B 2021 5
BAL 2B 2022 13
BAL 2B 2023 13
BAL 3B 2018 24
BAL 3B 2019 6
BAL 3B 2021 9
BAL 3B 2022 13
BAL 3B 2023 7
BAL SS 2018 7
BAL SS 2019 12
BAL SS 2021 11
BAL SS 2022 16
BAL SS 2023 4
BAL LF 2018 24
BAL LF 2019 13
BAL LF 2021 22
BAL LF 2022 16
BAL LF 2023 16
BAL CF 2018 15
BAL CF 2019 10
BAL CF 2021 30
BAL CF 2022 16
BAL CF 2023 15
BAL RF 2018 8
BAL RF 2019 35
BAL RF 2021 18
BAL RF 2022 33
BAL RF 2023 28
BAL DH 2018 17
BAL DH 2019 31
BAL DH 2021 21
BAL DH 2022 10
BAL DH 2023 14
BOS C 2018 5
BOS C 2019 23
BOS C 2021 6
BOS C 2022 8
BOS C 2023 9
BOS 1B 2018 15
BOS 1B 2019 19
BOS 1B 2021 25
BOS 1B 2022 12
BOS 1B 2023 24
BOS 2B 2018 10
BOS 2B 2019 3
BOS 2B 2021 6
BOS 2B 2022 16
BOS 2B 2023 3
BOS 3B 2018 23
BOS 3B 2019 33
BOS 3B 2021 23
BOS 3B 2022 15
BOS 3B 2023 6
BOS SS 2018 21
BOS SS 2019 32
BOS SS 2021 38
BOS SS 2022 27
BOS SS 2023 33
BOS LF 2018 16
BOS LF 2019 13
BOS LF 2021 13
BOS LF 2022 11
BOS LF 2023 15
BOS CF 2018 13
BOS CF 2019 21
BOS CF 2021 20
BOS CF 2022 6
BOS CF 2023 8
BOS RF 2018 32
BOS RF 2019 29
BOS RF 2021 31
BOS RF 2022 3
BOS RF 2023 13
BOS DH 2018 43
BOS DH 2019 36
BOS DH 2021 28
BOS DH 2022 16
BOS DH 2023 23
NYY C 2018 18
NYY C 2019 34
NYY C 2021 23
NYY C 2022 11
NYY C 2023 10
NYY 1B 2018 11
NYY 1B 2019 21
NYY 1B 2021 8
NYY 1B 2022 32
NYY 1B 2023 12
NYY 2B 2018 24
NYY 2B 2019 26
NYY 2B 2021 10
NYY 2B 2022 24
NYY 2B 2023 25
NYY 3B 2018 27
NYY 3B 2019 16
NYY 3B 2021 9
NYY 3B 2022 4
NYY 3B 2023 21
NYY SS 2018 27
NYY SS 2019 21
NYY SS 2021 14
NYY SS 2022 15
NYY SS 2023 15
NYY LF 2018 12
NYY LF 2019 13
NYY LF 2021 13
NYY LF 2022 8
NYY LF 2023 5
NYY CF 2018 27
NYY CF 2019 28
NYY CF 2021 10
NYY CF 2022 62
NYY CF 2023 7
NYY RF 2018 27
NYY RF 2019 27
NYY RF 2021 39
NYY RF 2022 12
NYY RF 2023 37
NYY DH 2018 38
NYY DH 2019 13
NYY DH 2021 35
NYY DH 2022 31
NYY DH 2023 24
TBR C 2018 14
TBR C 2019 9
TBR C 2021 33
TBR C 2022 6
TBR C 2023 11
TBR 1B 2018 11
TBR 1B 2019 19
TBR 1B 2021 13
TBR 1B 2022 11
TBR 1B 2023 22
TBR 2B 2018 7
TBR 2B 2019 17
TBR 2B 2021 39
TBR 2B 2022 8
TBR 2B 2023 21
TBR 3B 2018 10
TBR 3B 2019 20
TBR 3B 2021 7
TBR 3B 2022 8
TBR 3B 2023 17
TBR SS 2018 4
TBR SS 2019 14
TBR SS 2021 11
TBR SS 2022 9
TBR SS 2023 31
TBR LF 2018 7
TBR LF 2019 21
TBR LF 2021 27
TBR LF 2022 20
TBR LF 2023 23
TBR CF 2018 7
TBR CF 2019 14
TBR CF 2021 4
TBR CF 2022 7
TBR CF 2023 25
TBR RF 2018 9
TBR RF 2019 20
TBR RF 2021 10
TBR RF 2022 4
TBR RF 2023 20
TBR DH 2018 30
TBR DH 2019 33
TBR DH 2021 13
TBR DH 2022 6
TBR DH 2023 12
TOR C 2018 10
TOR C 2019 13
TOR C 2021 1
TOR C 2022 14
TOR C 2023 8
TOR 1B 2018 25
TOR 1B 2019 22
TOR 1B 2021 48
TOR 1B 2022 32
TOR 1B 2023 26
TOR 2B 2018 11
TOR 2B 2019 16
TOR 2B 2021 45
TOR 2B 2022 7
TOR 2B 2023 11
TOR 3B 2018 18
TOR 3B 2019 18
TOR 3B 2021 29
TOR 3B 2022 24
TOR 3B 2023 20
TOR SS 2018 17
TOR SS 2019 15
TOR SS 2021 2
TOR SS 2022 27
TOR SS 2023 17
TOR LF 2018 22
TOR LF 2019 20
TOR LF 2021 21
TOR LF 2022 5
TOR LF 2023 20
TOR CF 2018 15
TOR CF 2019 26
TOR CF 2021 22
TOR CF 2022 25
TOR CF 2023 8
TOR RF 2018 25
TOR RF 2019 31
TOR RF 2021 32
TOR RF 2022 25
TOR RF 2023 21
TOR DH 2018 21
TOR DH 2019 21
TOR DH 2021 22
TOR DH 2022 4
TOR DH 2023 19

I wanted to replace the names of the fielding positions with the number that represents them on a baseball scorecard and I have been meaning to practice joins, so in this code block I create a new data frame with the scorecard position number for each position, use a join to add this information into the tidy data frame, and then reorder the columns (removing the old position data in the process). I have now finished tidying the data and I am prepared to analyze it.

Position <- c("C", "1B", "2B", "3B", "SS", "LF", "CF", "RF", "DH")
Pos <- c(2,3,4,5,6,7,8,9,"DH")
positions_data <- data.frame(Position, Pos)

full_hr_data <- full_join(tidy_hr_data, positions_data, by = join_by(Position))

full_hr_data <- full_hr_data[,c(1,3,5,4)]

full_hr_data <- full_hr_data[order(full_hr_data$Pos, full_hr_data$Season, full_hr_data$Team),]

kable(full_hr_data, format = "pipe", caption = "Tidy Homerun Data with Scorecard Position Numbers", align = "lccc")
Tidy Homerun Data with Scorecard Position Numbers
Team Season Pos HRs
BAL 2018 2 3
BOS 2018 2 5
NYY 2018 2 18
TBR 2018 2 14
TOR 2018 2 10
BAL 2019 2 13
BOS 2019 2 23
NYY 2019 2 34
TBR 2019 2 9
TOR 2019 2 13
BAL 2021 2 11
BOS 2021 2 6
NYY 2021 2 23
TBR 2021 2 33
TOR 2021 2 1
BAL 2022 2 13
BOS 2022 2 8
NYY 2022 2 11
TBR 2022 2 6
TOR 2022 2 14
BAL 2023 2 20
BOS 2023 2 9
NYY 2023 2 10
TBR 2023 2 11
TOR 2023 2 8
BAL 2018 3 16
BOS 2018 3 15
NYY 2018 3 11
TBR 2018 3 11
TOR 2018 3 25
BAL 2019 3 12
BOS 2019 3 19
NYY 2019 3 21
TBR 2019 3 19
TOR 2019 3 22
BAL 2021 3 33
BOS 2021 3 25
NYY 2021 3 8
TBR 2021 3 13
TOR 2021 3 48
BAL 2022 3 22
BOS 2022 3 12
NYY 2022 3 32
TBR 2022 3 11
TOR 2022 3 32
BAL 2023 3 18
BOS 2023 3 24
NYY 2023 3 12
TBR 2023 3 22
TOR 2023 3 26
BAL 2018 4 17
BOS 2018 4 10
NYY 2018 4 24
TBR 2018 4 7
TOR 2018 4 11
BAL 2019 4 24
BOS 2019 4 3
NYY 2019 4 26
TBR 2019 4 17
TOR 2019 4 16
BAL 2021 4 5
BOS 2021 4 6
NYY 2021 4 10
TBR 2021 4 39
TOR 2021 4 45
BAL 2022 4 13
BOS 2022 4 16
NYY 2022 4 24
TBR 2022 4 8
TOR 2022 4 7
BAL 2023 4 13
BOS 2023 4 3
NYY 2023 4 25
TBR 2023 4 21
TOR 2023 4 11
BAL 2018 5 24
BOS 2018 5 23
NYY 2018 5 27
TBR 2018 5 10
TOR 2018 5 18
BAL 2019 5 6
BOS 2019 5 33
NYY 2019 5 16
TBR 2019 5 20
TOR 2019 5 18
BAL 2021 5 9
BOS 2021 5 23
NYY 2021 5 9
TBR 2021 5 7
TOR 2021 5 29
BAL 2022 5 13
BOS 2022 5 15
NYY 2022 5 4
TBR 2022 5 8
TOR 2022 5 24
BAL 2023 5 7
BOS 2023 5 6
NYY 2023 5 21
TBR 2023 5 17
TOR 2023 5 20
BAL 2018 6 7
BOS 2018 6 21
NYY 2018 6 27
TBR 2018 6 4
TOR 2018 6 17
BAL 2019 6 12
BOS 2019 6 32
NYY 2019 6 21
TBR 2019 6 14
TOR 2019 6 15
BAL 2021 6 11
BOS 2021 6 38
NYY 2021 6 14
TBR 2021 6 11
TOR 2021 6 2
BAL 2022 6 16
BOS 2022 6 27
NYY 2022 6 15
TBR 2022 6 9
TOR 2022 6 27
BAL 2023 6 4
BOS 2023 6 33
NYY 2023 6 15
TBR 2023 6 31
TOR 2023 6 17
BAL 2018 7 24
BOS 2018 7 16
NYY 2018 7 12
TBR 2018 7 7
TOR 2018 7 22
BAL 2019 7 13
BOS 2019 7 13
NYY 2019 7 13
TBR 2019 7 21
TOR 2019 7 20
BAL 2021 7 22
BOS 2021 7 13
NYY 2021 7 13
TBR 2021 7 27
TOR 2021 7 21
BAL 2022 7 16
BOS 2022 7 11
NYY 2022 7 8
TBR 2022 7 20
TOR 2022 7 5
BAL 2023 7 16
BOS 2023 7 15
NYY 2023 7 5
TBR 2023 7 23
TOR 2023 7 20
BAL 2018 8 15
BOS 2018 8 13
NYY 2018 8 27
TBR 2018 8 7
TOR 2018 8 15
BAL 2019 8 10
BOS 2019 8 21
NYY 2019 8 28
TBR 2019 8 14
TOR 2019 8 26
BAL 2021 8 30
BOS 2021 8 20
NYY 2021 8 10
TBR 2021 8 4
TOR 2021 8 22
BAL 2022 8 16
BOS 2022 8 6
NYY 2022 8 62
TBR 2022 8 7
TOR 2022 8 25
BAL 2023 8 15
BOS 2023 8 8
NYY 2023 8 7
TBR 2023 8 25
TOR 2023 8 8
BAL 2018 9 8
BOS 2018 9 32
NYY 2018 9 27
TBR 2018 9 9
TOR 2018 9 25
BAL 2019 9 35
BOS 2019 9 29
NYY 2019 9 27
TBR 2019 9 20
TOR 2019 9 31
BAL 2021 9 18
BOS 2021 9 31
NYY 2021 9 39
TBR 2021 9 10
TOR 2021 9 32
BAL 2022 9 33
BOS 2022 9 3
NYY 2022 9 12
TBR 2022 9 4
TOR 2022 9 25
BAL 2023 9 28
BOS 2023 9 13
NYY 2023 9 37
TBR 2023 9 20
TOR 2023 9 21
BAL 2018 DH 17
BOS 2018 DH 43
NYY 2018 DH 38
TBR 2018 DH 30
TOR 2018 DH 21
BAL 2019 DH 31
BOS 2019 DH 36
NYY 2019 DH 13
TBR 2019 DH 33
TOR 2019 DH 21
BAL 2021 DH 21
BOS 2021 DH 28
NYY 2021 DH 35
TBR 2021 DH 13
TOR 2021 DH 22
BAL 2022 DH 10
BOS 2022 DH 16
NYY 2022 DH 31
TBR 2022 DH 6
TOR 2022 DH 4
BAL 2023 DH 14
BOS 2023 DH 23
NYY 2023 DH 24
TBR 2023 DH 12
TOR 2023 DH 19

Data Analysis

To analyze this data, I want to see which team got the most (and least) “plus” value from one of their players in terms of home runs. To determine this, I will divide each player’s home run total for each year by the average number of home runs for their position that season. It’s important to group by position because some positions in baseball are more technically difficult defensively, so players in those positions are not expected to produce as much offensive output as players in less difficult defensive positions are. To make sure I’m considering multiple ways of interpreting the data, I will compare each player’s home runs to the mean and the median number of home runs from their position for the season.

In this code block, I add columns with the mean and median home runs for that season and position, and then columns showing each player’s home run total for the season divided by the mean and median for their position that season.

expanded_hr_data <- full_hr_data %>%
  group_by(Season,Pos) %>%
  mutate(yr_pos_mean = mean(HRs)) %>%
  mutate(yr_pos_median = median(HRs)) %>%
  mutate(mean_adj = round(HRs / yr_pos_mean, 2)) %>%
  mutate(median_adj = round(HRs / yr_pos_median, 2))

kable(expanded_hr_data, format = "pipe", caption = "Tidy Homerun Data with Grouped Home Run Information", align = "lccccccc")
Tidy Homerun Data with Grouped Home Run Information
Team Season Pos HRs yr_pos_mean yr_pos_median mean_adj median_adj
BAL 2018 2 3 10.0 10 0.30 0.30
BOS 2018 2 5 10.0 10 0.50 0.50
NYY 2018 2 18 10.0 10 1.80 1.80
TBR 2018 2 14 10.0 10 1.40 1.40
TOR 2018 2 10 10.0 10 1.00 1.00
BAL 2019 2 13 18.4 13 0.71 1.00
BOS 2019 2 23 18.4 13 1.25 1.77
NYY 2019 2 34 18.4 13 1.85 2.62
TBR 2019 2 9 18.4 13 0.49 0.69
TOR 2019 2 13 18.4 13 0.71 1.00
BAL 2021 2 11 14.8 11 0.74 1.00
BOS 2021 2 6 14.8 11 0.41 0.55
NYY 2021 2 23 14.8 11 1.55 2.09
TBR 2021 2 33 14.8 11 2.23 3.00
TOR 2021 2 1 14.8 11 0.07 0.09
BAL 2022 2 13 10.4 11 1.25 1.18
BOS 2022 2 8 10.4 11 0.77 0.73
NYY 2022 2 11 10.4 11 1.06 1.00
TBR 2022 2 6 10.4 11 0.58 0.55
TOR 2022 2 14 10.4 11 1.35 1.27
BAL 2023 2 20 11.6 10 1.72 2.00
BOS 2023 2 9 11.6 10 0.78 0.90
NYY 2023 2 10 11.6 10 0.86 1.00
TBR 2023 2 11 11.6 10 0.95 1.10
TOR 2023 2 8 11.6 10 0.69 0.80
BAL 2018 3 16 15.6 15 1.03 1.07
BOS 2018 3 15 15.6 15 0.96 1.00
NYY 2018 3 11 15.6 15 0.71 0.73
TBR 2018 3 11 15.6 15 0.71 0.73
TOR 2018 3 25 15.6 15 1.60 1.67
BAL 2019 3 12 18.6 19 0.65 0.63
BOS 2019 3 19 18.6 19 1.02 1.00
NYY 2019 3 21 18.6 19 1.13 1.11
TBR 2019 3 19 18.6 19 1.02 1.00
TOR 2019 3 22 18.6 19 1.18 1.16
BAL 2021 3 33 25.4 25 1.30 1.32
BOS 2021 3 25 25.4 25 0.98 1.00
NYY 2021 3 8 25.4 25 0.31 0.32
TBR 2021 3 13 25.4 25 0.51 0.52
TOR 2021 3 48 25.4 25 1.89 1.92
BAL 2022 3 22 21.8 22 1.01 1.00
BOS 2022 3 12 21.8 22 0.55 0.55
NYY 2022 3 32 21.8 22 1.47 1.45
TBR 2022 3 11 21.8 22 0.50 0.50
TOR 2022 3 32 21.8 22 1.47 1.45
BAL 2023 3 18 20.4 22 0.88 0.82
BOS 2023 3 24 20.4 22 1.18 1.09
NYY 2023 3 12 20.4 22 0.59 0.55
TBR 2023 3 22 20.4 22 1.08 1.00
TOR 2023 3 26 20.4 22 1.27 1.18
BAL 2018 4 17 13.8 11 1.23 1.55
BOS 2018 4 10 13.8 11 0.72 0.91
NYY 2018 4 24 13.8 11 1.74 2.18
TBR 2018 4 7 13.8 11 0.51 0.64
TOR 2018 4 11 13.8 11 0.80 1.00
BAL 2019 4 24 17.2 17 1.40 1.41
BOS 2019 4 3 17.2 17 0.17 0.18
NYY 2019 4 26 17.2 17 1.51 1.53
TBR 2019 4 17 17.2 17 0.99 1.00
TOR 2019 4 16 17.2 17 0.93 0.94
BAL 2021 4 5 21.0 10 0.24 0.50
BOS 2021 4 6 21.0 10 0.29 0.60
NYY 2021 4 10 21.0 10 0.48 1.00
TBR 2021 4 39 21.0 10 1.86 3.90
TOR 2021 4 45 21.0 10 2.14 4.50
BAL 2022 4 13 13.6 13 0.96 1.00
BOS 2022 4 16 13.6 13 1.18 1.23
NYY 2022 4 24 13.6 13 1.76 1.85
TBR 2022 4 8 13.6 13 0.59 0.62
TOR 2022 4 7 13.6 13 0.51 0.54
BAL 2023 4 13 14.6 13 0.89 1.00
BOS 2023 4 3 14.6 13 0.21 0.23
NYY 2023 4 25 14.6 13 1.71 1.92
TBR 2023 4 21 14.6 13 1.44 1.62
TOR 2023 4 11 14.6 13 0.75 0.85
BAL 2018 5 24 20.4 23 1.18 1.04
BOS 2018 5 23 20.4 23 1.13 1.00
NYY 2018 5 27 20.4 23 1.32 1.17
TBR 2018 5 10 20.4 23 0.49 0.43
TOR 2018 5 18 20.4 23 0.88 0.78
BAL 2019 5 6 18.6 18 0.32 0.33
BOS 2019 5 33 18.6 18 1.77 1.83
NYY 2019 5 16 18.6 18 0.86 0.89
TBR 2019 5 20 18.6 18 1.08 1.11
TOR 2019 5 18 18.6 18 0.97 1.00
BAL 2021 5 9 15.4 9 0.58 1.00
BOS 2021 5 23 15.4 9 1.49 2.56
NYY 2021 5 9 15.4 9 0.58 1.00
TBR 2021 5 7 15.4 9 0.45 0.78
TOR 2021 5 29 15.4 9 1.88 3.22
BAL 2022 5 13 12.8 13 1.02 1.00
BOS 2022 5 15 12.8 13 1.17 1.15
NYY 2022 5 4 12.8 13 0.31 0.31
TBR 2022 5 8 12.8 13 0.62 0.62
TOR 2022 5 24 12.8 13 1.88 1.85
BAL 2023 5 7 14.2 17 0.49 0.41
BOS 2023 5 6 14.2 17 0.42 0.35
NYY 2023 5 21 14.2 17 1.48 1.24
TBR 2023 5 17 14.2 17 1.20 1.00
TOR 2023 5 20 14.2 17 1.41 1.18
BAL 2018 6 7 15.2 17 0.46 0.41
BOS 2018 6 21 15.2 17 1.38 1.24
NYY 2018 6 27 15.2 17 1.78 1.59
TBR 2018 6 4 15.2 17 0.26 0.24
TOR 2018 6 17 15.2 17 1.12 1.00
BAL 2019 6 12 18.8 15 0.64 0.80
BOS 2019 6 32 18.8 15 1.70 2.13
NYY 2019 6 21 18.8 15 1.12 1.40
TBR 2019 6 14 18.8 15 0.74 0.93
TOR 2019 6 15 18.8 15 0.80 1.00
BAL 2021 6 11 15.2 11 0.72 1.00
BOS 2021 6 38 15.2 11 2.50 3.45
NYY 2021 6 14 15.2 11 0.92 1.27
TBR 2021 6 11 15.2 11 0.72 1.00
TOR 2021 6 2 15.2 11 0.13 0.18
BAL 2022 6 16 18.8 16 0.85 1.00
BOS 2022 6 27 18.8 16 1.44 1.69
NYY 2022 6 15 18.8 16 0.80 0.94
TBR 2022 6 9 18.8 16 0.48 0.56
TOR 2022 6 27 18.8 16 1.44 1.69
BAL 2023 6 4 20.0 17 0.20 0.24
BOS 2023 6 33 20.0 17 1.65 1.94
NYY 2023 6 15 20.0 17 0.75 0.88
TBR 2023 6 31 20.0 17 1.55 1.82
TOR 2023 6 17 20.0 17 0.85 1.00
BAL 2018 7 24 16.2 16 1.48 1.50
BOS 2018 7 16 16.2 16 0.99 1.00
NYY 2018 7 12 16.2 16 0.74 0.75
TBR 2018 7 7 16.2 16 0.43 0.44
TOR 2018 7 22 16.2 16 1.36 1.38
BAL 2019 7 13 16.0 13 0.81 1.00
BOS 2019 7 13 16.0 13 0.81 1.00
NYY 2019 7 13 16.0 13 0.81 1.00
TBR 2019 7 21 16.0 13 1.31 1.62
TOR 2019 7 20 16.0 13 1.25 1.54
BAL 2021 7 22 19.2 21 1.15 1.05
BOS 2021 7 13 19.2 21 0.68 0.62
NYY 2021 7 13 19.2 21 0.68 0.62
TBR 2021 7 27 19.2 21 1.41 1.29
TOR 2021 7 21 19.2 21 1.09 1.00
BAL 2022 7 16 12.0 11 1.33 1.45
BOS 2022 7 11 12.0 11 0.92 1.00
NYY 2022 7 8 12.0 11 0.67 0.73
TBR 2022 7 20 12.0 11 1.67 1.82
TOR 2022 7 5 12.0 11 0.42 0.45
BAL 2023 7 16 15.8 16 1.01 1.00
BOS 2023 7 15 15.8 16 0.95 0.94
NYY 2023 7 5 15.8 16 0.32 0.31
TBR 2023 7 23 15.8 16 1.46 1.44
TOR 2023 7 20 15.8 16 1.27 1.25
BAL 2018 8 15 15.4 15 0.97 1.00
BOS 2018 8 13 15.4 15 0.84 0.87
NYY 2018 8 27 15.4 15 1.75 1.80
TBR 2018 8 7 15.4 15 0.45 0.47
TOR 2018 8 15 15.4 15 0.97 1.00
BAL 2019 8 10 19.8 21 0.51 0.48
BOS 2019 8 21 19.8 21 1.06 1.00
NYY 2019 8 28 19.8 21 1.41 1.33
TBR 2019 8 14 19.8 21 0.71 0.67
TOR 2019 8 26 19.8 21 1.31 1.24
BAL 2021 8 30 17.2 20 1.74 1.50
BOS 2021 8 20 17.2 20 1.16 1.00
NYY 2021 8 10 17.2 20 0.58 0.50
TBR 2021 8 4 17.2 20 0.23 0.20
TOR 2021 8 22 17.2 20 1.28 1.10
BAL 2022 8 16 23.2 16 0.69 1.00
BOS 2022 8 6 23.2 16 0.26 0.38
NYY 2022 8 62 23.2 16 2.67 3.88
TBR 2022 8 7 23.2 16 0.30 0.44
TOR 2022 8 25 23.2 16 1.08 1.56
BAL 2023 8 15 12.6 8 1.19 1.88
BOS 2023 8 8 12.6 8 0.63 1.00
NYY 2023 8 7 12.6 8 0.56 0.88
TBR 2023 8 25 12.6 8 1.98 3.12
TOR 2023 8 8 12.6 8 0.63 1.00
BAL 2018 9 8 20.2 25 0.40 0.32
BOS 2018 9 32 20.2 25 1.58 1.28
NYY 2018 9 27 20.2 25 1.34 1.08
TBR 2018 9 9 20.2 25 0.45 0.36
TOR 2018 9 25 20.2 25 1.24 1.00
BAL 2019 9 35 28.4 29 1.23 1.21
BOS 2019 9 29 28.4 29 1.02 1.00
NYY 2019 9 27 28.4 29 0.95 0.93
TBR 2019 9 20 28.4 29 0.70 0.69
TOR 2019 9 31 28.4 29 1.09 1.07
BAL 2021 9 18 26.0 31 0.69 0.58
BOS 2021 9 31 26.0 31 1.19 1.00
NYY 2021 9 39 26.0 31 1.50 1.26
TBR 2021 9 10 26.0 31 0.38 0.32
TOR 2021 9 32 26.0 31 1.23 1.03
BAL 2022 9 33 15.4 12 2.14 2.75
BOS 2022 9 3 15.4 12 0.19 0.25
NYY 2022 9 12 15.4 12 0.78 1.00
TBR 2022 9 4 15.4 12 0.26 0.33
TOR 2022 9 25 15.4 12 1.62 2.08
BAL 2023 9 28 23.8 21 1.18 1.33
BOS 2023 9 13 23.8 21 0.55 0.62
NYY 2023 9 37 23.8 21 1.55 1.76
TBR 2023 9 20 23.8 21 0.84 0.95
TOR 2023 9 21 23.8 21 0.88 1.00
BAL 2018 DH 17 29.8 30 0.57 0.57
BOS 2018 DH 43 29.8 30 1.44 1.43
NYY 2018 DH 38 29.8 30 1.28 1.27
TBR 2018 DH 30 29.8 30 1.01 1.00
TOR 2018 DH 21 29.8 30 0.70 0.70
BAL 2019 DH 31 26.8 31 1.16 1.00
BOS 2019 DH 36 26.8 31 1.34 1.16
NYY 2019 DH 13 26.8 31 0.49 0.42
TBR 2019 DH 33 26.8 31 1.23 1.06
TOR 2019 DH 21 26.8 31 0.78 0.68
BAL 2021 DH 21 23.8 22 0.88 0.95
BOS 2021 DH 28 23.8 22 1.18 1.27
NYY 2021 DH 35 23.8 22 1.47 1.59
TBR 2021 DH 13 23.8 22 0.55 0.59
TOR 2021 DH 22 23.8 22 0.92 1.00
BAL 2022 DH 10 13.4 10 0.75 1.00
BOS 2022 DH 16 13.4 10 1.19 1.60
NYY 2022 DH 31 13.4 10 2.31 3.10
TBR 2022 DH 6 13.4 10 0.45 0.60
TOR 2022 DH 4 13.4 10 0.30 0.40
BAL 2023 DH 14 18.4 19 0.76 0.74
BOS 2023 DH 23 18.4 19 1.25 1.21
NYY 2023 DH 24 18.4 19 1.30 1.26
TBR 2023 DH 12 18.4 19 0.65 0.63
TOR 2023 DH 19 18.4 19 1.03 1.00

In this code block, I determine which player ranked best relative to the mean number of home runs for their position in a given season and display this information along with the results for their counterparts in that same season. The top spot goes to Aaron Judge, who set the American League record for home runs in a season in 2022 with 62 home runs. No other American League East center fielder hit more than 25 that year, and the division average for center fielders was 23.2.

expanded_hr_data_highmean <- expanded_hr_data[order(expanded_hr_data$mean_adj,decreasing = TRUE),]
expanded_hr_data_highmean_df <- subset(expanded_hr_data_highmean, Season == expanded_hr_data_highmean$Season[1] & Pos == expanded_hr_data_highmean$Pos[1])

kable(expanded_hr_data_highmean_df, format = "pipe", caption = "Position and Season Data for Best Home Run Value (Mean)", align = "lccccccc")
Position and Season Data for Best Home Run Value (Mean)
Team Season Pos HRs yr_pos_mean yr_pos_median mean_adj median_adj
NYY 2022 8 62 23.2 16 2.67 3.88
TOR 2022 8 25 23.2 16 1.08 1.56
BAL 2022 8 16 23.2 16 0.69 1.00
TBR 2022 8 7 23.2 16 0.30 0.44
BOS 2022 8 6 23.2 16 0.26 0.38

In this code block, I determine which player ranked best relative to the median number of home runs for their position in a given season and display this information along with the results for their counterparts in that same season. The top spot goes to Marcus Semien, who hit 45 home runs in 2021 as a second baseman for the Toronto Blue Jays. Second base is a “defense-first” position, and the median number of home runs for American League Eastern Division second basemen in 2021 was only 10.

expanded_hr_data_highmedian <- expanded_hr_data[order(expanded_hr_data$median_adj,decreasing = TRUE),]
expanded_hr_data_highmedian_df <- subset(expanded_hr_data_highmedian, Season == expanded_hr_data_highmedian$Season[1] & Pos == expanded_hr_data_highmedian$Pos[1])

kable(expanded_hr_data_highmedian_df, format = "pipe", caption = "Position and Season Data for Best Home Run Value (Median)", align = "lccccccc")
Position and Season Data for Best Home Run Value (Median)
Team Season Pos HRs yr_pos_mean yr_pos_median mean_adj median_adj
TOR 2021 4 45 21 10 2.14 4.5
TBR 2021 4 39 21 10 1.86 3.9
NYY 2021 4 10 21 10 0.48 1.0
BOS 2021 4 6 21 10 0.29 0.6
BAL 2021 4 5 21 10 0.24 0.5

In this code block, I determine which player ranked worst relative to the mean number of home runs for their position in a given season and display this information along with the results for their counterparts in that same season. The dubious honor goes to Reese McGuire, the Blue Jays catcher in 2021. His single home run during a season when the mean number of home runs for a catcher in the AL East division was 14.8 is the worst ratio of home runs to mean position home runs of any position for any team in any of the seasons being considered.

expanded_hr_data_lowmean <- expanded_hr_data[order(expanded_hr_data$mean_adj,decreasing = FALSE),]
expanded_hr_data_lowmean_df <- subset(expanded_hr_data_lowmean, Season == expanded_hr_data_lowmean$Season[1] & Pos == expanded_hr_data_lowmean$Pos[1])

kable(expanded_hr_data_lowmean_df, format = "pipe", caption = "Position and Season Data for Worst Home Run Value (Mean)", align = "lccccccc")
Position and Season Data for Worst Home Run Value (Mean)
Team Season Pos HRs yr_pos_mean yr_pos_median mean_adj median_adj
TOR 2021 2 1 14.8 11 0.07 0.09
BOS 2021 2 6 14.8 11 0.41 0.55
BAL 2021 2 11 14.8 11 0.74 1.00
NYY 2021 2 23 14.8 11 1.55 2.09
TBR 2021 2 33 14.8 11 2.23 3.00

In this code block, I determine which player ranked worst relative to the median number of home runs for their position in a given season and display this information along with the results for their counterparts in that same season. Once again, Reese McGuire’s single home run in 2021 places him in last place.

expanded_hr_data_lowmedian <- expanded_hr_data[order(expanded_hr_data$median_adj,decreasing = FALSE),]
expanded_hr_data_lowmedian_df <- subset(expanded_hr_data_lowmedian, Season == expanded_hr_data_lowmedian$Season[1] & Pos == expanded_hr_data_lowmedian$Pos[1])

kable(expanded_hr_data_lowmedian_df, format = "pipe", caption = "Position and Season Data for Worst Home Run Value (Median)", align = "lccccccc")
Position and Season Data for Worst Home Run Value (Median)
Team Season Pos HRs yr_pos_mean yr_pos_median mean_adj median_adj
TOR 2021 2 1 14.8 11 0.07 0.09
BOS 2021 2 6 14.8 11 0.41 0.55
BAL 2021 2 11 14.8 11 0.74 1.00
NYY 2021 2 23 14.8 11 1.55 2.09
TBR 2021 2 33 14.8 11 2.23 3.00

Findings and Recommendations

I got to experiment more with tidyverse in this analysis and had a lot of fun. I’d definitely be interested in taking it further to look at all major league baseball teams rather than one division. For this analysis, the number of home runs given for each position is the number of home runs hit by the player who started the most games for the team at that position during the year. It might be interesting to redo the analysis where the home runs by position is based on who started each game at that position, rather than using the total home runs hit by the player who started at that position most often.