The dataset used in this project is part of the Lahman package. The Lahman package provides tables from the Sean Lahman Baseball Database as a set of R data frames. The data frames consist of information collected from 1871 through 2022. The purpose of the data set is to allow researchers to conduct analytics on available information to create a more thorough understanding of baseball. There are 36 separate data sets in the Lahman package, but this project focuses on the “Appearances” data set, and all analysis was conducted on the 5,000 most current entries. All processing and analysis were conducted using R programming language in the Rstudio IDE.
Analysis in this project is focused on the following questions:
Analysis began by importing the “Appearances” data set and conducting
a basic exploratory inspection. First, the data was subset to the last
5,000 observations, and then an initial inspection was conducted by
isolating a single row. Within each row of the data set, there was
information that records a player’s utilization across several
variables. Captured within each row is the year of collection, the team
affiliation of the player, the league of the team, the player ID, the
total number of games where the player participated, the number of times
the player was on the starting lineup, the total number of games where
the player participated in offense, the total number of games where the
player participated in defense, and each of the offensive and defensive
positions. Next, additional steps were undertaken to understand the
structure of the data set and understand some fundamental statistical
values of each variable.
# import data
data("Appearances")
# subset data
data <- tail(Appearances, 5000)
A close inspection found interesting qualities in the data set that
must be addressed to ensure accurate analysis. The data set was
primarily a recorded tally of the total occurrences where a player
participated in a game. The G_all variable expresses the total number of
games participated, but it was apparent that there are movements of
players throughout many offensive and defensive positions that occur
during games. For example, G_batting explains that the player in this
row participated in a game as a batter in 81 of the 81 total games.
However, G_defense illustrates that the player participated in a
defensive position in 55 of the 81 total games. Therefore, there were 26
times that the player’s only contribution to the game was through
hitting.
# show a single row
data[5,]
## yearID teamID lgID playerID G_all GS G_batting G_defense G_p G_c G_1b
## 107111 2019 SFN NL solando01 81 42 81 55 0 0 0
## G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 107111 36 2 19 0 0 0 0 1 33 0
The next issue arose when examining the defensive contributions of the player. The G_defense variable is the total number of games where the player participated on defense, but the total of each defensive position the player appeared in is greater than the G_all value. For the 5th row in the data set, the player was utilized at second base 36 times, third base two times, and shortstop 19 times for a total of 57 appearances in the three positions, but their G_all value was 55. From this data set, it is impossible to understand what those defensive movements were within a game. It can only be deduced that the player moved positions. Analysis will not determine if the player moved from second base to third base, second base to shortstop, third base to shortstop, or any other combination of the three positions. However, it can be determined that there were two games in which the player in this observation moved between defensive positions.
At the project’s onset, it was understood that players often shift between positions for several factors, such as injuries, suspensions, trades, player retirement, and strategy to gain a competitive advantage. From a value perspective to an organization and a longevity perspective for the player, the data set provides an opportunity to delve deeper into how players are utilized in support of answering previously mentioned questions. However, without very specific questions and understanding the limitations of the data, analysis could become very inaccurate.
Data arranged in the method seen in the data set is problematic because of how standard analysis methods are conducted. Typically, to understand the distribution of information within a variable, a histogram will collect and analyze each row of the data set because all rows share the common trait of that variable. Unfortunately, each player does not share the common trait as the data is recorded horizontally per position. A histogram of a single variable with this data set would inform the distribution of appearances per position within the data set, but would be highly skewed if it was reporting position-aligned questions. Every row that the calculation encountered would collect frequent zero values and will present a misrepresentation of the positional information. Instead of ordering a histogram of the number of appearances per position for catchers or pitchers, it would create a histogram of the entire league’s appearances at that position, regardless of how they are actually used.
A more damaging example would be an attempt to determine the mean number of appearances per position with a box plot without filtering the data. If left unfiltered, the mean value would not present the average number of games played per position informed by players utilized in that position. It would be highly skewed from the inclusion of many zero values. For example, the box plot illustrating the mean appearances for a first baseman would include the appearances of all first basemen, but it would also have all the zero number of times that a pitcher, catcher, or center fielder failed to play first base.
Further analysis will explain the distribution of appearances focused
on positional alignment and exclude the remainder of the rows not
meeting the positional requirements. Identification of issues such as
these could only be achieved by a thorough inspection of the data set
and required additional steps before proceeding into the next phase of
analysis.
summary(data)
## yearID teamID lgID playerID G_all
## Min. :2019 CHN : 187 AA: 0 Length:5000 Min. : 1.00
## 1st Qu.:2020 PIT : 187 AL:2464 Class :character 1st Qu.: 8.00
## Median :2021 LAA : 186 FL: 0 Mode :character Median : 22.00
## Mean :2021 MIA : 184 NA: 0 Mean : 35.87
## 3rd Qu.:2022 SEA : 182 NL:2536 3rd Qu.: 50.00
## Max. :2022 BAL : 179 PL: 0 Max. :162.00
## (Other):3895 UA: 0
## GS G_batting G_defense G_p
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 7.00 1st Qu.: 0.00
## Median : 6.00 Median : 5.0 Median : 21.00 Median : 2.00
## Mean : 23.79 Mean : 27.8 Mean : 32.41 Mean :10.73
## 3rd Qu.: 31.00 3rd Qu.: 42.0 3rd Qu.: 45.00 3rd Qu.:16.00
## Max. :162.00 Max. :162.0 Max. :162.00 Max. :81.00
##
## G_c G_1b G_2b G_3b
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 2.762 Mean : 2.804 Mean : 2.841 Mean : 2.755
## 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :133.000 Max. :162.000 Max. :156.000 Max. :159.000
##
## G_ss G_lf G_cf G_rf
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.00 Median : 0.000 Median : 0.000
## Mean : 2.717 Mean : 3.08 Mean : 2.851 Mean : 2.907
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.000
## Max. :161.000 Max. :150.00 Max. :153.000 Max. :147.000
##
## G_of G_dh G_ph G_pr
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.000 Median : 0.000 Median : 0.000 Median : 0.000
## Mean : 8.411 Mean : 2.132 Mean : 2.238 Mean : 0.415
## 3rd Qu.: 0.000 3rd Qu.: 0.000 3rd Qu.: 2.000 3rd Qu.: 0.000
## Max. :153.000 Max. :153.000 Max. :77.000 Max. :28.000
##
str(data)
## 'data.frame': 5000 obs. of 21 variables:
## $ yearID : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
## $ teamID : Factor w/ 149 levels "ALT","ANA","ARI",..: 4 130 134 131 117 117 66 96 4 125 ...
## $ lgID : Factor w/ 7 levels "AA","AL","FL",..: 5 2 2 2 5 5 2 2 5 5 ...
## $ playerID : chr "sobotch01" "sogarer01" "sogarer01" "solakni01" ...
## $ G_all : int 32 37 73 33 81 28 162 71 29 8 ...
## $ GS : int 0 23 68 32 42 16 161 1 29 1 ...
## $ G_batting: int 30 37 73 33 81 28 162 5 27 8 ...
## $ G_defense: int 32 31 55 16 55 22 56 71 29 4 ...
## $ G_p : int 32 0 0 0 0 0 0 71 29 0 ...
## $ G_c : int 0 0 0 0 0 0 0 0 0 0 ...
## $ G_1b : int 0 0 0 0 0 0 0 0 0 0 ...
## $ G_2b : int 0 31 43 5 36 10 0 0 0 4 ...
## $ G_3b : int 0 0 6 11 2 2 0 0 0 0 ...
## $ G_ss : int 0 0 4 0 19 4 0 0 0 0 ...
## $ G_lf : int 0 0 1 0 0 9 0 0 0 0 ...
## $ G_cf : int 0 0 0 0 0 0 0 0 0 0 ...
## $ G_rf : int 0 0 6 0 0 0 56 0 0 0 ...
## $ G_of : int 0 0 7 0 0 9 56 0 0 0 ...
## $ G_dh : int 0 1 15 17 1 0 107 0 0 0 ...
## $ G_ph : int 0 11 2 1 33 9 1 0 0 6 ...
## $ G_pr : int 0 0 1 0 0 0 0 0 0 1 ...
tail(data)
## yearID teamID lgID playerID G_all GS G_batting G_defense G_p G_c G_1b
## 112101 2022 KCA AL zerpaan01 3 2 0 3 3 0 0
## 112102 2022 CIN NL zeuchtj01 3 3 0 3 3 0 0
## 112103 2022 PHI NL zimmebr01 9 5 9 9 0 0 0
## 112104 2022 TOR AL zimmebr01 100 23 100 90 0 0 0
## 112105 2022 BAL AL zimmebr02 15 13 0 15 15 0 0
## 112106 2022 TBA AL zuninmi01 36 34 36 35 0 35 0
## G_2b G_3b G_ss G_lf G_cf G_rf G_of G_dh G_ph G_pr
## 112101 0 0 0 0 0 0 0 0 0 0
## 112102 0 0 0 0 0 0 0 0 0 0
## 112103 0 0 0 0 9 0 9 0 0 1
## 112104 0 0 0 0 88 2 90 2 8 17
## 112105 0 0 0 0 0 0 0 0 0 0
## 112106 0 0 0 0 0 0 0 0 2 0
After the initial inspection of the dataset, exploratory data analysis continued by assessing data quality. It was found that there were no instances of missing information, and the dataset contained no occurrences of duplicate rows.
# identify missing values
vis_miss(data)
# identify duplicates
data[duplicated(data), ]
## [1] yearID teamID lgID playerID G_all GS G_batting
## [8] G_defense G_p G_c G_1b G_2b G_3b G_ss
## [15] G_lf G_cf G_rf G_of G_dh G_ph G_pr
## <0 rows> (or 0-length row.names)
Several steps were undertaken during the cleaning and manipulation phase of this project. First, the names of the columns were changed from their abbreviated form into a more easily understood format. Next, a new variable was created to determine the percentage of games each player started from the total games observed. With the start position variable, further insight can be gained into players’ usage throughout a game and illuminate strategy trends that could be more easily identified from only the total appearances and total starts variables.
# change names of variables
names(data)[1] <- "year"
names(data)[2] <- "team"
names(data)[3] <- "league"
names(data)[4] <- "player"
names(data)[5] <- "total appearances"
names(data)[6] <- "total starts"
names(data)[7] <- "total offense"
names(data)[8] <- "total defense"
names(data)[9] <- "pitcher"
names(data)[10] <- "catcher"
names(data)[11] <- "first base"
names(data)[12] <- "second base"
names(data)[13] <- "third base"
names(data)[14] <- "short stop"
names(data)[15] <- "left field"
names(data)[16] <- "center field"
names(data)[17] <- "right field"
names(data)[18] <- "total outfield"
names(data)[19] <- "designated hitter"
names(data)[20] <- "pinch hitter"
names(data)[21] <- "pinch runner"
# create starting percentage column
data$`start percent` <- data$`total starts`/data$`total appearances` * 100
Another crucial variable needed to be created to answer research questions for this project. Because it is known that some players play only one position, but other players shift around the field to suit their organization’s requirements, a categorical variable must be created to explain their utilization. The classification of players occurs through two efforts. First, R programming was used to evaluate each row and determine which defensive position that player appeared in most frequently and returned the name of that position in a new variable. Next, a calculation was made to evaluate if a player had values greater than zero in more than one position variable. If the player appeared in multiple positions, then they were classified as “utility” players because of their utilitarian utilization. If they only appeared in one position, they were classified as “single-use” players because of their niche capabilities.
#create defensive position category
column_names <- c("pitcher", "catcher", "first base", "second base", "third base",
"short stop", "left field", "center field", "right field")
max_column_index <- max.col(data[column_names], ties.method = "first")
max_column_name <- column_names[max_column_index]
data$`defensive position` <- max_column_name
# create role category
data <- data %>%
mutate(role = ifelse(rowSums(select(., pitcher:`right field`) != 0) == 1,
"single use", "utility"))
Following the creation of variables needed for research questions, the columns were rearranged in a more logical order, and the format of the variables was modified to allow further analysis.
data <- data[, c(1,2,3,4,5,6,22,7,8,23,24,9,10,11,12,13,14,15,16,17,18,19,20,21)]
data$year <- as.numeric(data$year)
data$team <- as.character(data$team)
data$league <- as.character(data$league)
data$`total appearances` <- as.numeric(data$`total appearances`)
data$`total starts` <- as.numeric(data$`total starts`)
data$`total offense` <- as.numeric(data$`total offense`)
data$`total defense`<- as.numeric(data$`total defense`)
data$role <- as.character(data$role)
data$pitcher <- as.numeric(data$pitcher)
data$catcher <- as.numeric(data$catcher)
data$`first base` <- as.numeric(data$`first base`)
data$`second base`<- as.numeric(data$`second base`)
data$`third base` <- as.numeric(data$`third base`)
data$`short stop` <- as.numeric(data$`short stop`)
data$`left field` <- as.numeric(data$`left field`)
data$`right field` <- as.numeric(data$`right field`)
data$`center field` <- as.numeric(data$`center field`)
data$`total outfield` <- as.numeric(data$`total outfield`)
data$`designated hitter` <- as.numeric(data$`designated hitter`)
data$`pinch hitter` <- as.numeric(data$`pinch hitter`)
data$`pinch runner` <- as.numeric(data$`pinch runner`)
Before extensive inspection of each of the defensive positions to answer research questions, visualizations were created from several variables to gain an additional understanding of the data set.
# Create bar chart of year
data %>% group_by(year) %>% count(year) %>%
ggplot(aes(x = year, y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Year Count",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Year",
y = "Count") +
theme_minimal()
| year | n |
|---|---|
| 2019 | 251 |
| 2020 | 1360 |
| 2021 | 1706 |
| 2022 | 1683 |
There is a significant amount of information captured from a simple bar chart representing the number of rows per year. It was evident that by limiting the analysis to the last 5,000 entries, the analysis only has complete information from 2020-2022. Additionally, impacts from how baseball responded to the COVID pandemic are apparent from the differences in 2020 vs 2021 and 2022. The 2020 season was shortened to 60 total games per team. While information such as season length impacts the total number of rows in the data set, other factors influence bar length.
When a player gets traded from the National League to the American League or switches teams mid-season, a new row is created for that player. There is not enough information in this data set, nor is the focus of analysis to extrapolate the number of trades per year. Future efforts could attempt to understand the number of appearances per year, per team, per player type, and per player to extract meaningful insight into opposing team strategy.
# Create bar chart per league
data %>% group_by(league) %>% count(league) %>%
ggplot(aes(x = league, y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "League Representation",
subtitle = "Appearances Dataset",
catpion = "Data Source: Lahman Package",
x = "League",
y = "Count") +
theme_minimal()
| league | n |
|---|---|
| AL | 2464 |
| NL | 2536 |
Captured within the league count bar chart, it was found that there were more entries in one league compared to another. Unfortunately, it is unknown if that is due to an increased amount of trades, an increase in the number of players called up from the minor leagues, or if one league experienced more injuries than the others.
Working with the information gained previously in the project, it was understood that the data set contained players who fulfilled multiple roles for their team. Because many of the focused research questions for this project inquire about the utilization of single-use and utility-type players, it was necessary to create a bar plot gathering the total number of rows in the data set that contained single-use and utility players.
# Create bar chart of count of single use vs utility player
data %>% group_by(role) %>% count(role) %>%
ggplot(aes(x = role, y = n)) +
geom_bar(stat = "identity", color = "steelblue4",fill = "lightblue2") +
labs(title = "Occurrences of 'single use' or 'utility'",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Role",
y = "Count") +
theme_minimal()
| role | n |
|---|---|
| single use | 3580 |
| utility | 1420 |
It is essential to understand that because of the structure of the data set and the scope of this project, there is no additional focus given to individual players to understand their utilization over time. If a player spent one year in the National League utilized as a single-use pitcher, then moved leagues and was used as a multi-use pitcher, that information is captured through an aggregation of positional utilization, but it is not specifically attributed to that individual player.
# Create bar chart of positions for single all players
data %>% group_by(`defensive position`) %>% count(`defensive position`, sort = TRUE) %>%
ggplot(aes(x = reorder(`defensive position`, -n), y = n)) +
geom_bar(stat = "identity", color = "steelblue4", fill = "lightblue2") +
labs(title = "Occurances by Position for all Roles",
subtitle = "Appearances Dataset",
caption = "Lahman Package",
x = "Position",
y = "Count") +
theme_minimal()
| defensive position | n |
|---|---|
| pitcher | 2797 |
| catcher | 376 |
| left field | 315 |
| second base | 288 |
| right field | 269 |
| first base | 263 |
| center field | 254 |
| third base | 246 |
| short stop | 192 |
From the bar chart counting the number of rows per position, it is apparent that either two situations exist, and they are not mutually exclusive. There is either a significantly greater amount of pitchers playing baseball or pitchers are moving between teams and leagues at a higher rate than other positions.
# Create bar chart of positions comparing occurances of single use vs utility.
data %>% group_by(`defensive position`) %>% count(role) %>%
ggplot(aes(fill = role, x = reorder(`defensive position`, -n), y = n)) +
geom_bar(position = "dodge", stat = "identity") +
labs(title = "Count of Occurances by Role and Position",
subtitle = "Appearances Dataset",
caption = "Data Souce: Lahman Package",
x = "Defensive Position",
y = "Occurances") +
theme_minimal() +
scale_fill_brewer() +
theme(axis.text.x =element_text(angle = 45, hjust = 1))
The purpose of counting the rows of single-use and utility players in the data set illuminates themes in how each position is employed and can inform trends in specialization per position. For example, from the above chart, it is apparent that pitchers not only represent the highest number of rows in the data set but are also the most specialized of all defensive positions. Compared to a position such as a shortstop, where observations are almost equal in size between single-use and utility players, there is a significant difference.
However, that does not mean that the shortstop position is not highly specialized and essential to the defensive strategy of a team. It could be possible from an understanding that there are fewer rows that teams hold onto players at that position for a more extended duration because of their skill and difficulty of replacement.
data %>%
group_by(role) %>%
summarize(pitcher = sum(pitcher),
catcher = sum(catcher),
first = sum(`first base`),
second = sum(`second base`),
third = sum(`third base`),
short = sum(`short stop`),
left = sum(`left field`),
right = sum(`right field`),
center = sum(`center field`)) %>%
pivot_longer(cols = -role, names_to = "Position",
values_to = "Sum") %>%
mutate(Position = factor(Position,
levels = c("pitcher", "catcher", "first",
"second", "third",
"short", "left", "right",
"center"))) %>%
ggplot(aes(x = Position, y = Sum, fill = role)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Total Appearances by Role and Position",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Defensive Position",
y = "Total Appearances") +
theme_minimal() +
scale_fill_brewer() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
| role | pitcher | catcher | first | second | third | short | left | right | center |
|---|---|---|---|---|---|---|---|---|---|
| single use | 53210 | 7674 | 6708 | 2314 | 3941 | 7227 | 2949 | 2705 | 4885 |
| utility | 460 | 6138 | 7310 | 11890 | 9832 | 6358 | 12453 | 11828 | 9372 |
Several observations can be taken from aggregating the total number of appearances per position in both single-use and utility roles of defensive positions. Starting with the pitcher position, it is apparent that there are more changes in the pitcher position during a single game than in any other position. Because the data set is arranged by year, if there were zero substitutions, there might be differences in the size of single-use and utility bars, but their sum would be equal. For every game that a team utilized several pitchers but did not change any other position, the pitcher bar grows larger than other positions at a rate equal to the number of substitutions. Other positions likely experience this same occurrence of frequent defensive shifting, as seen through the number of appearances of left, right, and center field. Following the trend seen in the row count of defensive positions, pitcher, catcher, and short stop continue to appear as the most specialized of all defensive positions.
Most importantly, it is apparent that player role and position does inform the number of appearances that a player participates on defense. Utilization rates between each position are unequal, and the role in which that player or a specific position was used appears to be informed by their utilitarian role for an organization. For players that are capable of being used in multiple positions they may experience a higher number of games than their specialized teammates depending on the position they play.
# Create histogram and box plot of total number of appearances for all positions and roles
hist_appearances <- data %>%
ggplot(aes(x = `total appearances`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Total Games Played",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Appearances",
y = "Frequency") +
theme_minimal()
box_appearances <- data %>%
ggplot(aes(x = "Category", y = `total appearances`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Total Games Played",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "All Positions",
y = "Number of Games Played") +
theme_minimal()
combined_appearances <- grid.arrange(hist_appearances, box_appearances, ncol=2)
# Creates histogram of appearances for single use pitchers
hist_pitcher <- data %>% filter(data$role=="single use", data$pitcher > 0) %>%
ggplot(aes(x = pitcher)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Pitcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Pitcher",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use pitchers
box_pitcher <- data %>% filter(data$role=="single use", data$pitcher > 0) %>%
ggplot(aes(x = "Category", y = pitcher)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Pitcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Pitcher",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_pitcher <- grid.arrange(hist_pitcher, box_pitcher, ncol=2)
# Creates histogram of appearances for utility pitchers
hist_pitcher_u <- data %>% filter(data$role=="utility", data$pitcher > 0) %>%
ggplot(aes(x = pitcher)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Pitcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Pitcher",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility pitchers
box_pitcher_u <- data %>% filter(data$role=="utility", data$pitcher > 0) %>%
ggplot(aes(x = "Category", y = pitcher)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Pitcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Pitcher",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_pitcher_u <- grid.arrange(hist_pitcher_u, box_pitcher_u, ncol=2)
As expected, stark differences can be seen in the utilization rates by comparing the mean number of games of single-use pitchers and utility pitchers. While the distribution of the number of appearances of single-use and utility pitchers is skewed right, indicating infrequent usage per player, the utility pitcher is employed very rarely. In this situation, it is highly likely that the player who fulfilled the utility pitcher role moved from their primary position for a specific purpose and that pitching is not their primary role for the team.
# Creates histogram of appearances for single use catchers
hist_catcher <- data %>% filter(data$role=="single use", data$catcher > 0) %>%
ggplot(aes(x = catcher)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Catcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Catcher",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use catchers
box_catcher <- data %>% filter(data$role=="single use", data$catcher > 0) %>%
ggplot(aes(x = "Category", y = catcher)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Catcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Catcher",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_catcher <- grid.arrange(hist_catcher, box_catcher, ncol=2)
# Creates histogram of appearances for utility catchers
hist_catcher_u <- data %>% filter(data$role=="utility", data$catcher > 0) %>%
ggplot(aes(x = catcher)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Catcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Catcher",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility catchers
box_catcher_u <- data %>% filter(data$role=="utility", data$catcher > 0) %>%
ggplot(aes(x = "Category", y = catcher)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Catcher",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Catcher",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_catcher_u <- grid.arrange(hist_catcher_u, box_catcher_u, ncol=2)
While the total number of games between single-use and utility catchers (single use: 7674, utility: 6138) indicated that analysis would likely show a higher mean number of games for single-use catchers, the opposite is true based on several possibilities. First, the right-skewed distribution of single-use catchers could indicate high levels of turnover between teams, leagues, or player removal. Some single-use catchers experience a very high level of utilization playing upwards of 120 games, but they are outliers in the data set. It is a more frequent occurrence that a player plays a fewer number of games when they are less utilitarian as a catcher. Next, there may be an offensive or defensive strategy that necessitates shifting personnel between positions because of their strengths and weaknesses as a player or the strengths and weaknesses of the opposition.
# Creates histogram of appearances for single use first base
hist_first <- data %>% filter(data$role=="single use", data$`first base` > 0) %>%
ggplot(aes(x = `first base`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use First Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at First Base",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use catchers
box_first <- data %>% filter(data$role=="single use", data$`first base` > 0) %>%
ggplot(aes(x = "Category", y = `first base`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use First Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "First Base",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_first <- grid.arrange(hist_first, box_first, ncol=2)
# Creates histogram of appearances for utility first base
hist_first_u <- data %>% filter(data$role=="utility", data$`first base` > 0) %>%
ggplot(aes(x = `first base`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility First Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at First Base",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility first base
box_first_u <- data %>% filter(data$role=="utility", data$`first base` > 0) %>%
ggplot(aes(x = "Category", y = `first base`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility First Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "First Base",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_first_u <- grid.arrange(hist_first_u, box_first_u, ncol=2)
Following a pattern seen in the comparison between single-use and utility pitchers, the same trend appears. The more specialized a player becomes in their abilities at first base, the more games they will play at that position.
# Creates histogram of appearances for single use second base
hist_second <- data %>% filter(data$role=="single use", data$`second base` > 0) %>%
ggplot(aes(x = `second base`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Second Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Second Base",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use second base
box_second <- data %>% filter(data$role=="single use", data$`second base` > 0) %>%
ggplot(aes(x = "Category", y = `second base`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Second Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Second Base",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_second <- grid.arrange(hist_second, box_second, ncol=2)
# Creates histogram of appearances for utility second base
hist_second_u <- data %>% filter(data$role=="utility", data$`second base` > 0) %>%
ggplot(aes(x = `second base`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Second Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Second Base",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility second base
box_second_u <- data %>% filter(data$role=="utility", data$`second base` > 0) %>%
ggplot(aes(x = "Category", y = `second base`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Second Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Second Base",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_second_u <- grid.arrange(hist_second_u, box_second_u, ncol=2)
Second base utilization rates between single-use and utility players further reinforce trends in other positions. The more specialized the player becomes in their position, the higher the number of games they play. Even though utility players may add flexibility to their organization for employment, they experience a lower mean number of appearances in the data set.
# Creates histogram of appearances for single use third base
hist_third <- data %>% filter(data$role=="single use", data$`third base` > 0) %>%
ggplot(aes(x = `third base`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Third Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Third Base",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use third base
box_third <- data %>% filter(data$role=="single use", data$`third base` > 0) %>%
ggplot(aes(x = "Category", y = `third base`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Third Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Third Base",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_third <- grid.arrange(hist_third, box_third, ncol=2)
# Creates histogram of appearances for utility third base
hist_third_u <- data %>% filter(data$role=="utility", data$`third base` > 0) %>%
ggplot(aes(x = `third base`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Third Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Third Base",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility third base
box_third_u <- data %>% filter(data$role=="utility", data$`third base` > 0) %>%
ggplot(aes(x = "Category", y = `third base`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Third Base",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Third Base",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_third_u <- grid.arrange(hist_third_u, box_third_u, ncol=2)
The third base position echoes the situation seen in the shortstop position. There was a larger number of occurrences in the data set of utility third base players with a higher total number of appearances for third base utility players. Still, the mean number of games is greater in single-use players.
# Creates histogram of appearances for single use short stops
hist_short <- data %>% filter(data$role=="single use", data$`short stop` > 0) %>%
ggplot(aes(x = `short stop`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Shortstop",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Shortstop",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use short stop
box_short <- data %>% filter(data$role=="single use", data$`short stop` > 0) %>%
ggplot(aes(x = "Category", y = `short stop`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Short Stop",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Short Stop",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_short <- grid.arrange(hist_short, box_short, ncol=2)
# Creates histogram of appearances for utility short stops
hist_short_u <- data %>% filter(data$role=="utility", data$`short stop` > 0) %>%
ggplot(aes(x = `short stop`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Shortstop",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Shortstop",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility short stops
box_short_u <- data %>% filter(data$role=="utility", data$`short stop` > 0) %>%
ggplot(aes(x = "Category", y = `short stop`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Short Stop",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Short Stop",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_short <- grid.arrange(hist_short_u, box_short_u, ncol=2)
The shortstop position further reinforces trends seen in all other infield positions. Aside from the catcher position, all infield positions have a greater mean number of appearances from single-use players. Additionally, an inspection of the histograms of all single-use players shows a critical difference in the shortstop position. Every other single-use infield position follows a right-skewed distribution. The single-use shortstop is the closest to a uniform distribution of appearances in the entire data set.
# Creates histogram of appearances for single use left field
hist_left <- data %>% filter(data$role=="single use", data$`left field` > 0) %>%
ggplot(aes(x = `left field`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Left Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Left Field",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use left field
box_left <- data %>% filter(data$role=="single use", data$`left field` > 0) %>%
ggplot(aes(x = "Category", y = `left field`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Left Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Left Field",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_left <- grid.arrange(hist_left, box_left, ncol=2)
# Creates histogram of appearances for utility left field
hist_left_u <- data %>% filter(data$role=="utility", data$`left field` > 0) %>%
ggplot(aes(x = `left field`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Left Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Left Field",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility left field
box_left_u <- data %>% filter(data$role=="utility", data$`left field` > 0) %>%
ggplot(aes(x = "Category", y = `left field`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Left Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Left Field",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_left_u <- grid.arrange(hist_left_u, box_left_u, ncol=2)
The left field position follows the same trend as every infield position other than catcher. There is a greater number of occurrences of utility players in the data set in the left field position than single-use players at the left field position. There is a greater number of utility appearances of left field players compared to the single-use left field players. However, the mean number of appearances is greater with single-use left field players.
# Creates histogram of appearances for single use center field
hist_center <- data %>% filter(data$role=="single use", data$`center field` > 0) %>%
ggplot(aes(x = `center field`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Center Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Center Field",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use center field
box_center <- data %>% filter(data$role=="single use", data$`center field` > 0) %>%
ggplot(aes(x = "Category", y = `center field`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Center Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Center Field",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_center <- grid.arrange(hist_center, box_center, ncol=2)
# Creates histogram of appearances for utility center field
hist_center_u <- data %>% filter(data$role=="utility", data$`center field` > 0) %>%
ggplot(aes(x = `center field`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Center Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Center Field",
y = "Frequency") +
theme_minimal()
# box plot of appearances for utility center field
box_center_u <- data %>% filter(data$role=="utility", data$`center field` > 0) %>%
ggplot(aes(x = "Category", y = `center field`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Center Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Center Field",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_center_u <- grid.arrange(hist_center_u, box_center_u, ncol=2)
Although the center field position continues with previously identified trends, another pattern becomes apparent when inspecting the histogram of single-use players and comparing that to the distribution of other single-use players. Only the short-stop position experiences such a uniform distribution. There is a slight right skew to the distribution of appearances with single-use center field players, but it is far less extreme compared to every other position, especially the utility players.
# Creates histogram of appearances for single use right field
hist_right <- data %>% filter(data$role=="single use", data$`right field` > 0) %>%
ggplot(aes(x = `right field`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Single Use Right Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Right Field",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use right field
box_right <- data %>% filter(data$role=="single use", data$`right field` > 0) %>%
ggplot(aes(x = "Category", y = `right field`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Single Use Right Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Right Field",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_right <- grid.arrange(hist_right, box_right, ncol=2)
# Creates histogram of appearances for utility right field
hist_right_u <- data %>% filter(data$role=="utility", data$`right field` > 0) %>%
ggplot(aes(x = `right field`)) +
geom_histogram(color = "steelblue4", fill = "lightblue2") +
labs(title = "Histogram of Utility Right Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Number of Games at Right Field",
y = "Frequency") +
theme_minimal()
# box plot of appearances for single use right field
box_right_u <- data %>% filter(data$role=="utility", data$`right field` > 0) %>%
ggplot(aes(x = "Category", y = `right field`)) +
geom_boxplot(color = "steelblue4", outlier.colour = "red") +
labs(title = "Boxpolot of Utility Right Field",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman package",
x = "Right Field",
y = "Number of Games Played") +
theme_minimal()
# Create combined chart
combined_right <- grid.arrange(hist_right_u, box_right_u, ncol=2)
Unsurprisingly, the right field position followed trends experienced in other positions when evaluating the appearances of single-use and utility players.
# split data into two distinct dataframes, one for sinlge use players and one for utility players
single_use <- subset(data, role == "single use")
utility_use <- subset(data, role == "utility")
# Extract number of appearances for single use pitchers
pitcher_df <- data.frame(pitcher = numeric(0))
single_pitcher <- subset(single_use, pitcher > 0)$pitcher
pitcher_df <- rbind(pitcher_df, data.frame(pitcher = single_pitcher))
# Extract number of appearances for single use catchers
catcher_df <- data.frame(catcher = numeric(0))
single_catcher <- subset(single_use, catcher > 0)$catcher
catcher_df <- rbind(catcher_df, data.frame(catcher = single_catcher))
# Extract number of appearances for single use first base
first_df <- data.frame(first = numeric(0))
single_first <- subset(single_use, `first base` > 0)$`first base`
first_df <- rbind(first_df, data.frame(first = single_first))
# Extract number of appearances for single use second base
second_df <- data.frame(second = numeric(0))
single_second <- subset(single_use, `second base` > 0)$`second base`
second_df <- rbind(second_df, data.frame(second = single_second))
# Extract number of appearances for single use third base
third_df <- data.frame(third = numeric(0))
single_third <- subset(single_use, `third base` > 0)$`third base`
third_df <- rbind(third_df, data.frame(third = single_third))
# Extract number of appearances for single use short stop
short_df <- data.frame(short = numeric(0))
single_short <- subset(single_use, `short stop` > 0)$`short stop`
short_df <- rbind(short_df, data.frame(short = single_short))
# Extract number of appearances for single use left field
left_df <- data.frame(left = numeric(0))
single_left <- subset(single_use, `left field` > 0)$`left field`
left_df <- rbind(left_df, data.frame(left = single_left))
# Extract number of appearances for single use center field
center_df <- data.frame(center = numeric(0))
single_center <- subset(single_use, `center field` > 0)$`center field`
center_df <- rbind(center_df, data.frame(center = single_center))
# Extract number of appearances for single use right field
right_df <- data.frame(center = numeric(0))
single_right <- subset(single_use, `right field` > 0)$`right field`
right_df <- rbind(right_df, data.frame(right = single_right))
# Create one data frame with all extracted information
single_all_pos <- data.frame(pitcher = rep(NA, 2751))
single_all_pos[1:nrow(pitcher_df), "pitcher"] <- pitcher_df$pitcher
single_all_pos[1:nrow(catcher_df), "catcher"] <- catcher_df$catcher
single_all_pos[1:nrow(first_df), "first"] <- first_df$first
single_all_pos[1:nrow(second_df), "second"] <- second_df$second
single_all_pos[1:nrow(third_df), "third"] <- third_df$third
single_all_pos[1:nrow(short_df), "short"] <- short_df$short
single_all_pos[1:nrow(left_df), "left"] <- left_df$left
single_all_pos[1:nrow(center_df), "center"] <- center_df$center
single_all_pos[1:nrow(right_df), "right"] <- right_df$right
# Create box plot of appearances for all single use positions
ggplot(melt(single_all_pos), aes(x = variable, y = value, fill = variable)) +
geom_boxplot() +
labs(title = "Boxplot of Appearances For Single Use Players",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Positions",
y = "Number of Appearances") +
theme_minimal() +
theme(legend.position = "none")
## No id variables; using all as measure variables
The box plot of all single-use positions illuminates key differences in the mean number of appearances and the data distribution. Although there is a greater number of pitchers, the data is the most tightly bunched. Indications exist that although pitchers, on average, are the most seldom used, they fulfill a precise role and are highly specialized. The third base and shortstop positions have the widest distribution of observed data and the highest mean number of appearances.
There was not an observed equality of mean appearances across all positions. Generally speaking, aside from the catcher position where the utility use catcher experienced a greater mean number of appearances, the more specialized a player becomes in their position, the more number of games they participate in the game. Single-use players at the shortstop position had the highest mean number of appearances, followed by third base and center field. As the shortstop position and center field position serve as defensive leaders in the infield and outfield, the analysis gains an understanding of player longevity and importance to the team.
The most interesting insight gained from utilization rates came from comparing the catcher and pitcher positions. In the dataset, the pitcher had more single-use occurrences, single-use appearances, and single-use mean appearances. A reasonable conclusion is that all indicators are that pitchers are used for a specific purpose and fulfill a particular need for a team. In contrast to the pitchers, the catchers experienced more single-use occurrences and single-use appearances, but the mean number of games was greater from utility players. Conclusions as to why the catcher position experiences such a high amount of changeover is unknown from the analysis of the dataset. It is possible that this occurs from injuries, trades, cuts, or catchers may commonly fulfill a crucial offensive role not immediately obvious from analysis of defensive positions.
The next aspect of the analysis was aimed at gaining an understanding of any connection between the number of appearances of the players and the percentage of games they were in the starting lineup. This information will further illuminate utilization trends to form a greater understanding of player utilization.
data %>%
ggplot(aes(x = `total appearances`, y = `start percent`)) +
geom_point() +
labs(title = "Start Percent and Appearances",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Number of Appearances",
y = "Starting Percentage") +
theme_minimal()
There are several interesting observations upon interpreting the scatter plot results of appearances and starting percentages. As expected, there is a slight positive relationship within the data set for players with a high start percentage and the total number of games they appeared. Highly skilled players have earned a spot on the starting lineup and typically sign contracts for long periods with the same team. Barring injuries, they experience long and successful careers. However, on the far left of the chart, there are puzzling distributions. The lower left portion of the scatterplot shows an exponential-logarithmic distribution. That type of distribution is usually associated with plotting an increase in failure rates. But in this data set it is expressing the inverse quality. As the starting percentage decreases, the number of games increases.
Expanding on the analysis of the starting percentage and the total number of appearances, the analysis sought to find meaning to some puzzling distributions. Perhaps the role of the player would illuminate the reasons behind previous observations. The following scatter plot presents a visual representation of the relationship between the role of a player and the total number of appearances in the data set.
data %>%
ggplot(aes(x = `total appearances`, y = `start percent`, color = role)) +
geom_point() +
labs(title = "Start Percent and Appearances by Role",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Number of Appearances",
y = "Starting Percentage") +
theme_minimal()
Based on the scatter plot of the relationship between starting percentage and number of appearances per player role, the interesting distributions become more evident. The utility players follow the predicted slightly strong relationship. As the starting percentage increases, the number of games increases. Also, the exponential-logarithmic distribution is created by single-use players.
Through a further attempt to isolate the factor in the data set explaining the puzzling distribution, the next step in the analysis was to create a scatter plot showing the relationship of starting percentage and number of appearances per position.
data %>%
ggplot(aes(x = `total appearances`, y = `start percent`, color = `defensive position`)) +
geom_point() +
labs(title = "Start Percent and Appearances",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Number of Appearances",
y = "Starting Percentage") +
theme_minimal()
From the scatter plot above, it is apparent that the pitchers are responsible for the observed exponential logarithmic distribution. This occurrence is understandable based on how pitchers are used in a game. It was previously identified that there was not only a higher amount of pitchers within the data set, but also, all indications were that they fulfilled particular roles for a team. Observed in the lower left-hand corner of the chart is a subgroup of the pitching position called relief pitchers. The players are highly skilled in their position and are trusted to finish games after the starting pitcher needs to be removed from fatigue. Additionally, these pitchers come in for a short duration based on what is occurring during the game to suit defensive strategy.
Previous visualizations showed the relationships of all positions and their relationship between starting percentage and number of appearances. Still, a single chart was required to gain a further understanding of observations of pitchers by role.
data %>% filter(data$`defensive position` == "pitcher") %>%
ggplot(aes(x = `total appearances`, y = `start percent`, color = role)) +
geom_point() +
labs(title = "Start Percent and Appearances of All Pitchers",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Number of Appearances",
y = "Starting Percentage") +
theme_minimal()
From the above scatter plot, two results were immediately apparent: the relief pitcher was a single-use player and among pitchers, and in some cases, played in more games than their starting counterparts. In every other position other than pitcher, the player in the starting lineup was expected to play in more games than their substitute teammate.
data %>% filter(data$`defensive position` == "pitcher", role == "single use") %>%
ggplot(aes(x = `total appearances`, y = `start percent`)) +
geom_point() +
labs(title = "Start Percent and Appearances of Single Use Pitchers",
subtitle = "Appearances Dataset",
caption = "Data Source: Lahman Package",
x = "Number of Appearances",
y = "Starting Percentage") +
theme_minimal()
From the chart above, some players start in every game they play. Moving across the upper left hand of the chart, there is a horizontal line of points. Each point is a player that started 100% of their games, and that pattern ceases at less than 40 games. While it may appear as though those players are less desirable to a team because of their low utilization rates, there could be other factors in play. From the previous analysis, it is known that pitchers are chosen to fulfill specific roles to support strategy. Still, the observations seen in the chart could also hint toward the recovery time needed between games for starting pitchers.
The final question of this project was to understand how utility players were used. Would a utility player be expected to play in every defensive position, or are there utilization trends? To a player seeking a longer career but lacking the high skill needed to warrant a starting position, understanding which positions are the most shared would allow them to create a higher value to their organization. From the team’s perspective, if they were aware of common position pairings, they would be better at understanding the strengths and limitations of their roster.
# subset the data to only contain utility players
utility <- subset(data, role == "utility")
# compute sums of appearances per position for utility players
utility_total <- utility %>%
group_by(`defensive position`) %>%
summarize(catcher = sum(catcher),
center = sum(`center field`),
first = sum(`first base`),
left = sum(`left field`),
pitcher = sum(pitcher),
right = sum(`right field`),
second = sum(`second base`),
short = sum(`short stop`),
third = sum(`third base`))
numeric_utility <- utility_total[, -1]
chord_data <- as.matrix(numeric_utility)
chord_data[lower.tri(chord_data, diag = TRUE)] <- 0
position_names <- c("catcher", "center", "first",
"left", "pitcher", "right",
"second", "short", "third")
rownames(chord_data) <- position_names
chordDiagram(chord_data, transparency = .5)
title("Utility Player Position Combinations")
| defensive position | catcher | center | first | left | pitcher | right | second | short | third |
|---|---|---|---|---|---|---|---|---|---|
| catcher | 6014 | 30 | 268 | 116 | 52 | 30 | 45 | 0 | 25 |
| center field | 10 | 6524 | 37 | 1316 | 12 | 1446 | 315 | 162 | 81 |
| first base | 50 | 90 | 5118 | 389 | 37 | 387 | 324 | 30 | 525 |
| left field | 19 | 1276 | 329 | 8155 | 15 | 2060 | 269 | 84 | 200 |
| pitcher | 0 | 4 | 2 | 4 | 194 | 10 | 0 | 0 | 0 |
| right field | 32 | 1192 | 333 | 1398 | 16 | 7113 | 128 | 58 | 132 |
| second base | 2 | 130 | 483 | 680 | 64 | 482 | 8881 | 1330 | 1197 |
| short stop | 1 | 96 | 89 | 112 | 22 | 111 | 877 | 4049 | 419 |
| third base | 10 | 30 | 651 | 283 | 48 | 189 | 1051 | 645 | 7253 |
The chord diagram in this project is a visual representation of the information contained in the table above. The first column is each of the defensive positions, and the remaining columns are each of the defensive positions in the same order as the first column. The matrix shows the sum of the data set’s intersection of the two positions. For example, in the first row of the chart, a utility catcher played in the catcher position 6014 times, center field 30 times, first base 268 times, left field 116 times, and so on.
While consuming the chart, it is important to understand that each player made the highest number of appearances in their primary position, and the chart displays positional movements. The movements sometimes can occur during the game or from different games. The outside ring of the chart is the total number of observations for each position, or a row sum excluding the diagonal matching of positions from the above table. The width of the bar is representative of the amount of times the player moved to another position. For example, the total movements for the first base position is 1832, and the contribution of those movements from the catcher is 50. A Sankey diagram could have provided more precise results if it was known where a player was moving from one position to another. Instead, the chord diagram only presents the frequency of pairings.
Some conclusions can be drawn from the chord diagram and corresponding table. Outfielders move between every outfield position more than they move to any single infield position. This result indicates the specialization of those players in the outfield. It suggests that there are some athletic traits specific to outfield and infield players that make them more suited for those positions over the others. Among the infield players, the movement most frequently occurs among three particular positions. Third base, shortstop, and second base experience the most frequent movements between positions. This observation is interesting because they play in one section of the field next to each other and hints at a defensive strategy or defensive management.
Based on the analysis in this project, it was found that the more skilled a player becomes in their primary position, the more likely they are to start their game and, therefore, play more games. Pitchers are a crucial component of the defensive roster of every organization and fulfill niche requirements for an organization. Finally, suppose a player cannot become skilled enough in any single position to extend their career and add flexibility to their organization. In that case, an outfielder should increase their ability to play in other outfield positions, and infielders should become more capable infielders as those are the most observed positional pairings.