This initial dataset is one that I have examined before. It gives stats about the overall best players from year to year. This dataset only observes hitting stats, but it gives an insight into what really makes the best hitters. Below is the data dictionary to show the variables being used and discussed throughout this document:
Column Name
Description
player_id
Unique identifier for the player.
player_full_name
Full name of the player.
team_id
Abbreviation for the MLB team the player belongs to.
team_full_name
Full name of the MLB team.
position
Player’s field position (e.g., 1B, OF, P).
avg
Batting average (hits divided by at-bats).
obp
On-base percentage (times reached base / plate appearances).
slg
Slugging percentage (total bases / at-bats).
ops
On-base plus slugging (OBP + SLG).
hr
Home runs.
rbi
Runs batted in.
runs
Total number of runs scored by the player.
sb
Stolen bases.
bb
Walks (base on balls).
so
Strikeouts.
pa
Plate appearances (number of times a player comes to bat).
ab
At-bats.
h
Hits.
doubles
Number of doubles hit.
triples
Number of triples hit.
total_bases
Total number of bases from hits.
war
Wins Above Replacement – overall player value over a replacement-level player.
league
League abbreviation (e.g., AL or NL).
season
MLB season year (e.g., 2024).
“Does a higher amount of hits contribute to a higher OBP?
The reason why I want to observe this is because many followers of baseball focus in mainly on hits as a metric of getting on base. Observing how walks, both intentional and accidental can heavily contribute to the OBP of a certain player. A lot of times, average is taken into account more, so this will be a very interesting visualization to see.In order to display this visualization, I will compare the number of hitsand the batters OBP for that season in a scatter plot. The X-axis will behits and the y-axis will be OBP. Because there will be so many instances onthe graph, I will use alpha= .5 to make the points be able to be seen asmore and less concentrated.
The analysis performed above shows the relationship between hits and OBP. Although I expected more drastic results in the relation, there is still a small correlation between Hits and OBP. In terms of the question I posed, hits don’t contribute as much to OBP as some people think. This graph shows that, most likely, walks contribute more to a high OBP (On-base percentage).
“Do players that have have more hits have a higher amount of RBI’s?”
I feel that look into this spread could be very interesting. One thing that I could find out form this data is where the hitters are in the batting lineup. There could be a lot of players with large amounts of its with very minimal RBIs. In order to answer this question, im going use mutate to make sure that both variables are numeric. Using a scatter plot would be the best way to see the relationship between the two. Because of the vast amounts of players in this dataset, limiting the scale on the axes are going to be important to best show the relationship.
Observing the results, there is a clear upward trend between the number of hits a player has and the number of RBIs they have. However, there are a few players that stand out with large amounts of hits and lower amounts of RBIs. The results show that these players could be leadoff hitters. They are the exceptions for the question I asked. Most of the time, the more hits means more RBIs from these players.
“Which league tends to hit more home runs?”
This is a question that has been crucial for competitions like the home run derby as well as for the pride of baseball fans. Knowing this statistic could be important for predictions. One thing however, is that the National League just recently switched over to having a designated hitter, which could greatly contribute to high home run values. To properly show an answer to this proposed question I am going to use a boxplot to properly show not only the averages for each league as well as the quartile range for each. Additionally, this boxplot will show outliers such as the player with the most home runs in the past 20 years. Making sure that HR is a numeric value using as.numeric is important to show that all values are included in the graph.
The results from this graph do not surprise me. Although, the results are similar, you can see that the American league has a larger range of values. Additionally, the american league has the player with the most amount of HRs. This can be a result from the National league not having the designated hitter role in the lineup, which was recently introduced.
Which variables are most closely correlated?
In order to properly observe and answer this question, I created acorrelation matrix. This ideally will show which variables most closely effect each other through all of the players and the statistics they put up each year.
Var1 Var2 Freq abs_cor
1 Rbat+ OPS+ 0.9867259 0.9867259
2 OPS+ Rbat+ 0.9867259 0.9867259
3 AB PA 0.9863059 0.9863059
4 PA AB 0.9863059 0.9863059
5 rOBA OPS 0.9701480 0.9701480
6 OPS rOBA 0.9701480 0.9701480
7 OPS SLG 0.9557740 0.9557740
8 SLG OPS 0.9557740 0.9557740
9 PA G 0.9549452 0.9549452
10 G PA 0.9549452 0.9549452
Is WAR directly impacted by the number of home runs a polsyer hits?
In order to observe this question I will compare the two factors in a scatter plot. Doing this will allow me to see if the number of home runs a player hits directly effects the WAR of a player positively.
Secondary Data
For those familiar to the baseball community you may be familiar with this next set of data I used. Baseball reference is a website where stats for players are recorded and players can even have their own pages. The direct link to the page is: https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml. In order to compare this to the other graph, I am going to observe the question “What league hits more homeruns?” again using this new data.
As shown in the graph, the average homeruns for both leagues is around 5, however, this does include players that aren’t the best hitters. It does show that the NL has a larger range of homeruns than the AL.
Source Code
---title: "462 final project"editor: visualtoc: true # Generates an automatic table of contents.format: # Options related to formatting. html: # Options related to HTML output. code-tools: TRUE # Allow the code tools option showing in the output. embed-resources: TRUE # Embeds all components into a single HTML file. execute: # Options related to the execution of code chunks. warning: FALSE # FALSE: Code chunk sarnings are hidden by default. message: FALSE # FALSE: Code chunk messages are hidden by default. echo: FALSE # TRUE: Show all code in the output.---## MLB dataThis initial dataset is one that I have examined before. It gives stats about the overall best players from year to year. This dataset only observes hitting stats, but it gives an insight into what really makes the best hitters. Below is the data dictionary to show the variables being used and discussed throughout this document:| **Column Name** | **Description** ||-----------------|-----------------|| | ||-------------|-----------------------------------|| `player_id` | Unique identifier for the player. || | ||--------------------|--------------------------|| `player_full_name` | Full name of the player. || | ||-----------|------------------------------------------------------|| `team_id` | Abbreviation for the MLB team the player belongs to. || | ||------------------|----------------------------|| `team_full_name` | Full name of the MLB team. || | ||------------|--------------------------------------------|| `position` | Player's field position (e.g., 1B, OF, P). || | ||-------|--------------------------------------------|| `avg` | Batting average (hits divided by at-bats). || | ||-------|--------------------------------------------------------------|| `obp` | On-base percentage (times reached base / plate appearances). || | ||-------|----------------------------------------------|| `slg` | Slugging percentage (total bases / at-bats). || | ||-------|------------------------------------|| `ops` | On-base plus slugging (OBP + SLG). || | ||------|------------|| `hr` | Home runs. || | ||-------|-----------------|| `rbi` | Runs batted in. || | ||--------|--------------------------------------------|| `runs` | Total number of runs scored by the player. || | ||------|---------------|| `sb` | Stolen bases. || | ||------|------------------------|| `bb` | Walks (base on balls). || | ||------|-------------|| `so` | Strikeouts. || | ||------|------------------------------------------------------------|| `pa` | Plate appearances (number of times a player comes to bat). || | ||------|----------|| `ab` | At-bats. || | ||-----|-------|| `h` | Hits. || | ||-----------|------------------------|| `doubles` | Number of doubles hit. || | ||-----------|------------------------|| `triples` | Number of triples hit. || | ||---------------|----------------------------------|| `total_bases` | Total number of bases from hits. || | ||------------------------------------|------------------------------------|| `war` | Wins Above Replacement – overall player value over a replacement-level player. || | ||----------|---------------------------------------|| `league` | League abbreviation (e.g., AL or NL). || | ||----------|-------------------------------|| `season` | MLB season year (e.g., 2024). |```{r}library(tidyverse) library(httr) library(lubridate) library(dplyr)library(tidyr)library(corrplot)library(jsonlite) # Converting json data into data frameslibrary(magrittr) # Extracting items from list objects using piping grammarlibrary(rvest)library(janitor)``````{r}mydata <-read.csv("https://myxavier-my.sharepoint.com/:x:/g/personal/janishefskib_xavier_edu/ETjz1TyRf5xNjPd7ooW4qLUBJ6TsJMh43uRJT8ecuqgwvQ?download=1",stringsAsFactors =FALSE)colnames(mydata) <-as.character(unlist(mydata[1, ]))mydata <- mydata[-1, ]mydata <- mydata %>%rename("Gold Glove"="Gold Globe")numeric_columns <-c("H", "OBP", "Year", "Age", "WAR", "G", "PA", "AB", "R","HR", "RBI", "AVG", "SLG", "OPS", "BB", "SO", "SB", "CS", "2B", "3B", "OBP","OPS+","rOBA", "Rbat+", "TB", "GIDP", "HBP", "SH", "SF", "IBB", "All Star Game", "MVP","Silver Slugger", "Gold Glove", "ROY" )numeric_columns <-unique(numeric_columns)mydata[numeric_columns] <-lapply(mydata[numeric_columns], as.numeric)```#### ## "Does a higher amount of hits contribute to a higher OBP?The reason why I want to observe this is because many followers of baseball focus in mainly on hits as a metric of getting on base. Observing how walks, both intentional and accidental can heavily contribute to the OBP of a certain player. A lot of times, average is taken into account more, so this will be a very interesting visualization to see.In order to display this visualization, I will compare the number of hitsand the batters OBP for that season in a scatter plot. The X-axis will behits and the y-axis will be OBP. Because there will be so many instances onthe graph, I will use alpha= .5 to make the points be able to be seen asmore and less concentrated.```{r}mydata %>%ggplot(aes(x= H, y= OBP))+geom_point(color="black", alpha= .5)+labs(x="Number of Hits", y="On Base Percentage")+scale_x_continuous(breaks =seq(0, 300, by =25)) +scale_y_continuous(breaks =seq(.2, .65, by =0.05), limits =c(.2, .65))+geom_smooth(method ="lm")```The analysis performed above shows the relationship between hits and OBP. Although I expected more drastic results in the relation, there is still a small correlation between Hits and OBP. In terms of the question I posed, hits don't contribute as much to OBP as some people think. This graph shows that, most likely, walks contribute more to a high OBP (On-base percentage).## "Do players that have have more hits have a higher amount of RBI's?"I feel that look into this spread could be very interesting. One thing that I could find out form this data is where the hitters are in the batting lineup. There could be a lot of players with large amounts of its with very minimal RBIs. In order to answer this question, im going use mutate to make sure that both variables are numeric. Using a scatter plot would be the best way to see the relationship between the two. Because of the vast amounts of players in this dataset, limiting the scale on the axes are going to be important to best show the relationship.```{r}mydata <- mydata %>%mutate(RBI =as.numeric(RBI),H =as.numeric(H))mydata %>%ggplot(aes(x= RBI, y=H ))+geom_point(color ="darkgreen", alpha= .5, size = .75)+labs(x="Runs Batted In (RBI)", y="Hits")+scale_x_continuous(breaks=seq(0,200, by =25))+geom_smooth(method ="lm")```Observing the results, there is a clear upward trend between the number of hits a player has and the number of RBIs they have. However, there are a few players that stand out with large amounts of hits and lower amounts of RBIs. The results show that these players could be leadoff hitters. They are the exceptions for the question I asked. Most of the time, the more hits means more RBIs from these players.## "Which league tends to hit more home runs?"This is a question that has been crucial for competitions like the home run derby as well as for the pride of baseball fans. Knowing this statistic could be important for predictions. One thing however, is that the National League just recently switched over to having a designated hitter, which could greatly contribute to high home run values. To properly show an answer to this proposed question I am going to use a boxplot to properly show not only the averages for each league as well as the quartile range for each. Additionally, this boxplot will show outliers such as the player with the most home runs in the past 20 years. Making sure that HR is a numeric value using as.numeric is important to show that all values are included in the graph.```{r}mydata <- mydata %>%mutate(HR=as.numeric(HR))mydata %>%ggplot(aes(x =factor(League), y= HR))+geom_boxplot()+scale_y_continuous(breaks=seq(0,65, by =5))+labs(x="League", y ="Number of Homeruns")```The results from this graph do not surprise me. Although, the results are similar, you can see that the American league has a larger range of values. Additionally, the american league has the player with the most amount of HRs. This can be a result from the National league not having the designated hitter role in the lineup, which was recently introduced.## Which variables are most closely correlated?In order to properly observe and answer this question, I created acorrelation matrix. This ideally will show which variables most closely effect each other through all of the players and the statistics they put up each year.```{r}num_data <- mydata[sapply(mydata, is.numeric)]cor_matrix <-cor(num_data, use ="complete.obs")cor_flat <-as.data.frame(as.table(cor_matrix)) %>%filter(Var1 != Var2) %>%mutate(abs_cor =abs(Freq)) %>%arrange(desc(abs_cor)) %>%distinct() %>%slice_head(n =10)print(cor_flat)```## Is WAR directly impacted by the number of home runs a polsyer hits?In order to observe this question I will compare the two factors in a scatter plot. Doing this will allow me to see if the number of home runs a player hits directly effects the WAR of a player positively.```{r}mydata %>%ggplot(aes(x = HR, y = WAR)) +geom_point(alpha =0.5, color ="grey") +geom_smooth(method ="lm", se =FALSE, color ="blue") +labs(x ="Home Runs (HR)",y ="Wins Above Replacement (WAR)",title ="Relationship Between Home Runs and WAR" ) +theme_minimal()```## Secondary DataFor those familiar to the baseball community you may be familiar with this next set of data I used. Baseball reference is a website where stats for players are recorded and players can even have their own pages. The direct link to the page is: <https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml>. In order to compare this to the other graph, I am going to observe the question "What league hits more homeruns?" again using this new data.```{r}set_config(user_agent("Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html"))url <-"https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml"page <-read_html(url)batting_table <- page %>%html_element("table#players_standard_batting") %>%html_table()batting_table <- batting_table %>%filter(Player !="Player") %>%clean_names()``````{r}batting_table <- batting_table %>%filter(player !="Player") %>%clean_names() %>%mutate(hr =as.numeric(hr),lg =as.factor(lg) ) %>%filter(!is.na(hr), !is.na(lg))ggplot(batting_table, aes(x = lg, y = hr)) +geom_boxplot(fill ="lightgreen") +labs(title ="Home Runs by League (2024)",x ="League",y ="Home Runs") +theme_minimal()```As shown in the graph, the average homeruns for both leagues is around 5, however, this does include players that aren't the best hitters. It does show that the NL has a larger range of homeruns than the AL.