462 final project

MLB data

This initial dataset is one that I have examined before. It gives stats about the overall best players from year to year. This dataset only observes hitting stats, but it gives an insight into what really makes the best hitters. Below is the data dictionary to show the variables being used and discussed throughout this document:

Column Name Description
player_id Unique identifier for the player.
player_full_name Full name of the player.
team_id Abbreviation for the MLB team the player belongs to.
team_full_name Full name of the MLB team.
position Player’s field position (e.g., 1B, OF, P).
avg Batting average (hits divided by at-bats).
obp On-base percentage (times reached base / plate appearances).
slg Slugging percentage (total bases / at-bats).
ops On-base plus slugging (OBP + SLG).
hr Home runs.
rbi Runs batted in.
runs Total number of runs scored by the player.
sb Stolen bases.
bb Walks (base on balls).
so Strikeouts.
pa Plate appearances (number of times a player comes to bat).
ab At-bats.
h Hits.
doubles Number of doubles hit.
triples Number of triples hit.
total_bases Total number of bases from hits.
war Wins Above Replacement – overall player value over a replacement-level player.
league League abbreviation (e.g., AL or NL).
season MLB season year (e.g., 2024).

“Does a higher amount of hits contribute to a higher OBP?

The reason why I want to observe this is because many followers of baseball focus in mainly on hits as a metric of getting on base. Observing how walks, both intentional and accidental can heavily contribute to the OBP of a certain player. A lot of times, average is taken into account more, so this will be a very interesting visualization to see.In order to display this visualization, I will compare the number of hitsand the batters OBP for that season in a scatter plot. The X-axis will behits and the y-axis will be OBP. Because there will be so many instances onthe graph, I will use alpha= .5 to make the points be able to be seen asmore and less concentrated.

The analysis performed above shows the relationship between hits and OBP. Although I expected more drastic results in the relation, there is still a small correlation between Hits and OBP. In terms of the question I posed, hits don’t contribute as much to OBP as some people think. This graph shows that, most likely, walks contribute more to a high OBP (On-base percentage).

“Do players that have have more hits have a higher amount of RBI’s?”

I feel that look into this spread could be very interesting. One thing that I could find out form this data is where the hitters are in the batting lineup. There could be a lot of players with large amounts of its with very minimal RBIs. In order to answer this question, im going use mutate to make sure that both variables are numeric. Using a scatter plot would be the best way to see the relationship between the two. Because of the vast amounts of players in this dataset, limiting the scale on the axes are going to be important to best show the relationship.

Observing the results, there is a clear upward trend between the number of hits a player has and the number of RBIs they have. However, there are a few players that stand out with large amounts of hits and lower amounts of RBIs. The results show that these players could be leadoff hitters. They are the exceptions for the question I asked. Most of the time, the more hits means more RBIs from these players.

“Which league tends to hit more home runs?”

This is a question that has been crucial for competitions like the home run derby as well as for the pride of baseball fans. Knowing this statistic could be important for predictions. One thing however, is that the National League just recently switched over to having a designated hitter, which could greatly contribute to high home run values. To properly show an answer to this proposed question I am going to use a boxplot to properly show not only the averages for each league as well as the quartile range for each. Additionally, this boxplot will show outliers such as the player with the most home runs in the past 20 years. Making sure that HR is a numeric value using as.numeric is important to show that all values are included in the graph.

The results from this graph do not surprise me. Although, the results are similar, you can see that the American league has a larger range of values. Additionally, the american league has the player with the most amount of HRs. This can be a result from the National league not having the designated hitter role in the lineup, which was recently introduced.

Which variables are most closely correlated?

In order to properly observe and answer this question, I created acorrelation matrix. This ideally will show which variables most closely effect each other through all of the players and the statistics they put up each year.

    Var1  Var2      Freq   abs_cor
1  Rbat+  OPS+ 0.9867259 0.9867259
2   OPS+ Rbat+ 0.9867259 0.9867259
3     AB    PA 0.9863059 0.9863059
4     PA    AB 0.9863059 0.9863059
5   rOBA   OPS 0.9701480 0.9701480
6    OPS  rOBA 0.9701480 0.9701480
7    OPS   SLG 0.9557740 0.9557740
8    SLG   OPS 0.9557740 0.9557740
9     PA     G 0.9549452 0.9549452
10     G    PA 0.9549452 0.9549452

Is WAR directly impacted by the number of home runs a polsyer hits?

In order to observe this question I will compare the two factors in a scatter plot. Doing this will allow me to see if the number of home runs a player hits directly effects the WAR of a player positively.

Secondary Data

For those familiar to the baseball community you may be familiar with this next set of data I used. Baseball reference is a website where stats for players are recorded and players can even have their own pages. The direct link to the page is: https://www.baseball-reference.com/leagues/majors/2024-standard-batting.shtml. In order to compare this to the other graph, I am going to observe the question “What league hits more homeruns?” again using this new data.

As shown in the graph, the average homeruns for both leagues is around 5, however, this does include players that aren’t the best hitters. It does show that the NL has a larger range of homeruns than the AL.