MLB Data
Introduction
Line of Inquiry
In this project, I seek to explore which Major League Baseball hitters, teams, or positions perform the best based on traditional offensive metrics such as Home Runs (HR), Runs Batted In (RBI), and Batting Average (AVG) during the 2025 season and also viewing some of the greats from baseball history. I find this topic interesting because it can help predict potential MVP candidates, All-Star selections, and give insight into player value for fantasy baseball drafts. I have always enjoyed watching baseball and this allows me to follow the entire league without having to watch every single game.
One specific question I want to focus on is which hitters are the most efficient at getting on base in the 2025 MLB season, and how does on-base efficiency relate to player position and or age? I’m interested in understanding the relationship between a player’s age and positions (like infielders, outfielders, catchers) and their on-base efficiency (getting hits, walks, etc.). I chose this topic because while home runs and RBIs are flashy, getting on base consistently is critical for a team’s offensive success. I want to find out if certain positions, ages, or teams tend to have better on-base players.
How I Will Answer the Question
To answer this question, I pulled a data set from Kaggle named “MLB Hitting and Pitching Stats Through All Time” to view some of the best players to ever play the game. I also scraped player statistics data for the current 2025 season from the MLB official website: https://www.mlb.com/stats/. This site provides detailed player performance across multiple hitting categories and is updated throughout the season to ensure the data stays present in real time. It is important to use both a completed season / career and a current season in order to gain different perspectives and answer various questions. It is better to scrape the current season because every time the code is run, the data will be refreshed as the season progresses to ensure the data is up to date as apposed to a CSV that would need to be re-pulled every day.
The Kaggle data set includes columns for:
Player name
position
Games
AB (at-bats)
Runs
Hits
Double (2B)
Triple (3B)
HR (home runs)
RBI (runs batted in)
BB (walks)
SO (strikeouts)
SB (stolen bases)
CS (caught stealing)
AVG (batting average)
OBP (on-base percentage)
SLG (slugging percentage)
OPS (on-base + slugging percentage)
Data Analysis
Player Amounts per Position
The bar graph displaying player count by position provides helpful context for interpreting the rest of the analysis. It shows the distribution of players across different positions, giving insight into which roles are most and least represented in the data set. Positions like outfielders and infielders tend to have higher counts, which is expected since there are multiple spots (e.g., left, center, and right field) that fall under those categories. In contrast, positions like designated hitters (DHs) and especially pitchers and catchers have fewer entries, which may partly explain the greater variability or lower averages seen in their batting statistics.
This visualization reinforces the importance of considering sample size when comparing performance metrics by position. The player count affects how much weight we should give to trends observed in positions with fewer representatives.
Batting Averages
The histogram of batting averages reveals a left-skewed distribution, meaning that most players tend to have batting averages clustered toward the higher end of the range, with fewer players having very low averages. The bulk of the data appears between approximately 0.230 and 0.290, and the mean seems to center around 0.260.
This skew indicates that while many players perform reasonably well at the plate, there are a few outliers with particularly low averages that pull the distribution slightly left. This is consistent with what we’d expect in professional baseball, where players must meet a certain performance threshold, so extremely low averages are rare. The shape of the distribution highlights the competitive consistency of batting performance across most players in the data set.
On Base Percentage
The histogram of on-base percentage (OBP) shows a distribution that is approximately normal (bell-shaped), centered around a mean of about 0.340 with exception to a few outliers. Most players’ OBPs fall within the range of 0.310 to 0.370, with fewer players at the extreme high and low ends. This suggests that on-base performance is relatively consistent among players, with most clustering around the league average.
Unlike the batting average, which showed slight left skewness, the OBP distribution appears more symmetrical. This normal distribution reflects that on-base percentage is a stable and reliable measure of offensive performance, with most players performing within a relatively narrow and balanced range.
Home Runs by Position
The boxplot comparing home runs by player position reveals significant differences in power-hitting performance across roles. Designated hitters (DHs) stand out as having the highest home run totals by a wide margin, which aligns with their role being primarily focused on offense rather than fielding.
On the other end of the spectrum, pitchers have the lowest home run totals, with most recording very few or none at all. This is expected, as pitchers typically focus on their primary role and rarely contribute significantly on offense.
Other positions fall between these two extremes, with outfielders and corner infielders generally showing moderate home run production. First basemen, a position known for power, follows close behind the designated hitters. This boxplot emphasizes how a player’s position strongly correlates with their offensive expectations, especially in terms of hitting for power.
Batting Average by Position
The box plot of batting average by player position shows that most positions have fairly similar average performance, with relatively small differences in central tendency and spread. This suggests that batting average is generally consistent across the field, regardless of a player’s defensive role.
However, two positions stand out with noticeably lower averages: catchers and pitchers. Pitchers have the lowest batting averages overall, which is expected since their primary role is defensive, and they often receive limited hitting opportunities. Catchers also tend to have slightly lower batting averages than other position players, possibly due to the physical demands of the position, which may impact offensive output. Overall, the boxplot highlights that while batting ability is distributed relatively evenly among most positions, some roles naturally come with lower offensive expectations.
Home Runs Vs. Batting Average Correlation
The scatterplot comparing home runs to batting average reveals a positive correlation between the two variables. In general, players with higher batting averages also tend to hit more home runs. While the relationship is not perfectly linear, the upward trend suggests that better overall hitters are also more likely to demonstrate power at the plate.
This relationship may reflect the fact that strong contact hitters often have the mechanics and consistency needed to generate more power. However, the plot also shows some variability, indicating that some players specialize in power hitting with lower averages, and others maintain high averages without hitting many home runs. Overall, the positive correlation supports the idea that while batting average and home run totals measure different aspects of hitting, they are often linked in well-rounded offensive players.
Data Analysis (Scraping)
Using R, I:
- Programmatically scraped 20 pages of hitters data.
- Extracted player names, teams, and key statistics.
- Cleaned and organized the data into a structured format.
- Visualized and analyzed the top players across different statistics.
The scraped data set includes columns for:
PLAYER
TEAM
G (games)
AB (at-bats)
R (runs)
H (hits)
2B (doubles)
3B (triples)
HR (home runs)
RBI (runs batted in)
SB (stolen bases)
AVG (batting average)
OBP (on-base percentage)
SLG (slugging percentage)
OPS (on-base + slugging percentage)
Data Wrangling / Transformation
Data Cleaning
I ensured that:
Player names are properly formatted (“First Last”).
I separated the position from the name column in order to make position a usable attribute to group by.
I changed the name of the columns since they repeated to make them simpler to use. For example “PLAYERPLAYER” to “PLAYER”.
I filtered out players that do not have more than 20 at-bats.
Analysis and Results
OBP by Position
# A tibble: 10 × 2
Position avg_obp
<chr> <dbl>
1 DH 0.324
2 RF 0.317
3 C 0.314
4 SS 0.313
5 CF 0.313
6 1B 0.312
7 2B 0.309
8 LF 0.300
9 3B 0.300
10 X 0.257
Designated hitters (DHs) have the highest on-base percentages among all positions, which aligns with their specialized role in the lineup. Since DHs are not responsible for fielding, their sole focus is on offensive production, particularly getting on base and driving in runs. This singular responsibility allows them to develop and maintain stronger hitting skills, making them some of the most effective offensive players on the team. Their higher OBP reflects both their skill and the strategic nature of the position.
Top Ten Hitters by OBP
# A tibble: 10 × 6
PLAYER Position TEAM OBP AVG H
<chr> <chr> <chr> <dbl> <dbl> <int>
1 Aaron Judge RF NYY 0.491 0.4 56
2 Carson Kelly C CHC 0.488 0.348 23
3 Jonny DeLuca CF TB 0.48 0.435 10
4 Leo Rivas 2B SEA 0.471 0.341 14
5 Iván Herrera C STL 0.458 0.381 8
6 Alex Call RF WSH 0.455 0.348 24
7 Pete Alonso 1B NYM 0.45 0.328 45
8 Austin Wynns C CIN 0.45 0.405 15
9 Jackson Merrill CF SD 0.44 0.4 18
10 Freddie Freeman 1B LAD 0.436 0.358 34
This table provides a current snapshot of players who are excelling in on-base percentage (OBP). It can be updated throughout the season to track which players are consistently performing at a high level. In addition to OBP, the table includes a range of other hitting statistics, allowing viewers to analyze the factors contributing to each player’s success, whether it’s a high batting average, strong walk rate, or a combination of both. This comprehensive view supports deeper insights into offensive performance.
Home Run Leaders by Team
This visualization displays the number of home runs hit by each team as the season progresses, offering a clear view of which teams are currently leading in power hitting. Tracking team-level home run totals over time provides valuable insight into offensive strength and momentum, helping identify which teams rely heavily on power to drive their scoring.
Home Run Leaders
This visualization tracks the number of home runs hit by each player over the course of the season, making it a valuable tool for evaluating player performance and trends. It’s especially useful for fantasy baseball decisions or making informed bets, as it highlights which players are consistently delivering power at the plate and gaining momentum. Identifying hot streaks or steady production can give viewers a strategic edge.
Home Runs Vs. Batting Average Correlation
When comparing the relationship between home runs and batting average in the original data set and the new current-season data set, several patterns emerge. As expected, the overall number of home runs is significantly lower in the current data set, given that it only reflects the early part of the season rather than an entire player’s career. This lower volume results in a tighter clustering of points and fewer extreme values in home run totals.
Despite this difference in scale, both data sets display a similar trend: a positive correlation between home runs and batting average. In each case, players with higher batting averages tend to hit more home runs. This suggests that players who make consistent, quality contact not only get on base more often but are also more likely to hit for power.
The similarity in the trend across both datasets reinforces the idea that strong offensive performance in one area like maintaining a high batting average often complements power hitting. It also shows that even in a small or early-season sample, foundational hitting trends remain consistent, offering insight into long-term player performance.
Conclusion
Through analyzing MLB hitters’ 2025 season data, several important trends emerge. Players’ on-base percentages (OBP) vary noticeably across fielding positions, with some positions (such as designated hitters) tending to have higher OBPs compared to others like middle infielders or catchers. Meanwhile, the distribution of home run totals by team and individual players reveals which teams and players are generating the most power at the plate.
From a fantasy baseball manager’s perspective, these insights are highly valuable. Targeting players with high OBP, especially from positions that traditionally lag in offensive production, can provide a competitive edge in leagues that reward on-base skills alongside traditional stats like home runs and RBIs. Identifying breakout hitters based on OBP distribution can also help managers find undervalued assets.
For a sports bettor, understanding which teams consistently produce high OBP and home runs can sharpen strategies when placing bets on totals (over/unders), any time home runs, or team performance. Teams filled with high-OBP players are more likely to sustain offensive rallies, affecting game outcomes beyond just isolated home run power.
From the perspective of a baseball fan, this analysis deepens appreciation for different player archetypes and the diversity of offensive contributions across the diamond. It’s not only the sluggers who drive success; players who consistently reach base create the foundation for high-scoring innings and exciting games.
Overall, these findings reinforce that both efficiency (OBP) and power (HR) are crucial elements in evaluating player value, predicting team success, and enriching the baseball experience from every viewpoint. This data can be used beyond what I demonstrated today for any piece of hitting statistic desired.
The uses are limitless. Team front offices could use deeper OBP and HR analysis to guide player development, scouting, and trade decisions. Broadcasters and writers can use these trends to tell richer stories during games, highlighting under appreciated contributors. Data Scientists can build predictive models for player performance, team win totals, or injury risk by expanding on variables like OBP, HR, plate appearances, and position, and fans curious about game strategy can use this type of analysis to better understand lineup construction and in-game decision-making.
In short, this project only scratches the surface of what this kind of data can reveal. With more detailed modeling, historical comparisons, or predictive analytics, this data set could become a powerful tool for anyone passionate about understanding the deeper layers of baseball.