library(readr)
library(dplyr)
library(tidyverse) # All the tidy things
library(jsonlite) # Converting json data into data frames
library(magrittr) # Extracting items from list objects using piping grammar
library(httr) # Interacting with HTTP verbs
mlb_data <- read_csv("mlb_2025_hitters.csv")
# Filter players with more than 20 at-bats
mlb_data_clean <- mlb_data %>%
filter(!is.na(OBP), AB > 20)MLB Scraped Data Season Stats 2025
Introduction
Line of Inquiry
In this project, I seek to explore which Major League Baseball hitters, teams, or positions perform the best based on traditional offensive metrics such as Home Runs (HR), Runs Batted In (RBI), and Batting Average (AVG) during the 2025 season.
I find this topic interesting because it can help predict potential MVP candidates, All-Star selections, and give insight into player value for fantasy baseball drafts. I have always enjoyed watching baseball and this allows me to follow the entire league without watching every single game.
One specific question I want to focus on is “which hitters are the most efficient at getting on base in the 2025 MLB season, and how does on-base efficiency relate to player position? I’m interested in understanding the relationship between player positions (like infielders, outfielders, catchers) and their on-base efficiency (getting hits, walks, etc.). I chose this topic because while home runs and RBIs are flashy, getting on base consistently is critical for a team’s offensive success. I want to find out if certain positions tend to have better on-base players.
How I Will Answer the Question
To answer this question, I scraped player statistics data from the MLB official website: https://www.mlb.com/stats/. This site provides detailed player performance across multiple hitting categories and is updated throughout the season to ensure the data stays present in real time.
Using R, I:
- Programmatically scraped 20 pages of hitters data.
- Extracted player names, teams, and key statistics.
- Cleaned and organized the data into a structured format.
- Saved the dataset as a
.csvand uploaded it for clean importing. - Visualized and analyzed the top players across different statistics.
The scraped dataset includes columns for PLAYER, TEAM, G, AB, R, H, 2B (doubles), 3B (triples), HR, RBI, SB, AVG, OBP, SLG, and OPS.
Data Wrangling / Transformation
Data Cleaning
I ensured that:
Player names are properly formatted (“First Last”).
I separated the position from the name column in order to make position a usable attribute to group by.
I changed the name of the columns since they repeated to make them simpler to use. For example “PLAYERPLAYER” to “PLAYER”.
I also changed the data to a CSV for easier use and to make it easier on the website scraped.
I filtered out players who have no OPB and players that do not have more than 20 at-bats.
Analysis and Results
OBP by Position
obp_by_position <- mlb_data_clean %>%
group_by(Position) %>%
summarize(avg_obp = mean(OBP, na.rm = TRUE)) %>%
arrange(desc(avg_obp))
obp_by_position# A tibble: 10 × 2
Position avg_obp
<chr> <dbl>
1 DH 0.325
2 RF 0.320
3 CF 0.314
4 2B 0.311
5 LF 0.308
6 3B 0.305
7 C 0.304
8 SS 0.304
9 1B 0.303
10 X 0.286
# Boxplot of OBP by Position
ggplot(mlb_data_clean, aes(x = Position, y = OBP)) +
geom_boxplot(fill = "skyblue", color = "darkblue") +
labs(
title = "Distribution of On-Base Percentage (OBP) by Position",
x = "Position",
y = "On-Base Percentage (OBP)"
) +
theme_minimal()The designated hitter has the best chance of getting on base which makes sense since their only purpose is to focus on hitting. They do not need to focus on the fielding aspect of the game so they should be the best hitters.
Top Ten Hitters by OBP
top_hitters <- mlb_data_clean %>%
arrange(desc(OBP)) %>%
slice_head(n = 10)
top_hitters# A tibble: 10 × 19
PLAYER Position TEAM G AB R H `2B` `3B` HR RBI BB
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Carson … C CHC 15 41 12 14 1 1 6 18 15
2 Aaron J… RF NYY 29 110 25 45 7 1 8 28 18
3 Leo Riv… 2B SEA 11 23 7 7 0 0 0 0 9
4 Austin … C CIN 11 30 6 13 1 0 3 11 3
5 Marcell… DH ATL 25 80 14 25 3 0 5 11 26
6 Jonny D… CF TB 9 23 4 10 0 1 0 1 2
7 Ketel M… 2B AZ 8 26 6 9 3 0 0 1 6
8 Tyler H… C TOR 10 29 6 13 3 0 1 5 1
9 Pavin S… DH AZ 26 69 15 23 9 0 4 9 17
10 Edgar Q… C CWS 11 32 3 11 2 0 0 5 6
# ℹ 7 more variables: SO <dbl>, SB <dbl>, CS <dbl>, AVG <dbl>, OBP <dbl>,
# SLG <dbl>, OPS <dbl>
This table gives an updated list that can be pulled throughout the season to see which players are dominating when it comes to OBP. This also allows the viewer to see all of the others stats of these players which can be helpful in determining what is driving that success.
Home Run Leaders by Team
hr_by_team <- mlb_data_clean %>%
group_by(TEAM) %>%
summarize(total_hr = sum(HR, na.rm = TRUE)) %>%
arrange(desc(total_hr))
# Bar plot of HR totals by team
ggplot(hr_by_team, aes(x = reorder(TEAM, total_hr), y = total_hr)) +
geom_col(fill = "darkblue") +
coord_flip() +
labs(title = "Total Home Runs by Team (2025)", x = "Team", y = "Home Runs")This shows the amount of home runs hit by each team as the season progresses which is a helpful tool to show which teams currently have the most power.
Home Run Leaders
top_hr_hitters <- mlb_data_clean %>%
arrange(desc(HR)) %>%
slice_head(n = 10)
# Plot
ggplot(top_hr_hitters, aes(x = reorder(PLAYER, -HR), y = HR)) +
geom_col(fill = "pink", color = "black") +
labs(
title = "Top 10 Home Run Hitters (2025)",
x = "Player",
y = "Home Runs"
) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))This shows the amount of home runs hit by each player as the season progresses which is a helpful tool when choosing who to start in you fantasy lineup or when placing a home run bet on a specific player.
Conclusion
Through analyzing MLB hitters’ 2025 season data, several important trends emerge. Players’ on-base percentages (OBP) vary noticeably across fielding positions, with some positions (such as designated hitters) tending to have higher OBPs compared to others like middle infielders or catchers. Meanwhile, the distribution of home run totals by team and individual players reveals which teams and players are generating the most power at the plate.
From a fantasy baseball manager’s perspective, these insights are highly valuable. Targeting players with high OBP, especially from positions that traditionally lag in offensive production, can provide a competitive edge in leagues that reward on-base skills alongside traditional stats like home runs and RBIs. Identifying breakout hitters based on OBP distribution can also help managers find undervalued assets.
For a sports bettor, understanding which teams consistently produce high OBP and home runs can sharpen strategies when placing bets on totals (over/unders), any time home runs, or team performance. Teams filled with high-OBP players are more likely to sustain offensive rallies, affecting game outcomes beyond just isolated home run power.
From the perspective of a baseball fan, this analysis deepens appreciation for different player archetypes and the diversity of offensive contributions across the diamond. It’s not only the sluggers who drive success; players who consistently reach base create the foundation for high-scoring innings and exciting games.
Overall, these findings reinforce that both efficiency (OBP) and power (HR) are crucial elements in evaluating player value, predicting team success, and enriching the baseball experience from every viewpoint. This data can be used beyond what I demonstrated today for any piece of hitting statistic desired.
The uses are limitless. Team front offices could use deeper OBP and HR analysis to guide player development, scouting, and trade decisions, Broadcasters and writers can use these trends to tell richer stories during games, highlighting underappreciated contributors, Data Scientists can build predictive models for player performance, team win totals, or injury risk by expanding on variables like OBP, HR, plate appearances, and position, and fans curious about game strategy can use this type of analysis to better understand lineup construction and in-game decision-making.
In short, this project only scratches the surface of what this kind of data can reveal. With more detailed modeling, historical comparisons, or predictive analytics, this data set could become a powerful tool for anyone passionate about understanding the deeper layers of baseball.