MLB Scraped Data Season Stats 2025

Author

Ben

Introduction

Line of Inquiry

In this project, I seek to explore which Major League Baseball hitters, teams, or positions perform the best based on traditional offensive metrics such as Home Runs (HR), Runs Batted In (RBI), and Batting Average (AVG) during the 2025 season.
I find this topic interesting because it can help predict potential MVP candidates, All-Star selections, and give insight into player value for fantasy baseball drafts. I have always enjoyed watching baseball and this allows me to follow the entire league without watching every single game.

One specific question I want to focus on is “which hitters are the most efficient at getting on base in the 2025 MLB season, and how does on-base efficiency relate to player position? I’m interested in understanding the relationship between player positions (like infielders, outfielders, catchers) and their on-base efficiency (getting hits, walks, etc.). I chose this topic because while home runs and RBIs are flashy, getting on base consistently is critical for a team’s offensive success. I want to find out if certain positions tend to have better on-base players.

How I Will Answer the Question

To answer this question, I scraped player statistics data from the MLB official website: https://www.mlb.com/stats/. This site provides detailed player performance across multiple hitting categories and is updated throughout the season to ensure the data stays present in real time.

Using R, I:

Programmatically scraped 20 pages of hitters data.
Extracted player names, teams, and key statistics.
Cleaned and organized the data into a structured format.
Saved the dataset as a .csv and uploaded it for clean importing.
Visualized and analyzed the top players across different statistics.

The scraped dataset includes columns for PLAYER, TEAM, G, AB, R, H, 2B (doubles), 3B (triples), HR, RBI, SB, AVG, OBP, SLG, and OPS.

Data Wrangling / Transformation

library(readr)
library(dplyr)
library(tidyverse) # All the tidy things
library(jsonlite)  # Converting json data into data frames
library(magrittr)  # Extracting items from list objects using piping grammar
library(httr)      # Interacting with HTTP verbs

mlb_data <- read_csv("mlb_2025_hitters.csv")

# Filter players with more than 20 at-bats
mlb_data_clean <- mlb_data %>%
  filter(!is.na(OBP), AB > 20)

Data Cleaning

I ensured that:

Player names are properly formatted (“First Last”).
I separated the position from the name column in order to make position a usable attribute to group by.
I changed the name of the columns since they repeated to make them simpler to use. For example “PLAYERPLAYER” to “PLAYER”.
I also changed the data to a CSV for easier use and to make it easier on the website scraped.
I filtered out players who have no OPB and players that do not have more than 20 at-bats.

Analysis and Results

OBP by Position

obp_by_position <- mlb_data_clean %>%
  group_by(Position) %>%
  summarize(avg_obp = mean(OBP, na.rm = TRUE)) %>%
  arrange(desc(avg_obp))

obp_by_position

# A tibble: 10 × 2
   Position avg_obp
   <chr>      <dbl>
 1 DH         0.325
 2 RF         0.320
 3 CF         0.314
 4 2B         0.311
 5 LF         0.308
 6 3B         0.305
 7 C          0.304
 8 SS         0.304
 9 1B         0.303
10 X          0.286

# Boxplot of OBP by Position
ggplot(mlb_data_clean, aes(x = Position, y = OBP)) +
  geom_boxplot(fill = "skyblue", color = "darkblue") +
  labs(
    title = "Distribution of On-Base Percentage (OBP) by Position",
    x = "Position",
    y = "On-Base Percentage (OBP)"
  ) +
  theme_minimal()

The designated hitter has the best chance of getting on base which makes sense since their only purpose is to focus on hitting. They do not need to focus on the fielding aspect of the game so they should be the best hitters.

Top Ten Hitters by OBP

top_hitters <- mlb_data_clean %>%
  arrange(desc(OBP)) %>%
  slice_head(n = 10)

top_hitters

# A tibble: 10 × 19
   PLAYER   Position TEAM      G    AB     R     H  `2B`  `3B`    HR   RBI    BB
   <chr>    <chr>    <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Carson … C        CHC      15    41    12    14     1     1     6    18    15
 2 Aaron J… RF       NYY      29   110    25    45     7     1     8    28    18
 3 Leo Riv… 2B       SEA      11    23     7     7     0     0     0     0     9
 4 Austin … C        CIN      11    30     6    13     1     0     3    11     3
 5 Marcell… DH       ATL      25    80    14    25     3     0     5    11    26
 6 Jonny D… CF       TB        9    23     4    10     0     1     0     1     2
 7 Ketel M… 2B       AZ        8    26     6     9     3     0     0     1     6
 8 Tyler H… C        TOR      10    29     6    13     3     0     1     5     1
 9 Pavin S… DH       AZ       26    69    15    23     9     0     4     9    17
10 Edgar Q… C        CWS      11    32     3    11     2     0     0     5     6
# ℹ 7 more variables: SO <dbl>, SB <dbl>, CS <dbl>, AVG <dbl>, OBP <dbl>,
#   SLG <dbl>, OPS <dbl>

This table gives an updated list that can be pulled throughout the season to see which players are dominating when it comes to OBP. This also allows the viewer to see all of the others stats of these players which can be helpful in determining what is driving that success.

Home Run Leaders by Team

hr_by_team <- mlb_data_clean %>%
  group_by(TEAM) %>%
  summarize(total_hr = sum(HR, na.rm = TRUE)) %>%
  arrange(desc(total_hr))

# Bar plot of HR totals by team
ggplot(hr_by_team, aes(x = reorder(TEAM, total_hr), y = total_hr)) +
  geom_col(fill = "darkblue") +
  coord_flip() +
  labs(title = "Total Home Runs by Team (2025)", x = "Team", y = "Home Runs")

This shows the amount of home runs hit by each team as the season progresses which is a helpful tool to show which teams currently have the most power.

Home Run Leaders

top_hr_hitters <- mlb_data_clean %>%
  arrange(desc(HR)) %>%
  slice_head(n = 10)

# Plot
ggplot(top_hr_hitters, aes(x = reorder(PLAYER, -HR), y = HR)) +
  geom_col(fill = "pink", color = "black") +
  labs(
    title = "Top 10 Home Run Hitters (2025)",
    x = "Player",
    y = "Home Runs"
  ) +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This shows the amount of home runs hit by each player as the season progresses which is a helpful tool when choosing who to start in you fantasy lineup or when placing a home run bet on a specific player.

Conclusion

Through analyzing MLB hitters’ 2025 season data, several important trends emerge. Players’ on-base percentages (OBP) vary noticeably across fielding positions, with some positions (such as designated hitters) tending to have higher OBPs compared to others like middle infielders or catchers. Meanwhile, the distribution of home run totals by team and individual players reveals which teams and players are generating the most power at the plate.

From a fantasy baseball manager’s perspective, these insights are highly valuable. Targeting players with high OBP, especially from positions that traditionally lag in offensive production, can provide a competitive edge in leagues that reward on-base skills alongside traditional stats like home runs and RBIs. Identifying breakout hitters based on OBP distribution can also help managers find undervalued assets.

For a sports bettor, understanding which teams consistently produce high OBP and home runs can sharpen strategies when placing bets on totals (over/unders), any time home runs, or team performance. Teams filled with high-OBP players are more likely to sustain offensive rallies, affecting game outcomes beyond just isolated home run power.

From the perspective of a baseball fan, this analysis deepens appreciation for different player archetypes and the diversity of offensive contributions across the diamond. It’s not only the sluggers who drive success; players who consistently reach base create the foundation for high-scoring innings and exciting games.

Overall, these findings reinforce that both efficiency (OBP) and power (HR) are crucial elements in evaluating player value, predicting team success, and enriching the baseball experience from every viewpoint. This data can be used beyond what I demonstrated today for any piece of hitting statistic desired.

The uses are limitless. Team front offices could use deeper OBP and HR analysis to guide player development, scouting, and trade decisions, Broadcasters and writers can use these trends to tell richer stories during games, highlighting underappreciated contributors, Data Scientists can build predictive models for player performance, team win totals, or injury risk by expanding on variables like OBP, HR, plate appearances, and position, and fans curious about game strategy can use this type of analysis to better understand lineup construction and in-game decision-making.

In short, this project only scratches the surface of what this kind of data can reveal. With more detailed modeling, historical comparisons, or predictive analytics, this data set could become a powerful tool for anyone passionate about understanding the deeper layers of baseball.

--- title: "MLB Scraped Data Season Stats 2025" author: "Ben" editor: visual toc: true # Generates an automatic table of contents. format: # Options related to formatting. html: # Options related to HTML output. code-tools: TRUE # Allow the code tools option showing in the output. embed-resources: TRUE # Embeds all components into a single HTML file. execute: # Options related to the execution of code chunks. warning: FALSE # FALSE: Code chunk sarnings are hidden by default. message: FALSE # FALSE: Code chunk messages are hidden by default. echo: TRUE --- # Introduction ## Line of Inquiry In this project, I seek to explore which Major League Baseball hitters, teams, or positions perform the best based on traditional offensive metrics such as Home Runs (HR), Runs Batted In (RBI), and Batting Average (AVG) during the 2025 season.\ I find this topic interesting because it can help predict potential MVP candidates, All-Star selections, and give insight into player value for fantasy baseball drafts. I have always enjoyed watching baseball and this allows me to follow the entire league without watching every single game. One specific question I want to focus on is "which hitters are the most efficient at getting on base in the 2025 MLB season, and how does on-base efficiency relate to player position? I'm interested in understanding the relationship between player positions (like infielders, outfielders, catchers) and their on-base efficiency (getting hits, walks, etc.). I chose this topic because while home runs and RBIs are flashy, getting on base consistently is critical for a team's offensive success. I want to find out if certain positions tend to have better on-base players. ## How I Will Answer the Question To answer this question, I scraped player statistics data from the MLB official website: <https://www.mlb.com/stats/>. This site provides detailed player performance across multiple hitting categories and is updated throughout the season to ensure the data stays present in real time. Using R, I: - Programmatically scraped 20 pages of hitters data. - Extracted player names, teams, and key statistics. - Cleaned and organized the data into a structured format. - Saved the dataset as a `.csv` and uploaded it for clean importing. - Visualized and analyzed the top players across different statistics. The scraped dataset includes columns for PLAYER, TEAM, G, AB, R, H, 2B (doubles), 3B (triples), HR, RBI, SB, AVG, OBP, SLG, and OPS. # Data Wrangling / Transformation ```{r} #| label: Import CSV Data library(readr) library(dplyr) library(tidyverse) # All the tidy things library(jsonlite) # Converting json data into data frames library(magrittr) # Extracting items from list objects using piping grammar library(httr) # Interacting with HTTP verbs mlb_data <- read_csv("mlb_2025_hitters.csv") # Filter players with more than 20 at-bats mlb_data_clean <- mlb_data %>% filter(!is.na(OBP), AB > 20) ``` ## Data Cleaning I ensured that: - Player names are properly formatted ("First Last"). - I separated the position from the name column in order to make position a usable attribute to group by. - I changed the name of the columns since they repeated to make them simpler to use. For example "PLAYERPLAYER" to "PLAYER". - I also changed the data to a CSV for easier use and to make it easier on the website scraped. - I filtered out players who have no OPB and players that do not have more than 20 at-bats. # Analysis and Results ## OBP by Position ```{r} #| label: OBP by Position obp_by_position <- mlb_data_clean %>% group_by(Position) %>% summarize(avg_obp = mean(OBP, na.rm = TRUE)) %>% arrange(desc(avg_obp)) obp_by_position # Boxplot of OBP by Position ggplot(mlb_data_clean, aes(x = Position, y = OBP)) + geom_boxplot(fill = "skyblue", color = "darkblue") + labs( title = "Distribution of On-Base Percentage (OBP) by Position", x = "Position", y = "On-Base Percentage (OBP)" ) + theme_minimal() ``` The designated hitter has the best chance of getting on base which makes sense since their only purpose is to focus on hitting. They do not need to focus on the fielding aspect of the game so they should be the best hitters. ## Top Ten Hitters by OBP ```{r} #| label: Top Ten Hitters by OBP top_hitters <- mlb_data_clean %>% arrange(desc(OBP)) %>% slice_head(n = 10) top_hitters ``` This table gives an updated list that can be pulled throughout the season to see which players are dominating when it comes to OBP. This also allows the viewer to see all of the others stats of these players which can be helpful in determining what is driving that success. ## Home Run Leaders by Team ```{r} #| label: HR leaders (Teams) hr_by_team <- mlb_data_clean %>% group_by(TEAM) %>% summarize(total_hr = sum(HR, na.rm = TRUE)) %>% arrange(desc(total_hr)) # Bar plot of HR totals by team ggplot(hr_by_team, aes(x = reorder(TEAM, total_hr), y = total_hr)) + geom_col(fill = "darkblue") + coord_flip() + labs(title = "Total Home Runs by Team (2025)", x = "Team", y = "Home Runs") ``` This shows the amount of home runs hit by each team as the season progresses which is a helpful tool to show which teams currently have the most power. ## Home Run Leaders ```{r} #| label: HR leaders top_hr_hitters <- mlb_data_clean %>% arrange(desc(HR)) %>% slice_head(n = 10) # Plot ggplot(top_hr_hitters, aes(x = reorder(PLAYER, -HR), y = HR)) + geom_col(fill = "pink", color = "black") + labs( title = "Top 10 Home Run Hitters (2025)", x = "Player", y = "Home Runs" ) + scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) ``` This shows the amount of home runs hit by each player as the season progresses which is a helpful tool when choosing who to start in you fantasy lineup or when placing a home run bet on a specific player. # Conclusion Through analyzing MLB hitters' 2025 season data, several important trends emerge. Players' on-base percentages (OBP) vary noticeably across fielding positions, with some positions (such as designated hitters) tending to have higher OBPs compared to others like middle infielders or catchers. Meanwhile, the distribution of home run totals by team and individual players reveals which teams and players are generating the most power at the plate. From a fantasy baseball manager's perspective, these insights are highly valuable. Targeting players with high OBP, especially from positions that traditionally lag in offensive production, can provide a competitive edge in leagues that reward on-base skills alongside traditional stats like home runs and RBIs. Identifying breakout hitters based on OBP distribution can also help managers find undervalued assets. For a sports bettor, understanding which teams consistently produce high OBP and home runs can sharpen strategies when placing bets on totals (over/unders), any time home runs, or team performance. Teams filled with high-OBP players are more likely to sustain offensive rallies, affecting game outcomes beyond just isolated home run power. From the perspective of a baseball fan, this analysis deepens appreciation for different player archetypes and the diversity of offensive contributions across the diamond. It's not only the sluggers who drive success; players who consistently reach base create the foundation for high-scoring innings and exciting games. Overall, these findings reinforce that both efficiency (OBP) and power (HR) are crucial elements in evaluating player value, predicting team success, and enriching the baseball experience from every viewpoint. This data can be used beyond what I demonstrated today for any piece of hitting statistic desired. The uses are limitless. Team front offices could use deeper OBP and HR analysis to guide player development, scouting, and trade decisions, Broadcasters and writers can use these trends to tell richer stories during games, highlighting underappreciated contributors, Data Scientists can build predictive models for player performance, team win totals, or injury risk by expanding on variables like OBP, HR, plate appearances, and position, and fans curious about game strategy can use this type of analysis to better understand lineup construction and in-game decision-making. In short, this project only scratches the surface of what this kind of data can reveal. With more detailed modeling, historical comparisons, or predictive analytics, this data set could become a powerful tool for anyone passionate about understanding the deeper layers of baseball.