library(tidyverse)
library(rvest)
library(xml2)
library(janitor)
library(hexbin)
# Load the primary dataset
mlb_data_kaggle <- read_csv("baseball_hitting.csv")BAIS 462 FInal Project
Part I: Identify Area of Interest
Introduction
Hello, I am Jacob Stamm. A Junior Business Analytics major at Xavier University from Indianapolis, IN. My free time consists primarily of playing and watching sports, which is why I am particularly interested the analytics behind sports.
Baseball was a sport I have played for a large portion of my life and have grown into supporting the Cincinnati Reds as I have lived in Cincinnati. Considering that baseball is one of the sports with the most data involved in the day to day analysis of the sport, I thought it would be incredibly interesting to look at this data myself and discover trends from the last 10 seasons.
In recent years, Major League Baseball has undergone a seismic shift in philosophy. The “small ball” era of bunts and stolen bases has been largely replaced by a strategy focused on Three True Outcomes: Home Runs, Walks, and Strikeouts. I am interested in exploring how this shift has impacted team performance and whether the “power surge”—specifically the record-breaking 2019 season—was a permanent change or a statistical anomaly.
Research Question
How has the relationship between team strikeout rates and home run production evolved between 2016 and 2025, and does a higher strikeout rate still guarantee higher run production in the current era?
The Data
I found this data set on the website: Kaggle
Here is the link to the Kaggle file that I used.
Data Dictionary
Player Name: The name of the player.
position: The player’s position in the field, such as outfielder, catcher, etc.
Games: The number of games in which the player participated.
At-bat: The number of times the player was at-bat.
Runs: The number of runs scored by the player.
Hits: The number of hits achieved by the player.
Double (2B): The number of doubles hit by the player.
third basemen: The number of triples hit by the player.
home run: The number of home runs hit by the player.
run batted in: The number of runs batted in by the player.
a walk: The total number of times a batter has been walked.
Strikeouts: The total number of times a batter strikes out
stolen base: The total number of stolen bases a hitter has.
Caught stealing: How many total times a runner has been caught stealing.
AVG: The hitter’s career batting average.
On-base Percentage: A hitter’s career average percent of at-bats that reach first safely.
Slugging Percentage: A hitter’s career average slugging percentage, which is measures power by total bases achieved per at-bat.
On-base Plus Slugging: Adding together a batter’s career average in on-base percentage and slugging percentage.
Part 2: Descriptive Analysis
Loading the Data set
Data Wrangling + Cleaning
# Clean and transform the Kaggle dataset
mlb_kaggle_clean <- mlb_data_kaggle %>%
clean_names() %>%
mutate(
# Core numeric conversion
caught_stealing = as.numeric(caught_stealing),
strikeouts = as.numeric(strikeouts)
) %>%
mutate(
so_rate = strikeouts / at_bat,
hr_per_hit = home_run / hits,
bb_k_ratio = a_walk / strikeouts
)Visualization 1: The Strikeout-Power Correlation Trend
ggplot(mlb_kaggle_clean, aes(x = so_rate, y = home_run)) +
geom_point(alpha = 0.3, color = "grey") +
geom_smooth(method = "lm", color = "blue") +
labs(title = "The Power-Strikeout Trade-off (2016-2025)",
subtitle = "Does a higher SO Rate consistently lead to more Home Runs?",
x = "Strikeout Rate (SO / AB)",
y = "Home Runs")This visualization establishes the baseline of the “Three True Outcomes” era. The linear model (blue line) shows that while there is a positive correlation between strikeout rates and home run production, the high volume of “grey” points scattered far from the line indicates that strikeouts alone do not automatically result in power.
Visualization 2: Total Run Value of High-K Players
ggplot(mlb_kaggle_clean, aes(x = so_rate, y = runs)) +
geom_hex(bins = 30) +
scale_fill_viridis_c() +
labs(title = "Career Run Production and Strikeout Frequency",
subtitle = "Concentration of total runs scored relative to SO rate",
x = "Career Strikeout Rate",
y = "Total Career Runs")This hexbin plot directly addresses if strikeouts “guarantee” runs. By observing the vertical density, we can see that the highest career run totals are not necessarily found at the highest strikeout rates. This suggests that “swinging through” the ball has a point of diminishing returns for total run production.
Visualization 3: Home Run Efficiency per Strikeout
ggplot(mlb_kaggle_clean, aes(x = so_rate, y = hr_per_hit, color = runs)) +
geom_point(alpha = 0.5) +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Home Run Efficiency vs. Strikeout Rate",
subtitle = "Color indicates total runs scored",
x = "Strikeout Rate",
y = "HR per Hit (Efficiency)")This plot uses the hr_per_hit metric to see how “efficiently” players are hitting for power relative to their strikeout frequency. It highlights that the most successful run-scorers (red points) are often those who maintain power without pushing their strikeout rates to the extreme right of the x-axis.
Visualization 4: Plate Discipline and Run Scoring
ggplot(mlb_kaggle_clean, aes(x = bb_k_ratio, y = runs, size = home_run)) +
geom_point(alpha = 0.3, color = "darkgreen") +
# Use coord_cartesian to zoom in without removing data points from calculations
coord_cartesian(xlim = c(0, 1.5), ylim = c(0, 1500)) +
labs(title = "The Impact of Discipline on Run Production",
subtitle = "Focusing on the league majority (Outliers removed for readability)",
x = "Walk-to-K Ratio (BB/K)",
y = "Runs Scored",
size = "Home Runs")This addresses the “current era” aspect of your question by focusing on discipline. By zooming in on the league majority, we see that players who score the most runs (highest on the y-axis) typically have a better bb_k_ratio. This proves that in the modern era, strikeouts must be offset by walks to remain productive.
Visualization 5: Power Potential by Position
ggplot(mlb_kaggle_clean, aes(x = reorder(position, so_rate, FUN = median), y = so_rate, fill = position)) +
geom_boxplot(show.legend = FALSE) +
coord_flip() +
labs(title = "Strikeout Rate Expectations by Position",
subtitle = "Who is 'allowed' to strike out more in exchange for power?",
x = "Position",
y = "Strikeout Rate")This boxplot shows that a high strikeout rate is a “functional cost” that is more acceptable for certain positions than others. It demonstrates that there is no “universal guarantee” that more strikeouts lead to more runs; rather, the value of that trade-off depends heavily on the player’s defensive role and expected output.
Part 3: Secondary Data Source
Load in CSV and Library Packages
library(tidyverse)
library(rvest)
library(xml2)
mlb_data <- read_csv("mlb_historical_batting.csv")Data Wrangling
# Clean the data: Remove the "League Average" rows and empty columns
mlb_clean <- mlb_data %>%
filter(!Tm %in% c("League Average", "Avg per 162 G")) %>%
select(-matches("Var|Rank")) %>%
mutate(
across(-c(Tm, Season), as.numeric),
Season = as.character(Season))
# Create a season-level summary
mlb_summary <- mlb_clean %>%
group_by(Season) %>%
summarize(
Avg_HR = mean(HR, na.rm = TRUE),
Avg_SO = mean(SO, na.rm = TRUE),
Avg_BB = mean(BB, na.rm = TRUE),
Avg_OBP = mean(OBP, na.rm = TRUE),
Avg_Runs = mean(R, na.rm = TRUE),
Avg_SB = mean(SB, na.rm = TRUE),
HR_per_SO = sum(HR) / sum(SO)
)Analysis 1: Power vs. Run Production
ggplot(mlb_summary, aes(x = Season, group = 1)) +
geom_line(aes(y = Avg_HR, color = "Home Runs")) +
geom_line(aes(y = Avg_Runs / 10, color = "Runs (Scaled 1/10)")) +
scale_y_continuous(sec.axis = sec_axis(~.*10, name = "Average Runs")) +
labs(title = "Power vs. Overall Scoring (2016-2025)",
y = "Mean Home Runs",
color = "Metric")You can see in 2019 that there was a higher number of home runs hit compared to the other seasons. Despite this, it appears that runs did not jump by the same amount from year to year. This shows us that while there were more homeruns, the total amount of runs did not increase drastically as you would expect.
Visualization 2: Trend Analysis with Confidence Intervals
ggplot(mlb_summary, aes(x = as.numeric(Season), y = Avg_HR)) +
geom_line(color = "steelblue", size = 1.2, alpha = 0.8) +
geom_point(color = "darkblue", size = 3) +
geom_smooth() +
labs(title = "Average Team Home Runs (2016-2025)",
subtitle = "Blue line shows annual average; Red dashed line shows the smoothed trend",
y = "Mean HR per Team",
x = "Season")When looking at this graph, we see something similar to the previous graph in the sense that 2019 had a higher average homeruns per team than other years. After 2019 there was a dip in average home runs for a few seasons and is now starting to climb again in 2025.
Visualization 3: Strikeout Rates
ggplot(mlb_summary, aes(x = Season, y = Avg_SO)) +
geom_col() +
geom_text(aes(label = round(Avg_SO, 0)), vjust = -0.5, size = 3.5) +
coord_cartesian(ylim = c(min(mlb_summary$Avg_SO) - 100, max(mlb_summary$Avg_SO) + 100)) +
labs(title = "The Strikeout Era: Average Team SO",
subtitle = "Year-over-year volume of strikeouts per team",
y = "Mean SO per Team",
x = "Season")In this graph we see the average number of team strikeouts by season. Strikeouts increased incredibly from 2015 to 2019 and now is fairly even despite dropping down from the 2019 numbers.
Visualization 4: OPS Distribution
ggplot(mlb_clean, aes(x = as.factor(Season), y = OPS, fill = as.factor(Season))) +
geom_boxplot(width = 0.2, color = "black", outlier.shape = 1) +
labs(title = "League-Wide OPS Density by Season",
subtitle = "Showing the spread and concentration of team offensive performance",
x = "Season",
y = "OPS")In this graph we can see how ops is compared to the ops in differing seasons. The ops on average increased from 2015 to 2019, largely due to the increasing number of homeruns hit. And similarly to above, the years after 2019 have lower average ops values, likely due to less hitters swinging purely for power and focusing more on on base percentage.
Visualization 5: HR vs. SO Relationship
ggplot(mlb_clean, aes(x = SO, y = HR)) +
geom_point(color = "darkgrey", alpha = 0.5) +
geom_smooth(method = "lm", color = "darkblue", se = TRUE, fill = "lightblue") +
facet_wrap(~Season, ncol = 5) +
labs(title = "The Correlation Between Strikeouts and Power",
subtitle = "Faceted by Season: How the HR/SO trade-off evolves",
x = "Strikeouts (SO)",
y = "Home Runs (HR)") In this graph, we see how homeruns and strikeouts are related to each other for each season. We can conclude that when hitters hit more homeruns, they tend to strike out more. As you can see when the number of strikeouts go up, the number of homeruns also increases. When looking at 2019, hitters hit more homeruns but also struck out more, leading us to believe that swinging for power will ultimately lead to more strikeouts.
Conclusion
The transition from the “small ball” era to a strategy defined by the Three True Outcomes is more than just a passing trend; it is the culmination of a decade-long evolution in how value is measured on the baseball field. By analyzing the relationship between team strikeout rates and home run production from 2016 to 2025, we can draw several key conclusions about the state of the modern game.
The Trade-off is Real, but Not Guaranteed
Our analysis of the career-level data in Part 2 confirms that while a positive correlation exists between striking out and hitting for power, it is far from a universal guarantee of success. The visualization of career run production demonstrates that the highest-scoring players are not necessarily those with the highest strikeout rates. Instead, the data reveals a “sweet spot”: a threshold where players can maximize their power output without becoming so inefficient that they cease to create runs.
Lessons from the 2019 Surge
The historical trends analyzed in Part 3 highlight the 2019 season as a unique anomaly in baseball history. While that year saw a massive spike in home runs, total run production did not follow at a proportional rate. This suggests that the “power surge” was a period of extreme efficiency that the league has since recalibrated. As of 2025, we see a stabilization of strikeout rates, suggesting that teams have reached a point of diminishing returns regarding the “swing-and-miss” approach.
The “Discipline” Modifier
Perhaps the most significant finding for any business analytics student or front-office executive is the role of plate discipline. Our findings show that in the current era, a high strikeout rate is only sustainable if it is paired with a high Walk-to-K ratio. Players who purely swing for the fences without the ability to reach base safely through walks are increasingly becoming liabilities in a data-driven lineup.
Final Thoughts
As I look back on the last ten seasons, it is clear that while the “small ball” of the past may be gone, the fundamental goal of the game—scoring runs—remains unchanged. The modern player must navigate a landscape where strikeouts are an accepted cost of doing business, but only if they are the byproduct of a disciplined, high-power approach. For the Cincinnati Reds and the rest of the league, the challenge of the next decade will be finding the next “evolution” that can counter the high-velocity, high-strikeout environment of 2026 and beyond.