Assignment 7: Ethical Web Scraping

Data Introduction

This project analyzes Major League Baseball (MLB) team batting statistics over a ten year period (2016–2025). The dataset includes critical performance metrics such as batting average, home runs, strikeouts, and OPS (On-base Plus Slugging). By aggregating this data, we can observe how offensive strategies—such as the “Three True Outcomes” (HR, BB, K) or the recent emphasis on batting average—have shifted across the league.

Why this Data is Suitable for Scraping

Baseball-Reference is ideal for scraping because it maintains a highly consistent URL structure and table ID system across different seasons. While the site uses HTML comments to protect some data from initial page loads, a programmatic approach in R allows us to “unhide” these tables, transforming fragmented web pages into a structured, longitudinal dataset that would be nearly impossible to compile manually.

Question

How has the relationship between league-wide strikeout rates and home run production evolved between 2016 and 2025, and are we seeing a “correction” in offensive trends in the most recent seasons?

How will I attempt to answer this inquiry

I will use the rvest package to scrape team-level data for each season. After cleaning the data to remove aggregate league totals (which appear as rows in the tables), I will calculate league-wide averages per year. I will then use ggplot2 to create a series of visualizations—specifically line charts and scatter plots—to identify correlations between power hitting and plate discipline over the last decade. For the purpose of readability, I have removed the 2020 season as it was canceled due to Covid.


Load in CSV and Library Packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding
library(xml2)

mlb_data <- read_csv("mlb_historical_batting.csv")
Rows: 330 Columns: 30
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (29): Tm, #Bat, BatAge, R/G, G, PA, AB, R, H, 2B, 3B, HR, RBI, SB, CS, B...
dbl  (1): Season

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Wrangling

# Clean the data: Remove the "League Average" rows and empty columns
mlb_clean <- mlb_data %>%
  filter(!Tm %in% c("League Average", "Avg per 162 G")) %>%
  select(-matches("Var|Rank")) %>%
  mutate(
    across(-c(Tm, Season), as.numeric),
    Season = as.character(Season))
Warning: There were 28 warnings in `mutate()`.
The first warning was:
ℹ In argument: `across(-c(Tm, Season), as.numeric)`.
Caused by warning:
! NAs introduced by coercion
ℹ Run `dplyr::last_dplyr_warnings()` to see the 27 remaining warnings.
# Create a season-level summary
mlb_summary <- mlb_clean %>%
  group_by(Season) %>%
  summarize(
    Avg_HR = mean(HR, na.rm = TRUE),
    Avg_SO = mean(SO, na.rm = TRUE),
    Avg_BB = mean(BB, na.rm = TRUE),       
    Avg_OBP = mean(OBP, na.rm = TRUE),    
    Avg_Runs = mean(R, na.rm = TRUE),     
    Avg_SB = mean(SB, na.rm = TRUE),     
    HR_per_SO = sum(HR) / sum(SO)         
  )

Analysis 1: Power vs. Run Production

ggplot(mlb_summary, aes(x = Season, group = 1)) +
  geom_line(aes(y = Avg_HR, color = "Home Runs")) +
  geom_line(aes(y = Avg_Runs / 10, color = "Runs (Scaled 1/10)")) +
  scale_y_continuous(sec.axis = sec_axis(~.*10, name = "Average Runs")) +
  labs(title = "Power vs. Overall Scoring (2016-2025)",
       y = "Mean Home Runs",
       color = "Metric")

You can see in 2019 that there was a higher number of home runs hit compared to the other seasons. Despite this, it appears that runs did not jump by the same amount from year to year. This shows us that while there were more homeruns, the total amount of runs did not increase drastically as you would expect.

Visualization 2: Trend Analysis with Confidence Intervals

ggplot(mlb_summary, aes(x = as.numeric(Season), y = Avg_HR)) +
  geom_line(color = "steelblue", size = 1.2, alpha = 0.8) +
  geom_point(color = "darkblue", size = 3) +
  geom_smooth() +
  labs(title = "Average Team Home Runs (2016-2025)",
       subtitle = "Blue line shows annual average; Red dashed line shows the smoothed trend",
       y = "Mean HR per Team",
       x = "Season")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

When looking at this graph, we see something similar to the previous graph in the sense that 2019 had a higher average homeruns per team than other years. After 2019 there was a dip in average home runs for a few seasons and is now starting to climb again in 2025.

Visualization 3: Strikeout Rates

ggplot(mlb_summary, aes(x = Season, y = Avg_SO)) +
  geom_col() +
  geom_text(aes(label = round(Avg_SO, 0)), vjust = -0.5, size = 3.5) +
  coord_cartesian(ylim = c(min(mlb_summary$Avg_SO) - 100, max(mlb_summary$Avg_SO) + 100)) +
  labs(title = "The Strikeout Era: Average Team SO",
       subtitle = "Year-over-year volume of strikeouts per team",
       y = "Mean SO per Team",
       x = "Season")

In this graph we see the average number of team strikeouts by season. Strikeouts increased incredibly from 2015 to 2019 and now is fairly even despite dropping down from the 2019 numbers.

Visualization 4: OPS Distribution

ggplot(mlb_clean, aes(x = as.factor(Season), y = OPS, fill = as.factor(Season))) +
  geom_boxplot(width = 0.2, color = "black", outlier.shape = 1) +
  labs(title = "League-Wide OPS Density by Season",
       subtitle = "Showing the spread and concentration of team offensive performance",
       x = "Season",
       y = "OPS")
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_boxplot()`).

In this graph we can see how ops is compared to the ops in differing seasons. The ops on average increased from 2015 to 2019, largely due to the increasing number of homeruns hit. And similarly to above, the years after 2019 have lower average ops values, likely due to less hitters swinging purely for power and focusing more on on base percentage.

Visualization 5: HR vs. SO Relationship

ggplot(mlb_clean, aes(x = SO, y = HR)) +
  geom_point(color = "darkgrey", alpha = 0.5) +
  geom_smooth(method = "lm", color = "darkblue", se = TRUE, fill = "lightblue") +
  facet_wrap(~Season, ncol = 5) + 
  labs(title = "The Correlation Between Strikeouts and Power",
       subtitle = "Faceted by Season: How the HR/SO trade-off evolves",
       x = "Strikeouts (SO)",
       y = "Home Runs (HR)") 
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_point()`).

In this graph, we see how homeruns and strikeouts are related to each other for each season. We can conclude that when hitters hit more homeruns, they tend to strike out more. As you can see when the number of strikeouts go up, the number of homeruns also increases. When looking at 2019, hitters hit more homeruns but also struck out more, leading us to believe that swinging for power will ultimately lead to more strikeouts.