library(tidyverse)
library(httr)
library(rvest)
library(polite)
library(lubridate)
library(magrittr)
library(ggplot2)
library(stringr)
library(xml2)
library(scales)Mlb Analysis
with an introduction to the dataset
Introduction to the MLB Data
This data set contains information about Major League Baseball players from the 2016 season, including team, batting stats, and performance metrics such as hits, home runs, RBIs, and more. The data was collected by scraping https://www.mlb.com/ , and allows for powerful visual exploration of player performance.
For example, you can analyze: - Who were the top power hitters? - How do hits relate to runs scored? - Which teams had the most well-rounded hitters?
Setting Up
We start by loading the libraries needed for scraping, cleaning, and visualization.
Collecting the Data
MLB.com shows batting stats across multiple pages, so we define a function to build all 33 page URLs.
mlb_url <- "https://www.mlb.com/stats/2016"
mlb_urls <- function(mlb_url, pages) {
paste0(mlb_url, "?page=", seq_len(pages))
}
complete_mlb_urls <- mlb_urls(mlb_url, 33)Next, we create functions to scrape each table and combine all pages into one dataset.
scrape_mlb_page <- function(url) {
url %>%
read_html() %>%
html_element("table") %>%
html_table(fill = TRUE)
}
scrape_mlb <- function(urls) {
all_mlb_stats <- data.frame()
for (i in seq_along(urls)) {
message("Collecting page ", i, " of ", length(urls))
Sys.sleep(runif(1, 2, 4))
page_data <- scrape_mlb_page(urls[i])
all_mlb_stats <- bind_rows(all_mlb_stats, page_data)
}
return(all_mlb_stats)
}
player_mlb_table <- scrape_mlb(complete_mlb_urls)Cleaning the Data
MLB’s column headers are messy, so we rename them to be readable.
player_mlb_table_clean <- player_mlb_table %>%
rename(
PLAYER = PLAYERPLAYER,
TEAM = TEAMTEAM,
G = GG,
AB = ABAB,
R = RR,
H = HH,
`2B` = `2B2B`,
`3B` = `3B3B`,
HR = `caret-upcaret-downHRcaret-upcaret-downHR`,
RBI = RBIRBI,
BB = BBBB,
SO = SOSO,
SB = SBSB,
CS = CSCS,
AVG = AVGAVG,
OBP = OBPOBP,
SLG = SLGSLG,
OPS = OPSOPS
)Visualization 1: Batting Average vs OPS by Team
player_mlb_table_clean %>%
filter(AB >= 100) %>%
ggplot(aes(x = AVG, y = OPS, color = TEAM)) +
geom_point(alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = FALSE, aes(group = TEAM, color = TEAM), linetype = "dashed") +
scale_x_continuous(labels = scales::number_format(accuracy = 0.001)) +
scale_color_viridis_d(option = "C") +
labs(
title = "Batting Average vs OPS by Team",
subtitle = "Players with 100+ At Bats",
x = "Batting Average (AVG)",
y = "On-Base Plus Slugging (OPS)",
color = "Team"
) +
theme_minimal()Explanation:
This scatterplot explores the relationship between a player’s batting average (AVG) and their on-base plus slugging (OPS), which is a key measure of offensive productivity. By grouping and coloring points by team, we can visually compare how different teams’ players perform. The dashed trendlines help show whether players on each team tend to produce higher or lower OPS given their batting average.
Visualization 2: Home Runs vs RBI’s
player_mlb_table_clean %>%
ggplot(aes(x = HR, y = RBI)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Home Runs vs Runs Batted In",
x = "Home Runs",
y = "RBIs"
) +
theme_minimal()Explanation:
This visualization highlights the positive correlation between home runs (HR) and runs batted in (RBI). Players who hit more home runs generally generate more RBI’s, which is expected since home runs directly contribute to run production. The regression line emphasizes the linear relationship and helps identify outliers.
Visualization 3: Top Reds Batting Averages
player_mlb_table_clean %>%
filter(TEAM == "CIN", AB >= 100) %>%
arrange(desc(AVG)) %>%
ggplot(aes(x = reorder(PLAYER, AVG), y = AVG)) +
geom_col(fill = "red") +
geom_text(aes(label = scales::number(AVG, accuracy = 0.001)), hjust = -0.2, size = 4) +
coord_flip() +
scale_y_continuous(labels = scales::number_format(accuracy = 0.001), expand = expansion(mult = c(0, 0.1))) +
labs(
title = "Top Batting Averages - Cincinnati Reds",
subtitle = "Players with 100+ AB",
x = "Player",
y = "Batting Average"
) +
theme_minimal()Explanation:
This bar chart focuses on Cincinnati Reds players who had at least 100 at-bats in the 2016 season. It ranks them by batting average, showing who were the most consistent hitters on the team. Each bar is labeled with the player’s actual average, making it easy to compare performance among teammates.
Visualization 4: Hits vs Runs, Colored by Batting Average
player_mlb_table_clean %>%
filter(AB >= 100) %>%
ggplot(aes(x = H, y = R, color = AVG)) +
geom_point(size = 3, alpha = 0.8) +
geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed") +
scale_color_viridis_c(option = "D") +
labs(
title = "Hits vs Runs Scored",
subtitle = "Colored by Batting Average",
x = "Hits",
y = "Runs",
color = "AVG"
) +
theme_minimal()Explanation:
This scatter plot shows the relationship between total hits (H) and runs scored (R). In general, players who collect more hits are also the ones who score more runs. By coloring each point according to batting average, we can identify players who not only hit often but also hit effectively. The trendline confirms the upward relationship between these two stats.
Visualization 5: Top 10 Home Run Hitters in MLB
player_mlb_table_clean %>%
filter(AB >= 100) %>%
arrange(desc(HR)) %>%
slice_head(n = 10) %>%
ggplot(aes(x = reorder(PLAYER, HR), y = HR)) +
geom_col(fill = "steelblue") +
geom_text(aes(label = HR), hjust = -0.2, size = 4) +
coord_flip() +
labs(
title = "Top 10 Home Run Hitters - 2016 Season",
subtitle = "Players with 100+ At Bats",
x = "Player",
y = "Home Runs"
) +
theme_minimal()Explanation:
This bar chart shows the top 10 home run hitters across all Major League Baseball teams in 2016. Only players with at least 100 at-bats are considered. The chart is sorted in descending order, and each bar is labeled with the exact number of home runs, making it easy to identify the league’s most powerful hitters.
The echo: false option disables the printing of code (only output is displayed).