When trying to understand relationships between two variables, there is no doubt great appeal to the simple two-dimensional graph. After all, one of the most fundamental lessons in statistics is seeing trends between two variables on a graph. Moving up the X-Axis variable increases the Y-Axis variable in a strong, linear fashion. Or perhaps the Y-Axis variable increases, then decreases, like a parabola. These trends are easily seen and can help us make inferences on variable relationships without flexing much statistical muscles.
But if only life were so simple. In reality, a response variable is often influenced by numerous, even countless variables. Humans are perceptive, but our fancy graphs and trend-lines only work up to three dimensions. How do we understand a relationship between 10 variables?
Principal Component Analysis (PCA) is a potential solution. As a dimension-reducing method, it can allow us to view the relationships of many variables in ways more attuned to our perception, while still explaining much of the variation of a data set.
To explore this concept, we will analyze Major League Baseball players and their performances from the 2025 season. We will aim to mathematically categorize hitters into different archetypes as to see where their strengths are. But with dozens of statistics available for every batter, the relationships will no doubt be quite abstract! Principal Component Analysis will allow us to reduce this dimensionality, while still explaining their variations in a meaningful way.
All data derived from the Baseballr package created by Bill Petti and Saiem Gilani. From Bill Petti’s GitHub repository:
“baseballr is a package written for R focused on
baseball analysis. It includes functions for scraping various data from
websites, such as FanGraphs.com, Baseball-Reference.com,
and baseballsavant.mlb.com.
It also includes functions for calculating metrics, such as wOBA, FIP,
and team-level consistency over custom time frames.
You can read more about some of the functions and how to use them at its official site as well as this Hardball Times article.
Copyright (c) 2020 baseballr authors
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.”
2025 batting statistics, acquired by the baseballr package, gives us 61 variables. We retain batters with at least 50 plate appearances for more complete data. Some variables are performance statistics, others are unique identifiers and demographics. For our purposes, we retained the following variables:
hr.pct: Home runs divided by total at bats
bb.pct: Walks divided by total at bats
sb: Total stolen bases
avg: Year long batting average
rbi: Total runners batted in
player_full_name: The player’s name
team_name: The player’s team name as of the trade deadline (July 31st, 2025)
league_name: The player’s team league (AL or NL)
There were no missing values in any of the retained variables.
vars <- dplyr::select(stats, team_name, player_full_name, at_bats, avg, rbi, hits, home_runs, base_on_balls, stolen_bases, league_name)
pca.vars <- vars %>%
mutate(hr.pct = home_runs / at_bats,
bb.pct = base_on_balls / at_bats,
sb = stolen_bases,
avg = as.numeric(avg)) %>%
dplyr::select(player_full_name, team_name, avg, rbi, hr.pct, bb.pct, sb, league_name)
clean <- na.omit(pca.vars)
data <- clean %>%
dplyr::select(-player_full_name, -league_name, -team_name)PCA was performed on the selected variables. Due to the different measurement metrics, the variation was scaled. The first two principal components explain 64.46% of overall variance. Because they are orthogonal, we can view them as a new, rotated axis to plot our data set. This creates a heat map where the origin houses the most average observations. The red, more extreme observations are players whose stats make them better or worse than the average player. The next plot presents the underlying variable’s impact on a player’s position on the new axis. We can see, for example, that a player with many home runs and walks will be moved toward the upper right. A player with many stolen bases will be moved downward to the right. The opposite is true for players with low numbers in these metrics respectively.
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| avg | -0.4618 | 0.359 | 0.1646 | 0.7425 | -0.2819 |
| rbi | -0.6137 | 0.02173 | 0.1325 | -0.1304 | 0.767 |
| hr.pct | -0.4667 | -0.4257 | 0.4537 | -0.3762 | -0.5036 |
| bb.pct | -0.227 | -0.6864 | -0.6096 | 0.3252 | -0.001607 |
| sb | -0.3752 | 0.4672 | -0.6147 | -0.4294 | -0.2803 |
| PC1 | PC2 | PC3 | PC4 | PC5 | |
|---|---|---|---|---|---|
| Standard deviation | 1.467 | 1.034 | 0.9229 | 0.8058 | 0.5253 |
| Proportion of Variance | 0.4307 | 0.2139 | 0.1703 | 0.1298 | 0.05519 |
| Cumulative Proportion | 0.4307 | 0.6446 | 0.815 | 0.9448 | 1 |
pca$x[,1] <- -pca$x[,1] # Flip PC1
pca$x[,2] <- -pca$x[,2] # Flip PC2
# Flip loading (arrows)
pca$rotation[,1] <- -pca$rotation[,1]
pca$rotation[,2] <- -pca$rotation[,2]
coords <- data.frame(pca$x)
coords <- coords %>%
dplyr::select(PC1, PC2) %>%
mutate(league = clean$league_name)
fviz_pca_ind(pca, col.ind = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))fviz_pca_var(pca, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)We next group our reoriented data using hierarchical clustering. This will mathematically ensure that we are putting similar observations together. We will choose a three-cluster cutoff.
d <- dist(coords, method = "euclidean")
hc <- hclust(d, method = "ward.D2")
k=3
cut_height <- sort(hc$height, decreasing = TRUE)[k - 1]
plot(hc, main = "Heirarchical Clustering of MLB Hitters")
abline(h = cut_height, col = "red", lty = 2, lwd = 2)clusters <- cutree(hc, k = 3)
clustered <- coords %>%
mutate(cluster = as.numeric(clusters),
name = clean$player_full_name,
team = clean$team_name)We can now introduce our clusters to our biplot and make general conclusions about what kind of batters were grouped together. A simple categorization of these players is to label them as “Power” hitters, “Contact/Speed” hitters, or neither. It comes as no surprise that superstars with booming bats make up the power cluster, players like Aaron Judge, Kyle Schwarber, Shohei Ohtani, and Cal Raleigh. On the other hand, players like Chandler Simpson, Trea Turner, Elly De la Cruz, and Pete Crow-Armstrong naturally fall into the Speed and Contact cluster for their formidable base running.
# Relabel cluster levels
clustered$cluster <- factor(clustered$cluster,
levels = c(1, 2, 3),
labels = c("Neither", "Contact/Speed", "Power"))
# Plot PCA with variable arrows
fviz_pca_biplot(pca,
label = "var", # show variable arrows only
habillage = clustered$cluster, # color by cluster
addEllipses = TRUE, ellipse.level = 0.95,
repel = TRUE) +
ggtitle("MLB PCA Biplot with Clusters")players_to_label <- c(
"Aaron Judge", "Kyle Schwarber", "Shohei Ohtani", "Cal Raleigh",
"Chandler Simpson", "Trea Turner", "Elly De La Cruz", "Pete Crow-Armstrong"
)
label_data <- clustered %>%
dplyr::filter(name %in% players_to_label)
ggplot(clustered, aes(x = PC1, y = PC2, color = factor(cluster))) +
geom_point(size = 3) +
geom_text(
data = label_data,
aes(label = name),
vjust = -0.5,
size = 3
) +
labs(
title = "MLB Batter Clusters (Hierarchical)",
color = "Cluster"
) +
theme_minimal()To simply analyze if our clustering has any practical impact, we can sort through the teams that players from each cluster play for, and see if these players had an impact on the season outcomes, namely a ticket to the playoffs. Interestingly enough, a large portion of the teams topping each list made the playoffs. This is rather intuitive, as it follows that teams with good players play better and make it to playoffs. It is interesting to see this begin to be mathematically supported as well.
team_power_counts <- clustered %>%
count(team, cluster) %>%
filter(cluster == "Power") %>%
arrange(desc(n))
team_speed_counts <- clustered %>%
count(team, cluster) %>%
filter(cluster == "Contact/Speed") %>%
arrange(desc(n))
playoff_teams_2025 <- c("Toronto Blue Jays",
'New York Yankees',
'Seattle Mariners',
'Boston Red Sox',
'Detroit Tigers',
'Cleveland Guardians',
'Milwuakee Brewers',
'Philadelphia Phillies',
'Chicago Cubs',
'Los Angeles Dodgers',
'San Diego Padres',
'Cincinnati Reds'
)
df <- left_join(team_power_counts, team_speed_counts,
by = "team")
df <- df %>%
rename('Power Hitters' = n.x,
'Contact/Speed Hitters' = n.y) %>%
select('team', 'Power Hitters', 'Contact/Speed Hitters') %>%
mutate('Made Playoffs' = if_else(team %in% playoff_teams_2025, 'Yes', 'No'))
pander(df)| team | Power Hitters | Contact/Speed Hitters | Made Playoffs |
|---|---|---|---|
| Atlanta Braves | 5 | 4 | No |
| Chicago Cubs | 5 | 5 | Yes |
| New York Yankees | 5 | 6 | Yes |
| Boston Red Sox | 4 | 5 | Yes |
| Milwaukee Brewers | 4 | 7 | No |
| Chicago White Sox | 3 | 2 | No |
| Detroit Tigers | 3 | 4 | Yes |
| Los Angeles Angels | 3 | 3 | No |
| Los Angeles Dodgers | 3 | 4 | Yes |
| New York Mets | 3 | 4 | No |
| Philadelphia Phillies | 3 | 5 | Yes |
| San Francisco Giants | 3 | 3 | No |
| Seattle Mariners | 3 | 6 | Yes |
| Arizona Diamondbacks | 2 | 4 | No |
| Athletics | 2 | 6 | No |
| Baltimore Orioles | 2 | 3 | No |
| Cincinnati Reds | 2 | 7 | Yes |
| Cleveland Guardians | 2 | 2 | Yes |
| Houston Astros | 2 | 4 | No |
| Kansas City Royals | 2 | 4 | No |
| Miami Marlins | 2 | 4 | No |
| Minnesota Twins | 2 | 3 | No |
| Toronto Blue Jays | 2 | 7 | Yes |
| Washington Nationals | 2 | 3 | No |
| Colorado Rockies | 1 | 4 | No |
| Pittsburgh Pirates | 1 | 2 | No |
| San Diego Padres | 1 | 8 | Yes |
| St. Louis Cardinals | 1 | 4 | No |
| Texas Rangers | 1 | 3 | No |
Having categorized the 2025 hitters and counting their presence on each team, we can experiment with these player’s impact on a roster. We’ve partitioned each team into whether they did or did not make the 2025 playoffs. We can then use logistic regression to determine if a team is more or less likely to make the playoffs based on the status of their players.
df <- df %>%
mutate('Made Playoffs' = if_else(team %in% playoff_teams_2025, 1, 0))
library(caret)
fitControl <- trainControl(method = "cv", number = 5, classProbs = TRUE, savePredictions = "all")
set.seed(1)
model.logit <- train(data = df,
`Made Playoffs` ~ `Power Hitters` + `Contact/Speed Hitters`,
method = "glm",
family = "binomial",
trControl = fitControl)
pander(summary(model.logit))| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -5.125 | 1.885 | -2.718 | 0.006558 |
\Power Hitters`` |
0.4963 | 0.404 | 1.228 | 0.2193 |
\Contact/Speed
Hitters`` |
0.7358 | 0.3175 | 2.317 | 0.02049 |
(Dispersion parameter for binomial family taken to be 1 )
| Null deviance: | 38.50 on 28 degrees of freedom |
| Residual deviance: | 28.96 on 26 degrees of freedom |
| (Intercept) | \Power Hitters`` |
\Contact/Speed Hitters`` |
|---|---|---|
| 0.005946 | 1.643 | 2.087 |
We get an interesting result in that Contact/Speed hitters have a significant impact on playoff appearance, while power hitters don’t. In fact, each Contact/Speed hitter doubles the odds that a team made the playoffs.
This contained analysis leads to interesting results, although their effectiveness in practice is less than perfect. Remember, our dimension reduction left us explaining only 64% of variation, building all further analysis on data that may not capture the true nature of the players involved. Baseball is a game, after all, played by people influenced by millions of factors, measurable or not. There is no doubt variation in the statistics that even an omniscient model cannot capture.
At the very least, we have seen small indications of potential trends that may help teams be successful in regular season play. One might expect a home-run threat player to make the biggest impact on a team (and their salary would certainly reflect that), but its possible that more consistent players who run the bases hard and often are contributing more to a season’s success. Further research could investigate this pattern to a further extent and provide more meaningful results.
All over, we were able to reduce the dimensions of a gargantuan data set into something far more approachable, while still containing a level of variation to make analyses upon.