1 Introduction

When trying to understand relationships between two variables, there is no doubt great appeal to the simple two-dimensional graph. After all, one of the most fundamental lessons in statistics is seeing trends between two variables on a graph. Moving up the X-Axis variable increases the Y-Axis variable in a strong, linear fashion. Or perhaps the Y-Axis variable increases, then decreases, like a parabola. These trends are easily seen and can help us make inferences on variable relationships without flexing much statistical muscles.

But if only life were so simple. In reality, a response variable is often influenced by numerous, even countless variables. Humans are perceptive, but our fancy graphs and trend-lines only work up to three dimensions. How do we understand a relationship between 10 variables?

Principal Component Analysis (PCA) is a potential solution. As a dimension-reducing method, it can allow us to view the relationships of many variables in ways more attuned to our perception, while still explaining much of the variation of a data set.

To explore this concept, we will analyze Major League Baseball players and their performances from the 2025 season. We will aim to mathematically categorize hitters into different archetypes as to see where their strengths are. But with dozens of statistics available for every batter, the relationships will no doubt be quite abstract! Principal Component Analysis will allow us to reduce this dimensionality, while still explaining their variations in a meaningful way.

2 Dimension Reduction by PCA

2.1 Data Loading

All data derived from the Baseballr package created by Bill Petti and Saiem Gilani. From Bill Petti’s GitHub repository:

baseballr is a package written for R focused on baseball analysis. It includes functions for scraping various data from websites, such as FanGraphs.com, Baseball-Reference.com, and baseballsavant.mlb.com. It also includes functions for calculating metrics, such as wOBA, FIP, and team-level consistency over custom time frames.

You can read more about some of the functions and how to use them at its official site as well as this Hardball Times article.

Copyright (c) 2020 baseballr authors

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.”

stats <- mlb_stats(
  stat_type = 'season',
  player_pool = "All",
  stat_group = 'hitting',
  season = 2025,
) 
stats <- subset(stats, at_bats > 50)

2.2 Variable Selection

2025 batting statistics, acquired by the baseballr package, gives us 61 variables. We retain batters with at least 50 plate appearances for more complete data. Some variables are performance statistics, others are unique identifiers and demographics. For our purposes, we retained the following variables:

  • hr.pct: Home runs divided by total at bats

  • bb.pct: Walks divided by total at bats

  • sb: Total stolen bases

  • avg: Year long batting average

  • rbi: Total runners batted in

  • player_full_name: The player’s name

  • team_name: The player’s team name as of the trade deadline (July 31st, 2025)

  • league_name: The player’s team league (AL or NL)

There were no missing values in any of the retained variables.

vars <- dplyr::select(stats, team_name, player_full_name, at_bats, avg, rbi, hits, home_runs, base_on_balls, stolen_bases, league_name)

pca.vars <- vars %>%
  mutate(hr.pct = home_runs / at_bats,
         bb.pct = base_on_balls / at_bats,
         sb = stolen_bases,
         avg = as.numeric(avg)) %>%
  dplyr::select(player_full_name, team_name, avg, rbi, hr.pct, bb.pct, sb, league_name)

clean <- na.omit(pca.vars)
data <- clean %>%
  dplyr::select(-player_full_name, -league_name, -team_name)

2.3 PCA

PCA was performed on the selected variables. Due to the different measurement metrics, the variation was scaled. The first two principal components explain 64.46% of overall variance. Because they are orthogonal, we can view them as a new, rotated axis to plot our data set. This creates a heat map where the origin houses the most average observations. The red, more extreme observations are players whose stats make them better or worse than the average player. The next plot presents the underlying variable’s impact on a player’s position on the new axis. We can see, for example, that a player with many home runs and walks will be moved toward the upper right. A player with many stolen bases will be moved downward to the right. The opposite is true for players with low numbers in these metrics respectively.

pca <- prcomp(data, scale = TRUE)
pander(summary(pca))
Principal Components Analysis
  PC1 PC2 PC3 PC4 PC5
avg -0.4618 0.359 0.1646 0.7425 -0.2819
rbi -0.6137 0.02173 0.1325 -0.1304 0.767
hr.pct -0.4667 -0.4257 0.4537 -0.3762 -0.5036
bb.pct -0.227 -0.6864 -0.6096 0.3252 -0.001607
sb -0.3752 0.4672 -0.6147 -0.4294 -0.2803
  PC1 PC2 PC3 PC4 PC5
Standard deviation 1.467 1.034 0.9229 0.8058 0.5253
Proportion of Variance 0.4307 0.2139 0.1703 0.1298 0.05519
Cumulative Proportion 0.4307 0.6446 0.815 0.9448 1
pca$x[,1] <- -pca$x[,1]  # Flip PC1
pca$x[,2] <- -pca$x[,2]  # Flip PC2

# Flip loading (arrows)
pca$rotation[,1] <- -pca$rotation[,1]
pca$rotation[,2] <- -pca$rotation[,2]

coords <- data.frame(pca$x)
coords <- coords %>%
  dplyr::select(PC1, PC2) %>%
  mutate(league = clean$league_name)

fviz_pca_ind(pca, col.ind = "cos2", 
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"))

fviz_pca_var(pca, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE)

2.4 Hierarchical Clustering

We next group our reoriented data using hierarchical clustering. This will mathematically ensure that we are putting similar observations together. We will choose a three-cluster cutoff.

d <- dist(coords, method = "euclidean")
hc <- hclust(d, method = "ward.D2")
k=3
cut_height <- sort(hc$height, decreasing = TRUE)[k - 1]


plot(hc, main = "Heirarchical Clustering of MLB Hitters")
abline(h = cut_height, col = "red", lty = 2, lwd = 2)

clusters <- cutree(hc, k = 3)
clustered <- coords %>%
  mutate(cluster = as.numeric(clusters),
         name = clean$player_full_name,
         team = clean$team_name)

We can now introduce our clusters to our biplot and make general conclusions about what kind of batters were grouped together. A simple categorization of these players is to label them as “Power” hitters, “Contact/Speed” hitters, or neither. It comes as no surprise that superstars with booming bats make up the power cluster, players like Aaron Judge, Kyle Schwarber, Shohei Ohtani, and Cal Raleigh. On the other hand, players like Chandler Simpson, Trea Turner, Elly De la Cruz, and Pete Crow-Armstrong naturally fall into the Speed and Contact cluster for their formidable base running.

# Relabel cluster levels
clustered$cluster <- factor(clustered$cluster,
                            levels = c(1, 2, 3),
                            labels = c("Neither", "Contact/Speed", "Power"))




# Plot PCA with variable arrows
fviz_pca_biplot(pca, 
                label = "var",  # show variable arrows only
                habillage = clustered$cluster,  # color by cluster
                addEllipses = TRUE, ellipse.level = 0.95,
                repel = TRUE) +
  ggtitle("MLB PCA Biplot with Clusters")

players_to_label <- c(
  "Aaron Judge", "Kyle Schwarber", "Shohei Ohtani", "Cal Raleigh",
  "Chandler Simpson", "Trea Turner", "Elly De La Cruz", "Pete Crow-Armstrong"
)
label_data <- clustered %>% 
  dplyr::filter(name %in% players_to_label)

ggplot(clustered, aes(x = PC1, y = PC2, color = factor(cluster))) +
  geom_point(size = 3) +
  geom_text(
    data = label_data,
    aes(label = name),
    vjust = -0.5,
    size = 3
  ) +
  labs(
    title = "MLB Batter Clusters (Hierarchical)",
    color = "Cluster"
  ) +
  theme_minimal()

3 Logistic Regression of Clusters

3.1 Proportion of Clusters by Teams

To simply analyze if our clustering has any practical impact, we can sort through the teams that players from each cluster play for, and see if these players had an impact on the season outcomes, namely a ticket to the playoffs. Interestingly enough, a large portion of the teams topping each list made the playoffs. This is rather intuitive, as it follows that teams with good players play better and make it to playoffs. It is interesting to see this begin to be mathematically supported as well.

team_power_counts <- clustered %>%
  count(team, cluster) %>%
  filter(cluster == "Power") %>%
  arrange(desc(n))

team_speed_counts <- clustered %>%
  count(team, cluster) %>%
  filter(cluster == "Contact/Speed") %>%
  arrange(desc(n))

playoff_teams_2025 <- c("Toronto Blue Jays",
                        'New York Yankees',
                        'Seattle Mariners',
                        'Boston Red Sox',
                        'Detroit Tigers',
                        'Cleveland Guardians',
                        'Milwuakee Brewers',
                        'Philadelphia Phillies',
                        'Chicago Cubs',
                        'Los Angeles Dodgers',
                        'San Diego Padres',
                        'Cincinnati Reds'
)

df <- left_join(team_power_counts, team_speed_counts,
                by = "team")
df <- df %>%
  rename('Power Hitters' = n.x,
         'Contact/Speed Hitters' = n.y) %>%
  select('team', 'Power Hitters', 'Contact/Speed Hitters') %>%
  mutate('Made Playoffs' = if_else(team %in% playoff_teams_2025, 'Yes', 'No'))
pander(df)
team Power Hitters Contact/Speed Hitters Made Playoffs
Atlanta Braves 5 4 No
Chicago Cubs 5 5 Yes
New York Yankees 5 6 Yes
Boston Red Sox 4 5 Yes
Milwaukee Brewers 4 7 No
Chicago White Sox 3 2 No
Detroit Tigers 3 4 Yes
Los Angeles Angels 3 3 No
Los Angeles Dodgers 3 4 Yes
New York Mets 3 4 No
Philadelphia Phillies 3 5 Yes
San Francisco Giants 3 3 No
Seattle Mariners 3 6 Yes
Arizona Diamondbacks 2 4 No
Athletics 2 6 No
Baltimore Orioles 2 3 No
Cincinnati Reds 2 7 Yes
Cleveland Guardians 2 2 Yes
Houston Astros 2 4 No
Kansas City Royals 2 4 No
Miami Marlins 2 4 No
Minnesota Twins 2 3 No
Toronto Blue Jays 2 7 Yes
Washington Nationals 2 3 No
Colorado Rockies 1 4 No
Pittsburgh Pirates 1 2 No
San Diego Padres 1 8 Yes
St. Louis Cardinals 1 4 No
Texas Rangers 1 3 No

3.2 Logistic Regression

Having categorized the 2025 hitters and counting their presence on each team, we can experiment with these player’s impact on a roster. We’ve partitioned each team into whether they did or did not make the 2025 playoffs. We can then use logistic regression to determine if a team is more or less likely to make the playoffs based on the status of their players.

df <- df %>%
  mutate('Made Playoffs' = if_else(team %in% playoff_teams_2025, 1, 0))
library(caret)
fitControl <- trainControl(method = "cv", number = 5, classProbs = TRUE, savePredictions = "all")

set.seed(1)                           
model.logit <- train(data = df,
                   `Made Playoffs` ~ `Power Hitters` + `Contact/Speed Hitters`, 
                   method = "glm",
                   family = "binomial",
                   trControl = fitControl) 


pander(summary(model.logit))
  Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.125 1.885 -2.718 0.006558
\Power Hitters`` 0.4963 0.404 1.228 0.2193
\Contact/Speed Hitters`` 0.7358 0.3175 2.317 0.02049

(Dispersion parameter for binomial family taken to be 1 )

Null deviance: 38.50 on 28 degrees of freedom
Residual deviance: 28.96 on 26 degrees of freedom
par(mfrow = c(2, 2))
plot(model.logit$finalModel)

pander(exp(coef(model.logit$finalModel)))
(Intercept) \Power Hitters`` \Contact/Speed Hitters``
0.005946 1.643 2.087

We get an interesting result in that Contact/Speed hitters have a significant impact on playoff appearance, while power hitters don’t. In fact, each Contact/Speed hitter doubles the odds that a team made the playoffs.

4 Conclusions

This contained analysis leads to interesting results, although their effectiveness in practice is less than perfect. Remember, our dimension reduction left us explaining only 64% of variation, building all further analysis on data that may not capture the true nature of the players involved. Baseball is a game, after all, played by people influenced by millions of factors, measurable or not. There is no doubt variation in the statistics that even an omniscient model cannot capture.

At the very least, we have seen small indications of potential trends that may help teams be successful in regular season play. One might expect a home-run threat player to make the biggest impact on a team (and their salary would certainly reflect that), but its possible that more consistent players who run the bases hard and often are contributing more to a season’s success. Further research could investigate this pattern to a further extent and provide more meaningful results.

All over, we were able to reduce the dimensions of a gargantuan data set into something far more approachable, while still containing a level of variation to make analyses upon.