MLB Umpire Performance Analysis 2015-2022

Introduction

Baseball is a sport deeply rooted in precision, strategy, and fair play. The role of the umpire is critical in maintaining the integrity of the game. Umpires are responsible for making split-second decisions and judgement that can impact the outcome of the game. While Major League Baseball (MLB) accepts only the best of the best umpires for their games, perfection is impossible in these split-second calls. Advancements in technology has introduced instant replay into the game, reducing the impact of human error in a majority of situations. Despite this, the majority of the umpire errors are by the home plate umpire, the umpire responsible for calling each pitch a ball or a strike. As of now, these ball-strike calls have not been changed by the technology revolution, causing umpire error to be prevalent.

Because of this, this report analyzes home plate umpires’ ball-strike calls from every game in the MLB seasons from 2015 to 2022. The data is originally from https://umpscorecards.us/, then cleaned, prepared and made available on https://www.kaggle.com/datasets/mattop/mlb-baseball-umpire-scorecards-2015-2022 under the public domain. This data set relies on three key metrics to analyze umpire performance: accuracy (including expected stats), consistency, and favor. The expected and favor metrics are calculated using a fine-tuned, in-house algorithm by umpscorecards.us. The consistency metric is calculated based on the umpires established strike zone throughout the game.

By analyzing this data, this report aims to uncover trends in umpire performance, evaluate for potential bias, and assess the implications of human error in umpiring. This analysis hopes to offer valuable insights regarding the influence of umpires in the MLB.

Data Summary

To gain an initial understanding of the data set, we compute summary statistics for the numeric variables in the data set. The summary statistics provide insights into the central tendencies and variability of umpire performance across different seasons.

We start by removing the “NDs” (No Data) from the data set, as excluding them still provides enough data to analysis. Once the NDs are removed, we convert the columns into the numeric class and generate descriptive statistics of the numeric variables:

# Filtering NDs and converting variables into numeric class
df <- df%>%
  filter(pitches_called != "ND") %>%
  data.frame()
df[,6:ncol(df)] <- lapply(df[,6:ncol(df)], function(x) as.numeric(gsub(",", "", x)))  # Stack overflow


# I think this method looks nicer than a simple summary() function.

numeric_columns <- df[,7:ncol(df)-1]
numeric_summary <- data.frame(
  Min = sapply(numeric_columns, min, na.rm = TRUE),
  Mean = round(sapply(numeric_columns, mean, na.rm = TRUE),2),
  Median = sapply(numeric_columns, median, na.rm = TRUE),
  Max = sapply(numeric_columns, max, na.rm = TRUE)
)
numeric_summary

##                                 Min   Mean Median    Max
## home_team_runs                 0.00   4.56   4.00  29.00
## away_team_runs                 0.00   4.43   4.00  28.00
## pitches_called                68.00 154.56 153.00 375.00
## incorrect_calls                0.00  11.70  11.00  45.00
## expected_incorrect_calls       3.10  11.92  11.60  43.90
## correct_calls                 63.00 142.86 141.00 331.00
## expected_correct_calls        63.00 142.65 141.10 331.10
## correct_calls_above_expected -24.50   0.21   0.40  16.10
## accuracy                      78.40  92.42  92.70 100.00
## expected_accuracy             85.00  92.28  92.50  97.40
## accuracy_above_expected      -11.70   0.14   0.20   9.40
## consistency                   81.40  93.17  93.30 100.00
## favor_home                    -3.45   0.03   0.03   3.40
## total_run_impact               0.00   1.53   1.41   7.14

The mean and median are nearly equal for each variable, indicating that the data is approximately symmetrical around the mean. Additionally, the minimum and maximum values fall within reasonable ranges for each variable, confirming the dataset’s validity.

HOME & AWAY TEAM RUNS

Both home and away teams have similar scoring distributions, with home team runs ranging from 0 to 29 and away team runs ranging from 0 to 28. The mean and median values are nearly identical for both, suggesting that home and away teams score similarly on average per game. This indicates a general scoring balance between teams, with no significant advantage for either side in terms of total runs.

PITCHES CALLED

The number of pitches called by the home plate umpire ranged from 68 to 375, with the average game consisting of around 154 pitches. This indicates that umpires are responsible for making a significant number of ball-strike decisions each game.

INCORRECT CALLS & EXPECTED INCORRECT CALLS

On average, umpires miss about 12 calls per game, with the number of missed calls ranging widely from 0 to 45. The expected number of incorrect calls also has a mean of approximately 12, indicating that umpires generally perform in line with statistical expectations.

CORRECT CALLS & EXPECTED CORRECT CALLS

Umpires made between 63 and 331 correct calls per game, with a mean of about 143. This closely aligns with the expected number of correct calls, which has a mean of about 143 as well. This consistency suggests that umpires maintain predictable accuracy patterns across games.

CORRECT CALLS ABOVE EXPECTED

This metric ranges from -24.5 to 16.1, with an average close to zero. The decimals are from the platforms models at calculating expected values. This value is found from subtracting the expected correct calls from the actual number of correct calls. A negative value indicates an umpire made fewer correct calls than expected, whereas a positive value suggests above-expected performance.

ACCURACY, EXPECTED ACCURACY, & ACCURACY ABOVE EXPECTED

Umpire accuracy ranges from 78.4% to 100%, with an average of 92.4%, indicating consistently high accuracy across umpires. The expected accuracy follows a similar distribution, reinforcing the idea that umpires generally perform within statistically expected boundaries. Accuracy above expected is calculated by subtracting expected accuracy from actual accuracy. It ranges from -11.7 to 9.4, with a small mean of 0.14, indicating that most umpires perform very close to their expected accuracy.

CONSISTENCY

This metric represents the proportion of taken pitches that align with the umpire’s established strike zone in a given game. It ranges from 81.4 to 100, with a mean of 93.17, indicating that most umpires maintain a high level of consistency in their ball-strike calls.

FAVOR HOME

This metric is calculated by taking the expected number of runs scored given the situation before the umpire’s call and subtracting it from the total number of runs the home team scored. It ranges from -3.45 to 3.4, with an average close to zero, suggesting no significant systemic bias toward home or away teams across all umpires.

TOTAL RUN IMPACT

This ranges from 0 to 7.14 runs, with a mean of 1.53 runs per game, meaning that umpire decisions have a measurable but moderate impact on scoring outcomes in every game.

Plots

Graphs and plots provide a clearer view of umpire performance, highlighting trends in accuracy, consistency, incorrect calls, and total run impact. These visualizations help identify patterns, compare umpires, and assess the influence of umpiring on game outcomes.

Total Number of Incorrect Calls by Year

Umpire performance can vary across seasons due to numerous factors, such as rule changes. This pie chart displays the total number of incorrect calls made each year, from 2015 to 2022.

incorrectcalls <- df %>%
  group_by(year) %>%
  summarise(n=sum(as.numeric(incorrect_calls)),.groups='keep') %>%
  arrange(year)


plot_ly(incorrectcalls, labels = ~year, values = ~n, sort = FALSE,
        text = ~paste("Year:", year, "<br>Incorrect Calls:", scales::comma(n)), 
        hoverinfo = "text") %>%
          add_pie(hole=0.5) %>%
          layout(title = "Incorrect Calls by Year (2015-2020)",
                 showlegend = FALSE) %>%
  layout(annotations = list(
    text = paste0("Total Number of Incorrect Calls: \n", scales::comma(sum(incorrectcalls$n))),
    showarrow = FALSE,
    font = list(size = 20, color = "black")  # Increase font size and set color
  ))

Overall, there was a total of 211,778 incorrect calls were recorded across all seasons. The highest percentage of incorrect calls occurred in 2015, with 16.7%, and 2016, with 16.3%, indicating that errors were more frequent in the earlier seasons. The 2020 season appears to an anomaly in the data. This is due to the shortened MLB season caused by the COVID-19 pandemic, which resulted in fewer games and thus fewer opportunities for incorrect calls. From 2015-2022, excluding the 2020 season, there was a gradually decline in the number of incorrect calls, with the lowest number being in 2022 (10.9%). This decline suggests that umpires have been performing better each year, as there are fewer incorrect calls made.

Mean Accuracy Above Expected by Year

The last chart depicted the declining trend in the number of incorrect calls made over time, suggesting an improvement in umpire performance and accuracy. To further examine this, we look at the mean accuracy above expected for MLB umpires each year, showing how umpires; actual accuracy compares to their expected accuracy.

# Create mean accuracy above expected df by year
yearacc <- df %>%
  select(accuracy_above_expected, year) %>%
  group_by(year) %>%
  summarise(ave_acc= mean(accuracy_above_expected)) %>%
  data.frame()
 

x_labels = c(2015,2016,2017,2018,2019,2020,2021,2022)
ggplot(yearacc, aes(x=year, y=ave_acc)) +
  geom_line(color= 'black',linewidth=2) +
  geom_point(shape=21,size=3, color = 'red', fill = 'navyblue') +
  labs(x='Year', 
       y='Mean Accuracy Above Expected',
       title='Mean Accuracy Above Expected for MLB Umpires Years 2015-2022') +
  theme_gdocs() +
  theme(plot.title=element_text(hjust=0.5)) +
  scale_x_continuous(breaks = x_labels, labels = as.character(x_labels)) +
  geom_line(aes(x=year,y=0), size=2,linetype='dashed', color = 'red') + 
  geom_label_repel(aes(label = scales::percent(ave_acc/100)), box.padding=2, max.time=10,
                   position = position_nudge_keep(x = 0.5))

Overall, umpire accuracy above expected remained relatively stable from 2015-2022, with a range of 0.439%. However, a very notable drop occurred in 2019, where the average accuracy above expected fell below zero, at -0.026%. This indicates that umpires performed worse than expected on average throughout the whole season. Before 2019, the values barely fluctuated, showing that umpires were mostly performing as expected, with slight overperformance on average. After 2019, the accuracy above expected steadily increased, reaching its max level in 2022 at 0.413%. This sharp improvement suggests that umpires have become more accurate than statistical expectations. This reinforces the idea that umpire performance as a whole has been improving over time.

Umpires Ranked Based on Total Correct Calls Above Expected

The last graphs have shown a stead improvement in overall umpire performance since 2015. To complement that trend, it would be interesting to see individual umpire performance. This graph ranks the top 10 best and worst umpires based on their correct calls above expected. This will provide insight into which umpires exceed statistically expectations and which ones fall short in terms of correct calls.

Note the top 10 are marked in green and the bottom 10 are marked in red.

#### Plot of top ten best and worst umpires based on Total Correct calls above expected ####

# Get umpire correct calls by umpire
umpire_calls <- df %>%
  group_by(umpire) %>%
  summarise(calls = sum(as.numeric(correct_calls_above_expected), na.rm=TRUE)) %>%
  arrange(desc(calls))


# Create a dataframe of just the top ten best and worst umpires
best_umpires_calls <- umpire_calls %>%
  arrange(desc(calls)) %>%
  head(10)  # Top 10 best umpires

worst_umpires_calls <- umpire_calls %>%
  arrange(desc(calls)) %>%
  tail(10)  # Top 10 worst umpires


# Merge into one dataframe
top_worst_umpires_calls <- bind_rows(best_umpires_calls,worst_umpires_calls)


ump_calls_order = rev(top_worst_umpires_calls$umpire)

ggplot(top_worst_umpires_calls, aes(x=factor(umpire,ump_calls_order), y = calls))+
  geom_bar(stat='identity',color='black', 
           fill= ifelse(top_worst_umpires_calls$calls>0,'green','red')) +
  coord_flip()+
  theme_light() + 
  labs(title = "Top 10 Best and Worst Umpires Ranked on Correct Calls Above Expected",
       x = "Umpire", y = "Correct Calls Above Expected",
       caption = "Shows only top 10 and bottom 10 of the 124 total umpires") +
  theme(plot.title = element_text(hjust=0.5),plot.subtitle = element_text(hjust=0.5)) +
  scale_y_continuous(breaks = seq(-450, 650, by=50), 
                    labels = seq(-450, 650, by=50)) +
  geom_text(aes(x=umpire,y=calls,
                label = round(calls), 
                hjust = ifelse(calls > 0, -0.1, 1.05)),size=4)

Will Little ranks as the best umpire, with 622 more correct calls than expected during the 2015-2022 seasons. At the bottom of the list is Joe West, with 419 less correct calls than statistically expected. This is a significant disparity between the best and worst umpire. This disparity of about 1,000 calls highlights the huge variation in individual umpire performance within the MLB. The large gap between top-performing and under-performing umpires shows the importance of individual umpire evaluation and accountability in the MLB, as some umpires consistently outperform expectations, while others struggle to meet the statistical baseline level of accuracy.

The previous graphs showed that, overall, the umpires perform near their statistically expected level. However, this graph shows that, on the individual level, this is not the case. Individually, some umpires seem to struggle more and some seem to excel. These struggling and excelling groups seem to counteract each other, resulting in overall umpire accuracy to even out. This suggests that while MLB umpiring as a whole remains relatively accurate, significant disparities exist among individual umpires.

Umpires Ranked Based on Average Consistency

The previous graph highlighted the best and worst umpires based on correct calls above expected, revealing substantial differences in overall accuracy. This graph shifts the focus to umpire consistency, measuring how reliably umpires call pitches based on their own established strike zones. Consistency is a crucial aspect of umpiring, as even an umpire with lower accuracy can sill be fair of they apply their strike zone consistently throughout a game.

Note that again the top 10 are marked in green and the bottom 10 are marked in red.

# Get umpire consistency by umpire
umpire_consistency <- df %>%
  group_by(umpire) %>%
  summarise(avg_con = mean(as.numeric(consistency), na.rm=TRUE)) %>%
  arrange(desc(avg_con))


# Create a dataframe of just the top ten best and worst umpires
best_umpires <- umpire_consistency %>%
  arrange(desc(avg_con)) %>%
  head(10)  # Top 10 best umpires

worst_umpires <- umpire_consistency %>%
  arrange(desc(avg_con)) %>%
  tail(10)  # Top 10 worst umpires

# Adds empty row to include a break in the graph in between top and worst umpires
empty <- data.frame(
  umpire = "",
  avg_con = 0)

# Merge into one dataframe
top_worst_umpires <- bind_rows(best_umpires, empty, worst_umpires)


ump_order = rev(top_worst_umpires$umpire)

ggplot(top_worst_umpires, aes(x = (factor(umpire, ump_order)), y = avg_con-90))+
  geom_bar(stat = 'identity', color = 'black', 
           fill = ifelse(top_worst_umpires$avg_con>93,'green','red')) +
  coord_flip() +
  theme_gdocs() +
  labs(title = "Top 10 Best and Worst Umpires Ranked on Average Consistancy",
       x = "Umpire", y = 'Average Consistancy',
       caption = "Shows only top 10 and bottom 10 of the 124 total umpires") +
  theme(plot.title = element_text(hjust=0.5),plot.subtitle = element_text(hjust=0.5))+
  scale_y_continuous(limit = c(0,6), 
                     breaks = seq(0,6, by =0.5),
                     labels = seq(90,96,by=0.5)) +
  geom_text(aes(x = umpire , y = avg_con-90, 
                label = scales::percent(round(avg_con,2)/100)),
            hjust = -0.1, size = 5)

Brock Ballou leads all MLB umpires with an average consistency of 95.20%, while Marcus Pattillo ranks the lowest in consistency with an average of 91.36%. The top ten umpires have an average consistency of 93.82% or higher. The worst ten umpires all all below 92.53% average accuracy. While the range between the best and worst umpire is smaller than the incorrect calls range at 4%, that can be quite significant, as that is one inconsistent call every 25 pitches on average.

Pat Hoberg, Jansen Visconti, and John Libka all appear in the top 10 for both correct calls above expected and average consistency. These three umpires maintain both high accuracy and high consistency, making them some of the best umpires in the MLB. Ted Barrett and Kerwin Danley appear in the bottom 10 for both correct calls above expected and consistency, making them two of the least reliable umpires in the MLB. Their low acuracy and poor consistency is not a good combination, as their strike calling is both unpredictable and below the statistically expected line.

Runs in Favor by Team by Year

While previous graphs focused on individual umpire performance and accuracy, this graph takes a team-level approach. Every call an umpire makes impacts the game, usually in favor of one team. The runs in favor metric tracks this impact for each team. By examining the total runs in favor for each team, it allows us to see the team with the most help from the umpires. Also on the graph is the total number of wins each team had in the 2015-2022 seasons, depicted by the black line. This graph helps to determine whether teams that have the most runs in favor tend to win more games, providing pontenial insight into influence of umpiring on team success.

#### Stacked Box Plot of Runs in Favor by Team by Year####

df$favor_home <- as.numeric(df$favor_home)

team_df <- df %>%
  select(home, away,favor_home, year) %>%
  mutate( away_favor = if_else(favor_home < 0, abs(favor_home), 0),
          home_favor = if_else(favor_home < 0, 0, favor_home)) %>%
  group_by(year) %>%
  data.frame()


nofavored <- team_df %>%
  select(home, away, year, home_favor, away_favor) %>%
  mutate(no_favored = if_else(away_favor == 0 & home_favor == 0, TRUE, FALSE))

# 131 games no team was favored, which will not be included on graph but is very interesting to know

team_favor <- nofavored %>%
  filter(!no_favored) %>%
  mutate(team_favored = ifelse(home_favor!=0,home,away),
         favorby = ifelse(home_favor!=0,home_favor,away_favor))


favor <- team_favor %>%
  select(team_favored, favorby, year) %>%
  group_by(team_favored,year) %>%
  summarise(total = sum(favorby), .groups='keep')

agg_tot <- favor %>%
  select(team_favored, total) %>%
  group_by(team_favored) %>%
  summarise(tot = sum(total), .groups='keep')

total_wins <- df%>%
  mutate(winning_team = if_else(home_team_runs > away_team_runs, home, away)) %>%
  group_by(winning_team) %>%
  summarise(total_wins = n())

ylab = seq(0,max(total_wins$total_wins)-200)
mylabels = seq(0,800,50)

ggplot(favor,aes(x=reorder(team_favored,total,sum),y=total, fill= as.factor(year)))+
  geom_bar(stat='identity',position = position_stack(reverse=TRUE)) +
  coord_flip() +
  labs(title='Runs in Favor by Team and Total Wins for Years 2015-2022',
       y='Total Runs in Favor',x= 'Team',fill='Year') +
  theme_light() +
  theme(plot.title=element_text(hjust=0.5)) +
  scale_fill_brewer(palette="Paired", guide=guide_legend(reverse = TRUE),
                    labels = as.factor(x_labels)) +
  geom_label(data=agg_tot, aes(x=team_favored,y=tot, label = round(tot), fill=NULL),
            hjust=-0.1,size=3,show.legend = FALSE) +
  scale_y_continuous(breaks=seq(0,350,50)) +
  geom_line(inherit.aes = FALSE, data = total_wins, aes(x = winning_team, y=total_wins/1.5,group=1, colour = "Total Wins"), size=1) +
  scale_color_manual(NULL, values='black') +
  scale_y_continuous(sec.axis=sec_axis(~.*1.5,name = 'Number of Wins',labels=mylabels, breaks = mylabels)) +
  geom_point(inherit.aes=FALSE, data=total_wins, aes(x=winning_team, y=total_wins/1.5,group=1),size=2,shape=21, fill = 'white',color='red')+
  theme(legend.background=element_rect(fill='transparent'),legend.box.background = element_rect(fill='transparent',color=NA))

The Los Angeles Dodges (LAD) received the most total runs in favor and also led the league in total wins over the 2015-2022 seasons. Other teams near the top include the Toronto Blue Jays and the Chicago Cubs. This suggests that teams receiving more favorable calls may have an advantage.

The Kansas City Royals (KC) received the fewest total runs in favor with 215. Other teams near the bottom include the Cincinnati Reds and the Chicago White Sox. All these teams had lower win totals relative to other teams.

While there are some notable exceptions, such as the Houston Astros, each team’s total wins generally follows a similar pattern to total runs in favor. While this correlation exists, is not absolute, as there are numerous factors not considered in this, such as overall team performance and roster strength, which have a massive impact on overall team success.

Conclusion

This report analyzed MLB umpire performance and its potential impact on game outcomes using data from the 2015-2022 seasons. By examining umpire accuracy, consistency, and correct calls above expected, we identified significant disparities between individual umpires, highlighting both top-performing and under-performing umpires. The data revealed that while league-wide umpire accuracy has improved over time, individual umpires still had massive inconsistencies, reinforcing the importance of umpire evaluation and accountability.

Additionally, the analysis explored team-level effects, looking at the relationship between total runs in favor and total wins. While there is some correlation between teams receiving more favorable calls and higher win totals, the data suggests that other factors, such as roster strength and overall team performance, play a more significant role in determining success. However, teams that consistently receive fewer favorable calls may face additional challenges that influence their ability to win games.

The findings in this report emphasize the growing role of technology in baseball officiating, as trends indicate a decline in incorrect calls and an increase in accuracy above expected in recent seasons. These improvements may be attributed to enhanced training, increased scrutiny, and the potential future implementation of automated strike zones. Currently in this year’s spring training (2025), the MLB is experimenting with challenging strike calls to counteract these umpire errors. As the MLB continues to evolve, balancing the human element of umpiring with technological advancements will be crucial in maintaining the integrity and fairness of the game.

Ultimately, this analysis provides valuable insights into how umpires impact the game and underscores the need for continued evaluation, training, and potential rule adaptations to ensure a more consistent and accurate officiating standard in Major League Baseball.