In this project, I used MLB pitch-by-pitch data from the 2025 season to examine whether a team’s on-base performance against a specific opposing starting pitcher can predict how many total runs that team scores in a game. The data were collected using the baseballr package, which pulls publicly available information from the MLB Stats API. Only regular-season games were included, so All-Star Game data were intentionally excluded. I chose this topic because runs are the most important outcome in baseball, and understanding how much a pitcher’s ability to limit baserunners impacts scoring can help measure pitcher effectiveness beyond traditional statistics.
After obtaining the pitch-level data, I reduced each plate appearance to only its final pitch in order to avoid double-counting outcomes such as balls in play, strikeouts, or walks. From these last pitches, I reconstructed the final score of each game by identifying the maximum recorded home and away run totals. These game-level totals were then merged into a team–pitcher match-up data set, allowing me to create a response variable (runs_scored) that represents how many runs the batting team scored against the opposing starting pitcher. If the batting team was the home team, runs_scored reflects the final home score; if the batting team was away, it reflects the away score. This provides a single, team-specific measure of offensive performance against a particular pitching match-up.
After pre=processing, the final data set contained 4,604 pitcher-team match-ups. The analysis focused on two explanatory variables: a team’s on-base percentage against the opposing pitcher (OBP_vs_pitcher) and whether the batting team played at home or away (home_away). Plate appearances (PA) were used only to filter out unreliable match-ups, imposing a minimum threshold of at least six plate appearances against the pitcher.
On-base percentage (OBP_vs_pitcher) is a continuous variable with many unique values. If we compared raw OBP values directly, we would only obtain overall averages and would not see how run scoring changes across different levels of on-base performance. By grouping OBP into meaningful tiers, we can summarize how scoring differs for low, average, and high on-base match-ups. This binning makes the relationship between OBP and runs more interpretative and allows us to visually identify a clear trend.
| OBP_bin | n | mean_runs | sd_runs |
|---|---|---|---|
| ≤ 0.250 | 1322 | 3.498 | 2.828 |
| 0.251–0.300 | 512 | 4.168 | 2.920 |
| 0.301–0.350 | 729 | 4.953 | 3.079 |
| > 0.350 | 2041 | 7.007 | 3.736 |
Table 1 shows a strong positive relationship between on-base performance against a pitcher and the number of runs scored. Match-ups where teams had an OBP of .250 or lower resulted in an average of only 3.50 runs, while match-ups with OBP greater than .350 led to an average of 7.01 runs—approximately double the scoring output. The steady increase across the tiers suggests that as teams reach base more often against a pitcher, they tend to generate substantially more runs. This supports using OBP_vs_pitcher as the primary predictor in the supervised learning models.
Now, let’s look at Table 2:
| home_away | n | mean_runs | sd_runs | mean_OBP | sd_OBP | mean_PA | sd_PA |
|---|---|---|---|---|---|---|---|
| Away | 2346 | 5.485 | 3.750 | 0.349 | 0.159 | 15.685 | 7.866 |
| Home | 2258 | 5.228 | 3.536 | 0.354 | 0.156 | 15.736 | 7.836 |
Table 2 shows only small differences between home and away scoring, with away teams averaging 5.49 runs and home teams averaging 5.23. OBP values are nearly identical, suggesting weak location effects at the match-up level. Because these differences are present but modest, home_away is included as a contextual predictor in the second model to test whether game location adds predictive value beyond OBP.
In addition to summary tables, graphical exploration helps visualize the distribution of runs scored and the relationship between OBP and offensive performance.
The distribution of runs scored is right-skewed, with most match-ups resulting in fewer than 7 runs but a long tail of higher-scoring outcomes. Since the data consist of non-negative integer counts with noticeable skew, a Poisson regression model is appropriate for supervised learning.
Now, let’s take a look at a scatterplot of OBP vs Pitcher and runs scores.
The scatterplot shows a clear upward trend, where higher OBP values generally correspond to more runs scored. The fitted curve reinforces the positive relationship observed in the summary tables.
Before fitting predictive models, several adjustments were made to ensure that the data accurately reflected team performance against starting pitchers. First, only the final pitch of each plate appearance was retained. This step prevents multiple pitches from being counted as separate events and avoids overstating outcomes such as balls in play, strikeouts, or walks. By using one record per plate appearance, batting results contribute equally regardless of how many pitches were thrown during the at-bat.
Run totals were not available directly at the match-up level, so they were recovered from the full game data and merged back into the dataset. This provided a clean outcome variable representing how many runs a batting team scored against a specific starting pitcher across the entire game.
To reduce noise from extremely small samples, match-ups were filtered to include only those with at least six plate appearances (PA ≥ 6). Very short outings—such as injury removals, bullpen openers, or inning-ending substitutions—can create unstable on-base percentages near 0 or 1. Applying a minimum PA threshold promotes more reliable measures of a team’s on-base performance against a pitcher.
Finally, a home/away indicator was engineered using inning information to capture game-location context. Although the exploratory analysis suggested that location differences are small, the variable was retained to evaluate whether it provides additional predictive value beyond on-base performance alone.
To complement the supervised models, I used unsupervised learning to identify natural groupings of pitchers based on how opposing teams performed against them. Rather than using PCA, which is most helpful for high-dimensional data, clustering was selected because the goal was to categorize pitchers based on performance rather than reduce the number of variables. Since the dataset contains only a few meaningful features—on-base percentage allowed and runs scored—PCA would not provide additional insight or dimensionality reduction.
I applied K-means clustering using two performance variables: the average OBP_vs_pitcher and average runs_scored against each pitcher (restricted to match-ups with PA ≥ 6 to avoid unstable samples). These metrics together reflect how effectively pitchers limit baserunners and run production, allowing pitchers to be grouped into tiers based on how difficult they are to hit against.
The K-means clustering identifies three general tiers of pitcher performance based on the average OBP and runs they allow. The green cluster represents pitchers who are difficult to score against: they allow low on-base percentages and consistently keep run totals low. The red cluster reflects more average pitchers who allow moderate OBP and produce mid-range run outcomes. The blue cluster contains pitchers who tend to allow high on-base percentages and high scoring, suggesting greater vulnerability.
To predict how many runs a batting team scores against a starting pitcher, I fit two Poisson regression models using runs_scored as the response variable. Poisson regression is appropriate because run totals are count data, take non-negative integer values, and display right-skewed variability, as shown in the EDA histogram.
The first model evaluates whether a team’s on-base percentage against the opposing pitcher explains run scoring on its own:
runs_scored ~ OBP_vs_pitcher
The estimated coefficient for OBP_vs_pitcher was positive and statistically significant, indicating that higher OBP against a pitcher is associated with more runs scored. Ten-fold cross-validation produced an RMSE of 3.2453 and an \(R^2\) = 0.2106, meaning variation in scoring can be partially explained by on-base success against the pitcher.
Because EDA suggested small but noticeable differences between home and away run scoring, a second model added a location indicator:
runs_scored ~ OBP_vs_pitcher + home_away
The home_away coefficient was positive, indicating that home teams score slightly more runs even when controlling for OBP. The improvement in cross-validated performance was modest (RMSE = 3.2421, \(R^2\) = 0.2141), suggesting that game location adds some contextual value, although its contribution is small compared to the effect of on-base performance.
Both models confirm the central finding: on-base performance against a pitcher is the most influential predictor of run scoring. Game location adds a minor improvement, but increasing OBP remains the dominant driver of offensive output.
This analysis examined whether a team’s on-base performance against an opposing starting pitcher predicts how many runs that team scores in a game. Using pitcher-specific on-base percentage as the main predictor and run totals as the outcome, Poisson regression identified a strong positive relationship: teams that reached base more frequently against a pitcher produced substantially higher run totals. Home-field context added a small but positive effect, but most of the predictive power came from on-base performance itself. This highlights how pitcher effectiveness can be evaluated not simply by ERA or strikeouts, but by how well they limit base runners against a particular lineup.
Even with meaningful patterns, the models still showed high variability, which reflects the unpredictable nature of baseball.
Future work could expand the model by incorporating additional context, such as bullpen usage, exit velocity, starting pitcher handedness, or batter–pitcher handedness match-ups. In the future, I hope to aggregate over multiple seasons to gain a bigger sample size and then re-train the model.