The 10-run rule is one of those baseball shortcuts that is simple enough to remember and useful enough to test: roughly 10 additional runs are worth one additional win. This report tests that idea using Sports Reference standings data from 2022 through 2026. I decided to use a multi-season sample to see how the rule holds up across different run environments.
The first part of the report focuses on MLB runs and wins. I total runs scored, convert that total into runs per MLB game, and then regress wins on runs scored to see how close the data comes to the 10-run rule. Since the sample includes an incomplete 2026 season, I put both wins and runs scored on a 162-game pace so each team-season is compared on the same scale.
The second part shifts from the 10-run rule to model choice. I explain how k-means clustering can group teams by style, then I explain why logistic regression is the better choice when the outcome is binary, such as a free throw being made or missed. The point is to connect each method to the kind of sports question it is built to answer.
Before testing the rule, I need the run total. Since this report uses the full 2022–2026 sample, I add up runs by season and then across the full sample.
| Total Runs Scored by MLB Season | ||
| All MLB teams, 2022–2026 | ||
| Season | MLB Games | Total Runs |
|---|---|---|
| 2022 | 2,430 | 20,817 |
| 2023 | 2,430 | 22,453 |
| 2024 | 2,429 | 21,294 |
| 2025 | 2,430 | 21,627 |
| 2026 | 1,238 | 11,125 |
| 2022-2026 Overall | 10,957 | 97,317 |
Across the full 2022–2026 sample, MLB teams scored 97,317 total runs. That is the run total used for the rest of the 10-run rule test.
The next step is turning the run total into a game-level scoring rate. Because every MLB game has two teams, runs per MLB game means combined scoring by both teams. I also include runs per team game because that is the cleaner number for thinking about what the average team scores.
| Runs Per MLB Game | ||||
| Combined scoring and team scoring, 2022–2026 | ||||
| Season | Total Runs | MLB Games | Runs Per MLB Game | Runs Per Team Game |
|---|---|---|---|---|
| 2022 | 20,817 | 2,430 | 8.57 | 4.28 |
| 2023 | 22,453 | 2,430 | 9.24 | 4.62 |
| 2024 | 21,294 | 2,429 | 8.77 | 4.38 |
| 2025 | 21,627 | 2,430 | 8.90 | 4.45 |
| 2026 | 11,125 | 1,238 | 8.99 | 4.49 |
| 2022-2026 Overall | 97,317 | 10,957 | 8.88 | 4.44 |
For the full 2022–2026 sample, MLB teams combined for 8.88 runs per MLB game. Since each game has two teams, that works out to 4.44 runs per team game.
The 10-run rule is a shortcut: about 10 additional runs are worth about one additional win. For the purpose of this assignment, I also apply the rule literally to an average 162-game team. An average team wins about half its games, or 81 wins. If 10 runs are attached to each win, that implies 810 runs over a full season, or 5.0 runs per team game. In actual sabermetric use, the cleaner interpretation is marginal: adding about 10 runs of value is worth about one extra win.
| Implied Scoring Level of the 10-Run Rule | ||
| What the rule suggests for an average MLB team | ||
| Measure | Value | Explanation |
|---|---|---|
| Runs per win | 10 | The rule of thumb |
| Average wins per team | 81 | Half of a 162-game season |
| Implied runs per team season | 810 | 10 runs per win multiplied by 81 wins |
| Implied runs per team game | 5.0 | Team season runs divided by 162 games |
| Implied combined runs per MLB game | 10.0 | Two teams play in every MLB game |
The direct answer is that the 10-run rule implies an average MLB team scores about 5.0 runs per team game. Since every MLB game has two teams, that equals about 10.0 combined runs per MLB game.
The next step is testing whether teams that score more runs also win more games. Since the sample includes a partial 2026 season, I use 162-game pace for both wins and runs scored. I also weight the regression by games played, so full seasons carry more weight than partial seasons.
| Regression Results: Runs Scored and Wins | ||
| Wins and runs scored are both shown on a 162-game pace | ||
| Measure | Value | Meaning |
|---|---|---|
| Runs scored coefficient | 0.124 | Expected wins gained for each additional run scored over a 162-game pace |
| Implied runs per win | 8.06 | How many additional runs are associated with one additional win |
| R-squared | 0.547 | Share of win variation explained by runs scored alone |
| P-value | <0.001 | Statistical evidence that runs scored and wins are related |
The coefficient for runs scored is 0.124. In plain English, each additional run scored over a 162-game pace is associated with about 0.124 additional wins. Flipping the coefficient gives an implied rate of about 8.06 runs per win, which is the number I compare to the 10-run rule.
The 10-run rule gives two different things to check. First, it implies a certain scoring environment: about 5 runs per team game, or 10 combined runs per MLB game. Second, it implies a win conversion rate: about 10 extra runs for one extra win. The scoring totals test the first idea, and the regression tests the second.
| Testing the 10-Run Rule | ||||
| Actual scoring and regression results compared to the rule of thumb | ||||
| Test | Benchmark | 2022-2026 Result | Difference | Interpretation |
|---|---|---|---|---|
| Combined runs per MLB game | 10.00 | 8.88 | -1.12 | The 10-run scoring benchmark is high |
| Runs per team game | 5.00 | 4.44 | -0.56 | The 5-run team scoring benchmark is high |
| Regression-implied runs per win | 10.00 | 8.06 | -1.94 | The 10-run rule is high |
| R-squared | 0.547 | Runs explain a meaningful share of wins, but not everything | ||
Based on the scoring totals and the regression, I would characterize the 10-run rule as high for this 2022–2026 sample. The rule says about 10.00 runs should equal one win, while the regression implies about 8.06 runs per win.
In plain English, calling the rule high means the rule overstates the number of runs associated with one additional win. Calling it low would mean the opposite: the rule understates how many runs are associated with one additional win. In this sample, the rule is high because the regression-implied runs per win is below 10.
The rule is still useful as a shortcut, but this sample shows where the shortcut starts to bend.
The R-squared value is 0.547, which means runs scored alone explains about 54.7% of the variation in team wins. Based on that R-squared, run production alone is meaningfully explanatory, but I would not treat it as fully explanatory by itself.
Teams also win because of run prevention, pitching, defense, bullpen quality, sequencing, and performance in close games. Runs scored matter a lot, but offense alone does not fully explain MLB wins.
In k-means clustering, k is the number of groups the model creates. One common way to choose k is to test several different values and look at how much the within-cluster variation, or within-cluster sum of squares, drops each time another cluster is added. The elbow is the point where adding more clusters stops making a big difference, so it gives a reasonable starting point. The final choice still has to make sense in context, because useful clusters should describe recognizable styles of teams or players, not just split the data for the sake of splitting it.
One useful k-means application would be clustering MLB teams by offensive style. Instead of only asking which teams score the most runs, this would help separate how teams create offense. I would use variables like runs per game, home runs, walk rate, strikeout rate, OBP, slugging percentage, and stolen bases. K-means would be useful because it could group teams with similar offensive profiles without forcing me to decide the groups ahead of time. The result could separate power teams, patient teams, contact teams, speed teams, and balanced teams, which would make team offensive identities easier to compare.
| Variables for an MLB Offensive Style Clustering Model | ||
| A possible k-means setup | ||
| Variable | What It Captures | Why It Matters |
|---|---|---|
| Runs per game | Overall scoring production | Shows how productive the offense is overall |
| Home runs | Power and instant run creation | Separates power-heavy teams from lower-power teams |
| Walk rate | Plate discipline and patience | Identifies teams that create offense by controlling the zone |
| Strikeout rate | Contact ability and swing-and-miss | Helps separate contact teams from swing-and-miss teams |
| On-base percentage | Ability to get runners on base | Shows whether teams consistently create traffic on base |
| Slugging percentage | Extra-base damage | Measures how much damage teams do when they hit the ball |
| Stolen bases | Speed and baserunning pressure | Captures an offensive style that does not rely only on power |
| Possible Offensive Style Clusters | ||
| The type of groups a k-means model could reveal | ||
| Possible Cluster | Offensive Identity | Likely Drivers |
|---|---|---|
| Power Offense | Creates runs through home runs and extra-base hits | Home runs, slugging percentage |
| Patient Offense | Creates runs by reaching base and extending at-bats | Walk rate, on-base percentage |
| Contact Offense | Creates runs by putting the ball in play | Low strikeout rate, contact ability |
| Speed Offense | Creates runs through stolen bases and pressure | Stolen bases, baserunning value |
| Balanced Offense | Does several things well without one extreme trait | Runs, OBP, SLG, contact, power |
The NBA example is really just the same idea in a different sport. In basketball, k-means can group teams, players, or areas of the court based on things like shot location, shot frequency, and shooting efficiency. The point is not just to rank teams from best to worst. It is to separate different styles of play. My MLB example uses that same logic, except instead of looking at shot selection, I would be looking at how teams create runs through power, walks, contact, speed, or balance.
Logistic regression makes sense when the left hand side variable is a binary outcome. The outcome has two possible results: made or missed, win or loss, yes or no, success or failure. That is different from a linear regression outcome, where the left hand side variable is continuous and can take on many numeric values. In sports terms, linear regression fits outcomes like runs scored, wins, points per game, or salary, while logistic regression fits outcomes like whether a free throw is made. The key difference is that logistic regression puts a yes-or-no outcome on the left side of the model, while linear regression puts a numeric amount on the left side.
| Linear Regression vs. Logistic Regression | |||
| The model depends on the type of outcome variable | |||
| Model | Outcome Type | Sports Example | What It Estimates |
|---|---|---|---|
| Linear Regression | Continuous numeric outcome | Runs scored, wins, points per game, salary | Expected numeric value |
| Logistic Regression | Binary outcome | Free throw made or missed, win or loss, yes or no | Probability that the event happens |
A free throw is either made or missed, so I would use logistic regression. The made free throw can be coded as 1 and the missed free throw can be coded as 0, which is why logistic regression fits the setup. The explanatory variables can still be player, time left in the game, and score margin, but the outcome is not continuous. The model is trying to estimate the probability that the free throw is made.
| Model Choice for a Free Throw Outcome | |||
| The dependent variable determines the model | |||
| Model Choice | Fits This Problem? | Reason | Example Output |
|---|---|---|---|
| Linear Regression | No | A made or missed free throw is not a continuous numeric outcome | Predicted numeric value |
| Logistic Regression | Yes | The outcome is binary, so the model should estimate the probability of a make | Predicted probability of made free throw |
The table below is the checklist version of the report. The sections above show the calculations and explanations in more detail.
| Assignment Answer Map | ||
| Question | Topic | Answer |
|---|---|---|
| Q1 | 10-run rule implication | The 10-run rule is best understood as a marginal rule: about 10 additional runs are worth about one additional win. For the purpose of the assignment’s literal implication, an average 81-win team would be tied to about 810 runs, or 5.0 runs per team game. Since every MLB game has two teams, that also means about 10.0 combined runs per MLB game. |
| Q2 | CSV and standings data | I used Sports Reference standings data, saved it as MLB_Standings.csv, loaded it into R, cleaned the copied standings fields, and filtered the report to the 2022–2026 seasons. I used the full multi-season sample so the 10-run rule could be tested across more than one run environment. |
| Q3 | Total runs scored | Across the full 2022–2026 sample, adding runs scored for all MLB teams by season gives 97,317 total runs. That is the run total used as the base for the scoring-rate part of the 10-run rule test. |
| Q4 | Runs per MLB game | The sample averaged 8.88 combined runs per MLB game. Since each game includes two teams, that works out to 4.44 runs per team game. |
| Q5 | Wins regressed on runs scored | I regressed wins on runs scored using 162-game pace for both variables, with the regression weighted by games played because 2026 is incomplete. The coefficient for runs scored is 0.124, meaning each additional run scored over a 162-game pace is associated with about 0.124 additional wins. |
| Q6 | Accuracy of the 10-run rule | The 10-run rule is high in this sample. The rule says about 10.00 runs should equal one win, while the regression implies about 8.06 runs per win. Since the regression-implied number is below 10, calling the rule high means the rule overstates how many runs are associated with one additional win. |
| Q7 | R-squared interpretation | The R-squared is 0.547, which means runs scored alone explains about 54.7% of the variation in team wins. Based on that R-squared, run production is clearly important, but I would not treat it as fully explanatory by itself because teams also win with pitching, defense, run prevention, bullpens, sequencing, and close-game performance. |
| Q8 | Choosing k in k-means | In k-means clustering, k is the number of groups the model creates. A common way to choose k is to test several values and look for the elbow, where adding another cluster no longer reduces within-cluster variation by much. The final choice should also make sense in context, because the clusters should describe recognizable styles instead of just splitting the data for no clear reason. |
| Q9 | Potential k-means application | A possible k-means application would be clustering MLB teams by offensive style. I would use variables like runs per game, home runs, walk rate, strikeout rate, OBP, slugging percentage, and stolen bases. The value of k-means is that it could group teams with similar offensive profiles without me deciding ahead of time who is a power team, contact team, speed team, or balanced team. |
| Q10 | NBA clustering example | The NBA k-means example is the same idea in a different sport. In basketball, k-means can group teams, players, or areas of the court based on shot location, shot frequency, and shooting efficiency. My MLB example uses that same logic, except instead of looking at shot selection, it looks at how teams create runs through power, walks, contact, speed, or balance. |
| Q11 | Logistic regression outcome | Logistic regression is appropriate when the left hand side variable is binary. That means the outcome has two possible results, like made or missed, win or loss, yes or no, or success or failure. The key difference is that logistic regression puts a yes-or-no outcome on the left side of the model, while linear regression puts a continuous numeric amount on the left side. |
| Q12 | Free throw model choice | For the free throw example, I would use logistic regression. The dependent variable is whether the free throw is made or missed, so the outcome is binary. A made free throw can be coded as 1 and a missed free throw can be coded as 0. The model should estimate the probability of a make based on the player, time left, and score margin. |
This table is meant to be the quick-reference version of the report. The full sections above show the calculations, regression output, clustering explanation, and model choice discussion in more detail.
The 10-run rule still works as a useful baseball shortcut, but it is not a perfect rule in this 2022–2026 sample. MLB teams averaged 4.44 runs per team game, while the rule implies about 5.0 runs per team game. The regression points in the same direction. Based on this sample, the rule is high, with about 8.06 runs per additional win.
Runs scored clearly matter, which is the whole reason the 10-run rule exists in the first place. Still, the R-squared value of 0.547 shows that offense alone does not explain everything about winning. Teams also win with pitching, defense, run prevention, bullpen performance, sequencing, and close-game results.
The broader modeling lesson is that the method has to match the question. Linear regression helps test the relationship between runs and wins. K-means clustering helps group similar teams or players by style. Logistic regression is the better choice when the outcome is binary, like a free throw being made or missed.