Question 1: Estimating Expected Win Percentage

The first step is estimating each team’s expected winning percentage with the basic Pythagorean expectation formula. The idea behind the model is simple. If a team scores more runs than it allows, it should usually win more games. If it allows more runs than it scores, it should usually lose more games.

\[ Expected\ Win\% = \frac{Runs^2}{Runs^2 + Runs\ Allowed^2} \]

Team W L G RS RA Actual Win% Expected Win% Expected Wins Win% Diff Win Diff
Los Angeles Dodgers 38 21 59 314 185 64.4% 74.2% 43.8 −9.8% −5.8
Atlanta Braves 40 20 60 316 207 66.7% 70.0% 42.0 −3.3% −2.0
New York Yankees 36 23 59 305 207 61.0% 68.5% 40.4 −7.4% −4.4
Milwaukee Brewers 35 21 56 268 194 62.5% 65.6% 36.7 −3.1% −1.7
Seattle Mariners 31 29 60 255 225 51.7% 56.2% 33.7 −4.6% −2.7
Pittsburgh Pirates 32 28 60 301 270 53.3% 55.4% 33.2 −2.1% −1.2
Tampa Bay Rays 36 20 56 262 243 64.3% 53.8% 30.1 10.5% 5.9
Chicago Cubs 32 28 60 284 265 53.3% 53.5% 32.1 −0.1% −0.1
Arizona Diamondbacks 31 27 58 264 256 53.4% 51.5% 29.9 1.9% 1.1
Texas Rangers 28 31 59 237 230 47.5% 51.5% 30.4 −4.0% −2.4
Chicago White Sox 32 27 59 275 267 54.2% 51.5% 30.4 2.8% 1.6
Cleveland Guardians 34 27 61 249 248 55.7% 50.2% 30.6 5.5% 3.4
Washington Nationals 31 29 60 324 323 51.7% 50.2% 30.1 1.5% 0.9
Boston Red Sox 25 33 58 231 236 43.1% 48.9% 28.4 −5.8% −3.4
San Diego Padres 32 26 58 227 233 55.2% 48.7% 28.2 6.5% 3.8
Toronto Blue Jays 29 31 60 244 251 48.3% 48.6% 29.2 −0.3% −0.2
St. Louis Cardinals 31 26 57 247 257 54.4% 48.0% 27.4 6.4% 3.6
New York Mets 26 33 59 239 252 44.1% 47.4% 27.9 −3.3% −1.9
Minnesota Twins 27 33 60 275 296 45.0% 46.3% 27.8 −1.3% −0.8
Philadelphia Phillies 30 29 59 230 256 50.8% 44.7% 26.4 6.2% 3.6
Miami Marlins 26 34 60 250 279 43.3% 44.5% 26.7 −1.2% −0.7
Houston Astros 27 34 61 271 304 44.3% 44.3% 27.0 −0.0% 0.0
Cincinnati Reds 30 28 58 256 288 51.7% 44.1% 25.6 7.6% 4.4
Athletics 28 31 59 252 286 47.5% 43.7% 25.8 3.8% 2.2
Baltimore Orioles 28 32 60 275 313 46.7% 43.6% 26.1 3.1% 1.9
Detroit Tigers 22 38 60 223 262 36.7% 42.0% 25.2 −5.3% −3.2
Los Angeles Angels 23 37 60 255 306 38.3% 41.0% 24.6 −2.7% −1.6
San Francisco Giants 23 36 59 232 280 39.0% 40.7% 24.0 −1.7% −1.0
Kansas City Royals 22 37 59 221 280 37.3% 38.4% 22.6 −1.1% −0.6
Colorado Rockies 22 38 60 251 334 36.7% 36.1% 21.7 0.6% 0.3

This table gives a better view of the standings than wins and losses alone. The actual record tells us what happened, but expected wins tell us whether that record matched the team’s run profile. A positive win difference means the team won more games than the model expected. A negative win difference means the team’s run differential was better than its actual record.

Also, as a Cubs fan who has watched a lot of their games this year, I think it’s interesting that they are pretty much spot on when it comes to actual wins vs expected wins especially when they are a very streaky team who have had two separate 10 game winning streaks along with a 10 game losing streaking. They are still right where they are expected to be.

Question 2: Calculating Absolute Deviation

Next, I calculated the absolute deviation for each team. This measures how far the model was from each team’s actual winning percentage.

\[ Absolute\ Deviation = |Win\% - Expected\ Win\%| \]

The important part here is that the direction does not matter. A team can win more games than expected or fewer games than expected, but absolute deviation only measures the size of the gap.

Team Actual Win% Expected Win% Abs. Dev.
Tampa Bay Rays 64.3% 53.8% 10.5%
Los Angeles Dodgers 64.4% 74.2% 9.8%
Cincinnati Reds 51.7% 44.1% 7.6%
New York Yankees 61.0% 68.5% 7.4%
San Diego Padres 55.2% 48.7% 6.5%
St. Louis Cardinals 54.4% 48.0% 6.4%
Philadelphia Phillies 50.8% 44.7% 6.2%
Boston Red Sox 43.1% 48.9% 5.8%
Cleveland Guardians 55.7% 50.2% 5.5%
Detroit Tigers 36.7% 42.0% 5.3%
Seattle Mariners 51.7% 56.2% 4.6%
Texas Rangers 47.5% 51.5% 4.0%
Athletics 47.5% 43.7% 3.8%
Atlanta Braves 66.7% 70.0% 3.3%
New York Mets 44.1% 47.4% 3.3%
Milwaukee Brewers 62.5% 65.6% 3.1%
Baltimore Orioles 46.7% 43.6% 3.1%
Chicago White Sox 54.2% 51.5% 2.8%
Los Angeles Angels 38.3% 41.0% 2.7%
Pittsburgh Pirates 53.3% 55.4% 2.1%
Arizona Diamondbacks 53.4% 51.5% 1.9%
San Francisco Giants 39.0% 40.7% 1.7%
Washington Nationals 51.7% 50.2% 1.5%
Minnesota Twins 45.0% 46.3% 1.3%
Miami Marlins 43.3% 44.5% 1.2%
Kansas City Royals 37.3% 38.4% 1.1%
Colorado Rockies 36.7% 36.1% 0.6%
Toronto Blue Jays 48.3% 48.6% 0.3%
Chicago Cubs 53.3% 53.5% 0.1%
Houston Astros 44.3% 44.3% 0.0%

The teams at the top of this table are the biggest misses for the basic model. That does not automatically mean they were lucky or unlucky, but it does mean their record did not line up as cleanly with their run differential. Those are usually the teams worth looking at more closely because the gap could come from a number of other things that determine wins and losses in baseball. That’s why this is a good start, but its still relatively basic in the whole scheme of things.

Question 3: Mean Absolute Deviation

Mean absolute deviation takes the team-level misses from Question 2 and turns them into one overall model-fit number.

The exponent 2 model has a mean absolute deviation of 3.78%. In plain terms, the model was off by about 3.8 percentage points in winning percentage for the average team. That is a pretty close fit considering the model only uses runs scored and runs allowed. It is not trying to account for every detail of a season, but it gets surprisingly close with just the basic relationship between run production and run prevention.

Question 4: Estimating Expected Win Percentage Using the 1.87 Exponent

The second version of the model uses an exponent of 1.87 instead of 2. The structure of the formula is the same, but the smaller exponent slightly reduces the effect of larger run differentials.

\[ Expected\ Win\% = \frac{Runs^{1.87}}{Runs^{1.87} + Runs\ Allowed^{1.87}} \]

Team W L G RS RA Actual Win% Expected Win% Expected Wins Win% Diff Win Diff
Los Angeles Dodgers 38 21 59 314 185 64.4% 72.9% 43.0 −8.5% −5.0
Atlanta Braves 40 20 60 316 207 66.7% 68.8% 41.3 −2.1% −1.3
New York Yankees 36 23 59 305 207 61.0% 67.4% 39.7 −6.3% −3.7
Milwaukee Brewers 35 21 56 268 194 62.5% 64.7% 36.2 −2.2% −1.2
Seattle Mariners 31 29 60 255 225 51.7% 55.8% 33.5 −4.2% −2.5
Pittsburgh Pirates 32 28 60 301 270 53.3% 55.1% 33.0 −1.7% −1.0
Tampa Bay Rays 36 20 56 262 243 64.3% 53.5% 30.0 10.8% 6.0
Chicago Cubs 32 28 60 284 265 53.3% 53.2% 31.9 0.1% 0.1
Arizona Diamondbacks 31 27 58 264 256 53.4% 51.4% 29.8 2.0% 1.2
Texas Rangers 28 31 59 237 230 47.5% 51.4% 30.3 −3.9% −2.3
Chicago White Sox 32 27 59 275 267 54.2% 51.4% 30.3 2.9% 1.7
Cleveland Guardians 34 27 61 249 248 55.7% 50.2% 30.6 5.5% 3.4
Washington Nationals 31 29 60 324 323 51.7% 50.1% 30.1 1.5% 0.9
Boston Red Sox 25 33 58 231 236 43.1% 49.0% 28.4 −5.9% −3.4
San Diego Padres 32 26 58 227 233 55.2% 48.8% 28.3 6.4% 3.7
Toronto Blue Jays 29 31 60 244 251 48.3% 48.7% 29.2 −0.3% −0.2
St. Louis Cardinals 31 26 57 247 257 54.4% 48.1% 27.4 6.2% 3.6
New York Mets 26 33 59 239 252 44.1% 47.5% 28.0 −3.5% −2.0
Minnesota Twins 27 33 60 275 296 45.0% 46.6% 27.9 −1.6% −0.9
Philadelphia Phillies 30 29 59 230 256 50.8% 45.0% 26.6 5.8% 3.4
Miami Marlins 26 34 60 250 279 43.3% 44.9% 26.9 −1.6% −0.9
Houston Astros 27 34 61 271 304 44.3% 44.6% 27.2 −0.4% −0.2
Cincinnati Reds 30 28 58 256 288 51.7% 44.5% 25.8 7.2% 4.2
Athletics 28 31 59 252 286 47.5% 44.1% 26.0 3.3% 2.0
Baltimore Orioles 28 32 60 275 313 46.7% 44.0% 26.4 2.7% 1.6
Detroit Tigers 22 38 60 223 262 36.7% 42.5% 25.5 −5.9% −3.5
Los Angeles Angels 23 37 60 255 306 38.3% 41.6% 24.9 −3.2% −1.9
San Francisco Giants 23 36 59 232 280 39.0% 41.3% 24.4 −2.3% −1.4
Kansas City Royals 22 37 59 221 280 37.3% 39.1% 23.1 −1.8% −1.1
Colorado Rockies 22 38 60 251 334 36.7% 37.0% 22.2 −0.3% −0.2

This table is asking the same basic question as Question 1, but with a slightly different assumption about the exponent. The 1.87 model still rewards teams for outscoring opponents, but it is a little less aggressive than squaring runs scored and runs allowed.

Question 5: Absolute Deviation for the 1.87 Model

\[ Absolute\ Deviation = |Win\% - Expected\ Win\%| \]

This gives the same type of error measurement as before, but now it is based on the 1.87 exponent model.

Team Actual Win% Expected Win% Abs. Dev.
Tampa Bay Rays 64.3% 53.5% 10.8%
Los Angeles Dodgers 64.4% 72.9% 8.5%
Cincinnati Reds 51.7% 44.5% 7.2%
San Diego Padres 55.2% 48.8% 6.4%
New York Yankees 61.0% 67.4% 6.3%
St. Louis Cardinals 54.4% 48.1% 6.2%
Boston Red Sox 43.1% 49.0% 5.9%
Detroit Tigers 36.7% 42.5% 5.9%
Philadelphia Phillies 50.8% 45.0% 5.8%
Cleveland Guardians 55.7% 50.2% 5.5%
Seattle Mariners 51.7% 55.8% 4.2%
Texas Rangers 47.5% 51.4% 3.9%
New York Mets 44.1% 47.5% 3.5%
Athletics 47.5% 44.1% 3.3%
Los Angeles Angels 38.3% 41.6% 3.2%
Chicago White Sox 54.2% 51.4% 2.9%
Baltimore Orioles 46.7% 44.0% 2.7%
San Francisco Giants 39.0% 41.3% 2.3%
Milwaukee Brewers 62.5% 64.7% 2.2%
Atlanta Braves 66.7% 68.8% 2.1%
Arizona Diamondbacks 53.4% 51.4% 2.0%
Kansas City Royals 37.3% 39.1% 1.8%
Pittsburgh Pirates 53.3% 55.1% 1.7%
Minnesota Twins 45.0% 46.6% 1.6%
Miami Marlins 43.3% 44.9% 1.6%
Washington Nationals 51.7% 50.1% 1.5%
Houston Astros 44.3% 44.6% 0.4%
Toronto Blue Jays 48.3% 48.7% 0.3%
Colorado Rockies 36.7% 37.0% 0.3%
Chicago Cubs 53.3% 53.2% 0.1%

Question 6: Mean Absolute Deviation for the 1.87 Model

The adjusted model has a mean absolute deviation of 3.67%, compared to 3.78% for the exponent 2 model. So the 1.87 model is a little closer on average. The difference is not massive, but it does matter because both models are trying to do the same thing with the same inputs. In terms of average model fit, the 1.87 exponent does a slightly better job matching actual winning percentage in this dataset.

Actual Win Percentage vs Expected Win Percentage

Question 7: Creating the Transformed Wins Variable

\[ transformed\_wins = ln\left(\frac{1}{Win\%} - 1\right) \]

This transformation puts winning percentage onto a log scale. That matters because the next part of the assignment estimates the exponent through regression instead of just assuming the exponent should be 2 or 1.87.

Team Actual Win% Transformed Wins
Atlanta Braves 66.7% −0.693
Los Angeles Dodgers 64.4% −0.593
Tampa Bay Rays 64.3% −0.588
Milwaukee Brewers 62.5% −0.511
New York Yankees 61.0% −0.448
Cleveland Guardians 55.7% −0.231
San Diego Padres 55.2% −0.208
St. Louis Cardinals 54.4% −0.176
Chicago White Sox 54.2% −0.170
Arizona Diamondbacks 53.4% −0.138
Pittsburgh Pirates 53.3% −0.134
Chicago Cubs 53.3% −0.134
Cincinnati Reds 51.7% −0.069
Seattle Mariners 51.7% −0.067
Washington Nationals 51.7% −0.067
Philadelphia Phillies 50.8% −0.034
Toronto Blue Jays 48.3% 0.067
Athletics 47.5% 0.102
Texas Rangers 47.5% 0.102
Baltimore Orioles 46.7% 0.134
Minnesota Twins 45.0% 0.201
Houston Astros 44.3% 0.231
New York Mets 44.1% 0.238
Miami Marlins 43.3% 0.268
Boston Red Sox 43.1% 0.278
San Francisco Giants 39.0% 0.448
Los Angeles Angels 38.3% 0.475
Kansas City Royals 37.3% 0.520
Detroit Tigers 36.7% 0.547
Colorado Rockies 36.7% 0.547

The transformed values look a little strange at first because stronger teams usually have lower values. That is just because of how the formula is written. A team above .500 will usually end up with a negative transformed value, while a team below .500 will usually end up positive, though that may not always be the case. The main point is not that the number feels intuitive on its own, but that it creates the right setup for the regression.

Question 8: Creating the Transformed Run Ratio Variable

The transformed run ratio uses runs allowed divided by runs scored, then takes the natural log.

\[ transformed\_run\_ratio = ln\left(\frac{Runs\ Allowed}{Runs\ Scored}\right) \]

This gives the run-based variable that lines up with the transformed wins variable.

Team RS RA Transformed Wins Transformed Run Ratio
Los Angeles Dodgers 314 185 −0.593 −0.529
Atlanta Braves 316 207 −0.693 −0.423
New York Yankees 305 207 −0.448 −0.388
Milwaukee Brewers 268 194 −0.511 −0.323
Seattle Mariners 255 225 −0.067 −0.125
Pittsburgh Pirates 301 270 −0.134 −0.109
Tampa Bay Rays 262 243 −0.588 −0.075
Chicago Cubs 284 265 −0.134 −0.069
Arizona Diamondbacks 264 256 −0.138 −0.031
Texas Rangers 237 230 0.102 −0.030
Chicago White Sox 275 267 −0.170 −0.030
Cleveland Guardians 249 248 −0.231 −0.004
Washington Nationals 324 323 −0.067 −0.003
Boston Red Sox 231 236 0.278 0.021
San Diego Padres 227 233 −0.208 0.026
Toronto Blue Jays 244 251 0.067 0.028
St. Louis Cardinals 247 257 −0.176 0.040
New York Mets 239 252 0.238 0.053
Minnesota Twins 275 296 0.201 0.074
Philadelphia Phillies 230 256 −0.034 0.107
Miami Marlins 250 279 0.268 0.110
Houston Astros 271 304 0.231 0.115
Cincinnati Reds 256 288 −0.069 0.118
Athletics 252 286 0.102 0.127
Baltimore Orioles 275 313 0.134 0.129
Detroit Tigers 223 262 0.547 0.161
Los Angeles Angels 255 306 0.475 0.182
San Francisco Giants 232 280 0.448 0.188
Kansas City Royals 221 280 0.520 0.237
Colorado Rockies 251 334 0.547 0.286

This variable is basically a log version of run prevention compared to run production. If a team allows fewer runs than it scores, the value is usually negative. If it allows more runs than it scores, the value is usually positive. That makes it a compact way to measure the same idea behind run differential.

Question 9: Estimating the Best Fit Exponent

The exponent is estimated directly from the data by regressing transformed wins on transformed run ratio calculated above.

\[ transformed\_wins = \beta_0 + \beta_1(transformed\_run\_ratio) \]

The reason this works is because the Pythagorean formula can be rearranged into a log relationship between winning percentage and the ratio of runs allowed to runs scored. Once both sides are transformed, the slope of the regression becomes the estimated exponent. So instead of choosing 2 or 1.87 ahead of time, this approach lets the data estimate the exponent that best fits these teams.

Term Estimate Std. Error t-stat p-value
(Intercept) 0.004 0.033 0.108 0.915
transformed_run_ratio 1.540 0.176 8.743 0.000

The slope on transformed_run_ratio is the key number. The estimated exponent is the slope coefficient on transformed_run_ratio. It estimates how strongly the run ratio translates into winning percentage. A higher exponent would mean run differential has a stronger effect on expected winning percentage. A lower exponent means the model is a little softer and does not push teams as far toward the extremes. For this dataset, the regression gives a data-driven way to compare against the assumed exponents from the earlier questions.

Best Fit Exponent Regression

The positive relationship in the chart makes sense because as a team allows more runs relative to what it scores, its transformed wins value gets worse. The regression line turns that relationship into one estimated exponent.

Question 10: Reporting the Exponent and R-Squared

Estimated Exponent R-Squared
1.540 0.732

The estimated exponent is 1.540, and the R-squared is 0.732. The estimated exponent is lower than both 2 and 1.87, which means this particular dataset prefers a softer exponent than either of the two formulas tested earlier.

The R-squared means the transformed run ratio explains about 73.2% of the variation in transformed winning percentage. That is a fairly strong result for a relatively simple model. Runs scored and runs allowed explain a lot of team performance, which is why the Pythagorean expectation works as well as it does. At the same time, it still leaves room for the parts of baseball that do not fit perfectly into one basic formula.