Econometrics II Exam 1

Question 1: Estimating Expected Win Percentage

The first step is estimating each team’s expected winning percentage with the basic Pythagorean expectation formula. The idea behind the model is simple. If a team scores more runs than it allows, it should usually win more games. If it allows more runs than it scores, it should usually lose more games.

\[ Expected\ Win\% = \frac{Runs^2}{Runs^2 + Runs\ Allowed^2} \]

Team	W	L	G	RS	RA	Actual Win%	Expected Win%	Expected Wins	Win% Diff	Win Diff
Los Angeles Dodgers	38	21	59	314	185	64.4%	74.2%	43.8	−9.8%	−5.8
Atlanta Braves	40	20	60	316	207	66.7%	70.0%	42.0	−3.3%	−2.0
New York Yankees	36	23	59	305	207	61.0%	68.5%	40.4	−7.4%	−4.4
Milwaukee Brewers	35	21	56	268	194	62.5%	65.6%	36.7	−3.1%	−1.7
Seattle Mariners	31	29	60	255	225	51.7%	56.2%	33.7	−4.6%	−2.7
Pittsburgh Pirates	32	28	60	301	270	53.3%	55.4%	33.2	−2.1%	−1.2
Tampa Bay Rays	36	20	56	262	243	64.3%	53.8%	30.1	10.5%	5.9
Chicago Cubs	32	28	60	284	265	53.3%	53.5%	32.1	−0.1%	−0.1
Arizona Diamondbacks	31	27	58	264	256	53.4%	51.5%	29.9	1.9%	1.1
Texas Rangers	28	31	59	237	230	47.5%	51.5%	30.4	−4.0%	−2.4
Chicago White Sox	32	27	59	275	267	54.2%	51.5%	30.4	2.8%	1.6
Cleveland Guardians	34	27	61	249	248	55.7%	50.2%	30.6	5.5%	3.4
Washington Nationals	31	29	60	324	323	51.7%	50.2%	30.1	1.5%	0.9
Boston Red Sox	25	33	58	231	236	43.1%	48.9%	28.4	−5.8%	−3.4
San Diego Padres	32	26	58	227	233	55.2%	48.7%	28.2	6.5%	3.8
Toronto Blue Jays	29	31	60	244	251	48.3%	48.6%	29.2	−0.3%	−0.2
St. Louis Cardinals	31	26	57	247	257	54.4%	48.0%	27.4	6.4%	3.6
New York Mets	26	33	59	239	252	44.1%	47.4%	27.9	−3.3%	−1.9
Minnesota Twins	27	33	60	275	296	45.0%	46.3%	27.8	−1.3%	−0.8
Philadelphia Phillies	30	29	59	230	256	50.8%	44.7%	26.4	6.2%	3.6
Miami Marlins	26	34	60	250	279	43.3%	44.5%	26.7	−1.2%	−0.7
Houston Astros	27	34	61	271	304	44.3%	44.3%	27.0	−0.0%	0.0
Cincinnati Reds	30	28	58	256	288	51.7%	44.1%	25.6	7.6%	4.4
Athletics	28	31	59	252	286	47.5%	43.7%	25.8	3.8%	2.2
Baltimore Orioles	28	32	60	275	313	46.7%	43.6%	26.1	3.1%	1.9
Detroit Tigers	22	38	60	223	262	36.7%	42.0%	25.2	−5.3%	−3.2
Los Angeles Angels	23	37	60	255	306	38.3%	41.0%	24.6	−2.7%	−1.6
San Francisco Giants	23	36	59	232	280	39.0%	40.7%	24.0	−1.7%	−1.0
Kansas City Royals	22	37	59	221	280	37.3%	38.4%	22.6	−1.1%	−0.6
Colorado Rockies	22	38	60	251	334	36.7%	36.1%	21.7	0.6%	0.3

This table gives a better view of the standings than wins and losses alone. The actual record tells us what happened, but expected wins tell us whether that record matched the team’s run profile. A positive win difference means the team won more games than the model expected. A negative win difference means the team’s run differential was better than its actual record.

Also, as a Cubs fan who has watched a lot of their games this year, I think it’s interesting that they are pretty much spot on when it comes to actual wins vs expected wins especially when they are a very streaky team who have had two separate 10 game winning streaks along with a 10 game losing streaking. They are still right where they are expected to be.

Question 2: Calculating Absolute Deviation

Next, I calculated the absolute deviation for each team. This measures how far the model was from each team’s actual winning percentage.

\[ Absolute\ Deviation = |Win\% - Expected\ Win\%| \]

The important part here is that the direction does not matter. A team can win more games than expected or fewer games than expected, but absolute deviation only measures the size of the gap.

Team	Actual Win%	Expected Win%	Abs. Dev.
Tampa Bay Rays	64.3%	53.8%	10.5%
Los Angeles Dodgers	64.4%	74.2%	9.8%
Cincinnati Reds	51.7%	44.1%	7.6%
New York Yankees	61.0%	68.5%	7.4%
San Diego Padres	55.2%	48.7%	6.5%
St. Louis Cardinals	54.4%	48.0%	6.4%
Philadelphia Phillies	50.8%	44.7%	6.2%
Boston Red Sox	43.1%	48.9%	5.8%
Cleveland Guardians	55.7%	50.2%	5.5%
Detroit Tigers	36.7%	42.0%	5.3%
Seattle Mariners	51.7%	56.2%	4.6%
Texas Rangers	47.5%	51.5%	4.0%
Athletics	47.5%	43.7%	3.8%
Atlanta Braves	66.7%	70.0%	3.3%
New York Mets	44.1%	47.4%	3.3%
Milwaukee Brewers	62.5%	65.6%	3.1%
Baltimore Orioles	46.7%	43.6%	3.1%
Chicago White Sox	54.2%	51.5%	2.8%
Los Angeles Angels	38.3%	41.0%	2.7%
Pittsburgh Pirates	53.3%	55.4%	2.1%
Arizona Diamondbacks	53.4%	51.5%	1.9%
San Francisco Giants	39.0%	40.7%	1.7%
Washington Nationals	51.7%	50.2%	1.5%
Minnesota Twins	45.0%	46.3%	1.3%
Miami Marlins	43.3%	44.5%	1.2%
Kansas City Royals	37.3%	38.4%	1.1%
Colorado Rockies	36.7%	36.1%	0.6%
Toronto Blue Jays	48.3%	48.6%	0.3%
Chicago Cubs	53.3%	53.5%	0.1%
Houston Astros	44.3%	44.3%	0.0%

The teams at the top of this table are the biggest misses for the basic model. That does not automatically mean they were lucky or unlucky, but it does mean their record did not line up as cleanly with their run differential. Those are usually the teams worth looking at more closely because the gap could come from a number of other things that determine wins and losses in baseball. That’s why this is a good start, but its still relatively basic in the whole scheme of things.

Question 3: Mean Absolute Deviation

Mean absolute deviation takes the team-level misses from Question 2 and turns them into one overall model-fit number.

The exponent 2 model has a mean absolute deviation of 3.78%. In plain terms, the model was off by about 3.8 percentage points in winning percentage for the average team. That is a pretty close fit considering the model only uses runs scored and runs allowed. It is not trying to account for every detail of a season, but it gets surprisingly close with just the basic relationship between run production and run prevention.

Question 4: Estimating Expected Win Percentage Using the 1.87 Exponent

The second version of the model uses an exponent of 1.87 instead of 2. The structure of the formula is the same, but the smaller exponent slightly reduces the effect of larger run differentials.

\[ Expected\ Win\% = \frac{Runs^{1.87}}{Runs^{1.87} + Runs\ Allowed^{1.87}} \]

Team	W	L	G	RS	RA	Actual Win%	Expected Win%	Expected Wins	Win% Diff	Win Diff
Los Angeles Dodgers	38	21	59	314	185	64.4%	72.9%	43.0	−8.5%	−5.0
Atlanta Braves	40	20	60	316	207	66.7%	68.8%	41.3	−2.1%	−1.3
New York Yankees	36	23	59	305	207	61.0%	67.4%	39.7	−6.3%	−3.7
Milwaukee Brewers	35	21	56	268	194	62.5%	64.7%	36.2	−2.2%	−1.2
Seattle Mariners	31	29	60	255	225	51.7%	55.8%	33.5	−4.2%	−2.5
Pittsburgh Pirates	32	28	60	301	270	53.3%	55.1%	33.0	−1.7%	−1.0
Tampa Bay Rays	36	20	56	262	243	64.3%	53.5%	30.0	10.8%	6.0
Chicago Cubs	32	28	60	284	265	53.3%	53.2%	31.9	0.1%	0.1
Arizona Diamondbacks	31	27	58	264	256	53.4%	51.4%	29.8	2.0%	1.2
Texas Rangers	28	31	59	237	230	47.5%	51.4%	30.3	−3.9%	−2.3
Chicago White Sox	32	27	59	275	267	54.2%	51.4%	30.3	2.9%	1.7
Cleveland Guardians	34	27	61	249	248	55.7%	50.2%	30.6	5.5%	3.4
Washington Nationals	31	29	60	324	323	51.7%	50.1%	30.1	1.5%	0.9
Boston Red Sox	25	33	58	231	236	43.1%	49.0%	28.4	−5.9%	−3.4
San Diego Padres	32	26	58	227	233	55.2%	48.8%	28.3	6.4%	3.7
Toronto Blue Jays	29	31	60	244	251	48.3%	48.7%	29.2	−0.3%	−0.2
St. Louis Cardinals	31	26	57	247	257	54.4%	48.1%	27.4	6.2%	3.6
New York Mets	26	33	59	239	252	44.1%	47.5%	28.0	−3.5%	−2.0
Minnesota Twins	27	33	60	275	296	45.0%	46.6%	27.9	−1.6%	−0.9
Philadelphia Phillies	30	29	59	230	256	50.8%	45.0%	26.6	5.8%	3.4
Miami Marlins	26	34	60	250	279	43.3%	44.9%	26.9	−1.6%	−0.9
Houston Astros	27	34	61	271	304	44.3%	44.6%	27.2	−0.4%	−0.2
Cincinnati Reds	30	28	58	256	288	51.7%	44.5%	25.8	7.2%	4.2
Athletics	28	31	59	252	286	47.5%	44.1%	26.0	3.3%	2.0
Baltimore Orioles	28	32	60	275	313	46.7%	44.0%	26.4	2.7%	1.6
Detroit Tigers	22	38	60	223	262	36.7%	42.5%	25.5	−5.9%	−3.5
Los Angeles Angels	23	37	60	255	306	38.3%	41.6%	24.9	−3.2%	−1.9
San Francisco Giants	23	36	59	232	280	39.0%	41.3%	24.4	−2.3%	−1.4
Kansas City Royals	22	37	59	221	280	37.3%	39.1%	23.1	−1.8%	−1.1
Colorado Rockies	22	38	60	251	334	36.7%	37.0%	22.2	−0.3%	−0.2

This table is asking the same basic question as Question 1, but with a slightly different assumption about the exponent. The 1.87 model still rewards teams for outscoring opponents, but it is a little less aggressive than squaring runs scored and runs allowed.

Question 5: Absolute Deviation for the 1.87 Model

\[ Absolute\ Deviation = |Win\% - Expected\ Win\%| \]

This gives the same type of error measurement as before, but now it is based on the 1.87 exponent model.

Team	Actual Win%	Expected Win%	Abs. Dev.
Tampa Bay Rays	64.3%	53.5%	10.8%
Los Angeles Dodgers	64.4%	72.9%	8.5%
Cincinnati Reds	51.7%	44.5%	7.2%
San Diego Padres	55.2%	48.8%	6.4%
New York Yankees	61.0%	67.4%	6.3%
St. Louis Cardinals	54.4%	48.1%	6.2%
Boston Red Sox	43.1%	49.0%	5.9%
Detroit Tigers	36.7%	42.5%	5.9%
Philadelphia Phillies	50.8%	45.0%	5.8%
Cleveland Guardians	55.7%	50.2%	5.5%
Seattle Mariners	51.7%	55.8%	4.2%
Texas Rangers	47.5%	51.4%	3.9%
New York Mets	44.1%	47.5%	3.5%
Athletics	47.5%	44.1%	3.3%
Los Angeles Angels	38.3%	41.6%	3.2%
Chicago White Sox	54.2%	51.4%	2.9%
Baltimore Orioles	46.7%	44.0%	2.7%
San Francisco Giants	39.0%	41.3%	2.3%
Milwaukee Brewers	62.5%	64.7%	2.2%
Atlanta Braves	66.7%	68.8%	2.1%
Arizona Diamondbacks	53.4%	51.4%	2.0%
Kansas City Royals	37.3%	39.1%	1.8%
Pittsburgh Pirates	53.3%	55.1%	1.7%
Minnesota Twins	45.0%	46.6%	1.6%
Miami Marlins	43.3%	44.9%	1.6%
Washington Nationals	51.7%	50.1%	1.5%
Houston Astros	44.3%	44.6%	0.4%
Toronto Blue Jays	48.3%	48.7%	0.3%
Colorado Rockies	36.7%	37.0%	0.3%
Chicago Cubs	53.3%	53.2%	0.1%

Question 6: Mean Absolute Deviation for the 1.87 Model

The adjusted model has a mean absolute deviation of 3.67%, compared to 3.78% for the exponent 2 model. So the 1.87 model is a little closer on average. The difference is not massive, but it does matter because both models are trying to do the same thing with the same inputs. In terms of average model fit, the 1.87 exponent does a slightly better job matching actual winning percentage in this dataset.

Actual Win Percentage vs Expected Win Percentage

Question 7: Creating the Transformed Wins Variable

\[ transformed\_wins = ln\left(\frac{1}{Win\%} - 1\right) \]

This transformation puts winning percentage onto a log scale. That matters because the next part of the assignment estimates the exponent through regression instead of just assuming the exponent should be 2 or 1.87.

Team	Actual Win%	Transformed Wins
Atlanta Braves	66.7%	−0.693
Los Angeles Dodgers	64.4%	−0.593
Tampa Bay Rays	64.3%	−0.588
Milwaukee Brewers	62.5%	−0.511
New York Yankees	61.0%	−0.448
Cleveland Guardians	55.7%	−0.231
San Diego Padres	55.2%	−0.208
St. Louis Cardinals	54.4%	−0.176
Chicago White Sox	54.2%	−0.170
Arizona Diamondbacks	53.4%	−0.138
Pittsburgh Pirates	53.3%	−0.134
Chicago Cubs	53.3%	−0.134
Cincinnati Reds	51.7%	−0.069
Seattle Mariners	51.7%	−0.067
Washington Nationals	51.7%	−0.067
Philadelphia Phillies	50.8%	−0.034
Toronto Blue Jays	48.3%	0.067
Athletics	47.5%	0.102
Texas Rangers	47.5%	0.102
Baltimore Orioles	46.7%	0.134
Minnesota Twins	45.0%	0.201
Houston Astros	44.3%	0.231
New York Mets	44.1%	0.238
Miami Marlins	43.3%	0.268
Boston Red Sox	43.1%	0.278
San Francisco Giants	39.0%	0.448
Los Angeles Angels	38.3%	0.475
Kansas City Royals	37.3%	0.520
Detroit Tigers	36.7%	0.547
Colorado Rockies	36.7%	0.547

The transformed values look a little strange at first because stronger teams usually have lower values. That is just because of how the formula is written. A team above .500 will usually end up with a negative transformed value, while a team below .500 will usually end up positive, though that may not always be the case. The main point is not that the number feels intuitive on its own, but that it creates the right setup for the regression.

Question 8: Creating the Transformed Run Ratio Variable

The transformed run ratio uses runs allowed divided by runs scored, then takes the natural log.

\[ transformed\_run\_ratio = ln\left(\frac{Runs\ Allowed}{Runs\ Scored}\right) \]

This gives the run-based variable that lines up with the transformed wins variable.

Team	RS	RA	Transformed Wins	Transformed Run Ratio
Los Angeles Dodgers	314	185	−0.593	−0.529
Atlanta Braves	316	207	−0.693	−0.423
New York Yankees	305	207	−0.448	−0.388
Milwaukee Brewers	268	194	−0.511	−0.323
Seattle Mariners	255	225	−0.067	−0.125
Pittsburgh Pirates	301	270	−0.134	−0.109
Tampa Bay Rays	262	243	−0.588	−0.075
Chicago Cubs	284	265	−0.134	−0.069
Arizona Diamondbacks	264	256	−0.138	−0.031
Texas Rangers	237	230	0.102	−0.030
Chicago White Sox	275	267	−0.170	−0.030
Cleveland Guardians	249	248	−0.231	−0.004
Washington Nationals	324	323	−0.067	−0.003
Boston Red Sox	231	236	0.278	0.021
San Diego Padres	227	233	−0.208	0.026
Toronto Blue Jays	244	251	0.067	0.028
St. Louis Cardinals	247	257	−0.176	0.040
New York Mets	239	252	0.238	0.053
Minnesota Twins	275	296	0.201	0.074
Philadelphia Phillies	230	256	−0.034	0.107
Miami Marlins	250	279	0.268	0.110
Houston Astros	271	304	0.231	0.115
Cincinnati Reds	256	288	−0.069	0.118
Athletics	252	286	0.102	0.127
Baltimore Orioles	275	313	0.134	0.129
Detroit Tigers	223	262	0.547	0.161
Los Angeles Angels	255	306	0.475	0.182
San Francisco Giants	232	280	0.448	0.188
Kansas City Royals	221	280	0.520	0.237
Colorado Rockies	251	334	0.547	0.286

This variable is basically a log version of run prevention compared to run production. If a team allows fewer runs than it scores, the value is usually negative. If it allows more runs than it scores, the value is usually positive. That makes it a compact way to measure the same idea behind run differential.

Question 9: Estimating the Best Fit Exponent

The exponent is estimated directly from the data by regressing transformed wins on transformed run ratio calculated above.

\[ transformed\_wins = \beta_0 + \beta_1(transformed\_run\_ratio) \]

The reason this works is because the Pythagorean formula can be rearranged into a log relationship between winning percentage and the ratio of runs allowed to runs scored. Once both sides are transformed, the slope of the regression becomes the estimated exponent. So instead of choosing 2 or 1.87 ahead of time, this approach lets the data estimate the exponent that best fits these teams.

Term	Estimate	Std. Error	t-stat	p-value
(Intercept)	0.004	0.033	0.108	0.915
transformed_run_ratio	1.540	0.176	8.743	0.000

The slope on transformed_run_ratio is the key number. The estimated exponent is the slope coefficient on transformed_run_ratio. It estimates how strongly the run ratio translates into winning percentage. A higher exponent would mean run differential has a stronger effect on expected winning percentage. A lower exponent means the model is a little softer and does not push teams as far toward the extremes. For this dataset, the regression gives a data-driven way to compare against the assumed exponents from the earlier questions.

Best Fit Exponent Regression

The positive relationship in the chart makes sense because as a team allows more runs relative to what it scores, its transformed wins value gets worse. The regression line turns that relationship into one estimated exponent.

Question 10: Reporting the Exponent and R-Squared

Estimated Exponent	R-Squared
1.540	0.732

The estimated exponent is 1.540, and the R-squared is 0.732. The estimated exponent is lower than both 2 and 1.87, which means this particular dataset prefers a softer exponent than either of the two formulas tested earlier.

The R-squared means the transformed run ratio explains about 73.2% of the variation in transformed winning percentage. That is a fairly strong result for a relatively simple model. Runs scored and runs allowed explain a lot of team performance, which is why the Pythagorean expectation works as well as it does. At the same time, it still leaves room for the parts of baseball that do not fit perfectly into one basic formula.