Overview
The use of historical statistics to predict future outcomes,
particularly wins and losses, and identify opportunities for improving
team or individual performance, has gained significant attention in
professional sports. The aim of this analysis is to develop several
models that can predict a baseball team’s wins over a season based on
team stats such as homeruns, strikeouts, base hits, and more. We will
begin by examining the data for any issues, such as missing data, or
outliers, and take the necessary measures to clean the data. We will
subsequently create and evaluate three different linear models that
forecast seasonal wins using the dataset, which includes both training
and evaluation data. We will train the models using the main training
data and then evaluate their performance against the evaluation data to
determine their effectiveness. Finally, we will choose the best model
that balances accuracy and simplicity for predicting seasonal wins.
1. Date Exploration
The baseball training dataset contains 2,276 observations of 17
variables detailing various teams’ performances per year from 1871 to
2006. The description of the columns is shown below. Due to the
relatively long period, we expect to see outliers and missing data as
the league modified official game rules; these rule changes undoubtedly
caused teams and players to change their tactics in response.
Additionally, the number of single base hits is noticeably missing from
the columns. However, we will derive this value as the number of other
types of hits (doubles, triples, home runs) can be subtracted from total
hits. Lastly, other columns representing game number (out of 162),
inning number (1-9), and matching opponent columns would have been
vastly useful for predictions. One last noticeable omission from the
original dataset is of the number of single base hits. However, this
value can possibly be calculated as a difference between other types of
hits (doubles, triples, home runs) and total hits.
1.1 Summary Statistics
The table below shows us some valuable descriptive statistics for the
training data. The data set contains all integers. We can see that many
of the variables have a minimum of 0 but not all. The means and medians
of each variable are all relatively close in value for each individual
variable. This tells us that most data is free from extreme outliers as
they tend to skew the mean relative to the median.
One interesting piece of information is the min/max of the
TARGET_WINS variable. The minimum is 0 meaning there are
teams that did not win a single game. The maximum is 146 which indicates
no team in the training dataset had a perfect season, as we know from
the data a season consists of 162 games.
Also of note is the number of missing values from certain variables.
Most notably the TEAM_BATTING_HBP (batters hit by pitch
variable). With 91% of the data missing we will remove this variable
from our dataset because there simply is not enough information to
impute a sensible value. The column TEAM_BASERUN_CS (caught
stealing) had 34% of the missing data, we may consider removing it
later. The missing data for these two columns may be due to a change
official rules or tactics before the modern era of baseball.
1.2 Distribution and Box Plots
Next, we’ll visually check for normal distributions and box plots in
both the dependent and independent variables. The density plot below
shows normalcy in most features except for extremely right skewed
features such as hits allowed (PITCHING_H) or errors (FIELDING_E).
Homeruns by batters (BATTING_HR) and strikeouts by batters (BATTING_SO)
variables seem bimodal. It implies the existence of two distinct
clusters within the baseball season data, where teams tended to score
more in one of the clusters.
Box plots for these further show a high number of outliers exist outside
of the interquartile ranges so their effects should be carefully
considered and we may deal with non-unimodal distributions.


Lastly, the function featurePlot() will show the relationship between
independent variables and the target variable TARGET_WINS.
In general, while our graphs display certain intriguing connections
among the variables, they also expose noteworthy problems with the data.
For example, the dataset contains a team that has not won any games,
which appears improbable. By checking the web data, we found that it
actually happened 2 times: 1872 and 1873. Or that the pitching data
contains numerous instances of 0’s, several teams have 0 strikeouts by
their pitchers over the season, which is highly improbable. Also, there
is as a team achieving 20,000 strikeouts. There will be further steps to
work with outliers and 0’s.
1.3 Correlation Matrix
Plotting the correlations between TARGET_WINS and the
variables (excluding INDEX and
TEAM_BATTING_HBP) we can see that very few variables are
strongly correlated with the target variable. Columns with correlations
close to zero are unlikely to offer significant insights into the
factors that contribute to a team’s victories.
To avoid multicolinearity, we should not include features that have
strong correlation. Comparing offensive (any column starting with
BATTING or BASERUN) to defensive stats
unexpectedly shows some correlation, pointing to potential problems.
Qualitatively, the matrix implies some teams or players are exceptional
both at hitting (offensive) and fielding (defensive). Furthermore, a
typical team’s number of batted home runs and allowed home runs has a
correlation of nearly 1.0. This is an unexpected correlation but can be
explained by noticing most games are decided by a difference of one or
two runs (whether the games are high scoring or not). Any final models
should include one of these two home run variables. Alternatively, the
correlation between a team’s hits (BATTING_H) and hits
allowed (PITCHING_H) is around 0.3 which is seems
reasonable.
There are some other strong correlations that are less obvious such
as Errors (TEAM_FIELDING_E) being strongly negatively
correlated with walks by batters (TEAM_BATTING_BB), strike
outs (TEAM_BATTING_SO). All combined together, teams that
get a lot of hits do not generally make fielding errors.
Digging a little deeper we can see there is a Pearson correlation
coefficient of -0.6559708 for errors and walks by batters which
indicates a strong negative correlation between the two variables.
Looking at errors compared with team pitching hits allowed we see a
correlation of 0.667759 which indicates a strong positive
correlation.

|
|
Coefficient
|
|
TARGET_WINS
|
1.0000000
|
|
TEAM_BATTING_H
|
0.4699467
|
|
TEAM_BATTING_2B
|
0.3129840
|
|
TEAM_BATTING_3B
|
-0.1243459
|
|
TEAM_BATTING_HR
|
0.4224168
|
|
TEAM_BATTING_BB
|
0.4686879
|
|
TEAM_BATTING_SO
|
-0.2288927
|
|
TEAM_BASERUN_SB
|
0.0148364
|
|
TEAM_BASERUN_CS
|
-0.1787560
|
|
TEAM_BATTING_HBP
|
0.0735042
|
|
TEAM_PITCHING_H
|
0.4712343
|
|
TEAM_PITCHING_HR
|
0.4224668
|
|
TEAM_PITCHING_BB
|
0.4683988
|
|
TEAM_PITCHING_SO
|
-0.2293648
|
|
TEAM_FIELDING_E
|
-0.3866880
|
|
TEAM_FIELDING_DP
|
-0.1958660
|
Lastly, lets take a closer look at the missing data. We’ve already
determined that the batter hit by pitch (TEAM_BATTING_HBP)
variable is missing 91% of its data but what of the other variables. We
will just drop the column from the further analysis.
Using the plot below we can visualize the missingness of the
remaining variables. There are 5 variables that contain varying degrees
of missing data. We will use the information to fill in the missing
values in our data preparation step.
TEAM_BASERUN_CS appears to be missing the second most
amount of values but at only 772 missing values out of 2276 this is much
less of a concern than the HBP variable we identified earlier. The
remaining variables that are missing data have less than 25% of their
data missing so should be safe to impute.

2. Data Preparation
2.1 Missing Data
Notes / Questions
As discussed above, we will drop the INDEX and
TEAM_BATTING_HBP variables as the
TEAM_BATTING_HBP variable is missing 91% of its data and
the INDEX variable is just an identification Variable.
We’ll also derive a new column for single base hits derived from
subtracting double, triples and home runs from the total number of
hits.
For further work with NA’s, we create flags to suggest if a variable
was missing.
To to impute the missing values in the trainDf data, the
mice library is used. To utilize MICE, one must make the
assumption that the missing values are missing at random, indicating
that the missingness can be explained by variables that have complete
information. The MICE algorithm then performs several iterations over
the data, as suggested by its name, and generates data to complete the
missing values. We check what impute method we use for each column.
pmm is predictive mean matching, replacing missing data
with column means.
Let’s also take a look at the density plots pre and post-imputation
to make sure densities look similar. Unfortunately, for
TEAM_BASERUN_SB, TEAM_BATTING_SO, and
TEAM_FIELDING DP they do not. But in the case of
TEAM_BATTING_SO our distribution becomes roughly more
normal, so it may be beneficial. For the TEAM_BASERUN_SB
and TEAM_BASERUN_CS and TEAM_FIELDING_DP we
may need to consider alternative methods.

2.2 Drop Outliers
To identify outliers, we used historical data from the Lahman’s
Baseball Database. In baseball’s early history, few games were played in
a season (less than 20); thus, the calculation used to normalize the
statistics to a 162-game season tends to create outliers. We used the
min and max from the modern era (post-1900) as a filter to alleviate the
issues created by outlier values.
ADD MORE TEXT WITH EXPLANATIONS ABOUT FUNCTIONS USED TO WORK
WITH OUTLIERS


3. Building models
Notes / Questions
At this juncture, with a thorough comprehension of our dataset and
having completed the data cleaning process, we can initiate the
construction of our multiple linear regression models. We will build
four separate linear models and compare their performance. To build the
models, we will use 4 data frames: original training dataset (trainDf),
dataset without missing values (cleanDf), dataset with no outliers and
no NA’s (droppedDf), dataset with transformations, without NA’s
(transformedDf). First, we decided to split our datasets into a training
and testing sets (80% training, 20% testing). Then, using our training
dataset, we will run our models and check them using testing data.
3.1 Model 1: Baseline
For the first multiple linear regression model, we will select all
the variables from the original un-cleaned dataset except for
INDEX (just a row index) and TEAM_BATTING_HBP
(91% of missing data). We may use this model as a base model to compare
with. The results: F-statistic is 82.1, adjusted R-squared 0.433, out of
the 15 variables, 6 have statistically significant p-values. The
adjusted R2 indicates that only 43% of the variance in the response
variable can be explained by the predictor variables. The F-statistic is
low and the model’s p-value are not statistically significant. We should
look at other models.
|
|
TARGET_WINS
|
|
Predictors
|
Estimates
|
std. Error
|
Statistic
|
p
|
|
(Intercept)
|
57.91
|
6.64
|
8.72
|
<0.0001
|
|
TEAM BATTING H
|
0.02
|
0.02
|
0.79
|
0.4318
|
|
TEAM BATTING 2B
|
-0.07
|
0.01
|
-7.52
|
<0.0001
|
|
TEAM BATTING 3B
|
0.16
|
0.02
|
7.28
|
<0.0001
|
|
TEAM BATTING HR
|
0.07
|
0.09
|
0.87
|
0.3866
|
|
TEAM BATTING BB
|
0.04
|
0.05
|
0.94
|
0.3463
|
|
TEAM BATTING SO
|
0.02
|
0.02
|
0.78
|
0.4368
|
|
TEAM BASERUN SB
|
0.04
|
0.01
|
4.13
|
<0.0001
|
|
TEAM BASERUN CS
|
0.05
|
0.02
|
2.86
|
0.0043
|
|
TEAM PITCHING H
|
0.02
|
0.02
|
1.04
|
0.3003
|
|
TEAM PITCHING HR
|
0.02
|
0.08
|
0.28
|
0.7794
|
|
TEAM PITCHING BB
|
-0.00
|
0.04
|
-0.09
|
0.9255
|
|
TEAM PITCHING SO
|
-0.04
|
0.02
|
-1.70
|
0.0892
|
|
TEAM FIELDING E
|
-0.16
|
0.01
|
-15.67
|
<0.0001
|
|
TEAM FIELDING DP
|
-0.11
|
0.01
|
-8.59
|
<0.0001
|
|
Observations
|
1486
|
|
R2 / R2 adjusted
|
0.439 / 0.433
|
The linear modeling assumption are evaluated using a diagnostic plot,
the Breusch–Pagan Test for Heteroscedasticity and Variance Inflation
Factor (VIF) to assess colinearity.
The initial review of the diagnostic plots for this model shows some
deviation from linear modeling assumptions. In the Q-Q plot, we see some
deviations from the normal distribution at the ends. The
residuals-fitted and standardzied residuals-fitted plots show a curve in
the main cluster, which indicates that we do not have constant
variance.

3.2 Model 2: Removed N/A Values
The second model will be based on the cleaned dataset without missing
values. We chose variables TEAM_PITCHING_BB,
TEAM_BATTING_2B, TEAM_BATTING_3B,
TEAM_FIELDING_E, TEAM_PITCHING_H,
TEAM_BATTING_HR, TEAM_BATTING_H based on the
intuition and the understanding of the data. The results: F-statistic is
119.9, adjusted R-squared 0.268, out of the 7 variables, 7 have
statistically significant p-values. The adjusted R2 indicates that only
25% of the variance in the response variable can be explained by the
predictor variables, which is less than in the original model. The
F-statistic is high and the p-value of the model is close to zero, and
the model’s diagnostic checks are satisfactory, this suggests that we
would dismiss the null hypothesis, which assumes that there is no
correlation between the explanatory and response variables.
|
|
TARGET_WINS
|
|
Predictors
|
Estimates
|
std. Error
|
Statistic
|
p
|
|
(Intercept)
|
7.27
|
3.28
|
2.22
|
0.0266
|
|
TEAM PITCHING BB
|
0.01
|
0.00
|
5.29
|
<0.0001
|
|
TEAM BATTING 2B
|
-0.03
|
0.01
|
-2.85
|
0.0044
|
|
TEAM BATTING 3B
|
0.10
|
0.02
|
6.07
|
<0.0001
|
|
TEAM FIELDING E
|
-0.02
|
0.00
|
-7.36
|
<0.0001
|
|
TEAM PITCHING H
|
-0.00
|
0.00
|
-4.03
|
0.0001
|
|
TEAM BATTING HR
|
0.04
|
0.01
|
4.85
|
<0.0001
|
|
TEAM BATTING H
|
0.05
|
0.00
|
15.22
|
<0.0001
|
|
Observations
|
2276
|
|
R2 / R2 adjusted
|
0.270 / 0.268
|
The initial review of the diagnostic plots for this model shows some
deviation from linear modeling assumptions. The Q-Q plot shows the
normal distribution but the residuals-fitted and standardzied
residuals-fitted plots don’t have contant variance for the fitted values
below 60 and above 95.

3.3 Model 3: Removed Outliers
The third model will use the cleaned dataset with outliers omitted.
Stepwise variable selection based on the AIC score is used to filter the
optimal set of features. The resulting model has a significant p-values
for the model and all predictor variables. The results: F-statistic is
117.3, adjusted R-squared 0.417. The adjusted R2 indicates that 41% of
the variance in the response variable can be explained by the predictor
variables. The F-statistic is high and the p-value of the model is close
to zero, so we would dismiss the null hypothesis, which assumes that
there is no correlation between the explanatory and response variables.
DAVID, WHY TEAM_BATTING_H WAS REMOVED? COULD YOU PLEASE ADD A
SENTENCE EXPLAINING THE CHOICE?
|
|
TARGET_WINS
|
|
Predictors
|
Estimates
|
std. Error
|
Statistic
|
p
|
|
(Intercept)
|
60.35
|
5.64
|
10.70
|
<0.0001
|
|
TEAM BATTING 2B
|
-0.05
|
0.01
|
-5.23
|
<0.0001
|
|
TEAM BATTING 3B
|
0.18
|
0.02
|
10.03
|
<0.0001
|
|
TEAM BATTING HR
|
0.09
|
0.01
|
9.82
|
<0.0001
|
|
TEAM BATTING BB
|
0.17
|
0.02
|
8.92
|
<0.0001
|
|
TEAM BATTING SO
|
-0.05
|
0.01
|
-6.38
|
<0.0001
|
|
TEAM BASERUN SB
|
0.09
|
0.01
|
15.76
|
<0.0001
|
|
TEAM BASERUN CS
|
-0.06
|
0.01
|
-5.80
|
<0.0001
|
|
TEAM PITCHING H
|
0.03
|
0.00
|
7.45
|
<0.0001
|
|
TEAM PITCHING BB
|
-0.13
|
0.02
|
-7.40
|
<0.0001
|
|
TEAM PITCHING SO
|
0.03
|
0.01
|
3.66
|
0.0003
|
|
TEAM FIELDING E
|
-0.12
|
0.01
|
-20.41
|
<0.0001
|
|
TEAM FIELDING DP
|
-0.12
|
0.01
|
-9.70
|
<0.0001
|
|
Observations
|
1950
|
|
R2 / R2 adjusted
|
0.421 / 0.417
|
|
AIC
|
14694.612
|
**WAS BEFORE FROM DAVID*
The initial review of the diagnostic plots for this model shows some
deviation from linear modeling assumptions at the boundaries of the
prediction range, but the overall QQ plot is almost a flat line; the
residuals are flat and exhibit homoscedasticity of variance in the 65 -
95 fitted value range.

The Breusch–Pagan Test for Heteroscedasticity assumes the following
Null and Alternate hypothesis.
- H0 - Residuals are distributed with equal variance (i.e.,
homoscedasticity)
- H1 - Residuals are distributed with unequal variance (i.e.,
heteroscedasticity)
For this model iteration, we reject the null hypothesis and conclude
that this model violates the homoscedasticity function.
studentized Breusch-Pagan test
data: lm3step BP = 70.258, df = 12, p-value = 2.865e-10
The Variance Inflation Factor (VIF) calculation detects collinearity
with the TEAM_PITCHING_BB and TEAM_BATTING_BB
variables. Both variables have a VIF score over 5 and are 0.93
correlated. Also, variables TEAM_PITCHING_SO, TEAM_BATTING_SO,
TEAM_PITCHING_H, TEAM_BATTING_2B have VIF over 5.


For the refined model we will drop TEAM_PITCHING_BB,
TEAM_BATTING_BB, and TEAM_PITCHING_SO from the
model. This model has a lower adjusted R2 of 0.37 but together with high
F-statistic of 163.9, statistically significant p-values should align
better with the linear regression assumptions.
|
|
TARGET_WINS
|
|
Predictors
|
Estimates
|
std. Error
|
Statistic
|
p
|
|
(Intercept)
|
107.56
|
3.26
|
33.03
|
<0.0001
|
|
TEAM BATTING 3B
|
0.22
|
0.02
|
12.62
|
<0.0001
|
|
TEAM BATTING HR
|
0.14
|
0.01
|
19.23
|
<0.0001
|
|
TEAM BATTING SO
|
-0.03
|
0.00
|
-16.35
|
<0.0001
|
|
TEAM BASERUN SB
|
0.10
|
0.01
|
17.90
|
<0.0001
|
|
TEAM BASERUN CS
|
-0.07
|
0.01
|
-5.73
|
<0.0001
|
|
TEAM FIELDING E
|
-0.13
|
0.01
|
-21.38
|
<0.0001
|
|
TEAM FIELDING DP
|
-0.10
|
0.01
|
-7.77
|
<0.0001
|
|
Observations
|
1950
|
|
R2 / R2 adjusted
|
0.371 / 0.369
|
|
AIC
|
14844.599
|
The diagnostic plots for the new model appear to be closer aligned
with the linear regression assumptions. For this model iteration, we
will again reject the null hypothesis and conclude that this model
violates the homoscedasticity function. And finally the Variance
Inflation Factor (VIF) calculation does not detects collinearity across
the remaining variables. Overall we will reject model 3 because the
residuals are not homoscedastic.

studentized Breusch-Pagan test
data: lm3step.final BP = 46.935, df = 7, p-value = 5.748e-08 The model
with stepAIC seems the nest out of 3 as all p-values are significant,
high F-statistic of 94.48, adjusted R-squared is 0.4185.
|
|
model 3
|
model 3 (StepAIC)
|
model 3 (Final)
|
|
Predictors
|
Estimates
|
std. Error
|
Statistic
|
p
|
Estimates
|
std. Error
|
Statistic
|
p
|
Estimates
|
std. Error
|
Statistic
|
p
|
|
(Intercept)
|
60.63
|
5.87
|
10.33
|
<0.0001
|
60.35
|
5.64
|
10.70
|
<0.0001
|
107.56
|
3.26
|
33.03
|
<0.0001
|
|
TEAM BATTING 2B
|
-0.05
|
0.02
|
-2.71
|
0.0067
|
-0.05
|
0.01
|
-5.23
|
<0.0001
|
|
|
|
|
|
TEAM BATTING 3B
|
0.18
|
0.02
|
7.37
|
<0.0001
|
0.18
|
0.02
|
10.03
|
<0.0001
|
0.22
|
0.02
|
12.62
|
<0.0001
|
|
TEAM BATTING HR
|
0.14
|
0.09
|
1.60
|
0.1106
|
0.09
|
0.01
|
9.82
|
<0.0001
|
0.14
|
0.01
|
19.23
|
<0.0001
|
|
TEAM BATTING BB
|
0.17
|
0.05
|
3.44
|
0.0006
|
0.17
|
0.02
|
8.92
|
<0.0001
|
|
|
|
|
|
TEAM BATTING SO
|
-0.05
|
0.01
|
-6.26
|
<0.0001
|
-0.05
|
0.01
|
-6.38
|
<0.0001
|
-0.03
|
0.00
|
-16.35
|
<0.0001
|
|
TEAM BASERUN SB
|
0.09
|
0.01
|
15.36
|
<0.0001
|
0.09
|
0.01
|
15.76
|
<0.0001
|
0.10
|
0.01
|
17.90
|
<0.0001
|
|
TEAM BASERUN CS
|
-0.06
|
0.01
|
-5.57
|
<0.0001
|
-0.06
|
0.01
|
-5.80
|
<0.0001
|
-0.07
|
0.01
|
-5.73
|
<0.0001
|
|
TEAM PITCHING H
|
0.03
|
0.02
|
1.99
|
0.0472
|
0.03
|
0.00
|
7.45
|
<0.0001
|
|
|
|
|
|
TEAM PITCHING HR
|
-0.05
|
0.08
|
-0.64
|
0.5234
|
|
|
|
|
|
|
|
|
|
TEAM PITCHING BB
|
-0.13
|
0.05
|
-2.78
|
0.0055
|
-0.13
|
0.02
|
-7.40
|
<0.0001
|
|
|
|
|
|
TEAM PITCHING SO
|
0.03
|
0.01
|
3.64
|
0.0003
|
0.03
|
0.01
|
3.66
|
0.0003
|
|
|
|
|
|
TEAM FIELDING E
|
-0.12
|
0.01
|
-20.10
|
<0.0001
|
-0.12
|
0.01
|
-20.41
|
<0.0001
|
-0.13
|
0.01
|
-21.38
|
<0.0001
|
|
TEAM FIELDING DP
|
-0.12
|
0.01
|
-9.60
|
<0.0001
|
-0.12
|
0.01
|
-9.70
|
<0.0001
|
-0.10
|
0.01
|
-7.77
|
<0.0001
|
|
TEAM BATTING 1B
|
-0.00
|
0.02
|
-0.21
|
0.8360
|
|
|
|
|
|
|
|
|
|
Observations
|
1950
|
1950
|
1950
|
|
R2 / R2 adjusted
|
0.421 / 0.417
|
0.421 / 0.417
|
0.371 / 0.369
|
|
AIC
|
14698.128
|
14694.612
|
14844.599
|
4. Selecting Models
Notes / Questions
- there are NA’s in the predsDf tables with the predictions for every
model. Don’t know why
The table presents an analysis of the performance metrics for each
model on the training dataset. The results suggest that the model’s
performance slightly improved after incorporating transformations and
selecting significant parameters. Below are the results \(R^2\), residual standard error, and
F-statistics of each model. It happened that the non-cleaned,
non-imputed raw training data had the best fitting statistics. Though
model 1 doesn’t meet assumption requirements. Moel 3 looks like the best
fit. The F-statistic for models 2-3 was large, they all had significant
p-values. During the exploratory data analysis, it was noted that some
of the predictor variables had correlations with each other, which is
expected in baseball. However, this high correlation between variables
could result in multicollinearity, causing unstable regression fits. To
mitigate this issue, we employed the Variable Inflation Factor (VIF)
method in the previous section to identify better models by selecting
variables with VIF values less than 5. The residuals plots from the part
3 showed that models 1-3 almost follow the normal distribution and
almost constant variance.
ADD WORDS ABOUT MSE

Money Ball Dataset
|
r
|
mse
|
adjusted.r
|
|
0.44
|
9.56
|
0.43
|
|
0.27
|
13.48
|
0.27
|
|
0.42
|
10.43
|
0.42
|
|
0.37
|
12.54
|
0.37
|
Next, we apply the models to the evaluation dataset to make
predictions. However, to ensure that the models 2-4 work properly, we
will fill in the missing values in the evaluation data set using the
same imputation method - mice, drop columns INDEX,
TEAM_BATTING_HBP, for model 3 outliers were removed, for
model 4 we will also make necessary transformations.
The table presents the predicted values of
TARGET_WINS for
each model. First thing to notice is that the first model predicts
negative number of wins, which is unrealistic. Though we won’t
concentrate on this model. The third and fourth AIC-generated models (no
outliers and transformed values) have similar outcomes, and both are
performing well. The second model also produces similar results but
there was an issue with model assumptions.
Money Ball Dataset
|
lm1
|
lm2
|
aic3
|
aic4
|
|
62.57
|
68.58
|
60.27
|
61.53
|
|
68.91
|
70.21
|
67.10
|
64.44
|
|
73.68
|
77.35
|
72.04
|
73.42
|
|
81.89
|
83.61
|
84.21
|
87.58
|
|
20.39
|
66.44
|
81.88
|
61.30
|
|
32.64
|
67.44
|
66.65
|
77.89
|
|
48.67
|
74.02
|
67.43
|
82.08
|
|
58.69
|
72.52
|
59.21
|
73.28
|
|
67.81
|
72.08
|
71.48
|
69.37
|
|
69.81
|
75.86
|
69.16
|
72.44
|
|
61.11
|
76.14
|
63.12
|
69.64
|
|
82.33
|
85.66
|
83.03
|
83.31
|
|
85.70
|
84.26
|
85.42
|
82.29
|
|
82.93
|
82.11
|
83.05
|
84.05
|
|
89.82
|
79.28
|
87.03
|
83.92
|
We can also see when plotting the predictions that there doesn’t seem
to be much obvious difference between the models aside from the clearly
outrageous outliers generated by the first model, and the 4th model
showing a slightly tighter range in wins.
TEXT IS NEEDED HERE

5. Conclusion
In general, we don’t much faith in any of the models we produced when
it comes to their predictive power. The model 1 that had the highest
\(R^2\) value was compromised by
missing data, leading to highly inaccurate negative predictions, though
model 3 has also high \(R^2\) and
doesn’t produce negative results. The second and fourth models
demonstrated a weak overall fit, as evidenced by their lower \(R^2\) scores. Furthermore, none of the
models displayed satisfactory performance when assessed for linearity or
homogeneity of variance. According to the analysis, Models 1 and 3
performed slightly better than Models 1 and 4 and could be considered as
the preferred linear models based on the adjusted R2. However, Model 3
is more statistically significant compared to the others and uses fewer
unnecessary variables for making predictions while still maintaining a
similar level of adjusted R2. Considering these factors, and its better
handling of multicollinearity (with fewer instances of sign-flipping in
coefficients), Model 3 would be the recommended choice of model.
References
Appendix A: R code
Appendix B: Lahman’s Baseball Database
Despite significant efforts to compensate for poor data quality, the
resulting models are poor predictors of win totals. Moreover, the poor
data quality is inconsistent with the overall state of baseball
statistics. When it comes to the major sports, baseball has the most
mature statistics available. Therefore finding better data is the best
course of action for developing a better predictive model.
We were able to locate a cleaner version of the same data set
provided in the class. The Lahman’s Baseball Database includes the same
variables as the sample database with fewer errors and additional
reference data that would allow us to connect the database to other
sources.
https://www.seanlahman.com/baseball-archive/statistics/
A significant advantage of Lahman’s data set over the data set
provided in class is that it includes information about the year and the
team. This data is valuable when considering how baseball has changed
over the years. The modern era in baseball is often delineated by the
turn of the century. However, when looking at the past 120 years of
baseball history, it is easy to pinpoint rule changes, evolutions in
playing strategy, and league structure that have fundamentally impacted
the game.
When comparing statistics across time, it is common to use many of
the breakdowns below to add context to the analysis:
- The Dead Ball Era (1901 - 1920)
- World War 2 (1941 - 1945)
- Segregation Era (1901 - 1947ish)
- Post-War Era/Yankees Era (1945 - late 50s/early 60s)
- Westward Expansion (1953 - 1961)
- Dead Ball 2 (The Sixties, roughly)
- Designated Hitter Era (1973 - current, AL only)
- Free Agency/Arbitration Era (1975 - current)
- Steroid Era (unknown, but late 80s - 2005 seems likely)
- Wild Card Era (1994 - current)
Surveying these periods would suggest that a more granular model has
the potential to perform better.
Although we could have chosen any number of the time periods above,
exploring the statistical outliers highlights that many of these values
correspond to the pre-1969 period. This delineation has some historical
support. As Jayson Stark of ESPN argues in this article (https://www.espn.com/mlb/columns/story?columnist=stark_jayson&id=2471349)
In 1969 the MLB underwent several rule changes and changes to the league
structure that impacted win totals and team statistics. 1969 was the
first year of division play and the expanded postseason. The Pitcher’s
Mound was lowered five inches. The Strike zone shrinks. Five-person
rotations kicking in. The save was invented. And more expansion to the
unbalanced schedules.
Thus using 1900 as the beginning of the modern era and 1969 as an
additional breakpoint, the dataset can be divided into three segments.
The density profiles for the predictor variables approach a normal
distribution when grouped by the three segments we identified. To
support data exploration, we added a era_cat field to the data set.
In general the Lahman’s data set contains fewer data gaps and the
variables are more consistently distributed. There are some missing
values in the data set including the Caught Stealing
(CS/TEAM_BASERUN_CS) variable is missing 27.9%; the Batters Hit by Pitch
(HBP/TEAM_BATTING_HBP) variable is missing 38.8%; and the Sacrifice
Flies (SF) variable is missing 51.6% of the values.
Most of the variables in the data set show some level of skewness,
with the following variables having a Kurtosis measure of greater than
3, TEAM_BATTING_H, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_H,
and TEAM_FIELDING_E
The Lahman data set contains several variables with bimodal
distributions, including, TEAM_BATTING_HR, TEAM_BATTING_SO,
TEAM_BATTING_HR, and TEAM_PITCHING_SO.
When we group the statistics by the three categories presented earliers,
we see a much cleaner density plot across all variables. There are few
signs of bimodal distributions, and the skewness of individual variables
is reduced greatly.

The training data set exibits the following characteristics.



DATA PREPARATION
The use of the era_cat variable allows us to group the data set into
3 categories that variables that approach normal distribution.
Since segmenting the data set using the era_cat variable creates 3
categories of variables that approach the normal distribution, we will
focus our data preparation step on missing values. The documentation for
MICE package recommends that a 5% threshold should be observed for
safely imputing missing values. Based on this rule of thumb we will drop
the ‘TEAM_BASERUN_CS’ and the ‘TEAM_BATTING_HBP’ variables.
The remaining variables with missing data
(‘TEAM_BATTING_BB’,‘TEAM_BATTING_SO’ and ‘TEAM_BASERUN_SB’) will be
impuned using the MICE package. Several methods can be used but for
simplicity we selected m=5 the default method.
yearID era_cat TARGET_WINS TEAM_BATTING_H
"" "" "" ""
TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB “” “”
“” “pmm” TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
TEAM_PITCHING_HR “pmm” “pmm” “” “” TEAM_PITCHING_BB TEAM_PITCHING_SO
TEAM_FIELDING_E TEAM_FIELDING_DP “” “” “” “”

Build Model
The first model includes all the variables from the original data set
with the exception of the yearID. It could be argued that year might
have an impact on the numbers, however we are accounting for year with
the era_cat variable. The initial model has an \(AdjR^2 = 0.8078\), and the model selected
by stepAIC includes the same variables and has the same \(AdjR^2=8078\).
Call: lm(formula = TARGET_WINS ~ . - yearID, data =
trainingTeam_df)
Residuals: Min 1Q Median 3Q Max -39.966 -3.999 -0.100 4.093
74.716
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 81.215198 3.483759 23.313 < 2e-16
era_cat1900-1969 1.148737 0.876534 1.311 0.190138
era_cat1969+ 0.222574 0.989999 0.225 0.822137
TEAM_BATTING_H 0.052241 0.001955 26.716 < 2e-16
TEAM_BATTING_2B 0.014369 0.004802 2.992 0.002800 ** TEAM_BATTING_3B
0.013312 0.009570 1.391 0.164341
TEAM_BATTING_HR 0.111726 0.005713 19.558 < 2e-16
TEAM_BATTING_BB 0.044477 0.001846 24.092 < 2e-16
TEAM_BATTING_SO -0.012752 0.001448 -8.804 < 2e-16
TEAM_BASERUN_SB 0.025445 0.002484 10.245 < 2e-16
TEAM_PITCHING_H -0.056968 0.001476 -38.590 < 2e-16
TEAM_PITCHING_HR -0.091531 0.006363 -14.385 < 2e-16
TEAM_PITCHING_BB -0.057768 0.001808 -31.946 < 2e-16
TEAM_PITCHING_SO 0.009398 0.001403 6.696 2.67e-11
TEAM_FIELDING_E -0.005929 0.001745 -3.399 0.000688
TEAM_FIELDING_DP 0.050537 0.007475 6.761 1.72e-11 —
Signif. codes: 0 ‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’
0.1 ’ ’ 1
Residual standard error: 6.746 on 2373 degrees of freedom Multiple
R-squared: 0.8119, Adjusted R-squared: 0.8107 F-statistic: 682.7 on 15
and 2373 DF, p-value: < 2.2e-16
Call: lm(formula = TARGET_WINS ~ era_cat + TEAM_BATTING_H +
TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB
+ TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data =
trainingTeam_df)
Residuals: Min 1Q Median 3Q Max -39.104 -4.014 -0.081 4.121
74.351
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.912590 3.477645 23.266 < 2e-16
era_cat1900-1969 1.062034 0.874487 1.214 0.224691
era_cat1969+ -0.095034 0.963500 -0.099 0.921437
TEAM_BATTING_H 0.053236 0.001820 29.245 < 2e-16
TEAM_BATTING_2B 0.014153 0.004801 2.948 0.003228 ** TEAM_BATTING_HR
0.110525 0.005648 19.568 < 2e-16 TEAM_BATTING_BB
0.044469 0.001846 24.083 < 2e-16 TEAM_BATTING_SO
-0.012490 0.001436 -8.696 < 2e-16 TEAM_BASERUN_SB
0.026440 0.002379 11.115 < 2e-16 TEAM_PITCHING_H
-0.057022 0.001476 -38.631 < 2e-16 TEAM_PITCHING_HR
-0.092645 0.006314 -14.673 < 2e-16 TEAM_PITCHING_BB
-0.057577 0.001803 -31.927 < 2e-16 TEAM_PITCHING_SO
0.009179 0.001395 6.580 5.77e-11 TEAM_FIELDING_E
-0.006142 0.001738 -3.534 0.000418 TEAM_FIELDING_DP
0.049873 0.007462 6.684 2.89e-11 — Signif. codes: 0
‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
Residual standard error: 6.748 on 2374 degrees of freedom Multiple
R-squared: 0.8117, Adjusted R-squared: 0.8106 F-statistic: 731 on 14 and
2374 DF, p-value: < 2.2e-16
The TEAM_BATTING_3B variable was dropped due to a low
p-values, giving us the final model with an \(AdjR^2=0.8078\).
Call: lm(formula = TARGET_WINS ~ era_cat + TEAM_BATTING_H +
TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB
+ TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data =
trainingTeam_df)
Residuals: Min 1Q Median 3Q Max -39.104 -4.014 -0.081 4.121
74.351
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 80.912590 3.477645 23.266 < 2e-16
era_cat1900-1969 1.062034 0.874487 1.214 0.224691
era_cat1969+ -0.095034 0.963500 -0.099 0.921437
TEAM_BATTING_H 0.053236 0.001820 29.245 < 2e-16
TEAM_BATTING_2B 0.014153 0.004801 2.948 0.003228 ** TEAM_BATTING_HR
0.110525 0.005648 19.568 < 2e-16 TEAM_BATTING_BB
0.044469 0.001846 24.083 < 2e-16 TEAM_BATTING_SO
-0.012490 0.001436 -8.696 < 2e-16 TEAM_BASERUN_SB
0.026440 0.002379 11.115 < 2e-16 TEAM_PITCHING_H
-0.057022 0.001476 -38.631 < 2e-16 TEAM_PITCHING_HR
-0.092645 0.006314 -14.673 < 2e-16 TEAM_PITCHING_BB
-0.057577 0.001803 -31.927 < 2e-16 TEAM_PITCHING_SO
0.009179 0.001395 6.580 5.77e-11 TEAM_FIELDING_E
-0.006142 0.001738 -3.534 0.000418 TEAM_FIELDING_DP
0.049873 0.007462 6.684 2.89e-11 — Signif. codes: 0
‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
Residual standard error: 6.748 on 2374 degrees of freedom Multiple
R-squared: 0.8117, Adjusted R-squared: 0.8106 F-statistic: 731 on 14 and
2374 DF, p-value: < 2.2e-16
There are a number of issues with the model diagnostics. The
Linearity, and QQ plots show deviation from the straight line at the
boundaries. In addition, the reference line for the Homogeneity of
Variance graph is not flat. This indicates that residual variance is not
consistent across the model values. Further exploration could be
conducted to see if there are smaller date ranges and prediction ranges
that generate a model that better aligns with the linear regression
assumptions.

PREDICT
The next step is to use the selected model to predict the testing
data set. The Yardstick package was used to calculate model performance,
including the \(R^2\). With an \(R^2=0.8318\) the performance on the testing
data set is consistent with the model summary. In the Residual
vs. TARGET_WINS graph, there appears to be wins range between 60 and 110
that generates more accurate forecasts. However, it should be noted that
the residuals appear to drift downwards. TARGET_WINS less than the mean
are overestimated and TARGET_WINS greater than the mean is over
estimated. yearID era_cat TARGET_WINS TEAM_BATTING_H “” “” “” “”
TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB “” “” “”
“” TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H TEAM_PITCHING_HR
“pmm” “pmm” “” “” TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
TEAM_FIELDING_DP “” “” “” “”

|
|
.metric
|
.estimator
|
.estimate
|
|
1
|
mape
|
standard
|
7.0714
|
|
2
|
smape
|
standard
|
6.9789
|
|
3
|
mase
|
standard
|
0.3186
|
|
4
|
mpe
|
standard
|
-0.4950
|
|
5
|
rmse
|
standard
|
6.6625
|
|
6
|
rsq
|
standard
|
0.7980
|

Appendix c: Pythagorean Model
Bill James developed the Pythagorean winning percentage. The concept
strives to calculate the number of games a team should win based on the
total offense and the number of runs allowed. Since this model includes
total runs scored vs. total runs allowed the expectation is that this
model will be a good predictor of a team’s wins in a given season.
\(Win Percentage = (Runs Scored)^2 / [
(Runs Scored)^2 + (Runs Allowed)^2]\)
As we expected, the Pythagorean model performs well \(Adj R^2\) = 0.9112. However, this linear
model differs slightly from the formal definition since it includes an
intercept of 5.21 and a coefficient for the Pythagorean factor of
151.39. The formal definition of the model would have an intercept of 0
and a coefficient of 162.
Call: lm(formula = TARGET_WINS ~ pythPercent, data =
trainingTeam_df)
Residuals: Min 1Q Median 3Q Max -38.694 -3.030 0.027 3.074 18.453
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.8046 0.4995 9.619 <2e-16 pythPercent
152.3365 0.9836 154.876 <2e-16 — Signif. codes: 0
‘’ 0.001 ’’ 0.01 ’’ 0.05 ‘.’ 0.1 ’ ’ 1
Residual standard error: 4.61 on 2387 degrees of freedom Multiple
R-squared: 0.9095, Adjusted R-squared: 0.9095 F-statistic: 2.399e+04 on
1 and 2387 DF, p-value: < 2.2e-16
The diagnostic plots show residuals that are normally distributed
with a linear relationship to the fitted values. The homogeneity of
variance is curved, indicating that the variance of residuals is not
consistent.

The next step is to use the Pythagorean Model to predict the testing
data set. The Yardstick package was used to calculate model performance,
including the \(R^2\). With an \(R^2=0.9131\) the performance on the testing
data set is consistent with the model summary. In the Residual
vs. TARGET_WINS graph there appears to be a wins range between 50 and
130 that generates more accurate forecasts.

|
|
.metric
|
.estimator
|
.estimate
|
|
1
|
mape
|
standard
|
4.9670
|
|
2
|
smape
|
standard
|
4.6989
|
|
3
|
mase
|
standard
|
0.2031
|
|
4
|
mpe
|
standard
|
-1.4262
|
|
5
|
rmse
|
standard
|
4.4704
|
|
6
|
rsq
|
standard
|
0.9187
|

Unsurprising the Pythagorean Model performs well when it comes to win
projections. There is clear relationship between the total yearly run
differential and the number of games won. It is a simple model but
efficient.
Wikipedia contributors. 2022.
“Slugging Percentage —
Wikipedia, the Free Encyclopedia.” https://en.wikipedia.org/w/index.php?title=Slugging_percentage&oldid=1123388426.