Notes / Questions
there are a couple of variable with high correlations that we might want to explore
Batting_hr - Pitching_hr = .97 - this is VERY, VERY SUSPICIOUS. Do most baseball games end with similar runs for both teams? Aren’t there many one-sided wins?
Batting_hr - Batting_so = .73 - this might be plausible. Some teams swing for the fences instead of trying to get on base; I’m looking at you, Chicago Cubs!
what is the cutoff in correlation that will require adjustment - good question. Above 0.9? Most baseball stats measure distinct actions and occurances. 0.9 correlations show a serious data problem or something else is fishy.
Review the distributions plots for each attribute focus on non normally distributed attributes - please read the initial attempt below.
Box plots show outliers discuss valid data based on https://www.baseball-reference.com/leagues/majors/bat.shtml
Given the number of games in a year I don’t believe that 0 is a valid measurement for most of the attributes
use MICE to fill missing values vs Median or Mean
add feature TEAM_1B? By Cliff
define range for this stats. By David
The use of historical statistics to predict future outcomes, particularly wins and losses, and identify opportunities for improving team or individual performance, has gained significant attention in professional sports. The aim of this analysis is to develop several models that can predict a baseball team’s wins over a season based on team stats such as homeruns, strikeouts, base hits, and more. We will begin by examining the data for any issues, such as missing data, or outliers, and take the necessary measures to clean the data. We will subsequently create and evaluate three different linear models that forecast seasonal wins using the dataset, which includes both training and evaluation data. We will train the models using the main training data and then evaluate their performance against the evaluation data to determine their effectiveness. Finally, we will choose the best model that balances accuracy and simplicity for predicting seasonal wins.
The table below shows us some valuable descriptive statistics for the training data. The data set contains all integers. We can see that many of the variables have a minimum of 0 but not all. The means and medians of each variable are all relatively close in value for each individual variable. This tells us that most data is free from extreme outliers as they tend to skew the mean relative to the median.
One interesting piece of information is the min/max of the
TARGET_WINS variable. The minimum is 0 meaning there are
teams that did not win a single game. The maximum is 146 which indicates
no team in the training dataset had a perfect season, as we know from
the data a season consists of 162 games.
Also of note is the number of missing values from certain variables.
Most notably the TEAM_BATTING_HBP (batters hit by pitch
variable). With 91% of the data missing we will remove this variable
from our dataset because there simply is not enough information to
impute a sensible value. The column TEAM_BASERUN_CS (caught
stealing) had 34% of the missing data, we may consider removing it
later. The missing data for these two columns may be due to a change
official rules or tactics before the modern era of baseball.
Next, we’ll visually check for normal distributions and box plots in
both the dependent and independent variables. The density plot below
shows normalcy in most features except for extremely right skewed
features such as hits allowed (PITCHING_H) or errors (FIELDING_E).
Homeruns by batters (BATTING_HR) and strikeouts by batters (BATTING_SO)
variables seem bimodal. It implies the existence of two distinct
clusters within the baseball season data, where teams tended to score
more in one of the clusters.
Box plots for these further show a high number of outliers exist outside
of the interquartile ranges so their effects should be carefully
considered and we may deal with non-unimodal distributions.
Lastly, the function featurePlot() will show the relationship between
independent variables and the target variable TARGET_WINS.
In general, while our graphs display certain intriguing connections
among the variables, they also expose noteworthy problems with the data.
For example, the dataset contains a team that has not won any games,
which appears improbable. By checking the web data, we found that it
actually happened 2 times: 1872 and 1873. Or that the pitching data
contains numerous instances of 0’s, several teams have 0 strikeouts by
their pitchers over the season, which is highly improbable. Also, there
is as a team achieving 20,000 strikeouts. There will be further steps to
work with outliers and 0’s.
Plotting the correlations between TARGET_WINS and the
variables (excluding INDEX and
TEAM_BATTING_HBP) we can see that very few variables are
strongly correlated with the target variable. Columns with correlations
close to zero are unlikely to offer significant insights into the
factors that contribute to a team’s victories.
To avoid multicolinearity, we should not include features that have
strong correlation. Comparing offensive (any column starting with
BATTING or BASERUN) to defensive stats
unexpectedly shows some correlation, pointing to potential problems.
Qualitatively, the matrix implies some teams or players are exceptional
both at hitting (offensive) and fielding (defensive). Furthermore, a
typical team’s number of batted home runs and allowed home runs has a
correlation of nearly 1.0. This is an unexpected correlation but can be
explained by noticing most games are decided by a difference of one or
two runs (whether the games are high scoring or not). Any final models
should include one of these two home run variables. Alternatively, the
correlation between a team’s hits (BATTING_H) and hits
allowed (PITCHING_H) is around 0.3 which is seems
reasonable.
There are some other strong correlations that are less obvious such
as Errors (TEAM_FIELDING_E) being strongly negatively
correlated with walks by batters (TEAM_BATTING_BB), strike
outs (TEAM_BATTING_SO). All combined together, teams that
get a lot of hits do not generally make fielding errors.
Digging a little deeper we can see there is a Pearson correlation coefficient of -0.6559708 for errors and walks by batters which indicates a strong negative correlation between the two variables. Looking at errors compared with team pitching hits allowed we see a correlation of 0.667759 which indicates a strong positive correlation.
| Coefficient | |
|---|---|
| TARGET_WINS | 1.0000000 |
| TEAM_BATTING_H | 0.4699467 |
| TEAM_BATTING_2B | 0.3129840 |
| TEAM_BATTING_3B | -0.1243459 |
| TEAM_BATTING_HR | 0.4224168 |
| TEAM_BATTING_BB | 0.4686879 |
| TEAM_BATTING_SO | -0.2288927 |
| TEAM_BASERUN_SB | 0.0148364 |
| TEAM_BASERUN_CS | -0.1787560 |
| TEAM_BATTING_HBP | 0.0735042 |
| TEAM_PITCHING_H | 0.4712343 |
| TEAM_PITCHING_HR | 0.4224668 |
| TEAM_PITCHING_BB | 0.4683988 |
| TEAM_PITCHING_SO | -0.2293648 |
| TEAM_FIELDING_E | -0.3866880 |
| TEAM_FIELDING_DP | -0.1958660 |
Lastly, lets take a closer look at the missing data. We’ve already
determined that the batter hit by pitch (TEAM_BATTING_HBP)
variable is missing 91% of its data but what of the other variables. We
will just drop the column from the further analysis.
Using the plot below we can visualize the missingness of the remaining variables. There are 5 variables that contain varying degrees of missing data. We will use the information to fill in the missing values in our data preparation step.
TEAM_BASERUN_CS appears to be missing the second most
amount of values but at only 772 missing values out of 2276 this is much
less of a concern than the HBP variable we identified earlier. The
remaining variables that are missing data have less than 25% of their
data missing so should be safe to impute.
Notes / Questions
As discussed above, we will drop the INDEX and
TEAM_BATTING_HBP variables as the
TEAM_BATTING_HBP variable is missing 91% of its data and
the INDEX variable is just an identification Variable.
We’ll also derive a new column for single base hits derived from subtracting double, triples and home runs from the total number of hits.
For further work with NA’s, we create flags to suggest if a variable was missing.
To to impute the missing values in the trainDf data, the
mice library is used. To utilize MICE, one must make the
assumption that the missing values are missing at random, indicating
that the missingness can be explained by variables that have complete
information. The MICE algorithm then performs several iterations over
the data, as suggested by its name, and generates data to complete the
missing values. We check what impute method we use for each column.
“pmm” is predictive mean matching, replacing missing data with column
means.
Let’s also take a look at the density plots pre and post-imputation
to make sure densities look similar. Unfortunately, for
TEAM_BASERUN_SB, TEAM_BATTING_SO, and
TEAM_FIELDING DP they do not. But in the case of
TEAM_BATTING_SO our distribution becomes roughly more
normal, so it may be beneficial. For the TEAM_BASERUN_SB
and TEAM_BASERUN_CS and TEAM_FIELDING_DP we
may need to consider alternative methods.
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## "" "" "" ""
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## "" "" "pmm" "pmm"
## TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## "pmm" "" "" ""
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP TEAM_BATTING_1B
## "pmm" "" "pmm" ""
## na_flag
## ""
We should also try to transform some variables so that they may fit a
more normal distribution, particularly TEAM_BATTING_HR,
TEAM_BATTING_SO, TEAM_PITCHING_HR, but also
TEAM_PITCHING_H, TEAM_PITCHING_BB,
TEAM_PITCHING_SO, and TEAM_FIELDING_E. Square
rooting these variables helps, as does removing the skewness by reducing
the low density ranges. The plots below compare the original
distributions of non-normal variables and transformed ones. This is just
one of the ways to handle the data that doesn’t follow normal
distribution. The other way is to use Box-Cox transformations, we may
try it after fitting the model.
Our current dataset is devoid of any missing data values, and we have
excluded the irrelevant INDEX and the
TEAM_BATTING_HBP variables, which had 91% missing values.
As shown in the table below, no missing data values exist anymore, and
we can analyze how the summary statistics may have altered with the
imputed data.
Notes / Questions
At this juncture, with a thorough comprehension of our dataset and having completed the data cleaning process, we can initiate the construction of our multiple linear regression models. We will build four separate linear models and compare their performance.
For the first model, we will select all the variables from the original un-cleaned dataset. We may use this model as a base model to compare with.
The second model will be based on the cleaned dataset without missing values and non-normal distributions. We chose variables TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_FIELDING_E based on the intuition and the understanding of the data.
The third model will contain the variables from the cleaned dataset.
For the fourth model we can run the imputed + features that were engineered (i.e. sqrt), and pick the best model.
Below are the results \(R^2\), residual standard error, and F-statistics of each model. Surprisingly the non-cleaned, non-imputed raw training data had the best fitting statistics.
| r | rsse | adjusted.r |
|---|---|---|
| 0.55 | 8.49 | 0.51 |
| 0.19 | 14.16 | 0.19 |
| 0.37 | 12.50 | 0.37 |
Next, we apply the models to the evaluation dataset to make predictions. However, to ensure that the models 2-4 work properly, we will fill in the missing values in the evaluation data set using the same imputation method - mice, for model 4 we will also make necessary transformations.
CHANGE SINCE THE MODELS WILL BE DIFFERENT The table
below contains the predicted TARGET_WINS for each model.
Upon initial examination, it is noticeable that the first model is
generating predictions with negative values. This is not a realistic
outcome as it is impossible to have a negative number of wins.
Therefore, this model is not particularly valuable. However, the second
and third models, which utilize cleaned and imputed data, do not
encounter this problem of producing large negative predictions.
Generally, both the AIC-generated model and the second model are
yielding comparable results.
| lm1 | lm2 | aic | aic4 |
|---|---|---|---|
| 49.79 | 71.67 | 63.16 | 63.19 |
| 60.34 | 70.29 | 66.20 | 67.17 |
| 65.66 | 72.05 | 75.63 | 72.70 |
| 79.89 | 83.42 | 87.95 | 86.49 |
| -3970.00 | 69.26 | 70.58 | 90.81 |
| -1633.38 | 69.45 | 72.93 | 73.72 |
| 25.95 | 72.11 | 85.77 | 81.24 |
| -134.29 | 67.43 | 76.36 | 69.14 |
| 26.47 | 77.89 | 70.35 | 72.53 |
| 33.99 | 74.41 | 74.35 | 71.93 |
| 42.44 | 75.16 | 70.27 | 69.09 |
| 68.81 | 85.61 | 84.34 | 83.81 |
| 94.21 | 86.82 | 83.12 | 85.07 |
| 91.46 | 78.95 | 84.99 | 84.35 |
| 85.54 | 74.37 | 87.32 | 84.82 |
We can also see when plotting the predictions that there doesn’t seem to be much obvious difference between the models aside from the clearly outrageous outliers generated by the first model, and the 4th model showing a slightly tighter range in wins.
We can the graphs below to check the validity of our models. All models suffer from a lack of linearity which indicates that a linear regression model may not be the greatest technique for predicting values from this data with the given variables. The models that included all the most variables (model 1, model 3, model 4), suffer from co-linearity issues.
Model 1
Model 2
Model 3
Model 4
Notes / Questions
Overall none of the models that I was able to generate instill much confidence in their ability to predict. The model with the best fit according to the \(R^2\) statistic was filled with missing data that caused clearly incorrect negative predictions.
The second and third models both had significantly lower \(R^2\) scores which indicated a poor fit overall. In addition, none of the models performed well when checked for linearity or homogeneity of variance. While the second model did not suffer from colinearity issues the other two models did.
Despite significant efforts to compensate for poor data quality, the resulting models are poor predictors of win totals. Moreover, the poor data quality is inconsistent with the overall state of baseball statistics. When it comes to the major sports, baseball has the most mature statistics available. Therefore finding better data is the best course of action for developing a better predictive model.
We were able to locate a cleaner version of the same data set provided in the class. The Lahman’s Baseball Database includes the same variables as the sample database with fewer errors and additional reference data that would allow us to connect the database to other sources.
https://www.seanlahman.com/baseball-archive/statistics/
teamDF <- evalDf <- read.csv("https://raw.githubusercontent.com/cliftonleesps/data621_group1/main/hw1/Teams.csv")
teamDF <- teamDF %>% dplyr::select('yearID','lgID','teamID','franchID','divID','Rank','G','W','L',
'R','AB','H','X2B','X3B','HR','BB','SO','SB','CS','HBP','SF','RA','ER','ERA',
'CG','SHO','SV','IPouts','HA','HRA','BBA','SOA','E','DP','FP','name')
# normalize data 162 games
teamDF$gFactor <- 162 / (teamDF$W + teamDF$L)
teamDF$TARGET_WINS <- teamDF$gFactor * teamDF$W
teamDF$TEAM_BATTING_H <- teamDF$gFactor * teamDF$H
teamDF$TEAM_BATTING_2B <- teamDF$gFactor * teamDF$X2B
teamDF$TEAM_BATTING_3B <- teamDF$gFactor * teamDF$X3B
teamDF$TEAM_BATTING_HR <- teamDF$gFactor * teamDF$HR
teamDF$TEAM_BATTING_BB <- teamDF$gFactor * teamDF$BB
teamDF$TEAM_BATTING_SO <- teamDF$gFactor * teamDF$SO
teamDF$TEAM_BASERUN_SB <- teamDF$gFactor * teamDF$SB
teamDF$TEAM_BASERUN_CS <- teamDF$gFactor * teamDF$CS
teamDF$TEAM_BATTING_HBP <- teamDF$gFactor * teamDF$HBP
teamDF$TEAM_PITCHING_H <- teamDF$gFactor * teamDF$HA
teamDF$TEAM_PITCHING_HR <- teamDF$gFactor * teamDF$HRA
teamDF$TEAM_PITCHING_BB <- teamDF$gFactor * teamDF$BBA
teamDF$TEAM_PITCHING_SO <- teamDF$gFactor * teamDF$SOA
teamDF$TEAM_FIELDING_E <- teamDF$gFactor * teamDF$E
teamDF$TEAM_FIELDING_DP <- teamDF$gFactor * teamDF$DP
# Pythagorean winning percentage
# (runs scored ^ 2) / [(runs scored ^ 2) + (runs allowed ^ 2)]
teamDF$pythPercent <- (teamDF$R^2) / ((teamDF$R^2) + (teamDF$RA^2))
A significant advantage of Lahman’s data set over the data set provided in class is that it includes information about the year and the team. This data is valuable when considering how baseball has changed over the years. The modern era in baseball is often delineated by the turn of the century. However, when looking at the past 120 years of baseball history, it is easy to pinpoint rule changes, evolutions in playing strategy, and league structure that have fundamentally impacted the game.
When comparing statistics across time, it is common to use many of the breakdowns below to add context to the analysis:
Surveying these periods would suggest that a more granular model has the potential to perform better.
Although we could have chosen any number of the time periods above, exploring the statistical outliers highlights that many of these values correspond to the pre-1969 period. This delineation has some historical support. As Jayson Stark of ESPN argues in this article (https://www.espn.com/mlb/columns/story?columnist=stark_jayson&id=2471349) In 1969 the MLB underwent several rule changes and changes to the league structure that impacted win totals and team statistics. 1969 was the first year of division play and the expanded postseason. The Pitcher’s Mound was lowered five inches. The Strike zone shrinks. Five-person rotations kicking in. The save was invented. And more expansion to the unbalanced schedules.
Thus using 1900 as the beginning of the modern era and 1969 as an additional breakpoint, the dataset can be divided into three segments. The density profiles for the predictor variables approach a normal distribution when grouped by the three segments we identified. To support data exploration, we added a era_cat field to the data set.
teamDF <- teamDF %>% mutate(era_cat = case_when(yearID >= 1969 ~ '1969+',
yearID >= 1900 & yearID < 1969 ~ '1900-1969',
yearID < 1900 ~ '1900-'))
teamAdjDF <- teamDF %>% dplyr::select("yearID","era_cat","TARGET_WINS","TEAM_BATTING_H","TEAM_BATTING_2B",
"TEAM_BATTING_3B",
"TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_SO","TEAM_BASERUN_SB",
"TEAM_BASERUN_CS","TEAM_BATTING_HBP","TEAM_PITCHING_H","TEAM_PITCHING_HR",
"TEAM_PITCHING_BB", "TEAM_PITCHING_SO","TEAM_FIELDING_E","TEAM_FIELDING_DP")
random_sample <- createDataPartition(teamAdjDF$TARGET_WINS,p = 0.8, list = FALSE)
trainingTeam_df <- teamAdjDF[random_sample, ]
testingTeam_df <- teamAdjDF[-random_sample, ]
In general the Lahman’s data set contains fewer data gaps and the variables are more consistently distributed. There are some missing values in the data set including the Caught Stealing (CS/TEAM_BASERUN_CS) variable is missing 27.9%; the Batters Hit by Pitch (HBP/TEAM_BATTING_HBP) variable is missing 38.8%; and the Sacrifice Flies (SF) variable is missing 51.6% of the values.
Most of the variables in the data set show some level of skewness, with the following variables having a Kurtosis measure of greater than 3, TEAM_BATTING_H, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_PITCHING_H, and TEAM_FIELDING_E
The Lahman data set contains several variables with bimodal distributions, including, TEAM_BATTING_HR, TEAM_BATTING_SO, TEAM_BATTING_HR, and TEAM_PITCHING_SO.
r_num <- nrow(teamDF)
stats <- teamDF %>% describe()
displayDf <- as.data.frame(stats) %>%
dplyr::mutate(missing = r_num - n,
missing_per = (r_num - n) / r_num *100
) %>%
dplyr::select(-c(vars, n, trimmed, mad, range))
kable(displayDf, digits = 2, format = "html", row.names = T) %>%
kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = F,
position = "center"
)
| mean | sd | median | min | max | skew | kurtosis | se | missing | missing_per | |
|---|---|---|---|---|---|---|---|---|---|---|
| yearID | 1958.86 | 42.99 | 1967.00 | 1871.00 | 2021.00 | -0.36 | -1.11 | 0.79 | 0 | 0.00 |
| lgID* | 3.04 | 1.06 | 4.00 | 1.00 | 6.00 | -0.12 | -1.56 | 0.02 | 50 | 1.68 |
| teamID* | 71.83 | 40.16 | 73.00 | 1.00 | 149.00 | -0.01 | -1.25 | 0.74 | 0 | 0.00 |
| franchID* | 55.61 | 33.32 | 57.00 | 1.00 | 120.00 | 0.00 | -1.21 | 0.61 | 0 | 0.00 |
| divID* | 2.08 | 1.21 | 1.00 | 1.00 | 4.00 | 0.50 | -1.41 | 0.02 | 0 | 0.00 |
| Rank | 4.04 | 2.29 | 4.00 | 1.00 | 13.00 | 0.61 | -0.04 | 0.04 | 0 | 0.00 |
| G | 150.01 | 24.43 | 159.00 | 6.00 | 165.00 | -3.14 | 10.33 | 0.45 | 0 | 0.00 |
| W | 74.61 | 18.00 | 77.00 | 0.00 | 116.00 | -1.09 | 1.80 | 0.33 | 0 | 0.00 |
| L | 74.61 | 17.75 | 76.00 | 4.00 | 134.00 | -0.82 | 1.32 | 0.32 | 0 | 0.00 |
| R | 681.03 | 139.53 | 691.00 | 24.00 | 1220.00 | -0.79 | 2.31 | 2.55 | 0 | 0.00 |
| AB | 5128.99 | 798.21 | 5402.00 | 211.00 | 5781.00 | -3.16 | 10.87 | 14.61 | 0 | 0.00 |
| H | 1339.42 | 230.87 | 1390.00 | 33.00 | 1783.00 | -2.37 | 7.24 | 4.23 | 0 | 0.00 |
| X2B | 228.67 | 59.83 | 234.00 | 1.00 | 376.00 | -0.65 | 0.64 | 1.10 | 0 | 0.00 |
| X3B | 45.67 | 22.49 | 40.00 | 0.00 | 150.00 | 0.89 | 0.47 | 0.41 | 0 | 0.00 |
| HR | 105.87 | 63.98 | 110.00 | 0.00 | 307.00 | 0.10 | -0.93 | 1.17 | 0 | 0.00 |
| BB | 473.56 | 132.32 | 494.00 | 1.00 | 835.00 | -1.28 | 2.38 | 2.42 | 1 | 0.03 |
| SO | 762.13 | 319.35 | 761.00 | 3.00 | 1596.00 | 0.10 | -0.53 | 5.86 | 16 | 0.54 |
| SB | 109.41 | 69.72 | 93.00 | 1.00 | 581.00 | 1.77 | 4.81 | 1.30 | 126 | 4.22 |
| CS | 46.55 | 21.91 | 44.00 | 3.00 | 191.00 | 1.70 | 6.28 | 0.47 | 832 | 27.87 |
| HBP | 45.82 | 18.13 | 43.00 | 7.00 | 160.00 | 0.75 | 1.13 | 0.42 | 1158 | 38.79 |
| SF | 44.11 | 10.21 | 44.00 | 7.00 | 77.00 | -0.20 | 1.06 | 0.27 | 1541 | 51.62 |
| RA | 681.03 | 139.21 | 689.00 | 34.00 | 1252.00 | -0.51 | 1.69 | 2.55 | 0 | 0.00 |
| ER | 573.41 | 149.89 | 594.00 | 23.00 | 1023.00 | -0.86 | 1.09 | 2.74 | 0 | 0.00 |
| ERA | 3.84 | 0.77 | 3.84 | 1.22 | 8.00 | 0.00 | 0.44 | 0.01 | 0 | 0.00 |
| CG | 47.55 | 39.32 | 41.00 | 0.00 | 148.00 | 0.56 | -0.78 | 0.72 | 0 | 0.00 |
| SHO | 9.59 | 5.08 | 9.00 | 0.00 | 32.00 | 0.65 | 0.71 | 0.09 | 0 | 0.00 |
| SV | 24.42 | 16.32 | 25.00 | 0.00 | 68.00 | 0.03 | -1.23 | 0.30 | 0 | 0.00 |
| IPouts | 4013.24 | 663.32 | 4252.00 | 162.00 | 4518.00 | -3.04 | 9.76 | 12.14 | 0 | 0.00 |
| HA | 1339.20 | 231.01 | 1389.00 | 49.00 | 1993.00 | -2.21 | 6.28 | 4.23 | 0 | 0.00 |
| HRA | 105.87 | 60.88 | 113.00 | 0.00 | 305.00 | -0.03 | -0.98 | 1.11 | 0 | 0.00 |
| BBA | 473.75 | 131.74 | 495.00 | 1.00 | 827.00 | -1.36 | 2.51 | 2.41 | 0 | 0.00 |
| SOA | 761.58 | 320.50 | 762.00 | 0.00 | 1687.00 | 0.10 | -0.52 | 5.87 | 0 | 0.00 |
| E | 180.78 | 108.37 | 141.00 | 20.00 | 639.00 | 1.71 | 2.51 | 1.98 | 0 | 0.00 |
| DP | 132.62 | 35.90 | 140.00 | 0.00 | 217.00 | -1.04 | 1.14 | 0.66 | 0 | 0.00 |
| FP | 0.97 | 0.03 | 0.98 | 0.76 | 0.99 | -3.01 | 10.72 | 0.00 | 0 | 0.00 |
| name* | 72.89 | 37.85 | 77.00 | 1.00 | 139.00 | -0.20 | -1.09 | 0.69 | 0 | 0.00 |
| gFactor | 1.19 | 0.96 | 1.04 | 0.98 | 27.00 | 14.78 | 293.77 | 0.02 | 0 | 0.00 |
| TARGET_WINS | 80.74 | 15.35 | 81.73 | 0.00 | 145.59 | -0.47 | 1.34 | 0.28 | 0 | 0.00 |
| TEAM_BATTING_H | 1459.40 | 140.36 | 1445.04 | 819.00 | 2561.80 | 1.48 | 7.25 | 2.57 | 0 | 0.00 |
| TEAM_BATTING_2B | 246.90 | 47.00 | 247.00 | 27.00 | 457.90 | -0.02 | 0.04 | 0.86 | 0 | 0.00 |
| TEAM_BATTING_3B | 51.19 | 27.65 | 42.26 | 0.00 | 223.45 | 1.21 | 1.71 | 0.51 | 0 | 0.00 |
| TEAM_BATTING_HR | 110.71 | 63.72 | 115.41 | 0.00 | 318.60 | 0.07 | -0.90 | 1.17 | 0 | 0.00 |
| TEAM_BATTING_BB | 503.72 | 115.43 | 512.44 | 12.46 | 878.38 | -1.03 | 2.57 | 2.11 | 1 | 0.03 |
| TEAM_BATTING_SO | 808.66 | 296.70 | 812.01 | 44.18 | 1641.60 | 0.17 | -0.43 | 5.45 | 16 | 0.54 |
| TEAM_BASERUN_SB | 119.49 | 83.04 | 97.00 | 3.68 | 697.20 | 2.15 | 6.54 | 1.55 | 126 | 4.22 |
| TEAM_BASERUN_CS | 49.03 | 22.60 | 45.23 | 8.10 | 200.92 | 1.94 | 7.18 | 0.49 | 832 | 27.87 |
| TEAM_BATTING_HBP | 49.02 | 19.88 | 47.00 | 9.00 | 173.96 | 0.96 | 2.15 | 0.47 | 1158 | 38.79 |
| TEAM_PITCHING_H | 1463.01 | 156.72 | 1447.27 | 661.50 | 3888.00 | 2.84 | 27.29 | 2.87 | 0 | 0.00 |
| TEAM_PITCHING_HR | 110.81 | 60.17 | 117.82 | 0.00 | 305.00 | -0.05 | -0.96 | 1.10 | 0 | 0.00 |
| TEAM_PITCHING_BB | 504.19 | 114.81 | 515.00 | 21.73 | 881.41 | -1.11 | 2.77 | 2.10 | 0 | 0.00 |
| TEAM_PITCHING_SO | 807.68 | 299.75 | 805.98 | 0.00 | 1876.50 | 0.16 | -0.35 | 5.49 | 0 | 0.00 |
| TEAM_FIELDING_E | 225.53 | 222.14 | 144.89 | 54.00 | 1998.00 | 3.30 | 13.49 | 4.07 | 0 | 0.00 |
| TEAM_FIELDING_DP | 141.88 | 27.58 | 145.00 | 0.00 | 228.27 | -0.46 | 0.50 | 0.50 | 0 | 0.00 |
| pythPercent | 0.50 | 0.10 | 0.50 | 0.03 | 0.85 | -0.36 | 0.97 | 0.00 | 0 | 0.00 |
| era_cat* | 2.37 | 0.70 | 2.00 | 1.00 | 3.00 | -0.64 | -0.76 | 0.01 | 0 | 0.00 |
m_df <- teamDF[random_sample, ] %>%
dplyr::select("TARGET_WINS","TEAM_BATTING_H","TEAM_BATTING_2B","TEAM_BATTING_3B",
"TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_SO","TEAM_BASERUN_SB",
"TEAM_BASERUN_CS","TEAM_BATTING_HBP","TEAM_PITCHING_H","TEAM_PITCHING_HR",
"TEAM_PITCHING_BB", "TEAM_PITCHING_SO","TEAM_FIELDING_E","TEAM_FIELDING_DP") %>% melt()
m_df %>% ggplot(aes(x= value)) +
geom_density(color='#023020', fill='gray') +
facet_wrap(~variable, scales = 'free', ncol = 4) +
theme_bw() +
labs(
title = 'Variable Density Plots'
)
When we group the statistics by the three categories presented earliers, we see a much cleaner density plot across all variables. There are few signs of bimodal distributions, and the skewness of individual variables is reduced greatly.
df0 <- teamDF[random_sample, ] %>% filter(era_cat == '1969+')
m_df <- df0 %>%
dplyr::select("TARGET_WINS","TEAM_BATTING_H","TEAM_BATTING_2B","TEAM_BATTING_3B",
"TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_SO","TEAM_BASERUN_SB",
"TEAM_BASERUN_CS","TEAM_BATTING_HBP","TEAM_PITCHING_H","TEAM_PITCHING_HR",
"TEAM_PITCHING_BB", "TEAM_PITCHING_SO","TEAM_FIELDING_E","TEAM_FIELDING_DP") %>% melt()
m_df %>% ggplot(aes(x= value)) +
geom_density(color='#023020', fill='gray') +
facet_wrap(~variable, scales = 'free', ncol = 4) +
theme_bw() +
labs(
title = 'Variable Density Plots (1969+)'
)
The training data set exibits the following characteristics.
r_num <- nrow(trainingTeam_df)
stats <- trainingTeam_df %>% describe()
displayDf <- as.data.frame(stats) %>%
dplyr::mutate(missing = r_num - n,
missing_per = (r_num - n) / r_num *100
) %>%
dplyr::select(-c(vars, n, trimmed, mad, range))
kable(displayDf, digits = 2, format = "html", row.names = T) %>%
kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = F,
position = "center"
)
| mean | sd | median | min | max | skew | kurtosis | se | missing | missing_per | |
|---|---|---|---|---|---|---|---|---|---|---|
| yearID | 1958.78 | 43.20 | 1967.00 | 1871.00 | 2021.00 | -0.35 | -1.12 | 0.88 | 0 | 0.00 |
| era_cat* | 2.36 | 0.70 | 2.00 | 1.00 | 3.00 | -0.63 | -0.77 | 0.01 | 0 | 0.00 |
| TARGET_WINS | 80.84 | 15.32 | 81.73 | 0.00 | 145.59 | -0.43 | 1.33 | 0.31 | 0 | 0.00 |
| TEAM_BATTING_H | 1459.09 | 139.53 | 1445.00 | 819.00 | 2561.80 | 1.44 | 7.22 | 2.85 | 0 | 0.00 |
| TEAM_BATTING_2B | 246.70 | 46.81 | 246.20 | 27.00 | 457.90 | -0.01 | 0.13 | 0.96 | 0 | 0.00 |
| TEAM_BATTING_3B | 51.27 | 27.72 | 43.00 | 0.00 | 223.45 | 1.23 | 1.90 | 0.57 | 0 | 0.00 |
| TEAM_BATTING_HR | 110.21 | 63.60 | 114.66 | 0.00 | 318.60 | 0.08 | -0.89 | 1.30 | 0 | 0.00 |
| TEAM_BATTING_BB | 504.74 | 115.94 | 514.51 | 12.46 | 878.38 | -1.02 | 2.51 | 2.37 | 1 | 0.04 |
| TEAM_BATTING_SO | 807.76 | 298.09 | 810.00 | 44.18 | 1641.60 | 0.19 | -0.43 | 6.11 | 12 | 0.50 |
| TEAM_BASERUN_SB | 120.12 | 83.87 | 96.00 | 16.83 | 697.20 | 2.18 | 6.72 | 1.75 | 102 | 4.27 |
| TEAM_BASERUN_CS | 49.21 | 22.95 | 45.00 | 8.42 | 200.92 | 1.99 | 7.20 | 0.55 | 671 | 28.09 |
| TEAM_BATTING_HBP | 48.96 | 19.88 | 47.27 | 10.60 | 173.96 | 1.04 | 2.55 | 0.52 | 927 | 38.80 |
| TEAM_PITCHING_H | 1461.96 | 155.49 | 1447.27 | 661.50 | 3888.00 | 2.89 | 31.05 | 3.18 | 0 | 0.00 |
| TEAM_PITCHING_HR | 110.58 | 60.08 | 117.24 | 0.00 | 305.00 | -0.05 | -0.96 | 1.23 | 0 | 0.00 |
| TEAM_PITCHING_BB | 504.27 | 115.55 | 514.40 | 21.73 | 854.18 | -1.12 | 2.79 | 2.36 | 0 | 0.00 |
| TEAM_PITCHING_SO | 808.28 | 302.32 | 804.74 | 0.00 | 1876.50 | 0.19 | -0.35 | 6.19 | 0 | 0.00 |
| TEAM_FIELDING_E | 225.36 | 220.11 | 144.68 | 54.00 | 1998.00 | 3.27 | 13.38 | 4.50 | 0 | 0.00 |
| TEAM_FIELDING_DP | 141.64 | 27.80 | 145.00 | 0.00 | 228.27 | -0.44 | 0.42 | 0.57 | 0 | 0.00 |
m_df <- trainingTeam_df %>% melt()
m_df %>% ggplot(aes(x= value)) +
geom_density(color='#023020', fill='gray') + facet_wrap(~variable, scales = 'free', ncol = 4) + theme_bw()
m_df %>% ggplot(aes(x = value)) +
geom_boxplot(outlier.color = 'red', outlier.shape = 1) +
facet_wrap(vars(variable),scales = "free", ncol = 4)
rcore <- rcorr(as.matrix(trainingTeam_df %>% dplyr::select(where(is.numeric))))
coeff <- rcore$r
corrplot(coeff, tl.cex = .7 , method = 'pie')
The use of the era_cat variable allows us to group the data set into 3 categories that variables that approach normal distribution.
Since segmenting the data set using the era_cat variable creates 3 categories of variables that approach the normal distribution, we will focus our data preparation step on missing values. The documentation for MICE package recommends that a 5% threshold should be observed for safely imputing missing values. Based on this rule of thumb we will drop the ‘TEAM_BASERUN_CS’ and the ‘TEAM_BATTING_HBP’ variables.
The remaining variables with missing data (‘TEAM_BATTING_BB’,‘TEAM_BATTING_SO’ and ‘TEAM_BASERUN_SB’) will be impuned using the MICE package. Several methods can be used but for simplicity we selected m=5 the default method.
trainingTeam_df <- trainingTeam_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_BATTING_HBP))
testingTeam_df <- testingTeam_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_BATTING_HBP))
imputeDf <- mice(trainingTeam_df, m = 5, maxit = 50, seed = 123, printFlag = F)
imputeDf$meth
## yearID era_cat TARGET_WINS TEAM_BATTING_H
## "" "" "" ""
## TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## "" "" "" "pmm"
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H TEAM_PITCHING_HR
## "pmm" "pmm" "" ""
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## "" "" "" ""
trainingTeam_df <- complete(imputeDf)
m_imputed <- melt(trainingTeam_df)
densityplot(imputeDf)
The first model includes all the variables from the original data set with the exception of the yearID. It could be argued that year might have an impact on the numbers, however we are accounting for year with the era_cat variable. The initial model has an \(AdjR^2 = 0.8078\), and the model selected by stepAIC includes the same variables and has the same \(AdjR^2=8078\).
lmA1 <- lm(TARGET_WINS ~ . -yearID, data = trainingTeam_df, by=era_cat)
summary(lmA1)
##
## Call:
## lm(formula = TARGET_WINS ~ . - yearID, data = trainingTeam_df,
## by = era_cat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.123 -4.070 0.068 4.063 84.686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.884646 3.475663 22.409 < 2e-16 ***
## era_cat1900-1969 0.735167 0.875849 0.839 0.401343
## era_cat1969+ -0.086667 0.986172 -0.088 0.929977
## TEAM_BATTING_H 0.055077 0.001993 27.640 < 2e-16 ***
## TEAM_BATTING_2B 0.012188 0.004808 2.535 0.011314 *
## TEAM_BATTING_3B 0.010312 0.009523 1.083 0.278963
## TEAM_BATTING_HR 0.103610 0.005762 17.981 < 2e-16 ***
## TEAM_BATTING_BB 0.045028 0.001858 24.236 < 2e-16 ***
## TEAM_BATTING_SO -0.010571 0.001448 -7.301 3.89e-13 ***
## TEAM_BASERUN_SB 0.022586 0.002541 8.889 < 2e-16 ***
## TEAM_PITCHING_H -0.057635 0.001482 -38.880 < 2e-16 ***
## TEAM_PITCHING_HR -0.095161 0.006422 -14.819 < 2e-16 ***
## TEAM_PITCHING_BB -0.055623 0.001838 -30.259 < 2e-16 ***
## TEAM_PITCHING_SO 0.009169 0.001390 6.595 5.22e-11 ***
## TEAM_FIELDING_E -0.005840 0.001756 -3.327 0.000892 ***
## TEAM_FIELDING_DP 0.050092 0.007413 6.757 1.77e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.722 on 2373 degrees of freedom
## Multiple R-squared: 0.8088, Adjusted R-squared: 0.8076
## F-statistic: 669.2 on 15 and 2373 DF, p-value: < 2.2e-16
lmA1.step <- stepAIC(lmA1, trace = FALSE, by=era_cat)
summary(lmA1.step)
##
## Call:
## lm(formula = TARGET_WINS ~ era_cat + TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = trainingTeam_df,
## by = era_cat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.461 -4.102 0.001 4.056 84.478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.533608 3.460638 22.404 < 2e-16 ***
## era_cat1900-1969 0.656406 0.872856 0.752 0.45211
## era_cat1969+ -0.342597 0.957468 -0.358 0.72051
## TEAM_BATTING_H 0.055827 0.001868 29.880 < 2e-16 ***
## TEAM_BATTING_2B 0.011999 0.004805 2.497 0.01259 *
## TEAM_BATTING_HR 0.102651 0.005694 18.028 < 2e-16 ***
## TEAM_BATTING_BB 0.045068 0.001858 24.261 < 2e-16 ***
## TEAM_BATTING_SO -0.010395 0.001439 -7.224 6.76e-13 ***
## TEAM_BASERUN_SB 0.023346 0.002442 9.560 < 2e-16 ***
## TEAM_PITCHING_H -0.057591 0.001482 -38.864 < 2e-16 ***
## TEAM_PITCHING_HR -0.096177 0.006353 -15.139 < 2e-16 ***
## TEAM_PITCHING_BB -0.055483 0.001834 -30.257 < 2e-16 ***
## TEAM_PITCHING_SO 0.009061 0.001387 6.534 7.80e-11 ***
## TEAM_FIELDING_E -0.006021 0.001748 -3.445 0.00058 ***
## TEAM_FIELDING_DP 0.049660 0.007403 6.708 2.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.722 on 2374 degrees of freedom
## Multiple R-squared: 0.8087, Adjusted R-squared: 0.8076
## F-statistic: 716.8 on 14 and 2374 DF, p-value: < 2.2e-16
The TEAM_BATTING_3B variable was dropped due to a low p-values, giving us the final model with an \(AdjR^2=0.8078\).
lmA1.final <- update(lmA1.step, . ~ . -TEAM_BATTING_3B)
summary(lmA1.final)
##
## Call:
## lm(formula = TARGET_WINS ~ era_cat + TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = trainingTeam_df,
## by = era_cat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.461 -4.102 0.001 4.056 84.478
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.533608 3.460638 22.404 < 2e-16 ***
## era_cat1900-1969 0.656406 0.872856 0.752 0.45211
## era_cat1969+ -0.342597 0.957468 -0.358 0.72051
## TEAM_BATTING_H 0.055827 0.001868 29.880 < 2e-16 ***
## TEAM_BATTING_2B 0.011999 0.004805 2.497 0.01259 *
## TEAM_BATTING_HR 0.102651 0.005694 18.028 < 2e-16 ***
## TEAM_BATTING_BB 0.045068 0.001858 24.261 < 2e-16 ***
## TEAM_BATTING_SO -0.010395 0.001439 -7.224 6.76e-13 ***
## TEAM_BASERUN_SB 0.023346 0.002442 9.560 < 2e-16 ***
## TEAM_PITCHING_H -0.057591 0.001482 -38.864 < 2e-16 ***
## TEAM_PITCHING_HR -0.096177 0.006353 -15.139 < 2e-16 ***
## TEAM_PITCHING_BB -0.055483 0.001834 -30.257 < 2e-16 ***
## TEAM_PITCHING_SO 0.009061 0.001387 6.534 7.80e-11 ***
## TEAM_FIELDING_E -0.006021 0.001748 -3.445 0.00058 ***
## TEAM_FIELDING_DP 0.049660 0.007403 6.708 2.46e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.722 on 2374 degrees of freedom
## Multiple R-squared: 0.8087, Adjusted R-squared: 0.8076
## F-statistic: 716.8 on 14 and 2374 DF, p-value: < 2.2e-16
There are a number of issues with the model diagnostics. The Linearity, and QQ plots show deviation from the straight line at the boundaries. In addition, the reference line for the Homogeneity of Variance graph is not flat. This indicates that residual variance is not consistent across the model values. Further exploration could be conducted to see if there are smaller date ranges and prediction ranges that generate a model that better aligns with the linear regression assumptions.
check_model(lmA1.final, check=c('ncv','qq','homogeneity','outliers'))
The next step is to use the selected model to predict the testing data set. The Yardstick package was used to calculate model performance, including the \(R^2\). With an \(R^2=0.8318\) the performance on the testing data set is consistent with the model summary. In the Residual vs. TARGET_WINS graph, there appears to be wins range between 60 and 110 that generates more accurate forecasts. However, it should be noted that the residuals appear to drift downwards. TARGET_WINS less than the mean are overestimated and TARGET_WINS greater than the mean is over estimated.
imputeDf <- mice(testingTeam_df, m = 5, maxit = 50, seed = 123, printFlag = F)
imputeDf$meth
## yearID era_cat TARGET_WINS TEAM_BATTING_H
## "" "" "" ""
## TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## "" "" "" ""
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H TEAM_PITCHING_HR
## "pmm" "pmm" "" ""
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## "" "" "" ""
testingTeam_df <- complete(imputeDf)
m_imputed <- melt(testingTeam_df)
densityplot(imputeDf)
lmA1Pred.final <- lmA1.final %>% predict(testingTeam_df)
multi_metric <- metric_set(mape, smape, mase, mpe, rmse, rsq)
m <- testingTeam_df %>% multi_metric(truth=testingTeam_df$TARGET_WINS, estimate=lmA1Pred.final)
kable(m, digits = 4, format = "html", row.names = T) %>%
kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = F,
position = "center"
)
| .metric | .estimator | .estimate | |
|---|---|---|---|
| 1 | mape | standard | 7.4735 |
| 2 | smape | standard | 7.4594 |
| 3 | mase | standard | 0.2987 |
| 4 | mpe | standard | -0.8452 |
| 5 | rmse | standard | 6.7297 |
| 6 | rsq | standard | 0.8107 |
n <- nrow(testingTeam_df)
x <- testingTeam_df$TARGET_WINS
e <- lmA1Pred.final - testingTeam_df$TARGET_WINS
plot(x, e,
xlab = "wins",
ylab = "residuals",
bg = "steelblue",
col = "darkgray", cex = 1.5, pch = 21, frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 1)
Bill James developed the Pythagorean winning percentage. The concept strives to calculate the number of games a team should win based on the total offense and the number of runs allowed. Since this model includes total runs scored vs. total runs allowed the expectation is that this model will be a good predictor of a team’s wins in a given season.
\(Win Percentage = (Runs Scored)^2 / [ (Runs Scored)^2 + (Runs Allowed)^2]\)
random_sample <- createDataPartition(teamDF$TARGET_WINS, p = 0.8, list = FALSE)
trainingTeam_df <- teamDF[random_sample, ]
testingTeam_df <- teamDF[-random_sample, ]
As we expected, the Pythagorean model performs well \(Adj R^2\) = 0.9112. However, this linear model differs slightly from the formal definition since it includes an intercept of 5.21 and a coefficient for the Pythagorean factor of 151.39. The formal definition of the model would have an intercept of 0 and a coefficient of 162.
lmp <- lm(TARGET_WINS ~ pythPercent, data = trainingTeam_df)
summary(lmp)
##
## Call:
## lm(formula = TARGET_WINS ~ pythPercent, data = trainingTeam_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.669 -2.953 0.136 3.025 18.239
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7105 0.4886 9.64 <2e-16 ***
## pythPercent 152.3072 0.9603 158.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.561 on 2387 degrees of freedom
## Multiple R-squared: 0.9133, Adjusted R-squared: 0.9133
## F-statistic: 2.515e+04 on 1 and 2387 DF, p-value: < 2.2e-16
The diagnostic plots show residuals that are normally distributed with a linear relationship to the fitted values. The homogeneity of variance is curved, indicating that the variance of residuals is not consistent.
check_model(lmp, check=c('ncv','qq','homogeneity','outliers'))
The next step is to use the Pythagorean Model to predict the testing data set. The Yardstick package was used to calculate model performance, including the \(R^2\). With an \(R^2=0.9131\) the performance on the testing data set is consistent with the model summary. In the Residual vs. TARGET_WINS graph there appears to be a wins range between 50 and 130 that generates more accurate forecasts.
lmpPred <- lmp %>% predict(testingTeam_df)
hist(lmpPred)
m1 <- testingTeam_df %>% multi_metric(truth=testingTeam_df$TARGET_WINS, estimate=lmpPred)
kable(m1, digits = 4, format = "html", row.names = T) %>%
kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = F,
position = "center"
)
| .metric | .estimator | .estimate | |
|---|---|---|---|
| 1 | mape | standard | Inf |
| 2 | smape | standard | 4.9015 |
| 3 | mase | standard | 0.2105 |
| 4 | mpe | standard | -Inf |
| 5 | rmse | standard | 4.5939 |
| 6 | rsq | standard | 0.9034 |
n <- nrow(testingTeam_df)
x <- testingTeam_df$TARGET_WINS
e <- lmpPred - testingTeam_df$TARGET_WINS
plot(x, e,
xlab = "wins",
ylab = "residuals",
bg = "steelblue",
col = "darkgray", cex = 1.5, pch = 21, frame = FALSE)
abline(h = 0, lwd = 2)
for (i in 1 : n)
lines(c(x[i], x[i]), c(e[i], 0), col = "red" , lwd = 1)
Unsurprising the Pythagorean Model performs well when it comes to win projections. There is clear relationship between the total yearly run differential and the number of games won. It is a simple model but efficient.