## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'
Baseball is a numbers game. Or at least it has been since Billy Beane adopted the then extremely unpopular strategy of exposing statistics to improve the Oakland A’s performance. In this report, I will attempt to do the same by analyzing a cross-sectional dataset, creating an econometric model, and then predicting a team’s win rate through said model.
This paper uses two datasets. A training dataset, which is much larger and will be used for most of the data exploration and analysis, and a much smaller evaluation dataset. The evaluation dataset also differs from the training dataset by missing the win variable. The reason for having a training and evaluation dataset is so that once we clean the training dataset and create an econometric model, we can use said econometric model to predict the win values for the evaluation dataset. The variables included in the datasets can be seen in the table below. The primary variable of interest is the “Wins” variable, indicating the amount of wins by a given team during a single season. The remaining variables are quantitative variables relating to some aspect of the game, specifically batting, pitching, fielding, and stealing.
| Variable | Description |
|---|---|
| INDEX | Identification Variable |
| Wins | Number of wins |
| Base_Hits | Base Hits by batters (1B,2B,3B,HR) |
| DoublesB | Doubles by batters (2B) |
| TriplesB | Triples by batters (3B) |
| HomerunsB | Homeruns by batters (4B) |
| WalksB | Walks by batters |
| Hit_by_a_PitchB | Batters hit by pitch (get a free base) |
| StrikeoutsB | Strikeouts by batters |
| Stolen_BasesR | Stolen bases |
| Caught_StealingR | Caught stealing |
| ErrorsD | Errors |
| Double_PlaysD | Double Plays |
| Walks_AllowedP | Walks allowed |
| Hits_AllowedP | Hits allowed |
| Homeruns_AllowedP | Homeruns allowed |
| StrikeoutsP | Strikeouts by pitchers |
The first step in our data evaluation is to see whether the
dataset has any missing observations. There are a few ways to do this,
however, the “vis_miss” plot seems the most efficient.
As we can see, six variables have some missing
observations. The graph orders the variables from most to least missing
observations, showing that the “Hit by a Pitch” variable is missing in
92% of all observations, and the “Caught Stealing” variable is missing
in 34%. Missing observations are not always a cause for concern, but one
must know why the data is missing. If the observations are missing due
to chance, then it does not have to be an issue, but if they are omitted
for a specific reason, that could drive endogeneity in our analysis
later. For example, if the “Hit by a pitch” variable has so many missing
observations because the MLB deleted the data to protect the league from
having a bad image, that would skew the data and generate unreliable
results. Therefore, since we do not know why the observations are
missing, removing the variable is better. The “Caught Stealing” variable
has fewer missing observations than “Hit by a Pitch.” However, it is
still missing in a third of all observations, so for the sake of
clarity, we should omit that variable as well.
As for the remaining variables with missing observations, the best course of action is to fill in the missing values with either the mean or the median. The decision between the mean and median is made based on the distribution of the given variable. Suppose the variable has a large tail on either side of the distribution. In that case, it will skew the mean excessively upwards or downwards, and therefore, imputing the variable with its mean could skew the results. If that is the case, using the median to fill in the missing values is preferable. If the variable is normally distributed, using the mean is naturally superior since it should not overly alter the distribution.
The next necessary step is to analyze the distribution of
the four variables we keep in the final dataset with missing
observations. These variables are “Double Plays,” “Stolen Bases,”
“Strikeouts by a Batter,” and “Strikeouts by a Pitcher.”
The first step of the variable analysis is reviewing their
summary statistics. I am using the summarytools package to generate my
summary statistics. To feed our curiosity, I will also include the
variables we will likely omit (Hit by a Pitch and Caught Stealing), so
that we may see if something suspicious is happening with them outside
of the high number of missing observations.
| Variable | Stats / Values | Freqs (% of Valid) | Graph | ||||
|---|---|---|---|---|---|---|---|
| Base Hits [numeric] |
|
15 distinct values | |||||
| Caught Stealing [numeric] |
|
15 distinct values | |||||
| Double Plays [numeric] |
|
15 distinct values | |||||
| Doubles [numeric] |
|
15 distinct values | |||||
| Errors [numeric] |
|
15 distinct values | |||||
| Hit by a Pitch [numeric] |
|
15 distinct values | |||||
| Hits Allowed [numeric] |
|
15 distinct values | |||||
| Homeruns [numeric] |
|
15 distinct values | |||||
| Homeruns Allowed [numeric] |
|
14 distinct values | |||||
| Stolen Bases [numeric] |
|
15 distinct values | |||||
| Strikeouts (B) [numeric] |
|
15 distinct values | |||||
| Strikeouts (P) [numeric] |
|
15 distinct values | |||||
| Triples [numeric] |
|
15 distinct values | |||||
| Walks [numeric] |
|
15 distinct values | |||||
| Walks Allowed [numeric] |
|
15 distinct values | |||||
| Wins [numeric] |
|
15 distinct values |
Generated by summarytools 1.0.1 (R version 4.2.3)
2023-10-05
Starting with the double plays variable, we see that the
variable’s mean is 146.39, while the median is 149. The standard
deviation of the variable is also small, so we can use the mean to fill
in the missing values of this variable
The stolen bases variable is similar, with a mean of
124.76, a median of 101, and an SD of 87.79; using the mean is fine.
Both strikeout variables can also be filled in with the
variable’s respective means since the difference between their median
and mean is small.
From the omitted variables, Hit by a Pitch is surprisingly
close to being normally distributed. This could mean that the data from
the season with many players being hit by pitches was removed, so we
cannot keep the variable regardless. The caught stealing variable
exhibits the same right-hand side skewness as almost all other variables
with missing observations. Unfortunately, the endogeneity concerns
remain, so it cannot be included in the final model.
The summary statistics help picture the distributions of
the dataset’s variables, but it is always better to see the complete
picture and graph the actual distributions of individual variables. The
graphs will be incredibly convenient when we decide which variables to
log.
The first set of graphs contains all variables, after
which they are grouped into four subsections: Batting, Base runs,
Fielding, and Pitching.
#Graph Batting
Batting <- data_with_labels[,c(3:8,11)]
plot_list2 <- list()
variable_names <- names(Batting)
for (variable in variable_names) {
variable_data <- Batting[[variable]]
variable_data <- variable_data[!is.na(variable_data)]
plot2 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
geom_density(fill = "green", color = "black") +
labs(title = paste(variable), x = variable, y = "Density")
plot_list2[[variable]] <- plot2
}
grid.arrange(grobs = plot_list2,ncol=3)
#Graph Baserun
Baserun <- data_with_labels[9:10]
plot_list3 <- list()
variable_names <- names(Baserun)
for (variable in variable_names) {
variable_data <- Baserun[[variable]]
variable_data <- variable_data[!is.na(variable_data)]
plot3 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
geom_density(fill = "red", color = "black") +
labs(title = paste(variable), x = variable, y = "Density")
plot_list3[[variable]] <- plot3
}
grid.arrange(grobs = plot_list3, ncol = 2, nrow=1)
#Graph Fielding
Fielding <- data_with_labels[16:17]
plot_list4 <- list()
variable_names <- names(Fielding)
for (variable in variable_names) {
variable_data <- Fielding[[variable]]
variable_data <- variable_data[!is.na(variable_data)]
plot4 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
geom_density(fill = "yellow", color = "black") +
labs(title = paste(variable), x = variable, y = "Density")
plot_list4[[variable]] <- plot4
}
grid.arrange(grobs = plot_list4, ncol = 2, nrow=1)
#Graph Pitching
Fielding <- data_with_labels[12:15]
plot_list4 <- list()
variable_names <- names(Fielding)
for (variable in variable_names) {
variable_data <- Fielding[[variable]]
variable_data <- variable_data[!is.na(variable_data)]
plot4 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
geom_density(fill = "purple", color = "black") +
labs(title = paste(variable), x = variable, y = "Density")
plot_list4[[variable]] <- plot4
}
grid.arrange(grobs = plot_list4, ncol = 2, nrow=2)
The data preparation process becomes simple thanks to our
previous dataset analysis. We calculate the mean for each variable with
missing observations and then replace the missing observation with said
mean. I have also created binary variables that take on the value one if
the specific vector had a missing observation at a given spot and zero
otherwise.
#Data Cleaning
# Filling in Missing Observations
REPLACE <- c(1:2276)
#Generate a Mean for the variable, not including the missing observations
meanTeamBattingSO <- mean(training$StrikeoutsB,na.rm=TRUE)
#Generate a Binary variable indicating whether there is a missing obserbation
training$MissingBattingSO <- ifelse(is.na(training$StrikeoutsB),1,0)
#Create a vector which replaces the missing observations with the mean of the variable
nonNABattingSO <- ifelse(is.na(training$StrikeoutsB),
meanTeamBattingSO,training$StrikeoutsB)
#Place the previously created vector into our new dataset, replacing the variable with missing values
training$StrikeoutsB[REPLACE] <- nonNABattingSO
It is at this point when we drop the double plays and
caught stealing variables.
#Removing HBP and CS due to them having too many missing observations
training <- training %>% select(-Double_PlaysD)
training <- training %>% select(-Caught_StealingR)
training <- training %>% select(-MissingBattingHBP)
training <- training %>% select(-MissingBaserunCS)
I ran a vis_miss graph to confirm that all of the missing
observations have been filled in, and as we can see, they have.
With the data being cleaned, we must start the variable
selection process. Outside of choosing variables through intuition, I
have also created a correlation plot, which shows the strength and the
direction of a relationship between each two variables in the dataset.
Green indicates a positive correlation, and red indicates a negative
correlation. The color’s vibrancy shows the correlation’s strength,
where white means no correlation, and dark green or red means a strong
correlation, closer to 1 or -1.
As we can see from the plot, the wins variable is
positively correlated with all other variables, though faintly. Since
some variables should negatively correlate with wins, this is largely
counterintuitive and will be curious to investigate in the regression
analysis section.
There are also two large green centers in the corners of
the graph, which show high levels of positive correlation between home
runs, home runs allowed, walks, and stolen bases. Home runs’ correlation
with these variables is strange because it means that teams with many
home runs also give up a lot of home runs and are not very successful
defensively. I hypothesize that teams with players who try and swing for
the fences are worse off due to the high likelihood of striking out. We
see that home runs and strikeouts are also strongly positively
correlated, meaning that it is likely that risk-taking behavior in
baseball is not beneficial.
Conversely, the errors variable being strongly positively
correlated with home runs allowed, walks, and stolen bases makes
complete intuitive sense. Teams that make too many errors on the
defensive side of the ball will also likely suffer from a lot of home
runs and base hits, as well as a lot of the opposing players being
walked and stealing bases.
The question of whether risk-taking behavior in baseball
pays off, which I have found from the positive correlation in home runs,
strikeouts, and other defensive variables, is interesting, and I intend
to look into it in my regressions.
Since our dependent variable “Wins” positively correlates
with all other variables, our variable selection is relatively simple. I
included all thirteen variables in the model to see which aspects of the
game impact the victory probability the most. Additionally, since the
players on defense and offense in baseball are often the same, it is
likely that by omitting some of the variables, we would suffer from
omitted variable bias. Teams with many batting specialists might perform
worse on defense, which could create bias if some of the variables were
omitted. The ultimate goal of this section is to develop a model that
will effectively predict teams’ wins and indicate the importance of
specific aspects of the game. I will also focus on coefficients of
variables, such as home runs, triples, and doubles, to see whether my
hypothesis regarding risk-taking behavior in baseball has statistical
evidence to support it.
The first model I chose is a simple MLR model, where the
dependent variable is wins, and the independent variables are the
thirteen remaining variables in the dataset.
| Wins | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 5.08 | -8.09 – 18.25 | 0.449 |
| Base Hits | 0.05 | 0.04 – 0.05 | <0.001 |
| DoublesB | -0.02 | -0.04 – -0.00 | 0.017 |
| TriplesB | 0.07 | 0.04 – 0.10 | <0.001 |
| HomerunsB | 0.04 | -0.02 – 0.09 | 0.162 |
| WalksB | 0.00 | -0.01 – 0.02 | 0.413 |
| Hit by a PitchB | 0.08 | -0.06 – 0.23 | 0.254 |
| StrikeoutsB | -0.01 | -0.01 – -0.00 | 0.014 |
| ErrorsD | -0.02 | -0.03 – -0.02 | <0.001 |
| Walks AllowedP | -0.00 | -0.01 – 0.01 | 0.899 |
| Stolen BasesR | 0.04 | 0.03 – 0.04 | <0.001 |
| Hits AllowedP | -0.00 | -0.00 – 0.00 | 0.058 |
| Homeruns AllowedP | 0.01 | -0.04 – 0.06 | 0.599 |
| StrikeoutsP | 0.00 | 0.00 – 0.00 | 0.004 |
| Observations | 2276 | ||
| R2 / R2 adjusted | 0.293 / 0.289 | ||
The first regression shows that most batting and pitching
variables generate statistically significant results. Their significance
could be an indication of the fact that these two aspects of the game
are the most important. The most impactful variables are Base Hits and
Triples. The regression indicates that for each additional base hit, the
team, on average, wins 0.047 more games, cetris paribus. The triples
variable suggests that, on average, the team wins 0.07 more games,
cetris paribus for each additional triple hit. While the coefficient on
base hits is smaller, there are also many more base hits compared to
triples, which indicates that, at least offensively, base hits are the
most critical aspect of the game.
Defensively, hits allowed, errors and strikeouts generate
statistically significant results, which indicates that defensively,
pitching game is the most essential aspect. The errors variable has the
largest coefficient; once again, errors are not as frequent as
strikeouts, so strikeouts are the most crucial part of the defensive
game. Strikeouts by a pitcher coefficient indicates that, on average,
for each additional strikeout by a pitcher, the team wins 0.0027 more
games, cetris paribus.
Since home runs do not produce statistically significant
results, I can make no assertions regarding my risk-taking
hypothesis.
The residual analysis of the model provides some promising
results. The residuals vs. fitted graph shows a straight line that is
reasonably close to zero, indicating that the model is a good fit, and
the scale location also shows fairly even distribution height-wise,
which means that heteroscedasticity is likely not a concern. The Q-Q
graph, which compares the distribution of the residuals of the
regression to a normal distribution, also doesn’t demonstrate any
problems to the normality assumption. However, the residual vs. leverage
shows severe outliers that skew the results.
Removing the outliers can improve the model, and log
transforming the variables will improve the model’s fit, especially
since the win variable is not normally distributed.
As to the Gauss-Markov assumptions, the model seems to be
BLUE. There is no reason to suspect that the data was not collected
randomly or that the estimates are not linear in parameter. At first,
one might think that the strikeouts variable could be cause for concern
regarding multicollinearity, but we have to remember that this is the
same team, so the issue is not there. The zero conditional mean
assumption can be complicated to confirm, especially when using an OLS
estimator; however, these concerns should be alleviated after the
inclusion of enough controls. Finally, homoscedasticity seems to be in
order, as per the scale-location graph in Residual Analysis 1.1.
| log1p(Wins) | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | -2.07 | -2.98 – -1.16 | <0.001 |
| Base Hits [log1p] | 14.64 | 8.94 – 20.33 | <0.001 |
| DoublesB [log1p] | -0.10 | -0.16 – -0.04 | 0.001 |
| TriplesB [log1p] | 0.07 | 0.05 – 0.10 | <0.001 |
| HomerunsB [log1p] | -0.28 | -0.36 – -0.20 | <0.001 |
| WalksB [log1p] | -13.46 | -19.22 – -7.71 | <0.001 |
| Hit by a PitchB | 0.00 | -0.00 – 0.00 | 0.114 |
| StrikeoutsB | 0.00 | -0.00 – 0.00 | 0.257 |
| ErrorsD [log1p] | -0.17 | -0.20 – -0.14 | <0.001 |
| Walks AllowedP [log1p] | 13.50 | 7.76 – 19.25 | <0.001 |
| Stolen BasesR [log1p] | 0.07 | 0.05 – 0.08 | <0.001 |
| Hits AllowedP [log1p] | -13.66 | -19.36 – -7.95 | <0.001 |
| Homeruns AllowedP [log1p] | 0.29 | 0.21 – 0.36 | <0.001 |
| StrikeoutsP [log1p] | -0.03 | -0.05 – -0.02 | <0.001 |
| Observations | 2271 | ||
| R2 / R2 adjusted | 0.347 / 0.343 | ||
The second set of regression results drastically improves
many key variables’ fit and statistical significance. I opted to log
transform all variables, except for hit by a pitch and batter strikeout,
since those variables seem normally distributed. The R-squared has
increased from 0.289 to 0.343, which means that the new model explains
over five percentage points more variation of the Wins variable than the
previous one. We can also see some statistical significance increases
and coefficient size changes. The base hits variable is now much more
critical, where now, on average, for a one percent increase in base
hits, the number of wins increases by 14.59 percent, cetris paribus.
Homeruns have also now become statistically significant
and indicate that, on average, for each additional one percent increase
in home runs, the number of wins decreases by 0.29 percent, cetris
paribus. This coefficient change plays into my hypothesis for
risk-taking behavior, and if we investigate the coefficients on Walks
Allowed, the theory becomes more probable. If a team has a lot of heavy
hitters, the pitcher will probably choose to walk them rather than pitch
at them, increasing their chance of the pitcher’s team winning, which is
reflected in the walks allowed coefficient. On average, for each
additional one percent of walks allowed, the number of wins by a team
increases by 13.44 percent, cetris paribus.
The one seemingly counter-intuitive part of the model is
the negative coefficients on strikeouts by a pitcher. On average, for
each additional percent increase in pitcher strikeouts, the number of
wins decreases by 0.03 percent, cetris paribus. I suspect that the
reason for this is that if a team has a lot of strikeouts, it means they
are in more competitive games, where their opponent is batting a lot.
Since these games are harder to win, more strikeouts could indicate the
team is in more competitive games.
The residual analysis has also improved. The residuals
vs. fitted graph looks cleaner, with the fitted line being closer to the
zero line, the scale-location graph remains reasonably evenly
distributed, and the previous issues with the residuals vs. leverage
graph and the presence of outliers have been corrected.
Another possible way of improving the model is to include
interaction terms of offensive and defensive variables. While these
might be hard to interpret, they could control for much of the
endogeneity that comes from omitted offensive and defensive variables.
| log1p(Wins) | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 1.95 | -10.15 – 14.06 | 0.751 |
| Base Hits [log1p] | 0.23 | -1.71 – 2.16 | 0.818 |
| DoublesB [log1p] | 0.26 | -0.21 – 0.73 | 0.281 |
| TriplesB [log1p] | -0.07 | -0.16 – 0.01 | 0.088 |
| HomerunsB [log1p] | -0.53 | -0.67 – -0.40 | <0.001 |
| WalksB [log1p] | -0.10 | -0.44 – 0.23 | 0.541 |
| Hit by a PitchB | -0.01 | -0.02 – 0.01 | 0.347 |
| StrikeoutsB | -0.00 | -0.00 – 0.00 | 0.711 |
| ErrorsD [log1p] | -0.13 | -0.18 – -0.07 | <0.001 |
| Walks AllowedP [log1p] | 0.21 | -0.04 – 0.45 | 0.095 |
| Stolen BasesR [log1p] | 0.08 | 0.06 – 0.10 | <0.001 |
| Hits AllowedP [log1p] | -0.25 | -0.67 – 0.18 | 0.252 |
| Homeruns AllowedP [log1p] | 0.42 | 0.31 – 0.53 | <0.001 |
| StrikeoutsP [log1p] | -0.02 | -0.04 – -0.00 | 0.021 |
| Base Hits | 0.00 | -0.00 – 0.00 | 0.156 |
| DoublesB | 0.00 | -0.01 – 0.01 | 0.663 |
| TriplesB | 0.02 | 0.00 – 0.04 | 0.014 |
| HomerunsB | 0.02 | 0.00 – 0.04 | 0.017 |
| ErrorsD | -0.00 | -0.01 – 0.00 | 0.233 |
| Walks AllowedP | -0.00 | -0.00 – -0.00 | 0.001 |
| Hits AllowedP | -0.00 | -0.00 – -0.00 | <0.001 |
| Homeruns AllowedP | -0.01 | -0.01 – -0.00 | <0.001 |
| Base Hits × DoublesB | -0.00 | -0.00 – 0.00 | 0.544 |
| Base Hits × TriplesB | -0.00 | -0.00 – -0.00 | 0.027 |
| DoublesB × TriplesB | -0.00 | -0.00 – 0.00 | 0.258 |
| Base Hits × HomerunsB | -0.00 | -0.00 – 0.00 | 0.084 |
| DoublesB × HomerunsB | -0.00 | -0.00 – 0.00 | 0.136 |
| TriplesB × HomerunsB | -0.00 | -0.00 – 0.00 | 0.987 |
| Hit by a PitchB × ErrorsD | 0.00 | -0.00 – 0.00 | 0.257 |
|
Walks AllowedP × Hits AllowedP |
0.00 | 0.00 – 0.00 | <0.001 |
|
Walks AllowedP × Homeruns AllowedP |
0.00 | 0.00 – 0.00 | <0.001 |
|
Hits AllowedP × Homeruns AllowedP |
0.00 | 0.00 – 0.00 | <0.001 |
|
(Base Hits × DoublesB) × TriplesB |
0.00 | -0.00 – 0.00 | 0.204 |
|
(Base Hits × DoublesB) × HomerunsB |
0.00 | -0.00 – 0.00 | 0.201 |
|
(Base Hits × TriplesB) × HomerunsB |
0.00 | -0.00 – 0.00 | 0.761 |
|
(DoublesB × TriplesB) × HomerunsB |
-0.00 | -0.00 – 0.00 | 0.859 |
|
(Walks AllowedP × Hits AllowedP) × Homeruns AllowedP |
-0.00 | -0.00 – -0.00 | <0.001 |
|
(Base Hits × DoublesB × TriplesB) × HomerunsB |
-0.00 | -0.00 – 0.00 | 0.936 |
| Observations | 2276 | ||
| R2 / R2 adjusted | 0.489 / 0.480 | ||
The new regression model is chaotic. While the R squared
has increased to 0.485, an over 14 percentage points increase in
explained variation of wins, the number of interaction terms in the
model makes it hard to interpret. A large number of statistically
significant variables also have close to no effect on the dependent
variable due to the small size of coefficients. While perhaps the model
controls for some more endogeneity, it does not tell a concise story,
and many of the coefficients are non-sensical.
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
The residual analysis shows some concerning features. The
Residuals vs. Fitted graph shows a pattern, indicating issues with the
model’s fit, and the scale-location graph has some outliers that could
pose heteroskedasticity issues. I think that including all the error
terms has created a zero conditional mean violation. While it now does
explain more of the variation in wins, it comes at the cost of
endogeneity and omitted variable bias.
While the third model predicts the most variation in the
wins variable, it faces some issues. The R-squared of Model 2.1 is the
highest, but the interaction terms, which represent parts of the game
such as batting and pitching, make the model too complicated to
interpret. Additionally, the residual analysis of Model 2.1 has shown
issues with fit and possible heterogeneity concerns. If I cannot explain
why something is working, I should not use it, and Model 2.1. is just an
econometric version of a Rubik’s cube.
The decision is, therefore, now between model 1.1 and 1.2,
which is easy. Model 1.2 has a higher R-squared, is more statistically
significant, and performs better during residual analysis. I will,
therefore, be using model 1.2 to predict the win variable in the
evaluation dataset.
The evaluation data faces the same issues as the training
dataset, meaning it is missing the same percentages of observations in
the same variables. For that reason, I will be going through the same
process, filling in the missing values with the variable medians, and
dropping the double plays and caught stealing variables from the
dataset.
I use the vis_miss graph again to show the variables with
missing observations. As we can see, hit by a pitch and caught stealing
do, in fact, still miss the most observations and should, therefore, be
dropped.
I now rerun the vis_miss graph to ensure there are no more
missing observations and if the appropriate variables are dropped.
The graph shows that our data is in order and is ready to
be analyzed.
I have used the predict function to generate the predicted
values of wins for the evaluation dataset. Since printing these values
out seemed impractical, I opted to create summary statistics for the
variable and compare those to the summary statistics of the wins
variable in the training dataset.
## Metric Value
## 1 Mean 4.3787217
## 2 SD 0.1707904
## 3 Minimum 2.5238309
## 4 Maximum 4.7012641
## 5 Median 4.3969855
## Metric Value
## 1 Mean 4.3815011
## 2 SD 0.2353917
## 3 Minimum 0.0000000
## 4 Maximum 4.9904326
## 5 Median 4.4188406
The figures may seem strange since both values are the
natural log of wins; however, this is not a problem since we are simply
comparing them. If we wanted to get the actual values, all we would have
to do is to “exp” them.
## [1] 79.95797
## [1] 79.73604
Comparing the means of the predicted and real values, they
are almost identical, with the difference being less than one-tenth of a
percent. The distribution of our predicted values seems to be much more
concentrated, which can be seen by comparing the standard deviation and
spread figures. The most significant difference appears to be the
minimum of both variables, which is 0 in the training dataset and 2.548
in the evaluation dataset. Outside of this, however, the two variables
are very similar, and our model has done an excellent job of predicting
wins.
To complete the analysis, I have generated two probability
density functions, one for the predicted wins and one for the actual
wins, and combined them on the plot to see how their distributions
differ.
As we can see, the models are spread out similarly, though
the predicted values are much more concentrated than the actual values.
To account for this in the future, including team-identifying variables
could allow us to use different econometric methods, such as fixed
effects, and control for more endogeneity. However, the model seems to
have effectively predicted the wins and provided us with reliable
results.