## Warning in !is.null(rmarkdown::metadata$output) && rmarkdown::metadata$output
## %in% : 'length(x) = 2 > 1' in coercion to 'logical(1)'

1 Introduction

      Baseball is a numbers game. Or at least it has been since Billy Beane adopted the then extremely unpopular strategy of exposing statistics to improve the Oakland A’s performance. In this report, I will attempt to do the same by analyzing a cross-sectional dataset, creating an econometric model, and then predicting a team’s win rate through said model.

2 Data Exploration

      This paper uses two datasets. A training dataset, which is much larger and will be used for most of the data exploration and analysis, and a much smaller evaluation dataset. The evaluation dataset also differs from the training dataset by missing the win variable. The reason for having a training and evaluation dataset is so that once we clean the training dataset and create an econometric model, we can use said econometric model to predict the win values for the evaluation dataset. The variables included in the datasets can be seen in the table below. The primary variable of interest is the “Wins” variable, indicating the amount of wins by a given team during a single season. The remaining variables are quantitative variables relating to some aspect of the game, specifically batting, pitching, fielding, and stealing.

2.1 Variable Table

Variable Descriptions
Variable Description
INDEX Identification Variable
Wins Number of wins
Base_Hits Base Hits by batters (1B,2B,3B,HR)
DoublesB Doubles by batters (2B)
TriplesB Triples by batters (3B)
HomerunsB Homeruns by batters (4B)
WalksB Walks by batters
Hit_by_a_PitchB Batters hit by pitch (get a free base)
StrikeoutsB Strikeouts by batters
Stolen_BasesR Stolen bases
Caught_StealingR Caught stealing
ErrorsD Errors
Double_PlaysD Double Plays
Walks_AllowedP Walks allowed
Hits_AllowedP Hits allowed
Homeruns_AllowedP Homeruns allowed
StrikeoutsP Strikeouts by pitchers

2.2 Missing Observations Graph

     The first step in our data evaluation is to see whether the dataset has any missing observations. There are a few ways to do this, however, the “vis_miss” plot seems the most efficient.


      As we can see, six variables have some missing observations. The graph orders the variables from most to least missing observations, showing that the “Hit by a Pitch” variable is missing in 92% of all observations, and the “Caught Stealing” variable is missing in 34%. Missing observations are not always a cause for concern, but one must know why the data is missing. If the observations are missing due to chance, then it does not have to be an issue, but if they are omitted for a specific reason, that could drive endogeneity in our analysis later. For example, if the “Hit by a pitch” variable has so many missing observations because the MLB deleted the data to protect the league from having a bad image, that would skew the data and generate unreliable results. Therefore, since we do not know why the observations are missing, removing the variable is better. The “Caught Stealing” variable has fewer missing observations than “Hit by a Pitch.” However, it is still missing in a third of all observations, so for the sake of clarity, we should omit that variable as well.

2.3 Summary Statistics

     As for the remaining variables with missing observations, the best course of action is to fill in the missing values with either the mean or the median. The decision between the mean and median is made based on the distribution of the given variable. Suppose the variable has a large tail on either side of the distribution. In that case, it will skew the mean excessively upwards or downwards, and therefore, imputing the variable with its mean could skew the results. If that is the case, using the median to fill in the missing values is preferable. If the variable is normally distributed, using the mean is naturally superior since it should not overly alter the distribution.


      The next necessary step is to analyze the distribution of the four variables we keep in the final dataset with missing observations. These variables are “Double Plays,” “Stolen Bases,” “Strikeouts by a Batter,” and “Strikeouts by a Pitcher.”


      The first step of the variable analysis is reviewing their summary statistics. I am using the summarytools package to generate my summary statistics. To feed our curiosity, I will also include the variables we will likely omit (Hit by a Pitch and Caught Stealing), so that we may see if something suspicious is happening with them outside of the high number of missing observations.

Data Frame Summary

SummaryStats

Dimensions: 15 x 16
Duplicates: 0
Variable Stats / Values Freqs (% of Valid) Graph
Base Hits [numeric]
Mean (sd) : 805.8 (901.4)
min ≤ med ≤ max:
0.1 ≤ 154.2 ≤ 2554
IQR (CV) : 1408 (1.1)
15 distinct values
Caught Stealing [numeric]
Mean (sd) : 136.5 (381.7)
min ≤ med ≤ max:
0 ≤ 24 ≤ 1504
IQR (CV) : 52.6 (2.8)
15 distinct values
Double Plays [numeric]
Mean (sd) : 202.1 (500)
min ≤ med ≤ max:
-0.4 ≤ 52 ≤ 1990
IQR (CV) : 135.7 (2.5)
15 distinct values
Doubles [numeric]
Mean (sd) : 268.2 (571.2)
min ≤ med ≤ max:
0 ≤ 69 ≤ 2276
IQR (CV) : 216.1 (2.1)
15 distinct values
Errors [numeric]
Mean (sd) : 369.9 (706.1)
min ≤ med ≤ max:
0.1 ≤ 122.2 ≤ 2276
IQR (CV) : 200.5 (1.9)
15 distinct values
Hit by a Pitch [numeric]
Mean (sd) : 40 (51.3)
min ≤ med ≤ max:
-0.1 ≤ 16.5 ≤ 191
IQR (CV) : 54.3 (1.3)
15 distinct values
Hits Allowed [numeric]
Mean (sd) : 2802.8 (7601.7)
min ≤ med ≤ max:
0.1 ≤ 1137 ≤ 30132
IQR (CV) : 1479.6 (2.7)
15 distinct values
Homeruns [numeric]
Mean (sd) : 218.3 (573.8)
min ≤ med ≤ max:
-1 ≤ 78.6 ≤ 2276
IQR (CV) : 103.1 (2.6)
15 distinct values
Homeruns Allowed [numeric]
Mean (sd) : 224.5 (574.4)
min ≤ med ≤ max:
-0.6 ≤ 74.1 ≤ 2276
IQR (CV) : 105.9 (2.6)
14 distinct values
Stolen Bases [numeric]
Mean (sd) : 242.1 (553.6)
min ≤ med ≤ max:
0 ≤ 87.8 ≤ 2145
IQR (CV) : 109.1 (2.3)
15 distinct values
Strikeouts (B) [numeric]
Mean (sd) : 503.1 (624.1)
min ≤ med ≤ max:
-0.3 ≤ 284.7 ≤ 2174
IQR (CV) : 742.6 (1.2)
15 distinct values
Strikeouts (P) [numeric]
Mean (sd) : 1774.6 (4875.1)
min ≤ med ≤ max:
0 ≤ 553.1 ≤ 19278
IQR (CV) : 756.8 (2.7)
15 distinct values
Triples [numeric]
Mean (sd) : 193.3 (579)
min ≤ med ≤ max:
0 ≤ 34 ≤ 2276
IQR (CV) : 62.3 (3)
15 distinct values
Walks [numeric]
Mean (sd) : 376.4 (593.4)
min ≤ med ≤ max:
-1 ≤ 122.7 ≤ 2276
IQR (CV) : 505.6 (1.6)
15 distinct values
Walks Allowed [numeric]
Mean (sd) : 580.1 (1024.2)
min ≤ med ≤ max:
0 ≤ 135 ≤ 3645
IQR (CV) : 492.9 (1.8)
15 distinct values
Wins [numeric]
Mean (sd) : 193.3 (578.1)
min ≤ med ≤ max:
-0.4 ≤ 21 ≤ 2276
IQR (CV) : 86.4 (3)
15 distinct values

Generated by summarytools 1.0.1 (R version 4.2.3)
2023-10-05


      Starting with the double plays variable, we see that the variable’s mean is 146.39, while the median is 149. The standard deviation of the variable is also small, so we can use the mean to fill in the missing values of this variable


      The stolen bases variable is similar, with a mean of 124.76, a median of 101, and an SD of 87.79; using the mean is fine.


     Both strikeout variables can also be filled in with the variable’s respective means since the difference between their median and mean is small.


      From the omitted variables, Hit by a Pitch is surprisingly close to being normally distributed. This could mean that the data from the season with many players being hit by pitches was removed, so we cannot keep the variable regardless. The caught stealing variable exhibits the same right-hand side skewness as almost all other variables with missing observations. Unfortunately, the endogeneity concerns remain, so it cannot be included in the final model.

2.4 Variable Distributions by type


      The summary statistics help picture the distributions of the dataset’s variables, but it is always better to see the complete picture and graph the actual distributions of individual variables. The graphs will be incredibly convenient when we decide which variables to log.
      The first set of graphs contains all variables, after which they are grouped into four subsections: Batting, Base runs, Fielding, and Pitching.

2.4.1 Batting

#Graph Batting 
Batting <- data_with_labels[,c(3:8,11)]

plot_list2 <- list()

variable_names <- names(Batting)
for (variable in variable_names) {
  variable_data <- Batting[[variable]]
  variable_data <- variable_data[!is.na(variable_data)]

  plot2 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
    geom_density(fill = "green", color = "black") +
    labs(title = paste(variable), x = variable, y = "Density")
  plot_list2[[variable]] <- plot2
}

grid.arrange(grobs = plot_list2,ncol=3)

2.4.2 Base runs

#Graph Baserun
Baserun <- data_with_labels[9:10]

plot_list3 <- list()

variable_names <- names(Baserun)
for (variable in variable_names) {
  variable_data <- Baserun[[variable]]
  variable_data <- variable_data[!is.na(variable_data)]

  plot3 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
    geom_density(fill = "red", color = "black") +
    labs(title = paste(variable), x = variable, y = "Density")
  plot_list3[[variable]] <- plot3
}

grid.arrange(grobs = plot_list3, ncol = 2, nrow=1)

2.4.3 Fielding

#Graph Fielding
Fielding <- data_with_labels[16:17]

plot_list4 <- list()

variable_names <- names(Fielding)
for (variable in variable_names) {
  variable_data <- Fielding[[variable]]
  variable_data <- variable_data[!is.na(variable_data)]

  plot4 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
    geom_density(fill = "yellow", color = "black") +
    labs(title = paste(variable), x = variable, y = "Density")
  plot_list4[[variable]] <- plot4
}

grid.arrange(grobs = plot_list4, ncol = 2, nrow=1)

2.4.4 Pitching

#Graph Pitching
Fielding <- data_with_labels[12:15]

plot_list4 <- list()

variable_names <- names(Fielding)
for (variable in variable_names) {
  variable_data <- Fielding[[variable]]
  variable_data <- variable_data[!is.na(variable_data)]

  plot4 <- ggplot(data = data.frame(var = variable_data), aes(x = var)) +
    geom_density(fill = "purple", color = "black") +
    labs(title = paste(variable), x = variable, y = "Density")
  plot_list4[[variable]] <- plot4
}

grid.arrange(grobs = plot_list4, ncol = 2, nrow=2)

3 Data Preparation


      The data preparation process becomes simple thanks to our previous dataset analysis. We calculate the mean for each variable with missing observations and then replace the missing observation with said mean. I have also created binary variables that take on the value one if the specific vector had a missing observation at a given spot and zero otherwise.

#Data Cleaning
# Filling in Missing Observations 
REPLACE <- c(1:2276)
#Generate a Mean for the variable, not including the missing observations
meanTeamBattingSO <- mean(training$StrikeoutsB,na.rm=TRUE)
#Generate a Binary variable indicating whether there is a missing obserbation
training$MissingBattingSO <- ifelse(is.na(training$StrikeoutsB),1,0)
#Create a vector which replaces the missing observations with the mean of the variable
nonNABattingSO <- ifelse(is.na(training$StrikeoutsB),
                         meanTeamBattingSO,training$StrikeoutsB)
#Place the previously created vector into our new dataset, replacing the variable with missing values
training$StrikeoutsB[REPLACE] <- nonNABattingSO


      It is at this point when we drop the double plays and caught stealing variables.

#Removing HBP and CS due to them having too many missing observations
training <- training %>% select(-Double_PlaysD)
training <- training %>% select(-Caught_StealingR)
training <- training %>% select(-MissingBattingHBP)
training <- training %>% select(-MissingBaserunCS)


      I ran a vis_miss graph to confirm that all of the missing observations have been filled in, and as we can see, they have.

4 Correlation Plot


      With the data being cleaned, we must start the variable selection process. Outside of choosing variables through intuition, I have also created a correlation plot, which shows the strength and the direction of a relationship between each two variables in the dataset. Green indicates a positive correlation, and red indicates a negative correlation. The color’s vibrancy shows the correlation’s strength, where white means no correlation, and dark green or red means a strong correlation, closer to 1 or -1.


      As we can see from the plot, the wins variable is positively correlated with all other variables, though faintly. Since some variables should negatively correlate with wins, this is largely counterintuitive and will be curious to investigate in the regression analysis section.


     There are also two large green centers in the corners of the graph, which show high levels of positive correlation between home runs, home runs allowed, walks, and stolen bases. Home runs’ correlation with these variables is strange because it means that teams with many home runs also give up a lot of home runs and are not very successful defensively. I hypothesize that teams with players who try and swing for the fences are worse off due to the high likelihood of striking out. We see that home runs and strikeouts are also strongly positively correlated, meaning that it is likely that risk-taking behavior in baseball is not beneficial.


     Conversely, the errors variable being strongly positively correlated with home runs allowed, walks, and stolen bases makes complete intuitive sense. Teams that make too many errors on the defensive side of the ball will also likely suffer from a lot of home runs and base hits, as well as a lot of the opposing players being walked and stealing bases.


      The question of whether risk-taking behavior in baseball pays off, which I have found from the positive correlation in home runs, strikeouts, and other defensive variables, is interesting, and I intend to look into it in my regressions.

5 Econometric Models


      Since our dependent variable “Wins” positively correlates with all other variables, our variable selection is relatively simple. I included all thirteen variables in the model to see which aspects of the game impact the victory probability the most. Additionally, since the players on defense and offense in baseball are often the same, it is likely that by omitting some of the variables, we would suffer from omitted variable bias. Teams with many batting specialists might perform worse on defense, which could create bias if some of the variables were omitted. The ultimate goal of this section is to develop a model that will effectively predict teams’ wins and indicate the importance of specific aspects of the game. I will also focus on coefficients of variables, such as home runs, triples, and doubles, to see whether my hypothesis regarding risk-taking behavior in baseball has statistical evidence to support it.

5.1 Regression 1.1


      The first model I chose is a simple MLR model, where the dependent variable is wins, and the independent variables are the thirteen remaining variables in the dataset.

  Wins
Predictors Estimates CI p
(Intercept) 5.08 -8.09 – 18.25 0.449
Base Hits 0.05 0.04 – 0.05 <0.001
DoublesB -0.02 -0.04 – -0.00 0.017
TriplesB 0.07 0.04 – 0.10 <0.001
HomerunsB 0.04 -0.02 – 0.09 0.162
WalksB 0.00 -0.01 – 0.02 0.413
Hit by a PitchB 0.08 -0.06 – 0.23 0.254
StrikeoutsB -0.01 -0.01 – -0.00 0.014
ErrorsD -0.02 -0.03 – -0.02 <0.001
Walks AllowedP -0.00 -0.01 – 0.01 0.899
Stolen BasesR 0.04 0.03 – 0.04 <0.001
Hits AllowedP -0.00 -0.00 – 0.00 0.058
Homeruns AllowedP 0.01 -0.04 – 0.06 0.599
StrikeoutsP 0.00 0.00 – 0.00 0.004
Observations 2276
R2 / R2 adjusted 0.293 / 0.289


      The first regression shows that most batting and pitching variables generate statistically significant results. Their significance could be an indication of the fact that these two aspects of the game are the most important. The most impactful variables are Base Hits and Triples. The regression indicates that for each additional base hit, the team, on average, wins 0.047 more games, cetris paribus. The triples variable suggests that, on average, the team wins 0.07 more games, cetris paribus for each additional triple hit. While the coefficient on base hits is smaller, there are also many more base hits compared to triples, which indicates that, at least offensively, base hits are the most critical aspect of the game.


      Defensively, hits allowed, errors and strikeouts generate statistically significant results, which indicates that defensively, pitching game is the most essential aspect. The errors variable has the largest coefficient; once again, errors are not as frequent as strikeouts, so strikeouts are the most crucial part of the defensive game. Strikeouts by a pitcher coefficient indicates that, on average, for each additional strikeout by a pitcher, the team wins 0.0027 more games, cetris paribus.


      Since home runs do not produce statistically significant results, I can make no assertions regarding my risk-taking hypothesis.

5.2 Residual Analysis 1.1


      The residual analysis of the model provides some promising results. The residuals vs. fitted graph shows a straight line that is reasonably close to zero, indicating that the model is a good fit, and the scale location also shows fairly even distribution height-wise, which means that heteroscedasticity is likely not a concern. The Q-Q graph, which compares the distribution of the residuals of the regression to a normal distribution, also doesn’t demonstrate any problems to the normality assumption. However, the residual vs. leverage shows severe outliers that skew the results.


      Removing the outliers can improve the model, and log transforming the variables will improve the model’s fit, especially since the win variable is not normally distributed.


      As to the Gauss-Markov assumptions, the model seems to be BLUE. There is no reason to suspect that the data was not collected randomly or that the estimates are not linear in parameter. At first, one might think that the strikeouts variable could be cause for concern regarding multicollinearity, but we have to remember that this is the same team, so the issue is not there. The zero conditional mean assumption can be complicated to confirm, especially when using an OLS estimator; however, these concerns should be alleviated after the inclusion of enough controls. Finally, homoscedasticity seems to be in order, as per the scale-location graph in Residual Analysis 1.1.

5.3 Regression 1.2

  log1p(Wins)
Predictors Estimates CI p
(Intercept) -2.07 -2.98 – -1.16 <0.001
Base Hits [log1p] 14.64 8.94 – 20.33 <0.001
DoublesB [log1p] -0.10 -0.16 – -0.04 0.001
TriplesB [log1p] 0.07 0.05 – 0.10 <0.001
HomerunsB [log1p] -0.28 -0.36 – -0.20 <0.001
WalksB [log1p] -13.46 -19.22 – -7.71 <0.001
Hit by a PitchB 0.00 -0.00 – 0.00 0.114
StrikeoutsB 0.00 -0.00 – 0.00 0.257
ErrorsD [log1p] -0.17 -0.20 – -0.14 <0.001
Walks AllowedP [log1p] 13.50 7.76 – 19.25 <0.001
Stolen BasesR [log1p] 0.07 0.05 – 0.08 <0.001
Hits AllowedP [log1p] -13.66 -19.36 – -7.95 <0.001
Homeruns AllowedP [log1p] 0.29 0.21 – 0.36 <0.001
StrikeoutsP [log1p] -0.03 -0.05 – -0.02 <0.001
Observations 2271
R2 / R2 adjusted 0.347 / 0.343


      The second set of regression results drastically improves many key variables’ fit and statistical significance. I opted to log transform all variables, except for hit by a pitch and batter strikeout, since those variables seem normally distributed. The R-squared has increased from 0.289 to 0.343, which means that the new model explains over five percentage points more variation of the Wins variable than the previous one. We can also see some statistical significance increases and coefficient size changes. The base hits variable is now much more critical, where now, on average, for a one percent increase in base hits, the number of wins increases by 14.59 percent, cetris paribus.


      Homeruns have also now become statistically significant and indicate that, on average, for each additional one percent increase in home runs, the number of wins decreases by 0.29 percent, cetris paribus. This coefficient change plays into my hypothesis for risk-taking behavior, and if we investigate the coefficients on Walks Allowed, the theory becomes more probable. If a team has a lot of heavy hitters, the pitcher will probably choose to walk them rather than pitch at them, increasing their chance of the pitcher’s team winning, which is reflected in the walks allowed coefficient. On average, for each additional one percent of walks allowed, the number of wins by a team increases by 13.44 percent, cetris paribus.


      The one seemingly counter-intuitive part of the model is the negative coefficients on strikeouts by a pitcher. On average, for each additional percent increase in pitcher strikeouts, the number of wins decreases by 0.03 percent, cetris paribus. I suspect that the reason for this is that if a team has a lot of strikeouts, it means they are in more competitive games, where their opponent is batting a lot. Since these games are harder to win, more strikeouts could indicate the team is in more competitive games.

5.4 Residual Analysis 1.2


      The residual analysis has also improved. The residuals vs. fitted graph looks cleaner, with the fitted line being closer to the zero line, the scale-location graph remains reasonably evenly distributed, and the previous issues with the residuals vs. leverage graph and the presence of outliers have been corrected.


      Another possible way of improving the model is to include interaction terms of offensive and defensive variables. While these might be hard to interpret, they could control for much of the endogeneity that comes from omitted offensive and defensive variables.

5.5 Model 2

5.6 Regression 2.1

  log1p(Wins)
Predictors Estimates CI p
(Intercept) 1.95 -10.15 – 14.06 0.751
Base Hits [log1p] 0.23 -1.71 – 2.16 0.818
DoublesB [log1p] 0.26 -0.21 – 0.73 0.281
TriplesB [log1p] -0.07 -0.16 – 0.01 0.088
HomerunsB [log1p] -0.53 -0.67 – -0.40 <0.001
WalksB [log1p] -0.10 -0.44 – 0.23 0.541
Hit by a PitchB -0.01 -0.02 – 0.01 0.347
StrikeoutsB -0.00 -0.00 – 0.00 0.711
ErrorsD [log1p] -0.13 -0.18 – -0.07 <0.001
Walks AllowedP [log1p] 0.21 -0.04 – 0.45 0.095
Stolen BasesR [log1p] 0.08 0.06 – 0.10 <0.001
Hits AllowedP [log1p] -0.25 -0.67 – 0.18 0.252
Homeruns AllowedP [log1p] 0.42 0.31 – 0.53 <0.001
StrikeoutsP [log1p] -0.02 -0.04 – -0.00 0.021
Base Hits 0.00 -0.00 – 0.00 0.156
DoublesB 0.00 -0.01 – 0.01 0.663
TriplesB 0.02 0.00 – 0.04 0.014
HomerunsB 0.02 0.00 – 0.04 0.017
ErrorsD -0.00 -0.01 – 0.00 0.233
Walks AllowedP -0.00 -0.00 – -0.00 0.001
Hits AllowedP -0.00 -0.00 – -0.00 <0.001
Homeruns AllowedP -0.01 -0.01 – -0.00 <0.001
Base Hits × DoublesB -0.00 -0.00 – 0.00 0.544
Base Hits × TriplesB -0.00 -0.00 – -0.00 0.027
DoublesB × TriplesB -0.00 -0.00 – 0.00 0.258
Base Hits × HomerunsB -0.00 -0.00 – 0.00 0.084
DoublesB × HomerunsB -0.00 -0.00 – 0.00 0.136
TriplesB × HomerunsB -0.00 -0.00 – 0.00 0.987
Hit by a PitchB × ErrorsD 0.00 -0.00 – 0.00 0.257
Walks AllowedP × Hits
AllowedP
0.00 0.00 – 0.00 <0.001
Walks AllowedP × Homeruns
AllowedP
0.00 0.00 – 0.00 <0.001
Hits AllowedP × Homeruns
AllowedP
0.00 0.00 – 0.00 <0.001
(Base Hits × DoublesB) ×
TriplesB
0.00 -0.00 – 0.00 0.204
(Base Hits × DoublesB) ×
HomerunsB
0.00 -0.00 – 0.00 0.201
(Base Hits × TriplesB) ×
HomerunsB
0.00 -0.00 – 0.00 0.761
(DoublesB × TriplesB) ×
HomerunsB
-0.00 -0.00 – 0.00 0.859
(Walks AllowedP × Hits
AllowedP) × Homeruns
AllowedP
-0.00 -0.00 – -0.00 <0.001
(Base Hits × DoublesB ×
TriplesB) × HomerunsB
-0.00 -0.00 – 0.00 0.936
Observations 2276
R2 / R2 adjusted 0.489 / 0.480


      The new regression model is chaotic. While the R squared has increased to 0.485, an over 14 percentage points increase in explained variation of wins, the number of interaction terms in the model makes it hard to interpret. A large number of statistically significant variables also have close to no effect on the dependent variable due to the small size of coefficients. While perhaps the model controls for some more endogeneity, it does not tell a concise story, and many of the coefficients are non-sensical.

5.7 Residual Analysis 2.1

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced


      The residual analysis shows some concerning features. The Residuals vs. Fitted graph shows a pattern, indicating issues with the model’s fit, and the scale-location graph has some outliers that could pose heteroskedasticity issues. I think that including all the error terms has created a zero conditional mean violation. While it now does explain more of the variation in wins, it comes at the cost of endogeneity and omitted variable bias.

6 Evaluation Dataset

6.1 Model Selection


      While the third model predicts the most variation in the wins variable, it faces some issues. The R-squared of Model 2.1 is the highest, but the interaction terms, which represent parts of the game such as batting and pitching, make the model too complicated to interpret. Additionally, the residual analysis of Model 2.1 has shown issues with fit and possible heterogeneity concerns. If I cannot explain why something is working, I should not use it, and Model 2.1. is just an econometric version of a Rubik’s cube.


      The decision is, therefore, now between model 1.1 and 1.2, which is easy. Model 1.2 has a higher R-squared, is more statistically significant, and performs better during residual analysis. I will, therefore, be using model 1.2 to predict the win variable in the evaluation dataset.

6.2 Data Cleaning


      The evaluation data faces the same issues as the training dataset, meaning it is missing the same percentages of observations in the same variables. For that reason, I will be going through the same process, filling in the missing values with the variable medians, and dropping the double plays and caught stealing variables from the dataset.


      I use the vis_miss graph again to show the variables with missing observations. As we can see, hit by a pitch and caught stealing do, in fact, still miss the most observations and should, therefore, be dropped.


      I now rerun the vis_miss graph to ensure there are no more missing observations and if the appropriate variables are dropped.


      The graph shows that our data is in order and is ready to be analyzed.

6.3 Prediction


      I have used the predict function to generate the predicted values of wins for the evaluation dataset. Since printing these values out seemed impractical, I opted to create summary statistics for the variable and compare those to the summary statistics of the wins variable in the training dataset.

6.3.1 Comparison

##    Metric     Value
## 1    Mean 4.3787217
## 2      SD 0.1707904
## 3 Minimum 2.5238309
## 4 Maximum 4.7012641
## 5  Median 4.3969855
##    Metric     Value
## 1    Mean 4.3815011
## 2      SD 0.2353917
## 3 Minimum 0.0000000
## 4 Maximum 4.9904326
## 5  Median 4.4188406


      The figures may seem strange since both values are the natural log of wins; however, this is not a problem since we are simply comparing them. If we wanted to get the actual values, all we would have to do is to “exp” them.

## [1] 79.95797
## [1] 79.73604


      Comparing the means of the predicted and real values, they are almost identical, with the difference being less than one-tenth of a percent. The distribution of our predicted values seems to be much more concentrated, which can be seen by comparing the standard deviation and spread figures. The most significant difference appears to be the minimum of both variables, which is 0 in the training dataset and 2.548 in the evaluation dataset. Outside of this, however, the two variables are very similar, and our model has done an excellent job of predicting wins.

6.3.2 Graphs


      To complete the analysis, I have generated two probability density functions, one for the predicted wins and one for the actual wins, and combined them on the plot to see how their distributions differ.


     As we can see, the models are spread out similarly, though the predicted values are much more concentrated than the actual values. To account for this in the future, including team-identifying variables could allow us to use different econometric methods, such as fixed effects, and control for more endogeneity. However, the model seems to have effectively predicted the wins and provided us with reliable results.