DATA 621 Homework #1

Introduction
- Working Theory
- Notes About the Data
Data Exploration
Data Preparation
Model Building
Model Selection
Final Model Selection

Introduction

We have been given a dataset with 2276 records summarizing a major league baseball team’s season. The records span 1871 to 2006 inclusive. All statistics have been adjusted to match the performance of a 162 game season. The objective is to build a linear regression model to predict the number of wins for a team.

Working Theory

We are working on the premise that there are “good” teams and there are “bad” teams. The good teams win more than the bad teams. We are assuming that some of the predictors will be higher for the good teams than for the bad teams. Consequently we can use these variables to predict how many times a team will win in a season.

Notes About the Data

There are some difficulties with this dataset. First it covers such a wide time period. We know there are different “eras” of baseball. This data will span multiple eras. Has the fundamental relationships between winning and these predictors change over time? We think it has. If so this will be a challenge.

Data Exploration

First Look at the Data

We will first look at the data to get a sense of what we have.

training %>%
  gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "indianred4", color="indianred4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

quick_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling()
}

quick_summary(training)

TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :29.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 65.0	Min. : 52.0
1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 548.0	1st Qu.: 66.0	1st Qu.: 38.0	1st Qu.:50.50	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 615.0	1st Qu.: 127.0	1st Qu.:131.0
Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 750.0	Median :101.0	Median : 49.0	Median :58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 813.5	Median : 159.0	Median :149.0
Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.6	Mean :124.8	Mean : 52.8	Mean :59.36	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 817.7	Mean : 246.5	Mean :146.4
3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 930.0	3rd Qu.:156.0	3rd Qu.: 62.0	3rd Qu.:67.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 968.0	3rd Qu.: 249.2	3rd Qu.:164.0
Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.0	Max. :95.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :19278.0	Max. :1898.0	Max. :228.0
NA	NA	NA	NA	NA	NA	NA’s :102	NA’s :131	NA’s :772	NA’s :2085	NA	NA	NA	NA’s :102	NA	NA’s :286

Some initial observations:

The response variable (TARGET_WINS) looks to be normally distributed. This supports the working theory that there are good teams and bad teams. There are also a lot of average teams.
There are also quite a few variables with missing values. We may need to deal with these in order to have the largest data set possible for modeling.
A couple variables are bimodal (TEAM_BATTING_HR, TEAM_BATTING_SO TEAM_PITCHING_HR). This may be a challenge as some of them are missing values and that may be a challenge in filling in missing values.
Some variables are right skewed (TEAM_BASERUN_CS, TEAM_BASERUN_SB, etc.). This might support the good team theory. It may also introduce non-normally distributed residuals in the model. We shall see.

Correlations

Let’s take a look at the correlations. The following is the correlations from the complete cases only:

training %>% 
  cor(., use = "complete.obs") %>%
  corrplot(., method = "color", type = "upper", tl.col = "black", diag = FALSE)

temp <- training %>% 
  cor(., use = "complete.obs") #%>%
  
temp[lower.tri(temp, diag=TRUE)] <- ""
temp <- temp %>%
  as.data.frame() %>%
  rownames_to_column() %>%
  gather(Variable, Correlation, -rowname) %>%
  filter(Variable != rowname) %>%
  filter(Correlation != "") %>%
  mutate(Correlation = as.numeric(Correlation)) %>%
  rename(` Variable` = rowname) %>%
  arrange(desc(abs(Correlation)))

Correlations with Response Variable

Let’s take a look at how the predictors are correlated with the response variable:

training %>%
  gather(variable, value, -TARGET_WINS) %>%
  ggplot(., aes(value, TARGET_WINS)) + 
  geom_point(fill = "indianred4", color="indianred4") + 
  geom_smooth(method = "lm", se = FALSE, color = "black") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = "Wins")

temp %>%
  filter(` Variable` == "TARGET_WINS") %>%
  kable() %>%
  kable_styling()

Variable	Variable	Correlation
TARGET_WINS	TEAM_PITCHING_H	0.4712343
TARGET_WINS	TEAM_BATTING_H	0.4699467
TARGET_WINS	TEAM_BATTING_BB	0.4686879
TARGET_WINS	TEAM_PITCHING_BB	0.4683988
TARGET_WINS	TEAM_PITCHING_HR	0.4224668
TARGET_WINS	TEAM_BATTING_HR	0.4224168
TARGET_WINS	TEAM_FIELDING_E	-0.3866880
TARGET_WINS	TEAM_BATTING_2B	0.3129840
TARGET_WINS	TEAM_PITCHING_SO	-0.2293648
TARGET_WINS	TEAM_BATTING_SO	-0.2288927
TARGET_WINS	TEAM_FIELDING_DP	-0.1958660
TARGET_WINS	TEAM_BASERUN_CS	-0.1787560
TARGET_WINS	TEAM_BATTING_3B	-0.1243459
TARGET_WINS	TEAM_BATTING_HBP	0.0735042
TARGET_WINS	TEAM_BASERUN_SB	0.0148364

It looks like the hits, walks, home runs, and errors have the strongest correlations with wins. None of these correlations are particularly strong. This suggests there is a lot of ‘noise’ in these relationships.

It is interesting to note allowing hits is positively correlated with wins. How strange! It is also noteworthy that pitching strikeouts is negatively correlated with winning. That does not make any sense. When one examines the scatter plots above it becomes apparent that these correlations are being effected by some outliers.

Strong Correlations (Absolute Value > 0.5)

Are any predictors are correlated with each other? We will only look for “strong” correlations:

temp %>%
  filter(abs(Correlation) > .5) %>%
  kable() %>%
  kable_styling()

Variable	Variable	Correlation
TEAM_BATTING_HR	TEAM_PITCHING_HR	0.9999326
TEAM_BATTING_BB	TEAM_PITCHING_BB	0.9998814
TEAM_BATTING_SO	TEAM_PITCHING_SO	0.9997684
TEAM_BATTING_H	TEAM_PITCHING_H	0.9991927
TEAM_BASERUN_SB	TEAM_BASERUN_CS	0.6247378
TEAM_BATTING_H	TEAM_BATTING_2B	0.5617729
TEAM_BATTING_2B	TEAM_PITCHING_H	0.5604535

There are 4 variables that have a correlation that is almost 1! We will need to be careful to prevent adding autocorrelation errors to our model.

Strange Data Values

Missing Values

During our first look at the data it was noted that there were variables that are missing data. Here’s a look at what variables are missing data and how big of a problem it is:

training %>% 
  gather(variable, value) %>%
  filter(is.na(value)) %>%
  group_by(variable) %>%
  tally() %>%
  mutate(percent = n / nrow(training) * 100) %>%
  mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
  arrange(desc(n)) %>%
  rename(`Variable Missing Data` = variable,
         `Number of Records` = n,
         `Share of Total` = percent) %>%
  kable() %>%
  kable_styling()

Variable Missing Data	Number of Records	Share of Total
TEAM_BATTING_HBP	2085	92%
TEAM_BASERUN_CS	772	34%
TEAM_FIELDING_DP	286	13%
TEAM_BASERUN_SB	131	5.8%
TEAM_BATTING_SO	102	4.5%
TEAM_PITCHING_SO	102	4.5%

The hit by pitcher varriable is missing over 90% of it’s data. We will exclude it from consideration in our model.

Caught stealling a base (TEAM_BASERUN_CS) is next on the list. It may be possible to predict it using TEAM_BASERUN_SB since they are strongly correlated, but there are 131 times they both are missing data.

The strike outs are going to be a little tricky because of their bimodal distribution. All 102

Zero Values

There are also variables that have verly low values. Let’s see how big of a problem this is:

training %>% 
  gather(variable, value) %>%
  filter(value == 0) %>%
  group_by(variable) %>%
  tally() %>%
  mutate(percent = n / nrow(training) * 100) %>%
  mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
  arrange(desc(n)) %>%
  rename(`Variable With Zeros` = variable,
         `Number of Records` = n,
         `Share of Total` = percent) %>%
  kable() %>%
  kable_styling()

Variable With Zeros	Number of Records	Share of Total
TEAM_BATTING_SO	20	0.9%
TEAM_PITCHING_SO	20	0.9%
TEAM_BATTING_HR	15	0.7%
TEAM_PITCHING_HR	15	0.7%
TEAM_BASERUN_SB	2	0.1%
TEAM_BATTING_3B	2	0.1%
TARGET_WINS	1	0%
TEAM_BASERUN_CS	1	0%
TEAM_BATTING_BB	1	0%
TEAM_PITCHING_BB	1	0%

This isn’t nearly as large of a problem as the missing values.

Deeper Dive into the Variables

TARGET_WINS

training %>% 
  ggplot(aes(TARGET_WINS)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TARGET_WINS)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TARGET_WINS)), col = "green", lty = 2) +
  labs(x = element_blank(), 
       y = "Count",
       title = "Distribution of Wins",
       caption = "* Red line is the mean value and green is the median")

The range of data looks good. There is a single zero value, which is the Washington Nationals 1872 season.

TEAM_BATTING_3B

This field represents triples hit by the team. Triples aren’t very common because the ball is still in the field of play (unlike a homerun) but the batter still has enough time to get 3 bases.

training %>% 
  ggplot(aes(TEAM_BATTING_3B)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_3B)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_3B)), col = "green", lty = 2) +
  labs(x = element_blank(), 
       y = "Count",
       title = "Distribution of Triples",
       caption = "* Red line is the mean value and green is the median")

Looking at the distribution, the value of zero doesn’t look too unusual. Even if it were, the value is not likely to have a large impact.

TEAM_BATTING_HR

Although homeruns are more common in modern baseball (thank you PDEs!), there are some low values in the data. So zero doesn’t seem too unusual here either.

training %>% 
  ggplot(aes(TEAM_BATTING_HR)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_HR)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_HR)), col = "green", lty = 2) +
  labs(x = element_blank(), 
       y = "Count",
       title = "Distribution of Homeruns",
       caption = "* Red line is the mean value and green is the median")

TEAM_BATTING_BB

This variable represents when the batter is “walked” by the pitcher (also known as Base on Balls):

training %>% 
  ggplot(aes(TEAM_BATTING_BB)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_BB)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_BB)), col = "green", lty = 2) +
  labs(x = element_blank(), 
       y = "Count",
       title = "Distribution of Walks (Base on Balls)",
       caption = "* Red line is the mean value and green is the median")

Four balls will walk a batter in modern baseball, however that wasn’t always the case. A century or more ago (within the date range of this data set) walks took as many as 9 balls to happen¹. Knowing this, and looking at the left-tail of the values above, it is not unreasonable that there might be a season with no walks. Like triples above, leaving the one zero data point in is unlikely to adversely impact any regression, since there are valid values nearby.

TEAM_BATTING_SO

Here we saw some NA values, 102 of them to be specific. Plus we have 20 zero values as well.

training %>% 
  ggplot(aes(TEAM_BATTING_SO)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_SO,na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_SO, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(), 
       y = "Count",
       title = "Distribution of Strikeouts (Batter)",
       caption = "* Red line is the mean value and green is the median")

First, the zero values seem nigh-impossible. An entire season (162 games) without a single batter being struck out seems highly suspect, let alone 20 of them in the dataset.

We will replace these values with imputed values, but the distribution looks to be bimodal, so using a mean or median (which is squarely between those peaks) may cause some issues with the model. So, instead, we will impute values using regression.

We will impute values for this variable by looking at it’s nearest neighbors (based on other variables) and taking a weighted average of their values.

TEAM_BASERUN_SB

With this variable, we have a good number of NA values, and 2 zeroes:

training %>% 
  ggplot(aes(TEAM_BASERUN_SB)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BASERUN_SB, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BASERUN_SB, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution of Stolen Bases",
       caption = "* Red line is the mean value and green is the median")

The zeroes may be legitimate here so we will leave them alone. For the NAs, we will use the same KNN imputation we used above for strikeouts to impute values.

TEAM_BASERUN_CS

This variable is NA for nearly a third of records and only 2 zero values (which could be legitimate values):

training %>% 
  ggplot(aes(TEAM_BASERUN_CS)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BASERUN_CS, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BASERUN_CS, na.rm = T)),col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution of Caught Stealing Bases",
       caption = "* Red line is the mean value and green is the median")

Despite the high number of missing values (and a potential for increased error), we will use the KNN imputed values.

TEAM_BATTING_HBP

With this variable, we see nearly all entries are missing:

training %>% 
  ggplot(aes(TEAM_BATTING_HBP)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution of Being Hit By a Pitch",
       caption = "* Red line is the mean value and green is the median")

We will drop this variable as we cannot responsibly impute values with such thin data.

TEAM_PITCHING_HR

This variable has no NA values, but there are a few zero values. However, the zero values seem to be legitimate given the distribution:

training %>% 
  ggplot(aes(TEAM_PITCHING_HR)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_PITCHING_HR, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_PITCHING_HR, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution of Homeruns Pitched",
       caption = "* Red line is the mean value and green is the median")

TEAM_PITCHING_BB

Here we have no NA values and a single zero:

training %>% 
  ggplot(aes(TEAM_PITCHING_BB)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_PITCHING_BB, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_PITCHING_BB, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution of Walks Pitched",
       caption = "* Red line is the mean value and green is the median")

As we did with walks above, we can assume that is is possible to have no walks (and therefore pitch no walks). So, we will leave the zero alone.

However, there are some really high values in the data, which strains reality a little. We could take anything defined as an outlier (\(1.5 \cdot \text{IQR}\)) and set it to NA so those records will be excluded from any model we build with this variable. But, when you do the math it seems extreme, but plausible. For example, the most number of games in a season in MLB is 162 (currently). With a max value or 3,645 walks pitched you get 22.5 walks per game on average. Divided equally amongst 9 innings, it comes out to 2.5 walks per inning.

I’d be surprised that any pitcher wouldn’t be removed after an inning or two of 2-3 walks, but neither can we rule it out as a possibility.

TEAM_PITCHING_SO

This variable represents strikeouts pitched. We see that there are 102 NA values and a lot of extremely high values:

training %>% 
  ggplot(aes(TEAM_PITCHING_SO)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_PITCHING_SO, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_PITCHING_SO, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Count",
       title = "Distribution of Strikeouts Pitched",
       caption = "* Red line is the mean value and green is the median")

The extreme values need to be handled. First, a typical game will be 9 innings in length, and in each inning you can only pitch 3 strikeouts (because then your part of the inning is over). Those 27 potential strikeouts multiplied by 162 games means an upper limit near 4,374 a season.

Games can go beyond 9 innings, but even if every game in a season was as long as the longest ever MLB game (26 innings) you can only have 12,636 strikeouts. So, the max value of 19278 is invalid.

We’ll make a high-yet-reasonable assumption of a mean 11 innings per game. We will call anything more than 5,346 strikeouts an invalid data point by setting them to NA so they will be imputed prior to modeling.

TEAM_FIELDING_DP

The values in this variable seem reasonable, however we do have some NA values.

training %>% 
  ggplot(aes(TEAM_FIELDING_DP)) + 
  geom_histogram(bins = 50) +
  geom_vline(aes(xintercept = mean(TEAM_FIELDING_DP, na.rm = T)), col = "red", lty = 2) +
  geom_vline(aes(xintercept = median(TEAM_FIELDING_DP, na.rm = T)), col = "green", lty = 2) +
  labs(x = element_blank(),
       y = "Counts",
       title = "Distribution of Double Plays",
       caption = "* Red line is the mean value and green is the median")

Again, we will use the KNN imputation from earlier to fill in NAs with imputed values.

Data Preparation

Fixing Missing/Zero Values

First we will remove the invalid data and prep it for imputation. We will drop the hit by pitcher variable from the dataset. We will then impute the data using KNN.

training <- training %>%
  mutate(TEAM_BATTING_SO = ifelse(TEAM_BATTING_SO == 0, NA, TEAM_BATTING_SO)) %>%
  mutate(TEAM_PITCHING_SO = ifelse(TEAM_PITCHING_SO > 5346, NA, TEAM_PITCHING_SO)) %>%
  select(-TEAM_BATTING_HBP)

set.seed(42)
knn <- training %>% knnImputation()
impute_me <- is.na(training$TEAM_BATTING_SO)
training[impute_me,"TEAM_BATTING_SO"] <- knn[impute_me,"TEAM_BATTING_SO"] 
impute_me <- is.na(training$TEAM_BASERUN_SB)
training[impute_me,"TEAM_BASERUN_SB"] <- knn[impute_me,"TEAM_BASERUN_SB"] 
impute_me <- is.na(training$TEAM_BASERUN_CS)
training[impute_me,"TEAM_BASERUN_CS"] <- knn[impute_me,"TEAM_BASERUN_CS"] 
impute_me <- is.na(training$TEAM_PITCHING_SO)
training[impute_me,"TEAM_PITCHING_SO"] <- knn[impute_me,"TEAM_PITCHING_SO"]
impute_me <- is.na(training$TEAM_FIELDING_DP)
training[impute_me,"TEAM_FIELDING_DP"] <- knn[impute_me,"TEAM_FIELDING_DP"]

Feature Engineering

The batting singles is not included but we can back it out of the hits. We will do this.

add_features <- function(df){
  df %>%
    mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)
}

training <- add_features(training)
evaluation <- add_features(evaluation)

Results

Here’s what the data look like after imputation and correction:

training %>%
  gather(variable, value) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "indianred4", color="indianred4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

quick_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling()
}

quick_summary(training)

TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP	TEAM_BATTING_1B
Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 66.0	Min. : 0.0	Min. : 0.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0	Min. : 65.0	Min. : 52.0	Min. : 709.0
1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 554.0	1st Qu.: 67.0	1st Qu.: 43.00	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 619	1st Qu.: 127.0	1st Qu.:130.0	1st Qu.: 990.8
Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 732.0	Median :104.8	Median : 58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 797	Median : 159.0	Median :147.0	Median :1050.0
Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.2	Mean :124.8	Mean : 70.49	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 796	Mean : 246.5	Mean :145.4	Mean :1073.2
3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 925.0	3rd Qu.:154.0	3rd Qu.: 91.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 957	3rd Qu.: 249.2	3rd Qu.:162.0	3rd Qu.:1129.0
Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :4224	Max. :1898.0	Max. :228.0	Max. :2112.0

Model Building

We will divide the data into training and test sets using a 70/30 split. We will build our models on the training set and evaluate it on the test set.

set.seed(42)
train_index <- createDataPartition(training$TARGET_WINS, p = .7, list = FALSE, times = 1)

moneyball_train <- training[train_index,]
moneyball_test <- training[-train_index,]

Kitchen Sink Model

We will begin with a “kitchen sink” model.

kitchen_sink <- lm(TARGET_WINS ~ ., moneyball_train)
summary(kitchen_sink)


Call:
lm(formula = TARGET_WINS ~ ., data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.393  -8.577  -0.072   8.370  61.635 

Coefficients: (1 not defined because of singularities)
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       5.9460976  7.0651290   0.842  0.40013    
TEAM_BATTING_H    0.0546947  0.0045070  12.135  < 2e-16 ***
TEAM_BATTING_2B  -0.0237996  0.0110657  -2.151  0.03165 *  
TEAM_BATTING_3B   0.0368255  0.0205168   1.795  0.07286 .  
TEAM_BATTING_HR   0.0860825  0.0354739   2.427  0.01535 *  
TEAM_BATTING_BB  -0.0001871  0.0060299  -0.031  0.97526    
TEAM_BATTING_SO  -0.0016401  0.0043993  -0.373  0.70934    
TEAM_BASERUN_SB  -0.0011231  0.0066068  -0.170  0.86503    
TEAM_BASERUN_CS   0.1097642  0.0188991   5.808 7.63e-09 ***
TEAM_PITCHING_H  -0.0006330  0.0005148  -1.229  0.21908    
TEAM_PITCHING_HR -0.0019716  0.0313430  -0.063  0.94985    
TEAM_PITCHING_BB  0.0128312  0.0039117   3.280  0.00106 ** 
TEAM_PITCHING_SO -0.0020795  0.0031650  -0.657  0.51126    
TEAM_FIELDING_E  -0.0204896  0.0029772  -6.882 8.47e-12 ***
TEAM_FIELDING_DP -0.1077032  0.0169111  -6.369 2.49e-10 ***
TEAM_BATTING_1B          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 12.95 on 1580 degrees of freedom
Multiple R-squared:  0.3286,    Adjusted R-squared:  0.3226 
F-statistic: 55.23 on 14 and 1580 DF,  p-value: < 2.2e-16

It does a fairly good job predicting, but there are a lot of variables that are not statistically significant.

plot(kitchen_sink)

Simple Model

Let’s try to create a simplier model. We will pick variables that had high correlations and exclude the pitching variables which would introduce autocorrelation issues.

simple_fit <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E, data = moneyball_train)
summary(simple_fit)


Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + 
    TEAM_FIELDING_E, data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-51.833  -9.025   0.218   8.837  50.486 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      4.800109   4.004492   1.199    0.231    
TEAM_BATTING_H   0.048614   0.002519  19.299  < 2e-16 ***
TEAM_BATTING_BB  0.015707   0.003792   4.142 3.62e-05 ***
TEAM_FIELDING_E -0.012961   0.002115  -6.129 1.11e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.8 on 1591 degrees of freedom
Multiple R-squared:  0.2314,    Adjusted R-squared:  0.2299 
F-statistic: 159.7 on 3 and 1591 DF,  p-value: < 2.2e-16

This model isn’t as fitted to the data as well but the variables are statistically significant.

plot(simple_fit)

Higher Order Stepwise Regression

For the third model we will use a stepwise regression method using a backwards elimination process. We also introduce some higher order polynomial variables.

full_formula <- "TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_1B + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BASERUN_SB^2) + I(TEAM_BASERUN_CS^2) + I(TEAM_PITCHING_H^2) + I(TEAM_PITCHING_HR^2) + I(TEAM_PITCHING_BB^2) + I(TEAM_PITCHING_SO^2) + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_1B^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_HR^3) + I(TEAM_BATTING_BB^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BASERUN_SB^3) + I(TEAM_BASERUN_CS^3) + I(TEAM_PITCHING_H^3) + I(TEAM_PITCHING_HR^3) + I(TEAM_PITCHING_BB^3) + I(TEAM_PITCHING_SO^3) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_DP^3) + I(TEAM_BATTING_1B^3) + I(TEAM_BATTING_2B^4) + I(TEAM_BATTING_3B^4) + I(TEAM_BATTING_HR^4) + I(TEAM_BATTING_BB^4) + I(TEAM_BATTING_SO^4) + I(TEAM_BASERUN_SB^4) + I(TEAM_BASERUN_CS^4) + I(TEAM_PITCHING_H^4) + I(TEAM_PITCHING_HR^4) + I(TEAM_PITCHING_BB^4) + I(TEAM_PITCHING_SO^4) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_DP^4) + I(TEAM_BATTING_1B^4)"

full_model <- lm(full_formula, moneyball_train)
step_back <- MASS::stepAIC(full_model, direction="backward", trace = F)
poly_call <- summary(step_back)$call
step_back <- lm(poly_call[2], moneyball_train)
summary(step_back)


Call:
lm(formula = poly_call[2], data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-41.197  -7.453  -0.130   7.130  57.064 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.721e+02  7.268e+01   2.368 0.018011 *  
TEAM_BATTING_2B       -1.264e+00  7.704e-01  -1.641 0.101053    
TEAM_PITCHING_H        2.634e-02  6.173e-03   4.267 2.10e-05 ***
TEAM_PITCHING_BB       2.401e-01  4.163e-02   5.766 9.76e-09 ***
TEAM_FIELDING_E       -2.257e-01  2.575e-02  -8.765  < 2e-16 ***
TEAM_FIELDING_DP      -3.092e+00  1.875e+00  -1.649 0.099352 .  
I(TEAM_BATTING_2B^2)   1.257e-02  4.968e-03   2.530 0.011503 *  
I(TEAM_BATTING_3B^2)   1.914e-03  4.189e-04   4.569 5.29e-06 ***
I(TEAM_BATTING_BB^2)  -7.454e-04  1.490e-04  -5.004 6.26e-07 ***
I(TEAM_BATTING_SO^2)   1.190e-04  2.980e-05   3.994 6.80e-05 ***
I(TEAM_BASERUN_SB^2)   4.939e-05  1.323e-05   3.733 0.000196 ***
I(TEAM_PITCHING_H^2)  -5.287e-06  1.114e-06  -4.747 2.25e-06 ***
I(TEAM_PITCHING_HR^2)  1.352e-03  3.473e-04   3.894 0.000103 ***
I(TEAM_PITCHING_BB^2) -3.054e-04  5.339e-05  -5.720 1.27e-08 ***
I(TEAM_PITCHING_SO^2) -5.613e-05  9.962e-06  -5.634 2.09e-08 ***
I(TEAM_FIELDING_E^2)   3.503e-04  6.256e-05   5.600 2.53e-08 ***
I(TEAM_FIELDING_DP^2)  3.198e-02  2.113e-02   1.514 0.130285    
I(TEAM_BATTING_2B^3)  -4.506e-05  1.389e-05  -3.244 0.001205 ** 
I(TEAM_BATTING_3B^3)  -6.372e-06  2.690e-06  -2.369 0.017974 *  
I(TEAM_BATTING_HR^3)   8.187e-07  4.736e-07   1.729 0.084096 .  
I(TEAM_BATTING_BB^3)   1.565e-06  3.265e-07   4.793 1.80e-06 ***
I(TEAM_BATTING_SO^3)  -1.283e-07  4.649e-08  -2.759 0.005864 ** 
I(TEAM_BASERUN_CS^3)   1.147e-05  2.496e-06   4.594 4.70e-06 ***
I(TEAM_PITCHING_H^3)   3.317e-10  7.906e-11   4.195 2.88e-05 ***
I(TEAM_PITCHING_HR^3) -8.134e-06  2.985e-06  -2.725 0.006497 ** 
I(TEAM_PITCHING_BB^3)  1.488e-07  2.539e-08   5.860 5.64e-09 ***
I(TEAM_PITCHING_SO^3)  3.989e-08  7.580e-09   5.262 1.62e-07 ***
I(TEAM_FIELDING_E^3)  -2.455e-07  5.708e-08  -4.301 1.80e-05 ***
I(TEAM_FIELDING_DP^3) -1.515e-04  1.027e-04  -1.475 0.140401    
I(TEAM_BATTING_1B^3)   1.097e-08  1.086e-09  10.106  < 2e-16 ***
I(TEAM_BATTING_2B^4)   5.422e-08  1.422e-08   3.814 0.000142 ***
I(TEAM_BATTING_BB^4)  -8.775e-10  2.085e-10  -4.208 2.72e-05 ***
I(TEAM_BATTING_SO^4)   3.654e-11  2.070e-11   1.765 0.077712 .  
I(TEAM_BASERUN_CS^4)  -4.467e-08  1.378e-08  -3.242 0.001212 ** 
I(TEAM_PITCHING_H^4)  -6.315e-15  1.742e-15  -3.626 0.000297 ***
I(TEAM_PITCHING_HR^4)  1.345e-08  6.111e-09   2.200 0.027934 *  
I(TEAM_PITCHING_BB^4) -2.198e-11  3.762e-12  -5.843 6.24e-09 ***
I(TEAM_PITCHING_SO^4) -7.847e-12  1.473e-12  -5.329 1.13e-07 ***
I(TEAM_FIELDING_E^4)   5.595e-11  1.732e-11   3.230 0.001262 ** 
I(TEAM_FIELDING_DP^4)  2.668e-07  1.824e-07   1.463 0.143723    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.54 on 1555 degrees of freedom
Multiple R-squared:  0.4748,    Adjusted R-squared:  0.4616 
F-statistic: 36.04 on 39 and 1555 DF,  p-value: < 2.2e-16

This model has the highest adjusted R-squared but has some variables that are statistically insignificant.

plot(step_back)

Model Selection

In order to select which model is the “best” we will test it against a test set. We will examine the difference between the predicted and actual values. Since the wins are in whole numbers and the predict function will generate floating point numbers, I will be rounding the results.

Kitchen Sink Results

moneyball_test$kitchen_sink <- round(predict(kitchen_sink, moneyball_test), 0)
moneyball_test <- moneyball_test %>%
  mutate(kitchen_sink_error = TARGET_WINS - kitchen_sink)

ggplot(moneyball_test, aes(kitchen_sink_error)) +
  geom_histogram(bins = 50)

summary(moneyball_test$kitchen_sink_error)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-47.0000  -9.0000   1.0000  -0.2819   8.0000  49.0000

Simple Model

moneyball_test$simple <- round(predict(simple_fit, moneyball_test), 0)
moneyball_test <- moneyball_test %>%
  mutate(simple_error = TARGET_WINS - simple)

ggplot(moneyball_test, aes(simple_error)) +
  geom_histogram(bins = 50)

summary(moneyball_test$simple_error)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-46.0000 -10.0000   0.0000  -0.3862   9.0000  47.0000

Stepwise Results

moneyball_test$step_back <- round(predict(step_back, moneyball_test), 0)
moneyball_test <- moneyball_test %>%
  mutate(step_back_error = TARGET_WINS - step_back)

moneyball_test %>%
  filter(step_back_error > -100) %>%
  ggplot(., aes(step_back_error)) +
  geom_histogram(bins = 50) +
  labs(caption = "Outlier removed")

summary(moneyball_test$step_back_error)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-175.000   -9.000   -1.000   -0.141    7.000  435.000

This model did not preform very well on the out of sample data.

Final Model Selection

In deciding which model to use to make my final predictions I will look at the amount of error in the predictions:

moneyball_test %>%
  select(kitchen_sink_error, simple_error, step_back_error) %>%
  rename(`Kitchen Sink` = kitchen_sink_error,
         `Simple` = simple_error,
         `Stepwise` = step_back_error) %>%
  summary() %>%
  kable() %>%
  kable_styling()

Kitchen Sink	Simple	Stepwise
Min. :-47.0000	Min. :-46.0000	Min. :-175.000
1st Qu.: -9.0000	1st Qu.:-10.0000	1st Qu.: -9.000
Median : 1.0000	Median : 0.0000	Median : -1.000
Mean : -0.2819	Mean : -0.3862	Mean : -0.141
3rd Qu.: 8.0000	3rd Qu.: 9.0000	3rd Qu.: 7.000
Max. : 49.0000	Max. : 47.0000	Max. : 435.000

The stepwise regression model did not preform well with out of sample data. The high adjusted R-squared is probably evidence of overfitting. The simple and “kitchen sink” model have similar preformance. I will use the simple model because I prefer to keep it simple stupid. The residuals of the simple model are mostly normally distributed. The adjusted R squared indicates about a quarter of the variation is explained by the model. I will complete this homework by generating predictions off of the evaluation set using the simple model.

evaluation$TARGET_WINS <- round(predict(simple_fit, evaluation), 0)
evaluation %>%
  select(INDEX, TARGET_WINS) %>%
  write.csv(., "./data/moneyball-predictions.csv", row.names = F)

https://en.wikipedia.org/wiki/Base_on_balls ↩

DATA 621 Homework #1

From Work by Critical Thinking Group 3