DATA 621 Homework #1
From Work by Critical Thinking Group 3
DATA 621 Homework #1
Introduction
We have been given a dataset with 2276 records summarizing a major league baseball team’s season. The records span 1871 to 2006 inclusive. All statistics have been adjusted to match the performance of a 162 game season. The objective is to build a linear regression model to predict the number of wins for a team.
Working Theory
We are working on the premise that there are “good” teams and there are “bad” teams. The good teams win more than the bad teams. We are assuming that some of the predictors will be higher for the good teams than for the bad teams. Consequently we can use these variables to predict how many times a team will win in a season.
Notes About the Data
There are some difficulties with this dataset. First it covers such a wide time period. We know there are different “eras” of baseball. This data will span multiple eras. Has the fundamental relationships between winning and these predictors change over time? We think it has. If so this will be a challenge.
Data Exploration
First Look at the Data
We will first look at the data to get a sense of what we have.
training %>%
gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
ggplot(., aes(value)) +
geom_density(fill = "indianred4", color="indianred4") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
quick_summary <- function(df){
df %>%
summary() %>%
kable() %>%
kable_styling()
}
quick_summary(training)
TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
Some initial observations:
- The response variable (
TARGET_WINS
) looks to be normally distributed. This supports the working theory that there are good teams and bad teams. There are also a lot of average teams. - There are also quite a few variables with missing values. We may need to deal with these in order to have the largest data set possible for modeling.
- A couple variables are bimodal (
TEAM_BATTING_HR
,TEAM_BATTING_SO
TEAM_PITCHING_HR
). This may be a challenge as some of them are missing values and that may be a challenge in filling in missing values. - Some variables are right skewed (
TEAM_BASERUN_CS
,TEAM_BASERUN_SB
, etc.). This might support the good team theory. It may also introduce non-normally distributed residuals in the model. We shall see.
Correlations
Let’s take a look at the correlations. The following is the correlations from the complete cases only:
training %>%
cor(., use = "complete.obs") %>%
corrplot(., method = "color", type = "upper", tl.col = "black", diag = FALSE)
temp <- training %>%
cor(., use = "complete.obs") #%>%
temp[lower.tri(temp, diag=TRUE)] <- ""
temp <- temp %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(Variable, Correlation, -rowname) %>%
filter(Variable != rowname) %>%
filter(Correlation != "") %>%
mutate(Correlation = as.numeric(Correlation)) %>%
rename(` Variable` = rowname) %>%
arrange(desc(abs(Correlation)))
Correlations with Response Variable
Let’s take a look at how the predictors are correlated with the response variable:
training %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "indianred4", color="indianred4") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")
Variable | Variable | Correlation |
---|---|---|
TARGET_WINS | TEAM_PITCHING_H | 0.4712343 |
TARGET_WINS | TEAM_BATTING_H | 0.4699467 |
TARGET_WINS | TEAM_BATTING_BB | 0.4686879 |
TARGET_WINS | TEAM_PITCHING_BB | 0.4683988 |
TARGET_WINS | TEAM_PITCHING_HR | 0.4224668 |
TARGET_WINS | TEAM_BATTING_HR | 0.4224168 |
TARGET_WINS | TEAM_FIELDING_E | -0.3866880 |
TARGET_WINS | TEAM_BATTING_2B | 0.3129840 |
TARGET_WINS | TEAM_PITCHING_SO | -0.2293648 |
TARGET_WINS | TEAM_BATTING_SO | -0.2288927 |
TARGET_WINS | TEAM_FIELDING_DP | -0.1958660 |
TARGET_WINS | TEAM_BASERUN_CS | -0.1787560 |
TARGET_WINS | TEAM_BATTING_3B | -0.1243459 |
TARGET_WINS | TEAM_BATTING_HBP | 0.0735042 |
TARGET_WINS | TEAM_BASERUN_SB | 0.0148364 |
It looks like the hits, walks, home runs, and errors have the strongest correlations with wins. None of these correlations are particularly strong. This suggests there is a lot of ‘noise’ in these relationships.
It is interesting to note allowing hits is positively correlated with wins. How strange! It is also noteworthy that pitching strikeouts is negatively correlated with winning. That does not make any sense. When one examines the scatter plots above it becomes apparent that these correlations are being effected by some outliers.
Strong Correlations (Absolute Value > 0.5)
Are any predictors are correlated with each other? We will only look for “strong” correlations:
Variable | Variable | Correlation |
---|---|---|
TEAM_BATTING_HR | TEAM_PITCHING_HR | 0.9999326 |
TEAM_BATTING_BB | TEAM_PITCHING_BB | 0.9998814 |
TEAM_BATTING_SO | TEAM_PITCHING_SO | 0.9997684 |
TEAM_BATTING_H | TEAM_PITCHING_H | 0.9991927 |
TEAM_BASERUN_SB | TEAM_BASERUN_CS | 0.6247378 |
TEAM_BATTING_H | TEAM_BATTING_2B | 0.5617729 |
TEAM_BATTING_2B | TEAM_PITCHING_H | 0.5604535 |
There are 4 variables that have a correlation that is almost 1! We will need to be careful to prevent adding autocorrelation errors to our model.
Strange Data Values
Missing Values
During our first look at the data it was noted that there were variables that are missing data. Here’s a look at what variables are missing data and how big of a problem it is:
training %>%
gather(variable, value) %>%
filter(is.na(value)) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(training) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
rename(`Variable Missing Data` = variable,
`Number of Records` = n,
`Share of Total` = percent) %>%
kable() %>%
kable_styling()
Variable Missing Data | Number of Records | Share of Total |
---|---|---|
TEAM_BATTING_HBP | 2085 | 92% |
TEAM_BASERUN_CS | 772 | 34% |
TEAM_FIELDING_DP | 286 | 13% |
TEAM_BASERUN_SB | 131 | 5.8% |
TEAM_BATTING_SO | 102 | 4.5% |
TEAM_PITCHING_SO | 102 | 4.5% |
The hit by pitcher varriable is missing over 90% of it’s data. We will exclude it from consideration in our model.
Caught stealling a base (TEAM_BASERUN_CS
) is next on the list. It may be possible to predict it using TEAM_BASERUN_SB
since they are strongly correlated, but there are 131 times they both are missing data.
The strike outs are going to be a little tricky because of their bimodal distribution. All 102
Zero Values
There are also variables that have verly low values. Let’s see how big of a problem this is:
training %>%
gather(variable, value) %>%
filter(value == 0) %>%
group_by(variable) %>%
tally() %>%
mutate(percent = n / nrow(training) * 100) %>%
mutate(percent = paste0(round(percent, ifelse(percent < 10, 1, 0)), "%")) %>%
arrange(desc(n)) %>%
rename(`Variable With Zeros` = variable,
`Number of Records` = n,
`Share of Total` = percent) %>%
kable() %>%
kable_styling()
Variable With Zeros | Number of Records | Share of Total |
---|---|---|
TEAM_BATTING_SO | 20 | 0.9% |
TEAM_PITCHING_SO | 20 | 0.9% |
TEAM_BATTING_HR | 15 | 0.7% |
TEAM_PITCHING_HR | 15 | 0.7% |
TEAM_BASERUN_SB | 2 | 0.1% |
TEAM_BATTING_3B | 2 | 0.1% |
TARGET_WINS | 1 | 0% |
TEAM_BASERUN_CS | 1 | 0% |
TEAM_BATTING_BB | 1 | 0% |
TEAM_PITCHING_BB | 1 | 0% |
This isn’t nearly as large of a problem as the missing values.
Deeper Dive into the Variables
TARGET_WINS
training %>%
ggplot(aes(TARGET_WINS)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TARGET_WINS)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TARGET_WINS)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Wins",
caption = "* Red line is the mean value and green is the median")
The range of data looks good. There is a single zero value, which is the Washington Nationals 1872 season.
TEAM_BATTING_3B
This field represents triples hit by the team. Triples aren’t very common because the ball is still in the field of play (unlike a homerun) but the batter still has enough time to get 3 bases.
training %>%
ggplot(aes(TEAM_BATTING_3B)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_3B)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_3B)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Triples",
caption = "* Red line is the mean value and green is the median")
Looking at the distribution, the value of zero doesn’t look too unusual. Even if it were, the value is not likely to have a large impact.
TEAM_BATTING_HR
Although homeruns are more common in modern baseball (thank you PDEs!), there are some low values in the data. So zero doesn’t seem too unusual here either.
training %>%
ggplot(aes(TEAM_BATTING_HR)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_HR)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_HR)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Homeruns",
caption = "* Red line is the mean value and green is the median")
TEAM_BATTING_BB
This variable represents when the batter is “walked” by the pitcher (also known as Base on Balls):
training %>%
ggplot(aes(TEAM_BATTING_BB)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_BB)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_BB)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Walks (Base on Balls)",
caption = "* Red line is the mean value and green is the median")
Four balls will walk a batter in modern baseball, however that wasn’t always the case. A century or more ago (within the date range of this data set) walks took as many as 9 balls to happen1. Knowing this, and looking at the left-tail of the values above, it is not unreasonable that there might be a season with no walks. Like triples above, leaving the one zero data point in is unlikely to adversely impact any regression, since there are valid values nearby.
TEAM_BATTING_SO
Here we saw some NA values, 102 of them to be specific. Plus we have 20 zero values as well.
training %>%
ggplot(aes(TEAM_BATTING_SO)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_SO,na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_SO, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Strikeouts (Batter)",
caption = "* Red line is the mean value and green is the median")
First, the zero values seem nigh-impossible. An entire season (162 games) without a single batter being struck out seems highly suspect, let alone 20 of them in the dataset.
We will replace these values with imputed values, but the distribution looks to be bimodal, so using a mean or median (which is squarely between those peaks) may cause some issues with the model. So, instead, we will impute values using regression.
We will impute values for this variable by looking at it’s nearest neighbors (based on other variables) and taking a weighted average of their values.
TEAM_BASERUN_SB
With this variable, we have a good number of NA values, and 2 zeroes:
training %>%
ggplot(aes(TEAM_BASERUN_SB)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BASERUN_SB, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BASERUN_SB, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Stolen Bases",
caption = "* Red line is the mean value and green is the median")
The zeroes may be legitimate here so we will leave them alone. For the NAs, we will use the same KNN imputation we used above for strikeouts to impute values.
TEAM_BASERUN_CS
This variable is NA for nearly a third of records and only 2 zero values (which could be legitimate values):
training %>%
ggplot(aes(TEAM_BASERUN_CS)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BASERUN_CS, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BASERUN_CS, na.rm = T)),col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Caught Stealing Bases",
caption = "* Red line is the mean value and green is the median")
Despite the high number of missing values (and a potential for increased error), we will use the KNN imputed values.
TEAM_BATTING_HBP
With this variable, we see nearly all entries are missing:
training %>%
ggplot(aes(TEAM_BATTING_HBP)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_BATTING_HBP, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_BATTING_HBP, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Being Hit By a Pitch",
caption = "* Red line is the mean value and green is the median")
We will drop this variable as we cannot responsibly impute values with such thin data.
TEAM_PITCHING_HR
This variable has no NA values, but there are a few zero values. However, the zero values seem to be legitimate given the distribution:
training %>%
ggplot(aes(TEAM_PITCHING_HR)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_PITCHING_HR, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_PITCHING_HR, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Homeruns Pitched",
caption = "* Red line is the mean value and green is the median")
TEAM_PITCHING_BB
Here we have no NA values and a single zero:
training %>%
ggplot(aes(TEAM_PITCHING_BB)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_PITCHING_BB, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_PITCHING_BB, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Walks Pitched",
caption = "* Red line is the mean value and green is the median")
As we did with walks above, we can assume that is is possible to have no walks (and therefore pitch no walks). So, we will leave the zero alone.
However, there are some really high values in the data, which strains reality a little. We could take anything defined as an outlier (\(1.5 \cdot \text{IQR}\)) and set it to NA so those records will be excluded from any model we build with this variable. But, when you do the math it seems extreme, but plausible. For example, the most number of games in a season in MLB is 162 (currently). With a max value or 3,645 walks pitched you get 22.5 walks per game on average. Divided equally amongst 9 innings, it comes out to 2.5 walks per inning.
I’d be surprised that any pitcher wouldn’t be removed after an inning or two of 2-3 walks, but neither can we rule it out as a possibility.
TEAM_PITCHING_SO
This variable represents strikeouts pitched. We see that there are 102 NA values and a lot of extremely high values:
training %>%
ggplot(aes(TEAM_PITCHING_SO)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_PITCHING_SO, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_PITCHING_SO, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Count",
title = "Distribution of Strikeouts Pitched",
caption = "* Red line is the mean value and green is the median")
The extreme values need to be handled. First, a typical game will be 9 innings in length, and in each inning you can only pitch 3 strikeouts (because then your part of the inning is over). Those 27 potential strikeouts multiplied by 162 games means an upper limit near 4,374 a season.
Games can go beyond 9 innings, but even if every game in a season was as long as the longest ever MLB game (26 innings) you can only have 12,636 strikeouts. So, the max value of 19278 is invalid.
We’ll make a high-yet-reasonable assumption of a mean 11 innings per game. We will call anything more than 5,346 strikeouts an invalid data point by setting them to NA so they will be imputed prior to modeling.
TEAM_FIELDING_DP
The values in this variable seem reasonable, however we do have some NA values.
training %>%
ggplot(aes(TEAM_FIELDING_DP)) +
geom_histogram(bins = 50) +
geom_vline(aes(xintercept = mean(TEAM_FIELDING_DP, na.rm = T)), col = "red", lty = 2) +
geom_vline(aes(xintercept = median(TEAM_FIELDING_DP, na.rm = T)), col = "green", lty = 2) +
labs(x = element_blank(),
y = "Counts",
title = "Distribution of Double Plays",
caption = "* Red line is the mean value and green is the median")
Again, we will use the KNN imputation from earlier to fill in NAs with imputed values.
Data Preparation
Fixing Missing/Zero Values
First we will remove the invalid data and prep it for imputation. We will drop the hit by pitcher variable from the dataset. We will then impute the data using KNN.
training <- training %>%
mutate(TEAM_BATTING_SO = ifelse(TEAM_BATTING_SO == 0, NA, TEAM_BATTING_SO)) %>%
mutate(TEAM_PITCHING_SO = ifelse(TEAM_PITCHING_SO > 5346, NA, TEAM_PITCHING_SO)) %>%
select(-TEAM_BATTING_HBP)
set.seed(42)
knn <- training %>% knnImputation()
impute_me <- is.na(training$TEAM_BATTING_SO)
training[impute_me,"TEAM_BATTING_SO"] <- knn[impute_me,"TEAM_BATTING_SO"]
impute_me <- is.na(training$TEAM_BASERUN_SB)
training[impute_me,"TEAM_BASERUN_SB"] <- knn[impute_me,"TEAM_BASERUN_SB"]
impute_me <- is.na(training$TEAM_BASERUN_CS)
training[impute_me,"TEAM_BASERUN_CS"] <- knn[impute_me,"TEAM_BASERUN_CS"]
impute_me <- is.na(training$TEAM_PITCHING_SO)
training[impute_me,"TEAM_PITCHING_SO"] <- knn[impute_me,"TEAM_PITCHING_SO"]
impute_me <- is.na(training$TEAM_FIELDING_DP)
training[impute_me,"TEAM_FIELDING_DP"] <- knn[impute_me,"TEAM_FIELDING_DP"]
Feature Engineering
The batting singles is not included but we can back it out of the hits. We will do this.
Results
Here’s what the data look like after imputation and correction:
training %>%
gather(variable, value) %>%
ggplot(., aes(value)) +
geom_density(fill = "indianred4", color="indianred4") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = element_blank())
quick_summary <- function(df){
df %>%
summary() %>%
kable() %>%
kable_styling()
}
quick_summary(training)
TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | TEAM_BATTING_1B | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 66.0 | Min. : 0.0 | Min. : 0.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0 | Min. : 65.0 | Min. : 52.0 | Min. : 709.0 | |
1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 554.0 | 1st Qu.: 67.0 | 1st Qu.: 43.00 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 619 | 1st Qu.: 127.0 | 1st Qu.:130.0 | 1st Qu.: 990.8 | |
Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 732.0 | Median :104.8 | Median : 58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 797 | Median : 159.0 | Median :147.0 | Median :1050.0 | |
Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.2 | Mean :124.8 | Mean : 70.49 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 796 | Mean : 246.5 | Mean :145.4 | Mean :1073.2 | |
3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 925.0 | 3rd Qu.:154.0 | 3rd Qu.: 91.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 957 | 3rd Qu.: 249.2 | 3rd Qu.:162.0 | 3rd Qu.:1129.0 | |
Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :4224 | Max. :1898.0 | Max. :228.0 | Max. :2112.0 |
Model Building
We will divide the data into training and test sets using a 70/30 split. We will build our models on the training set and evaluate it on the test set.
set.seed(42)
train_index <- createDataPartition(training$TARGET_WINS, p = .7, list = FALSE, times = 1)
moneyball_train <- training[train_index,]
moneyball_test <- training[-train_index,]
Kitchen Sink Model
We will begin with a “kitchen sink” model.
Call:
lm(formula = TARGET_WINS ~ ., data = moneyball_train)
Residuals:
Min 1Q Median 3Q Max
-49.393 -8.577 -0.072 8.370 61.635
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.9460976 7.0651290 0.842 0.40013
TEAM_BATTING_H 0.0546947 0.0045070 12.135 < 2e-16 ***
TEAM_BATTING_2B -0.0237996 0.0110657 -2.151 0.03165 *
TEAM_BATTING_3B 0.0368255 0.0205168 1.795 0.07286 .
TEAM_BATTING_HR 0.0860825 0.0354739 2.427 0.01535 *
TEAM_BATTING_BB -0.0001871 0.0060299 -0.031 0.97526
TEAM_BATTING_SO -0.0016401 0.0043993 -0.373 0.70934
TEAM_BASERUN_SB -0.0011231 0.0066068 -0.170 0.86503
TEAM_BASERUN_CS 0.1097642 0.0188991 5.808 7.63e-09 ***
TEAM_PITCHING_H -0.0006330 0.0005148 -1.229 0.21908
TEAM_PITCHING_HR -0.0019716 0.0313430 -0.063 0.94985
TEAM_PITCHING_BB 0.0128312 0.0039117 3.280 0.00106 **
TEAM_PITCHING_SO -0.0020795 0.0031650 -0.657 0.51126
TEAM_FIELDING_E -0.0204896 0.0029772 -6.882 8.47e-12 ***
TEAM_FIELDING_DP -0.1077032 0.0169111 -6.369 2.49e-10 ***
TEAM_BATTING_1B NA NA NA NA
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.95 on 1580 degrees of freedom
Multiple R-squared: 0.3286, Adjusted R-squared: 0.3226
F-statistic: 55.23 on 14 and 1580 DF, p-value: < 2.2e-16
It does a fairly good job predicting, but there are a lot of variables that are not statistically significant.
Simple Model
Let’s try to create a simplier model. We will pick variables that had high correlations and exclude the pitching variables which would introduce autocorrelation issues.
simple_fit <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_FIELDING_E, data = moneyball_train)
summary(simple_fit)
Call:
lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
TEAM_FIELDING_E, data = moneyball_train)
Residuals:
Min 1Q Median 3Q Max
-51.833 -9.025 0.218 8.837 50.486
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.800109 4.004492 1.199 0.231
TEAM_BATTING_H 0.048614 0.002519 19.299 < 2e-16 ***
TEAM_BATTING_BB 0.015707 0.003792 4.142 3.62e-05 ***
TEAM_FIELDING_E -0.012961 0.002115 -6.129 1.11e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 13.8 on 1591 degrees of freedom
Multiple R-squared: 0.2314, Adjusted R-squared: 0.2299
F-statistic: 159.7 on 3 and 1591 DF, p-value: < 2.2e-16
This model isn’t as fitted to the data as well but the variables are statistically significant.
Higher Order Stepwise Regression
For the third model we will use a stepwise regression method using a backwards elimination process. We also introduce some higher order polynomial variables.
full_formula <- "TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP + TEAM_BATTING_1B + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BASERUN_SB^2) + I(TEAM_BASERUN_CS^2) + I(TEAM_PITCHING_H^2) + I(TEAM_PITCHING_HR^2) + I(TEAM_PITCHING_BB^2) + I(TEAM_PITCHING_SO^2) + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_DP^2) + I(TEAM_BATTING_1B^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^3) + I(TEAM_BATTING_HR^3) + I(TEAM_BATTING_BB^3) + I(TEAM_BATTING_SO^3) + I(TEAM_BASERUN_SB^3) + I(TEAM_BASERUN_CS^3) + I(TEAM_PITCHING_H^3) + I(TEAM_PITCHING_HR^3) + I(TEAM_PITCHING_BB^3) + I(TEAM_PITCHING_SO^3) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_DP^3) + I(TEAM_BATTING_1B^3) + I(TEAM_BATTING_2B^4) + I(TEAM_BATTING_3B^4) + I(TEAM_BATTING_HR^4) + I(TEAM_BATTING_BB^4) + I(TEAM_BATTING_SO^4) + I(TEAM_BASERUN_SB^4) + I(TEAM_BASERUN_CS^4) + I(TEAM_PITCHING_H^4) + I(TEAM_PITCHING_HR^4) + I(TEAM_PITCHING_BB^4) + I(TEAM_PITCHING_SO^4) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_DP^4) + I(TEAM_BATTING_1B^4)"
full_model <- lm(full_formula, moneyball_train)
step_back <- MASS::stepAIC(full_model, direction="backward", trace = F)
poly_call <- summary(step_back)$call
step_back <- lm(poly_call[2], moneyball_train)
summary(step_back)
Call:
lm(formula = poly_call[2], data = moneyball_train)
Residuals:
Min 1Q Median 3Q Max
-41.197 -7.453 -0.130 7.130 57.064
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.721e+02 7.268e+01 2.368 0.018011 *
TEAM_BATTING_2B -1.264e+00 7.704e-01 -1.641 0.101053
TEAM_PITCHING_H 2.634e-02 6.173e-03 4.267 2.10e-05 ***
TEAM_PITCHING_BB 2.401e-01 4.163e-02 5.766 9.76e-09 ***
TEAM_FIELDING_E -2.257e-01 2.575e-02 -8.765 < 2e-16 ***
TEAM_FIELDING_DP -3.092e+00 1.875e+00 -1.649 0.099352 .
I(TEAM_BATTING_2B^2) 1.257e-02 4.968e-03 2.530 0.011503 *
I(TEAM_BATTING_3B^2) 1.914e-03 4.189e-04 4.569 5.29e-06 ***
I(TEAM_BATTING_BB^2) -7.454e-04 1.490e-04 -5.004 6.26e-07 ***
I(TEAM_BATTING_SO^2) 1.190e-04 2.980e-05 3.994 6.80e-05 ***
I(TEAM_BASERUN_SB^2) 4.939e-05 1.323e-05 3.733 0.000196 ***
I(TEAM_PITCHING_H^2) -5.287e-06 1.114e-06 -4.747 2.25e-06 ***
I(TEAM_PITCHING_HR^2) 1.352e-03 3.473e-04 3.894 0.000103 ***
I(TEAM_PITCHING_BB^2) -3.054e-04 5.339e-05 -5.720 1.27e-08 ***
I(TEAM_PITCHING_SO^2) -5.613e-05 9.962e-06 -5.634 2.09e-08 ***
I(TEAM_FIELDING_E^2) 3.503e-04 6.256e-05 5.600 2.53e-08 ***
I(TEAM_FIELDING_DP^2) 3.198e-02 2.113e-02 1.514 0.130285
I(TEAM_BATTING_2B^3) -4.506e-05 1.389e-05 -3.244 0.001205 **
I(TEAM_BATTING_3B^3) -6.372e-06 2.690e-06 -2.369 0.017974 *
I(TEAM_BATTING_HR^3) 8.187e-07 4.736e-07 1.729 0.084096 .
I(TEAM_BATTING_BB^3) 1.565e-06 3.265e-07 4.793 1.80e-06 ***
I(TEAM_BATTING_SO^3) -1.283e-07 4.649e-08 -2.759 0.005864 **
I(TEAM_BASERUN_CS^3) 1.147e-05 2.496e-06 4.594 4.70e-06 ***
I(TEAM_PITCHING_H^3) 3.317e-10 7.906e-11 4.195 2.88e-05 ***
I(TEAM_PITCHING_HR^3) -8.134e-06 2.985e-06 -2.725 0.006497 **
I(TEAM_PITCHING_BB^3) 1.488e-07 2.539e-08 5.860 5.64e-09 ***
I(TEAM_PITCHING_SO^3) 3.989e-08 7.580e-09 5.262 1.62e-07 ***
I(TEAM_FIELDING_E^3) -2.455e-07 5.708e-08 -4.301 1.80e-05 ***
I(TEAM_FIELDING_DP^3) -1.515e-04 1.027e-04 -1.475 0.140401
I(TEAM_BATTING_1B^3) 1.097e-08 1.086e-09 10.106 < 2e-16 ***
I(TEAM_BATTING_2B^4) 5.422e-08 1.422e-08 3.814 0.000142 ***
I(TEAM_BATTING_BB^4) -8.775e-10 2.085e-10 -4.208 2.72e-05 ***
I(TEAM_BATTING_SO^4) 3.654e-11 2.070e-11 1.765 0.077712 .
I(TEAM_BASERUN_CS^4) -4.467e-08 1.378e-08 -3.242 0.001212 **
I(TEAM_PITCHING_H^4) -6.315e-15 1.742e-15 -3.626 0.000297 ***
I(TEAM_PITCHING_HR^4) 1.345e-08 6.111e-09 2.200 0.027934 *
I(TEAM_PITCHING_BB^4) -2.198e-11 3.762e-12 -5.843 6.24e-09 ***
I(TEAM_PITCHING_SO^4) -7.847e-12 1.473e-12 -5.329 1.13e-07 ***
I(TEAM_FIELDING_E^4) 5.595e-11 1.732e-11 3.230 0.001262 **
I(TEAM_FIELDING_DP^4) 2.668e-07 1.824e-07 1.463 0.143723
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.54 on 1555 degrees of freedom
Multiple R-squared: 0.4748, Adjusted R-squared: 0.4616
F-statistic: 36.04 on 39 and 1555 DF, p-value: < 2.2e-16
This model has the highest adjusted R-squared but has some variables that are statistically insignificant.
Model Selection
In order to select which model is the “best” we will test it against a test set. We will examine the difference between the predicted and actual values. Since the wins are in whole numbers and the predict
function will generate floating point numbers, I will be rounding the results.
Kitchen Sink Results
moneyball_test$kitchen_sink <- round(predict(kitchen_sink, moneyball_test), 0)
moneyball_test <- moneyball_test %>%
mutate(kitchen_sink_error = TARGET_WINS - kitchen_sink)
ggplot(moneyball_test, aes(kitchen_sink_error)) +
geom_histogram(bins = 50)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-47.0000 -9.0000 1.0000 -0.2819 8.0000 49.0000
Simple Model
moneyball_test$simple <- round(predict(simple_fit, moneyball_test), 0)
moneyball_test <- moneyball_test %>%
mutate(simple_error = TARGET_WINS - simple)
ggplot(moneyball_test, aes(simple_error)) +
geom_histogram(bins = 50)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-46.0000 -10.0000 0.0000 -0.3862 9.0000 47.0000
Stepwise Results
moneyball_test$step_back <- round(predict(step_back, moneyball_test), 0)
moneyball_test <- moneyball_test %>%
mutate(step_back_error = TARGET_WINS - step_back)
moneyball_test %>%
filter(step_back_error > -100) %>%
ggplot(., aes(step_back_error)) +
geom_histogram(bins = 50) +
labs(caption = "Outlier removed")
Min. 1st Qu. Median Mean 3rd Qu. Max.
-175.000 -9.000 -1.000 -0.141 7.000 435.000
This model did not preform very well on the out of sample data.
Final Model Selection
In deciding which model to use to make my final predictions I will look at the amount of error in the predictions:
moneyball_test %>%
select(kitchen_sink_error, simple_error, step_back_error) %>%
rename(`Kitchen Sink` = kitchen_sink_error,
`Simple` = simple_error,
`Stepwise` = step_back_error) %>%
summary() %>%
kable() %>%
kable_styling()
Kitchen Sink | Simple | Stepwise | |
---|---|---|---|
Min. :-47.0000 | Min. :-46.0000 | Min. :-175.000 | |
1st Qu.: -9.0000 | 1st Qu.:-10.0000 | 1st Qu.: -9.000 | |
Median : 1.0000 | Median : 0.0000 | Median : -1.000 | |
Mean : -0.2819 | Mean : -0.3862 | Mean : -0.141 | |
3rd Qu.: 8.0000 | 3rd Qu.: 9.0000 | 3rd Qu.: 7.000 | |
Max. : 49.0000 | Max. : 47.0000 | Max. : 435.000 |
The stepwise regression model did not preform well with out of sample data. The high adjusted R-squared is probably evidence of overfitting. The simple and “kitchen sink” model have similar preformance. I will use the simple model because I prefer to keep it simple stupid. The residuals of the simple model are mostly normally distributed. The adjusted R squared indicates about a quarter of the variation is explained by the model. I will complete this homework by generating predictions off of the evaluation set using the simple model.