In this assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. Below is a short description of the variables of interest in the data set:
Variable Name | Definition | Theoretical Effect |
---|---|---|
INDEX | Identification Variable (do not use) | None |
TARGET_WINS | Number of wins | |
TEAM_BATTING_H | Base Hits by batters (1B,2B,3B,HR) | Positive Impact on Wins |
TEAM_BATTING_2B | Doubles by batters (2B) | Positive Impact on Wins |
TEAM_BATTING_3B | Triples by batters (3B) | Positive Impact on Wins |
TEAM_BATTING_HR | Homeruns by batters (4B) | Positive Impact on Wins |
TEAM_BATTING_BB | Walks by batters | Positive Impact on Wins |
TEAM_BATTING_SO | Batters hit by pitch (get a free base) | Positive Impact on Wins |
TEAM_BASERUN_SB | Strikeouts by batters | Negative Impact on Wins |
TEAM_BASERUN_CS | Stolen bases | Positive Impact on Wins |
TEAM_BATTING_HBP | Caught stealing | Negative Impact on Wins |
TEAM_PITCHING_H | Errors | Negative Impact on Wins |
TEAM_PITCHING_HR | Double Plays | Positive Impact on Wins |
TEAM_PITCHING_BB | Walks allowed | Negative Impact on Wins |
TEAM_PITCHING_SO | Hits allowed | Negative Impact on Wins |
TEAM_FIELDING_E | Homeruns allowed | Negative Impact on Wins |
TEAM_FIELDING_DP | Strikeouts by pitchers | Positive Impact on Wins |
After renaming the column names for better readability, a quick look at the summary statistics shows that dataset consists of 16 elements, with 2276 total cases. There are 15 explanatory variables. Moreover, describe function from psych package shows us the mean, standard deviation and the number of missing values. We will look in to missing values more later.
## Observations: 2,276
## Variables: 16
## $ WINS <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72...
## $ BATTING_H <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 13...
## $ BATTING_2B <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, ...
## $ BATTING_3B <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41...
## $ BATTING_HR <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, ...
## $ BATTING_BB <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, ...
## $ BATTING_SO <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 82...
## $ BASERUN_SB <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, ...
## $ BASERUN_CS <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 10...
## $ BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ PITCHING_H <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 13...
## $ PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, ...
## $ PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, ...
## $ PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 8...
## $ FIELDING_E <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131,...
## $ FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 1...
## vars n mean sd se
## WINS 1 2276 80.79 15.75 0.33
## BATTING_H 2 2276 1469.27 144.59 3.03
## BATTING_2B 3 2276 241.25 46.80 0.98
## BATTING_3B 4 2276 55.25 27.94 0.59
## BATTING_HR 5 2276 99.61 60.55 1.27
## BATTING_BB 6 2276 501.56 122.67 2.57
## BATTING_SO 7 2174 735.61 248.53 5.33
## BASERUN_SB 8 2145 124.76 87.79 1.90
## BASERUN_CS 9 1504 52.80 22.96 0.59
## BATTING_HBP 10 191 59.36 12.97 0.94
## PITCHING_H 11 2276 1779.21 1406.84 29.49
## PITCHING_HR 12 2276 105.70 61.30 1.28
## PITCHING_BB 13 2276 553.01 166.36 3.49
## PITCHING_SO 14 2174 817.73 553.09 11.86
## FIELDING_E 15 2276 246.48 227.77 4.77
## FIELDING_DP 16 1990 146.39 26.23 0.59
In the box plots below, we see that the variance of some of the explanatory variables greatly exceeds the variance of the response “win” variable. The dataset has many outlines with some observations that are more extreme than the 1.5 * IQR of the box plot whiskers.
In the histogram plot below, we see that the batting, pitching home-run and batting strike-out variables are bi modal. Also, half of the variables show skewness. We will create a box-cox transformation to build our model to mitigate the skewness.
In the correlation plot below, we see home runs, walks, strike outs & hits are highly correlated variables to our target variable.
Correlation | |
---|---|
WINS | 1.0000000 |
PITCHING_H | 0.4712343 |
BATTING_H | 0.4699467 |
BATTING_BB | 0.4686879 |
PITCHING_BB | 0.4683988 |
PITCHING_HR | 0.4224668 |
BATTING_HR | 0.4224168 |
BATTING_2B | 0.3129840 |
BATTING_HBP | 0.0735042 |
BASERUN_SB | 0.0148364 |
BATTING_3B | -0.1243459 |
BASERUN_CS | -0.1787560 |
FIELDING_DP | -0.1958660 |
BATTING_SO | -0.2288927 |
PITCHING_SO | -0.2293648 |
FIELDING_E | -0.3866880 |
The plot below shows the percent of missing value in a given column. BATTING_HBP has more than 90% of its value missing. We will delete this variable. Additionally, BASERUN_CS has close to 30% of its data missing. We will delete this as well. The reason for this decision is because they are missing a lot of data and the correlation between these variables and our target variable is low compared to other variable.
##
## Variables sorted by number of missings:
## Variable Count
## BATTING_HBP 0.91608084
## BASERUN_CS 0.33919156
## FIELDING_DP 0.12565905
## BASERUN_SB 0.05755712
## BATTING_SO 0.04481547
## PITCHING_SO 0.04481547
## WINS 0.00000000
## BATTING_H 0.00000000
## BATTING_2B 0.00000000
## BATTING_3B 0.00000000
## BATTING_HR 0.00000000
## BATTING_BB 0.00000000
## PITCHING_H 0.00000000
## PITCHING_HR 0.00000000
## PITCHING_BB 0.00000000
## FIELDING_E 0.00000000
For the remainder of the four variables with missing data, we will use the mice package and random forest method to impute the missing data. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. The table shows us the summary statistics of the original dataset and the imputed dataset. We see there is little change to the mean and standard deviation of the variables.
|
|
We will build three different models to see which one yields the best performance. In the first model below, we use all the variables. Nine of the variables have statistically significant p-values at the 5% significance level. At 0.3602, the adjusted \(R^2\) is not as high as we would prefer.
##
## Call:
## lm(formula = WINS ~ ., data = mbtrain_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.522 -8.465 0.057 8.317 56.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.1620349 5.1576713 5.266 1.52e-07 ***
## BATTING_H 0.0457203 0.0036522 12.518 < 2e-16 ***
## BATTING_2B -0.0196581 0.0090517 -2.172 0.02998 *
## BATTING_3B 0.0526023 0.0165326 3.182 0.00148 **
## BATTING_HR 0.0627098 0.0270572 2.318 0.02056 *
## BATTING_BB 0.0094383 0.0057166 1.651 0.09887 .
## BATTING_SO -0.0120071 0.0025213 -4.762 2.04e-06 ***
## BASERUN_SB 0.0385281 0.0040901 9.420 < 2e-16 ***
## PITCHING_H -0.0002590 0.0003724 -0.695 0.48688
## PITCHING_HR 0.0128277 0.0240136 0.534 0.59327
## PITCHING_BB -0.0011329 0.0040532 -0.280 0.77988
## PITCHING_SO 0.0024249 0.0008878 2.731 0.00636 **
## FIELDING_E -0.0285383 0.0025011 -11.410 < 2e-16 ***
## FIELDING_DP -0.0987831 0.0123203 -8.018 1.71e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.94 on 2262 degrees of freedom
## Multiple R-squared: 0.3296, Adjusted R-squared: 0.3257
## F-statistic: 85.53 on 13 and 2262 DF, p-value: < 2.2e-16
For our second model below, we will use the stepwise selection from the MASS package. Our model yielded relatively same summary statistics even though it automatically removed some variables from the calculation. Unlike the previous models, we do not have any unexpected coefficient signs, which means that we have removed at least some of the multilinearity in this model.
##
## Call:
## lm(formula = WINS ~ BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR +
## BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_SO + FIELDING_E +
## FIELDING_DP, data = mbtrain_imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.298 -8.407 0.017 8.336 56.916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.1282306 5.0776143 5.343 1.01e-07 ***
## BATTING_H 0.0454590 0.0035868 12.674 < 2e-16 ***
## BATTING_2B -0.0200168 0.0090400 -2.214 0.026910 *
## BATTING_3B 0.0559244 0.0161453 3.464 0.000542 ***
## BATTING_HR 0.0757336 0.0096057 7.884 4.88e-15 ***
## BATTING_BB 0.0080841 0.0032196 2.511 0.012111 *
## BATTING_SO -0.0114068 0.0023937 -4.765 2.00e-06 ***
## BASERUN_SB 0.0392395 0.0039399 9.960 < 2e-16 ***
## PITCHING_SO 0.0020086 0.0005841 3.439 0.000595 ***
## FIELDING_E -0.0296786 0.0020584 -14.418 < 2e-16 ***
## FIELDING_DP -0.0983683 0.0123075 -7.993 2.09e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.93 on 2265 degrees of freedom
## Multiple R-squared: 0.3293, Adjusted R-squared: 0.3263
## F-statistic: 111.2 on 10 and 2265 DF, p-value: < 2.2e-16
As mentioned previously, half of our dataset has many skewed variables. When an attribute has a normal distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more normal The Box Cox transform can perform this operation (assumes all values are positive).
##
## Call:
## lm(formula = WINS ~ ., data = mb_bc_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.040 -7.091 0.102 6.973 30.672
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.688e+05 6.314e+04 -5.841 6.13e-09 ***
## BATTING_H 4.821e+05 8.209e+04 5.872 5.10e-09 ***
## BATTING_2B -4.628e-01 8.194e-02 -5.648 1.88e-08 ***
## BATTING_3B 1.797e-01 1.910e-02 9.409 < 2e-16 ***
## BATTING_HR 9.363e-02 7.875e-02 1.189 0.235
## BATTING_BB 1.833e-02 2.710e-02 0.677 0.499
## BATTING_SO -5.715e-03 2.040e-02 -0.280 0.779
## BASERUN_SB 5.621e-02 5.557e-03 10.115 < 2e-16 ***
## PITCHING_H NA NA NA NA
## PITCHING_HR 8.049e-03 7.518e-02 0.107 0.915
## PITCHING_BB 1.687e-02 2.577e-02 0.655 0.513
## PITCHING_SO -1.715e-02 1.936e-02 -0.886 0.376
## FIELDING_E -1.725e+03 1.201e+02 -14.366 < 2e-16 ***
## FIELDING_DP -2.786e-03 3.673e-04 -7.585 5.25e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 1822 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.3845, Adjusted R-squared: 0.3805
## F-statistic: 94.85 on 12 and 1822 DF, p-value: < 2.2e-16
To make prediction, we will select our third model whose data was transformed using the Box Cox method. This model accounted for the skewness in our data and as a result yielded the highest \(R^2\). In the Q-Q plot, we see some deviations for the normal distribution at the ends. Some of the extreme cases have been labeled. Since we have more than 2000 observations, we can relax the normality assumption. Also the best fitted line has a curve in the main cluster, which indicates that we do not have constant variance. Looking at this plot, I am not sure if this is the best model or we could have improved it somewhow.
## fit lwr upr
## Min. :3.928e+08 Min. :261088771 Min. :5.246e+08
## 1st Qu.:6.680e+08 1st Qu.:444828401 1st Qu.:8.912e+08
## Median :7.008e+08 Median :466660442 Median :9.349e+08
## Mean :7.075e+08 Mean :471086082 Mean :9.440e+08
## 3rd Qu.:7.456e+08 3rd Qu.:496457781 3rd Qu.:9.947e+08
## Max. :1.044e+09 Max. :694355910 Max. :1.393e+09
knitr::opts_chunk$set(warning = F,
message = F,
echo = F,
fig.align = "center")
#load libraries
library(tidyverse)
library(psych)
library(corrplot)
library(stringr)
library(ggthemes)
library(RColorBrewer)
library(knitr)
library(mice)
library(VIM)
library(MASS)
library(caret)
vn <- c("INDEX", "TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP", "TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E", "TEAM_FIELDING_DP")
df <- c("Identification Variable (do not use)", "Number of wins", "Base Hits by batters (1B,2B,3B,HR)", "Doubles by batters (2B)", "Triples by batters (3B)", "Homeruns by batters (4B)", "Walks by batters", "Batters hit by pitch (get a free base)", "Strikeouts by batters", "Stolen bases", "Caught stealing", "Errors", "Double Plays", "Walks allowed", "Hits allowed", "Homeruns allowed", "Strikeouts by pitchers")
te <- c("None", "", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins")
kable(cbind(vn, df, te), col.names = c("Variable Name", "Definition", "Theoretical Effect"))
#load data
moneyball_train <- read.csv('https://raw.githubusercontent.com/saayedalam/Data/master/moneyball-training-data.csv')
#rename column
moneyball_train <- moneyball_train %>%
dplyr::select(-INDEX) %>% #remove column index
rename_all(funs(str_replace(., "TEAM_|TARGET_", ""))) #use stringr to rename column
#summary statistic
glimpse(moneyball_train)
describe(moneyball_train, skew = F, ranges = F)
#boxplot
moneyball_train %>%
gather(na.rm = TRUE) %>%
ggplot(aes(factor(key), value)) +
geom_boxplot(outlier.size = 1) +
stat_summary(fun.y=mean, geom="point", size=2, color = "red") +
coord_cartesian(ylim = c(0, 2000)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "Variables", y = "Value")
#histogram
moneyball_train %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3))) +
facet_wrap(~key, scale="free") +
theme_minimal()
#correlation
corr_moneyball <- moneyball_train %>%
drop_na() %>%
cor()
kable(sort(corr_moneyball[,1], decreasing = T), col.names = c("Correlation"))
corrplot(corr_moneyball, method = "pie", type = "upper", order = "FPC",
col = brewer.pal(n = 7, name = "GnBu"),
tl.col = "black", cl.align = "r", cl.ratio = 0.3)
#missing value columns
aggr(moneyball_train, numbers=TRUE, sortVars=TRUE, labels=names(moneyball_train), cex.axis=.7)
#delete two columns
moneyball_trainA <- moneyball_train[, -c(9, 10)]
#imputation
init <- mice(moneyball_trainA)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("WINS")] <- 0 #this code will remove the variable as a predictor but still will be imputed
mbtrain_impute <- mice(moneyball_trainA, method = 'rf', predictorMatrix=predM)
mbtrain_imputed <- complete(mbtrain_impute)
#comparing summary statistics after impuation
a <- describe(mbtrain_imputed, skew = F, ranges = F)
b <- describe(moneyball_trainA, skew = F, ranges = F)
kable(list(a, b))
#model 1 with all variables
model1 <- lm(WINS ~ ., mbtrain_imputed)
summary(model1)
#model 2 stepwise model
model2 <- stepAIC(model1, direction = "both", trace = FALSE)
summary(model2)
#boxcox transformation model
mbtrain_boxcox <- preProcess(moneyball_trainA, c("BoxCox"))
mb_bc_transformed <- predict(mbtrain_boxcox, moneyball_trainA)
model3 <- lm(WINS ~ ., mb_bc_transformed)
summary(model3)
#plot of the model 3
plot(model3)
#same data prepartion for the test dataset
moneyball_test <- read.csv('https://raw.githubusercontent.com/saayedalam/Data/master/moneyball-evaluation-data.csv')
moneyball_test <- moneyball_test %>%
dplyr::select(-c(1, 9, 10)) %>%
rename_all(funs(str_replace(., "TEAM_|TARGET_", "")))
#impute test dataset
init <- mice(moneyball_test)
meth <- init$method
predM <- init$predictorMatrix
mbttest_impute <- mice(moneyball_test, method = 'rf', predictorMatrix=predM)
mbttest_imputed <- complete(mbttest_impute)
prediction <- predict(model3, mbttest_imputed, interval = "prediction")
summary(prediction)