Introduction

In this assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. Below is a short description of the variables of interest in the data set:

Variable Name Definition Theoretical Effect
INDEX Identification Variable (do not use) None
TARGET_WINS Number of wins
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins
TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins
TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins
TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins
TEAM_BATTING_BB Walks by batters Positive Impact on Wins
TEAM_BATTING_SO Batters hit by pitch (get a free base) Positive Impact on Wins
TEAM_BASERUN_SB Strikeouts by batters Negative Impact on Wins
TEAM_BASERUN_CS Stolen bases Positive Impact on Wins
TEAM_BATTING_HBP Caught stealing Negative Impact on Wins
TEAM_PITCHING_H Errors Negative Impact on Wins
TEAM_PITCHING_HR Double Plays Positive Impact on Wins
TEAM_PITCHING_BB Walks allowed Negative Impact on Wins
TEAM_PITCHING_SO Hits allowed Negative Impact on Wins
TEAM_FIELDING_E Homeruns allowed Negative Impact on Wins
TEAM_FIELDING_DP Strikeouts by pitchers Positive Impact on Wins

Data Exploration

After renaming the column names for better readability, a quick look at the summary statistics shows that dataset consists of 16 elements, with 2276 total cases. There are 15 explanatory variables. Moreover, describe function from psych package shows us the mean, standard deviation and the number of missing values. We will look in to missing values more later.

## Observations: 2,276
## Variables: 16
## $ WINS        <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72...
## $ BATTING_H   <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 13...
## $ BATTING_2B  <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, ...
## $ BATTING_3B  <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41...
## $ BATTING_HR  <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, ...
## $ BATTING_BB  <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, ...
## $ BATTING_SO  <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 82...
## $ BASERUN_SB  <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, ...
## $ BASERUN_CS  <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 10...
## $ BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ PITCHING_H  <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 13...
## $ PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, ...
## $ PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, ...
## $ PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 8...
## $ FIELDING_E  <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131,...
## $ FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 1...
##             vars    n    mean      sd    se
## WINS           1 2276   80.79   15.75  0.33
## BATTING_H      2 2276 1469.27  144.59  3.03
## BATTING_2B     3 2276  241.25   46.80  0.98
## BATTING_3B     4 2276   55.25   27.94  0.59
## BATTING_HR     5 2276   99.61   60.55  1.27
## BATTING_BB     6 2276  501.56  122.67  2.57
## BATTING_SO     7 2174  735.61  248.53  5.33
## BASERUN_SB     8 2145  124.76   87.79  1.90
## BASERUN_CS     9 1504   52.80   22.96  0.59
## BATTING_HBP   10  191   59.36   12.97  0.94
## PITCHING_H    11 2276 1779.21 1406.84 29.49
## PITCHING_HR   12 2276  105.70   61.30  1.28
## PITCHING_BB   13 2276  553.01  166.36  3.49
## PITCHING_SO   14 2174  817.73  553.09 11.86
## FIELDING_E    15 2276  246.48  227.77  4.77
## FIELDING_DP   16 1990  146.39   26.23  0.59

In the box plots below, we see that the variance of some of the explanatory variables greatly exceeds the variance of the response “win” variable. The dataset has many outlines with some observations that are more extreme than the 1.5 * IQR of the box plot whiskers.

In the histogram plot below, we see that the batting, pitching home-run and batting strike-out variables are bi modal. Also, half of the variables show skewness. We will create a box-cox transformation to build our model to mitigate the skewness.

In the correlation plot below, we see home runs, walks, strike outs & hits are highly correlated variables to our target variable.

Correlation
WINS 1.0000000
PITCHING_H 0.4712343
BATTING_H 0.4699467
BATTING_BB 0.4686879
PITCHING_BB 0.4683988
PITCHING_HR 0.4224668
BATTING_HR 0.4224168
BATTING_2B 0.3129840
BATTING_HBP 0.0735042
BASERUN_SB 0.0148364
BATTING_3B -0.1243459
BASERUN_CS -0.1787560
FIELDING_DP -0.1958660
BATTING_SO -0.2288927
PITCHING_SO -0.2293648
FIELDING_E -0.3866880

Data Preparation

The plot below shows the percent of missing value in a given column. BATTING_HBP has more than 90% of its value missing. We will delete this variable. Additionally, BASERUN_CS has close to 30% of its data missing. We will delete this as well. The reason for this decision is because they are missing a lot of data and the correlation between these variables and our target variable is low compared to other variable.

## 
##  Variables sorted by number of missings: 
##     Variable      Count
##  BATTING_HBP 0.91608084
##   BASERUN_CS 0.33919156
##  FIELDING_DP 0.12565905
##   BASERUN_SB 0.05755712
##   BATTING_SO 0.04481547
##  PITCHING_SO 0.04481547
##         WINS 0.00000000
##    BATTING_H 0.00000000
##   BATTING_2B 0.00000000
##   BATTING_3B 0.00000000
##   BATTING_HR 0.00000000
##   BATTING_BB 0.00000000
##   PITCHING_H 0.00000000
##  PITCHING_HR 0.00000000
##  PITCHING_BB 0.00000000
##   FIELDING_E 0.00000000

For the remainder of the four variables with missing data, we will use the mice package and random forest method to impute the missing data. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. The table shows us the summary statistics of the original dataset and the imputed dataset. We see there is little change to the mean and standard deviation of the variables.

vars n mean sd se
WINS 1 2276 80.79086 15.75215 0.3301823
BATTING_H 2 2276 1469.26977 144.59120 3.0307891
BATTING_2B 3 2276 241.24692 46.80141 0.9810087
BATTING_3B 4 2276 55.25000 27.93856 0.5856226
BATTING_HR 5 2276 99.61204 60.54687 1.2691285
BATTING_BB 6 2276 501.55888 122.67086 2.5713150
BATTING_SO 7 2276 730.15158 245.63430 5.1487628
BASERUN_SB 8 2276 130.46046 94.31526 1.9769507
PITCHING_H 9 2276 1779.21046 1406.84293 29.4889618
PITCHING_HR 10 2276 105.69859 61.29875 1.2848886
PITCHING_BB 11 2276 553.00791 166.35736 3.4870317
PITCHING_SO 12 2276 812.72276 546.76514 11.4607936
FIELDING_E 13 2276 246.48067 227.77097 4.7743279
FIELDING_DP 14 2276 143.29965 27.52624 0.5769800
vars n mean sd se
WINS 1 2276 80.79086 15.75215 0.3301823
BATTING_H 2 2276 1469.26977 144.59120 3.0307891
BATTING_2B 3 2276 241.24692 46.80141 0.9810087
BATTING_3B 4 2276 55.25000 27.93856 0.5856226
BATTING_HR 5 2276 99.61204 60.54687 1.2691285
BATTING_BB 6 2276 501.55888 122.67086 2.5713150
BATTING_SO 7 2174 735.60534 248.52642 5.3301912
BASERUN_SB 8 2145 124.76177 87.79117 1.8955584
PITCHING_H 9 2276 1779.21046 1406.84293 29.4889618
PITCHING_HR 10 2276 105.69859 61.29875 1.2848886
PITCHING_BB 11 2276 553.00791 166.35736 3.4870317
PITCHING_SO 12 2174 817.73045 553.08503 11.8621151
FIELDING_E 13 2276 246.48067 227.77097 4.7743279
FIELDING_DP 14 1990 146.38794 26.22639 0.5879114

Build Models

We will build three different models to see which one yields the best performance. In the first model below, we use all the variables. Nine of the variables have statistically significant p-values at the 5% significance level. At 0.3602, the adjusted \(R^2\) is not as high as we would prefer.

## 
## Call:
## lm(formula = WINS ~ ., data = mbtrain_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.522  -8.465   0.057   8.317  56.251 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.1620349  5.1576713   5.266 1.52e-07 ***
## BATTING_H    0.0457203  0.0036522  12.518  < 2e-16 ***
## BATTING_2B  -0.0196581  0.0090517  -2.172  0.02998 *  
## BATTING_3B   0.0526023  0.0165326   3.182  0.00148 ** 
## BATTING_HR   0.0627098  0.0270572   2.318  0.02056 *  
## BATTING_BB   0.0094383  0.0057166   1.651  0.09887 .  
## BATTING_SO  -0.0120071  0.0025213  -4.762 2.04e-06 ***
## BASERUN_SB   0.0385281  0.0040901   9.420  < 2e-16 ***
## PITCHING_H  -0.0002590  0.0003724  -0.695  0.48688    
## PITCHING_HR  0.0128277  0.0240136   0.534  0.59327    
## PITCHING_BB -0.0011329  0.0040532  -0.280  0.77988    
## PITCHING_SO  0.0024249  0.0008878   2.731  0.00636 ** 
## FIELDING_E  -0.0285383  0.0025011 -11.410  < 2e-16 ***
## FIELDING_DP -0.0987831  0.0123203  -8.018 1.71e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.94 on 2262 degrees of freedom
## Multiple R-squared:  0.3296, Adjusted R-squared:  0.3257 
## F-statistic: 85.53 on 13 and 2262 DF,  p-value: < 2.2e-16

For our second model below, we will use the stepwise selection from the MASS package. Our model yielded relatively same summary statistics even though it automatically removed some variables from the calculation. Unlike the previous models, we do not have any unexpected coefficient signs, which means that we have removed at least some of the multilinearity in this model.

## 
## Call:
## lm(formula = WINS ~ BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + 
##     BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_SO + FIELDING_E + 
##     FIELDING_DP, data = mbtrain_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.298  -8.407   0.017   8.336  56.916 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.1282306  5.0776143   5.343 1.01e-07 ***
## BATTING_H    0.0454590  0.0035868  12.674  < 2e-16 ***
## BATTING_2B  -0.0200168  0.0090400  -2.214 0.026910 *  
## BATTING_3B   0.0559244  0.0161453   3.464 0.000542 ***
## BATTING_HR   0.0757336  0.0096057   7.884 4.88e-15 ***
## BATTING_BB   0.0080841  0.0032196   2.511 0.012111 *  
## BATTING_SO  -0.0114068  0.0023937  -4.765 2.00e-06 ***
## BASERUN_SB   0.0392395  0.0039399   9.960  < 2e-16 ***
## PITCHING_SO  0.0020086  0.0005841   3.439 0.000595 ***
## FIELDING_E  -0.0296786  0.0020584 -14.418  < 2e-16 ***
## FIELDING_DP -0.0983683  0.0123075  -7.993 2.09e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.93 on 2265 degrees of freedom
## Multiple R-squared:  0.3293, Adjusted R-squared:  0.3263 
## F-statistic: 111.2 on 10 and 2265 DF,  p-value: < 2.2e-16

As mentioned previously, half of our dataset has many skewed variables. When an attribute has a normal distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more normal The Box Cox transform can perform this operation (assumes all values are positive).

## 
## Call:
## lm(formula = WINS ~ ., data = mb_bc_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.040  -7.091   0.102   6.973  30.672 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.688e+05  6.314e+04  -5.841 6.13e-09 ***
## BATTING_H    4.821e+05  8.209e+04   5.872 5.10e-09 ***
## BATTING_2B  -4.628e-01  8.194e-02  -5.648 1.88e-08 ***
## BATTING_3B   1.797e-01  1.910e-02   9.409  < 2e-16 ***
## BATTING_HR   9.363e-02  7.875e-02   1.189    0.235    
## BATTING_BB   1.833e-02  2.710e-02   0.677    0.499    
## BATTING_SO  -5.715e-03  2.040e-02  -0.280    0.779    
## BASERUN_SB   5.621e-02  5.557e-03  10.115  < 2e-16 ***
## PITCHING_H          NA         NA      NA       NA    
## PITCHING_HR  8.049e-03  7.518e-02   0.107    0.915    
## PITCHING_BB  1.687e-02  2.577e-02   0.655    0.513    
## PITCHING_SO -1.715e-02  1.936e-02  -0.886    0.376    
## FIELDING_E  -1.725e+03  1.201e+02 -14.366  < 2e-16 ***
## FIELDING_DP -2.786e-03  3.673e-04  -7.585 5.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 1822 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3845, Adjusted R-squared:  0.3805 
## F-statistic: 94.85 on 12 and 1822 DF,  p-value: < 2.2e-16

Select Models

To make prediction, we will select our third model whose data was transformed using the Box Cox method. This model accounted for the skewness in our data and as a result yielded the highest \(R^2\). In the Q-Q plot, we see some deviations for the normal distribution at the ends. Some of the extreme cases have been labeled. Since we have more than 2000 observations, we can relax the normality assumption. Also the best fitted line has a curve in the main cluster, which indicates that we do not have constant variance. Looking at this plot, I am not sure if this is the best model or we could have improved it somewhow.

##       fit                 lwr                 upr           
##  Min.   :3.928e+08   Min.   :261088771   Min.   :5.246e+08  
##  1st Qu.:6.680e+08   1st Qu.:444828401   1st Qu.:8.912e+08  
##  Median :7.008e+08   Median :466660442   Median :9.349e+08  
##  Mean   :7.075e+08   Mean   :471086082   Mean   :9.440e+08  
##  3rd Qu.:7.456e+08   3rd Qu.:496457781   3rd Qu.:9.947e+08  
##  Max.   :1.044e+09   Max.   :694355910   Max.   :1.393e+09
knitr::opts_chunk$set(warning = F, 
                      message = F,
                      echo = F,
                      fig.align = "center")
#load libraries
library(tidyverse)
library(psych)
library(corrplot)
library(stringr)
library(ggthemes)
library(RColorBrewer)
library(knitr)
library(mice)
library(VIM)
library(MASS)
library(caret)
vn <- c("INDEX", "TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP", "TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E", "TEAM_FIELDING_DP")
df <- c("Identification Variable (do not use)", "Number of wins", "Base Hits by batters (1B,2B,3B,HR)", "Doubles by batters (2B)", "Triples by batters (3B)", "Homeruns by batters (4B)", "Walks by batters", "Batters hit by pitch (get a free base)", "Strikeouts by batters", "Stolen bases", "Caught stealing", "Errors", "Double Plays", "Walks allowed", "Hits allowed", "Homeruns allowed", "Strikeouts by pitchers")
te <- c("None", "", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins")
kable(cbind(vn, df, te), col.names = c("Variable Name", "Definition", "Theoretical Effect"))
#load data
moneyball_train <- read.csv('https://raw.githubusercontent.com/saayedalam/Data/master/moneyball-training-data.csv')

#rename column
moneyball_train <- moneyball_train %>%
  dplyr::select(-INDEX) %>% #remove column index
  rename_all(funs(str_replace(., "TEAM_|TARGET_", ""))) #use stringr to rename column

#summary statistic  
glimpse(moneyball_train)
describe(moneyball_train, skew = F, ranges = F)
#boxplot
moneyball_train %>%
  gather(na.rm = TRUE) %>%
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.size = 1) +
  stat_summary(fun.y=mean, geom="point", size=2, color = "red") +
  coord_cartesian(ylim = c(0, 2000)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Variables", y = "Value")
#histogram
moneyball_train %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3))) +
  facet_wrap(~key, scale="free") +
  theme_minimal()
#correlation
corr_moneyball <- moneyball_train %>%
  drop_na() %>%
  cor()

kable(sort(corr_moneyball[,1], decreasing = T), col.names = c("Correlation"))

corrplot(corr_moneyball, method = "pie", type = "upper", order = "FPC",
         col = brewer.pal(n = 7, name = "GnBu"),
         tl.col = "black", cl.align = "r", cl.ratio = 0.3)
#missing value columns
aggr(moneyball_train, numbers=TRUE, sortVars=TRUE, labels=names(moneyball_train), cex.axis=.7)

#delete two columns
moneyball_trainA <- moneyball_train[, -c(9, 10)]
#imputation
init <- mice(moneyball_trainA)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("WINS")] <- 0 #this code will remove the variable as a predictor but still will be imputed
mbtrain_impute <- mice(moneyball_trainA, method = 'rf', predictorMatrix=predM)
mbtrain_imputed <- complete(mbtrain_impute)
#comparing summary statistics after impuation
a <- describe(mbtrain_imputed, skew = F, ranges = F)
b <- describe(moneyball_trainA, skew = F, ranges = F)
kable(list(a, b))
#model 1 with all variables
model1 <- lm(WINS ~ ., mbtrain_imputed)
summary(model1)
#model 2 stepwise model
model2 <- stepAIC(model1, direction = "both", trace = FALSE)
summary(model2)
#boxcox transformation model
mbtrain_boxcox <- preProcess(moneyball_trainA, c("BoxCox"))
mb_bc_transformed <- predict(mbtrain_boxcox, moneyball_trainA)
model3 <- lm(WINS ~ ., mb_bc_transformed)
summary(model3)
#plot of the model 3
plot(model3)

#same data prepartion for the test dataset
moneyball_test <- read.csv('https://raw.githubusercontent.com/saayedalam/Data/master/moneyball-evaluation-data.csv')
moneyball_test <- moneyball_test %>%
  dplyr::select(-c(1, 9, 10)) %>%
  rename_all(funs(str_replace(., "TEAM_|TARGET_", "")))
#impute test dataset
init <- mice(moneyball_test)
meth <- init$method
predM <- init$predictorMatrix
mbttest_impute <- mice(moneyball_test, method = 'rf', predictorMatrix=predM)
mbttest_imputed <- complete(mbttest_impute)
prediction <- predict(model3, mbttest_imputed, interval = "prediction")
summary(prediction)