Homework 1

Introduction

In this assignment, we will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. Below is a short description of the variables of interest in the data set:

Variable Name	Definition	Theoretical Effect
INDEX	Identification Variable (do not use)	None
TARGET_WINS	Number of wins
TEAM_BATTING_H	Base Hits by batters (1B,2B,3B,HR)	Positive Impact on Wins
TEAM_BATTING_2B	Doubles by batters (2B)	Positive Impact on Wins
TEAM_BATTING_3B	Triples by batters (3B)	Positive Impact on Wins
TEAM_BATTING_HR	Homeruns by batters (4B)	Positive Impact on Wins
TEAM_BATTING_BB	Walks by batters	Positive Impact on Wins
TEAM_BATTING_SO	Batters hit by pitch (get a free base)	Positive Impact on Wins
TEAM_BASERUN_SB	Strikeouts by batters	Negative Impact on Wins
TEAM_BASERUN_CS	Stolen bases	Positive Impact on Wins
TEAM_BATTING_HBP	Caught stealing	Negative Impact on Wins
TEAM_PITCHING_H	Errors	Negative Impact on Wins
TEAM_PITCHING_HR	Double Plays	Positive Impact on Wins
TEAM_PITCHING_BB	Walks allowed	Negative Impact on Wins
TEAM_PITCHING_SO	Hits allowed	Negative Impact on Wins
TEAM_FIELDING_E	Homeruns allowed	Negative Impact on Wins
TEAM_FIELDING_DP	Strikeouts by pitchers	Positive Impact on Wins

Data Exploration

After renaming the column names for better readability, a quick look at the summary statistics shows that dataset consists of 16 elements, with 2276 total cases. There are 15 explanatory variables. Moreover, describe function from psych package shows us the mean, standard deviation and the number of missing values. We will look in to missing values more later.

## Observations: 2,276
## Variables: 16
## $ WINS        <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72...
## $ BATTING_H   <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 13...
## $ BATTING_2B  <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, ...
## $ BATTING_3B  <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41...
## $ BATTING_HR  <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, ...
## $ BATTING_BB  <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, ...
## $ BATTING_SO  <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 82...
## $ BASERUN_SB  <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, ...
## $ BASERUN_CS  <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 10...
## $ BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ PITCHING_H  <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 13...
## $ PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, ...
## $ PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, ...
## $ PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 8...
## $ FIELDING_E  <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131,...
## $ FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 1...

##             vars    n    mean      sd    se
## WINS           1 2276   80.79   15.75  0.33
## BATTING_H      2 2276 1469.27  144.59  3.03
## BATTING_2B     3 2276  241.25   46.80  0.98
## BATTING_3B     4 2276   55.25   27.94  0.59
## BATTING_HR     5 2276   99.61   60.55  1.27
## BATTING_BB     6 2276  501.56  122.67  2.57
## BATTING_SO     7 2174  735.61  248.53  5.33
## BASERUN_SB     8 2145  124.76   87.79  1.90
## BASERUN_CS     9 1504   52.80   22.96  0.59
## BATTING_HBP   10  191   59.36   12.97  0.94
## PITCHING_H    11 2276 1779.21 1406.84 29.49
## PITCHING_HR   12 2276  105.70   61.30  1.28
## PITCHING_BB   13 2276  553.01  166.36  3.49
## PITCHING_SO   14 2174  817.73  553.09 11.86
## FIELDING_E    15 2276  246.48  227.77  4.77
## FIELDING_DP   16 1990  146.39   26.23  0.59

In the box plots below, we see that the variance of some of the explanatory variables greatly exceeds the variance of the response “win” variable. The dataset has many outlines with some observations that are more extreme than the 1.5 * IQR of the box plot whiskers.

In the histogram plot below, we see that the batting, pitching home-run and batting strike-out variables are bi modal. Also, half of the variables show skewness. We will create a box-cox transformation to build our model to mitigate the skewness.

In the correlation plot below, we see home runs, walks, strike outs & hits are highly correlated variables to our target variable.

	Correlation
WINS	1.0000000
PITCHING_H	0.4712343
BATTING_H	0.4699467
BATTING_BB	0.4686879
PITCHING_BB	0.4683988
PITCHING_HR	0.4224668
BATTING_HR	0.4224168
BATTING_2B	0.3129840
BATTING_HBP	0.0735042
BASERUN_SB	0.0148364
BATTING_3B	-0.1243459
BASERUN_CS	-0.1787560
FIELDING_DP	-0.1958660
BATTING_SO	-0.2288927
PITCHING_SO	-0.2293648
FIELDING_E	-0.3866880

Data Preparation

The plot below shows the percent of missing value in a given column. BATTING_HBP has more than 90% of its value missing. We will delete this variable. Additionally, BASERUN_CS has close to 30% of its data missing. We will delete this as well. The reason for this decision is because they are missing a lot of data and the correlation between these variables and our target variable is low compared to other variable.

## 
##  Variables sorted by number of missings: 
##     Variable      Count
##  BATTING_HBP 0.91608084
##   BASERUN_CS 0.33919156
##  FIELDING_DP 0.12565905
##   BASERUN_SB 0.05755712
##   BATTING_SO 0.04481547
##  PITCHING_SO 0.04481547
##         WINS 0.00000000
##    BATTING_H 0.00000000
##   BATTING_2B 0.00000000
##   BATTING_3B 0.00000000
##   BATTING_HR 0.00000000
##   BATTING_BB 0.00000000
##   PITCHING_H 0.00000000
##  PITCHING_HR 0.00000000
##  PITCHING_BB 0.00000000
##   FIELDING_E 0.00000000

For the remainder of the four variables with missing data, we will use the mice package and random forest method to impute the missing data. Mice uses multivariate imputations to estimate the missing values. Using multiple imputations helps in resolving the uncertainty for the missingness. Our target variable will be removed as a predictor variable but still will be imputed. The table shows us the summary statistics of the original dataset and the imputed dataset. We see there is little change to the mean and standard deviation of the variables.

	vars	n	mean	sd	se
WINS	1	2276	80.79086	15.75215	0.3301823
BATTING_H	2	2276	1469.26977	144.59120	3.0307891
BATTING_2B	3	2276	241.24692	46.80141	0.9810087
BATTING_3B	4	2276	55.25000	27.93856	0.5856226
BATTING_HR	5	2276	99.61204	60.54687	1.2691285
BATTING_BB	6	2276	501.55888	122.67086	2.5713150
BATTING_SO	7	2276	730.15158	245.63430	5.1487628
BASERUN_SB	8	2276	130.46046	94.31526	1.9769507
PITCHING_H	9	2276	1779.21046	1406.84293	29.4889618
PITCHING_HR	10	2276	105.69859	61.29875	1.2848886
PITCHING_BB	11	2276	553.00791	166.35736	3.4870317
PITCHING_SO	12	2276	812.72276	546.76514	11.4607936
FIELDING_E	13	2276	246.48067	227.77097	4.7743279
FIELDING_DP	14	2276	143.29965	27.52624	0.5769800

	vars	n	mean	sd	se
WINS	1	2276	80.79086	15.75215	0.3301823
BATTING_H	2	2276	1469.26977	144.59120	3.0307891
BATTING_2B	3	2276	241.24692	46.80141	0.9810087
BATTING_3B	4	2276	55.25000	27.93856	0.5856226
BATTING_HR	5	2276	99.61204	60.54687	1.2691285
BATTING_BB	6	2276	501.55888	122.67086	2.5713150
BATTING_SO	7	2174	735.60534	248.52642	5.3301912
BASERUN_SB	8	2145	124.76177	87.79117	1.8955584
PITCHING_H	9	2276	1779.21046	1406.84293	29.4889618
PITCHING_HR	10	2276	105.69859	61.29875	1.2848886
PITCHING_BB	11	2276	553.00791	166.35736	3.4870317
PITCHING_SO	12	2174	817.73045	553.08503	11.8621151
FIELDING_E	13	2276	246.48067	227.77097	4.7743279
FIELDING_DP	14	1990	146.38794	26.22639	0.5879114

Build Models

We will build three different models to see which one yields the best performance. In the first model below, we use all the variables. Nine of the variables have statistically significant p-values at the 5% significance level. At 0.3602, the adjusted \(R^2\) is not as high as we would prefer.

## 
## Call:
## lm(formula = WINS ~ ., data = mbtrain_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.522  -8.465   0.057   8.317  56.251 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.1620349  5.1576713   5.266 1.52e-07 ***
## BATTING_H    0.0457203  0.0036522  12.518  < 2e-16 ***
## BATTING_2B  -0.0196581  0.0090517  -2.172  0.02998 *  
## BATTING_3B   0.0526023  0.0165326   3.182  0.00148 ** 
## BATTING_HR   0.0627098  0.0270572   2.318  0.02056 *  
## BATTING_BB   0.0094383  0.0057166   1.651  0.09887 .  
## BATTING_SO  -0.0120071  0.0025213  -4.762 2.04e-06 ***
## BASERUN_SB   0.0385281  0.0040901   9.420  < 2e-16 ***
## PITCHING_H  -0.0002590  0.0003724  -0.695  0.48688    
## PITCHING_HR  0.0128277  0.0240136   0.534  0.59327    
## PITCHING_BB -0.0011329  0.0040532  -0.280  0.77988    
## PITCHING_SO  0.0024249  0.0008878   2.731  0.00636 ** 
## FIELDING_E  -0.0285383  0.0025011 -11.410  < 2e-16 ***
## FIELDING_DP -0.0987831  0.0123203  -8.018 1.71e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.94 on 2262 degrees of freedom
## Multiple R-squared:  0.3296, Adjusted R-squared:  0.3257 
## F-statistic: 85.53 on 13 and 2262 DF,  p-value: < 2.2e-16

For our second model below, we will use the stepwise selection from the MASS package. Our model yielded relatively same summary statistics even though it automatically removed some variables from the calculation. Unlike the previous models, we do not have any unexpected coefficient signs, which means that we have removed at least some of the multilinearity in this model.

## 
## Call:
## lm(formula = WINS ~ BATTING_H + BATTING_2B + BATTING_3B + BATTING_HR + 
##     BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_SO + FIELDING_E + 
##     FIELDING_DP, data = mbtrain_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.298  -8.407   0.017   8.336  56.916 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27.1282306  5.0776143   5.343 1.01e-07 ***
## BATTING_H    0.0454590  0.0035868  12.674  < 2e-16 ***
## BATTING_2B  -0.0200168  0.0090400  -2.214 0.026910 *  
## BATTING_3B   0.0559244  0.0161453   3.464 0.000542 ***
## BATTING_HR   0.0757336  0.0096057   7.884 4.88e-15 ***
## BATTING_BB   0.0080841  0.0032196   2.511 0.012111 *  
## BATTING_SO  -0.0114068  0.0023937  -4.765 2.00e-06 ***
## BASERUN_SB   0.0392395  0.0039399   9.960  < 2e-16 ***
## PITCHING_SO  0.0020086  0.0005841   3.439 0.000595 ***
## FIELDING_E  -0.0296786  0.0020584 -14.418  < 2e-16 ***
## FIELDING_DP -0.0983683  0.0123075  -7.993 2.09e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.93 on 2265 degrees of freedom
## Multiple R-squared:  0.3293, Adjusted R-squared:  0.3263 
## F-statistic: 111.2 on 10 and 2265 DF,  p-value: < 2.2e-16

As mentioned previously, half of our dataset has many skewed variables. When an attribute has a normal distribution but is shifted, this is called a skew. The distribution of an attribute can be shifted to reduce the skew and make it more normal The Box Cox transform can perform this operation (assumes all values are positive).

## 
## Call:
## lm(formula = WINS ~ ., data = mb_bc_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.040  -7.091   0.102   6.973  30.672 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.688e+05  6.314e+04  -5.841 6.13e-09 ***
## BATTING_H    4.821e+05  8.209e+04   5.872 5.10e-09 ***
## BATTING_2B  -4.628e-01  8.194e-02  -5.648 1.88e-08 ***
## BATTING_3B   1.797e-01  1.910e-02   9.409  < 2e-16 ***
## BATTING_HR   9.363e-02  7.875e-02   1.189    0.235    
## BATTING_BB   1.833e-02  2.710e-02   0.677    0.499    
## BATTING_SO  -5.715e-03  2.040e-02  -0.280    0.779    
## BASERUN_SB   5.621e-02  5.557e-03  10.115  < 2e-16 ***
## PITCHING_H          NA         NA      NA       NA    
## PITCHING_HR  8.049e-03  7.518e-02   0.107    0.915    
## PITCHING_BB  1.687e-02  2.577e-02   0.655    0.513    
## PITCHING_SO -1.715e-02  1.936e-02  -0.886    0.376    
## FIELDING_E  -1.725e+03  1.201e+02 -14.366  < 2e-16 ***
## FIELDING_DP -2.786e-03  3.673e-04  -7.585 5.25e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 1822 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.3845, Adjusted R-squared:  0.3805 
## F-statistic: 94.85 on 12 and 1822 DF,  p-value: < 2.2e-16

Select Models

To make prediction, we will select our third model whose data was transformed using the Box Cox method. This model accounted for the skewness in our data and as a result yielded the highest \(R^2\). In the Q-Q plot, we see some deviations for the normal distribution at the ends. Some of the extreme cases have been labeled. Since we have more than 2000 observations, we can relax the normality assumption. Also the best fitted line has a curve in the main cluster, which indicates that we do not have constant variance. Looking at this plot, I am not sure if this is the best model or we could have improved it somewhow.

##       fit                 lwr                 upr           
##  Min.   :3.928e+08   Min.   :261088771   Min.   :5.246e+08  
##  1st Qu.:6.680e+08   1st Qu.:444828401   1st Qu.:8.912e+08  
##  Median :7.008e+08   Median :466660442   Median :9.349e+08  
##  Mean   :7.075e+08   Mean   :471086082   Mean   :9.440e+08  
##  3rd Qu.:7.456e+08   3rd Qu.:496457781   3rd Qu.:9.947e+08  
##  Max.   :1.044e+09   Max.   :694355910   Max.   :1.393e+09

knitr::opts_chunk$set(warning = F, 
                      message = F,
                      echo = F,
                      fig.align = "center")
#load libraries
library(tidyverse)
library(psych)
library(corrplot)
library(stringr)
library(ggthemes)
library(RColorBrewer)
library(knitr)
library(mice)
library(VIM)
library(MASS)
library(caret)
vn <- c("INDEX", "TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_HR", "TEAM_BATTING_BB", "TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS", "TEAM_BATTING_HBP", "TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO", "TEAM_FIELDING_E", "TEAM_FIELDING_DP")
df <- c("Identification Variable (do not use)", "Number of wins", "Base Hits by batters (1B,2B,3B,HR)", "Doubles by batters (2B)", "Triples by batters (3B)", "Homeruns by batters (4B)", "Walks by batters", "Batters hit by pitch (get a free base)", "Strikeouts by batters", "Stolen bases", "Caught stealing", "Errors", "Double Plays", "Walks allowed", "Hits allowed", "Homeruns allowed", "Strikeouts by pitchers")
te <- c("None", "", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Negative Impact on Wins", "Positive Impact on Wins")
kable(cbind(vn, df, te), col.names = c("Variable Name", "Definition", "Theoretical Effect"))
#load data
moneyball_train <- read.csv('https://raw.githubusercontent.com/saayedalam/Data/master/moneyball-training-data.csv')

#rename column
moneyball_train <- moneyball_train %>%
  dplyr::select(-INDEX) %>% #remove column index
  rename_all(funs(str_replace(., "TEAM_|TARGET_", ""))) #use stringr to rename column

#summary statistic  
glimpse(moneyball_train)
describe(moneyball_train, skew = F, ranges = F)
#boxplot
moneyball_train %>%
  gather(na.rm = TRUE) %>%
  ggplot(aes(factor(key), value)) +
  geom_boxplot(outlier.size = 1) +
  stat_summary(fun.y=mean, geom="point", size=2, color = "red") +
  coord_cartesian(ylim = c(0, 2000)) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Variables", y = "Value")
#histogram
moneyball_train %>%
  gather() %>%
  ggplot(aes(value)) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3))) +
  facet_wrap(~key, scale="free") +
  theme_minimal()
#correlation
corr_moneyball <- moneyball_train %>%
  drop_na() %>%
  cor()

kable(sort(corr_moneyball[,1], decreasing = T), col.names = c("Correlation"))

corrplot(corr_moneyball, method = "pie", type = "upper", order = "FPC",
         col = brewer.pal(n = 7, name = "GnBu"),
         tl.col = "black", cl.align = "r", cl.ratio = 0.3)
#missing value columns
aggr(moneyball_train, numbers=TRUE, sortVars=TRUE, labels=names(moneyball_train), cex.axis=.7)

#delete two columns
moneyball_trainA <- moneyball_train[, -c(9, 10)]
#imputation
init <- mice(moneyball_trainA)
meth <- init$method
predM <- init$predictorMatrix
predM[, c("WINS")] <- 0 #this code will remove the variable as a predictor but still will be imputed
mbtrain_impute <- mice(moneyball_trainA, method = 'rf', predictorMatrix=predM)
mbtrain_imputed <- complete(mbtrain_impute)
#comparing summary statistics after impuation
a <- describe(mbtrain_imputed, skew = F, ranges = F)
b <- describe(moneyball_trainA, skew = F, ranges = F)
kable(list(a, b))
#model 1 with all variables
model1 <- lm(WINS ~ ., mbtrain_imputed)
summary(model1)
#model 2 stepwise model
model2 <- stepAIC(model1, direction = "both", trace = FALSE)
summary(model2)
#boxcox transformation model
mbtrain_boxcox <- preProcess(moneyball_trainA, c("BoxCox"))
mb_bc_transformed <- predict(mbtrain_boxcox, moneyball_trainA)
model3 <- lm(WINS ~ ., mb_bc_transformed)
summary(model3)
#plot of the model 3
plot(model3)

#same data prepartion for the test dataset
moneyball_test <- read.csv('https://raw.githubusercontent.com/saayedalam/Data/master/moneyball-evaluation-data.csv')
moneyball_test <- moneyball_test %>%
  dplyr::select(-c(1, 9, 10)) %>%
  rename_all(funs(str_replace(., "TEAM_|TARGET_", "")))
#impute test dataset
init <- mice(moneyball_test)
meth <- init$method
predM <- init$predictorMatrix
mbttest_impute <- mice(moneyball_test, method = 'rf', predictorMatrix=predM)
mbttest_imputed <- complete(mbttest_impute)
prediction <- predict(model3, mbttest_imputed, interval = "prediction")
summary(prediction)