Overview

Explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. The objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. Only use the variables given (or variables derived from the variables provided).

VARIABLE NAME	DEFINITION	THEORETICAL EFFECT
INDEX	Identification Variable (do not use)	None
TARGET_WINS	Number of wins	Outcome Variable
TEAM_BATTING_H	Base Hits by batters (1B,2B,3B,HR)	Positive Impact on Wins
TEAM_BATTING_2B	Doubles by batters (2B)	Positive Impact on Wins
TEAM_BATTING_3B	Triples by batters (3B)	Positive Impact on Wins
TEAM_BATTING_HR	Homeruns by batters (4B)	Positive Impact on Wins
TEAM_BATTING_BB	Walks by batters	Positive Impact on Wins
TEAM_BATTING_HBP	Batters hit by pitch (get a free base)	Positive Impact on Wins
TEAM_BATTING_SO	Strikeouts by batters	Negative Impact on Wins
TEAM_BASERUN_SB	Stolen bases	Positive Impact on Wins
TEAM_BASERUN_CS	Caught stealing	Negative Impact on Wins
TEAM_FIELDING_E	Errors	Negative Impact on Wins
TEAM_FIELDING_DP	Double Plays	Positive Impact on Wins
TEAM_PITCHING_BB	Walks allowed	Negative Impact on Wins
TEAM_PITCHING_H	Hits allowed	Negative Impact on Wins
TEAM_PITCHING_HR	Homeruns allowed	Negative Impact on Wins
TEAM_PITCHING_SO	Strikeouts by pitchers	Positive Impact on Wins

Data Exploration

Describe the size and the variables in the moneyball training data set.

training <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
                            "SPS/master/DATA%20621/moneyball-training-data.csv"))
Y <- training[ , 2]
X <- training[ , 3:17]

Numerical Summaries

summary(training[ ,-1])

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286

Looking at the data summaries, we can see right from the start that there are some variables (TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO, TEAM_FIELDING_DP) with missing data. We can also see multiple skewed variables with large maximums relative to their mean and median.

Visual Exploration

par(mfrow = c(2,4))
Z <- cbind(Y, X)
colnames(Z) <- c("TARGET_WINS", colnames(X))
for (i in 1:ncol(Z)) {
  variable <- ifelse(i==1, names(Z[i]), substr(names(Z[i]), 6, nchar(names(Z[i]))))
  hist(Z[ ,i], xlab = variable, main = variable)
}

for (i in 1:ncol(Z)) {
  variable <- ifelse(i==1, names(Z[i]), substr(names(Z[i]), 6, nchar(names(Z[i]))))
  d <- density(Z[,i], na.rm = TRUE)
  plot(d, main = variable)
  polygon(d, col="red")
}

The histograms and density plots give a better understanding of how the data are distributed. The distribution of some variables could pass for some variation of a Gaussian or Student-t distribution. Others are show heavy skewing, high kurtosis, and multimodal properties.

heatmap(cor(Z, use = "complete.obs"), Colv=NA, Rowv=NA, scale='none', na.rm = T)

The heatmap provides a nice visual representation of the correlation that exists in the data. Missing values were removed for this representation using the na.rm option.

Data Preparation

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

Missing Values

nulls <- data.frame(array(NA, dim = c(ncol(X), 3)))
colnames(nulls) <- c("Variable", "Nulls", "Percent")
for (i in 1:ncol(X)) {
  nulls[i, 1] <- names(X[i])
  nulls[i, 2] <- sum(is.na(X[i]))
  nulls[i, 3] <- nulls[i, 2] / length(Y)
}; nulls

##            Variable Nulls    Percent
## 1    TEAM_BATTING_H     0 0.00000000
## 2   TEAM_BATTING_2B     0 0.00000000
## 3   TEAM_BATTING_3B     0 0.00000000
## 4   TEAM_BATTING_HR     0 0.00000000
## 5   TEAM_BATTING_BB     0 0.00000000
## 6   TEAM_BATTING_SO   102 0.04481547
## 7   TEAM_BASERUN_SB   131 0.05755712
## 8   TEAM_BASERUN_CS   772 0.33919156
## 9  TEAM_BATTING_HBP  2085 0.91608084
## 10  TEAM_PITCHING_H     0 0.00000000
## 11 TEAM_PITCHING_HR     0 0.00000000
## 12 TEAM_PITCHING_BB     0 0.00000000
## 13 TEAM_PITCHING_SO   102 0.04481547
## 14  TEAM_FIELDING_E     0 0.00000000
## 15 TEAM_FIELDING_DP   286 0.12565905

X <- X[, -c(7,8,9,15)] # Large Number of Nulls
all.equal(is.na(X[,6]), is.na(X[,10]))

## [1] TRUE

Some of variables are missing a significant amount of data such as the variable TEAM_BATTING_HBP which is missing 91.61% of values. Variables missing over 5% of data (TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_FIELDING_DP) will not be considered for the model. It also a little interesting that TEAM_BATTING_SO and TEAM_PITCHING_SO have the same number of missing values. This appears to just be a consequence of data collection however since both have nulls in the same index locations.

Correlations

Between \(Y\) and \(X\) Variables

corr_XY <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(corr_XY) <- c("Y", "X", "r","p","<0.05")
for (i in 1:ncol(X)) {
  r <- cor.test(Y, X[,i], use = "complete.obs")
  corr_XY[i, 1] <- "TARGET_WINS"
  corr_XY[i, 2] <- names(X[i])
  corr_XY[i, 3] <- r$estimate
  corr_XY[i, 4] <- r$p.value
  corr_XY[i, 5] <- corr_XY[i, 4] < 0.05
}; corr_XY

##              Y                X           r            p <0.05
## 1  TARGET_WINS   TEAM_BATTING_H  0.38876752 5.238956e-83  TRUE
## 2  TARGET_WINS  TEAM_BATTING_2B  0.28910365 4.585585e-45  TRUE
## 3  TARGET_WINS  TEAM_BATTING_3B  0.14260841 8.217437e-12  TRUE
## 4  TARGET_WINS  TEAM_BATTING_HR  0.17615320 2.551007e-17  TRUE
## 5  TARGET_WINS  TEAM_BATTING_BB  0.23255986 2.500423e-29  TRUE
## 6  TARGET_WINS  TEAM_BATTING_SO -0.03175071 1.388904e-01 FALSE
## 7  TARGET_WINS  TEAM_PITCHING_H -0.10993705 1.457270e-07  TRUE
## 8  TARGET_WINS TEAM_PITCHING_HR  0.18901373 9.502088e-20  TRUE
## 9  TARGET_WINS TEAM_PITCHING_BB  0.12417454 2.784686e-09  TRUE
## 10 TARGET_WINS TEAM_PITCHING_SO -0.07843609 2.515153e-04  TRUE
## 11 TARGET_WINS  TEAM_FIELDING_E -0.17648476 2.219881e-17  TRUE

X <- X[, -c(3,6,7,9,10)] # Insignificant/Low Correlation with Y

Most variables have a significant correlation with TARGET_WINS except for TEAM_BATTING_SO which will not be considered for the model due to its insignificance at a significance level of \(\alpha=0.05\). Low impact variables (TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_PITCHING_SO, TEAM_FIELDING_E) with \(r<|0.15|\) will not be considered for the model either. It is worth noting that since the complete.obs option is being used with the cor() function, correlations change once these variables are removed because incomplete cases become complete in the absence of nulls from those removed variable(s).

Between \(X\) Variables

corr_XX <- data.frame(array(NA, dim = c(choose(ncol(X), 2), 5)))
colnames(corr_XX) <- c("X1", "X2", "r","p","<0.05"); k = 1
for (i in 1:(ncol(X) - 1)) {
  for (j in (i+1):ncol(X)) {
    r <- cor.test(X[,i], X[,j], use = "complete.obs")
    corr_XX[k, 1] <- names(X[i])
    corr_XX[k, 2] <- names(X[j])
    corr_XX[k, 3] <- r$estimate
    corr_XX[k, 4] <- r$p.value
    corr_XX[k, 5] <- corr_XX[i, 4] < 0.05
    k = k + 1
  }
}; corr_XX

##                  X1               X2            r             p <0.05
## 1    TEAM_BATTING_H  TEAM_BATTING_2B  0.562849678 2.258301e-190  TRUE
## 2    TEAM_BATTING_H  TEAM_BATTING_HR -0.006544685  7.549934e-01  TRUE
## 3    TEAM_BATTING_H  TEAM_BATTING_BB -0.072464013  5.407324e-04  TRUE
## 4    TEAM_BATTING_H TEAM_PITCHING_HR  0.072853119  5.045119e-04  TRUE
## 5    TEAM_BATTING_H  TEAM_FIELDING_E  0.264902478  7.430104e-38  TRUE
## 6   TEAM_BATTING_2B  TEAM_BATTING_HR  0.435397293 6.206460e-106 FALSE
## 7   TEAM_BATTING_2B  TEAM_BATTING_BB  0.255726103  2.610613e-35 FALSE
## 8   TEAM_BATTING_2B TEAM_PITCHING_HR  0.454550818 1.894857e-116 FALSE
## 9   TEAM_BATTING_2B  TEAM_FIELDING_E -0.235150986  5.757433e-30 FALSE
## 10  TEAM_BATTING_HR  TEAM_BATTING_BB  0.513734810 1.598288e-153  TRUE
## 11  TEAM_BATTING_HR TEAM_PITCHING_HR  0.969371396  0.000000e+00  TRUE
## 12  TEAM_BATTING_HR  TEAM_FIELDING_E -0.587339098 3.510100e-211  TRUE
## 13  TEAM_BATTING_BB TEAM_PITCHING_HR  0.459552072 2.625820e-119  TRUE
## 14  TEAM_BATTING_BB  TEAM_FIELDING_E -0.655970815 3.785270e-280  TRUE
## 15 TEAM_PITCHING_HR  TEAM_FIELDING_E -0.493144466 8.650426e-140  TRUE

X <- X[, -c(1,3)] # Significant/High Correlation

There is significant correlation between nearly all the predictor variables at an \(\alpha=0.05\) significance level. The variable TEAM_BATTING_H is correlated with every variable and therefore will not be considered for the model. This is one variable however, that shows no significant correlation to multiple other variables. The variable TEAM_BATTING_2B is not significantly correlated to TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_PITCHING_HR, or TEAM_FIELDING_E. There does exist significant correlation between those four variables nevertheless. Therefore, as a measure of prudence and for the sake of parsimony, the variable (TEAM_BATTING_HR) with the largest intra-variable correlations and lowest relative correlation with \(Y\) will be excluded from the model.

Dependence

depend <- data.frame(array(NA, dim = c(ncol(X), 5)))
colnames(depend) <- c("Y", "X", "chi-stat", "Chi-test","<0.05")
for (i in 1:ncol(X)) {
  chi <- chisq.test(table(Y, X[,i]), simulate.p.value = TRUE)
  depend[i, 1] <- "TARGET_WINS"
  depend[i, 2] <- names(X[i])
  depend[i, 3] <- chi$statistic
  depend[i, 4] <- chi$p.value
  depend[i, 5] <- depend[i, 4] < 0.05
}; depend

##             Y                X  chi-stat     Chi-test <0.05
## 1 TARGET_WINS  TEAM_BATTING_2B  42266.80 0.0004997501  TRUE
## 2 TARGET_WINS  TEAM_BATTING_BB  93652.46 0.0004997501  TRUE
## 3 TARGET_WINS TEAM_PITCHING_HR  25278.79 0.9535232384 FALSE
## 4 TARGET_WINS  TEAM_FIELDING_E 115963.15 0.0004997501  TRUE

X <- X[, -3] # Indignificant Dependence

All but one of the \(X\) variables show evidence of statistically significant dependence with \(Y\). Listed in order of least to most dependent are TEAM_FIELDING_E, TEAM_BATTING_BB, and TEAM_BATTING_2B. The variable TEAM_PITCHING_HR that is not statistically significantly dependent will not be included in the model. It is worth noting that order was ranked using the \({ \chi }^{ 2 }\) test statistic instead of the \(p\)-value since the function simulated \(p\)-values where very small expected values posed the potential for producing incorrect approximations of \(p\). This simulation lead to duplicate \(p\)-values in some cases, but the \({ \chi }^{ 2 }\) test statistics remained unique.

Visual Analysis

par(mfrow = c(2,4))
Z <- cbind(Y, X)
colnames(Z) <- c("TARGET_WINS", colnames(X))
pairs(Z)

for (i in 1:ncol(Z)) {
  variable <- ifelse(i==1, names(Z[i]), substr(names(Z[i]), 6, nchar(names(Z[i]))))
  hist(Z[ ,i], xlab = variable, main = variable)
}
for (i in 1:ncol(Z)) {
  variable <- ifelse(i==1, names(Z[i]), substr(names(Z[i]), 6, nchar(names(Z[i]))))
  d <- density(Z[,i], na.rm = TRUE)
  plot(d, main = variable)
  polygon(d, col="red")
}

The remaining significantly correlated and significantly dependent variables appear to have Gaussian (TEAM_BATTING_2B, TEAM_BATTING_BB) and Exponential (TEAM_PITCHING_HR) distributions.

Transformation

library(MASS)
lambda <- fitdistr(X[,3], "Exponential")$estimate
par(mfrow = c(2,2))
for (i in 1:2) {
  if (i != 1) {
    X[,3] = log(X[,3], base = exp(lambda))
    colnames(X) <- c(names(X[1]), names(X[2]), 
      paste0(substr(names(X[3]), 1, 5), "log", substr(names(X[3]), 6, 15)))
  }
  variable <- substr(names(X[3]), 6, nchar(names(X[3])))
  hist(X[,3], xlab = variable, main = variable)
  d <- density(X[,3], na.rm = TRUE)
  plot(d, main = variable)
  polygon(d, col="red")    
};

Z <- cbind(Y, X)
colnames(Z) <- c("TARGET_WINS", colnames(X))

The exponentially distributed variable TEAM_FIELDING_E was transformed using to TEAM_logFIELDING_E using the log() function with a base derived using the fitdsitr function from the MASS library.

Build Models

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Indicate why variables were selected for inclusion into the model or excluded from the model. Discuss the coefficients in the models. For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why?

Model 1

\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 log(X_3) + \varepsilon_i \tag{1.1}\]

Model 1 was selected through all of the preceding steps. First, variables with a large amount of missing values were removed. Next, variables with statistically insignificant or a relatively low correlation with \(Y\) were removed. Then, variables with statistically significant or a relatively high correlation to other \(X\) variables were removed. Finally, variables that did not show statistically significant dependence with \(Y\) were removed.

model_1 <- lm(TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_BB + TEAM_logFIELDING_E, Z)

\[\hat{WINS_i} = 40.28 + 0.086 \cdot BATTING2B_i + 0.026 \cdot BATTINGBB_i + 0.005 \cdot logFIELDINGE_i + \varepsilon_i \tag{1.2}\]

round(cbind(summary(model_1)$coefficients, confint(model_1, level = 0.99)), 5)

##                    Estimate Std. Error  t value Pr(>|t|)    0.5 %   99.5 %
## (Intercept)        40.28013    5.09209  7.91034  0.00000 27.15276 53.40751
## TEAM_BATTING_2B     0.08576    0.00705 12.17223  0.00000  0.06759  0.10392
## TEAM_BATTING_BB     0.02555    0.00321  7.95533  0.00000  0.01727  0.03384
## TEAM_logFIELDING_E  0.00540    0.00266  2.03183  0.04229 -0.00145  0.01224

The transformed variable TEAM_logFIELDING_E turned out to have a positive coefficient; meaning that more errors increase wins. This is not only highly counterintuitive, but it also invalid at an \(\alpha=0.01\) significance level. The statistical insignificance of this variable can also be seen in the confidence interval which includes 0; meaning that it is possible that the variable has zero effect. The variables TEAM_BATTING_2B and TEAM_BATTING_BB have statistically significant positive relationships with wins. This means that the an increased number of doubles or walks in games coincides with wins. This inference is a sound.

Model 2

\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 log(X_2) + \varepsilon_i \tag{2.1}\]

Model 2 is a subset of Model 1 that includes only the two variables showing the most dependence with \(Y\).

model_2 <- lm(TARGET_WINS ~ TEAM_BATTING_BB + TEAM_logFIELDING_E, Z) # Most Dependence

\[\hat{WINS_i} = 68.346 + 0.029 \cdot BATTING2B_i - 0.002 \cdot logFIELDINGE_i + \varepsilon_i \tag{2.2}\]

round(cbind(summary(model_2)$coefficients, confint(model_2, level = 0.99)), 5)

##                    Estimate Std. Error  t value Pr(>|t|)    0.5 %   99.5 %
## (Intercept)        68.34562    4.68484 14.58869  0.00000 56.26814 80.42310
## TEAM_BATTING_BB     0.02873    0.00330  8.69527  0.00000  0.02021  0.03724
## TEAM_logFIELDING_E -0.00151    0.00268 -0.56468  0.57235 -0.00841  0.00539

Model 2 has the same flaw as Model 1. The coefficient estimate for TEAM_logFIELDING_E is unsound and insignificant. Removing TEAM_BATTING_2B did not mitigate this. Removing TEAM_BATTING_2B did shrink the confidence interval of the TEAM_BATTING_BB coefficient estimate, but it also significantly increased the coefficient estimate for the intercept.

Model 3

\[\hat{Y} = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \varepsilon_i \tag{3.1}\]

Model 3 is a subset of Model 1 that includes only the two variables most correlated to \(Y\).

model_3 <- lm(TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_BB, Z) # High Correlations

\[\hat{WINS_i} = 49.909 + 0.083 \cdot BATTING2B_i + 0.022 \cdot BATTINGBB_i + \varepsilon_i \tag{3.2}\]

round(cbind(summary(model_3)$coefficients, confint(model_3, level = 0.99)), 5)

##                 Estimate Std. Error  t value Pr(>|t|)    0.5 %   99.5 %
## (Intercept)     49.90930    1.86402 26.77511        0 45.10387 54.71473
## TEAM_BATTING_2B  0.08270    0.00689 12.00733        0  0.06494  0.10045
## TEAM_BATTING_BB  0.02179    0.00263  8.29460        0  0.01502  0.02857

Model 3 is an improvement from Model 1. Removing the variable TEAM_logFIELDING_E shrunk the confidence intervals of all \(\beta_i\) estimates. The two significantly correlated and significantly dependent variables TEAM_BATTING_2B and TEAM_BATTING_BB have statistically significant positive relationships with wins. Getting more doubles and walks in games helps increase the chances of winning. This makes a lot of sense.

Select Models

Model Choice

The best multiple linear regression model should have few imputations, show statistically significant correlation between the \(X\) variables and \(Y\), show statistically insignificant correlation among the \(X\) variables themselves, show statistically significant dependence between the \(X\) variables and \(Y\), and have statistically significant multiple linear regression coefficients. Based on these criteria, Model 3 is the best multiple linear regression model. Potentially, if there were another model that performed worse but made more sense or was more parsimonious, it would be chosen if a customer expressed an unbending preference (regardless of facts presented) for such a model.

\[\hat{WINS_i} = 49.909 + 0.083 \cdot BATTING2B_i + 0.022 \cdot BATTINGBB_i + \varepsilon_i \tag{3.2}\]

Evaluation

Regression Output

(model <- summary(model_3))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + TEAM_BATTING_BB, 
##     data = Z)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.073  -9.701   0.194   9.833  68.068 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     49.909298   1.864018  26.775   <2e-16 ***
## TEAM_BATTING_2B  0.082696   0.006887  12.007   <2e-16 ***
## TEAM_BATTING_BB  0.021795   0.002628   8.295   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.86 on 2273 degrees of freedom
## Multiple R-squared:  0.1105, Adjusted R-squared:  0.1097 
## F-statistic: 141.2 on 2 and 2273 DF,  p-value: < 2.2e-16

\[\hat{WINS_i} = 49.909 + 0.083 \cdot BATTING2B_i + 0.022 \cdot BATTINGBB_i + \varepsilon_i \tag{3.2}\]

This model provides the \(b_i\) coefficient estimates for the intercept, TEAM_BATTING_2B, and TEAM_BATTING_BB. The confidence intervals of the \(b_i\) estimates of \(\beta_i\) can be used to make inferences not just about the population parameter \(\beta_i\), but also regarding intervals for predictions of \(\hat{y}\).

Multicollinearity

library(lmtest)
dwtest(model_3)

## 
##  Durbin-Watson test
## 
## data:  model_3
## DW = 0.98102, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

The null hypothesis is that there is no autocorrelation (multicollinearity). Since the \(p\)-value is small, we reject the null hypothesis and conclude that the data supports the existence of first order (lag one) correlation. This will not be addressed in this analysis, but it is worth noting because autocorrelation can lead to underestimates of the standard error and make predictors appear significant when they are not.

Mean Squared Error and RMSE

data.frame("MSE" = model$sigma^2, "RMSE" = model$sigma)

##       MSE     RMSE
## 1 220.905 14.86287

The Mean Squared Error is the square of the RMSE. The benefit of using the RMSE is that it is expressed in the same units as the target variable. For this model, we see that standard error of the mean (RMSE) is fairly large relative to the target variable. In this model, the standard deviation of the unexplained variance in TARGET_WINS is 14.8629 which is a large deviation from the 80.7909 average wins encountered in the data. This model can likely be improved upon by adding, combining, or transforming other variables that were filtered out early in this analysis.

\(R^2\) and Adjusted \(R^2\)

data.frame("r.squared" = model$r.squared, "adj.r.squared" = model$adj.r.squared)

##   r.squared adj.r.squared
## 1 0.1105046      0.109722

\(R^2\) represents the percent change in \(Y\) explained by the predictor variables. \(R^2\) is fairly low for this model. \(R^2\) however, is not an adequate performance measure for this model. Adjusted \(R^2\) is more appropriate when models have multiple variables. It incorporates a penalty to account for the decrease in degrees of freedom (from additional variables). This penalty does not improve the evaluation however. Adjusted \(R^2\) is less than the already low \(R^2\). Again, it is very likely that this model be improved upon.

\(F\)-statistic

model$fstatistic

##     value     numdf     dendf 
##  141.1907    2.0000 2273.0000

The \(F\)-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one does not. At an \(\alpha=0.01\) the \(F\)-statistic which indicates that the effect of the model is not “a spurious result of oddities in the data set.”

Residuals

par(mfrow = c(2,2)); plot(model_3)

The Residuals vs Fitted plot shows that the residuals appear to have a linear pattern. The Normal Q-Q plot shows that the residuals appear to be normally distributed. The Scale-Location plot appears to show some heteroscedasticity since the line is not horizontal with equally (randomly) spread points. The Residuals vs Leverage plot does not show any extreme values outside the Cooks distance (dashed curve) that influence the (solid) regression line.

shapiro.test(model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.9972, p-value = 0.000384

The null hypothesis is that the residuals are normal. Since the p-value is large, we reject this hypothesis. The residuals that looked normal in the above plot were found not to be normal. This contradiction is a very important finding not just for the model, but also to show how the use of visual analysis alone can lead to very false conclusions.

Predictions

evaluation <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/",
                              "SPS/master/DATA%20621/moneyball-evaluation-data.csv"))
predict(model_3, evaluation)

##        1        2        3        4        5        6        7        8 
## 73.70991 73.64252 76.13624 86.05468 68.76713 74.11148 80.39919 70.73423 
##        9       10       11       12       13       14       15       16 
## 74.70289 77.29213 80.79288 79.74287 79.48768 80.05758 76.60552 75.24590 
##       17       18       19       20       21       22       23       24 
## 77.48055 85.70906 71.65348 88.02332 87.99179 87.68280 85.31151 80.13255 
##       25       26       27       28       29       30       31       32 
## 82.20969 84.05652 66.85167 80.04939 90.02457 84.79507 92.02072 85.14101 
##       33       34       35       36       37       38       39       40 
## 88.98724 89.32375 83.00513 83.69483 81.57363 82.98333 83.10637 82.22500 
##       41       42       43       44       45       46       47       48 
## 89.59254 89.12326 60.47386 83.13946 77.78491 82.43259 78.98764 77.49987 
##       49       50       51       52       53       54       55       56 
## 76.30364 79.15055 81.13402 85.07640 80.54928 75.74146 80.93138 77.27358 
##       57       58       59       60       61       62       63       64 
## 79.28055 75.24018 69.30460 74.74137 81.51536 84.02161 84.06410 84.27061 
##       65       66       67       68       69       70       71       72 
## 80.31216 82.02436 74.80876 75.43447 74.83041 79.04267 81.41473 80.03903 
##       73       74       75       76       77       78       79       80 
## 81.86655 79.45429 81.40715 78.76397 82.39921 81.21920 74.63054 74.94850 
##       81       82       83       84       85       86       87       88 
## 91.56705 88.88815 92.79404 85.14240 82.63060 79.02488 77.31130 79.90626 
##       89       90       91       92       93       94       95       96 
## 82.70433 91.26053 74.53935 82.36364 73.27850 73.27989 77.56974 76.97541 
##       97       98       99      100      101      102      103      104 
## 83.54212 86.82060 88.11808 88.19181 85.45015 80.42423 83.99748 80.23952 
##      105      106      107      108      109      110      111      112 
## 83.06216 76.54167 70.07081 87.02911 87.08089 77.35442 84.53911 85.98543 
##      113      114      115      116      117      118      119      120 
## 82.97823 83.93797 80.87248 87.12587 86.11296 84.89817 83.04948 75.25811 
##      121      122      123      124      125      126      127      128 
## 77.41192 73.28298 70.74768 66.91228 74.34134 78.11599 81.91477 75.44468 
##      129      130      131      132      133      134      135      136 
## 86.27387 87.31754 87.69439 85.93660 79.51272 79.17482 80.28063 73.99540 
##      137      138      139      140      141      142      143      144 
## 79.66142 78.65824 83.53979 81.85248 71.66831 70.11301 88.83993 84.29364 
##      145      146      147      148      149      150      151      152 
## 82.80310 80.67294 83.23652 84.49165 82.54357 84.17060 82.08140 84.83665 
##      153      154      155      156      157      158      159      160 
## 53.87485 73.72861 78.61836 73.63556 88.54191 72.31319 77.72292 72.08613 
##      161      162      163      164      165      166      167      168 
## 90.54827 91.40074 87.13531 93.67975 90.72896 86.01667 80.15621 80.07103 
##      169      170      171      172      173      174      175      176 
## 76.41647 84.82893 79.12227 81.73964 85.37828 87.02834 86.48285 82.70820 
##      177      178      179      180      181      182      183      184 
## 83.99919 78.69534 80.11972 80.72287 80.18248 84.58270 83.30701 86.25455 
##      185      186      187      188      189      190      191      192 
## 79.70144 81.66869 88.14049 79.71365 75.99093 90.18362 73.63881 75.80298 
##      193      194      195      196      197      198      199      200 
## 76.46021 79.92820 85.06495 77.31454 79.76467 83.46715 81.34085 83.72559 
##      201      202      203      204      205      206      207      208 
## 77.77671 79.54858 79.89281 82.85380 81.19802 81.14145 86.57065 85.20268 
##      209      210      211      212      213      214      215      216 
## 82.38653 75.78814 84.60712 76.63380 79.88508 76.75870 78.97357 81.55508 
##      217      218      219      220      221      222      223      224 
## 79.01978 83.22956 79.05441 82.20954 83.91417 80.88531 88.66433 81.26479 
##      225      226      227      228      229      230      231      232 
## 72.25831 77.65615 80.22236 86.10199 85.23144 75.61779 75.78490 81.49804 
##      233      234      235      236      237      238      239      240 
## 82.07353 80.19284 81.06076 77.60746 80.35436 77.95432 79.08826 74.20933 
##      241      242      243      244      245      246      247      248 
## 83.36543 86.19241 83.96734 84.54760 75.91056 82.35066 77.98322 83.92189 
##      249      250      251      252      253      254      255      256 
## 77.76651 83.88665 86.50851 70.21656 82.09022 60.17879 74.02911 80.76894 
##      257      258      259 
## 81.74938 84.05004 87.09712

Conclusion

As an exercise on manual model selection and evaluation, this analysis was very beneficial. After a lot of hard work and effort, the model was found have unremarkable performance metrics and multiple violations on linear model assumptions regarding residuals. Predictions from this model serve only as a heuristic example of how such an exercise should be performed, but the output will in all likelihood be unreliable given the multicollinearity and non-normal heteroscedastic residuals.

References

Linear Models with R, Faraway, 2005.

A Modern Approach to Regression with R, Sheather, 2009.

https://rpubs.com/josezuniga/234598

http://data.library.virginia.edu/diagnostic-plots/

http://onlinestatbook.com/2/estimation/confidence.html

http://www.statisticshowto.com/durbin-watson-test-coefficient/

http://www.theanalysisfactor.com/assessing-the-fit-of-regression-models/

Moneyball

Jose Zuniga

Overview

Data Exploration

Numerical Summaries

Visual Exploration

Data Preparation

Missing Values

Correlations

Between \(Y\) and \(X\) Variables

Between \(X\) Variables

Dependence

Visual Analysis

Transformation

Build Models

Model 1

Model 2

Model 3

Select Models

Model Choice

Evaluation

Regression Output

Multicollinearity

Mean Squared Error and RMSE

\(R^2\) and Adjusted \(R^2\)

\(F\)-statistic

Residuals

Predictions

Conclusion

References