Model Moneyball

library(corrplot)
library(psych)
library(ggplot2)
require(gridExtra)
library(car)
library(mice)
library(VIM)
library(caret)
library(dplyr)
library(MASS)

Read Data

Here, we read the dataset and shorten the feature names for better readibility in visualizations.

df <- read.csv("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Moneyball%20Regression/moneyball-training-data.csv")[-1]
names(df) <- sub("TEAM_", "", names(df))
names(df) <- sub("BATTING_", "bt_", names(df))
names(df) <- sub("BASERUN_", "br_", names(df))
names(df) <- sub("FIELDING_", "fd_", names(df))
names(df) <- sub("PITCHING_", "ph_", names(df))
names(df) <- sub("TARGET_", "", names(df))
head(df)

##   WINS bt_H bt_2B bt_3B bt_HR bt_BB bt_SO br_SB br_CS bt_HBP ph_H ph_HR ph_BB
## 1   39 1445   194    39    13   143   842    NA    NA     NA 9364    84   927
## 2   70 1339   219    22   190   685  1075    37    28     NA 1347   191   689
## 3   86 1377   232    35   137   602   917    46    27     NA 1377   137   602
## 4   70 1387   209    38    96   451   922    43    30     NA 1396    97   454
## 5   82 1297   186    27   102   472   920    49    39     NA 1297   102   472
## 6   75 1279   200    36    92   443   973   107    59     NA 1279    92   443
##   ph_SO fd_E fd_DP
## 1  5456 1011    NA
## 2  1082  193   155
## 3   917  175   153
## 4   928  164   156
## 5   920  138   168
## 6   973  123   149

Fill Missing Values

The following features have missing values.

bt_SO - Strikeouts by batters
br_SB - Stolen bases
br_CS - Caught stealing
bt_HBP - Batters hit by pitch (get a free base)
ph_SO - Strikeouts by pitchers
fd_DP - Double Plays

Since most values in bt_HBP are missing (90%), we will drop this feature. Features br_CS and fd_DP also have many missing values, and their correlation with WINS is also weak. They will be dropped from this model as well.

df <- subset(df, select = -c(bt_HBP, br_CS, fd_DP))

Multivariate Imputation by Chained Equations (mice)

We will use Multivariable Imputation by Chained Equations (mice) to fill the missing variables.

Model Build

Feature Engineering

We make a few adjustments to the featurs.

Replace bt_H (total base hits by batters) with BT_1B = bt_H - BT_2B - BT_3B - BT_HR (1B base hits)
Apply a log transform to all of the features that do not have a normal distribution. Many of the distributions are skewed or wide, so this will help normalize the features.
There are two features (bt_3B and ph_SO) that do not have a significant p-value when included in the linear regression model, so they will be dropped.

df2 <- df

#feature modification 
df2$bt_1B <- df$bt_H - df$bt_2B - df$bt_3B - df$bt_HR
df2 <- subset(df2, select = -c(bt_H))

#resolve skewness or wide distribution
df2$bt_3B <- log(df2$bt_3B + 1)
df2$br_SB <- log(df2$br_SB + 1)
df2$ph_SO <- log(df2$ph_SO + 1)
df2$bt_BB <- log(df2$bt_BB + 1)
df2$ph_H <- log(df2$ph_H + 1)
df2$ph_BB <- log(df2$ph_BB + 1)
df2$ph_HR <- log(df2$ph_HR + 1)
df2$fd_E <- log(df2$fd_E + 1)
df2$bt_HR <- log(df2$bt_HR + 1)
df2$bt_SO <- log(df2$bt_SO + 1)
df2$bt_2B <- log(df2$bt_2B + 1)


#drop features with low correlation
df2 <- subset(df2, select = -c(bt_3B, ph_SO))

Spatial Sign

One of the significant aspects of this model is a spacial sign tranformation. Our data has very dramatic outliers, and it is difficult to detect them. Our linear regression model is highly sensitive to outliers, and a spacial sign transformaton will address that. This tranformation will project feature values onto a unit cirle, making all of the samples will be the same distance from the center of the circle This resolves the effect outliers have on our data.

Spacial Sign Documentation

In the previous step, we normalized and centered all of the features - this is a very important step that must be done before a spacial sign tranformation. We can now apply the transformation.

df2 = as.data.frame(spatialSign(df2))

Multiple Linear Regression

# Multiple Linear Regression
fit <- lm(WINS ~  ., data = df2)
summary(fit) # show results

## 
## Call:
## lm(formula = WINS ~ ., data = df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0199976 -0.0005742  0.0005018  0.0011246  0.0168999 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  13.15494    0.04752  276.837  < 2e-16 ***
## bt_2B         1.45361    0.33426    4.349 1.43e-05 ***
## bt_HR        -1.68837    0.50246   -3.360 0.000792 ***
## bt_BB         1.69469    0.44595    3.800 0.000148 ***
## bt_SO         0.43842    0.11689    3.751 0.000181 ***
## br_SB         0.60995    0.08805    6.927 5.58e-12 ***
## ph_H         -3.08577    0.29710  -10.386  < 2e-16 ***
## ph_HR         1.61429    0.45543    3.545 0.000401 ***
## ph_BB        -0.89009    0.42295   -2.104 0.035446 *  
## fd_E         -0.85404    0.17838   -4.788 1.80e-06 ***
## bt_1B       -13.11171    0.04753 -275.876  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00211 on 2265 degrees of freedom
## Multiple R-squared:  0.9803, Adjusted R-squared:  0.9802 
## F-statistic: 1.126e+04 on 10 and 2265 DF,  p-value: < 2.2e-16