library(corrplot)
library(psych)
library(ggplot2)
require(gridExtra)
library(car)
library(mice)
library(VIM)
library(caret)
library(dplyr)
library(MASS)
Here, we read the dataset and shorten the feature names for better readibility in visualizations.
df <- read.csv("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Moneyball%20Regression/moneyball-training-data.csv")[-1]
names(df) <- sub("TEAM_", "", names(df))
names(df) <- sub("BATTING_", "bt_", names(df))
names(df) <- sub("BASERUN_", "br_", names(df))
names(df) <- sub("FIELDING_", "fd_", names(df))
names(df) <- sub("PITCHING_", "ph_", names(df))
names(df) <- sub("TARGET_", "", names(df))
head(df)
## WINS bt_H bt_2B bt_3B bt_HR bt_BB bt_SO br_SB br_CS bt_HBP ph_H ph_HR ph_BB
## 1 39 1445 194 39 13 143 842 NA NA NA 9364 84 927
## 2 70 1339 219 22 190 685 1075 37 28 NA 1347 191 689
## 3 86 1377 232 35 137 602 917 46 27 NA 1377 137 602
## 4 70 1387 209 38 96 451 922 43 30 NA 1396 97 454
## 5 82 1297 186 27 102 472 920 49 39 NA 1297 102 472
## 6 75 1279 200 36 92 443 973 107 59 NA 1279 92 443
## ph_SO fd_E fd_DP
## 1 5456 1011 NA
## 2 1082 193 155
## 3 917 175 153
## 4 928 164 156
## 5 920 138 168
## 6 973 123 149
The following features have missing values.
Since most values in bt_HBP are missing (90%), we will drop this feature. Features br_CS and fd_DP also have many missing values, and their correlation with WINS is also weak. They will be dropped from this model as well.
df <- subset(df, select = -c(bt_HBP, br_CS, fd_DP))
We will use Multivariable Imputation by Chained Equations (mice) to fill the missing variables.
We make a few adjustments to the featurs.
df2 <- df
#feature modification
df2$bt_1B <- df$bt_H - df$bt_2B - df$bt_3B - df$bt_HR
df2 <- subset(df2, select = -c(bt_H))
#resolve skewness or wide distribution
df2$bt_3B <- log(df2$bt_3B + 1)
df2$br_SB <- log(df2$br_SB + 1)
df2$ph_SO <- log(df2$ph_SO + 1)
df2$bt_BB <- log(df2$bt_BB + 1)
df2$ph_H <- log(df2$ph_H + 1)
df2$ph_BB <- log(df2$ph_BB + 1)
df2$ph_HR <- log(df2$ph_HR + 1)
df2$fd_E <- log(df2$fd_E + 1)
df2$bt_HR <- log(df2$bt_HR + 1)
df2$bt_SO <- log(df2$bt_SO + 1)
df2$bt_2B <- log(df2$bt_2B + 1)
#drop features with low correlation
df2 <- subset(df2, select = -c(bt_3B, ph_SO))
One of the significant aspects of this model is a spacial sign tranformation. Our data has very dramatic outliers, and it is difficult to detect them. Our linear regression model is highly sensitive to outliers, and a spacial sign transformaton will address that. This tranformation will project feature values onto a unit cirle, making all of the samples will be the same distance from the center of the circle This resolves the effect outliers have on our data.
In the previous step, we normalized and centered all of the features - this is a very important step that must be done before a spacial sign tranformation. We can now apply the transformation.
df2 = as.data.frame(spatialSign(df2))
# Multiple Linear Regression
fit <- lm(WINS ~ ., data = df2)
summary(fit) # show results
##
## Call:
## lm(formula = WINS ~ ., data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0199976 -0.0005742 0.0005018 0.0011246 0.0168999
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.15494 0.04752 276.837 < 2e-16 ***
## bt_2B 1.45361 0.33426 4.349 1.43e-05 ***
## bt_HR -1.68837 0.50246 -3.360 0.000792 ***
## bt_BB 1.69469 0.44595 3.800 0.000148 ***
## bt_SO 0.43842 0.11689 3.751 0.000181 ***
## br_SB 0.60995 0.08805 6.927 5.58e-12 ***
## ph_H -3.08577 0.29710 -10.386 < 2e-16 ***
## ph_HR 1.61429 0.45543 3.545 0.000401 ***
## ph_BB -0.89009 0.42295 -2.104 0.035446 *
## fd_E -0.85404 0.17838 -4.788 1.80e-06 ***
## bt_1B -13.11171 0.04753 -275.876 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00211 on 2265 degrees of freedom
## Multiple R-squared: 0.9803, Adjusted R-squared: 0.9802
## F-statistic: 1.126e+04 on 10 and 2265 DF, p-value: < 2.2e-16