In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).
First, we need to explore our given data set. I have published the original data sets in my github account
Here, we read the dataset and shorten the feature names for better readibility in visualizations.
df <- read.csv("https://raw.githubusercontent.com/mkivenson/Business-Analytics-Data-Mining/master/Moneyball%20Regression/moneyball-training-data.csv")[-1]
names(df) <- sub("TEAM_", "", names(df))
names(df) <- sub("BATTING_", "bt_", names(df))
names(df) <- sub("BASERUN_", "br_", names(df))
names(df) <- sub("FIELDING_", "fd_", names(df))
names(df) <- sub("PITCHING_", "ph_", names(df))
names(df) <- sub("TARGET_", "", names(df))
head(df)First, we take a look at a summary of the data. A few things of interest are revealed:
## WINS bt_H bt_2B bt_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## bt_HR bt_BB bt_SO br_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## br_CS bt_HBP ph_H ph_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## ph_BB ph_SO fd_E fd_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
Let’s see the dimensions of our moneyball training data set.
## [1] 2276 16
The training data has 17 columns and 2,276 rows.
The explanatory columns are broken down into four categories:
Below you will see a preview of the columns and the first few observations broken down into these four categories.
Next, we create histograms of each of the features and target variable.
grid.arrange(ggplot(df, aes(bt_H)) + geom_histogram(binwidth = 30),
ggplot(df, aes(bt_2B)) + geom_histogram(binwidth = 10),
ggplot(df, aes(bt_3B)) + geom_histogram(binwidth = 10),
ggplot(df, aes(bt_HR)) + geom_histogram(binwidth = 10),
ggplot(df, aes(bt_BB)) + geom_histogram(binwidth = 30),
ggplot(df, aes(bt_SO)) + geom_histogram(binwidth = 50),
ggplot(df, aes(br_SB)) + geom_histogram(binwidth = 30),
ggplot(df, aes(br_CS)) + geom_histogram(binwidth = 10),
ggplot(df, aes(bt_HBP)) + geom_histogram(binwidth = 3),
ggplot(df, aes(ph_H)) + geom_histogram(binwidth = 100),
ggplot(df, aes(ph_HR)) + geom_histogram(binwidth = 10),
ggplot(df, aes(ph_BB)) + geom_histogram(binwidth = 100),
ggplot(df, aes(ph_SO)) + geom_histogram(binwidth = 30),
ggplot(df, aes(fd_E)) + geom_histogram(binwidth = 30),
ggplot(df, aes(fd_DP)) + geom_histogram(binwidth = 10),
ggplot(df, aes(WINS)) + geom_histogram(binwidth = 5),
ncol=4)grid.arrange(ggplot(df, aes(x = "bt_H", y = bt_H))+geom_boxplot(),
ggplot(df, aes(x = "bt_2B", y = bt_2B))+geom_boxplot(),
ggplot(df, aes(x = "bt_3B", y = bt_3B))+geom_boxplot(),
ggplot(df, aes(x = "bt_HR", y = bt_HR))+geom_boxplot(),
ggplot(df, aes(x = "bt_BB", y = bt_BB))+geom_boxplot(),
ggplot(df, aes(x = "bt_SO", y = bt_SO))+geom_boxplot(),
ggplot(df, aes(x = "br_SB", y = br_SB))+geom_boxplot(),
ggplot(df, aes(x = "br_CS", y = br_CS))+geom_boxplot(),
ggplot(df, aes(x = "bt_HBP", y = bt_HBP))+geom_boxplot(),
ggplot(df, aes(x = "ph_H", y = ph_H))+geom_boxplot(),
ggplot(df, aes(x = "ph_HR", y = ph_HR))+geom_boxplot(),
ggplot(df, aes(x = "ph_BB", y = ph_BB))+geom_boxplot(),
ggplot(df, aes(x = "ph_SO", y = ph_SO))+geom_boxplot(),
ggplot(df, aes(x = "fd_E", y = fd_E))+geom_boxplot(),
ggplot(df, aes(x = "fd_DP", y = fd_DP))+geom_boxplot(),
ggplot(df, aes(x = "WINS", y = WINS))+geom_boxplot(),
ncol=4)corrplot(cor(df, use = "complete.obs"), method ="color", type="lower", addrect = 1, number.cex = 0.5, sig.level = 0.30,
addCoef.col = "black", # Add coefficient of correlation
tl.srt = 25, # Text label color and rotation
tl.cex = 0.7,
diag = TRUE)Here, we see a scatter plot of each of the feature variables with the target variable.
grid.arrange(ggplot(df, aes(bt_H, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(bt_2B, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(bt_3B, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(bt_HR, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(bt_BB, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(bt_SO, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(br_SB, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(br_CS, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(bt_HBP, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(ph_H, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(ph_HR, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(ph_BB, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(ph_SO, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(fd_E, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ggplot(df, aes(fd_DP, WINS)) + geom_point(alpha = 1/10)+geom_smooth(method=lm),
ncol=4)While exploring the data, we noticed that the max values of ph_H, ph_BB, ph_SO, and fd_E seem abnormally high.
We see that the record for most hits in a season by team (ph_H) was set at 1,724 in 1921. However, we also know that the datapoints were normalized for 162 games in a season. To take a moderate approach, we will remove the some of the most egggregious outliers that are seen in these variables.
grid.arrange(ggplot(df, aes(x = "ph_H", y = ph_H))+geom_boxplot(),
ggplot(df, aes(x = "ph_BB", y = ph_BB))+geom_boxplot(),
ggplot(df, aes(x = "ph_SO", y = ph_SO))+geom_boxplot(),
ggplot(df, aes(x = "fd_E", y = fd_E))+geom_boxplot(),
ncol=4)We will also remove influencial outliers using Cooks distance.
mod <- lm(WINS ~ ., data=df)
cooksd <- cooks.distance(mod)
plot(cooksd, pch="*", cex=2, main="Influential Outliers by Cooks distance")
abline(h = 4*mean(cooksd, na.rm=T), col="red") # add cutoff line
text(x=1:length(cooksd)+1, y=cooksd, labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),names(cooksd),""), col="red") # add labelsWe remove the influencial outliers.
The following features have missing values.
Since most values in bt_HBP are missing (90%), we will drop this feature.
We will use Multivariable Imputation by Chained Equations (mice) to fill the missing variables.
We will begin with all independent variables and use the back elimination method to eliminate the non-significant ones.
We will start by eliminating the variables with high p-values and lowest significance from the model
Let’s take a look at the resulting model:
be_lm2 <- lm(WINS ~ bt_H + bt_BB + br_SB + br_CS, data =df)
sum_lm2<-summary(be_lm2)
par(mfrow=c(2,2))
plot(be_lm2)###Linear Model 2.
This Linear Model will be built using the variables we believe would have the highest corelation with WINs.
THe following variables will be used: - Base Hits by batters (1B,2B,3B,HR) - Walks by batters - Stolen bases - Strikeouts by batters
Let’s remove the two variables with low significance:
be_lm3 <- lm(WINS ~ bt_H + bt_2B + bt_3B + bt_HR + bt_BB + bt_SO, data =df)
sum_lm3<-summary(be_lm3)
par(mfrow=c(2,2))
plot(be_lm3)be_lm4 <- lm(WINS ~
I(bt_H + bt_BB
- ph_H - ph_BB) +
I(bt_HR - ph_HR) +
I(bt_SO - ph_SO) +
I(br_SB - br_CS) +
fd_E + fd_DP , df)
sum_lm4<-summary(be_lm4)
par(mfrow=c(2,2))
plot(be_lm4)# list of models and model summaries
models <- list(be_lm1, be_lm2,be_lm3,be_lm4)
modsums <- list(sum_lm1, sum_lm2, sum_lm3, sum_lm4)
nmod <- length(modsums)
# storage variables
nvar <- integer(nmod)
sigma <- numeric(nmod)
rsq <- numeric(nmod)
adj_rsq <- numeric(nmod)
fstat <- numeric(nmod)
fstat_p <- numeric(nmod)
mse <- numeric(nmod)
rmse <- numeric(nmod)
# loop through model summaries
for (j in 1:nmod) {
nvar[j] <- modsums[[j]]$df[1]
sigma[j] <- modsums[[j]]$sigma
rsq[j] <- modsums[[j]]$r.squared
adj_rsq[j] <- modsums[[j]]$adj.r.squared
fstat[j] <- modsums[[j]]$fstatistic[1]
fstat_p[j] <- 1 - pf(modsums[[j]]$fstatistic[1], modsums[[j]]$fstatistic[2],
modsums[[j]]$fstatistic[3])
mse[j] <- mean(modsums[[j]]$residuals^2)
rmse[j] <- sqrt(mse[j])
}
modnames <- paste0("lm", c(1:nmod))
# evaluation dataframe
eval <- data.frame(Model = modnames,
N_Vars = nvar,
Sigma = sigma,
R_Sq = rsq,
Adj_R_Sq = adj_rsq,
F_Stat = fstat,
F_P_Val = fstat_p,
MSE = mse,
RMSE = rmse)
kable(eval, digits = 3, align = 'c')| Model | N_Vars | Sigma | R_Sq | Adj_R_Sq | F_Stat | F_P_Val | MSE | RMSE |
|---|---|---|---|---|---|---|---|---|
| lm1 | 16 | 12.540 | 0.363 | 0.359 | 85.935 | 0 | 156.134 | 12.495 |
| lm2 | 5 | 13.714 | 0.243 | 0.242 | 182.565 | 0 | 187.659 | 13.699 |
| lm3 | 7 | 13.785 | 0.236 | 0.234 | 116.894 | 0 | 189.444 | 13.764 |
| lm4 | 7 | 14.774 | 0.123 | 0.120 | 52.855 | 0 | 217.602 | 14.751 |