Predicting Baseball Wins using Multiple Linear Regression
DATA621 Homework 01
Assignment
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
We have been given a dataset with 2276 records summarizing a major league baseball team's season. The records span 1871 to 2006 inclusive. All statistics have been adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).
DATA EXPLORATION
Before modeling it’s important you understand your data. The following cells use exploratory data methods to understand the distribution of the data. The techniques used will explore differences feature type, summary statistics, number of missing values, and a visual representation of the density of each variable.
# Load dataframes without index column
raw_mbed <- read.csv("https://raw.githubusercontent.com/Zchen116/data-621/master/moneyball-evaluation-data.csv")
raw_mbtd <- read.csv("https://raw.githubusercontent.com/Zchen116/data-621/master/moneyball-training-data.csv")
mbed <- raw_mbed[,-1]
mbtd <- raw_mbtd[,-1]
# Check variable types
sapply(mbtd, class)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## "integer" "integer" "integer" "integer"
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## "integer" "integer" "integer" "integer"
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## "integer" "integer" "integer" "integer"
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## "integer" "integer" "integer" "integer"
# Summarize variables
summary(mbtd)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
# Plot distributions
par(mfrow = c(4, 4))
datasub = melt(mbtd)
## Using as id variables
ggplot(datasub, aes(x= value)) +
geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
# Show missing values
vis_miss(mbtd)
Let's go piece by piece through the information we uncovered.
Variable Type: We are dealing with only numerical types.
Summary: We have zero values which doesn't seem feasible within the context of the analysis.
Distributions: We have bimodal and skewed distributions.
Missing Values: The TEAM_BATTING_HBP
and TEAM_BASERUN_CS
variables are each missing over a third of their values.
Data Preparation
We begin by splitting the data into a training and testing set using a 75/25 split. Next each zero value is set to NA because zero is not exactly feasible in this context therefore we treat the data as missing or an anomaly. Following this, the variables with over 10% missing values were removed and only the complete cases were used for the training set. Lastly we view the distributions to better understand the prepped data.
# Split train/test data
set.seed(100)
smp_size <- floor(0.75 * nrow(raw_mbtd))
train_ind <- sample(seq_len(nrow(raw_mbtd)), size = smp_size)
train <- raw_mbtd[train_ind, -1]
test <- raw_mbtd[-train_ind, -1]
dim(train)
## [1] 1707 16
dim(test)
## [1] 569 16
# Set 0 equal to NA
train[train == 0] <- NA
test[is.na(test)] <- 0
# Remove variables with excessive missing values
train <- dplyr::select(train, -TEAM_BATTING_HBP, -TEAM_BASERUN_CS)
test <- dplyr::select(test, -TEAM_BATTING_HBP, -TEAM_BASERUN_CS)
# Filter complete cases
train <- train[complete.cases(train),]
# Plot clean distributions
par(mfrow = c(4, 4))
datasub = melt(train)
## Using as id variables
ggplot(datasub, aes(x= value)) +
geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
Our distributions look much more normally distributed.
Build Model
After the data has been cleaned and analyzed we are ready to create a linear regression model to predict team wins. It's important to note we are only using the training data to create our best fit line. We use the testing data set to evaluate the model on unseen data and prevent overfitting. The following cells explore three methods for creating a model.
Raw Model
The raw model is our base for evaluation. The raw model uses the base lm
package to create the best fit line. We fit the model on the training dataset using all of the features that met the data preparation criteria and evaluate the performance.
# Build stepwise model
raw_model <- lm(TARGET_WINS ~ .,
data = train)
par(mfrow = c(2, 2))
plot(raw_model)
summary(raw_model)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.2935 -7.1110 -0.0477 6.9765 28.9255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.378097 6.960717 8.099 1.21e-15 ***
## TEAM_BATTING_H -0.032756 0.017899 -1.830 0.067455 .
## TEAM_BATTING_2B -0.054468 0.010027 -5.432 6.58e-08 ***
## TEAM_BATTING_3B 0.177605 0.021405 8.297 2.52e-16 ***
## TEAM_BATTING_HR 0.074933 0.091070 0.823 0.410763
## TEAM_BATTING_BB 0.147477 0.048369 3.049 0.002340 **
## TEAM_BATTING_SO 0.024327 0.024452 0.995 0.319962
## TEAM_BASERUN_SB 0.065565 0.006364 10.302 < 2e-16 ***
## TEAM_PITCHING_H 0.062151 0.016237 3.828 0.000135 ***
## TEAM_PITCHING_HR 0.019369 0.087056 0.222 0.823964
## TEAM_PITCHING_BB -0.107511 0.045938 -2.340 0.019409 *
## TEAM_PITCHING_SO -0.044468 0.023249 -1.913 0.056002 .
## TEAM_FIELDING_E -0.124420 0.007985 -15.582 < 2e-16 ***
## TEAM_FIELDING_DP -0.109774 0.013946 -7.871 7.08e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.02 on 1371 degrees of freedom
## Multiple R-squared: 0.4228, Adjusted R-squared: 0.4173
## F-statistic: 77.24 on 13 and 1371 DF, p-value: < 2.2e-16
#Calculate RMSE and R.Squared
predictions <- predict.lm(raw_model, newdata = test[,-1])
rmse <- rmse(test[,1], predictions)
R.sq <- summary(raw_model)$adj.r.squared
raw <- cbind(rmse, R.sq)
raw
## rmse R.sq
## [1,] 56.05117 0.4172964
Stepwise Model
This model uses the raw model created above with the addition of the stepAIC
package. stepAIC
is a common package used to help with feature selection. This version of the model uses this package with no additional constraints to train and evaluate model performance.
stepwise_model <- stepAIC(raw_model, direction = c("both"), trace = FALSE)
par(mfrow = c(2, 2))
plot(stepwise_model)
summary(stepwise_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB +
## TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E +
## TEAM_FIELDING_DP, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.1996 -7.0448 -0.1146 6.8985 28.8834
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.147096 6.914466 8.265 3.26e-16 ***
## TEAM_BATTING_H -0.027337 0.016741 -1.633 0.102718
## TEAM_BATTING_2B -0.053858 0.010007 -5.382 8.64e-08 ***
## TEAM_BATTING_3B 0.176795 0.021386 8.267 3.20e-16 ***
## TEAM_BATTING_HR 0.095977 0.010346 9.277 < 2e-16 ***
## TEAM_BATTING_BB 0.165052 0.043579 3.787 0.000159 ***
## TEAM_BASERUN_SB 0.066454 0.006303 10.544 < 2e-16 ***
## TEAM_PITCHING_H 0.056606 0.015009 3.771 0.000169 ***
## TEAM_PITCHING_BB -0.124496 0.041230 -3.020 0.002579 **
## TEAM_PITCHING_SO -0.021638 0.002440 -8.867 < 2e-16 ***
## TEAM_FIELDING_E -0.123876 0.007943 -15.595 < 2e-16 ***
## TEAM_FIELDING_DP -0.110152 0.013924 -7.911 5.23e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.02 on 1373 degrees of freedom
## Multiple R-squared: 0.4223, Adjusted R-squared: 0.4176
## F-statistic: 91.22 on 11 and 1373 DF, p-value: < 2.2e-16
#Calculate RMSE and R.Squared
predictions2 <- predict.lm(stepwise_model, newdata = test[,-1])
rmse2 <- rmse(test[,1], predictions2)
R.sq2 <- summary(stepwise_model)$adj.r.squared
stepwise <- cbind(rmse2, R.sq2)
stepwise
## rmse2 R.sq2
## [1,] 49.62804 0.4176215
Stepwise with Constraints
Here we add constraints to the stepwise model created to tune model performance. Tuning a model is a technique used in data science to direct the model to adjust the feature weights to minimize a specific loss function. In this case we use RMSE (root-mean-square deviation).
#Calculate RMSE and R.Squared
predictions2 <- set_constraints(predictions2)
rmse3 <- rmse(test[,1], predictions2)
R.sq3 <- summary(stepwise_model)$adj.r.squared
constrained_stepwise <- cbind(rmse3, R.sq3)
constrained_stepwise
## rmse3 R.sq3
## [1,] 20.72242 0.4176215
Select Model
Lastly all the models will be compared in order to select the model that will produce the most accurate and precise results. The metrics we will be focused on are RMSE and adjusted R-Squared. The models compared were the original raw model which included all variables. Next was a stepwise model which minimizes AIC in order to determine the variables which are necessary to include. The last model was a stepwise model with constraints implemented on the predictions.
kk = rbind(round(raw, 4), round(stepwise, 4), round(constrained_stepwise, 4))
k1 = as.data.frame(kk)
rownames(k1) = c("raw", "stepwise", "constrained_stepwise")
k1 %>%
kable() %>%
kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE)
rmse | R.sq | |
---|---|---|
raw | 56.0512 | 0.4173 |
stepwise | 49.6280 | 0.4176 |
constrained_stepwise | 20.7224 | 0.4176 |
Our third model produced the best metrics. Including constraints on a stepwise model produced an RMSE and adjusted R-squared of 20.7224 and 0.4176 respectively.
preds <- cbind(head(predictions2, 10), head(test[,1], 10))
k2 = as.data.frame(preds)
colnames(k2) = c("Predictions", "Actual")
k2 %>%
kable() %>%
kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE)
Predictions | Actual | |
---|---|---|
4 | 66.12305 | 70 |
9 | 73.67847 | 86 |
10 | 66.30510 | 76 |
18 | 78.69734 | 66 |
21 | 71.06833 | 70 |
22 | 78.57993 | 81 |
24 | 92.83437 | 92 |
27 | 78.51106 | 91 |
28 | 73.71682 | 80 |
30 | 69.34804 | 72 |
Write Predictions
mbed[is.na(mbed)]<-0
final_predictions <- round(predict.lm(stepwise_model, newdata = mbed), 3)
final_predictions <- set_constraints(final_predictions)
final_predictions <- cbind(TARGET_WINS=final_predictions, mbed)
write.csv(final_predictions, "moneyball-prediction-data.csv")