Assignment

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

We have been given a dataset with 2276 records summarizing a major league baseball team's season. The records span 1871 to 2006 inclusive. All statistics have been adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided).


DATA EXPLORATION

Before modeling it’s important you understand your data. The following cells use exploratory data methods to understand the distribution of the data. The techniques used will explore differences feature type, summary statistics, number of missing values, and a visual representation of the density of each variable.

# Load dataframes without index column
raw_mbed <- read.csv("https://raw.githubusercontent.com/Zchen116/data-621/master/moneyball-evaluation-data.csv")
raw_mbtd <- read.csv("https://raw.githubusercontent.com/Zchen116/data-621/master/moneyball-training-data.csv")
mbed <- raw_mbed[,-1]
mbtd <- raw_mbtd[,-1]
# Check variable types
sapply(mbtd, class)
##      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##        "integer"        "integer"        "integer"        "integer" 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##        "integer"        "integer"        "integer"        "integer" 
##  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
##        "integer"        "integer"        "integer"        "integer" 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##        "integer"        "integer"        "integer"        "integer"
# Summarize variables
summary(mbtd)
##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286
# Plot distributions
par(mfrow = c(4, 4))

datasub = melt(mbtd)
## Using  as id variables
ggplot(datasub, aes(x= value)) + 
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free') 

# Show missing values
vis_miss(mbtd)

Let's go piece by piece through the information we uncovered.

Variable Type: We are dealing with only numerical types.
Summary: We have zero values which doesn't seem feasible within the context of the analysis.
Distributions: We have bimodal and skewed distributions.
Missing Values: The TEAM_BATTING_HBP and TEAM_BASERUN_CS variables are each missing over a third of their values.


Data Preparation

We begin by splitting the data into a training and testing set using a 75/25 split. Next each zero value is set to NA because zero is not exactly feasible in this context therefore we treat the data as missing or an anomaly. Following this, the variables with over 10% missing values were removed and only the complete cases were used for the training set. Lastly we view the distributions to better understand the prepped data.

# Split train/test data
set.seed(100) 

smp_size <- floor(0.75 * nrow(raw_mbtd))

train_ind <- sample(seq_len(nrow(raw_mbtd)), size = smp_size)

train <- raw_mbtd[train_ind, -1]
test <- raw_mbtd[-train_ind, -1]
dim(train)
## [1] 1707   16
dim(test)
## [1] 569  16
# Set 0 equal to NA
train[train == 0] <- NA
test[is.na(test)] <- 0
# Remove variables with excessive missing values
train <- dplyr::select(train, -TEAM_BATTING_HBP, -TEAM_BASERUN_CS)
test <- dplyr::select(test, -TEAM_BATTING_HBP, -TEAM_BASERUN_CS)
# Filter complete cases
train <- train[complete.cases(train),]
# Plot clean distributions
par(mfrow = c(4, 4))

datasub = melt(train)
## Using  as id variables
ggplot(datasub, aes(x= value)) + 
    geom_density(fill='red') + facet_wrap(~variable, scales = 'free') 

Our distributions look much more normally distributed.


Build Model

After the data has been cleaned and analyzed we are ready to create a linear regression model to predict team wins. It's important to note we are only using the training data to create our best fit line. We use the testing data set to evaluate the model on unseen data and prevent overfitting. The following cells explore three methods for creating a model.

Raw Model

The raw model is our base for evaluation. The raw model uses the base lm package to create the best fit line. We fit the model on the training dataset using all of the features that met the data preparation criteria and evaluate the performance.

# Build stepwise model
raw_model <- lm(TARGET_WINS ~ ., 
            data = train)
par(mfrow = c(2, 2))
plot(raw_model)

summary(raw_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.2935  -7.1110  -0.0477   6.9765  28.9255 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      56.378097   6.960717   8.099 1.21e-15 ***
## TEAM_BATTING_H   -0.032756   0.017899  -1.830 0.067455 .  
## TEAM_BATTING_2B  -0.054468   0.010027  -5.432 6.58e-08 ***
## TEAM_BATTING_3B   0.177605   0.021405   8.297 2.52e-16 ***
## TEAM_BATTING_HR   0.074933   0.091070   0.823 0.410763    
## TEAM_BATTING_BB   0.147477   0.048369   3.049 0.002340 ** 
## TEAM_BATTING_SO   0.024327   0.024452   0.995 0.319962    
## TEAM_BASERUN_SB   0.065565   0.006364  10.302  < 2e-16 ***
## TEAM_PITCHING_H   0.062151   0.016237   3.828 0.000135 ***
## TEAM_PITCHING_HR  0.019369   0.087056   0.222 0.823964    
## TEAM_PITCHING_BB -0.107511   0.045938  -2.340 0.019409 *  
## TEAM_PITCHING_SO -0.044468   0.023249  -1.913 0.056002 .  
## TEAM_FIELDING_E  -0.124420   0.007985 -15.582  < 2e-16 ***
## TEAM_FIELDING_DP -0.109774   0.013946  -7.871 7.08e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 1371 degrees of freedom
## Multiple R-squared:  0.4228, Adjusted R-squared:  0.4173 
## F-statistic: 77.24 on 13 and 1371 DF,  p-value: < 2.2e-16
#Calculate RMSE and R.Squared
predictions <- predict.lm(raw_model, newdata = test[,-1])

rmse <- rmse(test[,1], predictions)

R.sq <- summary(raw_model)$adj.r.squared

raw <- cbind(rmse, R.sq)
raw
##          rmse      R.sq
## [1,] 56.05117 0.4172964

Stepwise Model

This model uses the raw model created above with the addition of the stepAIC package. stepAIC is a common package used to help with feature selection. This version of the model uses this package with no additional constraints to train and evaluate model performance.

stepwise_model <- stepAIC(raw_model, direction = c("both"), trace = FALSE)
par(mfrow = c(2, 2))
plot(stepwise_model)

summary(stepwise_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.1996  -7.0448  -0.1146   6.8985  28.8834 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      57.147096   6.914466   8.265 3.26e-16 ***
## TEAM_BATTING_H   -0.027337   0.016741  -1.633 0.102718    
## TEAM_BATTING_2B  -0.053858   0.010007  -5.382 8.64e-08 ***
## TEAM_BATTING_3B   0.176795   0.021386   8.267 3.20e-16 ***
## TEAM_BATTING_HR   0.095977   0.010346   9.277  < 2e-16 ***
## TEAM_BATTING_BB   0.165052   0.043579   3.787 0.000159 ***
## TEAM_BASERUN_SB   0.066454   0.006303  10.544  < 2e-16 ***
## TEAM_PITCHING_H   0.056606   0.015009   3.771 0.000169 ***
## TEAM_PITCHING_BB -0.124496   0.041230  -3.020 0.002579 ** 
## TEAM_PITCHING_SO -0.021638   0.002440  -8.867  < 2e-16 ***
## TEAM_FIELDING_E  -0.123876   0.007943 -15.595  < 2e-16 ***
## TEAM_FIELDING_DP -0.110152   0.013924  -7.911 5.23e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.02 on 1373 degrees of freedom
## Multiple R-squared:  0.4223, Adjusted R-squared:  0.4176 
## F-statistic: 91.22 on 11 and 1373 DF,  p-value: < 2.2e-16
#Calculate RMSE and R.Squared
predictions2 <- predict.lm(stepwise_model, newdata = test[,-1])

rmse2 <- rmse(test[,1], predictions2)

R.sq2 <- summary(stepwise_model)$adj.r.squared

stepwise <- cbind(rmse2, R.sq2)

stepwise
##         rmse2     R.sq2
## [1,] 49.62804 0.4176215

Stepwise with Constraints

Here we add constraints to the stepwise model created to tune model performance. Tuning a model is a technique used in data science to direct the model to adjust the feature weights to minimize a specific loss function. In this case we use RMSE (root-mean-square deviation).

#Calculate RMSE and R.Squared
predictions2 <- set_constraints(predictions2)

rmse3 <- rmse(test[,1], predictions2)

R.sq3 <- summary(stepwise_model)$adj.r.squared

constrained_stepwise <- cbind(rmse3, R.sq3)

constrained_stepwise
##         rmse3     R.sq3
## [1,] 20.72242 0.4176215

Select Model

Lastly all the models will be compared in order to select the model that will produce the most accurate and precise results. The metrics we will be focused on are RMSE and adjusted R-Squared. The models compared were the original raw model which included all variables. Next was a stepwise model which minimizes AIC in order to determine the variables which are necessary to include. The last model was a stepwise model with constraints implemented on the predictions.

kk = rbind(round(raw, 4), round(stepwise, 4), round(constrained_stepwise, 4))
k1 = as.data.frame(kk)
rownames(k1) = c("raw", "stepwise", "constrained_stepwise")
k1 %>%
  kable() %>%
  kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE) 
rmse R.sq
raw 56.0512 0.4173
stepwise 49.6280 0.4176
constrained_stepwise 20.7224 0.4176

Our third model produced the best metrics. Including constraints on a stepwise model produced an RMSE and adjusted R-squared of 20.7224 and 0.4176 respectively.

preds <- cbind(head(predictions2, 10), head(test[,1], 10))

k2 = as.data.frame(preds)
colnames(k2) = c("Predictions", "Actual")
k2 %>%
  kable() %>%
  kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE) 
Predictions Actual
4 66.12305 70
9 73.67847 86
10 66.30510 76
18 78.69734 66
21 71.06833 70
22 78.57993 81
24 92.83437 92
27 78.51106 91
28 73.71682 80
30 69.34804 72

Write Predictions

mbed[is.na(mbed)]<-0
final_predictions <- round(predict.lm(stepwise_model, newdata = mbed), 3)
final_predictions <- set_constraints(final_predictions)
final_predictions <- cbind(TARGET_WINS=final_predictions, mbed)
write.csv(final_predictions, "moneyball-prediction-data.csv")