Money Ball - Data 621 Project 1

Overview

In this homework assignment, we will explore, analyze and model a data set containing approximately 2200 records. This analysis attempts to predict the number of wins for the teams. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

library(tidyverse)
library(gridExtra)
library(kableExtra)
library(MASS)
library(lindia)
library(DT)
library(corrplot)
library(psych)
library(VIM)
library(mice)
library(car)
library(caret)
library(e1071)

Part i. DATA EXPLORATION

Preview

Below is a preview of what the dataset contains.

moneyball_training_data <- read.csv('moneyball-training-data.csv', stringsAsFactors = F, header = T)
head(moneyball_training_data, 10)

Structure of Data

str(moneyball_training_data)

## 'data.frame':    2276 obs. of  17 variables:
##  $ INDEX           : int  1 2 3 4 5 6 7 8 11 12 ...
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

At first glance, we can see that TEAM_BATTING_HBP has a lot of missing data. Let’s look at the summary to see if it reveals further information on the data.

Summary of Data

summary(moneyball_training_data)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

Based on the summary above we have our work cut out for us, especially when handling missing values. There are 6 variables with missing values:

TEAM_BATTING_SO: 102
TEAM_BASERUN_SB: 131
TEAM_BASERUN_CS: 772
TEAM_BATTING_HBP: 2085
TEAM_PITCHING_SO: 102
TEAM_FIELDING_DP: 286

Some variables also have a minumum of 0. Whether or not there values affect our model outcome will be interesting to find out as we move forward.

round((sum(complete.cases(moneyball_training_data))/nrow(moneyball_training_data))*100,2)

## [1] 8.39

About only 8% of the data has complete rows.

On this next plot is a graphic representation of what variable has missing values. As mentioned earlier, the six variables that have missing values can be visually observed here. One of our goal when completing the project is knowing how to handle missing values effectively in order to reduce bias and produce useful and powerful models.

aggr(moneyball_training_data[,-1], 
     labels = names(moneyball_training_data[,-1]), 
     col=c('navyblue','yellow'), cex.axis = .7, 
     oma = c(7,4,2,2))

Further Descriptive Analytics

Here we look at more descriptive analytics on the raw dataset.

describe(moneyball_training_data[,-1])

Boxplot: Exploring Outliers

Histogram graphic for each variable

ggplot(stack(moneyball_training_data[,-1]), aes(x = ind, y = values)) + 
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 1000)) +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = 'grey'))

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

outliers <- boxplot(moneyball_training_data[,-1], plot = F)$out
out <- c(sum(moneyball_training_data$TARGET_WINS %in% outliers),
sum(moneyball_training_data$TEAM_BATTING_H %in% outliers),
sum(moneyball_training_data$TEAM_BATTING_2B %in% outliers),
sum(moneyball_training_data$TEAM_BATTING_3B %in% outliers),
sum(moneyball_training_data$TEAM_BATTING_HR %in% outliers),
sum(moneyball_training_data$TEAM_BATTING_BB %in% outliers),
sum(moneyball_training_data$TEAM_BATTING_SO %in% outliers),
sum(moneyball_training_data$TEAM_BASERUN_SB %in% outliers),
sum(moneyball_training_data$TEAM_BASERUN_CS %in% outliers),
sum(moneyball_training_data$TEAM_BATTING_HBP %in% outliers),
sum(moneyball_training_data$TEAM_PITCHING_H %in% outliers),
sum(moneyball_training_data$TEAM_PITCHING_HR %in% outliers),
sum(moneyball_training_data$TEAM_PITCHING_BB %in% outliers),
sum(moneyball_training_data$TEAM_PITCHING_SO %in% outliers),
sum(moneyball_training_data$TEAM_FIELDING_E %in% outliers),
sum(moneyball_training_data$TEAM_FIELDING_DP %in% outliers))

out_df <-data.frame(names(moneyball_training_data[,-1]), out)
colnames(out_df)<- c("Item", "Count")


ggplot(out_df, aes(x = Item, y=Count, color = Item)) + geom_bar(stat="identity", fill = "white")+
  geom_text(aes(label=Count), vjust=1.3, color = "black", size=3.5)+
  theme_minimal() +  theme(text = element_text(size=8), axis.text.x = element_text(angle = 90, hjust = 1)) + theme(legend.position = "none")

Results show that TEAM_FIELDING_E has the most outliers amongst the predictors and target variable.

Skewness in Data

Histogram graphic displaying the distribution for each variables.

moneyball_df1 = melt(moneyball_training_data[,-1])
ggplot(moneyball_df1, aes(x= value)) + 
    geom_density(fill='blue') + facet_wrap(~variable, scales = 'free')

Correlations

pairs.panels(moneyball_training_data[2:8])

pairs.panels(moneyball_training_data[9:17])

Closer look at correlations with dependant variable.

# To begin, let's find the most correlated variables to seasonal wins. 

# Buld Corrplot
corrplot(cor(moneyball_training_data[-1], moneyball_training_data$TARGET_WINS, use = "na.or.complete"),
type = "lower", 
order = "original", tl.col = "black", tl.srt = 45, tl.cex = 0.55, cl.pos='n', addgrid.col = FALSE)

From this Corrplot, we can see the variables Team_Batting_H,Team_Batting_2B,Team_Batting_HR, Team_Batting_BB, Team_Pitching_H,Team_Pitching_HR, and Team_Pitching_BB, all are positively correlated with Target_Wins. Not all variables we’d expect to have a positive contribution do so. For example the number of three base hits is negatively correlated with total wins. Why two base hits contributes positively and three base hits does the opposite is unclear. Walks allowed is also strangely positively correlated with wins.

Part ii. DATA PREPARATION

We will prepare our data using two formats: Filtering using booleans and missing value imputation. The outcome of how the models perform will based on the results provided by these two methods.

Objective 1: Ensure each variable can independently explain the variance in total wins. Ideally the residuals are smoothly and independently distributed around 0. Our aim should be to construct a vector of booleans that filters the noise from our explanatory variables.

Objective 2: Missing values can be a problem when trying to do analysis on the data. In most models, missing values are excluded which can limit the amount of information available in the analysis. This is the case why we have to either remove the missing values, impute them or model them. In this example, missing values will be imputed.

Filtering Noise

Doubles vs Residuals

plot(
  moneyball_training_data$TEAM_BATTING_2B,
  rstandard(
    lm(TARGET_WINS ~ TEAM_BATTING_2B, data = moneyball_training_data)),
  ylab = "Residuals",
  xlab = "Doubles",
  main = "Null Model"
  )

For a range from 150 to 350 doubles, the explanatory variable doubles apears to satisfy our assumptions (independence, homoscedasity) well. Most of our doubles data points are confined to this range. Therefore our transformation will consist of constraining the range of our explanatory variable.

Doubles Range (150 -350) vs Residuals

plot(
  moneyball_training_data$TEAM_BATTING_2B[(moneyball_training_data$TEAM_BATTING_2B > 150) &
                                            (moneyball_training_data$TEAM_BATTING_2B < 350)],
  rstandard(
    lm(
      TARGET_WINS[(moneyball_training_data$TEAM_BATTING_2B > 150) & 
                    (moneyball_training_data$TEAM_BATTING_2B < 350)] ~
        TEAM_BATTING_2B[(moneyball_training_data$TEAM_BATTING_2B > 150) & 
                          (moneyball_training_data$TEAM_BATTING_2B < 350)],
  data = moneyball_training_data)),
  ylab = "Residuals", 
  xlab = "Doubles", 
  main = "Constrained Range Model"
  )

This worked very well. We can now create modified doubles and wins variables with only this range to use in our final model.

# Doubles Range
doubles_range <- (moneyball_training_data$TEAM_BATTING_2B > 150) & 
  (moneyball_training_data$TEAM_BATTING_2B < 350)

Home Runs vs Residuals

plot(
  moneyball_training_data$TEAM_BATTING_HR,
  rstandard(lm(TARGET_WINS ~ TEAM_BATTING_HR,
  data = moneyball_training_data)),
  ylab = "Residuals",
  xlab = "Home Runs",
  main = "Null Model" 
  )

The explainitory variable Home Runs appears to satisfy our assumptions (independence, homoscedasity) well.There does appear to be some curvature for the home run range less than 50. We can take the square root of the explanatory variable to flatten this fish tail in the data

Home Runs vs Residuals

plot(
  sqrt(
    moneyball_training_data$TEAM_BATTING_HR),
  rstandard(lm(TARGET_WINS ~ sqrt(TEAM_BATTING_HR),
  data = moneyball_training_data)),
  ylab = "Residuals",
  xlab = "Home Runs",
  main = "Sqrt Home Runs Model"
  )

We can see this transformation has flattened the residuals overall but has left a gap in the data points at the lower end of Home Runs. Again we can constrain our range to observations greater than square root 7 and less than 12.

# Home Runs vs Residuals
plot(
  sqrt(
    moneyball_training_data$TEAM_BATTING_HR[(sqrt(moneyball_training_data$TEAM_BATTING_HR) > 9) &
    (sqrt(moneyball_training_data$TEAM_BATTING_HR) < 12)]),
  rstandard(
    lm(
      TARGET_WINS[(sqrt(moneyball_training_data$TEAM_BATTING_HR) > 9) & 
                    (sqrt(moneyball_training_data$TEAM_BATTING_HR) < 12)] ~
        sqrt(TEAM_BATTING_HR[(sqrt(moneyball_training_data$TEAM_BATTING_HR) > 9) & (sqrt(moneyball_training_data$TEAM_BATTING_HR) < 12)]),
       data = moneyball_training_data)),
  ylab = "Residuals",
  xlab = "Home Runs",
  main = "Sqrt Home Runs Model / Constrained Range" )

As can be seen from the range of residuals, the preceeding transformations have provided an explanatory variable that satisfies our assumptions. We can now further filter our data with a new explanatory variable and filtered target variable.

# New Filter
HR_range <- (sqrt(moneyball_training_data$TEAM_BATTING_HR) > 9) & 
  (sqrt(moneyball_training_data$TEAM_BATTING_HR) < 12)

Walks vs Residuals

plot(
  moneyball_training_data$TEAM_BATTING_BB,
  rstandard(
    lm(TARGET_WINS ~ TEAM_BATTING_BB,
  data = moneyball_training_data)),
  ylab = "Residuals",
  xlab = "Walks",
  main = "Null Model"
  )

For walks we can see a situation similar to doubles. Let’s zoom into the range between 400 and 700

# Walks Range (400 - 700) vs Residuals 
plot(
  moneyball_training_data$TEAM_BATTING_BB[(moneyball_training_data$TEAM_BATTING_BB > 400) &
                                            (moneyball_training_data$TEAM_BATTING_BB < 700)],
  rstandard(
    lm(
      TARGET_WINS[(moneyball_training_data$TEAM_BATTING_BB > 400) & 
                    (moneyball_training_data$TEAM_BATTING_BB < 700)] ~
        moneyball_training_data$TEAM_BATTING_BB[(moneyball_training_data$TEAM_BATTING_BB > 400) &
                                                  (moneyball_training_data$TEAM_BATTING_BB < 700)],
  data = moneyball_training_data)),
  ylab = "Residuals",
  xlab = "Walks",
  main = "Constrained Range Model")

Much better! Let’s store the boolean vector for walks.

walks_range <- (moneyball_training_data$TEAM_BATTING_BB > 400) & 
  (moneyball_training_data$TEAM_BATTING_BB < 700)

Strikeouts

# Strikeouts vs Residuals 
plot(
  moneyball_training_data$TEAM_BATTING_SO[complete.cases(moneyball_training_data$TEAM_BATTING_SO)],
  rstandard(
    lm(
      TARGET_WINS[complete.cases(moneyball_training_data$TEAM_BATTING_SO)] ~
        moneyball_training_data$TEAM_BATTING_SO[complete.cases(moneyball_training_data$TEAM_BATTING_SO)],
       data = moneyball_training_data
      )
    ),
  ylab = "Residuals",
  xlab = "Strikeouts",
  main = "Null Model"
  )

# Strikeouts vs Residuals 
plot(
  moneyball_training_data$TEAM_BATTING_SO[complete.cases(moneyball_training_data$TEAM_BATTING_SO) & (moneyball_training_data$TEAM_BATTING_SO > 400) & 
(moneyball_training_data$TEAM_BATTING_SO < 1100)],
  rstandard(
    lm(
      TARGET_WINS[complete.cases(moneyball_training_data$TEAM_BATTING_SO) & 
                    (moneyball_training_data$TEAM_BATTING_SO > 400) & 
                    (moneyball_training_data$TEAM_BATTING_SO < 1100)] ~
        moneyball_training_data$TEAM_BATTING_SO[complete.cases(moneyball_training_data$TEAM_BATTING_SO) &                     (moneyball_training_data$TEAM_BATTING_SO > 400) & 
        (moneyball_training_data$TEAM_BATTING_SO < 1100)],
       data = moneyball_training_data
      )
    ),
  ylab = "Residuals",
  xlab = "Strikeouts",
  main = "Constrained Range Model")

strikeouts_range <- complete.cases(moneyball_training_data$TEAM_BATTING_SO) & 
  (moneyball_training_data$TEAM_BATTING_SO > 400) & 
  (moneyball_training_data$TEAM_BATTING_SO < 1100)

Fielding Errors

# Strikeouts vs Residuals 
plot(
  moneyball_training_data$TEAM_FIELDING_E,
  rstandard(
    lm(
      TARGET_WINS ~ moneyball_training_data$TEAM_FIELDING_E,
       data = moneyball_training_data
      )
    ),
  ylab = "Residuals",
  xlab = "Feilding Errors",
  main = "Null Model")

Feilding Errors vs Residuals

plot(
  moneyball_training_data$TEAM_FIELDING_E[(moneyball_training_data$TEAM_FIELDING_E < 175) & (moneyball_training_data$TEAM_FIELDING_E > 110)],
  rstandard(
    lm(
      TARGET_WINS[(moneyball_training_data$TEAM_FIELDING_E < 175) & 
                    (moneyball_training_data$TEAM_FIELDING_E > 110)] ~
        moneyball_training_data$TEAM_FIELDING_E[(moneyball_training_data$TEAM_FIELDING_E < 175) &
        (moneyball_training_data$TEAM_FIELDING_E > 110)],
       data = moneyball_training_data
      )
    ),
  ylab = "Residuals",
  xlab = "Feilding Errors",
  main = "Constrained Range Model")

# Feilding Error Boolean
feilding_error_range <- (moneyball_training_data$TEAM_FIELDING_E < 175) & 
  (moneyball_training_data$TEAM_FIELDING_E > 110)

Combine Vectors and Filter Training Dataset.

# Build Metafilter 
meta_filter <-moneyball_training_data[,-1][doubles_range & HR_range & 
                                             walks_range & strikeouts_range & 
                                             feilding_error_range,]

Impute Missing data

temp <- mice(moneyball_training_data[,-1],m=5,maxit=10,meth='pmm',seed=500, printFlag = F)

## Warning: Number of logged events: 50

imputed_train_data <- complete(temp)

Preview

Look especially at the the variable TEAM_BATTING_HBP compared to original data it has 90% of its values missing.

head(imputed_train_data, 10)

Visual of complete dataset

Amelia::missmap(imputed_train_data)

The dataset now consist of only complete rows where each missing value is replaced via the predictive mean method.

densityplot(temp)

The imputed points are red and the obserced are blue. The matching shape of each distribution would tell us that the imputed values are plausible enough.

The Stripplot shows where the missing values were imputed based on the variables.

stripplot(temp, pch = 20, cex = 1.2)

After imputation we can see that every variable has a value in each row and the NAs are gone.

summary(imputed_train_data)

##   TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 542.0   1st Qu.: 67.0  
##  Median :102.00   Median :512.0   Median : 734.5   Median :105.5  
##  Mean   : 99.61   Mean   :501.6   Mean   : 728.0   Mean   :135.2  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:170.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##  TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.00   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 43.00   1st Qu.:51.00    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 57.00   Median :61.00    Median : 1518   Median :107.0   
##  Mean   : 75.05   Mean   :60.00    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 90.00   3rd Qu.:70.25    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.00   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  607.8   1st Qu.: 127.0   1st Qu.:124.0   
##  Median : 536.5   Median :  802.5   Median : 159.0   Median :145.0   
##  Mean   : 553.0   Mean   :  809.2   Mean   : 246.5   Mean   :141.3   
##  3rd Qu.: 611.0   3rd Qu.:  957.2   3rd Qu.: 249.2   3rd Qu.:162.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0

Transformation: Centering and Scaling

mb = preProcess(imputed_train_data, 
                   c("BoxCox", "center", "scale"))
moneyball_transformed = data.frame(
      mb = predict(mb, imputed_train_data))

moneyball_transformed1 = melt(moneyball_transformed)
ggplot(moneyball_transformed1, aes(x= value)) + 
    geom_density(fill='blue') + facet_wrap(~variable, scales = 'free')

# Build Corrplot
corrplot(cor(imputed_train_data[-1], imputed_train_data$TARGET_WINS, use = "na.or.complete"),
type = "lower", 
order = "original", tl.col = "black", tl.srt = 45, tl.cex = 0.55, cl.pos='n', addgrid.col = FALSE)

We have more postive correlation with the target variable than the previous correlation plot.

Part iii. BUILD MODELS

Model 1

For the first model approach, we decide to create a regression on the raw data using all the variables on the data set. As we can observe we have many variables that don’t have a good significance level. In our first model attempt, we obtain an R Squared value of 0.5501 and an adjusted R-square value of 0.5116, noticing that the difference maybe because of the numbers of variables in the regression that doesn’t have a significance level. Also, we obtain a low F-statistic result.

lm.train <- lm(TARGET_WINS ~., data = moneyball_training_data[,-1])
summary(lm.train)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = moneyball_training_data[, 
##     -1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8708  -5.6564  -0.0599   5.2545  22.9274 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.28826   19.67842   3.064  0.00253 ** 
## TEAM_BATTING_H    1.91348    2.76139   0.693  0.48927    
## TEAM_BATTING_2B   0.02639    0.03029   0.871  0.38484    
## TEAM_BATTING_3B  -0.10118    0.07751  -1.305  0.19348    
## TEAM_BATTING_HR  -4.84371   10.50851  -0.461  0.64542    
## TEAM_BATTING_BB  -4.45969    3.63624  -1.226  0.22167    
## TEAM_BATTING_SO   0.34196    2.59876   0.132  0.89546    
## TEAM_BASERUN_SB   0.03304    0.02867   1.152  0.25071    
## TEAM_BASERUN_CS  -0.01104    0.07143  -0.155  0.87730    
## TEAM_BATTING_HBP  0.08247    0.04960   1.663  0.09815 .  
## TEAM_PITCHING_H  -1.89096    2.76095  -0.685  0.49432    
## TEAM_PITCHING_HR  4.93043   10.50664   0.469  0.63946    
## TEAM_PITCHING_BB  4.51089    3.63372   1.241  0.21612    
## TEAM_PITCHING_SO -0.37364    2.59705  -0.144  0.88577    
## TEAM_FIELDING_E  -0.17204    0.04140  -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819    0.03654  -2.961  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.467 on 175 degrees of freedom
##   (2085 observations deleted due to missingness)
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5116 
## F-statistic: 14.27 on 15 and 175 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.train)

gg_reshist(lm.train)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

On a whole, the model is significant. However 2085 observations were removed due to missingness which makes the model skeptical. The residual plot seems normal with the points distributed randomly. We should be somewhat concerned with the outliers in the qq-plot which caused the tails of the plot to turn into the opposite direction. The variation in the third plot (bottom-left) seems to display homoscedasticity.

Model 2

With this model we started doing an approach by filtering some of the noise in our variables (Doubles, Home runs, Walks, Strikeouts, fielding error). On this model we didn’t see too much improvement, R square went down comparing to the first model but the F-statistics went slightly higher. As stated earlier, this model was built on a filtering method which kept 506 observations for the model to work with. An additional 33 were removed due to missingness.

lm.train2 = lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
                 TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
                 TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
                 TEAM_PITCHING_SO + TEAM_FIELDING_E +  TEAM_FIELDING_DP , data = meta_filter)
summary(lm.train2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = meta_filter)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.8165  -6.6792   0.1261   6.2821  22.8485 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      31.5992851 13.9216579   2.270  0.02368 *  
## TEAM_BATTING_H    0.0514584  0.0377933   1.362  0.17400    
## TEAM_BATTING_2B  -0.0565008  0.0185335  -3.049  0.00243 ** 
## TEAM_BATTING_3B   0.1392117  0.0423792   3.285  0.00110 ** 
## TEAM_BATTING_HR  -0.0282346  0.2850461  -0.099  0.92114    
## TEAM_BATTING_BB  -0.1062370  0.0956209  -1.111  0.26714    
## TEAM_BATTING_SO   0.0913935  0.0467464   1.955  0.05118 .  
## TEAM_BASERUN_SB   0.0312539  0.0164747   1.897  0.05844 .  
## TEAM_BASERUN_CS  -0.0312946  0.0395656  -0.791  0.42938    
## TEAM_PITCHING_H  -0.0008172  0.0357390  -0.023  0.98177    
## TEAM_PITCHING_HR  0.0952730  0.2736531   0.348  0.72788    
## TEAM_PITCHING_BB  0.1375139  0.0929499   1.479  0.13971    
## TEAM_PITCHING_SO -0.0953209  0.0444738  -2.143  0.03261 *  
## TEAM_FIELDING_E  -0.1623533  0.0297495  -5.457 7.91e-08 ***
## TEAM_FIELDING_DP -0.1016161  0.0247806  -4.101 4.87e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.688 on 461 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.3219, Adjusted R-squared:  0.3014 
## F-statistic: 15.63 on 14 and 461 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.train2)

gg_reshist(lm.train2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Model 3

We observed that we have a lot missing data in our raw dataset and to solve this and hope to see an improvement in our model we decide to work by doing multiple imputations in our training data and after that create a new model.

lm.train3 <- lm(TARGET_WINS ~., data = imputed_train_data)
summary(lm.train3)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed_train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.308  -8.429   0.262   8.022  46.776 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      32.4215968  5.3247937   6.089 1.33e-09 ***
## TEAM_BATTING_H    0.0429219  0.0035525  12.082  < 2e-16 ***
## TEAM_BATTING_2B  -0.0217313  0.0088293  -2.461 0.013918 *  
## TEAM_BATTING_3B   0.0313910  0.0163183   1.924 0.054522 .  
## TEAM_BATTING_HR   0.0606282  0.0263058   2.305 0.021271 *  
## TEAM_BATTING_BB   0.0193784  0.0056640   3.421 0.000634 ***
## TEAM_BATTING_SO  -0.0163874  0.0024659  -6.646 3.77e-11 ***
## TEAM_BASERUN_SB   0.0599810  0.0052927  11.333  < 2e-16 ***
## TEAM_BASERUN_CS  -0.0184157  0.0105683  -1.743 0.081548 .  
## TEAM_BATTING_HBP  0.0632476  0.0252139   2.508 0.012196 *  
## TEAM_PITCHING_H   0.0014835  0.0003808   3.896 0.000101 ***
## TEAM_PITCHING_HR  0.0168792  0.0234027   0.721 0.470829    
## TEAM_PITCHING_BB -0.0088860  0.0040068  -2.218 0.026673 *  
## TEAM_PITCHING_SO  0.0031946  0.0008956   3.567 0.000369 ***
## TEAM_FIELDING_E  -0.0421098  0.0026422 -15.937  < 2e-16 ***
## TEAM_FIELDING_DP -0.1221000  0.0126844  -9.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.53 on 2260 degrees of freedom
## Multiple R-squared:  0.3716, Adjusted R-squared:  0.3674 
## F-statistic:  89.1 on 15 and 2260 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.train3)

gg_reshist(lm.train3)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In this model, we definitely obtained more statistical significance variables and a good increase on f-statistics. This is an improvement compared to model 2 because the missing values were imputed. This gave the model more options to work with.

Model 4

This model is built on removing multicollinearity from the model 3 for improved results. The rule of thumb is to remove any variables with a score of more than 5.

vif(lm.train3)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         3.824178         2.474938         3.012718        36.769145 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         6.997200         5.365888         3.931167         3.922233 
## TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##         1.140037         4.160152        29.828536         6.440010 
## TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         3.426995         5.249712         2.051073

TEAM_BATTING_HR has a high VIF (highly correlated) score so it will be the first to be removed from the model.

lm.train4 <- update(lm.train3, .~. - TEAM_BATTING_HR, data = imputed_train_data)
vif(lm.train4)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_BB 
##         3.812960         2.474181         2.884116         5.615529 
##  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP 
##         5.340989         3.928034         3.921867         1.136316 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         4.138367         3.802642         5.000993         3.175985 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         5.244291         2.051018

Next we remove TEAM_BATTING_BB

lm.train4 <- update(lm.train4, .~. - TEAM_BATTING_BB, data = imputed_train_data)
vif(lm.train4)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_SO 
##         3.811750         2.469475         2.882555         5.135989 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##         3.927827         3.921833         1.112140         3.437098 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         3.802252         1.846735         2.306046         4.978480 
## TEAM_FIELDING_DP 
##         1.980628

TEAM_BATTING_SO

lm.train4 <- update(lm.train4, .~. - TEAM_BATTING_SO, data = imputed_train_data)
vif(lm.train4)

##   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BASERUN_SB 
##         3.157901         2.359989         2.815128         3.720868 
##  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
##         3.908211         1.107066         3.400599         2.441104 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         1.689236         1.872926         4.791321         1.966047

Multicollinearity is no longer present in this model. We can establish that TEAM_BATTING_BB, TEAM_BATTING_SO and TEAM_BATTING_HR are all dependent on other predictor variables and so they were removed from the model.

summary(lm.train4)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = imputed_train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.212  -8.669   0.270   8.274  45.415 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      17.8086199  3.9013570   4.565 5.27e-06 ***
## TEAM_BATTING_H    0.0519650  0.0032698  15.892  < 2e-16 ***
## TEAM_BATTING_2B  -0.0299403  0.0087330  -3.428 0.000618 ***
## TEAM_BATTING_3B   0.0355485  0.0159776   2.225 0.026187 *  
## TEAM_BASERUN_SB   0.0529770  0.0052156  10.157  < 2e-16 ***
## TEAM_BASERUN_CS  -0.0152922  0.0106854  -1.431 0.152531    
## TEAM_BATTING_HBP  0.0511955  0.0251671   2.034 0.042046 *  
## TEAM_PITCHING_H   0.0009723  0.0003487   2.788 0.005345 ** 
## TEAM_PITCHING_HR  0.0394807  0.0067812   5.822 6.64e-09 ***
## TEAM_PITCHING_BB  0.0041987  0.0020786   2.020 0.043504 *  
## TEAM_PITCHING_SO -0.0002819  0.0006706  -0.420 0.674231    
## TEAM_FIELDING_E  -0.0424416  0.0025568 -16.600  < 2e-16 ***
## TEAM_FIELDING_DP -0.1042555  0.0125789  -8.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.69 on 2263 degrees of freedom
## Multiple R-squared:  0.3544, Adjusted R-squared:  0.351 
## F-statistic: 103.5 on 12 and 2263 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.train4)

gg_reshist(lm.train4)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Even with multicollinearity removed, the Adj R-Squared did not increase. There is curvature in the scale-location plot which indicates non-contant variance.

Model 5

Using Transformed data and removing collinearity

lm.train5 <- lm(mb.TARGET_WINS ~ mb.TEAM_BATTING_H + mb.TEAM_BATTING_2B + mb.TEAM_BATTING_3B + 
                  mb.TEAM_BASERUN_SB +  mb.TEAM_BASERUN_CS +    mb.TEAM_BATTING_HBP + mb.TEAM_PITCHING_H +
                  mb.TEAM_PITCHING_HR + mb.TEAM_PITCHING_BB +   mb.TEAM_PITCHING_SO + mb.TEAM_FIELDING_E +
                  mb.TEAM_FIELDING_DP, moneyball_transformed)

summary(lm.train5)

## 
## Call:
## lm(formula = mb.TARGET_WINS ~ mb.TEAM_BATTING_H + mb.TEAM_BATTING_2B + 
##     mb.TEAM_BATTING_3B + mb.TEAM_BASERUN_SB + mb.TEAM_BASERUN_CS + 
##     mb.TEAM_BATTING_HBP + mb.TEAM_PITCHING_H + mb.TEAM_PITCHING_HR + 
##     mb.TEAM_PITCHING_BB + mb.TEAM_PITCHING_SO + mb.TEAM_FIELDING_E + 
##     mb.TEAM_FIELDING_DP, data = moneyball_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9810 -0.5259 -0.0020  0.5336  3.0816 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.728e-11  1.733e-02   0.000 1.000000    
## mb.TEAM_BATTING_H    5.128e-01  3.520e-02  14.567  < 2e-16 ***
## mb.TEAM_BATTING_2B  -7.042e-02  2.736e-02  -2.574 0.010115 *  
## mb.TEAM_BATTING_3B   1.836e-01  2.991e-02   6.137 9.91e-10 ***
## mb.TEAM_BASERUN_SB   2.296e-01  3.203e-02   7.168 1.02e-12 ***
## mb.TEAM_BASERUN_CS   3.162e-03  3.423e-02   0.092 0.926399    
## mb.TEAM_BATTING_HBP  6.625e-02  1.854e-02   3.573 0.000361 ***
## mb.TEAM_PITCHING_H  -2.089e-01  3.230e-02  -6.467 1.22e-10 ***
## mb.TEAM_PITCHING_HR  6.543e-02  2.952e-02   2.216 0.026779 *  
## mb.TEAM_PITCHING_BB  1.121e-01  2.132e-02   5.255 1.62e-07 ***
## mb.TEAM_PITCHING_SO -3.694e-03  2.352e-02  -0.157 0.875183    
## mb.TEAM_FIELDING_E  -4.421e-01  3.813e-02 -11.595  < 2e-16 ***
## mb.TEAM_FIELDING_DP -1.630e-01  2.351e-02  -6.931 5.42e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8267 on 2263 degrees of freedom
## Multiple R-squared:  0.3202, Adjusted R-squared:  0.3166 
## F-statistic: 88.83 on 12 and 2263 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.train5)

gg_reshist(lm.train5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The previous models showed a much wider spread however the residuals in this model are closer to 0. The plots seem normal. There are outliers noted but not to the point where we have to worry too much as we have more than 2000 observations to consider.

Model 6

Backward selection

This model we going to work with Backward selection. We run a backward selection on the raw data in order to find the most significant variables. We get a high R square 0.5345 but a decrease in the F-statistic.

lm.train6 <- lm(TARGET_WINS ~., data = moneyball_training_data[,-1])

summary(step(lm.train6, direccion='backward', trace = F))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = moneyball_training_data[, 
##     -1])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.2248  -5.6294  -0.0212   5.0439  21.3065 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.95454   19.10292   3.191 0.001670 ** 
## TEAM_BATTING_H    0.02541    0.01009   2.518 0.012648 *  
## TEAM_BATTING_HBP  0.08712    0.04852   1.796 0.074211 .  
## TEAM_PITCHING_HR  0.08945    0.02394   3.736 0.000249 ***
## TEAM_PITCHING_BB  0.05672    0.00940   6.034 8.66e-09 ***
## TEAM_PITCHING_SO -0.03136    0.00728  -4.308 2.68e-05 ***
## TEAM_FIELDING_E  -0.17218    0.03970  -4.338 2.38e-05 ***
## TEAM_FIELDING_DP -0.11904    0.03516  -3.386 0.000869 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.422 on 183 degrees of freedom
##   (2085 observations deleted due to missingness)
## Multiple R-squared:  0.5345, Adjusted R-squared:  0.5167 
## F-statistic: 30.02 on 7 and 183 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.train6)

gg_reshist(lm.train6)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This model can be compared with Model 1. The difference is that only the statistically significant predictors remain which indeed improved the output of the model.

Model 7

Stepwise selection

This model we work with stepwise selection both (forward and backward). We used the imputed data and comparing with model 6.

lm.train7 <- lm(TARGET_WINS ~., data = imputed_train_data)
lm.inter <- lm(TARGET_WINS ~ 1, data = imputed_train_data)

summary(step(lm.inter, direccion='both', scope = formula(lm.train7), trace = F))

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_FIELDING_E + 
##     TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_BATTING_SO + TEAM_PITCHING_H + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_HBP + TEAM_PITCHING_SO + 
##     TEAM_BATTING_2B + TEAM_PITCHING_BB + TEAM_BATTING_3B + TEAM_BASERUN_CS, 
##     data = imputed_train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.301  -8.447   0.274   8.014  46.897 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      31.9812180  5.2891103   6.047 1.73e-09 ***
## TEAM_BATTING_H    0.0431774  0.0035344  12.216  < 2e-16 ***
## TEAM_FIELDING_E  -0.0420179  0.0026389 -15.923  < 2e-16 ***
## TEAM_BASERUN_SB   0.0599939  0.0052921  11.337  < 2e-16 ***
## TEAM_FIELDING_DP -0.1218711  0.0126791  -9.612  < 2e-16 ***
## TEAM_BATTING_SO  -0.0161301  0.0024397  -6.612 4.73e-11 ***
## TEAM_PITCHING_H   0.0014640  0.0003798   3.855 0.000119 ***
## TEAM_BATTING_HR   0.0783506  0.0093914   8.343  < 2e-16 ***
## TEAM_BATTING_BB   0.0176960  0.0051608   3.429 0.000617 ***
## TEAM_BATTING_HBP  0.0645168  0.0251498   2.565 0.010373 *  
## TEAM_PITCHING_SO  0.0029988  0.0008534   3.514 0.000450 ***
## TEAM_BATTING_2B  -0.0219183  0.0088245  -2.484 0.013071 *  
## TEAM_PITCHING_BB -0.0074438  0.0034718  -2.144 0.032135 *  
## TEAM_BATTING_3B   0.0327671  0.0162047   2.022 0.043286 *  
## TEAM_BASERUN_CS  -0.0187900  0.0105544  -1.780 0.075161 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.53 on 2261 degrees of freedom
## Multiple R-squared:  0.3715, Adjusted R-squared:  0.3676 
## F-statistic: 95.45 on 14 and 2261 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm.train7)

gg_reshist(lm.train7)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Definitely an improvement due to an increase in both the Adj R-Squared and F-statistic compared to models 3 and 6. The plots follow some assumptions of the model such as normality.

Part iv. SELECT MODELS

After running 7 different models, we’ve decided to use Model 7. The approach taken for this model was that of stepwise which took steps in adding a variable, and then evaluated each of them to determine their significance to the model. The data selected was that of a cleaned up imputed data set. Our primary criteria for selecting the this model was comparing the Adjusted R-Squared values and F-Statistics. While looking at each of the models Adjusted R-Squared values, we noticed that Models 1 and 6 had the hightest values of 0.5116 and 0.5167 respectively. Additionally, looking at there F-Statistic values, Model 1 had an F-Statistic of 14.27, and Model 6 had an F-Statistic of 30.02. While the Adjusted R-Squared values were the highest, we opted not to select these models because their F-Statistic values were too low, most of the observations were removed due to missingness and the data set for the model was that of the raw data.

Being left with Models 2, 3, 4, 5, and 7, we can loosely look at Model 2, which used a filtered data set, and eliminate it for it had the lowest Adjusted R-Squared value of 0.3014 and a low F-Statistic value of 15.63. Model 3 was one of the favorite models because it had a higher Adjusted R-Squared value of 0.3674 and F-Statistic of 89.1. The method taken here was apply multiple imputations in our training data set and hope to see better results. This held true, but we took additional measures in Model 4 to improve Model 3. We wanted to remove the multicollinearity that was in Model 3, so we evaluate the Variance Inflation Factor (VIF) for each variable in the model. If the VIF was above 5, we could say that the variable had a correlation with another variable. Model 4 removed 3 variables (TEAM_BATTING_BB, TEAM_BATTING_SO and TEAM_BATTING_HR), as they were all dependent on other predictor variables. While doing this, we did not see an improvement in our Adjusted R-Squared value, 0.351, but did see an increase in our F-Statistic, now above 100, at 103.5. Model 5 used the transformed data and also removed the variables associated with multicollinearity, and again, we saw a lower Adjusted R-Squared value of 0.3166 and a lower F-Statistic of 88.83.

Comparing all of our models to Model 7, we saw the greatest Adjusted R-Squared value of 0.3676 and the second highest F-Statistic of 95.45. We valued the Adjusted R-Squared value for these models more than the F-Statistic, even though the F-Statistic was important.

moneyball_evaluation_data <- read.csv('moneyball-evaluation-data.csv', header = T, stringsAsFactors = F)
moneyball_evaluation_data <- moneyball_evaluation_data[,-1]

imputed_moneyball_evaulated_df <- mice(moneyball_evaluation_data, m=5, maxit = 5, method = 'pmm', printFlag = F)

## Warning: Number of logged events: 50

imputed_moneyball_evaulated_df <- complete(imputed_moneyball_evaulated_df)

Let’s see the predicted wins for the teams based on the the model we chose.

eval_data <- predict(lm.train7, newdata = imputed_moneyball_evaulated_df, interval="prediction")
write.csv(eval_data, "moneyball_preds.csv", row.names = F)
head(eval_data, 30)

##         fit      lwr       upr
## 1  61.53147 36.90491  86.15802
## 2  66.05619 41.41377  90.69862
## 3  73.16340 48.54653  97.78028
## 4  86.78046 62.16465 111.39628
## 5  59.57531 34.89319  84.25743
## 6  67.46261 42.79323  92.13198
## 7  81.16960 56.50145 105.83776
## 8  77.77784 53.14352 102.41217
## 9  70.30781 45.69187  94.92376
## 10 74.50157 49.89114  99.11200
## 11 68.97133 44.34881  93.59385
## 12 82.56822 57.95149 107.18494
## 13 81.75140 57.12043 106.38236
## 14 84.28518 59.65652 108.91385
## 15 86.58895 61.95208 111.22583
## 16 78.86785 54.22822 103.50747
## 17 73.81928 49.22330  98.41526
## 18 77.88048 53.29065 102.47032
## 19 72.68869 48.05343  97.32395
## 20 90.02777 65.40021 114.65533
## 21 80.81645 56.19392 105.43897
## 22 82.56889 57.95040 107.18737
## 23 77.98142 53.36607 102.59677
## 24 71.32904 46.73034  95.92775
## 25 82.52217 57.92136 107.12298
## 26 89.92955 65.31472 114.54439
## 27 63.30306 38.16660  88.43951
## 28 74.91196 50.29321  99.53071
## 29 82.26683 57.61431 106.91935
## 30 74.85095 50.20168  99.50022

Appendix

Predicted Wins: Github

R Source Code: Github

PDF: Github

Money Ball - Data 621 Project 1

Jack Russo, Javern Wilson, Joseph Simone, Anthony Munoz, Paul Perez

03-01-2020

Part i. DATA EXPLORATION

Preview

Structure of Data

Summary of Data

Further Descriptive Analytics

Boxplot: Exploring Outliers

Skewness in Data

Correlations

Part ii. DATA PREPARATION

Filtering Noise

Doubles vs Residuals

Doubles Range (150 -350) vs Residuals

Home Runs vs Residuals

Home Runs vs Residuals

Walks vs Residuals

Strikeouts

Fielding Errors

Feilding Errors vs Residuals

Impute Missing data

Preview

Transformation: Centering and Scaling

Part iii. BUILD MODELS

Model 1

Model 2

Model 3

Model 4

Model 5

Model 6

Model 7

Part iv. SELECT MODELS

Appendix