MoneyBall Assignment

Data Exploration

In this assignment, we will be analyzing approximately 2200 records from year of 1871 to 2006, Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Firstly, import the data and load required packages to perform general descriptive analyses

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 515465 27.6    1138110 60.8         NA   669294 35.8
## Vcells 948150  7.3    8388608 64.0      16384  1839880 14.1
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.1.8
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
## corrplot 0.92 loaded
## 
## Loading required package: carData
## 
## 
## Attaching package: 'car'
## 
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## 
## The following object is masked from 'package:purrr':
## 
##     some
## 
## 
## The following object is masked from 'package:psych':
## 
##     logit
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Performing General Data Analyses and boxplot

head(df1)
##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1     1          39           1445             194              39
## 2     2          70           1339             219              22
## 3     3          86           1377             232              35
## 4     4          70           1387             209              38
## 5     5          82           1297             186              27
## 6     6          75           1279             200              36
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
## 6              92             443             973             107
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
## 6              59               NA            1279               92
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168
## 6              443              973             123              149
summary(df1)
##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286
# From the Summary we have noticed that there are 6 variables containing N/A values, hence we need to fill them with values in order to perform the analyses

df1_med <- df1%>% 
  mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
df2_med <- df2%>%
  mutate_all(~ifelse(is.na(.), median(., na.rm = TRUE), .))
# Data Imputation, perfomed by filling all the N/A values with the median of the column

As we can see the names are really complicated, in order to better understand the text, as well as making the graphs easier to read, it is advisable to make the names easier to read

Sorting Variable Names

train <- df1_med
test <- df2_med
train$INDEX <- NULL
test$INDEX <- NULL

clean <- function(train){
    name_list <- names(train)
    name_list <- gsub("TEAM_", "", name_list)
    names(train) <- name_list
    train
}

df1 <- clean(train) 
df2 <- clean(test)

# Now the TEAM part of the variables have been removed

General Calculation of correlation between predictory and outcome variables

Calculate correlation between TEAM_WINS and all other variables, to determine which variable is worthy of creating a barchart for further analyze

# Calculate correlation between TEAM_WINS and all other variables
correlations <- cor(df1)

# Extract correlations for TEAM_WINS
wins_cor <- correlations["TARGET_WINS", ]

# Print correlations for TEAM_WINS only
print(wins_cor)
## TARGET_WINS   BATTING_H  BATTING_2B  BATTING_3B  BATTING_HR  BATTING_BB 
##  1.00000000  0.38876752  0.28910365  0.14260841  0.17615320  0.23255986 
##  BATTING_SO  BASERUN_SB  BASERUN_CS BATTING_HBP  PITCHING_H PITCHING_HR 
## -0.03058135  0.12361087  0.01595982  0.01651641 -0.10993705  0.18901373 
## PITCHING_BB PITCHING_SO  FIELDING_E FIELDING_DP 
##  0.12417454 -0.07579967 -0.17648476 -0.03008630
# Then, we can calculate that correlations between all 16 different variables
DT::datatable(cor(drop_na(df1[,])), options = list(pagelength=5))

From the first correlation table, we can conclude that TEAM_BATTING_H(Base Hits by Batters), TEAM_BATTING_2B (Doubles by Batters), and TEAM_BATTING_BB(Walks by Batters) post a realtively strong correlation with Wins. Hence we are going to further plot a boxplot to better visualize the relationship

cor_res <- cor(df1, use = "complete.obs")
df1 %>% 
  cor(., use = "complete.obs") %>%
  corrplot(., method = "color", type = "upper", tl.col = "black", diag = FALSE)

??corrplot

From the corrplot above, we can see that there are variables that show a strong positive correlation. BATTING_HR has a relatively strong correlation relationship with BATTING_BB AND BATTING_SO, it also has a very strong relationship with PITCHING_HR. BATTING_3B has a relative strong negative correlation with BATTING_HR and BATTING_SO.

Distribution Skewness

df1 %>%    # By creating a Distribution graph, we are able to analyze the skewness of the distrubution of values
  gather(variable, value, TARGET_WINS:FIELDING_DP) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "red", color="red") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

From the distribution graphs above, we can see that most of the data tend to have a distribution skewed to the right, with only several data have a relatively normal distribution, among those, Batting_2B, Batting_BB and Target_Wins are obvious in its relative normal distributions.

Checking for Outliers

ggplot(stack(df1), aes(x = ind, y = values)) + 
  geom_boxplot() + coord_cartesian(ylim = c(0, 2500)) + 
  theme(legend.position="none") + 
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = 'grey'))

The outliers graph matches the distribution graph, indicating there are multiple independent variables have large amount of outliers .One potential impact of outliers is that they can distort the estimated relationships between variables in a statistical model.Especially in linear regression, outliers can pull the regression line away from the true trend of the data and result in biased estimates of the coefficients. Therefore we have to consider the effect of outliers while we performing linear relationship analyses in the end.

Data Preparation

Even though we have used medians to fill up all the missing variables, however, considering the tremendous amount of missing values in BATTING_HBP(Batters hit by pitch), we can drop the variable out of the study for accuracy purposes.

HBP_removed <- function(df1){
  # Drop the hit by pitcher variable
  df1 %>% select(-BATTING_HBP)
}

df1 <- HBP_removed(df1)
df2 <- HBP_removed(df2) %>% na.omit()

Since we have already removed the data that has a tremendous amount of N/A values, and filled median to the columns with N/A values, hence it should be ready to start building linear relationship analytical models.

Building Models

Model 1: Backward selection

To initially use all the variables available, to discover the significance of all values based on our base data, and we can build other models upon the result we obtain from this model.

model1 <- lm (formula = TARGET_WINS ~ . , data = df1) 
summary(model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.753  -8.626   0.120   8.395  58.561 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.6421579  5.3902272   4.386 1.21e-05 ***
## BATTING_H    0.0489152  0.0036949  13.239  < 2e-16 ***
## BATTING_2B  -0.0209575  0.0091783  -2.283 0.022501 *  
## BATTING_3B   0.0644788  0.0168040   3.837 0.000128 ***
## BATTING_HR   0.0527325  0.0274915   1.918 0.055219 .  
## BATTING_BB   0.0104483  0.0058377   1.790 0.073621 .  
## BATTING_SO  -0.0084323  0.0025461  -3.312 0.000941 ***
## BASERUN_SB   0.0254236  0.0043565   5.836 6.12e-09 ***
## BASERUN_CS  -0.0110027  0.0157842  -0.697 0.485829    
## PITCHING_H  -0.0008456  0.0003674  -2.302 0.021444 *  
## PITCHING_HR  0.0129626  0.0243894   0.531 0.595135    
## PITCHING_BB  0.0007798  0.0041571   0.188 0.851231    
## PITCHING_SO  0.0028156  0.0009219   3.054 0.002284 ** 
## FIELDING_E  -0.0195325  0.0024609  -7.937 3.23e-15 ***
## FIELDING_DP -0.1217801  0.0129421  -9.410  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.07 on 2261 degrees of freedom
## Multiple R-squared:  0.3154, Adjusted R-squared:  0.3111 
## F-statistic:  74.4 on 14 and 2261 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model1)

The overall estimates predicts well, however, since there are several variables that are not statistically significant (Batting_BB, Baserun_CS, Pitching_HR, Pitching_BB) From the above residual plots:

  1. The variability of the points is approximately the same in the mid values of x with the decrese of variations towards the two end points which depicts that the plot is unbaised and homoscedastic except for few outliers.

  2. Normal QQ plot fulfills the assumptions of normality.

But since few coefficients of the model are not significant, let’s see if assumptions of other models are true.

Model 2: Simple Model

The imple

model2 <- lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_HR
             + BATTING_BB + BATTING_SO + BASERUN_SB + 
               PITCHING_SO + PITCHING_H
               + FIELDING_E + FIELDING_DP, 
             data = df1)
summary(model2)
## 
## Call:
## lm(formula = TARGET_WINS ~ BATTING_H + BATTING_2B + BATTING_HR + 
##     BATTING_BB + BATTING_SO + BASERUN_SB + PITCHING_SO + PITCHING_H + 
##     FIELDING_E + FIELDING_DP, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.339  -8.693   0.131   8.491  56.732 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20.5750482  5.2324529   3.932 8.67e-05 ***
## BATTING_H    0.0543597  0.0034375  15.814  < 2e-16 ***
## BATTING_2B  -0.0245164  0.0091592  -2.677 0.007489 ** 
## BATTING_HR   0.0555147  0.0091930   6.039 1.81e-09 ***
## BATTING_BB   0.0124752  0.0033780   3.693 0.000227 ***
## BATTING_SO  -0.0098879  0.0024371  -4.057 5.13e-05 ***
## BASERUN_SB   0.0290169  0.0040966   7.083 1.87e-12 ***
## PITCHING_SO  0.0031793  0.0006720   4.731 2.37e-06 ***
## PITCHING_H  -0.0010283  0.0003158  -3.256 0.001146 ** 
## FIELDING_E  -0.0177728  0.0023798  -7.468 1.15e-13 ***
## FIELDING_DP -0.1241527  0.0129589  -9.580  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.11 on 2265 degrees of freedom
## Multiple R-squared:  0.3102, Adjusted R-squared:  0.3072 
## F-statistic: 101.9 on 10 and 2265 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model2)

The P Values are indicating that all of the variables are considered statistically significant. From the above redisual plots: 1. The variability of the points is approximately the same throughout the values of x which depicts that this plot is also unbaised and homoscedastic with very few(minimum) outliers.

  1. Normal QQ plot fulfills the assumptions of normality. ### Model 3 Step Wise Regression

In this regression model, we will only keep those that are really statistically significant in Model 1

model3 <- lm(formula = TARGET_WINS ~ BATTING_2B + BATTING_3B + BASERUN_SB + 
BASERUN_CS + FIELDING_E + FIELDING_DP, data = df1)

summary(model3)
## 
## Call:
## lm(formula = TARGET_WINS ~ BATTING_2B + BATTING_3B + BASERUN_SB + 
##     BASERUN_CS + FIELDING_E + FIELDING_DP, data = df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.143  -9.668   0.286   9.289  64.924 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 61.668048   2.664735  23.142  < 2e-16 ***
## BATTING_2B   0.097707   0.006733  14.512  < 2e-16 ***
## BATTING_3B   0.137972   0.013559  10.176  < 2e-16 ***
## BASERUN_SB   0.027056   0.004166   6.495 1.02e-10 ***
## BASERUN_CS  -0.046707   0.016505  -2.830   0.0047 ** 
## FIELDING_E  -0.020846   0.001578 -13.211  < 2e-16 ***
## FIELDING_DP -0.053614   0.013035  -4.113 4.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.18 on 2269 degrees of freedom
## Multiple R-squared:  0.1918, Adjusted R-squared:  0.1896 
## F-statistic: 89.73 on 6 and 2269 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(model3)

From this model, we can see that the p values are all extremly small, only with the exceptance of BASERUN_CS, however, even though the p-value of this variable is relatively big, it is still well below 0.05, at 0.0047

Selecting Models

Using stargazer to compare the three models

stargazer( model1, model2, model3, type = "text")
## 
## =================================================================================================
##                                                  Dependent variable:                             
##                     -----------------------------------------------------------------------------
##                                                      TARGET_WINS                                 
##                                (1)                       (2)                       (3)           
## -------------------------------------------------------------------------------------------------
## BATTING_H                   0.049***                   0.054***                                  
##                              (0.004)                   (0.003)                                   
##                                                                                                  
## BATTING_2B                  -0.021**                  -0.025***                  0.098***        
##                              (0.009)                   (0.009)                   (0.007)         
##                                                                                                  
## BATTING_3B                  0.064***                                             0.138***        
##                              (0.017)                                             (0.014)         
##                                                                                                  
## BATTING_HR                   0.053*                    0.056***                                  
##                              (0.027)                   (0.009)                                   
##                                                                                                  
## BATTING_BB                   0.010*                    0.012***                                  
##                              (0.006)                   (0.003)                                   
##                                                                                                  
## BATTING_SO                  -0.008***                 -0.010***                                  
##                              (0.003)                   (0.002)                                   
##                                                                                                  
## BASERUN_SB                  0.025***                   0.029***                  0.027***        
##                              (0.004)                   (0.004)                   (0.004)         
##                                                                                                  
## BASERUN_CS                   -0.011                                             -0.047***        
##                              (0.016)                                             (0.017)         
##                                                                                                  
## PITCHING_H                  -0.001**                  -0.001***                                  
##                             (0.0004)                   (0.0003)                                  
##                                                                                                  
## PITCHING_HR                   0.013                                                              
##                              (0.024)                                                             
##                                                                                                  
## PITCHING_BB                   0.001                                                              
##                              (0.004)                                                             
##                                                                                                  
## PITCHING_SO                 0.003***                   0.003***                                  
##                              (0.001)                   (0.001)                                   
##                                                                                                  
## FIELDING_E                  -0.020***                 -0.018***                 -0.021***        
##                              (0.002)                   (0.002)                   (0.002)         
##                                                                                                  
## FIELDING_DP                 -0.122***                 -0.124***                 -0.054***        
##                              (0.013)                   (0.013)                   (0.013)         
##                                                                                                  
## Constant                    23.642***                 20.575***                 61.668***        
##                              (5.390)                   (5.232)                   (2.665)         
##                                                                                                  
## -------------------------------------------------------------------------------------------------
## Observations                  2,276                     2,276                     2,276          
## R2                            0.315                     0.310                     0.192          
## Adjusted R2                   0.311                     0.307                     0.190          
## Residual Std. Error    13.074 (df = 2261)         13.111 (df = 2265)        14.180 (df = 2269)   
## F Statistic         74.399*** (df = 14; 2261) 101.865*** (df = 10; 2265) 89.734*** (df = 6; 2269)
## =================================================================================================
## Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

Calculating Multicolinearity

vif(model1)
##   BATTING_H  BATTING_2B  BATTING_3B  BATTING_HR  BATTING_BB  BATTING_SO 
##    3.798921    2.455952    2.933662   36.876862    6.825564    5.090936 
##  BASERUN_SB  BASERUN_CS  PITCHING_H PITCHING_HR PITCHING_BB PITCHING_SO 
##    1.842541    1.165281    3.555278   29.749612    6.365752    3.305543 
##  FIELDING_E FIELDING_DP 
##    4.181673    1.342312
vif(model2)
##   BATTING_H  BATTING_2B  BATTING_HR  BATTING_BB  BATTING_SO  BASERUN_SB 
##    3.269276    2.431685    4.099876    2.272356    4.637931    1.619950 
## PITCHING_SO  PITCHING_H  FIELDING_E FIELDING_DP 
##    1.746238    2.612103    3.888127    1.338094
vif(model3)
##  BATTING_2B  BATTING_3B  BASERUN_SB  BASERUN_CS  FIELDING_E FIELDING_DP 
##    1.123365    1.623667    1.432052    1.083042    1.461507    1.157426

From the VIF(Variance Inflation Factors) function above, we can see that initially, in model 1, BATTING_HR and PITCHING_BB has a significantly high value,meaning that they were significantly impacting the model due to the high colinear relationship, and BATTING_BB, BATTING_SO, PITCHING_H, PITCHING_BB and FEILDING_E had relative colinear relationship that were mildly impacting the model.

Moving to the second model, after eliminating variables that were less statistically significant, we had successfully lowered the colinearity between variables, with only a few variables remained noticable.

In model 3, all 6 variables were left in the acceptable range of colinearity, hence the model would not be significantly impacted.

Computing Prediction

# Make predictions for model 1
predictions1 <- model1 %>% predict(df1)
# Model performance
data.frame(
  RMSE = RMSE(predictions1, df1$TARGET_WINS,na.rm = TRUE),
  R2 = R2(predictions1,df1$TARGET_WINS,na.rm = TRUE)
)
##       RMSE        R2
## 1 13.03069 0.3153854
# Make predictions for model 2
predictions2 <- model2 %>% predict(df1)
# Model performance
data.frame(
  RMSE = RMSE(predictions2, df1$TARGET_WINS,na.rm = TRUE),
  R2 = R2(predictions2,df1$TARGET_WINS,na.rm = TRUE)
)
##       RMSE        R2
## 1 13.07977 0.3102187
# Make predictions for model 3
predictions3 <- model3 %>% predict(df1)
# Model performance
data.frame(
  RMSE = RMSE(predictions3, df1$TARGET_WINS,na.rm = TRUE),
  R2 = R2(predictions3,df1$TARGET_WINS,na.rm = TRUE)
)
##       RMSE        R2
## 1 14.15824 0.1917802

The RMSE across three models are similar, however, model 3 has a significantly lower R2 comparing with the previous models, considering that it only had 6 variables left with a lower R2 and a F-statistic, it is not advisable to use model 3 as our concluding model2.

Further exploring model 1 and model 2, considering they had similar R2 and RMSE, one deciding confounding factor that seperate them apart would be Model 1 contained variables that had high Variance Inflation Factors, meaning they post high impact on the model due to high colinearity, hence it would be most advisable to use model 2 as our concluding factor because of the high R2 and F-Statistic Value