Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

dim(moneyball_training_data)
## [1] 2276   17

The dataset consists of 17 elements, with 2276 total cases. There are 15 explanatory variables, which can be categorized into four groups:

vars n mean sd median trimmed mad min max range skew kurtosis se na_count
TARGET_WINS 2 191 80.92670 12.115013 82 81.11765 13.3434 43 116 73 -0.1698314 -0.2952783 0.8766116 0
TEAM_BATTING_H 3 191 1478.62827 76.147869 1477 1477.42484 74.1300 1308 1667 359 0.1302702 -0.3710350 5.5098664 0
TEAM_BATTING_2B 4 191 297.19895 26.329335 296 296.62745 25.2042 201 373 172 0.0915189 0.4778716 1.9051238 0
TEAM_BATTING_3B 5 191 30.74346 9.043878 29 30.13072 8.8956 12 61 49 0.7007420 0.7446217 0.6543921 0
TEAM_BATTING_HR 6 191 178.05236 32.413243 175 176.81046 35.5824 116 260 144 0.2980673 -0.7172373 2.3453399 0
TEAM_BATTING_BB 7 191 543.31937 74.842133 535 541.31373 74.1300 365 775 410 0.3115199 -0.1474175 5.4153867 0
TEAM_BATTING_SO 8 191 1051.02618 104.156382 1050 1046.95425 97.8516 805 1399 594 0.3985050 0.3955105 7.5364913 102
TEAM_BASERUN_SB 9 191 90.90576 29.916401 87 89.06536 29.6520 31 177 146 0.5553966 -0.1414909 2.1646748 131
TEAM_BASERUN_CS 10 191 39.94241 11.898334 38 39.49020 11.8608 12 74 62 0.3468509 0.0006392 0.8609332 772
TEAM_BATTING_HBP 11 191 59.35602 12.967123 58 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681 2085
TEAM_PITCHING_H 12 191 1479.70157 75.788625 1480 1478.50327 72.6474 1312 1667 355 0.1279056 -0.3894781 5.4838725 0
TEAM_PITCHING_HR 13 191 178.17801 32.391678 175 176.93464 35.5824 116 260 144 0.2989191 -0.7190905 2.3437795 0
TEAM_PITCHING_BB 14 191 543.71728 74.916681 537 541.74510 72.6474 367 775 408 0.3144366 -0.1338563 5.4207808 0
TEAM_PITCHING_SO 15 191 1051.81675 104.347208 1052 1047.80392 97.8516 805 1399 594 0.3945586 0.3903991 7.5502990 102
TEAM_FIELDING_E 16 191 107.05236 16.632162 106 106.58170 17.7912 65 145 80 0.1780432 -0.3567367 1.2034610 0
TEAM_FIELDING_DP 17 191 152.33508 17.611682 152 152.04575 19.2738 113 204 91 0.2164822 -0.2115741 1.2743366 286

There are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.

The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.

In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

Let’s break down the correlation by wins - since that’s what we’re trying to predict.

x
TARGET_WINS 1.0000000
TEAM_BATTING_H 0.4699467
TEAM_BATTING_2B 0.3129840
TEAM_BATTING_3B -0.1243459
TEAM_BATTING_HR 0.4224168
TEAM_BATTING_BB 0.4686879
TEAM_BATTING_SO -0.2288927
TEAM_BASERUN_SB 0.0148364
TEAM_BASERUN_CS -0.1787560
TEAM_BATTING_HBP 0.0735042
TEAM_PITCHING_H 0.4712343
TEAM_PITCHING_HR 0.4224668
TEAM_PITCHING_BB 0.4683988
TEAM_PITCHING_SO -0.2293648
TEAM_FIELDING_E -0.3866880
TEAM_FIELDING_DP -0.1958660

Below is the plot correlation.

Data Preparation

Removal of Data

The variable TEAM_BATTING_HBP has over 90% missing values. That variable will be removed completely.

The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Output - The below table shows the results of above data manipulation

vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET_WINS 1 2276 80.79086 15.75215 82.0 81.31229 14.8260 0 146 146 -0.3987232 1.0274757 0.3301823
TEAM_BATTING_H 2 2276 1469.26977 144.59120 1454.0 1459.04116 114.1602 891 2554 1663 1.5713335 7.2785261 3.0307891
TEAM_BATTING_2B 3 2276 241.24692 46.80141 238.0 240.39627 47.4432 69 458 389 0.2151018 0.0061609 0.9810087
TEAM_BATTING_3B 4 2276 55.25000 27.93856 47.0 52.17563 23.7216 0 223 223 1.1094652 1.5032418 0.5856226
TEAM_BATTING_HR 5 2276 99.61204 60.54687 102.0 97.38529 78.5778 0 264 264 0.1860421 -0.9631189 1.2691285
TEAM_BATTING_BB 6 2276 501.55888 122.67086 512.0 512.18331 94.8864 0 878 878 -1.0257599 2.1828544 2.5713150
TEAM_BATTING_SO 7 2276 729.44508 245.63744 734.5 734.43030 279.4701 0 1399 1399 -0.2394171 -0.3199559 5.1488285
TEAM_BASERUN_SB 8 2276 131.90949 96.62529 104.0 116.04720 66.7170 0 697 697 1.8784660 4.5762445 2.0253715
TEAM_BASERUN_CS 9 2276 66.06634 40.68995 53.0 58.78814 23.7216 0 201 201 1.6897117 2.4838474 0.8529056
TEAM_PITCHING_H 10 2276 1779.21046 1406.84293 1518.0 1555.89517 174.9468 1137 30132 28995 10.3295111 141.8396985 29.4889618
TEAM_PITCHING_HR 11 2276 105.69859 61.29875 107.0 103.15697 74.1300 0 343 343 0.2877877 -0.6046311 1.2848886
TEAM_PITCHING_SO 12 2276 811.01582 542.27942 802.0 789.97530 256.4898 0 19278 19278 22.5067680 694.9733696 11.3667679
TEAM_FIELDING_E 13 2276 246.48067 227.77097 159.0 193.43798 62.2692 65 1898 1833 2.9904656 10.9702717 4.7743279
TEAM_FIELDING_DP 14 2276 142.85325 28.21508 146.0 144.04775 26.6868 52 228 176 -0.3746132 -0.0320501 0.5914189

Build Models

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.047  -8.426   0.163   8.301  54.143 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.738e+01  5.217e+00   5.248 1.68e-07 ***
## TEAM_BATTING_H    4.758e-02  3.622e-03  13.138  < 2e-16 ***
## TEAM_BATTING_2B  -2.423e-02  9.015e-03  -2.687  0.00726 ** 
## TEAM_BATTING_3B   4.881e-02  1.662e-02   2.936  0.00335 ** 
## TEAM_BATTING_HR   6.208e-02  2.378e-02   2.611  0.00909 ** 
## TEAM_BATTING_BB   9.773e-03  3.161e-03   3.092  0.00201 ** 
## TEAM_BATTING_SO  -1.247e-02  2.424e-03  -5.142 2.95e-07 ***
## TEAM_BASERUN_SB   3.871e-02  4.139e-03   9.352  < 2e-16 ***
## TEAM_BASERUN_CS   8.383e-05  9.045e-03   0.009  0.99261    
## TEAM_PITCHING_H  -2.748e-04  3.304e-04  -0.832  0.40561    
## TEAM_PITCHING_HR  1.424e-02  2.075e-02   0.686  0.49255    
## TEAM_PITCHING_SO  2.178e-03  6.663e-04   3.270  0.00109 ** 
## TEAM_FIELDING_E  -3.090e-02  2.511e-03 -12.302  < 2e-16 ***
## TEAM_FIELDING_DP -1.094e-01  1.245e-02  -8.788  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.87 on 2262 degrees of freedom
## Multiple R-squared:  0.336,  Adjusted R-squared:  0.3322 
## F-statistic: 88.06 on 13 and 2262 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.831  -8.439   0.112   8.291  54.626 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      27.026713   4.984325   5.422 6.51e-08 ***
## TEAM_BATTING_H    0.047465   0.003563  13.322  < 2e-16 ***
## TEAM_BATTING_2B  -0.024656   0.008992  -2.742 0.006155 ** 
## TEAM_BATTING_3B   0.052110   0.016242   3.208 0.001353 ** 
## TEAM_BATTING_HR   0.076492   0.009498   8.054 1.28e-15 ***
## TEAM_BATTING_BB   0.009775   0.003158   3.095 0.001990 ** 
## TEAM_BATTING_SO  -0.012085   0.002370  -5.100 3.68e-07 ***
## TEAM_BASERUN_SB   0.039431   0.003827  10.304  < 2e-16 ***
## TEAM_PITCHING_SO  0.001976   0.000586   3.372 0.000758 ***
## TEAM_FIELDING_E  -0.031771   0.002074 -15.321  < 2e-16 ***
## TEAM_FIELDING_DP -0.109107   0.012285  -8.881  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.87 on 2265 degrees of freedom
## Multiple R-squared:  0.3357, Adjusted R-squared:  0.3328 
## F-statistic: 114.5 on 10 and 2265 DF,  p-value: < 2.2e-16

Select Models

Appendex

moneyball_training_data <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/moneyball-training-data.csv")


mbd1 <- describe(moneyball_training_data, na.rm = F)

mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]


kable(mbd1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 800)) +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]

mb_hist %>%
  keep(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_histogram(bins = 35)                 



kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")

corrgram(drop_na(mb_hist), order=TRUE,
         upper.panel=panel.cor, main="Moneyball")



mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]



library(mice)
init = mice(mbd2, maxit=0) 
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)

imputedtable <- describe(imputed)

kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

model1 <- lm(TARGET_WINS ~., imputed)

(summary(model1))

model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)

summary(model2)