In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
dim(moneyball_training_data)
## [1] 2276 17
The dataset consists of 17 elements, with 2276 total cases. There are 15 explanatory variables, which can be categorized into four groups:
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 2 | 191 | 80.92670 | 12.115013 | 82 | 81.11765 | 13.3434 | 43 | 116 | 73 | -0.1698314 | -0.2952783 | 0.8766116 | 0 |
TEAM_BATTING_H | 3 | 191 | 1478.62827 | 76.147869 | 1477 | 1477.42484 | 74.1300 | 1308 | 1667 | 359 | 0.1302702 | -0.3710350 | 5.5098664 | 0 |
TEAM_BATTING_2B | 4 | 191 | 297.19895 | 26.329335 | 296 | 296.62745 | 25.2042 | 201 | 373 | 172 | 0.0915189 | 0.4778716 | 1.9051238 | 0 |
TEAM_BATTING_3B | 5 | 191 | 30.74346 | 9.043878 | 29 | 30.13072 | 8.8956 | 12 | 61 | 49 | 0.7007420 | 0.7446217 | 0.6543921 | 0 |
TEAM_BATTING_HR | 6 | 191 | 178.05236 | 32.413243 | 175 | 176.81046 | 35.5824 | 116 | 260 | 144 | 0.2980673 | -0.7172373 | 2.3453399 | 0 |
TEAM_BATTING_BB | 7 | 191 | 543.31937 | 74.842133 | 535 | 541.31373 | 74.1300 | 365 | 775 | 410 | 0.3115199 | -0.1474175 | 5.4153867 | 0 |
TEAM_BATTING_SO | 8 | 191 | 1051.02618 | 104.156382 | 1050 | 1046.95425 | 97.8516 | 805 | 1399 | 594 | 0.3985050 | 0.3955105 | 7.5364913 | 102 |
TEAM_BASERUN_SB | 9 | 191 | 90.90576 | 29.916401 | 87 | 89.06536 | 29.6520 | 31 | 177 | 146 | 0.5553966 | -0.1414909 | 2.1646748 | 131 |
TEAM_BASERUN_CS | 10 | 191 | 39.94241 | 11.898334 | 38 | 39.49020 | 11.8608 | 12 | 74 | 62 | 0.3468509 | 0.0006392 | 0.8609332 | 772 |
TEAM_BATTING_HBP | 11 | 191 | 59.35602 | 12.967123 | 58 | 58.86275 | 11.8608 | 29 | 95 | 66 | 0.3185754 | -0.1119828 | 0.9382681 | 2085 |
TEAM_PITCHING_H | 12 | 191 | 1479.70157 | 75.788625 | 1480 | 1478.50327 | 72.6474 | 1312 | 1667 | 355 | 0.1279056 | -0.3894781 | 5.4838725 | 0 |
TEAM_PITCHING_HR | 13 | 191 | 178.17801 | 32.391678 | 175 | 176.93464 | 35.5824 | 116 | 260 | 144 | 0.2989191 | -0.7190905 | 2.3437795 | 0 |
TEAM_PITCHING_BB | 14 | 191 | 543.71728 | 74.916681 | 537 | 541.74510 | 72.6474 | 367 | 775 | 408 | 0.3144366 | -0.1338563 | 5.4207808 | 0 |
TEAM_PITCHING_SO | 15 | 191 | 1051.81675 | 104.347208 | 1052 | 1047.80392 | 97.8516 | 805 | 1399 | 594 | 0.3945586 | 0.3903991 | 7.5502990 | 102 |
TEAM_FIELDING_E | 16 | 191 | 107.05236 | 16.632162 | 106 | 106.58170 | 17.7912 | 65 | 145 | 80 | 0.1780432 | -0.3567367 | 1.2034610 | 0 |
TEAM_FIELDING_DP | 17 | 191 | 152.33508 | 17.611682 | 152 | 152.04575 | 19.2738 | 113 | 204 | 91 | 0.2164822 | -0.2115741 | 1.2743366 | 286 |
There are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.
The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.
## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).
The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.
In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.
The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.
Let’s break down the correlation by wins - since that’s what we’re trying to predict.
x | |
---|---|
TARGET_WINS | 1.0000000 |
TEAM_BATTING_H | 0.4699467 |
TEAM_BATTING_2B | 0.3129840 |
TEAM_BATTING_3B | -0.1243459 |
TEAM_BATTING_HR | 0.4224168 |
TEAM_BATTING_BB | 0.4686879 |
TEAM_BATTING_SO | -0.2288927 |
TEAM_BASERUN_SB | 0.0148364 |
TEAM_BASERUN_CS | -0.1787560 |
TEAM_BATTING_HBP | 0.0735042 |
TEAM_PITCHING_H | 0.4712343 |
TEAM_PITCHING_HR | 0.4224668 |
TEAM_PITCHING_BB | 0.4683988 |
TEAM_PITCHING_SO | -0.2293648 |
TEAM_FIELDING_E | -0.3866880 |
TEAM_FIELDING_DP | -0.1958660 |
Below is the plot correlation.
The variable TEAM_BATTING_HBP has over 90% missing values. That variable will be removed completely.
The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset
The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.
Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.
In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 1 | 2276 | 80.79086 | 15.75215 | 82.0 | 81.31229 | 14.8260 | 0 | 146 | 146 | -0.3987232 | 1.0274757 | 0.3301823 |
TEAM_BATTING_H | 2 | 2276 | 1469.26977 | 144.59120 | 1454.0 | 1459.04116 | 114.1602 | 891 | 2554 | 1663 | 1.5713335 | 7.2785261 | 3.0307891 |
TEAM_BATTING_2B | 3 | 2276 | 241.24692 | 46.80141 | 238.0 | 240.39627 | 47.4432 | 69 | 458 | 389 | 0.2151018 | 0.0061609 | 0.9810087 |
TEAM_BATTING_3B | 4 | 2276 | 55.25000 | 27.93856 | 47.0 | 52.17563 | 23.7216 | 0 | 223 | 223 | 1.1094652 | 1.5032418 | 0.5856226 |
TEAM_BATTING_HR | 5 | 2276 | 99.61204 | 60.54687 | 102.0 | 97.38529 | 78.5778 | 0 | 264 | 264 | 0.1860421 | -0.9631189 | 1.2691285 |
TEAM_BATTING_BB | 6 | 2276 | 501.55888 | 122.67086 | 512.0 | 512.18331 | 94.8864 | 0 | 878 | 878 | -1.0257599 | 2.1828544 | 2.5713150 |
TEAM_BATTING_SO | 7 | 2276 | 729.44508 | 245.63744 | 734.5 | 734.43030 | 279.4701 | 0 | 1399 | 1399 | -0.2394171 | -0.3199559 | 5.1488285 |
TEAM_BASERUN_SB | 8 | 2276 | 131.90949 | 96.62529 | 104.0 | 116.04720 | 66.7170 | 0 | 697 | 697 | 1.8784660 | 4.5762445 | 2.0253715 |
TEAM_BASERUN_CS | 9 | 2276 | 66.06634 | 40.68995 | 53.0 | 58.78814 | 23.7216 | 0 | 201 | 201 | 1.6897117 | 2.4838474 | 0.8529056 |
TEAM_PITCHING_H | 10 | 2276 | 1779.21046 | 1406.84293 | 1518.0 | 1555.89517 | 174.9468 | 1137 | 30132 | 28995 | 10.3295111 | 141.8396985 | 29.4889618 |
TEAM_PITCHING_HR | 11 | 2276 | 105.69859 | 61.29875 | 107.0 | 103.15697 | 74.1300 | 0 | 343 | 343 | 0.2877877 | -0.6046311 | 1.2848886 |
TEAM_PITCHING_SO | 12 | 2276 | 811.01582 | 542.27942 | 802.0 | 789.97530 | 256.4898 | 0 | 19278 | 19278 | 22.5067680 | 694.9733696 | 11.3667679 |
TEAM_FIELDING_E | 13 | 2276 | 246.48067 | 227.77097 | 159.0 | 193.43798 | 62.2692 | 65 | 1898 | 1833 | 2.9904656 | 10.9702717 | 4.7743279 |
TEAM_FIELDING_DP | 14 | 2276 | 142.85325 | 28.21508 | 146.0 | 144.04775 | 26.6868 | 52 | 228 | 176 | -0.3746132 | -0.0320501 | 0.5914189 |
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -56.047 -8.426 0.163 8.301 54.143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.738e+01 5.217e+00 5.248 1.68e-07 ***
## TEAM_BATTING_H 4.758e-02 3.622e-03 13.138 < 2e-16 ***
## TEAM_BATTING_2B -2.423e-02 9.015e-03 -2.687 0.00726 **
## TEAM_BATTING_3B 4.881e-02 1.662e-02 2.936 0.00335 **
## TEAM_BATTING_HR 6.208e-02 2.378e-02 2.611 0.00909 **
## TEAM_BATTING_BB 9.773e-03 3.161e-03 3.092 0.00201 **
## TEAM_BATTING_SO -1.247e-02 2.424e-03 -5.142 2.95e-07 ***
## TEAM_BASERUN_SB 3.871e-02 4.139e-03 9.352 < 2e-16 ***
## TEAM_BASERUN_CS 8.383e-05 9.045e-03 0.009 0.99261
## TEAM_PITCHING_H -2.748e-04 3.304e-04 -0.832 0.40561
## TEAM_PITCHING_HR 1.424e-02 2.075e-02 0.686 0.49255
## TEAM_PITCHING_SO 2.178e-03 6.663e-04 3.270 0.00109 **
## TEAM_FIELDING_E -3.090e-02 2.511e-03 -12.302 < 2e-16 ***
## TEAM_FIELDING_DP -1.094e-01 1.245e-02 -8.788 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.87 on 2262 degrees of freedom
## Multiple R-squared: 0.336, Adjusted R-squared: 0.3322
## F-statistic: 88.06 on 13 and 2262 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.831 -8.439 0.112 8.291 54.626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.026713 4.984325 5.422 6.51e-08 ***
## TEAM_BATTING_H 0.047465 0.003563 13.322 < 2e-16 ***
## TEAM_BATTING_2B -0.024656 0.008992 -2.742 0.006155 **
## TEAM_BATTING_3B 0.052110 0.016242 3.208 0.001353 **
## TEAM_BATTING_HR 0.076492 0.009498 8.054 1.28e-15 ***
## TEAM_BATTING_BB 0.009775 0.003158 3.095 0.001990 **
## TEAM_BATTING_SO -0.012085 0.002370 -5.100 3.68e-07 ***
## TEAM_BASERUN_SB 0.039431 0.003827 10.304 < 2e-16 ***
## TEAM_PITCHING_SO 0.001976 0.000586 3.372 0.000758 ***
## TEAM_FIELDING_E -0.031771 0.002074 -15.321 < 2e-16 ***
## TEAM_FIELDING_DP -0.109107 0.012285 -8.881 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.87 on 2265 degrees of freedom
## Multiple R-squared: 0.3357, Adjusted R-squared: 0.3328
## F-statistic: 114.5 on 10 and 2265 DF, p-value: < 2.2e-16
moneyball_training_data <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/moneyball-training-data.csv")
mbd1 <- describe(moneyball_training_data, na.rm = F)
mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]
kable(mbd1, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
scroll_box(width = "100%", height = "700px")
ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
geom_boxplot() +
theme(legend.position="none") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(panel.background = element_rect(fill = '#d0ddf2'))
ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 800)) +
theme(legend.position="none") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(panel.background = element_rect(fill = '#d0ddf2'))
mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]
mb_hist %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 35)
kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T) %>%
scroll_box(height = "500px")
corrgram(drop_na(mb_hist), order=TRUE,
upper.panel=panel.cor, main="Moneyball")
mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]
library(mice)
init = mice(mbd2, maxit=0)
meth = init$method
predM = init$predictorMatrix
predM[, c("TARGET_WINS")]=0
imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)
imputed <- complete(imputed)
imputedtable <- describe(imputed)
kable(imputedtable, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
scroll_box(width = "100%", height = "700px")
model1 <- lm(TARGET_WINS ~., imputed)
(summary(model1))
model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)
summary(model2)