1 Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

2 Data Exploration

dim(moneyball_training_data)
## [1] 2276   17

The dataset consists of 17 elements, with 2276 total cases. There are 15 explanatory variables, which can be categorized into four groups:

  • batting
  • baserun
  • fielding
  • pitching
vars n mean sd median trimmed mad min max range skew kurtosis se na_count
TARGET_WINS 2 191 80.92670 12.115013 82 81.11765 13.3434 43 116 73 -0.1698314 -0.2952783 0.8766116 0
TEAM_BATTING_H 3 191 1478.62827 76.147869 1477 1477.42484 74.1300 1308 1667 359 0.1302702 -0.3710350 5.5098664 0
TEAM_BATTING_2B 4 191 297.19895 26.329335 296 296.62745 25.2042 201 373 172 0.0915189 0.4778716 1.9051238 0
TEAM_BATTING_3B 5 191 30.74346 9.043878 29 30.13072 8.8956 12 61 49 0.7007420 0.7446217 0.6543921 0
TEAM_BATTING_HR 6 191 178.05236 32.413243 175 176.81046 35.5824 116 260 144 0.2980673 -0.7172373 2.3453399 0
TEAM_BATTING_BB 7 191 543.31937 74.842133 535 541.31373 74.1300 365 775 410 0.3115199 -0.1474175 5.4153867 0
TEAM_BATTING_SO 8 191 1051.02618 104.156382 1050 1046.95425 97.8516 805 1399 594 0.3985050 0.3955105 7.5364913 102
TEAM_BASERUN_SB 9 191 90.90576 29.916401 87 89.06536 29.6520 31 177 146 0.5553966 -0.1414909 2.1646748 131
TEAM_BASERUN_CS 10 191 39.94241 11.898334 38 39.49020 11.8608 12 74 62 0.3468509 0.0006392 0.8609332 772
TEAM_BATTING_HBP 11 191 59.35602 12.967123 58 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681 2085
TEAM_PITCHING_H 12 191 1479.70157 75.788625 1480 1478.50327 72.6474 1312 1667 355 0.1279056 -0.3894781 5.4838725 0
TEAM_PITCHING_HR 13 191 178.17801 32.391678 175 176.93464 35.5824 116 260 144 0.2989191 -0.7190905 2.3437795 0
TEAM_PITCHING_BB 14 191 543.71728 74.916681 537 541.74510 72.6474 367 775 408 0.3144366 -0.1338563 5.4207808 0
TEAM_PITCHING_SO 15 191 1051.81675 104.347208 1052 1047.80392 97.8516 805 1399 594 0.3945586 0.3903991 7.5502990 102
TEAM_FIELDING_E 16 191 107.05236 16.632162 106 106.58170 17.7912 65 145 80 0.1780432 -0.3567367 1.2034610 0
TEAM_FIELDING_DP 17 191 152.33508 17.611682 152 152.04575 19.2738 113 204 91 0.2164822 -0.2115741 1.2743366 286

Looking at the data above, there are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.

The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.

In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.

The above boxplots show all of the variables listed in the dataset. This visualization may assist in showing how the data is spread.

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by wins - since that’s what we’re trying to predict.

x
TARGET_WINS 1.0000000
TEAM_BATTING_H 0.4699467
TEAM_BATTING_2B 0.3129840
TEAM_BATTING_3B -0.1243459
TEAM_BATTING_HR 0.4224168
TEAM_BATTING_BB 0.4686879
TEAM_BATTING_SO -0.2288927
TEAM_BASERUN_SB 0.0148364
TEAM_BASERUN_CS -0.1787560
TEAM_BATTING_HBP 0.0735042
TEAM_PITCHING_H 0.4712343
TEAM_PITCHING_HR 0.4224668
TEAM_PITCHING_BB 0.4683988
TEAM_PITCHING_SO -0.2293648
TEAM_FIELDING_E -0.3866880
TEAM_FIELDING_DP -0.1958660

Below is a visual representation of the correlation plot.

According to the coorelation graph, batting_h, batting_2b, batting_hr, batting_bb, pitching_h, pitching_hr, and pitching_bb are the most positively correlated.

3 Data Preparation

3.1 Removal of Data

The variable TEAM_BATTING_HBP is also missing over 90% of its values. That variable will be removed completely.

The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset

Using the VIF and vifstep function from the USDM package, the data will first be tested for other collinearity issues. The variables determined that have collinearity issues will be discarded.

3.2 Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Variables that exceed the established threshold will be discarded to avoid collinearity issues.

vif(imputed)
##           Variables      VIF
## 1       TARGET_WINS 1.485532
## 2    TEAM_BATTING_H 3.991879
## 3   TEAM_BATTING_2B 2.449123
## 4   TEAM_BATTING_3B 2.977893
## 5   TEAM_BATTING_HR 4.876431
## 6   TEAM_BATTING_BB 5.565493
## 7   TEAM_BATTING_SO 5.178738
## 8   TEAM_BASERUN_SB 2.278903
## 9   TEAM_BASERUN_CS 1.818298
## 10  TEAM_PITCHING_H 3.661374
## 11 TEAM_PITCHING_BB 4.737911
## 12 TEAM_PITCHING_SO 2.977236
## 13  TEAM_FIELDING_E 4.625943
## 14 TEAM_FIELDING_DP 1.772751
v1 <- vifstep(imputed, th=10)

3.3 Output - The below table shows the results of above data manipulation.

The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.

vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET_WINS 1 2276 80.79086 15.75215 82.0 81.31229 14.8260 0 146 146 -0.3987232 1.0274757 0.3301823
TEAM_BATTING_H 2 2276 1469.26977 144.59120 1454.0 1459.04116 114.1602 891 2554 1663 1.5713335 7.2785261 3.0307891
TEAM_BATTING_2B 3 2276 241.24692 46.80141 238.0 240.39627 47.4432 69 458 389 0.2151018 0.0061609 0.9810087
TEAM_BATTING_3B 4 2276 55.25000 27.93856 47.0 52.17563 23.7216 0 223 223 1.1094652 1.5032418 0.5856226
TEAM_BATTING_HR 5 2276 99.61204 60.54687 102.0 97.38529 78.5778 0 264 264 0.1860421 -0.9631189 1.2691285
TEAM_BATTING_BB 6 2276 501.55888 122.67086 512.0 512.18331 94.8864 0 878 878 -1.0257599 2.1828544 2.5713150
TEAM_BATTING_SO 7 2276 730.25264 245.68705 737.5 735.48683 277.9875 0 1399 1399 -0.2481540 -0.3128153 5.1498685
TEAM_BASERUN_SB 8 2276 129.71309 93.95469 103.0 114.48518 64.4931 0 697 697 1.9268830 5.0404554 1.9693928
TEAM_BASERUN_CS 9 2276 65.77417 39.43856 54.0 59.02415 23.7216 0 201 201 1.7045243 2.7622052 0.8266753
TEAM_PITCHING_H 10 2276 1779.21046 1406.84293 1518.0 1555.89517 174.9468 1137 30132 28995 10.3295111 141.8396985 29.4889618
TEAM_PITCHING_BB 11 2276 553.00791 166.35736 536.5 542.62459 98.5929 0 3645 3645 6.7438995 96.9676398 3.4870317
TEAM_PITCHING_SO 12 2276 812.13445 542.18699 804.5 791.12733 254.2659 0 19278 19278 22.5128987 695.2641969 11.3648305
TEAM_FIELDING_E 13 2276 246.48067 227.77097 159.0 193.43798 62.2692 65 1898 1833 2.9904656 10.9702717 4.7743279
TEAM_FIELDING_DP 14 2276 142.23989 28.43894 146.0 143.42426 26.6868 52 228 176 -0.3652627 -0.1175944 0.5961112

4 Build Models

Using the training data provided, we will build 3 different linear regression models, to determine which will provide the best prediction for the # of wins for a baseball team. The tree approachs are: all variables, only significant variables, and backwards elimination of each variable.

4.1 Model 1: All Variables

All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.577  -8.494   0.159   8.443  61.658 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.777e+01  5.282e+00   5.258 1.60e-07 ***
## TEAM_BATTING_H    4.627e-02  3.627e-03  12.758  < 2e-16 ***
## TEAM_BATTING_2B  -1.873e-02  9.078e-03  -2.063  0.03921 *  
## TEAM_BATTING_3B   4.701e-02  1.676e-02   2.806  0.00506 ** 
## TEAM_BATTING_HR   7.316e-02  9.791e-03   7.472 1.12e-13 ***
## TEAM_BATTING_BB   8.606e-03  5.223e-03   1.648  0.09955 .  
## TEAM_BATTING_SO  -1.198e-02  2.504e-03  -4.783 1.84e-06 ***
## TEAM_BASERUN_SB   3.656e-02  4.298e-03   8.506  < 2e-16 ***
## TEAM_BASERUN_CS  -5.636e-04  9.291e-03  -0.061  0.95164    
## TEAM_PITCHING_H  -4.468e-04  3.695e-04  -1.209  0.22672    
## TEAM_PITCHING_BB -4.361e-05  3.556e-03  -0.012  0.99021    
## TEAM_PITCHING_SO  2.458e-03  8.632e-04   2.848  0.00445 ** 
## TEAM_FIELDING_E  -2.860e-02  2.495e-03 -11.466  < 2e-16 ***
## TEAM_FIELDING_DP -1.036e-01  1.253e-02  -8.264 2.37e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.96 on 2262 degrees of freedom
## Multiple R-squared:  0.3268, Adjusted R-squared:  0.323 
## F-statistic: 84.48 on 13 and 2262 DF,  p-value: < 2.2e-16

Conclusions based on model:

F-statistic is 89.25, R-squared is 0.3352 Out of the 14 variables, 9 have statistically significant p-values at the 5% level.

4.2 Model 2: Highly Significant Variables Only

Based on model one, Model 2 will focus only on the variables that are statistically significant - in order to see if only those variables allow for a better model. Variables will be choosen based on their significance level from the R output.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.317  -8.588   0.135   8.502  61.077 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.8270395  4.5668751   7.845 6.61e-15 ***
## TEAM_BATTING_H    0.0400597  0.0026789  14.954  < 2e-16 ***
## TEAM_BATTING_3B   0.0576747  0.0162673   3.545  0.00040 ***
## TEAM_BATTING_HR   0.0792951  0.0091938   8.625  < 2e-16 ***
## TEAM_BATTING_SO  -0.0131207  0.0023263  -5.640 1.91e-08 ***
## TEAM_BASERUN_SB   0.0406521  0.0037937  10.716  < 2e-16 ***
## TEAM_PITCHING_SO  0.0018183  0.0005835   3.116  0.00186 ** 
## TEAM_FIELDING_E  -0.0325020  0.0017862 -18.196  < 2e-16 ***
## TEAM_FIELDING_DP -0.0993903  0.0123477  -8.049 1.33e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.98 on 2267 degrees of freedom
## Multiple R-squared:  0.323,  Adjusted R-squared:  0.3206 
## F-statistic: 135.2 on 8 and 2267 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 143, R-squared is 0.333

The F-statistic is better than the first model, however the R-squared drops slightly.

4.3 Model 3: Backwards Elimination and Significance

Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown. This model is similar to the ‘forward selection’ variant - however I find it easier to work our way backwards and to eliminate variables rather than add them.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_BATTING_HR, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.589  -9.138   0.000   8.646  59.008 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.072197   2.914999   2.083 0.037355 *  
## TEAM_BATTING_H   0.049561   0.002038  24.319  < 2e-16 ***
## TEAM_BASERUN_SB  0.045561   0.003442  13.237  < 2e-16 ***
## TEAM_FIELDING_E -0.025407   0.001637 -15.520  < 2e-16 ***
## TEAM_BATTING_HR  0.022620   0.006014   3.761 0.000174 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.29 on 2271 degrees of freedom
## Multiple R-squared:  0.2899, Adjusted R-squared:  0.2887 
## F-statistic: 231.8 on 4 and 2271 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 245.5, R-squared is 0.3006 The F-statistic is larger than both of the other two models, however the R-squared is slightly lower than the other two.

5 Select Models

The three models from the previous selection have been summarised below. From the three models, I decided to use model 3 for the predictions. While the first model had the highest R-squared, it had multiple variables that weren’t statistically significant, and some that had multicollinearity issues. The F-statistic in model 3 is also much higher than the other two.

A comparsion of the multiple linear regression models, based on: mean square error, R2, F-stat, and root MSE.

Model 1 Model 2 Model 3
Mean Squared Error: 166.957871565185 Mean Squared Error: 167.907711946534 Mean Squared Error: 176.111327638759
Root MSE: 12.9212178824283 Root MSE: 12.9579208188094 Root MSE: 13.2706943163784
Adjusted R-squared: 0.322971824775576 Adjusted R-squared: 0.320621862368771 Adjusted R-squared: 0.288683962428309
F-statistic: 84.4825955610935 F-statistic: 135.206323490219 F-statistic: 231.824267918398

5.1 Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

vars n mean sd median trimmed mad min max range skew kurtosis se
TEAM_BATTING_H 1 259 1469.3900 150.65523 1455 1463.68421 114.1602 819 2170 1351 0.5876139 3.6642947 9.361261
TEAM_BATTING_2B 2 259 241.3205 49.51612 239 242.32536 48.9258 44 376 332 -0.3273282 0.6693023 3.076782
TEAM_BATTING_3B 3 259 55.9112 27.14410 52 52.94737 26.6868 14 155 141 0.9790284 0.6987468 1.686652
TEAM_BATTING_HR 4 259 95.6332 56.33221 101 93.67943 66.7170 0 242 242 0.1712363 -0.9031262 3.500313
TEAM_BATTING_BB 5 259 498.9575 120.59215 509 505.98086 94.8864 15 792 777 -0.9209916 2.5265655 7.493232
TEAM_BATTING_SO 6 259 707.2587 238.49198 680 712.19139 259.4550 0 1268 1268 -0.2171483 -0.1898881 14.819171
TEAM_BASERUN_SB 7 259 130.7452 104.86734 95 112.77033 66.7170 0 580 580 1.8785832 4.0655539 6.516140
TEAM_BASERUN_CS 8 259 63.3668 34.42361 56 58.82775 23.7216 0 154 154 1.1797483 0.8795437 2.138979
TEAM_PITCHING_H 9 259 1813.4633 1662.91308 1515 1554.25359 173.4642 1155 22768 21613 9.2764797 102.0702914 103.328391
TEAM_PITCHING_BB 10 259 552.4170 172.95006 526 536.46411 97.8516 136 2008 1872 4.1113772 29.2127324 10.746594
TEAM_PITCHING_SO 11 259 793.1158 613.19152 734 760.06699 235.7334 0 9963 9963 12.8862500 190.1532522 38.101867
TEAM_FIELDING_E 12 259 249.7490 230.90260 163 197.36364 59.3040 73 1568 1495 3.0887263 10.8748551 14.347589
TEAM_FIELDING_DP 13 259 142.6448 27.28171 146 143.61722 25.2042 69 204 135 -0.3271267 -0.1939724 1.695203

After imputing and cleaning the data, using the predict function and Model 3, the following are the predicted values for the test set of the data, including prediction intervals:

fit lwr upr
67.13623 41.055889 93.21656
67.60660 41.527611 93.68558
76.03751 49.973209 102.10180
89.53503 63.466923 115.60314
70.16862 44.089489 96.24776
64.87399 38.777247 90.97073
81.96123 55.864890 108.05758
75.55546 49.487775 101.62314
70.08147 43.999724 96.16321
74.31463 48.242403 100.38687
75.76072 49.688450 101.83299
82.30038 56.233632 108.36714
78.58716 52.512895 104.66143
80.46603 54.395602 106.53645
78.95788 52.894835 105.02092
79.49049 53.426092 105.55488
73.20732 47.138873 99.27577
82.07808 56.012606 108.14356
68.33606 42.255490 94.41663
91.89863 65.820896 117.97637
81.80335 55.727601 107.87909
85.95945 59.888204 112.03069
77.73345 51.669044 103.79786
73.79085 47.722477 99.85922
86.03144 59.967841 112.09505
89.81946 63.751274 115.88765
69.65389 43.450462 95.85732
76.66803 50.595247 102.74082
82.25434 56.172831 108.33585
78.38870 52.306833 104.47057
86.77833 60.712068 112.84460
84.21850 58.155037 110.28197
82.23930 56.174799 108.30381
82.99247 56.924191 109.06074
79.60809 53.543362 105.67281
80.84144 54.762024 106.92086
75.45926 49.394599 101.52392
87.90480 61.821770 113.98784
86.03867 59.971227 112.10611
87.46234 61.387089 113.53758
82.10938 56.044761 108.17400
87.23623 61.169121 113.30334
31.00913 4.597776 57.42048
99.69708 73.516531 125.87763
90.28341 64.174034 116.39278
90.28543 64.195836 116.37503
96.22415 70.124249 122.32406
73.51255 47.444423 99.58067
69.52710 43.447847 95.60635
76.75404 50.684432 102.82365
79.94281 53.874642 106.01098
87.78549 61.709114 113.86186
77.36421 51.299973 103.42844
73.22777 47.159106 99.29644
78.23890 52.178886 104.29890
79.41250 53.350443 105.47456
88.62498 62.537837 114.71212
73.83970 47.754763 99.92464
62.14221 36.036322 88.24810
77.48610 51.412411 103.55980
86.59158 60.512878 112.67029
76.32423 50.247457 102.40100
86.03029 59.965814 112.09477
85.95845 59.862677 112.05422
86.31257 60.232893 112.39225
99.95558 73.813694 126.09747
75.00788 48.943641 101.07213
83.29275 57.217406 109.36810
78.57816 52.492592 104.66374
84.89222 58.807209 110.97724
84.80366 58.711411 110.89591
77.25293 51.173044 103.33282
79.51941 53.454304 105.58453
83.73802 57.647463 109.82859
85.45822 59.390518 111.52593
86.48654 60.417618 112.55547
81.40443 55.340586 107.46827
82.58200 56.514902 108.64909
71.75230 45.672786 97.83182
78.33310 52.255753 104.41044
87.23865 61.166041 113.31125
89.23463 63.158208 115.31104
96.68847 70.592683 122.78426
81.65470 55.583160 107.72624
80.59969 54.534239 106.66514
82.29907 56.236935 108.36121
79.41995 53.353186 105.48671
82.63534 56.573549 108.69713
84.55174 58.490110 110.61336
90.09904 64.019419 116.17866
79.03753 52.960202 105.11486
86.15399 59.926887 112.38109
73.04081 46.972190 99.10943
82.57737 56.500985 108.65375
84.64871 58.563206 110.73422
79.78616 53.709014 105.86331
82.81826 56.740994 108.89552
96.27710 70.168882 122.38533
86.54482 60.459170 112.63047
88.90420 62.829490 114.97891
83.07500 57.001330 109.14867
73.03540 46.966682 99.10413
83.71895 57.649976 109.78792
78.38742 52.318497 104.45634
81.24960 55.169348 107.32986
66.84296 40.743824 92.94209
47.33882 21.175164 73.50248
83.67310 57.607856 109.73834
84.05024 57.979251 110.12123
61.22342 35.130179 87.31665
82.78641 56.725193 108.84763
87.12356 61.055363 113.19175
94.45230 68.378518 120.52609
91.33655 65.269033 117.40407
83.96728 57.906992 110.02756
83.20464 57.143448 109.26583
91.14050 65.072105 117.20890
82.44282 56.379537 108.50611
79.77622 53.713885 105.83855
77.05494 50.944457 103.16542
89.45172 63.368426 115.53502
67.01697 40.927090 93.10686
66.80119 40.717442 92.88493
60.11416 34.009157 86.21917
68.86179 42.779023 94.94455
87.44880 61.367929 113.52967
88.62009 62.526380 114.71379
74.96902 48.900029 101.03802
87.69835 61.630061 113.76663
93.23197 67.146402 119.31753
86.36779 60.292782 112.44280
80.63481 54.566428 106.70320
79.57357 53.508867 105.63828
85.10774 59.041625 111.17386
85.31934 59.242136 111.39654
71.19588 45.103271 97.28849
76.55274 50.488794 102.61669
79.26542 53.200875 105.32996
90.53036 64.448302 116.61241
82.53217 56.471026 108.59331
67.31824 41.234732 93.40175
69.69733 43.613575 95.78109
91.67851 65.594246 117.76277
76.67572 50.606266 102.74518
72.92561 46.848393 99.00283
72.33785 46.267072 98.40862
78.47338 52.408821 104.53794
81.68135 55.618139 107.74456
84.72967 58.665025 110.79432
81.32244 55.260534 107.38434
82.86269 56.788004 108.93738
84.28374 58.221674 110.34580
45.26908 18.933110 71.60504
73.82185 47.755042 99.88865
77.14849 51.084993 103.21199
76.31588 50.249853 102.38191
87.41020 61.329582 113.49082
67.89901 41.811999 93.98601
87.18958 61.112421 113.26673
72.28956 46.214096 98.36503
96.22337 70.134291 122.31245
99.41116 73.320060 125.50225
86.63437 60.568076 112.70066
97.74847 71.650702 123.84624
89.59111 63.513337 115.66888
84.47829 58.409293 110.54728
82.22218 56.157901 108.28646
81.77499 55.711798 107.83818
77.13688 51.068084 103.20567
82.92497 56.860321 108.98963
88.18245 62.099406 114.26549
85.92711 59.851374 112.00285
78.30266 52.233769 104.37155
90.62441 64.539082 116.70974
81.62584 55.562611 107.68907
73.82979 47.755859 99.90373
75.01621 48.950755 101.08166
75.29618 49.231618 101.36074
74.01369 47.944997 100.08239
79.88010 53.816684 105.94351
86.36111 60.248658 112.47356
84.31503 58.233582 110.39648
85.86281 59.798009 111.92760
82.07639 55.998654 108.15413
89.18001 62.974010 115.38601
99.66933 73.467653 125.87102
87.68879 61.603940 113.77364
56.65728 30.502015 82.81255
81.45939 55.242450 107.67633
113.71519 87.537749 139.89264
70.44522 44.373418 96.51703
79.72226 53.649789 105.79472
78.84559 52.783530 104.90764
81.27961 55.207407 107.35181
84.35495 58.275895 110.43400
70.36314 44.290874 96.43541
77.17734 51.115940 103.23873
76.16384 50.098300 102.22938
75.69303 49.628593 101.75746
82.25950 56.196700 108.32229
76.27326 50.207397 102.33911
80.16288 54.100325 106.22544
73.56452 47.497053 99.63198
87.41267 61.349360 113.47598
81.59053 55.529817 107.65124
78.55895 52.494563 104.62334
81.67572 55.613796 107.73765
79.10919 53.047616 105.17076
81.32306 55.248439 107.39768
71.88460 45.805864 97.96333
100.81056 74.699625 126.92149
90.45132 64.367058 116.53558
78.85707 52.792721 104.92141
68.08431 42.009503 94.15912
70.66189 44.589865 96.73392
84.99225 58.923693 111.06082
83.97027 57.907613 110.03293
94.06135 67.977999 120.14471
79.39026 53.328870 105.45165
77.45847 51.395179 103.52175
81.55135 55.490138 107.61256
81.31675 55.254315 107.37919
85.07509 59.009008 111.14118
80.75341 54.690878 106.81594
87.18459 60.900380 113.46881
74.83597 48.770976 100.90097
81.55878 55.498117 107.61944
82.03872 55.972576 108.10486
80.93174 54.870778 106.99269
89.84760 63.645467 116.04973
74.43396 48.353576 100.51434
92.10202 66.026458 118.17758
78.06023 51.996185 104.12427
85.87531 59.798844 111.95178
77.66966 51.607442 103.73189
73.81543 47.748797 99.88206
84.34386 58.284259 110.40346
77.50491 51.439734 103.57009
85.18128 59.108813 111.25376
72.97924 46.908681 99.04980
87.23800 61.169237 113.30676
85.77969 59.707254 111.85212
84.70459 58.626982 110.78219
86.96065 60.896647 113.02465
65.96530 39.876078 92.05453
89.91658 63.848003 115.98516
81.42166 55.361418 107.48190
84.96504 58.894665 111.03541
73.95875 47.889914 100.02758
88.92177 62.842080 115.00146
83.02815 56.957246 109.09905
73.40288 47.240856 99.56490
90.75125 64.664191 116.83830
36.45602 10.154620 62.75743
69.92466 43.851514 95.99780
74.92555 48.861238 100.98986
82.60206 56.534355 108.66976
85.06042 58.987784 111.13306
80.57628 54.508445 106.64412
##       fit              lwr              upr        
##  Min.   : 31.01   Min.   : 4.598   Min.   : 57.42  
##  1st Qu.: 76.29   1st Qu.:50.227   1st Qu.:102.36  
##  Median : 81.63   Median :55.563   Median :107.69  
##  Mean   : 80.67   Mean   :54.588   Mean   :106.75  
##  3rd Qu.: 86.03   3rd Qu.:59.946   3rd Qu.:112.09  
##  Max.   :113.72   Max.   :87.538   Max.   :139.89
##        1 
## 81.09361
##        fit      lwr      upr
## 1 81.09361 55.03392 107.1533

6 Appendex

moneyball_training_data <- read_csv("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/moneyball-training-data.csv")


mbd1 <- describe(moneyball_training_data, na.rm = F)

mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]


kable(mbd1,"latex", booktabs = T) %>%
   kable_styling(latex_options = c("striped", "scale_down"))

# kable(mbd1, "html", escape = F) %>%
#   kable_styling("striped", full_width = T) %>%
#   column_spec(1, bold = T) %>%
#   scroll_box(width = "100%", height = "700px")

ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
  geom_boxplot() +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(panel.background = element_rect(fill = '#d0ddf2'))


ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 800)) +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(panel.background = element_rect(fill = '#d0ddf2'))


mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]

mb_hist %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(bins = 35)


kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")

corrgram(drop_na(mb_hist), order=TRUE,
         upper.panel=panel.cor, main="Moneyball")



mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]




init = mice(mbd2, maxit=0)
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)

imputedtable <- describe(imputed)


kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")


mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-c(1, 11, 13)]

init = mice(mbd2, maxit=0)
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)


vif(imputed)
v1 <- vifstep(imputed, th=10)


imputedtable <- describe(imputed)


kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")



model1 <- lm(TARGET_WINS ~., imputed)


model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H  + TEAM_BATTING_3B  + TEAM_BATTING_HR  + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)

summary(model2)


model4 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB  + TEAM_FIELDING_E  + TEAM_BATTING_HR, imputed)

summary(model4)




compare_model1 <- c(m1mse, m1root, m1ar, m1fs )
compare_model2 <- c(m2mse, m2root, m2ar, m2fs )
compare_model3 <- c(m3mse, m3root, m3ar, m3fs )

compare <- data.frame(compare_model1, compare_model2, compare_model3)
colnames(compare) <- c("Model 1", "Model 2", "Model 3")

kable(compare)




mbeval <- mbeval[,-c(1, 10, 12)]


init = mice(mbeval, maxit=0)
meth = init$method
predM = init$predictorMatrix


imputed1 = mice(mbeval, method="rf", predictorMatrix=predM, m=5)

imputed1 <- complete(imputed1)

imputedtable1 <- describe(imputed1)


kable(imputedtable1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

predict1 <- predict(model4, newdata = imputed1, interval="prediction")

kable(predict1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")


summary(predict1)