CUNY MSDS DATA 621 - HW 1

NIcholas Schettini

June 5, 2018

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Caption for the picture.

Caption for the picture.

Data Exploration

The dataset consists of 17 elements, with 2276 total cases. Out of those 17, 15 are explanatory variables, which can be broken down into four groups:

  • batting
  • baserun
  • fielding
  • pitching
vars n mean sd median trimmed mad min max range skew kurtosis se na_count
TARGET_WINS 2 191 80.92670 12.115013 82 81.11765 13.3434 43 116 73 -0.1698314 -0.2952783 0.8766116 0
TEAM_BATTING_H 3 191 1478.62827 76.147869 1477 1477.42484 74.1300 1308 1667 359 0.1302702 -0.3710350 5.5098664 0
TEAM_BATTING_2B 4 191 297.19895 26.329335 296 296.62745 25.2042 201 373 172 0.0915189 0.4778716 1.9051238 0
TEAM_BATTING_3B 5 191 30.74346 9.043878 29 30.13072 8.8956 12 61 49 0.7007420 0.7446217 0.6543921 0
TEAM_BATTING_HR 6 191 178.05236 32.413243 175 176.81046 35.5824 116 260 144 0.2980673 -0.7172373 2.3453399 0
TEAM_BATTING_BB 7 191 543.31937 74.842133 535 541.31373 74.1300 365 775 410 0.3115199 -0.1474175 5.4153867 0
TEAM_BATTING_SO 8 191 1051.02618 104.156382 1050 1046.95425 97.8516 805 1399 594 0.3985050 0.3955105 7.5364913 102
TEAM_BASERUN_SB 9 191 90.90576 29.916401 87 89.06536 29.6520 31 177 146 0.5553966 -0.1414909 2.1646748 131
TEAM_BASERUN_CS 10 191 39.94241 11.898334 38 39.49020 11.8608 12 74 62 0.3468509 0.0006392 0.8609332 772
TEAM_BATTING_HBP 11 191 59.35602 12.967123 58 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681 2085
TEAM_PITCHING_H 12 191 1479.70157 75.788625 1480 1478.50327 72.6474 1312 1667 355 0.1279056 -0.3894781 5.4838725 0
TEAM_PITCHING_HR 13 191 178.17801 32.391678 175 176.93464 35.5824 116 260 144 0.2989191 -0.7190905 2.3437795 0
TEAM_PITCHING_BB 14 191 543.71728 74.916681 537 541.74510 72.6474 367 775 408 0.3144366 -0.1338563 5.4207808 0
TEAM_PITCHING_SO 15 191 1051.81675 104.347208 1052 1047.80392 97.8516 805 1399 594 0.3945586 0.3903991 7.5502990 102
TEAM_FIELDING_E 16 191 107.05236 16.632162 106 106.58170 17.7912 65 145 80 0.1780432 -0.3567367 1.2034610 0
TEAM_FIELDING_DP 17 191 152.33508 17.611682 152 152.04575 19.2738 113 204 91 0.2164822 -0.2115741 1.2743366 286

Looking at the data above, there are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.

The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.

In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.

The above boxplots show all of the variables listed in the dataset. This visualization may assist in showing how the data is spread.

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by wins - since that’s what we’re trying to predict.

x
TARGET_WINS 1.0000000
TEAM_BATTING_H 0.4699467
TEAM_BATTING_2B 0.3129840
TEAM_BATTING_3B -0.1243459
TEAM_BATTING_HR 0.4224168
TEAM_BATTING_BB 0.4686879
TEAM_BATTING_SO -0.2288927
TEAM_BASERUN_SB 0.0148364
TEAM_BASERUN_CS -0.1787560
TEAM_BATTING_HBP 0.0735042
TEAM_PITCHING_H 0.4712343
TEAM_PITCHING_HR 0.4224668
TEAM_PITCHING_BB 0.4683988
TEAM_PITCHING_SO -0.2293648
TEAM_FIELDING_E -0.3866880
TEAM_FIELDING_DP -0.1958660

Below is a visual representation of the correlation plot.

According to the coorelation graph, batting_h, batting_2b, batting_hr, batting_bb, pitching_h, pitching_hr, and pitching_bb are the most positively correlated.

Data Preparation

Removal of Data

The variable TEAM_BATTING_HBP is also missing over 90% of its values. That variable will be removed completely.

The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset

Using the VIF and vifstep function from the USDM package, the data will first be tested for other collinearity issues. The variables determined that have collinearity issues will be discarded.

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Variables that exceed the established threshold will be discarded to avoid collinearity issues.

vif(imputed)
##           Variables      VIF
## 1       TARGET_WINS 1.499251
## 2    TEAM_BATTING_H 3.997932
## 3   TEAM_BATTING_2B 2.457536
## 4   TEAM_BATTING_3B 2.979868
## 5   TEAM_BATTING_HR 4.918888
## 6   TEAM_BATTING_BB 5.543429
## 7   TEAM_BATTING_SO 5.231486
## 8   TEAM_BASERUN_SB 2.344517
## 9   TEAM_BASERUN_CS 1.778219
## 10  TEAM_PITCHING_H 3.683440
## 11 TEAM_PITCHING_BB 4.781545
## 12 TEAM_PITCHING_SO 2.986074
## 13  TEAM_FIELDING_E 4.774207
## 14 TEAM_FIELDING_DP 1.750808
v1 <- vifstep(imputed, th=10)

Output - The below table shows the results of above data manipulation.

The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.

vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET_WINS 1 2276 80.79086 15.75215 82.0 81.31229 14.8260 0 146 146 -0.3987232 1.0274757 0.3301823
TEAM_BATTING_H 2 2276 1469.26977 144.59120 1454.0 1459.04116 114.1602 891 2554 1663 1.5713335 7.2785261 3.0307891
TEAM_BATTING_2B 3 2276 241.24692 46.80141 238.0 240.39627 47.4432 69 458 389 0.2151018 0.0061609 0.9810087
TEAM_BATTING_3B 4 2276 55.25000 27.93856 47.0 52.17563 23.7216 0 223 223 1.1094652 1.5032418 0.5856226
TEAM_BATTING_HR 5 2276 99.61204 60.54687 102.0 97.38529 78.5778 0 264 264 0.1860421 -0.9631189 1.2691285
TEAM_BATTING_BB 6 2276 501.55888 122.67086 512.0 512.18331 94.8864 0 878 878 -1.0257599 2.1828544 2.5713150
TEAM_BATTING_SO 7 2276 730.04525 245.34496 736.0 735.28266 278.7288 0 1399 1399 -0.2491809 -0.3043446 5.1426978
TEAM_BASERUN_SB 8 2276 130.57777 94.76956 103.0 115.01701 65.2344 0 697 697 1.8656479 4.4601924 1.9864734
TEAM_BASERUN_CS 9 2276 65.05756 38.45694 53.0 58.84522 22.2390 0 201 201 1.6112087 2.4228591 0.8060993
TEAM_PITCHING_H 10 2276 1779.21046 1406.84293 1518.0 1555.89517 174.9468 1137 30132 28995 10.3295111 141.8396985 29.4889618
TEAM_PITCHING_BB 11 2276 553.00791 166.35736 536.5 542.62459 98.5929 0 3645 3645 6.7438995 96.9676398 3.4870317
TEAM_PITCHING_SO 12 2276 811.27373 542.07751 802.5 790.48299 254.2659 0 19278 19278 22.5302527 695.9711670 11.3625357
TEAM_FIELDING_E 13 2276 246.48067 227.77097 159.0 193.43798 62.2692 65 1898 1833 2.9904656 10.9702717 4.7743279
TEAM_FIELDING_DP 14 2276 142.48023 28.08209 146.0 143.54281 26.6868 52 228 176 -0.3341358 -0.1834034 0.5886313

Build Models

Using the training data provided, we will build 3 different linear regression models, to determine which will provide the best prediction for the # of wins for a baseball team. The tree approachs are: all variables, only significant variables, and backwards elimination of each variable.

Model 1: All Variables

All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.523  -8.434   0.164   8.261  58.735 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      26.1424328  5.2322110   4.996 6.29e-07 ***
## TEAM_BATTING_H    0.0466562  0.0036096  12.926  < 2e-16 ***
## TEAM_BATTING_2B  -0.0192738  0.0090513  -2.129  0.03333 *  
## TEAM_BATTING_3B   0.0408129  0.0166908   2.445  0.01455 *  
## TEAM_BATTING_HR   0.0727073  0.0097896   7.427 1.57e-13 ***
## TEAM_BATTING_BB   0.0089198  0.0051882   1.719  0.08571 .  
## TEAM_BATTING_SO  -0.0121938  0.0025086  -4.861 1.25e-06 ***
## TEAM_BASERUN_SB   0.0413004  0.0042832   9.642  < 2e-16 ***
## TEAM_BASERUN_CS   0.0009375  0.0093794   0.100  0.92039    
## TEAM_PITCHING_H  -0.0001880  0.0003690  -0.510  0.61045    
## TEAM_PITCHING_BB -0.0009933  0.0035554  -0.279  0.77999    
## TEAM_PITCHING_SO  0.0025529  0.0008606   2.966  0.00304 ** 
## TEAM_FIELDING_E  -0.0297181  0.0025185 -11.800  < 2e-16 ***
## TEAM_FIELDING_DP -0.0957177  0.0125853  -7.606 4.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.9 on 2262 degrees of freedom
## Multiple R-squared:  0.333,  Adjusted R-squared:  0.3292 
## F-statistic: 86.87 on 13 and 2262 DF,  p-value: < 2.2e-16

Conclusions based on model:

F-statistic is 89.25, R-squared is 0.3352 Out of the 14 variables, 9 have statistically significant p-values at the 5% level.

Model 2: Highly Significant Variables Only

Based on model one, Model 2 will focus only on the variables that are statistically significant - in order to see if only those variables allow for a better model. Variables will be choosen based on their significance level from the R output.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.312  -8.551   0.171   8.267  58.000 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.0046383  4.5179148   7.527 7.47e-14 ***
## TEAM_BATTING_H    0.0407043  0.0026698  15.246  < 2e-16 ***
## TEAM_BATTING_3B   0.0495451  0.0162081   3.057 0.002263 ** 
## TEAM_BATTING_HR   0.0783195  0.0091365   8.572  < 2e-16 ***
## TEAM_BATTING_SO  -0.0134377  0.0023291  -5.770 9.03e-09 ***
## TEAM_BASERUN_SB   0.0446690  0.0037826  11.809  < 2e-16 ***
## TEAM_PITCHING_SO  0.0019535  0.0005818   3.358 0.000799 ***
## TEAM_FIELDING_E  -0.0325609  0.0017680 -18.417  < 2e-16 ***
## TEAM_FIELDING_DP -0.0922242  0.0123648  -7.459 1.24e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.92 on 2267 degrees of freedom
## Multiple R-squared:  0.3299, Adjusted R-squared:  0.3275 
## F-statistic: 139.5 on 8 and 2267 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 143, R-squared is 0.333

The F-statistic is better than the first model, however the R-squared drops slightly.

Model 3: Backwards Elimination and Significance

Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown. This model is similar to the ‘forward selection’ variant - however I find it easier to work our way backwards and to eliminate variables rather than add them.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_BATTING_HR, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.322  -9.029   0.006   8.508  57.962 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.314710   2.896116   1.835   0.0666 .  
## TEAM_BATTING_H   0.049751   0.002022  24.608  < 2e-16 ***
## TEAM_BASERUN_SB  0.049378   0.003391  14.560  < 2e-16 ***
## TEAM_FIELDING_E -0.026155   0.001630 -16.049  < 2e-16 ***
## TEAM_BATTING_HR  0.023860   0.005956   4.006 6.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.19 on 2271 degrees of freedom
## Multiple R-squared:  0.3005, Adjusted R-squared:  0.2992 
## F-statistic: 243.8 on 4 and 2271 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 245.5, R-squared is 0.3006 The F-statistic is larger than both of the other two models, however the R-squared is slightly lower than the other two.

Select Models

The three models from the previous selection have been summarised below. From the three models, I decided to use model 3 for the predictions. While the first model had the highest R-squared, it had multiple variables that weren’t statistically significant, and some that had multicollinearity issues. The F-statistic in model 3 is also much higher than the other two.

A comparsion of the multiple linear regression models, based on: mean square error, R2, F-stat, and root MSE.

Model 1 Model 2 Model 3
Mean Squared Error: 165.430149959513 Mean Squared Error: 166.208918203967 Mean Squared Error: 173.503148865462
Root MSE: 12.8619652448416 Root MSE: 12.8922037760798 Root MSE: 13.1720594010755
Adjusted R-squared: 0.329166864046515 Adjusted R-squared: 0.32749541996577 Adjusted R-squared: 0.299218431818632
F-statistic: 86.8696419732824 F-statistic: 139.484573663462 F-statistic: 243.843834404051

Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

vars n mean sd median trimmed mad min max range skew kurtosis se
TEAM_BATTING_H 1 259 1469.38996 150.65523 1455 1463.68421 114.1602 819 2170 1351 0.5876139 3.6642947 9.361261
TEAM_BATTING_2B 2 259 241.32046 49.51612 239 242.32536 48.9258 44 376 332 -0.3273282 0.6693023 3.076782
TEAM_BATTING_3B 3 259 55.91120 27.14410 52 52.94737 26.6868 14 155 141 0.9790284 0.6987468 1.686652
TEAM_BATTING_HR 4 259 95.63320 56.33221 101 93.67943 66.7170 0 242 242 0.1712363 -0.9031262 3.500313
TEAM_BATTING_BB 5 259 498.95753 120.59215 509 505.98086 94.8864 15 792 777 -0.9209916 2.5265655 7.493232
TEAM_BATTING_SO 6 259 701.83784 243.51290 680 708.15311 259.4550 0 1268 1268 -0.2448918 -0.2154189 15.131155
TEAM_BASERUN_SB 7 259 128.12741 96.02480 95 112.79426 63.7518 0 580 580 1.6607884 3.1509919 5.966691
TEAM_BASERUN_CS 8 259 64.28571 35.48134 55 60.36364 23.7216 0 154 154 1.0136018 0.2796269 2.204703
TEAM_PITCHING_H 9 259 1813.46332 1662.91308 1515 1554.25359 173.4642 1155 22768 21613 9.2764797 102.0702914 103.328391
TEAM_PITCHING_BB 10 259 552.41699 172.95006 526 536.46411 97.8516 136 2008 1872 4.1113772 29.2127324 10.746594
TEAM_PITCHING_SO 11 259 794.33205 613.71595 736 760.66986 235.7334 0 9963 9963 12.8487348 189.3937245 38.134454
TEAM_FIELDING_E 12 259 249.74903 230.90260 163 197.36364 59.3040 73 1568 1495 3.0887263 10.8748551 14.347589
TEAM_FIELDING_DP 13 259 142.88417 27.34920 146 144.02392 25.2042 69 204 135 -0.3784993 -0.0737034 1.699397

After imputing and cleaning the data, using the predict function and Model 3, the following are the predicted values for the test set of the data, including prediction intervals:

fit lwr upr
66.84442 40.957813 92.73102
67.29649 41.411221 93.18175
75.77018 49.899595 101.64077
89.74078 63.866483 115.61507
75.48386 49.595064 101.37266
69.68768 43.804389 95.57097
82.54934 56.648014 108.45066
75.56408 49.690188 101.43797
69.99633 44.108431 95.88424
73.95660 48.078185 99.83501
75.43232 49.553884 101.31075
82.10294 56.229900 107.97598
78.34735 52.466793 104.22790
80.21514 54.338437 106.09185
78.78385 52.914487 104.65321
79.30491 53.434194 105.17562
73.06414 47.189400 98.93888
82.01046 56.138711 107.88222
68.14481 42.258018 94.03161
91.78241 65.898550 117.66626
81.56651 55.684708 107.44832
85.71854 59.841212 111.59588
77.40235 51.531677 103.27303
73.39180 47.517128 99.26648
86.01310 60.143217 111.88299
89.94805 64.073626 115.82247
58.17171 32.181737 84.16168
76.31521 50.436286 102.19414
81.98767 56.100175 107.87516
77.98918 52.101387 103.87698
86.62082 60.748320 112.49332
84.04249 58.172776 109.91221
81.98405 56.113320 107.85478
82.68952 56.815014 108.56403
79.36204 53.491014 105.23307
80.68676 54.801129 106.57239
75.28182 49.410840 101.15281
87.86324 61.974069 113.75241
85.83062 59.956980 111.70426
87.27252 61.391084 113.15395
81.87679 56.005901 107.74768
87.11062 61.237314 112.98392
34.16652 7.959504 60.37354
101.01809 75.034458 127.00173
91.10157 65.187555 117.01558
90.73397 64.838737 116.62920
96.74828 70.842770 122.65379
73.53568 47.661382 99.40997
69.34003 43.454553 95.22550
76.59684 50.720984 102.47269
79.64269 53.768362 105.51703
87.59705 61.714644 113.47947
77.19815 51.327590 103.06872
72.95083 47.075836 98.82582
78.10832 52.241980 103.97467
79.26373 53.395381 105.13209
89.17837 63.285636 115.07110
74.02463 48.133735 99.91552
62.09936 36.187601 88.01112
77.61336 51.733548 103.49317
86.68127 60.796375 112.56616
76.03359 50.150722 101.91645
86.07192 60.201184 111.94266
86.00686 60.105127 111.90859
86.53265 60.647009 112.41830
101.00517 75.058266 126.95208
74.88012 49.009567 100.75068
83.16788 57.286354 109.04941
78.99179 53.100575 104.88300
85.37388 59.483083 111.26468
84.46322 58.565243 110.36119
76.92828 51.042384 102.81417
79.30842 53.436993 105.17985
83.64109 57.744431 109.53775
85.24998 59.376117 111.12385
86.41449 60.539314 112.28967
81.54696 55.676934 107.41698
82.67868 56.805416 108.55194
71.77148 45.885857 97.65711
78.42707 52.543575 104.31056
86.99878 61.120091 112.87748
89.03482 63.152356 114.91729
96.55587 70.654185 122.45756
81.34326 55.465639 107.22087
80.37801 54.506251 106.24978
82.12541 56.256977 107.99384
79.44775 53.574779 105.32072
82.52453 56.656454 108.39260
84.69217 58.824283 110.56007
90.34284 64.457209 116.22848
79.35878 53.475732 105.24182
86.40493 60.374066 112.43580
72.88054 47.005624 98.75546
82.69688 56.814328 108.57944
85.11620 59.224855 111.00754
80.10770 54.224637 105.99075
83.13282 57.249523 109.01611
96.14589 70.231933 122.05984
86.34516 60.453575 112.23675
88.92693 63.046028 114.80783
82.82389 56.944149 108.70364
72.64541 46.770407 98.52041
83.41881 57.543722 109.29390
78.14882 52.273596 104.02405
81.24605 55.359698 107.13241
71.38093 45.495315 97.26655
49.07420 23.120501 75.02789
83.66313 57.791624 109.53463
84.06725 58.190065 109.94444
60.78095 34.881272 86.68063
82.92419 57.056753 108.79163
87.48037 61.606074 113.35467
94.74991 68.869893 120.62993
91.46150 65.587718 117.33528
83.94691 58.080318 109.81350
83.17406 57.306569 109.04155
91.27021 65.395574 117.14485
82.56630 56.696811 108.43578
79.55968 53.691045 105.42831
77.63222 51.717258 103.54719
89.90789 64.018662 115.79711
67.06688 41.171107 92.96266
66.59474 40.704784 92.48470
59.86647 33.955355 85.77757
68.79519 42.906290 94.68409
87.23263 61.345775 113.11949
88.25727 62.357883 114.15666
74.77664 48.901378 100.65190
87.55390 61.679419 113.42839
93.04601 67.154479 118.93755
86.10646 60.225416 111.98750
80.45473 54.580154 106.32931
79.69722 53.826387 105.56806
85.26277 59.390493 111.13504
85.42525 59.541952 111.30855
72.94748 47.046308 98.84865
76.50858 50.638360 102.37880
79.00200 53.131226 104.87278
91.06717 65.179088 116.95525
82.63074 56.763351 108.49814
67.17715 41.287469 93.06684
69.66325 43.773382 95.55312
91.60195 65.711651 117.49224
76.31512 50.439476 102.19076
72.57974 46.696382 98.46309
71.93374 46.056673 97.81081
78.39098 52.520139 104.26183
81.57525 55.705767 107.44473
84.55755 58.686674 110.42842
81.18704 55.318823 107.05525
82.84533 56.964462 108.72619
84.32265 58.454293 110.19100
43.70469 17.561519 69.84787
73.46726 47.594155 99.34037
77.02909 51.159289 102.89889
76.37344 50.501229 102.24565
87.66079 61.774227 113.54735
65.39449 39.500185 91.28880
87.37835 61.495124 113.26158
72.30981 46.428228 98.19140
96.20487 70.309711 122.10004
99.61895 73.721749 125.51616
86.48036 60.607852 112.35287
97.75285 71.849054 123.65664
89.48008 63.596132 115.36403
84.20031 58.325139 110.07548
82.03997 56.169395 107.91055
81.64414 55.774642 107.51364
77.17411 51.299160 103.04907
82.86275 56.991814 108.73368
88.58714 62.698063 114.47622
86.22406 60.342183 112.10595
77.99639 52.121330 103.87146
90.50307 64.611747 116.39439
81.36673 55.497221 107.23623
73.45158 47.571493 99.33166
74.73739 48.865647 100.60914
74.98058 49.109682 100.85148
73.62646 47.751491 99.50144
79.85270 53.983020 105.72239
87.19922 61.282026 113.11641
84.50598 58.618621 110.39334
85.99410 60.123090 111.86512
81.94932 56.065374 107.83326
88.96062 62.949304 114.97194
101.00683 75.002680 127.01098
88.05561 62.165108 113.94610
70.48439 44.505083 96.46370
64.78586 38.887407 90.68431
114.40413 88.421126 140.38712
70.30229 44.424225 96.18036
79.77672 53.898052 105.65538
78.67288 52.804513 104.54124
80.97365 55.095368 106.85194
84.03774 58.152729 109.92274
69.96893 44.090367 95.84750
76.92294 51.055207 102.79068
75.92875 50.056883 101.80061
75.42316 49.552385 101.29393
82.05362 56.184549 107.92270
76.01820 50.145994 101.89040
79.95618 54.087339 105.82501
73.37336 47.499601 99.24713
87.57201 61.702426 113.44160
81.64056 55.773567 107.50755
78.64678 52.776214 104.51735
81.55645 55.688237 107.42466
79.03358 53.165700 104.90146
81.64337 55.763040 107.52371
71.94618 46.061585 97.83078
101.36212 75.445411 127.27883
90.50750 64.617099 116.39789
78.89369 53.023090 104.76429
67.71789 41.836726 93.59905
70.23387 44.355541 96.11220
84.72739 58.852678 110.60209
83.81404 57.945103 109.68297
94.64876 68.759581 120.53794
79.29458 53.426880 105.16227
77.45905 51.589497 103.32860
81.50475 55.637242 107.37227
81.46051 55.591858 107.32916
85.04305 59.170698 110.91539
80.61032 54.741482 106.47916
86.29814 60.209008 112.38728
74.72789 48.856599 100.59918
81.51074 55.643768 107.37770
82.14032 56.268014 108.01262
80.90219 55.034931 106.76945
77.55036 51.651524 103.44919
74.59904 48.712648 100.48543
92.16630 66.284537 118.04806
77.80810 51.937732 103.67847
85.82773 59.945075 111.71039
77.55453 51.685984 103.42307
73.67686 47.803936 99.54978
84.36600 58.500094 110.23191
77.46094 51.589502 103.33239
85.31306 59.434543 111.19158
72.88181 47.004995 98.75862
87.08428 61.209329 112.95923
85.51560 59.637122 111.39408
84.44546 58.561874 110.32904
86.97796 61.107681 112.84824
65.87711 39.981794 91.77243
89.86501 63.990213 115.73982
81.30220 55.435638 107.16875
85.04720 59.170663 110.92374
73.59370 47.718619 99.46878
89.21338 63.327736 115.09902
82.97177 57.094636 108.84890
54.33927 28.368448 80.31009
91.22629 65.333606 117.11898
28.37968 2.269096 54.49026
69.69086 43.811424 95.57030
74.75187 48.881225 100.62251
82.89015 57.016361 108.76393
85.47483 59.596340 111.35333
80.49211 54.618015 106.36621
##       fit              lwr              upr        
##  Min.   : 28.38   Min.   : 2.269   Min.   : 54.49  
##  1st Qu.: 75.97   1st Qu.:50.101   1st Qu.:101.85  
##  Median : 81.55   Median :55.677   Median :107.42  
##  Mean   : 80.50   Mean   :54.608   Mean   :106.38  
##  3rd Qu.: 86.01   3rd Qu.:60.133   3rd Qu.:111.90  
##  Max.   :114.40   Max.   :88.421   Max.   :140.39
##        1 
## 81.01443
##        fit      lwr      upr
## 1 81.01443 55.14843 106.8804

Appendex

moneyball_training_data <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/moneyball-training-data.csv")


mbd1 <- describe(moneyball_training_data, na.rm = F)

mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]


kable(mbd1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 800)) +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]

mb_hist %>%
  keep(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_histogram(bins = 35)                 



kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")

corrgram(drop_na(mb_hist), order=TRUE,
         upper.panel=panel.cor, main="Moneyball")



mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]




init = mice(mbd2, maxit=0) 
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)

imputedtable <- describe(imputed)

kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")





mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-c(1, 11, 13)]

init = mice(mbd2, maxit=0) 
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)


vif(imputed)
v1 <- vifstep(imputed, th=10)


imputedtable <- describe(imputed)

kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")



model1 <- lm(TARGET_WINS ~., imputed)


model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H  + TEAM_BATTING_3B  + TEAM_BATTING_HR  + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)

summary(model2)


model4 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB  + TEAM_FIELDING_E  + TEAM_BATTING_HR, imputed)

summary(model4)




compare_model1 <- c(m1mse, m1root, m1ar, m1fs )
compare_model2 <- c(m2mse, m2root, m2ar, m2fs )
compare_model3 <- c(m3mse, m3root, m3ar, m3fs )

compare <- data.frame(compare_model1, compare_model2, compare_model3)
colnames(compare) <- c("Model 1", "Model 2", "Model 3")

kable(compare)




mbeval <- mbeval[,-c(1, 10, 12)]


init = mice(mbeval, maxit=0) 
meth = init$method
predM = init$predictorMatrix


imputed1 = mice(mbeval, method="rf", predictorMatrix=predM, m=5)

imputed1 <- complete(imputed1)

imputedtable1 <- describe(imputed1)

kable(imputedtable1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

predict1 <- predict(model4, newdata = imputed1, interval="prediction")
kable(predict1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")


summary(predict1)