Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Caption for the picture.
Data Exploration
The dataset consists of 17 elements, with 2276 total cases. Out of those 17, 15 are explanatory variables, which can be broken down into four groups:
- batting
- baserun
- fielding
- pitching
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | na_count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 2 | 191 | 80.92670 | 12.115013 | 82 | 81.11765 | 13.3434 | 43 | 116 | 73 | -0.1698314 | -0.2952783 | 0.8766116 | 0 |
TEAM_BATTING_H | 3 | 191 | 1478.62827 | 76.147869 | 1477 | 1477.42484 | 74.1300 | 1308 | 1667 | 359 | 0.1302702 | -0.3710350 | 5.5098664 | 0 |
TEAM_BATTING_2B | 4 | 191 | 297.19895 | 26.329335 | 296 | 296.62745 | 25.2042 | 201 | 373 | 172 | 0.0915189 | 0.4778716 | 1.9051238 | 0 |
TEAM_BATTING_3B | 5 | 191 | 30.74346 | 9.043878 | 29 | 30.13072 | 8.8956 | 12 | 61 | 49 | 0.7007420 | 0.7446217 | 0.6543921 | 0 |
TEAM_BATTING_HR | 6 | 191 | 178.05236 | 32.413243 | 175 | 176.81046 | 35.5824 | 116 | 260 | 144 | 0.2980673 | -0.7172373 | 2.3453399 | 0 |
TEAM_BATTING_BB | 7 | 191 | 543.31937 | 74.842133 | 535 | 541.31373 | 74.1300 | 365 | 775 | 410 | 0.3115199 | -0.1474175 | 5.4153867 | 0 |
TEAM_BATTING_SO | 8 | 191 | 1051.02618 | 104.156382 | 1050 | 1046.95425 | 97.8516 | 805 | 1399 | 594 | 0.3985050 | 0.3955105 | 7.5364913 | 102 |
TEAM_BASERUN_SB | 9 | 191 | 90.90576 | 29.916401 | 87 | 89.06536 | 29.6520 | 31 | 177 | 146 | 0.5553966 | -0.1414909 | 2.1646748 | 131 |
TEAM_BASERUN_CS | 10 | 191 | 39.94241 | 11.898334 | 38 | 39.49020 | 11.8608 | 12 | 74 | 62 | 0.3468509 | 0.0006392 | 0.8609332 | 772 |
TEAM_BATTING_HBP | 11 | 191 | 59.35602 | 12.967123 | 58 | 58.86275 | 11.8608 | 29 | 95 | 66 | 0.3185754 | -0.1119828 | 0.9382681 | 2085 |
TEAM_PITCHING_H | 12 | 191 | 1479.70157 | 75.788625 | 1480 | 1478.50327 | 72.6474 | 1312 | 1667 | 355 | 0.1279056 | -0.3894781 | 5.4838725 | 0 |
TEAM_PITCHING_HR | 13 | 191 | 178.17801 | 32.391678 | 175 | 176.93464 | 35.5824 | 116 | 260 | 144 | 0.2989191 | -0.7190905 | 2.3437795 | 0 |
TEAM_PITCHING_BB | 14 | 191 | 543.71728 | 74.916681 | 537 | 541.74510 | 72.6474 | 367 | 775 | 408 | 0.3144366 | -0.1338563 | 5.4207808 | 0 |
TEAM_PITCHING_SO | 15 | 191 | 1051.81675 | 104.347208 | 1052 | 1047.80392 | 97.8516 | 805 | 1399 | 594 | 0.3945586 | 0.3903991 | 7.5502990 | 102 |
TEAM_FIELDING_E | 16 | 191 | 107.05236 | 16.632162 | 106 | 106.58170 | 17.7912 | 65 | 145 | 80 | 0.1780432 | -0.3567367 | 1.2034610 | 0 |
TEAM_FIELDING_DP | 17 | 191 | 152.33508 | 17.611682 | 152 | 152.04575 | 19.2738 | 113 | 204 | 91 | 0.2164822 | -0.2115741 | 1.2743366 | 286 |
Looking at the data above, there are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.
The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.
## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).
The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.
In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.
The above boxplots show all of the variables listed in the dataset. This visualization may assist in showing how the data is spread.
The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.
For this project, it makes sense to break down the correlation by wins - since that’s what we’re trying to predict.
x | |
---|---|
TARGET_WINS | 1.0000000 |
TEAM_BATTING_H | 0.4699467 |
TEAM_BATTING_2B | 0.3129840 |
TEAM_BATTING_3B | -0.1243459 |
TEAM_BATTING_HR | 0.4224168 |
TEAM_BATTING_BB | 0.4686879 |
TEAM_BATTING_SO | -0.2288927 |
TEAM_BASERUN_SB | 0.0148364 |
TEAM_BASERUN_CS | -0.1787560 |
TEAM_BATTING_HBP | 0.0735042 |
TEAM_PITCHING_H | 0.4712343 |
TEAM_PITCHING_HR | 0.4224668 |
TEAM_PITCHING_BB | 0.4683988 |
TEAM_PITCHING_SO | -0.2293648 |
TEAM_FIELDING_E | -0.3866880 |
TEAM_FIELDING_DP | -0.1958660 |
Below is a visual representation of the correlation plot.
According to the coorelation graph, batting_h, batting_2b, batting_hr, batting_bb, pitching_h, pitching_hr, and pitching_bb are the most positively correlated.
Data Preparation
Removal of Data
The variable TEAM_BATTING_HBP is also missing over 90% of its values. That variable will be removed completely.
The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset
Using the VIF and vifstep function from the USDM package, the data will first be tested for other collinearity issues. The variables determined that have collinearity issues will be discarded.
Imputation of Missing (NA) values
The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.
Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.
In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.
Variables that exceed the established threshold will be discarded to avoid collinearity issues.
vif(imputed)
## Variables VIF
## 1 TARGET_WINS 1.499251
## 2 TEAM_BATTING_H 3.997932
## 3 TEAM_BATTING_2B 2.457536
## 4 TEAM_BATTING_3B 2.979868
## 5 TEAM_BATTING_HR 4.918888
## 6 TEAM_BATTING_BB 5.543429
## 7 TEAM_BATTING_SO 5.231486
## 8 TEAM_BASERUN_SB 2.344517
## 9 TEAM_BASERUN_CS 1.778219
## 10 TEAM_PITCHING_H 3.683440
## 11 TEAM_PITCHING_BB 4.781545
## 12 TEAM_PITCHING_SO 2.986074
## 13 TEAM_FIELDING_E 4.774207
## 14 TEAM_FIELDING_DP 1.750808
v1 <- vifstep(imputed, th=10)
Output - The below table shows the results of above data manipulation.
The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TARGET_WINS | 1 | 2276 | 80.79086 | 15.75215 | 82.0 | 81.31229 | 14.8260 | 0 | 146 | 146 | -0.3987232 | 1.0274757 | 0.3301823 |
TEAM_BATTING_H | 2 | 2276 | 1469.26977 | 144.59120 | 1454.0 | 1459.04116 | 114.1602 | 891 | 2554 | 1663 | 1.5713335 | 7.2785261 | 3.0307891 |
TEAM_BATTING_2B | 3 | 2276 | 241.24692 | 46.80141 | 238.0 | 240.39627 | 47.4432 | 69 | 458 | 389 | 0.2151018 | 0.0061609 | 0.9810087 |
TEAM_BATTING_3B | 4 | 2276 | 55.25000 | 27.93856 | 47.0 | 52.17563 | 23.7216 | 0 | 223 | 223 | 1.1094652 | 1.5032418 | 0.5856226 |
TEAM_BATTING_HR | 5 | 2276 | 99.61204 | 60.54687 | 102.0 | 97.38529 | 78.5778 | 0 | 264 | 264 | 0.1860421 | -0.9631189 | 1.2691285 |
TEAM_BATTING_BB | 6 | 2276 | 501.55888 | 122.67086 | 512.0 | 512.18331 | 94.8864 | 0 | 878 | 878 | -1.0257599 | 2.1828544 | 2.5713150 |
TEAM_BATTING_SO | 7 | 2276 | 730.04525 | 245.34496 | 736.0 | 735.28266 | 278.7288 | 0 | 1399 | 1399 | -0.2491809 | -0.3043446 | 5.1426978 |
TEAM_BASERUN_SB | 8 | 2276 | 130.57777 | 94.76956 | 103.0 | 115.01701 | 65.2344 | 0 | 697 | 697 | 1.8656479 | 4.4601924 | 1.9864734 |
TEAM_BASERUN_CS | 9 | 2276 | 65.05756 | 38.45694 | 53.0 | 58.84522 | 22.2390 | 0 | 201 | 201 | 1.6112087 | 2.4228591 | 0.8060993 |
TEAM_PITCHING_H | 10 | 2276 | 1779.21046 | 1406.84293 | 1518.0 | 1555.89517 | 174.9468 | 1137 | 30132 | 28995 | 10.3295111 | 141.8396985 | 29.4889618 |
TEAM_PITCHING_BB | 11 | 2276 | 553.00791 | 166.35736 | 536.5 | 542.62459 | 98.5929 | 0 | 3645 | 3645 | 6.7438995 | 96.9676398 | 3.4870317 |
TEAM_PITCHING_SO | 12 | 2276 | 811.27373 | 542.07751 | 802.5 | 790.48299 | 254.2659 | 0 | 19278 | 19278 | 22.5302527 | 695.9711670 | 11.3625357 |
TEAM_FIELDING_E | 13 | 2276 | 246.48067 | 227.77097 | 159.0 | 193.43798 | 62.2692 | 65 | 1898 | 1833 | 2.9904656 | 10.9702717 | 4.7743279 |
TEAM_FIELDING_DP | 14 | 2276 | 142.48023 | 28.08209 | 146.0 | 143.54281 | 26.6868 | 52 | 228 | 176 | -0.3341358 | -0.1834034 | 0.5886313 |
Build Models
Using the training data provided, we will build 3 different linear regression models, to determine which will provide the best prediction for the # of wins for a baseball team. The tree approachs are: all variables, only significant variables, and backwards elimination of each variable.
Model 1: All Variables
All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.523 -8.434 0.164 8.261 58.735
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.1424328 5.2322110 4.996 6.29e-07 ***
## TEAM_BATTING_H 0.0466562 0.0036096 12.926 < 2e-16 ***
## TEAM_BATTING_2B -0.0192738 0.0090513 -2.129 0.03333 *
## TEAM_BATTING_3B 0.0408129 0.0166908 2.445 0.01455 *
## TEAM_BATTING_HR 0.0727073 0.0097896 7.427 1.57e-13 ***
## TEAM_BATTING_BB 0.0089198 0.0051882 1.719 0.08571 .
## TEAM_BATTING_SO -0.0121938 0.0025086 -4.861 1.25e-06 ***
## TEAM_BASERUN_SB 0.0413004 0.0042832 9.642 < 2e-16 ***
## TEAM_BASERUN_CS 0.0009375 0.0093794 0.100 0.92039
## TEAM_PITCHING_H -0.0001880 0.0003690 -0.510 0.61045
## TEAM_PITCHING_BB -0.0009933 0.0035554 -0.279 0.77999
## TEAM_PITCHING_SO 0.0025529 0.0008606 2.966 0.00304 **
## TEAM_FIELDING_E -0.0297181 0.0025185 -11.800 < 2e-16 ***
## TEAM_FIELDING_DP -0.0957177 0.0125853 -7.606 4.13e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.9 on 2262 degrees of freedom
## Multiple R-squared: 0.333, Adjusted R-squared: 0.3292
## F-statistic: 86.87 on 13 and 2262 DF, p-value: < 2.2e-16
Conclusions based on model:
F-statistic is 89.25, R-squared is 0.3352 Out of the 14 variables, 9 have statistically significant p-values at the 5% level.
Model 2: Highly Significant Variables Only
Based on model one, Model 2 will focus only on the variables that are statistically significant - in order to see if only those variables allow for a better model. Variables will be choosen based on their significance level from the R output.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.312 -8.551 0.171 8.267 58.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.0046383 4.5179148 7.527 7.47e-14 ***
## TEAM_BATTING_H 0.0407043 0.0026698 15.246 < 2e-16 ***
## TEAM_BATTING_3B 0.0495451 0.0162081 3.057 0.002263 **
## TEAM_BATTING_HR 0.0783195 0.0091365 8.572 < 2e-16 ***
## TEAM_BATTING_SO -0.0134377 0.0023291 -5.770 9.03e-09 ***
## TEAM_BASERUN_SB 0.0446690 0.0037826 11.809 < 2e-16 ***
## TEAM_PITCHING_SO 0.0019535 0.0005818 3.358 0.000799 ***
## TEAM_FIELDING_E -0.0325609 0.0017680 -18.417 < 2e-16 ***
## TEAM_FIELDING_DP -0.0922242 0.0123648 -7.459 1.24e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.92 on 2267 degrees of freedom
## Multiple R-squared: 0.3299, Adjusted R-squared: 0.3275
## F-statistic: 139.5 on 8 and 2267 DF, p-value: < 2.2e-16
Conclusions based on model: F-statistic is 143, R-squared is 0.333
The F-statistic is better than the first model, however the R-squared drops slightly.
Model 3: Backwards Elimination and Significance
Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown. This model is similar to the ‘forward selection’ variant - however I find it easier to work our way backwards and to eliminate variables rather than add them.
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB +
## TEAM_FIELDING_E + TEAM_BATTING_HR, data = imputed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.322 -9.029 0.006 8.508 57.962
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.314710 2.896116 1.835 0.0666 .
## TEAM_BATTING_H 0.049751 0.002022 24.608 < 2e-16 ***
## TEAM_BASERUN_SB 0.049378 0.003391 14.560 < 2e-16 ***
## TEAM_FIELDING_E -0.026155 0.001630 -16.049 < 2e-16 ***
## TEAM_BATTING_HR 0.023860 0.005956 4.006 6.38e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.19 on 2271 degrees of freedom
## Multiple R-squared: 0.3005, Adjusted R-squared: 0.2992
## F-statistic: 243.8 on 4 and 2271 DF, p-value: < 2.2e-16
Conclusions based on model: F-statistic is 245.5, R-squared is 0.3006 The F-statistic is larger than both of the other two models, however the R-squared is slightly lower than the other two.
Select Models
The three models from the previous selection have been summarised below. From the three models, I decided to use model 3 for the predictions. While the first model had the highest R-squared, it had multiple variables that weren’t statistically significant, and some that had multicollinearity issues. The F-statistic in model 3 is also much higher than the other two.
A comparsion of the multiple linear regression models, based on: mean square error, R2, F-stat, and root MSE.
Model 1 | Model 2 | Model 3 |
---|---|---|
Mean Squared Error: 165.430149959513 | Mean Squared Error: 166.208918203967 | Mean Squared Error: 173.503148865462 |
Root MSE: 12.8619652448416 | Root MSE: 12.8922037760798 | Root MSE: 13.1720594010755 |
Adjusted R-squared: 0.329166864046515 | Adjusted R-squared: 0.32749541996577 | Adjusted R-squared: 0.299218431818632 |
F-statistic: 86.8696419732824 | F-statistic: 139.484573663462 | F-statistic: 243.843834404051 |
Predictions
Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TEAM_BATTING_H | 1 | 259 | 1469.38996 | 150.65523 | 1455 | 1463.68421 | 114.1602 | 819 | 2170 | 1351 | 0.5876139 | 3.6642947 | 9.361261 |
TEAM_BATTING_2B | 2 | 259 | 241.32046 | 49.51612 | 239 | 242.32536 | 48.9258 | 44 | 376 | 332 | -0.3273282 | 0.6693023 | 3.076782 |
TEAM_BATTING_3B | 3 | 259 | 55.91120 | 27.14410 | 52 | 52.94737 | 26.6868 | 14 | 155 | 141 | 0.9790284 | 0.6987468 | 1.686652 |
TEAM_BATTING_HR | 4 | 259 | 95.63320 | 56.33221 | 101 | 93.67943 | 66.7170 | 0 | 242 | 242 | 0.1712363 | -0.9031262 | 3.500313 |
TEAM_BATTING_BB | 5 | 259 | 498.95753 | 120.59215 | 509 | 505.98086 | 94.8864 | 15 | 792 | 777 | -0.9209916 | 2.5265655 | 7.493232 |
TEAM_BATTING_SO | 6 | 259 | 701.83784 | 243.51290 | 680 | 708.15311 | 259.4550 | 0 | 1268 | 1268 | -0.2448918 | -0.2154189 | 15.131155 |
TEAM_BASERUN_SB | 7 | 259 | 128.12741 | 96.02480 | 95 | 112.79426 | 63.7518 | 0 | 580 | 580 | 1.6607884 | 3.1509919 | 5.966691 |
TEAM_BASERUN_CS | 8 | 259 | 64.28571 | 35.48134 | 55 | 60.36364 | 23.7216 | 0 | 154 | 154 | 1.0136018 | 0.2796269 | 2.204703 |
TEAM_PITCHING_H | 9 | 259 | 1813.46332 | 1662.91308 | 1515 | 1554.25359 | 173.4642 | 1155 | 22768 | 21613 | 9.2764797 | 102.0702914 | 103.328391 |
TEAM_PITCHING_BB | 10 | 259 | 552.41699 | 172.95006 | 526 | 536.46411 | 97.8516 | 136 | 2008 | 1872 | 4.1113772 | 29.2127324 | 10.746594 |
TEAM_PITCHING_SO | 11 | 259 | 794.33205 | 613.71595 | 736 | 760.66986 | 235.7334 | 0 | 9963 | 9963 | 12.8487348 | 189.3937245 | 38.134454 |
TEAM_FIELDING_E | 12 | 259 | 249.74903 | 230.90260 | 163 | 197.36364 | 59.3040 | 73 | 1568 | 1495 | 3.0887263 | 10.8748551 | 14.347589 |
TEAM_FIELDING_DP | 13 | 259 | 142.88417 | 27.34920 | 146 | 144.02392 | 25.2042 | 69 | 204 | 135 | -0.3784993 | -0.0737034 | 1.699397 |
After imputing and cleaning the data, using the predict function and Model 3, the following are the predicted values for the test set of the data, including prediction intervals:
fit | lwr | upr |
---|---|---|
66.84442 | 40.957813 | 92.73102 |
67.29649 | 41.411221 | 93.18175 |
75.77018 | 49.899595 | 101.64077 |
89.74078 | 63.866483 | 115.61507 |
75.48386 | 49.595064 | 101.37266 |
69.68768 | 43.804389 | 95.57097 |
82.54934 | 56.648014 | 108.45066 |
75.56408 | 49.690188 | 101.43797 |
69.99633 | 44.108431 | 95.88424 |
73.95660 | 48.078185 | 99.83501 |
75.43232 | 49.553884 | 101.31075 |
82.10294 | 56.229900 | 107.97598 |
78.34735 | 52.466793 | 104.22790 |
80.21514 | 54.338437 | 106.09185 |
78.78385 | 52.914487 | 104.65321 |
79.30491 | 53.434194 | 105.17562 |
73.06414 | 47.189400 | 98.93888 |
82.01046 | 56.138711 | 107.88222 |
68.14481 | 42.258018 | 94.03161 |
91.78241 | 65.898550 | 117.66626 |
81.56651 | 55.684708 | 107.44832 |
85.71854 | 59.841212 | 111.59588 |
77.40235 | 51.531677 | 103.27303 |
73.39180 | 47.517128 | 99.26648 |
86.01310 | 60.143217 | 111.88299 |
89.94805 | 64.073626 | 115.82247 |
58.17171 | 32.181737 | 84.16168 |
76.31521 | 50.436286 | 102.19414 |
81.98767 | 56.100175 | 107.87516 |
77.98918 | 52.101387 | 103.87698 |
86.62082 | 60.748320 | 112.49332 |
84.04249 | 58.172776 | 109.91221 |
81.98405 | 56.113320 | 107.85478 |
82.68952 | 56.815014 | 108.56403 |
79.36204 | 53.491014 | 105.23307 |
80.68676 | 54.801129 | 106.57239 |
75.28182 | 49.410840 | 101.15281 |
87.86324 | 61.974069 | 113.75241 |
85.83062 | 59.956980 | 111.70426 |
87.27252 | 61.391084 | 113.15395 |
81.87679 | 56.005901 | 107.74768 |
87.11062 | 61.237314 | 112.98392 |
34.16652 | 7.959504 | 60.37354 |
101.01809 | 75.034458 | 127.00173 |
91.10157 | 65.187555 | 117.01558 |
90.73397 | 64.838737 | 116.62920 |
96.74828 | 70.842770 | 122.65379 |
73.53568 | 47.661382 | 99.40997 |
69.34003 | 43.454553 | 95.22550 |
76.59684 | 50.720984 | 102.47269 |
79.64269 | 53.768362 | 105.51703 |
87.59705 | 61.714644 | 113.47947 |
77.19815 | 51.327590 | 103.06872 |
72.95083 | 47.075836 | 98.82582 |
78.10832 | 52.241980 | 103.97467 |
79.26373 | 53.395381 | 105.13209 |
89.17837 | 63.285636 | 115.07110 |
74.02463 | 48.133735 | 99.91552 |
62.09936 | 36.187601 | 88.01112 |
77.61336 | 51.733548 | 103.49317 |
86.68127 | 60.796375 | 112.56616 |
76.03359 | 50.150722 | 101.91645 |
86.07192 | 60.201184 | 111.94266 |
86.00686 | 60.105127 | 111.90859 |
86.53265 | 60.647009 | 112.41830 |
101.00517 | 75.058266 | 126.95208 |
74.88012 | 49.009567 | 100.75068 |
83.16788 | 57.286354 | 109.04941 |
78.99179 | 53.100575 | 104.88300 |
85.37388 | 59.483083 | 111.26468 |
84.46322 | 58.565243 | 110.36119 |
76.92828 | 51.042384 | 102.81417 |
79.30842 | 53.436993 | 105.17985 |
83.64109 | 57.744431 | 109.53775 |
85.24998 | 59.376117 | 111.12385 |
86.41449 | 60.539314 | 112.28967 |
81.54696 | 55.676934 | 107.41698 |
82.67868 | 56.805416 | 108.55194 |
71.77148 | 45.885857 | 97.65711 |
78.42707 | 52.543575 | 104.31056 |
86.99878 | 61.120091 | 112.87748 |
89.03482 | 63.152356 | 114.91729 |
96.55587 | 70.654185 | 122.45756 |
81.34326 | 55.465639 | 107.22087 |
80.37801 | 54.506251 | 106.24978 |
82.12541 | 56.256977 | 107.99384 |
79.44775 | 53.574779 | 105.32072 |
82.52453 | 56.656454 | 108.39260 |
84.69217 | 58.824283 | 110.56007 |
90.34284 | 64.457209 | 116.22848 |
79.35878 | 53.475732 | 105.24182 |
86.40493 | 60.374066 | 112.43580 |
72.88054 | 47.005624 | 98.75546 |
82.69688 | 56.814328 | 108.57944 |
85.11620 | 59.224855 | 111.00754 |
80.10770 | 54.224637 | 105.99075 |
83.13282 | 57.249523 | 109.01611 |
96.14589 | 70.231933 | 122.05984 |
86.34516 | 60.453575 | 112.23675 |
88.92693 | 63.046028 | 114.80783 |
82.82389 | 56.944149 | 108.70364 |
72.64541 | 46.770407 | 98.52041 |
83.41881 | 57.543722 | 109.29390 |
78.14882 | 52.273596 | 104.02405 |
81.24605 | 55.359698 | 107.13241 |
71.38093 | 45.495315 | 97.26655 |
49.07420 | 23.120501 | 75.02789 |
83.66313 | 57.791624 | 109.53463 |
84.06725 | 58.190065 | 109.94444 |
60.78095 | 34.881272 | 86.68063 |
82.92419 | 57.056753 | 108.79163 |
87.48037 | 61.606074 | 113.35467 |
94.74991 | 68.869893 | 120.62993 |
91.46150 | 65.587718 | 117.33528 |
83.94691 | 58.080318 | 109.81350 |
83.17406 | 57.306569 | 109.04155 |
91.27021 | 65.395574 | 117.14485 |
82.56630 | 56.696811 | 108.43578 |
79.55968 | 53.691045 | 105.42831 |
77.63222 | 51.717258 | 103.54719 |
89.90789 | 64.018662 | 115.79711 |
67.06688 | 41.171107 | 92.96266 |
66.59474 | 40.704784 | 92.48470 |
59.86647 | 33.955355 | 85.77757 |
68.79519 | 42.906290 | 94.68409 |
87.23263 | 61.345775 | 113.11949 |
88.25727 | 62.357883 | 114.15666 |
74.77664 | 48.901378 | 100.65190 |
87.55390 | 61.679419 | 113.42839 |
93.04601 | 67.154479 | 118.93755 |
86.10646 | 60.225416 | 111.98750 |
80.45473 | 54.580154 | 106.32931 |
79.69722 | 53.826387 | 105.56806 |
85.26277 | 59.390493 | 111.13504 |
85.42525 | 59.541952 | 111.30855 |
72.94748 | 47.046308 | 98.84865 |
76.50858 | 50.638360 | 102.37880 |
79.00200 | 53.131226 | 104.87278 |
91.06717 | 65.179088 | 116.95525 |
82.63074 | 56.763351 | 108.49814 |
67.17715 | 41.287469 | 93.06684 |
69.66325 | 43.773382 | 95.55312 |
91.60195 | 65.711651 | 117.49224 |
76.31512 | 50.439476 | 102.19076 |
72.57974 | 46.696382 | 98.46309 |
71.93374 | 46.056673 | 97.81081 |
78.39098 | 52.520139 | 104.26183 |
81.57525 | 55.705767 | 107.44473 |
84.55755 | 58.686674 | 110.42842 |
81.18704 | 55.318823 | 107.05525 |
82.84533 | 56.964462 | 108.72619 |
84.32265 | 58.454293 | 110.19100 |
43.70469 | 17.561519 | 69.84787 |
73.46726 | 47.594155 | 99.34037 |
77.02909 | 51.159289 | 102.89889 |
76.37344 | 50.501229 | 102.24565 |
87.66079 | 61.774227 | 113.54735 |
65.39449 | 39.500185 | 91.28880 |
87.37835 | 61.495124 | 113.26158 |
72.30981 | 46.428228 | 98.19140 |
96.20487 | 70.309711 | 122.10004 |
99.61895 | 73.721749 | 125.51616 |
86.48036 | 60.607852 | 112.35287 |
97.75285 | 71.849054 | 123.65664 |
89.48008 | 63.596132 | 115.36403 |
84.20031 | 58.325139 | 110.07548 |
82.03997 | 56.169395 | 107.91055 |
81.64414 | 55.774642 | 107.51364 |
77.17411 | 51.299160 | 103.04907 |
82.86275 | 56.991814 | 108.73368 |
88.58714 | 62.698063 | 114.47622 |
86.22406 | 60.342183 | 112.10595 |
77.99639 | 52.121330 | 103.87146 |
90.50307 | 64.611747 | 116.39439 |
81.36673 | 55.497221 | 107.23623 |
73.45158 | 47.571493 | 99.33166 |
74.73739 | 48.865647 | 100.60914 |
74.98058 | 49.109682 | 100.85148 |
73.62646 | 47.751491 | 99.50144 |
79.85270 | 53.983020 | 105.72239 |
87.19922 | 61.282026 | 113.11641 |
84.50598 | 58.618621 | 110.39334 |
85.99410 | 60.123090 | 111.86512 |
81.94932 | 56.065374 | 107.83326 |
88.96062 | 62.949304 | 114.97194 |
101.00683 | 75.002680 | 127.01098 |
88.05561 | 62.165108 | 113.94610 |
70.48439 | 44.505083 | 96.46370 |
64.78586 | 38.887407 | 90.68431 |
114.40413 | 88.421126 | 140.38712 |
70.30229 | 44.424225 | 96.18036 |
79.77672 | 53.898052 | 105.65538 |
78.67288 | 52.804513 | 104.54124 |
80.97365 | 55.095368 | 106.85194 |
84.03774 | 58.152729 | 109.92274 |
69.96893 | 44.090367 | 95.84750 |
76.92294 | 51.055207 | 102.79068 |
75.92875 | 50.056883 | 101.80061 |
75.42316 | 49.552385 | 101.29393 |
82.05362 | 56.184549 | 107.92270 |
76.01820 | 50.145994 | 101.89040 |
79.95618 | 54.087339 | 105.82501 |
73.37336 | 47.499601 | 99.24713 |
87.57201 | 61.702426 | 113.44160 |
81.64056 | 55.773567 | 107.50755 |
78.64678 | 52.776214 | 104.51735 |
81.55645 | 55.688237 | 107.42466 |
79.03358 | 53.165700 | 104.90146 |
81.64337 | 55.763040 | 107.52371 |
71.94618 | 46.061585 | 97.83078 |
101.36212 | 75.445411 | 127.27883 |
90.50750 | 64.617099 | 116.39789 |
78.89369 | 53.023090 | 104.76429 |
67.71789 | 41.836726 | 93.59905 |
70.23387 | 44.355541 | 96.11220 |
84.72739 | 58.852678 | 110.60209 |
83.81404 | 57.945103 | 109.68297 |
94.64876 | 68.759581 | 120.53794 |
79.29458 | 53.426880 | 105.16227 |
77.45905 | 51.589497 | 103.32860 |
81.50475 | 55.637242 | 107.37227 |
81.46051 | 55.591858 | 107.32916 |
85.04305 | 59.170698 | 110.91539 |
80.61032 | 54.741482 | 106.47916 |
86.29814 | 60.209008 | 112.38728 |
74.72789 | 48.856599 | 100.59918 |
81.51074 | 55.643768 | 107.37770 |
82.14032 | 56.268014 | 108.01262 |
80.90219 | 55.034931 | 106.76945 |
77.55036 | 51.651524 | 103.44919 |
74.59904 | 48.712648 | 100.48543 |
92.16630 | 66.284537 | 118.04806 |
77.80810 | 51.937732 | 103.67847 |
85.82773 | 59.945075 | 111.71039 |
77.55453 | 51.685984 | 103.42307 |
73.67686 | 47.803936 | 99.54978 |
84.36600 | 58.500094 | 110.23191 |
77.46094 | 51.589502 | 103.33239 |
85.31306 | 59.434543 | 111.19158 |
72.88181 | 47.004995 | 98.75862 |
87.08428 | 61.209329 | 112.95923 |
85.51560 | 59.637122 | 111.39408 |
84.44546 | 58.561874 | 110.32904 |
86.97796 | 61.107681 | 112.84824 |
65.87711 | 39.981794 | 91.77243 |
89.86501 | 63.990213 | 115.73982 |
81.30220 | 55.435638 | 107.16875 |
85.04720 | 59.170663 | 110.92374 |
73.59370 | 47.718619 | 99.46878 |
89.21338 | 63.327736 | 115.09902 |
82.97177 | 57.094636 | 108.84890 |
54.33927 | 28.368448 | 80.31009 |
91.22629 | 65.333606 | 117.11898 |
28.37968 | 2.269096 | 54.49026 |
69.69086 | 43.811424 | 95.57030 |
74.75187 | 48.881225 | 100.62251 |
82.89015 | 57.016361 | 108.76393 |
85.47483 | 59.596340 | 111.35333 |
80.49211 | 54.618015 | 106.36621 |
## fit lwr upr
## Min. : 28.38 Min. : 2.269 Min. : 54.49
## 1st Qu.: 75.97 1st Qu.:50.101 1st Qu.:101.85
## Median : 81.55 Median :55.677 Median :107.42
## Mean : 80.50 Mean :54.608 Mean :106.38
## 3rd Qu.: 86.01 3rd Qu.:60.133 3rd Qu.:111.90
## Max. :114.40 Max. :88.421 Max. :140.39
## 1
## 81.01443
## fit lwr upr
## 1 81.01443 55.14843 106.8804
Appendex
moneyball_training_data <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/moneyball-training-data.csv")
mbd1 <- describe(moneyball_training_data, na.rm = F)
mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]
kable(mbd1, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
scroll_box(width = "100%", height = "700px")
ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
geom_boxplot() +
theme(legend.position="none") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(panel.background = element_rect(fill = '#d0ddf2'))
ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
geom_boxplot() +
coord_cartesian(ylim = c(0, 800)) +
theme(legend.position="none") +
theme(axis.text.x=element_text(angle=45, hjust=1)) +
theme(panel.background = element_rect(fill = '#d0ddf2'))
mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]
mb_hist %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 35)
kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
kable_styling("striped", full_width = F) %>%
column_spec(1, bold = T) %>%
scroll_box(height = "500px")
corrgram(drop_na(mb_hist), order=TRUE,
upper.panel=panel.cor, main="Moneyball")
mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]
init = mice(mbd2, maxit=0)
meth = init$method
predM = init$predictorMatrix
predM[, c("TARGET_WINS")]=0
imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)
imputed <- complete(imputed)
imputedtable <- describe(imputed)
kable(imputedtable, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
scroll_box(width = "100%", height = "700px")
mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-c(1, 11, 13)]
init = mice(mbd2, maxit=0)
meth = init$method
predM = init$predictorMatrix
predM[, c("TARGET_WINS")]=0
imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)
imputed <- complete(imputed)
vif(imputed)
v1 <- vifstep(imputed, th=10)
imputedtable <- describe(imputed)
kable(imputedtable, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
scroll_box(width = "100%", height = "700px")
model1 <- lm(TARGET_WINS ~., imputed)
model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)
summary(model2)
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_BATTING_HR, imputed)
summary(model4)
compare_model1 <- c(m1mse, m1root, m1ar, m1fs )
compare_model2 <- c(m2mse, m2root, m2ar, m2fs )
compare_model3 <- c(m3mse, m3root, m3ar, m3fs )
compare <- data.frame(compare_model1, compare_model2, compare_model3)
colnames(compare) <- c("Model 1", "Model 2", "Model 3")
kable(compare)
mbeval <- mbeval[,-c(1, 10, 12)]
init = mice(mbeval, maxit=0)
meth = init$method
predM = init$predictorMatrix
imputed1 = mice(mbeval, method="rf", predictorMatrix=predM, m=5)
imputed1 <- complete(imputed1)
imputedtable1 <- describe(imputed1)
kable(imputedtable1, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
scroll_box(width = "100%", height = "700px")
predict1 <- predict(model4, newdata = imputed1, interval="prediction")
kable(predict1, "html", escape = F) %>%
kable_styling("striped", full_width = T) %>%
column_spec(1, bold = T) %>%
scroll_box(width = "100%", height = "700px")
summary(predict1)