CUNY MSDS DATA 621

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Caption for the picture.

Data Exploration

The dataset consists of 17 elements, with 2276 total cases. Out of those 17, 15 are explanatory variables, which can be broken down into four groups:

batting
baserun
fielding
pitching

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	na_count
TARGET_WINS	2	191	80.92670	12.115013	82	81.11765	13.3434	43	116	73	-0.1698314	-0.2952783	0.8766116	0
TEAM_BATTING_H	3	191	1478.62827	76.147869	1477	1477.42484	74.1300	1308	1667	359	0.1302702	-0.3710350	5.5098664	0
TEAM_BATTING_2B	4	191	297.19895	26.329335	296	296.62745	25.2042	201	373	172	0.0915189	0.4778716	1.9051238	0
TEAM_BATTING_3B	5	191	30.74346	9.043878	29	30.13072	8.8956	12	61	49	0.7007420	0.7446217	0.6543921	0
TEAM_BATTING_HR	6	191	178.05236	32.413243	175	176.81046	35.5824	116	260	144	0.2980673	-0.7172373	2.3453399	0
TEAM_BATTING_BB	7	191	543.31937	74.842133	535	541.31373	74.1300	365	775	410	0.3115199	-0.1474175	5.4153867	0
TEAM_BATTING_SO	8	191	1051.02618	104.156382	1050	1046.95425	97.8516	805	1399	594	0.3985050	0.3955105	7.5364913	102
TEAM_BASERUN_SB	9	191	90.90576	29.916401	87	89.06536	29.6520	31	177	146	0.5553966	-0.1414909	2.1646748	131
TEAM_BASERUN_CS	10	191	39.94241	11.898334	38	39.49020	11.8608	12	74	62	0.3468509	0.0006392	0.8609332	772
TEAM_BATTING_HBP	11	191	59.35602	12.967123	58	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681	2085
TEAM_PITCHING_H	12	191	1479.70157	75.788625	1480	1478.50327	72.6474	1312	1667	355	0.1279056	-0.3894781	5.4838725	0
TEAM_PITCHING_HR	13	191	178.17801	32.391678	175	176.93464	35.5824	116	260	144	0.2989191	-0.7190905	2.3437795	0
TEAM_PITCHING_BB	14	191	543.71728	74.916681	537	541.74510	72.6474	367	775	408	0.3144366	-0.1338563	5.4207808	0
TEAM_PITCHING_SO	15	191	1051.81675	104.347208	1052	1047.80392	97.8516	805	1399	594	0.3945586	0.3903991	7.5502990	102
TEAM_FIELDING_E	16	191	107.05236	16.632162	106	106.58170	17.7912	65	145	80	0.1780432	-0.3567367	1.2034610	0
TEAM_FIELDING_DP	17	191	152.33508	17.611682	152	152.04575	19.2738	113	204	91	0.2164822	-0.2115741	1.2743366	286

Looking at the data above, there are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.

The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.

In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.

The above boxplots show all of the variables listed in the dataset. This visualization may assist in showing how the data is spread.

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by wins - since that’s what we’re trying to predict.

	x
TARGET_WINS	1.0000000
TEAM_BATTING_H	0.4699467
TEAM_BATTING_2B	0.3129840
TEAM_BATTING_3B	-0.1243459
TEAM_BATTING_HR	0.4224168
TEAM_BATTING_BB	0.4686879
TEAM_BATTING_SO	-0.2288927
TEAM_BASERUN_SB	0.0148364
TEAM_BASERUN_CS	-0.1787560
TEAM_BATTING_HBP	0.0735042
TEAM_PITCHING_H	0.4712343
TEAM_PITCHING_HR	0.4224668
TEAM_PITCHING_BB	0.4683988
TEAM_PITCHING_SO	-0.2293648
TEAM_FIELDING_E	-0.3866880
TEAM_FIELDING_DP	-0.1958660

Below is a visual representation of the correlation plot.

According to the coorelation graph, batting_h, batting_2b, batting_hr, batting_bb, pitching_h, pitching_hr, and pitching_bb are the most positively correlated.

Data Preparation

Removal of Data

The variable TEAM_BATTING_HBP is also missing over 90% of its values. That variable will be removed completely.

The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset

Using the VIF and vifstep function from the USDM package, the data will first be tested for other collinearity issues. The variables determined that have collinearity issues will be discarded.

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Variables that exceed the established threshold will be discarded to avoid collinearity issues.

vif(imputed)

##           Variables      VIF
## 1       TARGET_WINS 1.499251
## 2    TEAM_BATTING_H 3.997932
## 3   TEAM_BATTING_2B 2.457536
## 4   TEAM_BATTING_3B 2.979868
## 5   TEAM_BATTING_HR 4.918888
## 6   TEAM_BATTING_BB 5.543429
## 7   TEAM_BATTING_SO 5.231486
## 8   TEAM_BASERUN_SB 2.344517
## 9   TEAM_BASERUN_CS 1.778219
## 10  TEAM_PITCHING_H 3.683440
## 11 TEAM_PITCHING_BB 4.781545
## 12 TEAM_PITCHING_SO 2.986074
## 13  TEAM_FIELDING_E 4.774207
## 14 TEAM_FIELDING_DP 1.750808

v1 <- vifstep(imputed, th=10)

Output - The below table shows the results of above data manipulation.

The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	1	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_H	2	2276	1469.26977	144.59120	1454.0	1459.04116	114.1602	891	2554	1663	1.5713335	7.2785261	3.0307891
TEAM_BATTING_2B	3	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	4	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	5	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	6	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	7	2276	730.04525	245.34496	736.0	735.28266	278.7288	0	1399	1399	-0.2491809	-0.3043446	5.1426978
TEAM_BASERUN_SB	8	2276	130.57777	94.76956	103.0	115.01701	65.2344	0	697	697	1.8656479	4.4601924	1.9864734
TEAM_BASERUN_CS	9	2276	65.05756	38.45694	53.0	58.84522	22.2390	0	201	201	1.6112087	2.4228591	0.8060993
TEAM_PITCHING_H	10	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_BB	11	2276	553.00791	166.35736	536.5	542.62459	98.5929	0	3645	3645	6.7438995	96.9676398	3.4870317
TEAM_PITCHING_SO	12	2276	811.27373	542.07751	802.5	790.48299	254.2659	0	19278	19278	22.5302527	695.9711670	11.3625357
TEAM_FIELDING_E	13	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	14	2276	142.48023	28.08209	146.0	143.54281	26.6868	52	228	176	-0.3341358	-0.1834034	0.5886313

Build Models

Using the training data provided, we will build 3 different linear regression models, to determine which will provide the best prediction for the # of wins for a baseball team. The tree approachs are: all variables, only significant variables, and backwards elimination of each variable.

Model 1: All Variables

All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.523  -8.434   0.164   8.261  58.735 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      26.1424328  5.2322110   4.996 6.29e-07 ***
## TEAM_BATTING_H    0.0466562  0.0036096  12.926  < 2e-16 ***
## TEAM_BATTING_2B  -0.0192738  0.0090513  -2.129  0.03333 *  
## TEAM_BATTING_3B   0.0408129  0.0166908   2.445  0.01455 *  
## TEAM_BATTING_HR   0.0727073  0.0097896   7.427 1.57e-13 ***
## TEAM_BATTING_BB   0.0089198  0.0051882   1.719  0.08571 .  
## TEAM_BATTING_SO  -0.0121938  0.0025086  -4.861 1.25e-06 ***
## TEAM_BASERUN_SB   0.0413004  0.0042832   9.642  < 2e-16 ***
## TEAM_BASERUN_CS   0.0009375  0.0093794   0.100  0.92039    
## TEAM_PITCHING_H  -0.0001880  0.0003690  -0.510  0.61045    
## TEAM_PITCHING_BB -0.0009933  0.0035554  -0.279  0.77999    
## TEAM_PITCHING_SO  0.0025529  0.0008606   2.966  0.00304 ** 
## TEAM_FIELDING_E  -0.0297181  0.0025185 -11.800  < 2e-16 ***
## TEAM_FIELDING_DP -0.0957177  0.0125853  -7.606 4.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.9 on 2262 degrees of freedom
## Multiple R-squared:  0.333,  Adjusted R-squared:  0.3292 
## F-statistic: 86.87 on 13 and 2262 DF,  p-value: < 2.2e-16

Conclusions based on model:

F-statistic is 89.25, R-squared is 0.3352 Out of the 14 variables, 9 have statistically significant p-values at the 5% level.

Model 2: Highly Significant Variables Only

Based on model one, Model 2 will focus only on the variables that are statistically significant - in order to see if only those variables allow for a better model. Variables will be choosen based on their significance level from the R output.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.312  -8.551   0.171   8.267  58.000 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.0046383  4.5179148   7.527 7.47e-14 ***
## TEAM_BATTING_H    0.0407043  0.0026698  15.246  < 2e-16 ***
## TEAM_BATTING_3B   0.0495451  0.0162081   3.057 0.002263 ** 
## TEAM_BATTING_HR   0.0783195  0.0091365   8.572  < 2e-16 ***
## TEAM_BATTING_SO  -0.0134377  0.0023291  -5.770 9.03e-09 ***
## TEAM_BASERUN_SB   0.0446690  0.0037826  11.809  < 2e-16 ***
## TEAM_PITCHING_SO  0.0019535  0.0005818   3.358 0.000799 ***
## TEAM_FIELDING_E  -0.0325609  0.0017680 -18.417  < 2e-16 ***
## TEAM_FIELDING_DP -0.0922242  0.0123648  -7.459 1.24e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.92 on 2267 degrees of freedom
## Multiple R-squared:  0.3299, Adjusted R-squared:  0.3275 
## F-statistic: 139.5 on 8 and 2267 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 143, R-squared is 0.333

The F-statistic is better than the first model, however the R-squared drops slightly.

Model 3: Backwards Elimination and Significance

Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown. This model is similar to the ‘forward selection’ variant - however I find it easier to work our way backwards and to eliminate variables rather than add them.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_BATTING_HR, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.322  -9.029   0.006   8.508  57.962 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.314710   2.896116   1.835   0.0666 .  
## TEAM_BATTING_H   0.049751   0.002022  24.608  < 2e-16 ***
## TEAM_BASERUN_SB  0.049378   0.003391  14.560  < 2e-16 ***
## TEAM_FIELDING_E -0.026155   0.001630 -16.049  < 2e-16 ***
## TEAM_BATTING_HR  0.023860   0.005956   4.006 6.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.19 on 2271 degrees of freedom
## Multiple R-squared:  0.3005, Adjusted R-squared:  0.2992 
## F-statistic: 243.8 on 4 and 2271 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 245.5, R-squared is 0.3006 The F-statistic is larger than both of the other two models, however the R-squared is slightly lower than the other two.

Select Models

The three models from the previous selection have been summarised below. From the three models, I decided to use model 3 for the predictions. While the first model had the highest R-squared, it had multiple variables that weren’t statistically significant, and some that had multicollinearity issues. The F-statistic in model 3 is also much higher than the other two.

A comparsion of the multiple linear regression models, based on: mean square error, R2, F-stat, and root MSE.

Model 1	Model 2	Model 3
Mean Squared Error: 165.430149959513	Mean Squared Error: 166.208918203967	Mean Squared Error: 173.503148865462
Root MSE: 12.8619652448416	Root MSE: 12.8922037760798	Root MSE: 13.1720594010755
Adjusted R-squared: 0.329166864046515	Adjusted R-squared: 0.32749541996577	Adjusted R-squared: 0.299218431818632
F-statistic: 86.8696419732824	F-statistic: 139.484573663462	F-statistic: 243.843834404051

Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TEAM_BATTING_H	1	259	1469.38996	150.65523	1455	1463.68421	114.1602	819	2170	1351	0.5876139	3.6642947	9.361261
TEAM_BATTING_2B	2	259	241.32046	49.51612	239	242.32536	48.9258	44	376	332	-0.3273282	0.6693023	3.076782
TEAM_BATTING_3B	3	259	55.91120	27.14410	52	52.94737	26.6868	14	155	141	0.9790284	0.6987468	1.686652
TEAM_BATTING_HR	4	259	95.63320	56.33221	101	93.67943	66.7170	0	242	242	0.1712363	-0.9031262	3.500313
TEAM_BATTING_BB	5	259	498.95753	120.59215	509	505.98086	94.8864	15	792	777	-0.9209916	2.5265655	7.493232
TEAM_BATTING_SO	6	259	701.83784	243.51290	680	708.15311	259.4550	0	1268	1268	-0.2448918	-0.2154189	15.131155
TEAM_BASERUN_SB	7	259	128.12741	96.02480	95	112.79426	63.7518	0	580	580	1.6607884	3.1509919	5.966691
TEAM_BASERUN_CS	8	259	64.28571	35.48134	55	60.36364	23.7216	0	154	154	1.0136018	0.2796269	2.204703
TEAM_PITCHING_H	9	259	1813.46332	1662.91308	1515	1554.25359	173.4642	1155	22768	21613	9.2764797	102.0702914	103.328391
TEAM_PITCHING_BB	10	259	552.41699	172.95006	526	536.46411	97.8516	136	2008	1872	4.1113772	29.2127324	10.746594
TEAM_PITCHING_SO	11	259	794.33205	613.71595	736	760.66986	235.7334	0	9963	9963	12.8487348	189.3937245	38.134454
TEAM_FIELDING_E	12	259	249.74903	230.90260	163	197.36364	59.3040	73	1568	1495	3.0887263	10.8748551	14.347589
TEAM_FIELDING_DP	13	259	142.88417	27.34920	146	144.02392	25.2042	69	204	135	-0.3784993	-0.0737034	1.699397

After imputing and cleaning the data, using the predict function and Model 3, the following are the predicted values for the test set of the data, including prediction intervals:

fit	lwr	upr
66.84442	40.957813	92.73102
67.29649	41.411221	93.18175
75.77018	49.899595	101.64077
89.74078	63.866483	115.61507
75.48386	49.595064	101.37266
69.68768	43.804389	95.57097
82.54934	56.648014	108.45066
75.56408	49.690188	101.43797
69.99633	44.108431	95.88424
73.95660	48.078185	99.83501
75.43232	49.553884	101.31075
82.10294	56.229900	107.97598
78.34735	52.466793	104.22790
80.21514	54.338437	106.09185
78.78385	52.914487	104.65321
79.30491	53.434194	105.17562
73.06414	47.189400	98.93888
82.01046	56.138711	107.88222
68.14481	42.258018	94.03161
91.78241	65.898550	117.66626
81.56651	55.684708	107.44832
85.71854	59.841212	111.59588
77.40235	51.531677	103.27303
73.39180	47.517128	99.26648
86.01310	60.143217	111.88299
89.94805	64.073626	115.82247
58.17171	32.181737	84.16168
76.31521	50.436286	102.19414
81.98767	56.100175	107.87516
77.98918	52.101387	103.87698
86.62082	60.748320	112.49332
84.04249	58.172776	109.91221
81.98405	56.113320	107.85478
82.68952	56.815014	108.56403
79.36204	53.491014	105.23307
80.68676	54.801129	106.57239
75.28182	49.410840	101.15281
87.86324	61.974069	113.75241
85.83062	59.956980	111.70426
87.27252	61.391084	113.15395
81.87679	56.005901	107.74768
87.11062	61.237314	112.98392
34.16652	7.959504	60.37354
101.01809	75.034458	127.00173
91.10157	65.187555	117.01558
90.73397	64.838737	116.62920
96.74828	70.842770	122.65379
73.53568	47.661382	99.40997
69.34003	43.454553	95.22550
76.59684	50.720984	102.47269
79.64269	53.768362	105.51703
87.59705	61.714644	113.47947
77.19815	51.327590	103.06872
72.95083	47.075836	98.82582
78.10832	52.241980	103.97467
79.26373	53.395381	105.13209
89.17837	63.285636	115.07110
74.02463	48.133735	99.91552
62.09936	36.187601	88.01112
77.61336	51.733548	103.49317
86.68127	60.796375	112.56616
76.03359	50.150722	101.91645
86.07192	60.201184	111.94266
86.00686	60.105127	111.90859
86.53265	60.647009	112.41830
101.00517	75.058266	126.95208
74.88012	49.009567	100.75068
83.16788	57.286354	109.04941
78.99179	53.100575	104.88300
85.37388	59.483083	111.26468
84.46322	58.565243	110.36119
76.92828	51.042384	102.81417
79.30842	53.436993	105.17985
83.64109	57.744431	109.53775
85.24998	59.376117	111.12385
86.41449	60.539314	112.28967
81.54696	55.676934	107.41698
82.67868	56.805416	108.55194
71.77148	45.885857	97.65711
78.42707	52.543575	104.31056
86.99878	61.120091	112.87748
89.03482	63.152356	114.91729
96.55587	70.654185	122.45756
81.34326	55.465639	107.22087
80.37801	54.506251	106.24978
82.12541	56.256977	107.99384
79.44775	53.574779	105.32072
82.52453	56.656454	108.39260
84.69217	58.824283	110.56007
90.34284	64.457209	116.22848
79.35878	53.475732	105.24182
86.40493	60.374066	112.43580
72.88054	47.005624	98.75546
82.69688	56.814328	108.57944
85.11620	59.224855	111.00754
80.10770	54.224637	105.99075
83.13282	57.249523	109.01611
96.14589	70.231933	122.05984
86.34516	60.453575	112.23675
88.92693	63.046028	114.80783
82.82389	56.944149	108.70364
72.64541	46.770407	98.52041
83.41881	57.543722	109.29390
78.14882	52.273596	104.02405
81.24605	55.359698	107.13241
71.38093	45.495315	97.26655
49.07420	23.120501	75.02789
83.66313	57.791624	109.53463
84.06725	58.190065	109.94444
60.78095	34.881272	86.68063
82.92419	57.056753	108.79163
87.48037	61.606074	113.35467
94.74991	68.869893	120.62993
91.46150	65.587718	117.33528
83.94691	58.080318	109.81350
83.17406	57.306569	109.04155
91.27021	65.395574	117.14485
82.56630	56.696811	108.43578
79.55968	53.691045	105.42831
77.63222	51.717258	103.54719
89.90789	64.018662	115.79711
67.06688	41.171107	92.96266
66.59474	40.704784	92.48470
59.86647	33.955355	85.77757
68.79519	42.906290	94.68409
87.23263	61.345775	113.11949
88.25727	62.357883	114.15666
74.77664	48.901378	100.65190
87.55390	61.679419	113.42839
93.04601	67.154479	118.93755
86.10646	60.225416	111.98750
80.45473	54.580154	106.32931
79.69722	53.826387	105.56806
85.26277	59.390493	111.13504
85.42525	59.541952	111.30855
72.94748	47.046308	98.84865
76.50858	50.638360	102.37880
79.00200	53.131226	104.87278
91.06717	65.179088	116.95525
82.63074	56.763351	108.49814
67.17715	41.287469	93.06684
69.66325	43.773382	95.55312
91.60195	65.711651	117.49224
76.31512	50.439476	102.19076
72.57974	46.696382	98.46309
71.93374	46.056673	97.81081
78.39098	52.520139	104.26183
81.57525	55.705767	107.44473
84.55755	58.686674	110.42842
81.18704	55.318823	107.05525
82.84533	56.964462	108.72619
84.32265	58.454293	110.19100
43.70469	17.561519	69.84787
73.46726	47.594155	99.34037
77.02909	51.159289	102.89889
76.37344	50.501229	102.24565
87.66079	61.774227	113.54735
65.39449	39.500185	91.28880
87.37835	61.495124	113.26158
72.30981	46.428228	98.19140
96.20487	70.309711	122.10004
99.61895	73.721749	125.51616
86.48036	60.607852	112.35287
97.75285	71.849054	123.65664
89.48008	63.596132	115.36403
84.20031	58.325139	110.07548
82.03997	56.169395	107.91055
81.64414	55.774642	107.51364
77.17411	51.299160	103.04907
82.86275	56.991814	108.73368
88.58714	62.698063	114.47622
86.22406	60.342183	112.10595
77.99639	52.121330	103.87146
90.50307	64.611747	116.39439
81.36673	55.497221	107.23623
73.45158	47.571493	99.33166
74.73739	48.865647	100.60914
74.98058	49.109682	100.85148
73.62646	47.751491	99.50144
79.85270	53.983020	105.72239
87.19922	61.282026	113.11641
84.50598	58.618621	110.39334
85.99410	60.123090	111.86512
81.94932	56.065374	107.83326
88.96062	62.949304	114.97194
101.00683	75.002680	127.01098
88.05561	62.165108	113.94610
70.48439	44.505083	96.46370
64.78586	38.887407	90.68431
114.40413	88.421126	140.38712
70.30229	44.424225	96.18036
79.77672	53.898052	105.65538
78.67288	52.804513	104.54124
80.97365	55.095368	106.85194
84.03774	58.152729	109.92274
69.96893	44.090367	95.84750
76.92294	51.055207	102.79068
75.92875	50.056883	101.80061
75.42316	49.552385	101.29393
82.05362	56.184549	107.92270
76.01820	50.145994	101.89040
79.95618	54.087339	105.82501
73.37336	47.499601	99.24713
87.57201	61.702426	113.44160
81.64056	55.773567	107.50755
78.64678	52.776214	104.51735
81.55645	55.688237	107.42466
79.03358	53.165700	104.90146
81.64337	55.763040	107.52371
71.94618	46.061585	97.83078
101.36212	75.445411	127.27883
90.50750	64.617099	116.39789
78.89369	53.023090	104.76429
67.71789	41.836726	93.59905
70.23387	44.355541	96.11220
84.72739	58.852678	110.60209
83.81404	57.945103	109.68297
94.64876	68.759581	120.53794
79.29458	53.426880	105.16227
77.45905	51.589497	103.32860
81.50475	55.637242	107.37227
81.46051	55.591858	107.32916
85.04305	59.170698	110.91539
80.61032	54.741482	106.47916
86.29814	60.209008	112.38728
74.72789	48.856599	100.59918
81.51074	55.643768	107.37770
82.14032	56.268014	108.01262
80.90219	55.034931	106.76945
77.55036	51.651524	103.44919
74.59904	48.712648	100.48543
92.16630	66.284537	118.04806
77.80810	51.937732	103.67847
85.82773	59.945075	111.71039
77.55453	51.685984	103.42307
73.67686	47.803936	99.54978
84.36600	58.500094	110.23191
77.46094	51.589502	103.33239
85.31306	59.434543	111.19158
72.88181	47.004995	98.75862
87.08428	61.209329	112.95923
85.51560	59.637122	111.39408
84.44546	58.561874	110.32904
86.97796	61.107681	112.84824
65.87711	39.981794	91.77243
89.86501	63.990213	115.73982
81.30220	55.435638	107.16875
85.04720	59.170663	110.92374
73.59370	47.718619	99.46878
89.21338	63.327736	115.09902
82.97177	57.094636	108.84890
54.33927	28.368448	80.31009
91.22629	65.333606	117.11898
28.37968	2.269096	54.49026
69.69086	43.811424	95.57030
74.75187	48.881225	100.62251
82.89015	57.016361	108.76393
85.47483	59.596340	111.35333
80.49211	54.618015	106.36621

##       fit              lwr              upr        
##  Min.   : 28.38   Min.   : 2.269   Min.   : 54.49  
##  1st Qu.: 75.97   1st Qu.:50.101   1st Qu.:101.85  
##  Median : 81.55   Median :55.677   Median :107.42  
##  Mean   : 80.50   Mean   :54.608   Mean   :106.38  
##  3rd Qu.: 86.01   3rd Qu.:60.133   3rd Qu.:111.90  
##  Max.   :114.40   Max.   :88.421   Max.   :140.39

##        1 
## 81.01443

##        fit      lwr      upr
## 1 81.01443 55.14843 106.8804

Appendex

moneyball_training_data <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/moneyball-training-data.csv")


mbd1 <- describe(moneyball_training_data, na.rm = F)

mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]


kable(mbd1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 800)) +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]

mb_hist %>%
  keep(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_histogram(bins = 35)                 



kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")

corrgram(drop_na(mb_hist), order=TRUE,
         upper.panel=panel.cor, main="Moneyball")



mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]




init = mice(mbd2, maxit=0) 
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)

imputedtable <- describe(imputed)

kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")





mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-c(1, 11, 13)]

init = mice(mbd2, maxit=0) 
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)


vif(imputed)
v1 <- vifstep(imputed, th=10)


imputedtable <- describe(imputed)

kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")



model1 <- lm(TARGET_WINS ~., imputed)


model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H  + TEAM_BATTING_3B  + TEAM_BATTING_HR  + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)

summary(model2)


model4 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB  + TEAM_FIELDING_E  + TEAM_BATTING_HR, imputed)

summary(model4)




compare_model1 <- c(m1mse, m1root, m1ar, m1fs )
compare_model2 <- c(m2mse, m2root, m2ar, m2fs )
compare_model3 <- c(m3mse, m3root, m3ar, m3fs )

compare <- data.frame(compare_model1, compare_model2, compare_model3)
colnames(compare) <- c("Model 1", "Model 2", "Model 3")

kable(compare)




mbeval <- mbeval[,-c(1, 10, 12)]


init = mice(mbeval, maxit=0) 
meth = init$method
predM = init$predictorMatrix


imputed1 = mice(mbeval, method="rf", predictorMatrix=predM, m=5)

imputed1 <- complete(imputed1)

imputedtable1 <- describe(imputed1)

kable(imputedtable1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

predict1 <- predict(model4, newdata = imputed1, interval="prediction")
kable(predict1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")


summary(predict1)

CUNY MSDS DATA 621 - HW 1

NIcholas Schettini

June 5, 2018