1 Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

2 Data Exploration

dim(moneyball_training_data)

## [1] 2276   17

The dataset consists of 17 elements, with 2276 total cases. There are 15 explanatory variables, which can be categorized into four groups:

batting
baserun
fielding
pitching

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	na_count
TARGET_WINS	2	191	80.92670	12.115013	82	81.11765	13.3434	43	116	73	-0.1698314	-0.2952783	0.8766116	0
TEAM_BATTING_H	3	191	1478.62827	76.147869	1477	1477.42484	74.1300	1308	1667	359	0.1302702	-0.3710350	5.5098664	0
TEAM_BATTING_2B	4	191	297.19895	26.329335	296	296.62745	25.2042	201	373	172	0.0915189	0.4778716	1.9051238	0
TEAM_BATTING_3B	5	191	30.74346	9.043878	29	30.13072	8.8956	12	61	49	0.7007420	0.7446217	0.6543921	0
TEAM_BATTING_HR	6	191	178.05236	32.413243	175	176.81046	35.5824	116	260	144	0.2980673	-0.7172373	2.3453399	0
TEAM_BATTING_BB	7	191	543.31937	74.842133	535	541.31373	74.1300	365	775	410	0.3115199	-0.1474175	5.4153867	0
TEAM_BATTING_SO	8	191	1051.02618	104.156382	1050	1046.95425	97.8516	805	1399	594	0.3985050	0.3955105	7.5364913	102
TEAM_BASERUN_SB	9	191	90.90576	29.916401	87	89.06536	29.6520	31	177	146	0.5553966	-0.1414909	2.1646748	131
TEAM_BASERUN_CS	10	191	39.94241	11.898334	38	39.49020	11.8608	12	74	62	0.3468509	0.0006392	0.8609332	772
TEAM_BATTING_HBP	11	191	59.35602	12.967123	58	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681	2085
TEAM_PITCHING_H	12	191	1479.70157	75.788625	1480	1478.50327	72.6474	1312	1667	355	0.1279056	-0.3894781	5.4838725	0
TEAM_PITCHING_HR	13	191	178.17801	32.391678	175	176.93464	35.5824	116	260	144	0.2989191	-0.7190905	2.3437795	0
TEAM_PITCHING_BB	14	191	543.71728	74.916681	537	541.74510	72.6474	367	775	408	0.3144366	-0.1338563	5.4207808	0
TEAM_PITCHING_SO	15	191	1051.81675	104.347208	1052	1047.80392	97.8516	805	1399	594	0.3945586	0.3903991	7.5502990	102
TEAM_FIELDING_E	16	191	107.05236	16.632162	106	106.58170	17.7912	65	145	80	0.1780432	-0.3567367	1.2034610	0
TEAM_FIELDING_DP	17	191	152.33508	17.611682	152	152.04575	19.2738	113	204	91	0.2164822	-0.2115741	1.2743366	286

Looking at the data above, there are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.

The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.

In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.

The above boxplots show all of the variables listed in the dataset. This visualization may assist in showing how the data is spread.

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

For this project, it makes sense to break down the correlation by wins - since that’s what we’re trying to predict.

	x
TARGET_WINS	1.0000000
TEAM_BATTING_H	0.4699467
TEAM_BATTING_2B	0.3129840
TEAM_BATTING_3B	-0.1243459
TEAM_BATTING_HR	0.4224168
TEAM_BATTING_BB	0.4686879
TEAM_BATTING_SO	-0.2288927
TEAM_BASERUN_SB	0.0148364
TEAM_BASERUN_CS	-0.1787560
TEAM_BATTING_HBP	0.0735042
TEAM_PITCHING_H	0.4712343
TEAM_PITCHING_HR	0.4224668
TEAM_PITCHING_BB	0.4683988
TEAM_PITCHING_SO	-0.2293648
TEAM_FIELDING_E	-0.3866880
TEAM_FIELDING_DP	-0.1958660

Below is a visual representation of the correlation plot.

According to the coorelation graph, batting_h, batting_2b, batting_hr, batting_bb, pitching_h, pitching_hr, and pitching_bb are the most positively correlated.

3 Data Preparation

3.1 Removal of Data

The variable TEAM_BATTING_HBP is also missing over 90% of its values. That variable will be removed completely.

The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset

Using the VIF and vifstep function from the USDM package, the data will first be tested for other collinearity issues. The variables determined that have collinearity issues will be discarded.

3.2 Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Variables that exceed the established threshold will be discarded to avoid collinearity issues.

vif(imputed)

##           Variables      VIF
## 1       TARGET_WINS 1.485532
## 2    TEAM_BATTING_H 3.991879
## 3   TEAM_BATTING_2B 2.449123
## 4   TEAM_BATTING_3B 2.977893
## 5   TEAM_BATTING_HR 4.876431
## 6   TEAM_BATTING_BB 5.565493
## 7   TEAM_BATTING_SO 5.178738
## 8   TEAM_BASERUN_SB 2.278903
## 9   TEAM_BASERUN_CS 1.818298
## 10  TEAM_PITCHING_H 3.661374
## 11 TEAM_PITCHING_BB 4.737911
## 12 TEAM_PITCHING_SO 2.977236
## 13  TEAM_FIELDING_E 4.625943
## 14 TEAM_FIELDING_DP 1.772751

v1 <- vifstep(imputed, th=10)

3.3 Output - The below table shows the results of above data manipulation.

The NA data has been ‘filled in’ using the MICE prediction, using the Random Forest Method. Variables with collinearity as established by the vir/virstep package have been dropped.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	1	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_H	2	2276	1469.26977	144.59120	1454.0	1459.04116	114.1602	891	2554	1663	1.5713335	7.2785261	3.0307891
TEAM_BATTING_2B	3	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	4	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	5	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	6	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	7	2276	730.25264	245.68705	737.5	735.48683	277.9875	0	1399	1399	-0.2481540	-0.3128153	5.1498685
TEAM_BASERUN_SB	8	2276	129.71309	93.95469	103.0	114.48518	64.4931	0	697	697	1.9268830	5.0404554	1.9693928
TEAM_BASERUN_CS	9	2276	65.77417	39.43856	54.0	59.02415	23.7216	0	201	201	1.7045243	2.7622052	0.8266753
TEAM_PITCHING_H	10	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_BB	11	2276	553.00791	166.35736	536.5	542.62459	98.5929	0	3645	3645	6.7438995	96.9676398	3.4870317
TEAM_PITCHING_SO	12	2276	812.13445	542.18699	804.5	791.12733	254.2659	0	19278	19278	22.5128987	695.2641969	11.3648305
TEAM_FIELDING_E	13	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	14	2276	142.23989	28.43894	146.0	143.42426	26.6868	52	228	176	-0.3652627	-0.1175944	0.5961112

4 Build Models

Using the training data provided, we will build 3 different linear regression models, to determine which will provide the best prediction for the # of wins for a baseball team. The tree approachs are: all variables, only significant variables, and backwards elimination of each variable.

4.1 Model 1: All Variables

All remaining variables after the data prep. After the data has been manipulated (imputed, etc. as stated above), all of the variables will be tested to determine the base model they provided. This will allow us to see which variables are significant in our dataset, and allow us to make other models based on that.

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.577  -8.494   0.159   8.443  61.658 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.777e+01  5.282e+00   5.258 1.60e-07 ***
## TEAM_BATTING_H    4.627e-02  3.627e-03  12.758  < 2e-16 ***
## TEAM_BATTING_2B  -1.873e-02  9.078e-03  -2.063  0.03921 *  
## TEAM_BATTING_3B   4.701e-02  1.676e-02   2.806  0.00506 ** 
## TEAM_BATTING_HR   7.316e-02  9.791e-03   7.472 1.12e-13 ***
## TEAM_BATTING_BB   8.606e-03  5.223e-03   1.648  0.09955 .  
## TEAM_BATTING_SO  -1.198e-02  2.504e-03  -4.783 1.84e-06 ***
## TEAM_BASERUN_SB   3.656e-02  4.298e-03   8.506  < 2e-16 ***
## TEAM_BASERUN_CS  -5.636e-04  9.291e-03  -0.061  0.95164    
## TEAM_PITCHING_H  -4.468e-04  3.695e-04  -1.209  0.22672    
## TEAM_PITCHING_BB -4.361e-05  3.556e-03  -0.012  0.99021    
## TEAM_PITCHING_SO  2.458e-03  8.632e-04   2.848  0.00445 ** 
## TEAM_FIELDING_E  -2.860e-02  2.495e-03 -11.466  < 2e-16 ***
## TEAM_FIELDING_DP -1.036e-01  1.253e-02  -8.264 2.37e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.96 on 2262 degrees of freedom
## Multiple R-squared:  0.3268, Adjusted R-squared:  0.323 
## F-statistic: 84.48 on 13 and 2262 DF,  p-value: < 2.2e-16

Conclusions based on model:

F-statistic is 89.25, R-squared is 0.3352 Out of the 14 variables, 9 have statistically significant p-values at the 5% level.

4.2 Model 2: Highly Significant Variables Only

Based on model one, Model 2 will focus only on the variables that are statistically significant - in order to see if only those variables allow for a better model. Variables will be choosen based on their significance level from the R output.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.317  -8.588   0.135   8.502  61.077 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      35.8270395  4.5668751   7.845 6.61e-15 ***
## TEAM_BATTING_H    0.0400597  0.0026789  14.954  < 2e-16 ***
## TEAM_BATTING_3B   0.0576747  0.0162673   3.545  0.00040 ***
## TEAM_BATTING_HR   0.0792951  0.0091938   8.625  < 2e-16 ***
## TEAM_BATTING_SO  -0.0131207  0.0023263  -5.640 1.91e-08 ***
## TEAM_BASERUN_SB   0.0406521  0.0037937  10.716  < 2e-16 ***
## TEAM_PITCHING_SO  0.0018183  0.0005835   3.116  0.00186 ** 
## TEAM_FIELDING_E  -0.0325020  0.0017862 -18.196  < 2e-16 ***
## TEAM_FIELDING_DP -0.0993903  0.0123477  -8.049 1.33e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.98 on 2267 degrees of freedom
## Multiple R-squared:  0.323,  Adjusted R-squared:  0.3206 
## F-statistic: 135.2 on 8 and 2267 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 143, R-squared is 0.333

The F-statistic is better than the first model, however the R-squared drops slightly.

4.3 Model 3: Backwards Elimination and Significance

Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal output (r2, f-stat) are produced. Only the final output will be shown. This model is similar to the ‘forward selection’ variant - however I find it easier to work our way backwards and to eliminate variables rather than add them.

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_E + TEAM_BATTING_HR, data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.589  -9.138   0.000   8.646  59.008 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.072197   2.914999   2.083 0.037355 *  
## TEAM_BATTING_H   0.049561   0.002038  24.319  < 2e-16 ***
## TEAM_BASERUN_SB  0.045561   0.003442  13.237  < 2e-16 ***
## TEAM_FIELDING_E -0.025407   0.001637 -15.520  < 2e-16 ***
## TEAM_BATTING_HR  0.022620   0.006014   3.761 0.000174 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.29 on 2271 degrees of freedom
## Multiple R-squared:  0.2899, Adjusted R-squared:  0.2887 
## F-statistic: 231.8 on 4 and 2271 DF,  p-value: < 2.2e-16

Conclusions based on model: F-statistic is 245.5, R-squared is 0.3006 The F-statistic is larger than both of the other two models, however the R-squared is slightly lower than the other two.

5 Select Models

The three models from the previous selection have been summarised below. From the three models, I decided to use model 3 for the predictions. While the first model had the highest R-squared, it had multiple variables that weren’t statistically significant, and some that had multicollinearity issues. The F-statistic in model 3 is also much higher than the other two.

A comparsion of the multiple linear regression models, based on: mean square error, R2, F-stat, and root MSE.

Model 1	Model 2	Model 3
Mean Squared Error: 166.957871565185	Mean Squared Error: 167.907711946534	Mean Squared Error: 176.111327638759
Root MSE: 12.9212178824283	Root MSE: 12.9579208188094	Root MSE: 13.2706943163784
Adjusted R-squared: 0.322971824775576	Adjusted R-squared: 0.320621862368771	Adjusted R-squared: 0.288683962428309
F-statistic: 84.4825955610935	F-statistic: 135.206323490219	F-statistic: 231.824267918398

5.1 Predictions

Similar to the train data, the evaulation data also needs some prep work. Similar to what was done for the test data, the eval data has had columns removed, and NA values imputed using the MICE - Random Forest method to predict what the NA values could be.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TEAM_BATTING_H	1	259	1469.3900	150.65523	1455	1463.68421	114.1602	819	2170	1351	0.5876139	3.6642947	9.361261
TEAM_BATTING_2B	2	259	241.3205	49.51612	239	242.32536	48.9258	44	376	332	-0.3273282	0.6693023	3.076782
TEAM_BATTING_3B	3	259	55.9112	27.14410	52	52.94737	26.6868	14	155	141	0.9790284	0.6987468	1.686652
TEAM_BATTING_HR	4	259	95.6332	56.33221	101	93.67943	66.7170	0	242	242	0.1712363	-0.9031262	3.500313
TEAM_BATTING_BB	5	259	498.9575	120.59215	509	505.98086	94.8864	15	792	777	-0.9209916	2.5265655	7.493232
TEAM_BATTING_SO	6	259	707.2587	238.49198	680	712.19139	259.4550	0	1268	1268	-0.2171483	-0.1898881	14.819171
TEAM_BASERUN_SB	7	259	130.7452	104.86734	95	112.77033	66.7170	0	580	580	1.8785832	4.0655539	6.516140
TEAM_BASERUN_CS	8	259	63.3668	34.42361	56	58.82775	23.7216	0	154	154	1.1797483	0.8795437	2.138979
TEAM_PITCHING_H	9	259	1813.4633	1662.91308	1515	1554.25359	173.4642	1155	22768	21613	9.2764797	102.0702914	103.328391
TEAM_PITCHING_BB	10	259	552.4170	172.95006	526	536.46411	97.8516	136	2008	1872	4.1113772	29.2127324	10.746594
TEAM_PITCHING_SO	11	259	793.1158	613.19152	734	760.06699	235.7334	0	9963	9963	12.8862500	190.1532522	38.101867
TEAM_FIELDING_E	12	259	249.7490	230.90260	163	197.36364	59.3040	73	1568	1495	3.0887263	10.8748551	14.347589
TEAM_FIELDING_DP	13	259	142.6448	27.28171	146	143.61722	25.2042	69	204	135	-0.3271267	-0.1939724	1.695203

After imputing and cleaning the data, using the predict function and Model 3, the following are the predicted values for the test set of the data, including prediction intervals:

fit	lwr	upr
67.13623	41.055889	93.21656
67.60660	41.527611	93.68558
76.03751	49.973209	102.10180
89.53503	63.466923	115.60314
70.16862	44.089489	96.24776
64.87399	38.777247	90.97073
81.96123	55.864890	108.05758
75.55546	49.487775	101.62314
70.08147	43.999724	96.16321
74.31463	48.242403	100.38687
75.76072	49.688450	101.83299
82.30038	56.233632	108.36714
78.58716	52.512895	104.66143
80.46603	54.395602	106.53645
78.95788	52.894835	105.02092
79.49049	53.426092	105.55488
73.20732	47.138873	99.27577
82.07808	56.012606	108.14356
68.33606	42.255490	94.41663
91.89863	65.820896	117.97637
81.80335	55.727601	107.87909
85.95945	59.888204	112.03069
77.73345	51.669044	103.79786
73.79085	47.722477	99.85922
86.03144	59.967841	112.09505
89.81946	63.751274	115.88765
69.65389	43.450462	95.85732
76.66803	50.595247	102.74082
82.25434	56.172831	108.33585
78.38870	52.306833	104.47057
86.77833	60.712068	112.84460
84.21850	58.155037	110.28197
82.23930	56.174799	108.30381
82.99247	56.924191	109.06074
79.60809	53.543362	105.67281
80.84144	54.762024	106.92086
75.45926	49.394599	101.52392
87.90480	61.821770	113.98784
86.03867	59.971227	112.10611
87.46234	61.387089	113.53758
82.10938	56.044761	108.17400
87.23623	61.169121	113.30334
31.00913	4.597776	57.42048
99.69708	73.516531	125.87763
90.28341	64.174034	116.39278
90.28543	64.195836	116.37503
96.22415	70.124249	122.32406
73.51255	47.444423	99.58067
69.52710	43.447847	95.60635
76.75404	50.684432	102.82365
79.94281	53.874642	106.01098
87.78549	61.709114	113.86186
77.36421	51.299973	103.42844
73.22777	47.159106	99.29644
78.23890	52.178886	104.29890
79.41250	53.350443	105.47456
88.62498	62.537837	114.71212
73.83970	47.754763	99.92464
62.14221	36.036322	88.24810
77.48610	51.412411	103.55980
86.59158	60.512878	112.67029
76.32423	50.247457	102.40100
86.03029	59.965814	112.09477
85.95845	59.862677	112.05422
86.31257	60.232893	112.39225
99.95558	73.813694	126.09747
75.00788	48.943641	101.07213
83.29275	57.217406	109.36810
78.57816	52.492592	104.66374
84.89222	58.807209	110.97724
84.80366	58.711411	110.89591
77.25293	51.173044	103.33282
79.51941	53.454304	105.58453
83.73802	57.647463	109.82859
85.45822	59.390518	111.52593
86.48654	60.417618	112.55547
81.40443	55.340586	107.46827
82.58200	56.514902	108.64909
71.75230	45.672786	97.83182
78.33310	52.255753	104.41044
87.23865	61.166041	113.31125
89.23463	63.158208	115.31104
96.68847	70.592683	122.78426
81.65470	55.583160	107.72624
80.59969	54.534239	106.66514
82.29907	56.236935	108.36121
79.41995	53.353186	105.48671
82.63534	56.573549	108.69713
84.55174	58.490110	110.61336
90.09904	64.019419	116.17866
79.03753	52.960202	105.11486
86.15399	59.926887	112.38109
73.04081	46.972190	99.10943
82.57737	56.500985	108.65375
84.64871	58.563206	110.73422
79.78616	53.709014	105.86331
82.81826	56.740994	108.89552
96.27710	70.168882	122.38533
86.54482	60.459170	112.63047
88.90420	62.829490	114.97891
83.07500	57.001330	109.14867
73.03540	46.966682	99.10413
83.71895	57.649976	109.78792
78.38742	52.318497	104.45634
81.24960	55.169348	107.32986
66.84296	40.743824	92.94209
47.33882	21.175164	73.50248
83.67310	57.607856	109.73834
84.05024	57.979251	110.12123
61.22342	35.130179	87.31665
82.78641	56.725193	108.84763
87.12356	61.055363	113.19175
94.45230	68.378518	120.52609
91.33655	65.269033	117.40407
83.96728	57.906992	110.02756
83.20464	57.143448	109.26583
91.14050	65.072105	117.20890
82.44282	56.379537	108.50611
79.77622	53.713885	105.83855
77.05494	50.944457	103.16542
89.45172	63.368426	115.53502
67.01697	40.927090	93.10686
66.80119	40.717442	92.88493
60.11416	34.009157	86.21917
68.86179	42.779023	94.94455
87.44880	61.367929	113.52967
88.62009	62.526380	114.71379
74.96902	48.900029	101.03802
87.69835	61.630061	113.76663
93.23197	67.146402	119.31753
86.36779	60.292782	112.44280
80.63481	54.566428	106.70320
79.57357	53.508867	105.63828
85.10774	59.041625	111.17386
85.31934	59.242136	111.39654
71.19588	45.103271	97.28849
76.55274	50.488794	102.61669
79.26542	53.200875	105.32996
90.53036	64.448302	116.61241
82.53217	56.471026	108.59331
67.31824	41.234732	93.40175
69.69733	43.613575	95.78109
91.67851	65.594246	117.76277
76.67572	50.606266	102.74518
72.92561	46.848393	99.00283
72.33785	46.267072	98.40862
78.47338	52.408821	104.53794
81.68135	55.618139	107.74456
84.72967	58.665025	110.79432
81.32244	55.260534	107.38434
82.86269	56.788004	108.93738
84.28374	58.221674	110.34580
45.26908	18.933110	71.60504
73.82185	47.755042	99.88865
77.14849	51.084993	103.21199
76.31588	50.249853	102.38191
87.41020	61.329582	113.49082
67.89901	41.811999	93.98601
87.18958	61.112421	113.26673
72.28956	46.214096	98.36503
96.22337	70.134291	122.31245
99.41116	73.320060	125.50225
86.63437	60.568076	112.70066
97.74847	71.650702	123.84624
89.59111	63.513337	115.66888
84.47829	58.409293	110.54728
82.22218	56.157901	108.28646
81.77499	55.711798	107.83818
77.13688	51.068084	103.20567
82.92497	56.860321	108.98963
88.18245	62.099406	114.26549
85.92711	59.851374	112.00285
78.30266	52.233769	104.37155
90.62441	64.539082	116.70974
81.62584	55.562611	107.68907
73.82979	47.755859	99.90373
75.01621	48.950755	101.08166
75.29618	49.231618	101.36074
74.01369	47.944997	100.08239
79.88010	53.816684	105.94351
86.36111	60.248658	112.47356
84.31503	58.233582	110.39648
85.86281	59.798009	111.92760
82.07639	55.998654	108.15413
89.18001	62.974010	115.38601
99.66933	73.467653	125.87102
87.68879	61.603940	113.77364
56.65728	30.502015	82.81255
81.45939	55.242450	107.67633
113.71519	87.537749	139.89264
70.44522	44.373418	96.51703
79.72226	53.649789	105.79472
78.84559	52.783530	104.90764
81.27961	55.207407	107.35181
84.35495	58.275895	110.43400
70.36314	44.290874	96.43541
77.17734	51.115940	103.23873
76.16384	50.098300	102.22938
75.69303	49.628593	101.75746
82.25950	56.196700	108.32229
76.27326	50.207397	102.33911
80.16288	54.100325	106.22544
73.56452	47.497053	99.63198
87.41267	61.349360	113.47598
81.59053	55.529817	107.65124
78.55895	52.494563	104.62334
81.67572	55.613796	107.73765
79.10919	53.047616	105.17076
81.32306	55.248439	107.39768
71.88460	45.805864	97.96333
100.81056	74.699625	126.92149
90.45132	64.367058	116.53558
78.85707	52.792721	104.92141
68.08431	42.009503	94.15912
70.66189	44.589865	96.73392
84.99225	58.923693	111.06082
83.97027	57.907613	110.03293
94.06135	67.977999	120.14471
79.39026	53.328870	105.45165
77.45847	51.395179	103.52175
81.55135	55.490138	107.61256
81.31675	55.254315	107.37919
85.07509	59.009008	111.14118
80.75341	54.690878	106.81594
87.18459	60.900380	113.46881
74.83597	48.770976	100.90097
81.55878	55.498117	107.61944
82.03872	55.972576	108.10486
80.93174	54.870778	106.99269
89.84760	63.645467	116.04973
74.43396	48.353576	100.51434
92.10202	66.026458	118.17758
78.06023	51.996185	104.12427
85.87531	59.798844	111.95178
77.66966	51.607442	103.73189
73.81543	47.748797	99.88206
84.34386	58.284259	110.40346
77.50491	51.439734	103.57009
85.18128	59.108813	111.25376
72.97924	46.908681	99.04980
87.23800	61.169237	113.30676
85.77969	59.707254	111.85212
84.70459	58.626982	110.78219
86.96065	60.896647	113.02465
65.96530	39.876078	92.05453
89.91658	63.848003	115.98516
81.42166	55.361418	107.48190
84.96504	58.894665	111.03541
73.95875	47.889914	100.02758
88.92177	62.842080	115.00146
83.02815	56.957246	109.09905
73.40288	47.240856	99.56490
90.75125	64.664191	116.83830
36.45602	10.154620	62.75743
69.92466	43.851514	95.99780
74.92555	48.861238	100.98986
82.60206	56.534355	108.66976
85.06042	58.987784	111.13306
80.57628	54.508445	106.64412

##       fit              lwr              upr        
##  Min.   : 31.01   Min.   : 4.598   Min.   : 57.42  
##  1st Qu.: 76.29   1st Qu.:50.227   1st Qu.:102.36  
##  Median : 81.63   Median :55.563   Median :107.69  
##  Mean   : 80.67   Mean   :54.588   Mean   :106.75  
##  3rd Qu.: 86.03   3rd Qu.:59.946   3rd Qu.:112.09  
##  Max.   :113.72   Max.   :87.538   Max.   :139.89

##        1 
## 81.09361

##        fit      lwr      upr
## 1 81.09361 55.03392 107.1533

6 Appendex

moneyball_training_data <- read_csv("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA621/moneyball-training-data.csv")


mbd1 <- describe(moneyball_training_data, na.rm = F)

mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]


kable(mbd1,"latex", booktabs = T) %>%
   kable_styling(latex_options = c("striped", "scale_down"))

# kable(mbd1, "html", escape = F) %>%
#   kable_styling("striped", full_width = T) %>%
#   column_spec(1, bold = T) %>%
#   scroll_box(width = "100%", height = "700px")

ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
  geom_boxplot() +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(panel.background = element_rect(fill = '#d0ddf2'))


ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) +
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 800)) +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) +
  theme(panel.background = element_rect(fill = '#d0ddf2'))


mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]

mb_hist %>%
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram(bins = 35)


kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")

corrgram(drop_na(mb_hist), order=TRUE,
         upper.panel=panel.cor, main="Moneyball")



mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]




init = mice(mbd2, maxit=0)
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)

imputedtable <- describe(imputed)


kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")


mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-c(1, 11, 13)]

init = mice(mbd2, maxit=0)
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)


vif(imputed)
v1 <- vifstep(imputed, th=10)


imputedtable <- describe(imputed)


kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")



model1 <- lm(TARGET_WINS ~., imputed)


model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H  + TEAM_BATTING_3B  + TEAM_BATTING_HR  + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)

summary(model2)


model4 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB  + TEAM_FIELDING_E  + TEAM_BATTING_HR, imputed)

summary(model4)




compare_model1 <- c(m1mse, m1root, m1ar, m1fs )
compare_model2 <- c(m2mse, m2root, m2ar, m2fs )
compare_model3 <- c(m3mse, m3root, m3ar, m3fs )

compare <- data.frame(compare_model1, compare_model2, compare_model3)
colnames(compare) <- c("Model 1", "Model 2", "Model 3")

kable(compare)




mbeval <- mbeval[,-c(1, 10, 12)]


init = mice(mbeval, maxit=0)
meth = init$method
predM = init$predictorMatrix


imputed1 = mice(mbeval, method="rf", predictorMatrix=predM, m=5)

imputed1 <- complete(imputed1)

imputedtable1 <- describe(imputed1)


kable(imputedtable1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

predict1 <- predict(model4, newdata = imputed1, interval="prediction")

kable(predict1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")


summary(predict1)

DATA621_HW1

Niteen Kumar

June 24, 2018