DATA621

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

dim(moneyball_training_data)

## [1] 2276   17

The dataset consists of 17 elements, with 2276 total cases. There are 15 explanatory variables, which can be categorized into four groups:

batting
baserun
fielding
pitching

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se	na_count
TARGET_WINS	2	191	80.92670	12.115013	82	81.11765	13.3434	43	116	73	-0.1698314	-0.2952783	0.8766116	0
TEAM_BATTING_H	3	191	1478.62827	76.147869	1477	1477.42484	74.1300	1308	1667	359	0.1302702	-0.3710350	5.5098664	0
TEAM_BATTING_2B	4	191	297.19895	26.329335	296	296.62745	25.2042	201	373	172	0.0915189	0.4778716	1.9051238	0
TEAM_BATTING_3B	5	191	30.74346	9.043878	29	30.13072	8.8956	12	61	49	0.7007420	0.7446217	0.6543921	0
TEAM_BATTING_HR	6	191	178.05236	32.413243	175	176.81046	35.5824	116	260	144	0.2980673	-0.7172373	2.3453399	0
TEAM_BATTING_BB	7	191	543.31937	74.842133	535	541.31373	74.1300	365	775	410	0.3115199	-0.1474175	5.4153867	0
TEAM_BATTING_SO	8	191	1051.02618	104.156382	1050	1046.95425	97.8516	805	1399	594	0.3985050	0.3955105	7.5364913	102
TEAM_BASERUN_SB	9	191	90.90576	29.916401	87	89.06536	29.6520	31	177	146	0.5553966	-0.1414909	2.1646748	131
TEAM_BASERUN_CS	10	191	39.94241	11.898334	38	39.49020	11.8608	12	74	62	0.3468509	0.0006392	0.8609332	772
TEAM_BATTING_HBP	11	191	59.35602	12.967123	58	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681	2085
TEAM_PITCHING_H	12	191	1479.70157	75.788625	1480	1478.50327	72.6474	1312	1667	355	0.1279056	-0.3894781	5.4838725	0
TEAM_PITCHING_HR	13	191	178.17801	32.391678	175	176.93464	35.5824	116	260	144	0.2989191	-0.7190905	2.3437795	0
TEAM_PITCHING_BB	14	191	543.71728	74.916681	537	541.74510	72.6474	367	775	408	0.3144366	-0.1338563	5.4207808	0
TEAM_PITCHING_SO	15	191	1051.81675	104.347208	1052	1047.80392	97.8516	805	1399	594	0.3945586	0.3903991	7.5502990	102
TEAM_FIELDING_E	16	191	107.05236	16.632162	106	106.58170	17.7912	65	145	80	0.1780432	-0.3567367	1.2034610	0
TEAM_FIELDING_DP	17	191	152.33508	17.611682	152	152.04575	19.2738	113	204	91	0.2164822	-0.2115741	1.2743366	286

There are multiple variables with missing (NA) values, with TEAM-BATTING_HBP being the highest.

The boxplots below help show the spread of data within the dataset, and show various outliers. As shown in the graph below, TEAM_PITCHING_H seems to have the highest spread with the most outliers.

## Warning: Removed 3478 rows containing non-finite values (stat_boxplot).

The graph below zooms into the other variables, so it becomes easier to see spread and outliers from the other variables.

In the Histograms below, the data shows multiple graphs with right skews while only a few have left-skew.

The correlation plot below shows how variables in the dataset are related to each other. Looking at the plot, we can see that certain variables are more related than others.

Let’s break down the correlation by wins - since that’s what we’re trying to predict.

	x
TARGET_WINS	1.0000000
TEAM_BATTING_H	0.4699467
TEAM_BATTING_2B	0.3129840
TEAM_BATTING_3B	-0.1243459
TEAM_BATTING_HR	0.4224168
TEAM_BATTING_BB	0.4686879
TEAM_BATTING_SO	-0.2288927
TEAM_BASERUN_SB	0.0148364
TEAM_BASERUN_CS	-0.1787560
TEAM_BATTING_HBP	0.0735042
TEAM_PITCHING_H	0.4712343
TEAM_PITCHING_HR	0.4224668
TEAM_PITCHING_BB	0.4683988
TEAM_PITCHING_SO	-0.2293648
TEAM_FIELDING_E	-0.3866880
TEAM_FIELDING_DP	-0.1958660

Below is the plot correlation.

Data Preparation

Removal of Data

The variable TEAM_BATTING_HBP has over 90% missing values. That variable will be removed completely.

The variable TEAM_PITCHING_HR and TEAM_BATTING_HR are also very closely correlated with each other. This shows that there may be some collinearity involved. The TEAM_PITCHING_HR variable will be dropped from the dataset

Imputation of Missing (NA) values

The data exploration revealed multiple variables that had numerious NA values. There are multiple ways to handle NA data: deleting the observations, deleting the variables, imputation with the mean/median/mode, or imputation with a prediction.

Imputation the mean/median/mode is an easy way to fill in the missing NA’s, however it reduces the variance in the dataset and shrinks standard errors - which can invalidate hypothesis tests.

In this case, data will be imputated via prediction using the MICE (Multivariate Imputation) library using a random forest prediction method.

Output - The below table shows the results of above data manipulation

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	1	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_H	2	2276	1469.26977	144.59120	1454.0	1459.04116	114.1602	891	2554	1663	1.5713335	7.2785261	3.0307891
TEAM_BATTING_2B	3	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	4	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	5	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	6	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	7	2276	729.44508	245.63744	734.5	734.43030	279.4701	0	1399	1399	-0.2394171	-0.3199559	5.1488285
TEAM_BASERUN_SB	8	2276	131.90949	96.62529	104.0	116.04720	66.7170	0	697	697	1.8784660	4.5762445	2.0253715
TEAM_BASERUN_CS	9	2276	66.06634	40.68995	53.0	58.78814	23.7216	0	201	201	1.6897117	2.4838474	0.8529056
TEAM_PITCHING_H	10	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_HR	11	2276	105.69859	61.29875	107.0	103.15697	74.1300	0	343	343	0.2877877	-0.6046311	1.2848886
TEAM_PITCHING_SO	12	2276	811.01582	542.27942	802.0	789.97530	256.4898	0	19278	19278	22.5067680	694.9733696	11.3667679
TEAM_FIELDING_E	13	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	14	2276	142.85325	28.21508	146.0	144.04775	26.6868	52	228	176	-0.3746132	-0.0320501	0.5914189

Build Models

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -56.047  -8.426   0.163   8.301  54.143 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.738e+01  5.217e+00   5.248 1.68e-07 ***
## TEAM_BATTING_H    4.758e-02  3.622e-03  13.138  < 2e-16 ***
## TEAM_BATTING_2B  -2.423e-02  9.015e-03  -2.687  0.00726 ** 
## TEAM_BATTING_3B   4.881e-02  1.662e-02   2.936  0.00335 ** 
## TEAM_BATTING_HR   6.208e-02  2.378e-02   2.611  0.00909 ** 
## TEAM_BATTING_BB   9.773e-03  3.161e-03   3.092  0.00201 ** 
## TEAM_BATTING_SO  -1.247e-02  2.424e-03  -5.142 2.95e-07 ***
## TEAM_BASERUN_SB   3.871e-02  4.139e-03   9.352  < 2e-16 ***
## TEAM_BASERUN_CS   8.383e-05  9.045e-03   0.009  0.99261    
## TEAM_PITCHING_H  -2.748e-04  3.304e-04  -0.832  0.40561    
## TEAM_PITCHING_HR  1.424e-02  2.075e-02   0.686  0.49255    
## TEAM_PITCHING_SO  2.178e-03  6.663e-04   3.270  0.00109 ** 
## TEAM_FIELDING_E  -3.090e-02  2.511e-03 -12.302  < 2e-16 ***
## TEAM_FIELDING_DP -1.094e-01  1.245e-02  -8.788  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.87 on 2262 degrees of freedom
## Multiple R-squared:  0.336,  Adjusted R-squared:  0.3322 
## F-statistic: 88.06 on 13 and 2262 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.831  -8.439   0.112   8.291  54.626 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      27.026713   4.984325   5.422 6.51e-08 ***
## TEAM_BATTING_H    0.047465   0.003563  13.322  < 2e-16 ***
## TEAM_BATTING_2B  -0.024656   0.008992  -2.742 0.006155 ** 
## TEAM_BATTING_3B   0.052110   0.016242   3.208 0.001353 ** 
## TEAM_BATTING_HR   0.076492   0.009498   8.054 1.28e-15 ***
## TEAM_BATTING_BB   0.009775   0.003158   3.095 0.001990 ** 
## TEAM_BATTING_SO  -0.012085   0.002370  -5.100 3.68e-07 ***
## TEAM_BASERUN_SB   0.039431   0.003827  10.304  < 2e-16 ***
## TEAM_PITCHING_SO  0.001976   0.000586   3.372 0.000758 ***
## TEAM_FIELDING_E  -0.031771   0.002074 -15.321  < 2e-16 ***
## TEAM_FIELDING_DP -0.109107   0.012285  -8.881  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.87 on 2265 degrees of freedom
## Multiple R-squared:  0.3357, Adjusted R-squared:  0.3328 
## F-statistic: 114.5 on 10 and 2265 DF,  p-value: < 2.2e-16

Select Models

Appendex

moneyball_training_data <- read_csv("https://raw.githubusercontent.com/nschettini/CUNY-MSDS-DATA-621/master/moneyball-training-data.csv")


mbd1 <- describe(moneyball_training_data, na.rm = F)

mbd1$na_count <- sapply(moneyball_training_data, function(y) sum(length(which(is.na(y)))))
mbd1 <- mbd1[-1,]


kable(mbd1, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


ggplot(stack(moneyball_training_data), aes(x = ind, y = values)) + 
  geom_boxplot() +
  coord_cartesian(ylim = c(0, 800)) +
  theme(legend.position="none") +
  theme(axis.text.x=element_text(angle=45, hjust=1)) + 
  theme(panel.background = element_rect(fill = '#d0ddf2'))


mb_hist <- moneyball_training_data
mb_hist <- mb_hist[,-1 ]

mb_hist %>%
  keep(is.numeric) %>%                     
  gather() %>%                             
  ggplot(aes(value)) +                     
    facet_wrap(~ key, scales = "free") +  
    geom_histogram(bins = 35)                 



kable(cor(drop_na(mb_hist))[,1], "html", escape = F) %>%
  kable_styling("striped", full_width = F) %>%
  column_spec(1, bold = T) %>%
  scroll_box(height = "500px")

corrgram(drop_na(mb_hist), order=TRUE,
         upper.panel=panel.cor, main="Moneyball")



mbd2 <- moneyball_training_data
mbd2 <- mbd2[,-1]
mbd2 <- mbd2[,-10]
mbd2 <- mbd2[,-12]



library(mice)
init = mice(mbd2, maxit=0) 
meth = init$method
predM = init$predictorMatrix

predM[, c("TARGET_WINS")]=0

imputed = mice(mbd2, method="rf", predictorMatrix=predM, m=5)

imputed <- complete(imputed)

imputedtable <- describe(imputed)

kable(imputedtable, "html", escape = F) %>%
  kable_styling("striped", full_width = T) %>%
  column_spec(1, bold = T) %>%
  scroll_box(width = "100%", height = "700px")

model1 <- lm(TARGET_WINS ~., imputed)

(summary(model1))

model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, imputed)

summary(model2)

DATA621_HW1

Niteen Kumar

June 24, 2018