DATA 621 - Homework 1

Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(MASS)
library(corrplot)
library(RColorBrewer)
library(GGally)
library(ggResidpanel)
library(psych)
library(mice)
library(reshape2)
library(cowplot)
library(car)
library(caTools)
library(VIM)
library(broom)

Introduction

The dataset that we have been given for this assignment is a collection of historical baseball team performances. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

The original Training data set is comprised of 17 elements and 2276 total observations. Out of those 17 elements, INDEX is simply an index value used for sorting while TARGET_WINS represents the response variable we are to use within our regression models. The remaining 15 elements are all potential predictor variables for our linear models. A summary table for the data set is provided below.

Objective

Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables listed above (or variables that can be derived from the variables provided).

Data Load

Loaded Training and Evalutaion data sets into respective data frames.

train_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_1/DataFiles/moneyball-training-data.csv")

eval_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_1/DataFiles/moneyball-evaluation-data.csv")

Training Data

Excluded the ‘INDEX’ attribute from training and evaluation(test) data frames.

train_df <- train_df[-c(1)]

Sample snapshot of training data frame -

head(train_df, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
39	1445	194	39	13	143	842	NA	NA	NA	9364	84	927	5456	1011	NA
70	1339	219	22	190	685	1075	37	28	NA	1347	191	689	1082	193	155
86	1377	232	35	137	602	917	46	27	NA	1377	137	602	917	175	153
70	1387	209	38	96	451	922	43	30	NA	1396	97	454	928	164	156
82	1297	186	27	102	472	920	49	39	NA	1297	102	472	920	138	168
75	1279	200	36	92	443	973	107	59	NA	1279	92	443	973	123	149
80	1244	179	54	122	525	1062	80	54	NA	1244	122	525	1062	136	186
85	1273	171	37	115	456	1027	40	36	NA	1281	116	459	1033	112	136
86	1391	197	40	114	447	922	69	27	NA	1391	114	447	922	127	169
76	1271	213	18	96	441	827	72	34	NA	1271	96	441	827	131	159
78	1305	179	27	82	374	888	60	39	NA	1364	86	391	928	119	141
68	1372	203	31	95	509	801	119	79	NA	1372	95	509	801	147	150
72	1332	196	41	55	597	816	221	109	NA	1340	55	601	821	185	165
76	1265	210	23	63	534	812	126	80	NA	1265	63	534	812	150	139
74	1380	233	40	131	542	880	159	89	NA	1380	131	542	880	147	137
87	1417	226	28	108	539	682	86	69	NA	1417	108	539	682	136	136
88	1563	242	43	164	589	843	100	53	NA	1563	164	589	843	135	172
66	1460	239	32	107	546	900	92	64	NA	1478	108	553	911	136	146
75	1390	197	24	143	579	841	65	49	NA	2047	211	853	1239	149	177
93	1518	268	26	186	613	760	55	53	NA	1518	186	613	760	106	171

str(train_df)

## 'data.frame':    2276 obs. of  16 variables:
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

PART I: Data Exploration

We wanted to start off data exploration process with high level descriptive statistical summary and missing/exception value analysis.

Descriptive Statistical Summary

Basic statistical summary of all features and dependant variable (TARGET_WINS).

stat_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}
stat_summary(train_df)

TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :29.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 65.0	Min. : 52.0
1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 548.0	1st Qu.: 66.0	1st Qu.: 38.0	1st Qu.:50.50	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 615.0	1st Qu.: 127.0	1st Qu.:131.0
Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 750.0	Median :101.0	Median : 49.0	Median :58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 813.5	Median : 159.0	Median :149.0
Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.6	Mean :124.8	Mean : 52.8	Mean :59.36	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 817.7	Mean : 246.5	Mean :146.4
3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 930.0	3rd Qu.:156.0	3rd Qu.: 62.0	3rd Qu.:67.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 968.0	3rd Qu.: 249.2	3rd Qu.:164.0
Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.0	Max. :95.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :19278.0	Max. :1898.0	Max. :228.0
NA	NA	NA	NA	NA	NA	NA’s :102	NA’s :131	NA’s :772	NA’s :2085	NA	NA	NA	NA’s :102	NA	NA’s :286

We also used describe() function of ‘psych’ package to summarize additional statistical measurements like Standard Deviation, Skewness, Kurtois, Standard Error etc.

stat_desc <- function(df){
df %>% 
    describe() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}

stat_desc(train_df)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	1	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_H	2	2276	1469.26977	144.59120	1454.0	1459.04116	114.1602	891	2554	1663	1.5713335	7.2785261	3.0307891
TEAM_BATTING_2B	3	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	4	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	5	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	6	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	7	2174	735.60534	248.52642	750.0	742.31322	284.6592	0	1399	1399	-0.2978001	-0.3207992	5.3301912
TEAM_BASERUN_SB	8	2145	124.76177	87.79117	101.0	110.81188	60.7866	0	697	697	1.9724140	5.4896754	1.8955584
TEAM_BASERUN_CS	9	1504	52.80386	22.95634	49.0	50.35963	17.7912	0	201	201	1.9762180	7.6203818	0.5919414
TEAM_BATTING_HBP	10	191	59.35602	12.96712	58.0	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681
TEAM_PITCHING_H	11	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_HR	12	2276	105.69859	61.29875	107.0	103.15697	74.1300	0	343	343	0.2877877	-0.6046311	1.2848886
TEAM_PITCHING_BB	13	2276	553.00791	166.35736	536.5	542.62459	98.5929	0	3645	3645	6.7438995	96.9676398	3.4870317
TEAM_PITCHING_SO	14	2174	817.73045	553.08503	813.5	796.93391	257.2311	0	19278	19278	22.1745535	671.1891292	11.8621151
TEAM_FIELDING_E	15	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	16	1990	146.38794	26.22639	149.0	147.57789	23.7216	52	228	176	-0.3889390	0.1817397	0.5879114

Missing Value Analysis

Below are the features that has NA values. It is clear from the table below, TEAM_BATTING_HBP(>90%) and TEAM_BASERUN_CS (>33%) have significant NA Values -

## Counts of missing data per feature
train_na_df <- data.frame(apply(train_df, 2, function(x) length(which(is.na(x)))))
train_na_df1 <- data.frame(apply(train_df, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

train_na_df <- cbind(Feature = rownames(train_na_df), train_na_df, train_na_df1)
colnames(train_na_df) <- c('Feature Name','No. of NA Recocrds','Percentage of NA Records')
rownames(train_na_df) <- NULL


train_na_df%>% filter(`No. of NA Recocrds` != 0) %>% arrange(desc(`No. of NA Recocrds`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Feature Name	No. of NA Recocrds	Percentage of NA Records
TEAM_BATTING_HBP	2085	91.608084
TEAM_BASERUN_CS	772	33.919156
TEAM_FIELDING_DP	286	12.565905
TEAM_BASERUN_SB	131	5.755712
TEAM_BATTING_SO	102	4.481547
TEAM_PITCHING_SO	102	4.481547

Descriptive Statistical Plots

Box Plots

## Box plots:
gb1 <- ggplot(data = train_df, aes(y = TARGET_WINS)) + geom_boxplot()
gb2 <- ggplot(data = train_df, aes(y = TEAM_BATTING_H)) + geom_boxplot()
gb3 <- ggplot(data = train_df, aes(y = TEAM_BATTING_2B)) + geom_boxplot()
gb4 <- ggplot(data = train_df, aes(y = TEAM_BATTING_3B)) + geom_boxplot()
gb5 <- ggplot(data = train_df, aes(y = TEAM_BATTING_HR)) + geom_boxplot()
gb6 <- ggplot(data = train_df, aes(y = TEAM_BATTING_BB)) + geom_boxplot()
gb7 <- ggplot(data = train_df, aes(y = TEAM_BATTING_HBP)) + geom_boxplot()
gb8 <- ggplot(data = train_df, aes(y = TEAM_BATTING_SO)) + geom_boxplot()
gb9 <- ggplot(data = train_df, aes(y = TEAM_BASERUN_SB)) + geom_boxplot()
gb10 <- ggplot(data = train_df, aes(y = TEAM_BASERUN_CS)) + geom_boxplot()
gb11 <- ggplot(data = train_df, aes(y = TEAM_FIELDING_E)) + geom_boxplot()
gb12 <- ggplot(data = train_df, aes(y = TEAM_FIELDING_DP)) + geom_boxplot()
gb13 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_BB)) + geom_boxplot()
gb14 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_H)) + geom_boxplot()
gb15 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_HR)) + geom_boxplot()
gb16 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_SO)) + geom_boxplot()

plot_grid(gb1, gb2, gb3, gb4, gb5, gb6, gb7, gb8, gb9, gb10,
          gb11, gb12, gb13, gb14, gb15, gb16, labels = "AUTO, scale = 10")

Density Plots

train_df %>%
  gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

## Warning: Removed 3478 rows containing non-finite values (stat_density).

Observations Summary

The dataset has many outliers with some observations that are more extreme than the 1.5 * IQR of the box plot whiskers. We also see that the variance of some of the explanatory variables greatly exceeds the variance of the response “win” variable.
Dependant variable TARGET_WINS is normally distributed which indicates that data set includes a balanced amount of good, bad and average team performances
TEAM_BATTING_HR,TEAM_BATTING_SO and TEAM_PITCHING_HR features are showing bi-modal distribution
Some of the features like TEAM_BASERUN_CS, TEAM_BASERUN_SB and TEAM_FIELDING_E are right skewed
Observing the results of missing value analysis, we need to remove TEAM_BATTING_HBP due to large amount of missing values. We will need to come up with an imputation mechanism for the other features with missing values.

PART II: Data Preparation

Remove the feature: TEAM_BATTING_HBP

The TEAM_BATTING_HBP variable contains 2085 missing values out of a total of 2277. Since it would be very difficult to accurately impute such a large proportion of any variable’s missing values, so we choose to exclude the variable from our further analysis.

train_df <- train_df %>% dplyr::select(-c(TEAM_BATTING_HBP))
eval_df <-  eval_df %>% dplyr::select(-c(TEAM_BATTING_HBP))

Plotting the missing data pattern after removing the max. missing feature -

aggr_plot <- aggr(train_df, col=c('dodgerblue4','red'),
                 numbers=TRUE, sortVars=TRUE,oma = c(10,5,5,3),
                 labels=names(train_df), cex.axis=.8, gap=3, axes = TRUE, Prop = TRUE,
                 ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable      Count
##   TEAM_BASERUN_CS 0.33919156
##  TEAM_FIELDING_DP 0.12565905
##   TEAM_BASERUN_SB 0.05755712
##   TEAM_BATTING_SO 0.04481547
##  TEAM_PITCHING_SO 0.04481547
##       TARGET_WINS 0.00000000
##    TEAM_BATTING_H 0.00000000
##   TEAM_BATTING_2B 0.00000000
##   TEAM_BATTING_3B 0.00000000
##   TEAM_BATTING_HR 0.00000000
##   TEAM_BATTING_BB 0.00000000
##   TEAM_PITCHING_H 0.00000000
##  TEAM_PITCHING_HR 0.00000000
##  TEAM_PITCHING_BB 0.00000000
##   TEAM_FIELDING_E 0.00000000

Imputation

Imputing the missing data for the other 4 features, which we are keeping for our analysis. We have used the ‘Predictive Mean Matching’(pmm) method included in MICE package for imputation purposes.

impute_data <- function(df){
  
  df <- mice(data = df, m = 1, method = "pmm", maxit = 5, seed = 500)
  df <- mice::complete(df, 1)
}

train_df <- impute_data(train_df)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP

eval_df <- impute_data(eval_df)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP

Feature Engineering

We created a new variable called TEAM_BATTING_1B which represents offensive single base hits. (created by subtracting out the TEAM_BATTING doubles, triples and home runs from the TEAM_BATTING_H variable). We believe that separating out singles from the other unique hit values will minimize collinearity. The TEAM_BATTING_H variable is then removed from the data set since it is simply a linear combination of its component variables.

add_batting1b_feature <- function(df){
  df <-df %>%
    mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)
}

train_df <- add_batting1b_feature(train_df)
eval_df <- add_batting1b_feature(eval_df)

train_df <- train_df %>% dplyr::select(-c(TEAM_BATTING_H))
eval_df <-  eval_df %>% dplyr::select(-c(TEAM_BATTING_H))

Correlation Plot

corrMatrix <- round(cor(train_df),4)

corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Based on the Correlation plot above, there is a high degree of collinearity amogst independent variables like TEAM_BATTING_HR, TEAM_PITCHING_HR,TEAM_BASERUN_SB, TEAM_BASERUN_CS etc.

Handling multicollinearity

TEAM_BASERUN_CS: This variable is strongly correlated (82%) with the TEAM_BASERUN_SB variable and is the 2nd largest source of NA’s in our data set. Hence we decided to exclude the variable from our analysis.

TEAM_PITCHING_HR: This variable is highly (97%) correlated with TEAM_BATTING_HR. The fact that these two variables are nearly perfectly correlated indicates that one of them can legitimately be removed from the data set, and we chose TEAM_PITCHING_HR to be removed from our model for further analysis.

train_df <- train_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_PITCHING_HR))
eval_df <- eval_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_PITCHING_HR))

Data Preparation Results

After making feature removal and imputation we computed the statistical summary of the resultant training data and created feature density plots.

stat_desc(train_df)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	1	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_2B	2	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	3	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	4	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	5	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	6	2276	727.16037	247.06360	733.0	731.75906	284.6592	0	1399	1399	-0.2248595	-0.3716978	5.1787224
TEAM_BASERUN_SB	7	2276	135.98330	100.99464	106.0	118.92481	68.1996	0	697	697	1.8483330	4.2016782	2.1169578
TEAM_PITCHING_H	8	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_BB	9	2276	553.00791	166.35736	536.5	542.62459	98.5929	0	3645	3645	6.7438995	96.9676398	3.4870317
TEAM_PITCHING_SO	10	2276	808.52021	543.41454	800.5	787.86169	258.7137	0	19278	19278	22.3747281	689.5764312	11.3905614
TEAM_FIELDING_E	11	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	12	2276	141.88664	29.21855	146.0	142.88968	28.1694	52	228	176	-0.2916907	-0.2611675	0.6124526
TEAM_BATTING_1B	13	2276	1073.16081	128.92464	1050.0	1059.11196	99.3342	709	2112	1403	2.0468187	8.5280361	2.7024011

train_df %>%
  gather(variable, value, TARGET_WINS:TEAM_BATTING_1B) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

PART III: Building Models

Test train approach

We have divided the traning dataset into training and test sets using a 80/20 split. We will build our models on the training set and evaluate it on the test set.

set.seed(123)
split <- sample.split(train_df$TARGET_WINS, SplitRatio = 0.8)
training_set <- subset(train_df, split == TRUE)
test_set <- subset(train_df, split == FALSE)

We will build four different models to see which one yields the best performance.

Model 1:

g1 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB +  TEAM_PITCHING_H +
            TEAM_PITCHING_BB + TEAM_PITCHING_SO +
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)

summary(g1)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.143  -8.443   0.218   8.144  50.190 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.5795058  5.4745350   6.316 3.36e-10 ***
## TEAM_BATTING_1B   0.0426304  0.0038183  11.165  < 2e-16 ***
## TEAM_BATTING_2B   0.0264099  0.0077580   3.404 0.000678 ***
## TEAM_BATTING_3B   0.0826533  0.0169375   4.880 1.15e-06 ***
## TEAM_BATTING_HR   0.1319750  0.0093123  14.172  < 2e-16 ***
## TEAM_BATTING_BB   0.0301181  0.0064215   4.690 2.93e-06 ***
## TEAM_BATTING_SO  -0.0222146  0.0036738  -6.047 1.79e-09 ***
## TEAM_BASERUN_SB   0.0569348  0.0046146  12.338  < 2e-16 ***
## TEAM_PITCHING_H   0.0019434  0.0004166   4.665 3.31e-06 ***
## TEAM_PITCHING_BB -0.0173696  0.0047283  -3.674 0.000246 ***
## TEAM_PITCHING_SO  0.0060394  0.0022425   2.693 0.007144 ** 
## TEAM_FIELDING_E  -0.0427808  0.0030067 -14.228  < 2e-16 ***
## TEAM_FIELDING_DP -0.1309038  0.0139311  -9.397  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.46 on 1813 degrees of freedom
## Multiple R-squared:  0.3996, Adjusted R-squared:  0.3956 
## F-statistic: 100.5 on 12 and 1813 DF,  p-value: < 2.2e-16

VIF Results

Multicollinearity can assessed by computing a score called the Variance Inflation Factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. Features with VIF scores higher than 5 or 10 are still showing relatively higher degree of Collinearity. But we decided against removing any more features at this point.

vif(g1)

##  TEAM_BATTING_1B  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.917513         1.546951         2.655454         3.793152 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_H 
##         7.508851         9.663351         2.643337         3.945886 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         5.230981         5.068137         5.649857         1.928087

Model Diagnostic Plots

resid_panel(g1, plots='default', smoother = TRUE)

RMSE Calculation

rmse_calc <- function(actual, predicted) {
  rmse_val <- sqrt(sum((actual - predicted)^2) / length(actual))
  
  return(rmse_val)
}

### RMSE of first model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g1, newdata = training_set))

## [1] 12.41448

### RMSE of first model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g1, newdata = test_set))

## [1] 12.72149

Model Summary

We have used the glance() methid from broom package to gather the model summary statistics in one data frame and appended our RMSE calculations so that we can compare these statistics in the model selection section.

model_1 <- as.data.frame(glance(g1))
model_1$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g1, newdata = training_set))
model_1$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g1, newdata = test_set))

Model 2:

# Removing : TEAM_PITCHING_SO - based on p-values

g2 <- lm(TARGET_WINS ~TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + 
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + 
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)


summary(g2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.458  -8.386   0.166   8.132  49.739 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.529218   5.483930   6.296 3.81e-10 ***
## TEAM_BATTING_1B   0.042182   0.003821  11.039  < 2e-16 ***
## TEAM_BATTING_2B   0.027113   0.007767   3.491 0.000493 ***
## TEAM_BATTING_3B   0.081998   0.016965   4.833 1.46e-06 ***
## TEAM_BATTING_HR   0.128944   0.009260  13.925  < 2e-16 ***
## TEAM_BATTING_BB   0.020983   0.005462   3.842 0.000126 ***
## TEAM_BATTING_SO  -0.014900   0.002478  -6.013 2.20e-09 ***
## TEAM_BASERUN_SB   0.057051   0.004622  12.342  < 2e-16 ***
## TEAM_PITCHING_H   0.001734   0.000410   4.229 2.46e-05 ***
## TEAM_PITCHING_BB -0.009780   0.003803  -2.571 0.010205 *  
## TEAM_FIELDING_E  -0.041431   0.002970 -13.951  < 2e-16 ***
## TEAM_FIELDING_DP -0.126497   0.013858  -9.128  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.48 on 1814 degrees of freedom
## Multiple R-squared:  0.3972, Adjusted R-squared:  0.3935 
## F-statistic: 108.7 on 11 and 1814 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g2, plots='default', smoother = TRUE)

#### RMSE Calculation

### RMSE of second model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g2, newdata = training_set))

## [1] 12.43929

### RMSE of second model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g2, newdata = test_set))

## [1] 12.98812

Model Summary

model_2 <- as.data.frame(glance(g2))
model_2$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g2, newdata = training_set))
model_2$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g2, newdata = test_set))

Model 3:

# Removing : TEAM_PITCHING_BB - based on p-value
g3 <- lm(TARGET_WINS ~TEAM_BATTING_1B + TEAM_BATTING_2B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_PITCHING_H + 
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)


summary(g3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.454  -8.516   0.493   8.234  51.173 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      38.9664493  5.4295694   7.177 1.04e-12 ***
## TEAM_BATTING_1B   0.0444661  0.0038055  11.685  < 2e-16 ***
## TEAM_BATTING_2B   0.0320488  0.0077448   4.138 3.66e-05 ***
## TEAM_BATTING_HR   0.1224962  0.0091307  13.416  < 2e-16 ***
## TEAM_BATTING_BB   0.0113947  0.0033496   3.402 0.000684 ***
## TEAM_BATTING_SO  -0.0185025  0.0023936  -7.730 1.77e-14 ***
## TEAM_BASERUN_SB   0.0609333  0.0044339  13.743  < 2e-16 ***
## TEAM_PITCHING_H   0.0010385  0.0003508   2.961 0.003108 ** 
## TEAM_FIELDING_E  -0.0410182  0.0029818 -13.756  < 2e-16 ***
## TEAM_FIELDING_DP -0.1282933  0.0139448  -9.200  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 1816 degrees of freedom
## Multiple R-squared:  0.3881, Adjusted R-squared:  0.3851 
## F-statistic:   128 on 9 and 1816 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g1, plots='default', smoother = TRUE)

#### RMSE Calculation

### RMSE of third model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g3, newdata = training_set))

## [1] 12.53268

### RMSE of third model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g3, newdata = test_set))

## [1] 12.57679

Model Summary

model_3 <- as.data.frame(glance(g3))
model_3$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g3, newdata = training_set))
model_3$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g3, newdata = test_set))

Model 4: Stepwise Regression Model

For the fourth model, we have used backward elimination process with all variables plus higher order polynomials for the features. We have used the stepAIC() function from MASS package, which choose the best model by AIC.

full_formula <- "TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP + I(TEAM_BATTING_1B^2) + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_HR^2) + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_SO^2) + I(TEAM_BASERUN_SB^2) + I(TEAM_PITCHING_H^2) +  I(TEAM_PITCHING_BB^2) + I(TEAM_PITCHING_SO^2) + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_DP^2)"

full_model <- lm(full_formula, training_set)

step_model <- stepAIC(full_model, direction = "backward", 
                      trace = FALSE)

summary(step_model)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_BB + TEAM_FIELDING_E + TEAM_FIELDING_DP + 
##     I(TEAM_BATTING_1B^2) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_SO^2) + I(TEAM_BASERUN_SB^2) + I(TEAM_PITCHING_BB^2) + 
##     I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_DP^2), data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.245  -7.737  -0.246   7.499  58.689 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.415e+02  1.410e+01  10.036  < 2e-16 ***
## TEAM_BATTING_1B       -5.314e-02  2.010e-02  -2.643 0.008283 ** 
## TEAM_BATTING_2B        2.902e-02  7.545e-03   3.847 0.000124 ***
## TEAM_BATTING_3B        2.063e-01  4.496e-02   4.589 4.76e-06 ***
## TEAM_BATTING_HR        1.217e-01  9.350e-03  13.016  < 2e-16 ***
## TEAM_BATTING_BB       -1.672e-01  1.745e-02  -9.580  < 2e-16 ***
## TEAM_BATTING_SO        2.900e-02  8.201e-03   3.536 0.000416 ***
## TEAM_BASERUN_SB        7.621e-02  9.608e-03   7.932 3.75e-15 ***
## TEAM_PITCHING_BB       1.618e-02  7.719e-03   2.096 0.036195 *  
## TEAM_FIELDING_E       -8.866e-02  6.505e-03 -13.630  < 2e-16 ***
## TEAM_FIELDING_DP      -4.769e-01  8.126e-02  -5.870 5.19e-09 ***
## I(TEAM_BATTING_1B^2)   4.129e-05  8.048e-06   5.130 3.20e-07 ***
## I(TEAM_BATTING_3B^2)  -4.483e-04  2.686e-04  -1.669 0.095266 .  
## I(TEAM_BATTING_BB^2)   1.684e-04  1.554e-05  10.840  < 2e-16 ***
## I(TEAM_BATTING_SO^2)  -2.873e-05  5.272e-06  -5.449 5.77e-08 ***
## I(TEAM_BASERUN_SB^2)  -2.452e-05  1.687e-05  -1.454 0.146182    
## I(TEAM_PITCHING_BB^2) -7.055e-06  3.142e-06  -2.245 0.024868 *  
## I(TEAM_FIELDING_E^2)   2.228e-05  3.945e-06   5.649 1.87e-08 ***
## I(TEAM_FIELDING_DP^2)  1.172e-03  2.793e-04   4.197 2.83e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.89 on 1807 degrees of freedom
## Multiple R-squared:  0.4552, Adjusted R-squared:  0.4498 
## F-statistic:  83.9 on 18 and 1807 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(step_model, plots='default', smoother = TRUE)

#### RMSE Calculation

### RMSE of fourth model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(step_model, newdata = training_set))

## [1] 11.82498

### RMSE of fourth model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(step_model, newdata = test_set))

## [1] 13.20033

Model Summary

model_4 <- as.data.frame(glance(step_model))
model_4$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(step_model, newdata = training_set))
model_4$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(step_model, newdata = test_set))

PART IV: Selecting Models

Compare Key statistics

The table below summarizes the model statistics for all four of our models. The models are listed from left to right in accordance with the order in which they were described in Part III.

model_summary <- rbind(model_1,model_2,model_3,model_4)

Model_Name <- c('Model 1(All Features)','Model 2 (Selective Features)','Model 3(Selective Features)','Model 4(Stepwise - Backward Elimination)')

model_summary <- t(model_summary) 
colnames(model_summary) <- Model_Name

model_summary %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")

	Model 1(All Features)	Model 2 (Selective Features)	Model 3(Selective Features)	Model 4(Stepwise - Backward Elimination)
r.squared	3.995815e-01	3.971795e-01	3.880940e-01	4.552490e-01
adj.r.squared	3.956074e-01	3.935240e-01	3.850615e-01	4.498226e-01
sigma	1.245891e+01	1.248036e+01	1.256714e+01	1.188699e+01
statistic	1.005467e+02	1.086534e+02	1.279751e+02	8.389510e+01
p.value	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
df	1.300000e+01	1.200000e+01	1.000000e+01	1.900000e+01
logLik	-7.190427e+03	-7.194072e+03	-7.207730e+03	-7.101594e+03
AIC	1.440885e+04	1.441414e+04	1.443746e+04	1.424319e+04
BIC	1.448599e+04	1.448577e+04	1.449807e+04	1.435338e+04
deviance	2.814219e+05	2.825477e+05	2.868062e+05	2.553300e+05
df.residual	1.813000e+03	1.814000e+03	1.816000e+03	1.807000e+03
Train_RMSE	1.241448e+01	1.243929e+01	1.253268e+01	1.182498e+01
Test_RMSE	1.272149e+01	1.298812e+01	1.257679e+01	1.320033e+01

Model Interpretation

Between the four models, we looked at the relevant diagnostic plots like residual plot, Q-Q plot, Histogram etc. We also revewed the adjusted R square and computed RMSE numbers for all 4 models.

We have decided to go with Model 4 which showed maximum Adj. R squared and minimum RMSE values. Below are the coefficients for the selected best model.

model_coefficients <- step_model$coefficients

model_coefficients

##           (Intercept)       TEAM_BATTING_1B       TEAM_BATTING_2B 
##          1.415360e+02         -5.313552e-02          2.902182e-02 
##       TEAM_BATTING_3B       TEAM_BATTING_HR       TEAM_BATTING_BB 
##          2.063066e-01          1.217027e-01         -1.671953e-01 
##       TEAM_BATTING_SO       TEAM_BASERUN_SB      TEAM_PITCHING_BB 
##          2.899771e-02          7.620586e-02          1.618197e-02 
##       TEAM_FIELDING_E      TEAM_FIELDING_DP  I(TEAM_BATTING_1B^2) 
##         -8.865551e-02         -4.769411e-01          4.129007e-05 
##  I(TEAM_BATTING_3B^2)  I(TEAM_BATTING_BB^2)  I(TEAM_BATTING_SO^2) 
##         -4.483305e-04          1.684286e-04         -2.872571e-05 
##  I(TEAM_BASERUN_SB^2) I(TEAM_PITCHING_BB^2)  I(TEAM_FIELDING_E^2) 
##         -2.451895e-05         -7.055294e-06          2.228347e-05 
## I(TEAM_FIELDING_DP^2) 
##          1.172112e-03

Model Prediction

lm_predicted <- round(predict(step_model, newdata = eval_df),0)

lm_predicted_df <- as.data.frame(cbind(eval_df$INDEX, lm_predicted))

colnames(lm_predicted_df) <- c('INDEX','TARGET_WINS')

lm_predicted_df %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")

INDEX	TARGET_WINS
9	58
10	62
14	70
47	87
60	87
63	73
74	78
83	76
98	71
120	70
123	65
135	82
138	82
140	80
151	85
153	76
171	71
184	76
193	75
213	86
217	82
226	83
230	82
241	68
291	81
294	87
300	35
348	71
350	82
357	70
367	96
368	86
372	89
382	91
388	78
396	85
398	75
403	89
407	83
410	89
412	82
414	95
436	13
440	119
476	96
479	80
481	97
501	79
503	67
506	77
519	74
522	82
550	73
554	75
566	72
578	80
596	89
599	82
605	62
607	80
614	87
644	84
692	88
699	84
700	85
716	105
721	70
722	77
729	77
731	91
746	80
763	71
774	81
776	89
788	79
789	78
792	86
811	82
835	74
837	78
861	81
862	86
863	94
871	74
879	87
887	78
892	84
904	83
909	94
925	92
940	78
951	54
976	63
981	93
983	90
984	89
989	97
995	103
1000	86
1001	88
1007	78
1016	68
1027	81
1033	84
1070	69
1081	70
1084	59
1098	74
1150	88
1160	52
1169	88
1172	90
1174	96
1176	94
1178	83
1184	79
1193	87
1196	82
1199	71
1207	78
1218	93
1223	70
1226	72
1227	78
1229	69
1241	85
1244	85
1246	77
1248	91
1249	90
1253	83
1261	79
1305	77
1314	80
1323	86
1328	74
1353	76
1363	76
1371	93
1372	80
1389	69
1393	75
1421	90
1431	69
1437	73
1442	71
1450	75
1463	80
1464	80
1470	84
1471	82
1484	80
1495	32
1507	67
1514	75
1526	68
1549	92
1552	66
1556	88
1564	72
1585	107
1586	122
1590	97
1591	113
1592	106
1603	93
1612	84
1634	82
1645	72
1647	79
1673	91
1674	92
1687	78
1688	90
1700	79
1708	75
1713	83
1717	69
1721	72
1730	79
1737	95
1748	89
1749	86
1763	85
1768	68
1778	96
1780	85
1782	54
1784	67
1794	111
1803	67
1804	82
1819	78
1832	73
1833	74
1844	60
1847	74
1854	93
1855	81
1857	84
1864	69
1865	77
1869	73
1880	93
1881	81
1882	88
1894	76
1896	77
1916	77
1918	69
1921	102
1926	87
1938	82
1979	62
1982	68
1987	82
1997	80
2004	97
2011	77
2015	81
2022	73
2025	74
2027	78
2031	72
2036	81
2066	76
2073	79
2087	76
2092	83
2125	75
2148	86
2162	91
2191	83
2203	87
2218	79
2221	75
2225	78
2232	78
2267	85
2291	71
2299	86
2317	86
2318	81
2353	83
2403	59
2411	85
2415	80
2424	84
2441	71
2464	85
2465	77
2472	72
2481	92
2487	39
2500	67
2501	79
2520	86
2521	86
2525	76

Statistical summary of the Prediction Output

result_summary <- round(rbind(as.data.frame(describe(train_df %>% dplyr::select(TARGET_WINS))),
                        as.data.frame(describe(lm_predicted_df %>% dplyr::select(TARGET_WINS)))),0)

dataset_label <- data.frame(c('Training Data','Model Prediction'))

result_summary <- cbind(dataset_label,result_summary)
names(result_summary)[1] <- 'Data Set'
rownames(result_summary) <- NULL

result_summary  %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")

Data Set	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Training Data	1	2276	81	16	82	81	15	0	146	146	0	1	0
Model Prediction	1	259	80	12	80	80	9	13	122	109	-1	5	1

Conclusion

Based on the above table, mean and median of the training data set and prediction output are pretty close with prediction range 13 and 122. So even if not perfect, but the model output seems to be within acceptable levels of accuracy. But definitely there are further improvements that can be applied on this model. Probably more advanced technique and domain knowledge can be used to better handle outliers present in the training data set to improve the model quality.

Model output

# export to .csv for submission
write.csv(lm_predicted_df, file = "C:/CUNY/Semester4(Spring)/DATA 621/Assignments/Homework1/Output/moneyball_submission.csv",row.names = FALSE)

Our model prediction output can be found in the below GitHub location -

Model Output

References:

Multicollinearity & VIF - Multicollinearity essentials and VIF in R

Broom Package - Model Summary Statistics

MICE Tutorial for Imputation - MICE Package

DATA 621 - Homework 1

Deepak Mongia & Soumya Ghosh

February 25, 2020

Libraries

Introduction

Objective

Data Load

Training Data

PART I: Data Exploration

Descriptive Statistical Summary

Missing Value Analysis

Descriptive Statistical Plots

Box Plots

Density Plots

Observations Summary

PART II: Data Preparation

Remove the feature: TEAM_BATTING_HBP

Imputation

Feature Engineering

Correlation Plot

Handling multicollinearity

Data Preparation Results

PART III: Building Models

Test train approach

Model 1:

VIF Results

Model Diagnostic Plots

RMSE Calculation

Model Summary

Model 2:

Model Diagnostic Plots

Model Summary

Model 3:

Model Diagnostic Plots

Model Summary

Model 4: Stepwise Regression Model

Model Diagnostic Plots

Model Summary

PART IV: Selecting Models

Compare Key statistics

Model Interpretation

Model Prediction

Statistical summary of the Prediction Output

Conclusion

Model output

References: