DATA 621 - Homework 1

Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(MASS)
library(corrplot)
library(RColorBrewer)
library(GGally)
library(ggResidpanel)

library(psych)
library(mice)
library(reshape2)
library(cowplot)
library(car)
library(caTools)
library(VIM)
library(broom)

Introduction

The dataset that we have been given for this assignment is a collection of historical baseball team performances. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

The original Training data set is comprised of 17 elements and 2276 total observations. Out of those 17 elements, INDEX is simply an index value used for sorting while TARGET_WINS represents the response variable we are to use within our regression models. The remaining 15 elements are all potential predictor variables for our linear models. A summary table for the data set is provided below.

Objective

Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables listed above (or variables that can be derived from the variables provided).

Data Load

Loaded Training and Test data sets into respective data frames.

train_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_1/DataFiles/moneyball-training-data.csv")

eval_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_1/DataFiles/moneyball-evaluation-data.csv")

Training Data

Excluded the ‘INDEX’ attribute from training and evaluation(test) data frames.

train_df <- train_df[-c(1)]
eval_df <- eval_df[-c(1)]

Sample snapshot of training data frame -

head(train_df, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
39	1445	194	39	13	143	842	NA	NA	NA	9364	84	927	5456	1011	NA
70	1339	219	22	190	685	1075	37	28	NA	1347	191	689	1082	193	155
86	1377	232	35	137	602	917	46	27	NA	1377	137	602	917	175	153
70	1387	209	38	96	451	922	43	30	NA	1396	97	454	928	164	156
82	1297	186	27	102	472	920	49	39	NA	1297	102	472	920	138	168
75	1279	200	36	92	443	973	107	59	NA	1279	92	443	973	123	149
80	1244	179	54	122	525	1062	80	54	NA	1244	122	525	1062	136	186
85	1273	171	37	115	456	1027	40	36	NA	1281	116	459	1033	112	136
86	1391	197	40	114	447	922	69	27	NA	1391	114	447	922	127	169
76	1271	213	18	96	441	827	72	34	NA	1271	96	441	827	131	159
78	1305	179	27	82	374	888	60	39	NA	1364	86	391	928	119	141
68	1372	203	31	95	509	801	119	79	NA	1372	95	509	801	147	150
72	1332	196	41	55	597	816	221	109	NA	1340	55	601	821	185	165
76	1265	210	23	63	534	812	126	80	NA	1265	63	534	812	150	139
74	1380	233	40	131	542	880	159	89	NA	1380	131	542	880	147	137
87	1417	226	28	108	539	682	86	69	NA	1417	108	539	682	136	136
88	1563	242	43	164	589	843	100	53	NA	1563	164	589	843	135	172
66	1460	239	32	107	546	900	92	64	NA	1478	108	553	911	136	146
75	1390	197	24	143	579	841	65	49	NA	2047	211	853	1239	149	177
93	1518	268	26	186	613	760	55	53	NA	1518	186	613	760	106	171

str(train_df)

## 'data.frame':    2276 obs. of  16 variables:
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

Data Exploration

We wanted to start off data exploration process with high level descriptive statistical summary and missing/exception value analysis.

Descriptive Statistical Summary

Basic statistical summary of all features and dependant variable (TARGET_WINS).

stat_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}
stat_summary(train_df)

TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :29.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 65.0	Min. : 52.0
1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 548.0	1st Qu.: 66.0	1st Qu.: 38.0	1st Qu.:50.50	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 615.0	1st Qu.: 127.0	1st Qu.:131.0
Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 750.0	Median :101.0	Median : 49.0	Median :58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 813.5	Median : 159.0	Median :149.0
Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.6	Mean :124.8	Mean : 52.8	Mean :59.36	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 817.7	Mean : 246.5	Mean :146.4
3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 930.0	3rd Qu.:156.0	3rd Qu.: 62.0	3rd Qu.:67.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 968.0	3rd Qu.: 249.2	3rd Qu.:164.0
Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.0	Max. :95.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :19278.0	Max. :1898.0	Max. :228.0
NA	NA	NA	NA	NA	NA	NA’s :102	NA’s :131	NA’s :772	NA’s :2085	NA	NA	NA	NA’s :102	NA	NA’s :286

We also used describe() function of ‘psych’ package to summarize additional statistical measurements like Standard Deviation, Skewness, Kurtois, Standard Error etc.

stat_desc <- function(df){
df %>% 
    describe() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}

stat_desc(train_df)

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
TARGET_WINS	1	2276	80.79086	15.75215	82.0	81.31229	14.8260	0	146	146	-0.3987232	1.0274757	0.3301823
TEAM_BATTING_H	2	2276	1469.26977	144.59120	1454.0	1459.04116	114.1602	891	2554	1663	1.5713335	7.2785261	3.0307891
TEAM_BATTING_2B	3	2276	241.24692	46.80141	238.0	240.39627	47.4432	69	458	389	0.2151018	0.0061609	0.9810087
TEAM_BATTING_3B	4	2276	55.25000	27.93856	47.0	52.17563	23.7216	0	223	223	1.1094652	1.5032418	0.5856226
TEAM_BATTING_HR	5	2276	99.61204	60.54687	102.0	97.38529	78.5778	0	264	264	0.1860421	-0.9631189	1.2691285
TEAM_BATTING_BB	6	2276	501.55888	122.67086	512.0	512.18331	94.8864	0	878	878	-1.0257599	2.1828544	2.5713150
TEAM_BATTING_SO	7	2174	735.60534	248.52642	750.0	742.31322	284.6592	0	1399	1399	-0.2978001	-0.3207992	5.3301912
TEAM_BASERUN_SB	8	2145	124.76177	87.79117	101.0	110.81188	60.7866	0	697	697	1.9724140	5.4896754	1.8955584
TEAM_BASERUN_CS	9	1504	52.80386	22.95634	49.0	50.35963	17.7912	0	201	201	1.9762180	7.6203818	0.5919414
TEAM_BATTING_HBP	10	191	59.35602	12.96712	58.0	58.86275	11.8608	29	95	66	0.3185754	-0.1119828	0.9382681
TEAM_PITCHING_H	11	2276	1779.21046	1406.84293	1518.0	1555.89517	174.9468	1137	30132	28995	10.3295111	141.8396985	29.4889618
TEAM_PITCHING_HR	12	2276	105.69859	61.29875	107.0	103.15697	74.1300	0	343	343	0.2877877	-0.6046311	1.2848886
TEAM_PITCHING_BB	13	2276	553.00791	166.35736	536.5	542.62459	98.5929	0	3645	3645	6.7438995	96.9676398	3.4870317
TEAM_PITCHING_SO	14	2174	817.73045	553.08503	813.5	796.93391	257.2311	0	19278	19278	22.1745535	671.1891292	11.8621151
TEAM_FIELDING_E	15	2276	246.48067	227.77097	159.0	193.43798	62.2692	65	1898	1833	2.9904656	10.9702717	4.7743279
TEAM_FIELDING_DP	16	1990	146.38794	26.22639	149.0	147.57789	23.7216	52	228	176	-0.3889390	0.1817397	0.5879114

Missing Value Analysis

Below are the features that has NA values. It is clear from the table below, TEAM_BATTING_HBP(>90%) and TEAM_BASERUN_CS (>33%) have significant NA Values -

## Counts of missing data per feature
train_na_df <- data.frame(apply(train_df, 2, function(x) length(which(is.na(x)))))
train_na_df1 <- data.frame(apply(train_df, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

train_na_df <- cbind(Feature = rownames(train_na_df), train_na_df, train_na_df1)
colnames(train_na_df) <- c('Feature Name','No. of NA Recocrds','Percentage of NA Records')
rownames(train_na_df) <- NULL


train_na_df%>% filter(`No. of NA Recocrds` != 0) %>% arrange(desc(`No. of NA Recocrds`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")

Feature Name	No. of NA Recocrds	Percentage of NA Records
TEAM_BATTING_HBP	2085	91.608084
TEAM_BASERUN_CS	772	33.919156
TEAM_FIELDING_DP	286	12.565905
TEAM_BASERUN_SB	131	5.755712
TEAM_BATTING_SO	102	4.481547
TEAM_PITCHING_SO	102	4.481547

Descriptive Statistical Plots

Box Plots

## Box plots:
gb1 <- ggplot(data = train_df, aes(y = TARGET_WINS)) + geom_boxplot()
gb2 <- ggplot(data = train_df, aes(y = TEAM_BATTING_H)) + geom_boxplot()
gb3 <- ggplot(data = train_df, aes(y = TEAM_BATTING_2B)) + geom_boxplot()
gb4 <- ggplot(data = train_df, aes(y = TEAM_BATTING_3B)) + geom_boxplot()
gb5 <- ggplot(data = train_df, aes(y = TEAM_BATTING_HR)) + geom_boxplot()
gb6 <- ggplot(data = train_df, aes(y = TEAM_BATTING_BB)) + geom_boxplot()
gb7 <- ggplot(data = train_df, aes(y = TEAM_BATTING_HBP)) + geom_boxplot()
gb8 <- ggplot(data = train_df, aes(y = TEAM_BATTING_SO)) + geom_boxplot()
gb9 <- ggplot(data = train_df, aes(y = TEAM_BASERUN_SB)) + geom_boxplot()
gb10 <- ggplot(data = train_df, aes(y = TEAM_BASERUN_CS)) + geom_boxplot()
gb11 <- ggplot(data = train_df, aes(y = TEAM_FIELDING_E)) + geom_boxplot()
gb12 <- ggplot(data = train_df, aes(y = TEAM_FIELDING_DP)) + geom_boxplot()
gb13 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_BB)) + geom_boxplot()
gb14 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_H)) + geom_boxplot()
gb15 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_HR)) + geom_boxplot()
gb16 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_SO)) + geom_boxplot()

plot_grid(gb1, gb2, gb3, gb4, gb5, gb6, gb7, gb8, gb9, gb10,
          gb11, gb12, gb13, gb14, gb15, gb16, labels = "AUTO, scale = 10")

#### Density Plots

train_df %>%
  gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

## Warning: Removed 3478 rows containing non-finite values (stat_density).

Observations Summary

The dataset has many outliers with some observations that are more extreme than the 1.5 * IQR of the box plot whiskers. We also see that the variance of some of the explanatory variables greatly exceeds the variance of the response “win” variable.
Dependant variable TARGET_WINS is normally distributed which indicates that data set includes a balanced amount of good, bad and average team performances
TEAM_BATTING_HR,TEAM_BATTING_SO and TEAM_PITCHING_HR features are showing bi-modal distribution
Some of the features like TEAM_BASERUN_CS, TEAM_BASERUN_SB and TEAM_FIELDING_E are right skewed
Observing the results of missing value analysis, we need to remove TEAM_BATTING_HBP due to large amount of missing values. We will need to come up with an imputation mechanism for the other features with missing values.

Data Preparation

Remove the feature: TEAM_BATTING_HBP

The TEAM_BATTING_HBP variable contains 2085 missing values out of a total of 2277. Since it would be very difficult to accurately impute such a large proportion of any variable’s missing values, so we choose to exclude the variable from our further analysis.

train_df <- train_df %>% dplyr::select(-c(TEAM_BATTING_HBP))
eval_df <-  eval_df %>% dplyr::select(-c(TEAM_BATTING_HBP))

Plotting the missing data pattern after removing the max. missing feature -

aggr_plot <- aggr(train_df, col=c('dodgerblue4','red'),
                 numbers=TRUE, sortVars=TRUE,
                 labels=names(train_df), cex.axis=.7, gap=7, axes = TRUE, Prop = TRUE,
                 ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable      Count
##   TEAM_BASERUN_CS 0.33919156
##  TEAM_FIELDING_DP 0.12565905
##   TEAM_BASERUN_SB 0.05755712
##   TEAM_BATTING_SO 0.04481547
##  TEAM_PITCHING_SO 0.04481547
##       TARGET_WINS 0.00000000
##    TEAM_BATTING_H 0.00000000
##   TEAM_BATTING_2B 0.00000000
##   TEAM_BATTING_3B 0.00000000
##   TEAM_BATTING_HR 0.00000000
##   TEAM_BATTING_BB 0.00000000
##   TEAM_PITCHING_H 0.00000000
##  TEAM_PITCHING_HR 0.00000000
##  TEAM_PITCHING_BB 0.00000000
##   TEAM_FIELDING_E 0.00000000

Imputation

Imputing the missing data for the other 4 features, which we are keeping for our analysis. We have used the ‘Predictive Mean Matching’(pmm) methid included in MICE package for imputation purposes.

impute_data <- function(df){
  
  df <- mice(data = df, m = 1, method = "pmm", maxit = 5, seed = 500)
  df <- mice::complete(df, 1)
}

#train_df <- mice(data = train_df, m = 1,method = "pmm", maxit = 5, seed = 500)
#train_df <- mice::complete(train_df, 1)

train_df <- impute_data(train_df)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP

eval_df <- impute_data(eval_df)

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP

Feature Engineering

We created a new variable called TEAM_BATTING_1B which represents offensive single base hits. (created by subtracting out the TEAM_BATTING doubles, triples and home runs from the TEAM_BATTING_H variable). We believe that separating out singles from the other unique hit values will minimize collinearity. The TEAM_BATTING_H variable is then removed from the data set since it is simply a linear combination of its component variables.

add_batting1b_feature <- function(df){
  df <-df %>%
    mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)
}

train_df <- add_batting1b_feature(train_df)
eval_df <- add_batting1b_feature(eval_df)

train_df <- train_df %>% dplyr::select(-c(TEAM_BATTING_H))
eval_df <-  eval_df %>% dplyr::select(-c(TEAM_BATTING_H))

Correlation Plot

corrMatrix <- round(cor(train_df),4)

corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Based on the Correlation plot above, there is a high degree of collinarity amogst independent variables like TEAM_BATTING_HR, TEAM_PITCHING_HR,TEAM_BASERUN_SB, TEAM_BASERUN_CS etc.

Handling Colinearity

TEAM_BASERUN_CS: This variable is strongly correlated (82%) with the TEAM_BASERUN_SB variable and is the 2nd largest source of NA’s in our data set. Hence we decided to exclude the variable from our analysis.

TEAM_PITCHING_HR: This variable is highly (97%) correlated with TEAM_BATTING_HR. The fact that these two variables are nearly perfectly correlated indicates that one of them can legitimately be removed from the data set, and we chose TEAM_PITCHING_HR to be removed from our model for further analysis.

##train_model_df <- train_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_PITCHING_HR))
##eval_df <- eval_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_PITCHING_HR))

Data Preparation Results

We created a density plot after making feature removal and imputation.

train_df %>%
  gather(variable, value, TARGET_WINS:TEAM_BATTING_1B) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

Test train approach

We have divided the traning dataset into training and test sets using a 80/20 split. We will build our models on the training set and evaluate it on the test set.

set.seed(123)
split <- sample.split(train_df$TARGET_WINS, SplitRatio = 0.8)
training_set <- subset(train_df, split == TRUE)
test_set <- subset(train_df, split == FALSE)

Building Models

We will build four different models to see which one yields the best performance.

Model 1:

g1 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H +
           TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)

vif(g1)

##  TEAM_BATTING_1B  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.935719         1.574744         2.830456        45.896140 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         8.756261         9.855289         4.159089         4.284251 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         3.999482        36.763145         6.430649         5.210524 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         5.656579         1.953459

summary(g1)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.086  -8.424   0.215   8.121  49.167 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      33.6635331  5.5863275   6.026 2.03e-09 ***
## TEAM_BATTING_1B   0.0424969  0.0038303  11.095  < 2e-16 ***
## TEAM_BATTING_2B   0.0274386  0.0078276   3.505 0.000467 ***
## TEAM_BATTING_3B   0.0766673  0.0174872   4.384 1.23e-05 ***
## TEAM_BATTING_HR   0.1117477  0.0323936   3.450 0.000574 ***
## TEAM_BATTING_BB   0.0316514  0.0069346   4.564 5.35e-06 ***
## TEAM_BATTING_SO  -0.0215649  0.0037102  -5.812 7.26e-09 ***
## TEAM_BASERUN_SB   0.0529440  0.0057885   9.146  < 2e-16 ***
## TEAM_BASERUN_CS   0.0136834  0.0117694   1.163 0.245138    
## TEAM_PITCHING_H   0.0020019  0.0004194   4.773 1.96e-06 ***
## TEAM_PITCHING_HR  0.0199291  0.0286937   0.695 0.487429    
## TEAM_PITCHING_BB -0.0185423  0.0052427  -3.537 0.000415 ***
## TEAM_PITCHING_SO  0.0056043  0.0022739   2.465 0.013807 *  
## TEAM_FIELDING_E  -0.0428401  0.0030086 -14.239  < 2e-16 ***
## TEAM_FIELDING_DP -0.1293847  0.0140228  -9.227  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.46 on 1811 degrees of freedom
## Multiple R-squared:  0.4002, Adjusted R-squared:  0.3956 
## F-statistic: 86.31 on 14 and 1811 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g1, plots='default', smoother = TRUE)

rmse_calc <- function(actual, predicted) {
  rmse_val <- sqrt(sum((actual - predicted)^2) / length(actual))
  
  return(rmse_val)
}

### RMSE of first model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g1, newdata = training_set))

## [1] 12.40796

### RMSE of first model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g1, newdata = test_set))

## [1] 12.71621

model_1 <- as.data.frame(glance(g1))
model_1$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g1, newdata = training_set))
model_1$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g1, newdata = test_set))

Model 2:

# Removing : TEAM_BATTING_3B, TEAM_BASERUN_CS, TEAM_PITCHING_HR - based on p-values

g2 <- lm(TARGET_WINS ~TEAM_BATTING_1B + TEAM_BATTING_2B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)


summary(g2)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.143  -8.650   0.352   8.126  52.285 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      37.418235   5.477664   6.831 1.15e-11 ***
## TEAM_BATTING_1B   0.045251   0.003804  11.896  < 2e-16 ***
## TEAM_BATTING_2B   0.031660   0.007731   4.095 4.41e-05 ***
## TEAM_BATTING_HR   0.123712   0.009215  13.426  < 2e-16 ***
## TEAM_BATTING_BB   0.028916   0.006457   4.478 7.99e-06 ***
## TEAM_BATTING_SO  -0.025361   0.003639  -6.969 4.47e-12 ***
## TEAM_BASERUN_SB   0.062500   0.004500  13.890  < 2e-16 ***
## TEAM_PITCHING_H   0.001656   0.000415   3.990 6.86e-05 ***
## TEAM_PITCHING_BB -0.014903   0.004731  -3.150  0.00166 ** 
## TEAM_PITCHING_SO  0.005882   0.002256   2.607  0.00921 ** 
## TEAM_FIELDING_E  -0.041997   0.003021 -13.900  < 2e-16 ***
## TEAM_FIELDING_DP -0.133072   0.014011  -9.498  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.54 on 1814 degrees of freedom
## Multiple R-squared:  0.3917, Adjusted R-squared:  0.388 
## F-statistic: 106.2 on 11 and 1814 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g2, plots='default', smoother = TRUE)

### RMSE of second model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g2, newdata = training_set))

## [1] 12.49575

### RMSE of second model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g2, newdata = test_set))

## [1] 12.67295

model_2 <- as.data.frame(glance(g2))
model_2$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g2, newdata = training_set))
model_2$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g2, newdata = test_set))

Model 3:

# Removing : TEAM_PITCHING_BB - based on p-value
g3 <- lm(TARGET_WINS ~TEAM_BATTING_1B + TEAM_BATTING_2B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO +
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)


summary(g3)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.378  -8.576   0.459   8.231  51.120 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      39.4261217  5.4538150   7.229 7.13e-13 ***
## TEAM_BATTING_1B   0.0445039  0.0038059  11.693  < 2e-16 ***
## TEAM_BATTING_2B   0.0318005  0.0077501   4.103 4.25e-05 ***
## TEAM_BATTING_HR   0.1237522  0.0092371  13.397  < 2e-16 ***
## TEAM_BATTING_BB   0.0115156  0.0033525   3.435 0.000606 ***
## TEAM_BATTING_SO  -0.0205651  0.0033138  -6.206 6.72e-10 ***
## TEAM_BASERUN_SB   0.0604691  0.0044640  13.546  < 2e-16 ***
## TEAM_PITCHING_H   0.0009817  0.0003564   2.754 0.005940 ** 
## TEAM_PITCHING_SO  0.0016319  0.0018130   0.900 0.368176    
## TEAM_FIELDING_E  -0.0414710  0.0030241 -13.714  < 2e-16 ***
## TEAM_FIELDING_DP -0.1293614  0.0139959  -9.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 1815 degrees of freedom
## Multiple R-squared:  0.3884, Adjusted R-squared:  0.385 
## F-statistic: 115.2 on 10 and 1815 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g1, plots='default', smoother = TRUE)

### RMSE of third model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g3, newdata = training_set))

## [1] 12.52988

### RMSE of third model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g3, newdata = test_set))

## [1] 12.40627

model_3 <- as.data.frame(glance(g3))
model_3$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g3, newdata = training_set))
model_3$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g3, newdata = test_set))

Model 4: Stepwise Regression Model

We have used the stepAIC() function from MASS package, which choose the best model by AIC. I have used a backward elimination process using direction parameter as ‘backward’.

step_model <- stepAIC(g1, direction = "backward", 
                      trace = FALSE)

summary(step_model)

## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.143  -8.443   0.218   8.144  50.190 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.5795058  5.4745350   6.316 3.36e-10 ***
## TEAM_BATTING_1B   0.0426304  0.0038183  11.165  < 2e-16 ***
## TEAM_BATTING_2B   0.0264099  0.0077580   3.404 0.000678 ***
## TEAM_BATTING_3B   0.0826533  0.0169375   4.880 1.15e-06 ***
## TEAM_BATTING_HR   0.1319750  0.0093123  14.172  < 2e-16 ***
## TEAM_BATTING_BB   0.0301181  0.0064215   4.690 2.93e-06 ***
## TEAM_BATTING_SO  -0.0222146  0.0036738  -6.047 1.79e-09 ***
## TEAM_BASERUN_SB   0.0569348  0.0046146  12.338  < 2e-16 ***
## TEAM_PITCHING_H   0.0019434  0.0004166   4.665 3.31e-06 ***
## TEAM_PITCHING_BB -0.0173696  0.0047283  -3.674 0.000246 ***
## TEAM_PITCHING_SO  0.0060394  0.0022425   2.693 0.007144 ** 
## TEAM_FIELDING_E  -0.0427808  0.0030067 -14.228  < 2e-16 ***
## TEAM_FIELDING_DP -0.1309038  0.0139311  -9.397  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.46 on 1813 degrees of freedom
## Multiple R-squared:  0.3996, Adjusted R-squared:  0.3956 
## F-statistic: 100.5 on 12 and 1813 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(step_model, plots='default', smoother = TRUE)

### RMSE of third model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(step_model, newdata = training_set))

## [1] 12.41448

### RMSE of third model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(step_model, newdata = test_set))

## [1] 12.72149

model_4 <- as.data.frame(glance(step_model))
model_4$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(step_model, newdata = training_set))
model_4$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(step_model, newdata = test_set))

Selecting Models

Step 1: Compare Key statistics

model_summary <- rbind(model_1,model_2,model_3,model_4)

Model_Name <- c('Model 1(All Features)','Model 2 (Selective Features)','Model 3(Selective Features)','Model 4(Stepwise - Backward Elimination)')

model_summary <- t(model_summary) 
colnames(model_summary) <- Model_Name

model_summary %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")

	Model 1(All Features)	Model 2 (Selective Features)	Model 3(Selective Features)	Model 4(Stepwise - Backward Elimination)
r.squared	4.002118e-01	3.916951e-01	3.883671e-01	3.995815e-01
adj.r.squared	3.955751e-01	3.880063e-01	3.849972e-01	3.956074e-01
sigma	1.245924e+01	1.253701e+01	1.256779e+01	1.245891e+01
statistic	8.631423e+01	1.061870e+02	1.152466e+02	1.005467e+02
p.value	0.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00
df	1.500000e+01	1.200000e+01	1.100000e+01	1.300000e+01
logLik	-7.189468e+03	-7.202341e+03	-7.207322e+03	-7.190427e+03
AIC	1.441094e+04	1.443068e+04	1.443864e+04	1.440885e+04
BIC	1.449909e+04	1.450231e+04	1.450476e+04	1.448599e+04
deviance	2.811265e+05	2.851183e+05	2.866782e+05	2.814219e+05
df.residual	1.811000e+03	1.814000e+03	1.815000e+03	1.813000e+03
Train_RMSE	1.240796e+01	1.249575e+01	1.252988e+01	1.241448e+01
Test_RMSE	1.271621e+01	1.267295e+01	1.240627e+01	1.272149e+01

Model Interpretation

Having constructed a model where all p-values are below 0.05, I looked the residual plot, Q-Q plot and other relevant plots using ggresidpanel package:

DATA 621 - Homework 1

Deepak Mongia & Soumya Ghosh

February 25, 2020

Libraries

Introduction

Objective

Data Load

Training Data

Data Exploration

Descriptive Statistical Summary

Missing Value Analysis

Descriptive Statistical Plots

Box Plots

Observations Summary

Data Preparation

Remove the feature: TEAM_BATTING_HBP

Imputation

Feature Engineering

Correlation Plot

Handling Colinearity

Data Preparation Results

Test train approach

Building Models

Model 1:

Model Diagnostic Plots

Model 2:

Model Diagnostic Plots

Model 3:

Model Diagnostic Plots

Model 4: Stepwise Regression Model

Model Diagnostic Plots

Selecting Models

Step 1: Compare Key statistics

Model Interpretation