Libraries

library(kableExtra)
library(tidyverse)
library(ggplot2)
library(dplyr)
library(plotly)
library(MASS)
library(corrplot)
library(RColorBrewer)
library(GGally)
library(ggResidpanel)

library(psych)
library(mice)
library(reshape2)
library(cowplot)
library(car)
library(caTools)
library(VIM)
library(broom)

Introduction

The dataset that we have been given for this assignment is a collection of historical baseball team performances. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

The original Training data set is comprised of 17 elements and 2276 total observations. Out of those 17 elements, INDEX is simply an index value used for sorting while TARGET_WINS represents the response variable we are to use within our regression models. The remaining 15 elements are all potential predictor variables for our linear models. A summary table for the data set is provided below.

Objective

Our objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. We can only use the variables listed above (or variables that can be derived from the variables provided).

Data Load

Loaded Training and Test data sets into respective data frames.

train_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_1/DataFiles/moneyball-training-data.csv")

eval_df <- read.csv("https://raw.githubusercontent.com/soumya2g/CUNYDataMiningHomeWork/master/HomeWork_1/DataFiles/moneyball-evaluation-data.csv")

Training Data

Excluded the ‘INDEX’ attribute from training and evaluation(test) data frames.

train_df <- train_df[-c(1)]
eval_df <- eval_df[-c(1)]

Sample snapshot of training data frame -

head(train_df, 20) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
39 1445 194 39 13 143 842 NA NA NA 9364 84 927 5456 1011 NA
70 1339 219 22 190 685 1075 37 28 NA 1347 191 689 1082 193 155
86 1377 232 35 137 602 917 46 27 NA 1377 137 602 917 175 153
70 1387 209 38 96 451 922 43 30 NA 1396 97 454 928 164 156
82 1297 186 27 102 472 920 49 39 NA 1297 102 472 920 138 168
75 1279 200 36 92 443 973 107 59 NA 1279 92 443 973 123 149
80 1244 179 54 122 525 1062 80 54 NA 1244 122 525 1062 136 186
85 1273 171 37 115 456 1027 40 36 NA 1281 116 459 1033 112 136
86 1391 197 40 114 447 922 69 27 NA 1391 114 447 922 127 169
76 1271 213 18 96 441 827 72 34 NA 1271 96 441 827 131 159
78 1305 179 27 82 374 888 60 39 NA 1364 86 391 928 119 141
68 1372 203 31 95 509 801 119 79 NA 1372 95 509 801 147 150
72 1332 196 41 55 597 816 221 109 NA 1340 55 601 821 185 165
76 1265 210 23 63 534 812 126 80 NA 1265 63 534 812 150 139
74 1380 233 40 131 542 880 159 89 NA 1380 131 542 880 147 137
87 1417 226 28 108 539 682 86 69 NA 1417 108 539 682 136 136
88 1563 242 43 164 589 843 100 53 NA 1563 164 589 843 135 172
66 1460 239 32 107 546 900 92 64 NA 1478 108 553 911 136 146
75 1390 197 24 143 579 841 65 49 NA 2047 211 853 1239 149 177
93 1518 268 26 186 613 760 55 53 NA 1518 186 613 760 106 171
str(train_df)
## 'data.frame':    2276 obs. of  16 variables:
##  $ TARGET_WINS     : int  39 70 86 70 82 75 80 85 86 76 ...
##  $ TEAM_BATTING_H  : int  1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
##  $ TEAM_BATTING_2B : int  194 219 232 209 186 200 179 171 197 213 ...
##  $ TEAM_BATTING_3B : int  39 22 35 38 27 36 54 37 40 18 ...
##  $ TEAM_BATTING_HR : int  13 190 137 96 102 92 122 115 114 96 ...
##  $ TEAM_BATTING_BB : int  143 685 602 451 472 443 525 456 447 441 ...
##  $ TEAM_BATTING_SO : int  842 1075 917 922 920 973 1062 1027 922 827 ...
##  $ TEAM_BASERUN_SB : int  NA 37 46 43 49 107 80 40 69 72 ...
##  $ TEAM_BASERUN_CS : int  NA 28 27 30 39 59 54 36 27 34 ...
##  $ TEAM_BATTING_HBP: int  NA NA NA NA NA NA NA NA NA NA ...
##  $ TEAM_PITCHING_H : int  9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
##  $ TEAM_PITCHING_HR: int  84 191 137 97 102 92 122 116 114 96 ...
##  $ TEAM_PITCHING_BB: int  927 689 602 454 472 443 525 459 447 441 ...
##  $ TEAM_PITCHING_SO: int  5456 1082 917 928 920 973 1062 1033 922 827 ...
##  $ TEAM_FIELDING_E : int  1011 193 175 164 138 123 136 112 127 131 ...
##  $ TEAM_FIELDING_DP: int  NA 155 153 156 168 149 186 136 169 159 ...

Data Exploration

We wanted to start off data exploration process with high level descriptive statistical summary and missing/exception value analysis.

Descriptive Statistical Summary

Basic statistical summary of all features and dependant variable (TARGET_WINS).

stat_summary <- function(df){
  df %>%
    summary() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}
stat_summary(train_df)
TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0 Median : 750.0 Median :101.0 Median : 49.0 Median :58.00 Median : 1518 Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
NA NA NA NA NA NA NA’s :102 NA’s :131 NA’s :772 NA’s :2085 NA NA NA NA’s :102 NA NA’s :286

We also used describe() function of ‘psych’ package to summarize additional statistical measurements like Standard Deviation, Skewness, Kurtois, Standard Error etc.

stat_desc <- function(df){
df %>% 
    describe() %>%
    kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
}

stat_desc(train_df)
vars n mean sd median trimmed mad min max range skew kurtosis se
TARGET_WINS 1 2276 80.79086 15.75215 82.0 81.31229 14.8260 0 146 146 -0.3987232 1.0274757 0.3301823
TEAM_BATTING_H 2 2276 1469.26977 144.59120 1454.0 1459.04116 114.1602 891 2554 1663 1.5713335 7.2785261 3.0307891
TEAM_BATTING_2B 3 2276 241.24692 46.80141 238.0 240.39627 47.4432 69 458 389 0.2151018 0.0061609 0.9810087
TEAM_BATTING_3B 4 2276 55.25000 27.93856 47.0 52.17563 23.7216 0 223 223 1.1094652 1.5032418 0.5856226
TEAM_BATTING_HR 5 2276 99.61204 60.54687 102.0 97.38529 78.5778 0 264 264 0.1860421 -0.9631189 1.2691285
TEAM_BATTING_BB 6 2276 501.55888 122.67086 512.0 512.18331 94.8864 0 878 878 -1.0257599 2.1828544 2.5713150
TEAM_BATTING_SO 7 2174 735.60534 248.52642 750.0 742.31322 284.6592 0 1399 1399 -0.2978001 -0.3207992 5.3301912
TEAM_BASERUN_SB 8 2145 124.76177 87.79117 101.0 110.81188 60.7866 0 697 697 1.9724140 5.4896754 1.8955584
TEAM_BASERUN_CS 9 1504 52.80386 22.95634 49.0 50.35963 17.7912 0 201 201 1.9762180 7.6203818 0.5919414
TEAM_BATTING_HBP 10 191 59.35602 12.96712 58.0 58.86275 11.8608 29 95 66 0.3185754 -0.1119828 0.9382681
TEAM_PITCHING_H 11 2276 1779.21046 1406.84293 1518.0 1555.89517 174.9468 1137 30132 28995 10.3295111 141.8396985 29.4889618
TEAM_PITCHING_HR 12 2276 105.69859 61.29875 107.0 103.15697 74.1300 0 343 343 0.2877877 -0.6046311 1.2848886
TEAM_PITCHING_BB 13 2276 553.00791 166.35736 536.5 542.62459 98.5929 0 3645 3645 6.7438995 96.9676398 3.4870317
TEAM_PITCHING_SO 14 2174 817.73045 553.08503 813.5 796.93391 257.2311 0 19278 19278 22.1745535 671.1891292 11.8621151
TEAM_FIELDING_E 15 2276 246.48067 227.77097 159.0 193.43798 62.2692 65 1898 1833 2.9904656 10.9702717 4.7743279
TEAM_FIELDING_DP 16 1990 146.38794 26.22639 149.0 147.57789 23.7216 52 228 176 -0.3889390 0.1817397 0.5879114

Missing Value Analysis

Below are the features that has NA values. It is clear from the table below, TEAM_BATTING_HBP(>90%) and TEAM_BASERUN_CS (>33%) have significant NA Values -

## Counts of missing data per feature
train_na_df <- data.frame(apply(train_df, 2, function(x) length(which(is.na(x)))))
train_na_df1 <- data.frame(apply(train_df, 2,function(x) {sum(is.na(x)) / length(x) * 100}))

train_na_df <- cbind(Feature = rownames(train_na_df), train_na_df, train_na_df1)
colnames(train_na_df) <- c('Feature Name','No. of NA Recocrds','Percentage of NA Records')
rownames(train_na_df) <- NULL


train_na_df%>% filter(`No. of NA Recocrds` != 0) %>% arrange(desc(`No. of NA Recocrds`)) %>% kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="300px")
Feature Name No. of NA Recocrds Percentage of NA Records
TEAM_BATTING_HBP 2085 91.608084
TEAM_BASERUN_CS 772 33.919156
TEAM_FIELDING_DP 286 12.565905
TEAM_BASERUN_SB 131 5.755712
TEAM_BATTING_SO 102 4.481547
TEAM_PITCHING_SO 102 4.481547

Descriptive Statistical Plots

Box Plots

## Box plots:
gb1 <- ggplot(data = train_df, aes(y = TARGET_WINS)) + geom_boxplot()
gb2 <- ggplot(data = train_df, aes(y = TEAM_BATTING_H)) + geom_boxplot()
gb3 <- ggplot(data = train_df, aes(y = TEAM_BATTING_2B)) + geom_boxplot()
gb4 <- ggplot(data = train_df, aes(y = TEAM_BATTING_3B)) + geom_boxplot()
gb5 <- ggplot(data = train_df, aes(y = TEAM_BATTING_HR)) + geom_boxplot()
gb6 <- ggplot(data = train_df, aes(y = TEAM_BATTING_BB)) + geom_boxplot()
gb7 <- ggplot(data = train_df, aes(y = TEAM_BATTING_HBP)) + geom_boxplot()
gb8 <- ggplot(data = train_df, aes(y = TEAM_BATTING_SO)) + geom_boxplot()
gb9 <- ggplot(data = train_df, aes(y = TEAM_BASERUN_SB)) + geom_boxplot()
gb10 <- ggplot(data = train_df, aes(y = TEAM_BASERUN_CS)) + geom_boxplot()
gb11 <- ggplot(data = train_df, aes(y = TEAM_FIELDING_E)) + geom_boxplot()
gb12 <- ggplot(data = train_df, aes(y = TEAM_FIELDING_DP)) + geom_boxplot()
gb13 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_BB)) + geom_boxplot()
gb14 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_H)) + geom_boxplot()
gb15 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_HR)) + geom_boxplot()
gb16 <- ggplot(data = train_df, aes(y = TEAM_PITCHING_SO)) + geom_boxplot()

plot_grid(gb1, gb2, gb3, gb4, gb5, gb6, gb7, gb8, gb9, gb10,
          gb11, gb12, gb13, gb14, gb15, gb16, labels = "AUTO, scale = 10")

#### Density Plots

train_df %>%
  gather(variable, value, TARGET_WINS:TEAM_FIELDING_DP) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())
## Warning: Removed 3478 rows containing non-finite values (stat_density).

Observations Summary

  • The dataset has many outliers with some observations that are more extreme than the 1.5 * IQR of the box plot whiskers. We also see that the variance of some of the explanatory variables greatly exceeds the variance of the response “win” variable.
  • Dependant variable TARGET_WINS is normally distributed which indicates that data set includes a balanced amount of good, bad and average team performances
  • TEAM_BATTING_HR,TEAM_BATTING_SO and TEAM_PITCHING_HR features are showing bi-modal distribution
  • Some of the features like TEAM_BASERUN_CS, TEAM_BASERUN_SB and TEAM_FIELDING_E are right skewed
  • Observing the results of missing value analysis, we need to remove TEAM_BATTING_HBP due to large amount of missing values. We will need to come up with an imputation mechanism for the other features with missing values.

Data Preparation

Remove the feature: TEAM_BATTING_HBP

The TEAM_BATTING_HBP variable contains 2085 missing values out of a total of 2277. Since it would be very difficult to accurately impute such a large proportion of any variable’s missing values, so we choose to exclude the variable from our further analysis.

train_df <- train_df %>% dplyr::select(-c(TEAM_BATTING_HBP))
eval_df <-  eval_df %>% dplyr::select(-c(TEAM_BATTING_HBP))

Plotting the missing data pattern after removing the max. missing feature -

aggr_plot <- aggr(train_df, col=c('dodgerblue4','red'),
                 numbers=TRUE, sortVars=TRUE,
                 labels=names(train_df), cex.axis=.7, gap=7, axes = TRUE, Prop = TRUE,
                 ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##          Variable      Count
##   TEAM_BASERUN_CS 0.33919156
##  TEAM_FIELDING_DP 0.12565905
##   TEAM_BASERUN_SB 0.05755712
##   TEAM_BATTING_SO 0.04481547
##  TEAM_PITCHING_SO 0.04481547
##       TARGET_WINS 0.00000000
##    TEAM_BATTING_H 0.00000000
##   TEAM_BATTING_2B 0.00000000
##   TEAM_BATTING_3B 0.00000000
##   TEAM_BATTING_HR 0.00000000
##   TEAM_BATTING_BB 0.00000000
##   TEAM_PITCHING_H 0.00000000
##  TEAM_PITCHING_HR 0.00000000
##  TEAM_PITCHING_BB 0.00000000
##   TEAM_FIELDING_E 0.00000000

Imputation

Imputing the missing data for the other 4 features, which we are keeping for our analysis. We have used the ‘Predictive Mean Matching’(pmm) methid included in MICE package for imputation purposes.

impute_data <- function(df){
  
  df <- mice(data = df, m = 1, method = "pmm", maxit = 5, seed = 500)
  df <- mice::complete(df, 1)
}

#train_df <- mice(data = train_df, m = 1,method = "pmm", maxit = 5, seed = 500)
#train_df <- mice::complete(train_df, 1)

train_df <- impute_data(train_df)
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
eval_df <- impute_data(eval_df)
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP

Feature Engineering

We created a new variable called TEAM_BATTING_1B which represents offensive single base hits. (created by subtracting out the TEAM_BATTING doubles, triples and home runs from the TEAM_BATTING_H variable). We believe that separating out singles from the other unique hit values will minimize collinearity. The TEAM_BATTING_H variable is then removed from the data set since it is simply a linear combination of its component variables.

add_batting1b_feature <- function(df){
  df <-df %>%
    mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR)
}

train_df <- add_batting1b_feature(train_df)
eval_df <- add_batting1b_feature(eval_df)

train_df <- train_df %>% dplyr::select(-c(TEAM_BATTING_H))
eval_df <-  eval_df %>% dplyr::select(-c(TEAM_BATTING_H))

Correlation Plot

corrMatrix <- round(cor(train_df),4)

corrMatrix %>% corrplot(., method = "color", outline = T, addgrid.col = "darkgray", order="hclust", addrect = 4, rect.col = "black", rect.lwd = 5,cl.pos = "b", tl.col = "indianred4", tl.cex = 1.0, cl.cex = 1.0, addCoef.col = "white", number.digits = 2, number.cex = 0.8, col = colorRampPalette(c("darkred","white","dodgerblue4"))(100))

Based on the Correlation plot above, there is a high degree of collinarity amogst independent variables like TEAM_BATTING_HR, TEAM_PITCHING_HR,TEAM_BASERUN_SB, TEAM_BASERUN_CS etc.

Handling Colinearity

TEAM_BASERUN_CS: This variable is strongly correlated (82%) with the TEAM_BASERUN_SB variable and is the 2nd largest source of NA’s in our data set. Hence we decided to exclude the variable from our analysis.

TEAM_PITCHING_HR: This variable is highly (97%) correlated with TEAM_BATTING_HR. The fact that these two variables are nearly perfectly correlated indicates that one of them can legitimately be removed from the data set, and we chose TEAM_PITCHING_HR to be removed from our model for further analysis.

##train_model_df <- train_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_PITCHING_HR))
##eval_df <- eval_df %>% dplyr::select(-c(TEAM_BASERUN_CS,TEAM_PITCHING_HR))

Data Preparation Results

We created a density plot after making feature removal and imputation.

train_df %>%
  gather(variable, value, TARGET_WINS:TEAM_BATTING_1B) %>%
  ggplot(., aes(value)) + 
  geom_density(fill = "dodgerblue4", color="dodgerblue4") + 
  facet_wrap(~variable, scales ="free", ncol = 4) +
  labs(x = element_blank(), y = element_blank())

Test train approach

We have divided the traning dataset into training and test sets using a 80/20 split. We will build our models on the training set and evaluate it on the test set.

set.seed(123)
split <- sample.split(train_df$TARGET_WINS, SplitRatio = 0.8)
training_set <- subset(train_df, split == TRUE)
test_set <- subset(train_df, split == FALSE)

Building Models

We will build four different models to see which one yields the best performance.

Model 1:

g1 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H +
           TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)

vif(g1)
##  TEAM_BATTING_1B  TEAM_BATTING_2B  TEAM_BATTING_3B  TEAM_BATTING_HR 
##         2.935719         1.574744         2.830456        45.896140 
##  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS 
##         8.756261         9.855289         4.159089         4.284251 
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO 
##         3.999482        36.763145         6.430649         5.210524 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         5.656579         1.953459
summary(g1) 
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.086  -8.424   0.215   8.121  49.167 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      33.6635331  5.5863275   6.026 2.03e-09 ***
## TEAM_BATTING_1B   0.0424969  0.0038303  11.095  < 2e-16 ***
## TEAM_BATTING_2B   0.0274386  0.0078276   3.505 0.000467 ***
## TEAM_BATTING_3B   0.0766673  0.0174872   4.384 1.23e-05 ***
## TEAM_BATTING_HR   0.1117477  0.0323936   3.450 0.000574 ***
## TEAM_BATTING_BB   0.0316514  0.0069346   4.564 5.35e-06 ***
## TEAM_BATTING_SO  -0.0215649  0.0037102  -5.812 7.26e-09 ***
## TEAM_BASERUN_SB   0.0529440  0.0057885   9.146  < 2e-16 ***
## TEAM_BASERUN_CS   0.0136834  0.0117694   1.163 0.245138    
## TEAM_PITCHING_H   0.0020019  0.0004194   4.773 1.96e-06 ***
## TEAM_PITCHING_HR  0.0199291  0.0286937   0.695 0.487429    
## TEAM_PITCHING_BB -0.0185423  0.0052427  -3.537 0.000415 ***
## TEAM_PITCHING_SO  0.0056043  0.0022739   2.465 0.013807 *  
## TEAM_FIELDING_E  -0.0428401  0.0030086 -14.239  < 2e-16 ***
## TEAM_FIELDING_DP -0.1293847  0.0140228  -9.227  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.46 on 1811 degrees of freedom
## Multiple R-squared:  0.4002, Adjusted R-squared:  0.3956 
## F-statistic: 86.31 on 14 and 1811 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g1, plots='default', smoother = TRUE)

rmse_calc <- function(actual, predicted) {
  rmse_val <- sqrt(sum((actual - predicted)^2) / length(actual))
  
  return(rmse_val)
}

### RMSE of first model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g1, newdata = training_set))
## [1] 12.40796
### RMSE of first model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g1, newdata = test_set))
## [1] 12.71621
model_1 <- as.data.frame(glance(g1))
model_1$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g1, newdata = training_set))
model_1$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g1, newdata = test_set))

Model 2:

# Removing : TEAM_BATTING_3B, TEAM_BASERUN_CS, TEAM_PITCHING_HR - based on p-values

g2 <- lm(TARGET_WINS ~TEAM_BATTING_1B + TEAM_BATTING_2B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)


summary(g2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     TEAM_FIELDING_DP, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.143  -8.650   0.352   8.126  52.285 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      37.418235   5.477664   6.831 1.15e-11 ***
## TEAM_BATTING_1B   0.045251   0.003804  11.896  < 2e-16 ***
## TEAM_BATTING_2B   0.031660   0.007731   4.095 4.41e-05 ***
## TEAM_BATTING_HR   0.123712   0.009215  13.426  < 2e-16 ***
## TEAM_BATTING_BB   0.028916   0.006457   4.478 7.99e-06 ***
## TEAM_BATTING_SO  -0.025361   0.003639  -6.969 4.47e-12 ***
## TEAM_BASERUN_SB   0.062500   0.004500  13.890  < 2e-16 ***
## TEAM_PITCHING_H   0.001656   0.000415   3.990 6.86e-05 ***
## TEAM_PITCHING_BB -0.014903   0.004731  -3.150  0.00166 ** 
## TEAM_PITCHING_SO  0.005882   0.002256   2.607  0.00921 ** 
## TEAM_FIELDING_E  -0.041997   0.003021 -13.900  < 2e-16 ***
## TEAM_FIELDING_DP -0.133072   0.014011  -9.498  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.54 on 1814 degrees of freedom
## Multiple R-squared:  0.3917, Adjusted R-squared:  0.388 
## F-statistic: 106.2 on 11 and 1814 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g2, plots='default', smoother = TRUE)

### RMSE of second model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g2, newdata = training_set))
## [1] 12.49575
### RMSE of second model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g2, newdata = test_set))
## [1] 12.67295
model_2 <- as.data.frame(glance(g2))
model_2$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g2, newdata = training_set))
model_2$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g2, newdata = test_set))

Model 3:

# Removing : TEAM_PITCHING_BB - based on p-value
g3 <- lm(TARGET_WINS ~TEAM_BATTING_1B + TEAM_BATTING_2B +
           TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
           TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO +
           TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)


summary(g3)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.378  -8.576   0.459   8.231  51.120 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      39.4261217  5.4538150   7.229 7.13e-13 ***
## TEAM_BATTING_1B   0.0445039  0.0038059  11.693  < 2e-16 ***
## TEAM_BATTING_2B   0.0318005  0.0077501   4.103 4.25e-05 ***
## TEAM_BATTING_HR   0.1237522  0.0092371  13.397  < 2e-16 ***
## TEAM_BATTING_BB   0.0115156  0.0033525   3.435 0.000606 ***
## TEAM_BATTING_SO  -0.0205651  0.0033138  -6.206 6.72e-10 ***
## TEAM_BASERUN_SB   0.0604691  0.0044640  13.546  < 2e-16 ***
## TEAM_PITCHING_H   0.0009817  0.0003564   2.754 0.005940 ** 
## TEAM_PITCHING_SO  0.0016319  0.0018130   0.900 0.368176    
## TEAM_FIELDING_E  -0.0414710  0.0030241 -13.714  < 2e-16 ***
## TEAM_FIELDING_DP -0.1293614  0.0139959  -9.243  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.57 on 1815 degrees of freedom
## Multiple R-squared:  0.3884, Adjusted R-squared:  0.385 
## F-statistic: 115.2 on 10 and 1815 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(g1, plots='default', smoother = TRUE)

### RMSE of third model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(g3, newdata = training_set))
## [1] 12.52988
### RMSE of third model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(g3, newdata = test_set))
## [1] 12.40627
model_3 <- as.data.frame(glance(g3))
model_3$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(g3, newdata = training_set))
model_3$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(g3, newdata = test_set))

Model 4: Stepwise Regression Model

We have used the stepAIC() function from MASS package, which choose the best model by AIC. I have used a backward elimination process using direction parameter as ‘backward’.

step_model <- stepAIC(g1, direction = "backward", 
                      trace = FALSE)

summary(step_model)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -46.143  -8.443   0.218   8.144  50.190 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      34.5795058  5.4745350   6.316 3.36e-10 ***
## TEAM_BATTING_1B   0.0426304  0.0038183  11.165  < 2e-16 ***
## TEAM_BATTING_2B   0.0264099  0.0077580   3.404 0.000678 ***
## TEAM_BATTING_3B   0.0826533  0.0169375   4.880 1.15e-06 ***
## TEAM_BATTING_HR   0.1319750  0.0093123  14.172  < 2e-16 ***
## TEAM_BATTING_BB   0.0301181  0.0064215   4.690 2.93e-06 ***
## TEAM_BATTING_SO  -0.0222146  0.0036738  -6.047 1.79e-09 ***
## TEAM_BASERUN_SB   0.0569348  0.0046146  12.338  < 2e-16 ***
## TEAM_PITCHING_H   0.0019434  0.0004166   4.665 3.31e-06 ***
## TEAM_PITCHING_BB -0.0173696  0.0047283  -3.674 0.000246 ***
## TEAM_PITCHING_SO  0.0060394  0.0022425   2.693 0.007144 ** 
## TEAM_FIELDING_E  -0.0427808  0.0030067 -14.228  < 2e-16 ***
## TEAM_FIELDING_DP -0.1309038  0.0139311  -9.397  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.46 on 1813 degrees of freedom
## Multiple R-squared:  0.3996, Adjusted R-squared:  0.3956 
## F-statistic: 100.5 on 12 and 1813 DF,  p-value: < 2.2e-16

Model Diagnostic Plots

resid_panel(step_model, plots='default', smoother = TRUE)

### RMSE of third model - training dataset
rmse_calc(training_set$TARGET_WINS, predict(step_model, newdata = training_set))
## [1] 12.41448
### RMSE of third model - test dataset
rmse_calc(test_set$TARGET_WINS, predict(step_model, newdata = test_set))
## [1] 12.72149
model_4 <- as.data.frame(glance(step_model))
model_4$Train_RMSE <- rmse_calc(training_set$TARGET_WINS, predict(step_model, newdata = training_set))
model_4$Test_RMSE <- rmse_calc(test_set$TARGET_WINS, predict(step_model, newdata = test_set))

Selecting Models

Step 1: Compare Key statistics

model_summary <- rbind(model_1,model_2,model_3,model_4)

Model_Name <- c('Model 1(All Features)','Model 2 (Selective Features)','Model 3(Selective Features)','Model 4(Stepwise - Backward Elimination)')

model_summary <- t(model_summary) 
colnames(model_summary) <- Model_Name

model_summary %>% kable() %>%
    kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%  scroll_box(width="100%",height="300px")
Model 1(All Features) Model 2 (Selective Features) Model 3(Selective Features) Model 4(Stepwise - Backward Elimination)
r.squared 4.002118e-01 3.916951e-01 3.883671e-01 3.995815e-01
adj.r.squared 3.955751e-01 3.880063e-01 3.849972e-01 3.956074e-01
sigma 1.245924e+01 1.253701e+01 1.256779e+01 1.245891e+01
statistic 8.631423e+01 1.061870e+02 1.152466e+02 1.005467e+02
p.value 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
df 1.500000e+01 1.200000e+01 1.100000e+01 1.300000e+01
logLik -7.189468e+03 -7.202341e+03 -7.207322e+03 -7.190427e+03
AIC 1.441094e+04 1.443068e+04 1.443864e+04 1.440885e+04
BIC 1.449909e+04 1.450231e+04 1.450476e+04 1.448599e+04
deviance 2.811265e+05 2.851183e+05 2.866782e+05 2.814219e+05
df.residual 1.811000e+03 1.814000e+03 1.815000e+03 1.813000e+03
Train_RMSE 1.240796e+01 1.249575e+01 1.252988e+01 1.241448e+01
Test_RMSE 1.271621e+01 1.267295e+01 1.240627e+01 1.272149e+01

Model Interpretation

Having constructed a model where all p-values are below 0.05, I looked the residual plot, Q-Q plot and other relevant plots using ggresidpanel package: