Introduction

The purpose of this homework assignment is to build a multiple linear regression model on the training data to predict the number of wins for the team given in the data set.

Data Exporation

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

Dataset

The data set contains approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_WINS Number of wins  
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins
TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins
TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins
TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins
TEAM_BATTING_BB Walks by batters Positive Impact on Wins
TEAM_BATTING_HBP Batters hit by pitch (get a free base) Positive Impact on Wins
TEAM_BATTING_SO Strikeouts by batters Negative Impact on Wins
TEAM_BASERUN_SB Stolen bases Positive Impact on Wins
TEAM_BASERUN_CS Caught stealing Negative Impact on Wins
TEAM_FIELDING_E Errors Negative Impact on Wins
TEAM_FIELDING_DP Double Plays Positive Impact on Wins
TEAM_PITCHING_BB Walks allowed Negative Impact on Wins
TEAM_PITCHING_H Hits allowed Negative Impact on Wins
TEAM_PITCHING_HR Homeruns allowed Negative Impact on Wins
TEAM_PITCHING_SO Strikeouts by pitchers Positive Impact on Wins
# https://cran.r-project.org/web/packages/pastecs/pastecs.pdf
library(pastecs)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.4     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.1     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract() masks pastecs::extract()
## x dplyr::filter()  masks stats::filter()
## x dplyr::first()   masks pastecs::first()
## x dplyr::lag()     masks stats::lag()
## x dplyr::last()    masks pastecs::last()
# https://www.rdocumentation.org/packages/naniar/versions/0.6.1
library(naniar)
library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
library(ggplot2)

# Used for skewness
library(moments)

# Log Scale
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
# Correlation corrplot
# https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
library(corrplot)
## corrplot 0.90 loaded
# Correlation
library(correlation)

# MICE: for missing values
library(mice)
## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
# Caret: Center and Scaling
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
# To help prevent scientific notation for viewed values
options(scipen=100)

# To set the number of decimal places
options(digits=2)

# From http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

The following is an exploration of the data set.

# Reading the data 
trainData <- read.csv('https://raw.githubusercontent.com/logicalschema/Fall-2021/main/DATA621/hw1/moneyball-training-data.csv')
evalData <- read.csv('https://raw.githubusercontent.com/logicalschema/Fall-2021/main/DATA621/hw1/moneyball-evaluation-data.csv')

# Remove the Index column
trainData <- subset(trainData, select = -INDEX)
evalData <- subset(evalData, select = -INDEX)

head(trainData)
# Summary of the training data
summary(trainData)
##   TARGET_WINS  TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
##  Min.   :  0   Min.   : 891   Min.   : 69     Min.   :  0     Min.   :  0    
##  1st Qu.: 71   1st Qu.:1383   1st Qu.:208     1st Qu.: 34     1st Qu.: 42    
##  Median : 82   Median :1454   Median :238     Median : 47     Median :102    
##  Mean   : 81   Mean   :1469   Mean   :241     Mean   : 55     Mean   :100    
##  3rd Qu.: 92   3rd Qu.:1537   3rd Qu.:273     3rd Qu.: 72     3rd Qu.:147    
##  Max.   :146   Max.   :2554   Max.   :458     Max.   :223     Max.   :264    
##                                                                              
##  TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
##  Min.   :  0     Min.   :   0    Min.   :  0     Min.   :  0    
##  1st Qu.:451     1st Qu.: 548    1st Qu.: 66     1st Qu.: 38    
##  Median :512     Median : 750    Median :101     Median : 49    
##  Mean   :502     Mean   : 736    Mean   :125     Mean   : 53    
##  3rd Qu.:580     3rd Qu.: 930    3rd Qu.:156     3rd Qu.: 62    
##  Max.   :878     Max.   :1399    Max.   :697     Max.   :201    
##                  NA's   :102     NA's   :131     NA's   :772    
##  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
##  Min.   :29       Min.   : 1137   Min.   :  0      Min.   :   0    
##  1st Qu.:50       1st Qu.: 1419   1st Qu.: 50      1st Qu.: 476    
##  Median :58       Median : 1518   Median :107      Median : 536    
##  Mean   :59       Mean   : 1779   Mean   :106      Mean   : 553    
##  3rd Qu.:67       3rd Qu.: 1682   3rd Qu.:150      3rd Qu.: 611    
##  Max.   :95       Max.   :30132   Max.   :343      Max.   :3645    
##  NA's   :2085                                                      
##  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
##  Min.   :    0    Min.   :  65    Min.   : 52     
##  1st Qu.:  615    1st Qu.: 127    1st Qu.:131     
##  Median :  814    Median : 159    Median :149     
##  Mean   :  818    Mean   : 246    Mean   :146     
##  3rd Qu.:  968    3rd Qu.: 249    3rd Qu.:164     
##  Max.   :19278    Max.   :1898    Max.   :228     
##  NA's   :102                      NA's   :286
# https://cran.r-project.org/web/packages/pastecs/pastecs.pdf
stat.desc(trainData, basic = FALSE)

It is important to note that with the summary of the data set:

  • TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO, and TEAM_FIELDING_DP have NA values.
  • Because the INDEX variable is described as having no theoretical effect, we can remove the INDEX column.
  • The variables are numeric and not categorical.
  • TARGET_WINS is our dependent variable with 15 remaining variables.

Distributions

The following is a look at each of the variable distributions for the training data set.

par(mfrow = c(3, 3))

plotData <- melt(trainData)
## No id variables; using all as measure variables
ggplot(plotData, aes(x= value)) + 
  theme(panel.border = element_blank(), panel.background = element_blank(), 
        panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  geom_density(fill='dodgerblue') + facet_wrap(~variable, scales = 'free') 

  • TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_BB, and TEAM_BASERUN_CS look to be normally distributed.
  • TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_HR are bimodal.

Skewness

skewness(trainData)
##      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##            -0.40             1.57             0.22             1.11 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##             0.19            -1.03               NA               NA 
##  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
##               NA               NA            10.34             0.29 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##             6.75               NA             2.99               NA

Boxplots

The following are the boxplots of each of the variables for the training data set. Along with the distribution graphs above, these are helpful to identify outliers.

trainData %>%
        tidyr::gather(key, value) %>%
        ggplot(aes(x = key, y = value, fill = key)) +
        geom_boxplot() +
        # scale_y_continuous(labels = scales::dollar) +
        geom_boxplot(outlier.colour = "red") +
        theme(legend.position = "none",
              panel.background = element_blank(),
              axis.title.y = element_blank()) + 
        scale_y_continuous(trans = log2_trans()) +
        coord_flip()

With the variables TEAM_PITCHING_SO, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_FIELDING_E, TEAM_BATTING_SO, TEAM_BATTING_BB, and TEAM_BASERUN_CS, there are a large number of outliers.

In the Data Preparation section, we will continue to winnow the variables to produce the multiple linear regression model.

Correlation

trainData %>%
        complete.cases() %>% 
        trainData[., ] %>%                         
        cor() %>%
        corrplot(method = "shade")

The correlation matrix shows a strong relationship between TEAM_PITCHING_H and TEAM_BATTING_H, TEAM_PITCHING_HR and TEAM_BATTING_HR, TEAM_PITCHING_BB and TEAM_BATTING_BB, TEAM_PITCHING_SO and TEAM_BATTING_SO.

cor(trainData)
##                  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS             1.00         0.3888           0.289          0.1426
## TEAM_BATTING_H          0.39         1.0000           0.563          0.4277
## TEAM_BATTING_2B         0.29         0.5628           1.000         -0.1073
## TEAM_BATTING_3B         0.14         0.4277          -0.107          1.0000
## TEAM_BATTING_HR         0.18        -0.0065           0.435         -0.6356
## TEAM_BATTING_BB         0.23        -0.0725           0.256         -0.2872
## TEAM_BATTING_SO           NA             NA              NA              NA
## TEAM_BASERUN_SB           NA             NA              NA              NA
## TEAM_BASERUN_CS           NA             NA              NA              NA
## TEAM_BATTING_HBP          NA             NA              NA              NA
## TEAM_PITCHING_H        -0.11         0.3027           0.024          0.1949
## TEAM_PITCHING_HR        0.19         0.0729           0.455         -0.5678
## TEAM_PITCHING_BB        0.12         0.0942           0.178         -0.0022
## TEAM_PITCHING_SO          NA             NA              NA              NA
## TEAM_FIELDING_E        -0.18         0.2649          -0.235          0.5098
## TEAM_FIELDING_DP          NA             NA              NA              NA
##                  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS               0.1762           0.233              NA
## TEAM_BATTING_H           -0.0065          -0.072              NA
## TEAM_BATTING_2B           0.4354           0.256              NA
## TEAM_BATTING_3B          -0.6356          -0.287              NA
## TEAM_BATTING_HR           1.0000           0.514              NA
## TEAM_BATTING_BB           0.5137           1.000              NA
## TEAM_BATTING_SO               NA              NA               1
## TEAM_BASERUN_SB               NA              NA              NA
## TEAM_BASERUN_CS               NA              NA              NA
## TEAM_BATTING_HBP              NA              NA              NA
## TEAM_PITCHING_H          -0.2501          -0.450              NA
## TEAM_PITCHING_HR          0.9694           0.460              NA
## TEAM_PITCHING_BB          0.1369           0.489              NA
## TEAM_PITCHING_SO              NA              NA              NA
## TEAM_FIELDING_E          -0.5873          -0.656              NA
## TEAM_FIELDING_DP              NA              NA              NA
##                  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS                   NA              NA               NA
## TEAM_BATTING_H                NA              NA               NA
## TEAM_BATTING_2B               NA              NA               NA
## TEAM_BATTING_3B               NA              NA               NA
## TEAM_BATTING_HR               NA              NA               NA
## TEAM_BATTING_BB               NA              NA               NA
## TEAM_BATTING_SO               NA              NA               NA
## TEAM_BASERUN_SB                1              NA               NA
## TEAM_BASERUN_CS               NA               1               NA
## TEAM_BATTING_HBP              NA              NA                1
## TEAM_PITCHING_H               NA              NA               NA
## TEAM_PITCHING_HR              NA              NA               NA
## TEAM_PITCHING_BB              NA              NA               NA
## TEAM_PITCHING_SO              NA              NA               NA
## TEAM_FIELDING_E               NA              NA               NA
## TEAM_FIELDING_DP              NA              NA               NA
##                  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS               -0.110            0.189           0.1242
## TEAM_BATTING_H             0.303            0.073           0.0942
## TEAM_BATTING_2B            0.024            0.455           0.1781
## TEAM_BATTING_3B            0.195           -0.568          -0.0022
## TEAM_BATTING_HR           -0.250            0.969           0.1369
## TEAM_BATTING_BB           -0.450            0.460           0.4894
## TEAM_BATTING_SO               NA               NA               NA
## TEAM_BASERUN_SB               NA               NA               NA
## TEAM_BASERUN_CS               NA               NA               NA
## TEAM_BATTING_HBP              NA               NA               NA
## TEAM_PITCHING_H            1.000           -0.142           0.3207
## TEAM_PITCHING_HR          -0.142            1.000           0.2219
## TEAM_PITCHING_BB           0.321            0.222           1.0000
## TEAM_PITCHING_SO              NA               NA               NA
## TEAM_FIELDING_E            0.668           -0.493          -0.0228
## TEAM_FIELDING_DP              NA               NA               NA
##                  TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS                    NA          -0.176               NA
## TEAM_BATTING_H                 NA           0.265               NA
## TEAM_BATTING_2B                NA          -0.235               NA
## TEAM_BATTING_3B                NA           0.510               NA
## TEAM_BATTING_HR                NA          -0.587               NA
## TEAM_BATTING_BB                NA          -0.656               NA
## TEAM_BATTING_SO                NA              NA               NA
## TEAM_BASERUN_SB                NA              NA               NA
## TEAM_BASERUN_CS                NA              NA               NA
## TEAM_BATTING_HBP               NA              NA               NA
## TEAM_PITCHING_H                NA           0.668               NA
## TEAM_PITCHING_HR               NA          -0.493               NA
## TEAM_PITCHING_BB               NA          -0.023               NA
## TEAM_PITCHING_SO                1              NA               NA
## TEAM_FIELDING_E                NA           1.000               NA
## TEAM_FIELDING_DP               NA              NA                1

The following runs Pearson correlation tests between each variable and the variable TARGET_WINS.

# Note: getOption("na.action") is na.omit

cor.test(trainData$TEAM_BATTING_H, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BATTING_H and trainData$TARGET_WINS
## t = 20, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.35 0.42
## sample estimates:
##  cor 
## 0.39
cor.test(trainData$TEAM_BATTING_2B, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BATTING_2B and trainData$TARGET_WINS
## t = 14, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.25 0.33
## sample estimates:
##  cor 
## 0.29
cor.test(trainData$TEAM_BATTING_3B, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BATTING_3B and trainData$TARGET_WINS
## t = 7, df = 2274, p-value = 0.000000000008
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.10 0.18
## sample estimates:
##  cor 
## 0.14
cor.test(trainData$TEAM_BATTING_HR, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BATTING_HR and trainData$TARGET_WINS
## t = 9, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.14 0.22
## sample estimates:
##  cor 
## 0.18
cor.test(trainData$TEAM_BATTING_BB, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BATTING_BB and trainData$TARGET_WINS
## t = 11, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.19 0.27
## sample estimates:
##  cor 
## 0.23
cor.test(trainData$TEAM_BATTING_HBP, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BATTING_HBP and trainData$TARGET_WINS
## t = 1, df = 189, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.069  0.213
## sample estimates:
##   cor 
## 0.074
cor.test(trainData$TEAM_BATTING_SO, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BATTING_SO and trainData$TARGET_WINS
## t = -1, df = 2172, p-value = 0.1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.074  0.010
## sample estimates:
##    cor 
## -0.032
cor.test(trainData$TEAM_BASERUN_SB, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BASERUN_SB and trainData$TARGET_WINS
## t = 6, df = 2143, p-value = 0.0000000003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.093 0.176
## sample estimates:
##  cor 
## 0.14
cor.test(trainData$TEAM_BASERUN_CS, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_BASERUN_CS and trainData$TARGET_WINS
## t = 0.9, df = 1502, p-value = 0.4
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.028  0.073
## sample estimates:
##   cor 
## 0.022
cor.test(trainData$TEAM_FIELDING_E, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_FIELDING_E and trainData$TARGET_WINS
## t = -9, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.22 -0.14
## sample estimates:
##   cor 
## -0.18
cor.test(trainData$TEAM_FIELDING_DP, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_FIELDING_DP and trainData$TARGET_WINS
## t = -2, df = 1988, p-value = 0.1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0787  0.0091
## sample estimates:
##    cor 
## -0.035
cor.test(trainData$TEAM_PITCHING_BB, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_PITCHING_BB and trainData$TARGET_WINS
## t = 6, df = 2274, p-value = 0.000000003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.084 0.164
## sample estimates:
##  cor 
## 0.12
cor.test(trainData$TEAM_PITCHING_H, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_PITCHING_H and trainData$TARGET_WINS
## t = -5, df = 2274, p-value = 0.0000001
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.150 -0.069
## sample estimates:
##   cor 
## -0.11
cor.test(trainData$TEAM_PITCHING_HR, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_PITCHING_HR and trainData$TARGET_WINS
## t = 9, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.15 0.23
## sample estimates:
##  cor 
## 0.19
cor.test(trainData$TEAM_PITCHING_SO, trainData$TARGET_WINS)
## 
##  Pearson's product-moment correlation
## 
## data:  trainData$TEAM_PITCHING_SO and trainData$TARGET_WINS
## t = -4, df = 2172, p-value = 0.0003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.120 -0.037
## sample estimates:
##    cor 
## -0.078

Variables vs TARGET_WINS

p1 <- ggplot(trainData) + 
  aes(x = TEAM_BATTING_H, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p2 <- ggplot(trainData) + 
  aes(x = TEAM_BATTING_2B, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p3 <- ggplot(trainData) + 
  aes(x = TEAM_BATTING_3B, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p4 <- ggplot(trainData) + 
  aes(x = TEAM_BATTING_HR, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p5 <- ggplot(trainData) + 
  aes(x = TEAM_BATTING_BB, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p6 <- ggplot(trainData) + 
  aes(x = TEAM_BATTING_HBP, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p7 <- ggplot(trainData) + 
  aes(x = TEAM_BATTING_SO, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p8 <- ggplot(trainData) + 
  aes(x = TEAM_BASERUN_SB, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p9 <- ggplot(trainData) + 
  aes(x = TEAM_BASERUN_CS, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p10 <- ggplot(trainData) + 
  aes(x = TEAM_FIELDING_E, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()


p11 <- ggplot(trainData) + 
  aes(x = TEAM_FIELDING_DP, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p12 <- ggplot(trainData) + 
  aes(x = TEAM_PITCHING_BB, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p13 <- ggplot(trainData) + 
  aes(x = TEAM_PITCHING_H, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p14 <- ggplot(trainData) + 
  aes(x = TEAM_PITCHING_HR, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

p15 <- ggplot(trainData) + 
  aes(x = TEAM_PITCHING_SO, y = TARGET_WINS) +
  geom_point(colour = "dodgerblue") +
  theme_minimal()

# Empty Plot
p_empty <- ggplot() + 
  theme_void()

multiplot(p1, p2, p3, p4, p5, p6, cols=3)

multiplot(p7, p8, p9, p10, p11, p12, cols=3)

multiplot(p13, p14, p15, p_empty, p_empty, p_empty, cols=3)

Missing Values

The following gives a chart of the percentages of missing values by variable for the training data set.

# Percentage of missing values by the variable
miss_var_summary(trainData)

The missing values will be handled in the data preparation section of this report.

Data Preparation

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

Dropping a Variable

According to this Dan Berdikulov in this article, when data is missing for 60-70% of a variable, dropping the variable should be considered. With 91.6% missing, the variable TEAM_BATTING_HBP will be dropped.

# Remove the TEAM_BATTING_HBP column
trainData <- subset(trainData, select = -TEAM_BATTING_HBP)

evalData <- subset(evalData, select = -TEAM_BATTING_HBP)

Imputation

For missing values, I decided to fill them using the Predictive Mean Matching method because the variables are numerical data. I used the MICE (Multivariate Imputation via Chained Equations) library for this.

#TEAM_BASERUN_CS
#TEAM_FIELDING_DP
#TEAM_BASERUN_SB
#TEAM_BATTING_SO
#TEAM_PITCHING_SO
# https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/mice.impute.pmm
set.seed(91421)
temp <- mice(trainData, m=5, maxit=5, meth='pmm')
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
tempData <- complete(temp, 1)

trainData$TEAM_BASERUN_CS <- tempData$TEAM_BASERUN_CS
trainData$TEAM_FIELDING_DP <- tempData$TEAM_FIELDING_DP
trainData$TEAM_BASERUN_SB <-tempData$TEAM_BASERUN_SB
trainData$TEAM_BATTING_SO <- tempData$TEAM_BATTING_SO
trainData$TEAM_PITCHING_SO <- tempData$TEAM_PITCHING_SO


temp <- mice(evalData, m=5, maxit=5, meth='pmm')
## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_BASERUN_CS  TEAM_PITCHING_SO  TEAM_FIELDING_DP
tempData <- complete(temp, 1)

evalData$TEAM_BASERUN_CS <- tempData$TEAM_BASERUN_CS
evalData$TEAM_FIELDING_DP <- tempData$TEAM_FIELDING_DP
evalData$TEAM_BASERUN_SB <-tempData$TEAM_BASERUN_SB
evalData$TEAM_BATTING_SO <- tempData$TEAM_BATTING_SO
evalData$TEAM_PITCHING_SO <- tempData$TEAM_PITCHING_SO

Let’s look at a summary of the imputed data set.

summary(trainData)
##   TARGET_WINS  TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
##  Min.   :  0   Min.   : 891   Min.   : 69     Min.   :  0     Min.   :  0    
##  1st Qu.: 71   1st Qu.:1383   1st Qu.:208     1st Qu.: 34     1st Qu.: 42    
##  Median : 82   Median :1454   Median :238     Median : 47     Median :102    
##  Mean   : 81   Mean   :1469   Mean   :241     Mean   : 55     Mean   :100    
##  3rd Qu.: 92   3rd Qu.:1537   3rd Qu.:273     3rd Qu.: 72     3rd Qu.:147    
##  Max.   :146   Max.   :2554   Max.   :458     Max.   :223     Max.   :264    
##  TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
##  Min.   :  0     Min.   :   0    Min.   :  0     Min.   :  0    
##  1st Qu.:451     1st Qu.: 542    1st Qu.: 67     1st Qu.: 43    
##  Median :512     Median : 733    Median :106     Median : 57    
##  Mean   :502     Mean   : 728    Mean   :136     Mean   : 77    
##  3rd Qu.:580     3rd Qu.: 925    3rd Qu.:170     3rd Qu.: 91    
##  Max.   :878     Max.   :1399    Max.   :697     Max.   :201    
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
##  Min.   : 1137   Min.   :  0      Min.   :   0     Min.   :    0   
##  1st Qu.: 1419   1st Qu.: 50      1st Qu.: 476     1st Qu.:  613   
##  Median : 1518   Median :107      Median : 536     Median :  804   
##  Mean   : 1779   Mean   :106      Mean   : 553     Mean   :  812   
##  3rd Qu.: 1682   3rd Qu.:150      3rd Qu.: 611     3rd Qu.:  958   
##  Max.   :30132   Max.   :343      Max.   :3645     Max.   :19278   
##  TEAM_FIELDING_E TEAM_FIELDING_DP
##  Min.   :  65    Min.   : 52     
##  1st Qu.: 127    1st Qu.:126     
##  Median : 159    Median :146     
##  Mean   : 246    Mean   :142     
##  3rd Qu.: 249    3rd Qu.:162     
##  Max.   :1898    Max.   :228

Transformations

# https://topepo.github.io/caret/pre-processing.html#centering-and-scaling
trainTransformed <- trainData
preProcessValues <- preProcess(trainTransformed, method = c("BoxCox", "center", "scale"))
trainTransformed <- predict(preProcessValues, trainTransformed)

trainTransformed2 <- evalData
preProcessValues <- preProcess(trainTransformed2, method = c("BoxCox", "center", "scale"))
trainTransformed2 <- predict(preProcessValues, trainTransformed2)



trainData$TEAM_PITCHING_SO <- trainTransformed$TEAM_PITCHING_SO
trainData$TEAM_PITCHING_BB <- trainTransformed$TEAM_PITCHING_BB
trainData$TEAM_BASERUN_SB <- trainTransformed$TEAM_BASERUN_SB
trainData$TEAM_BASERUN_CS <- trainTransformed$TEAM_BASERUN_CS
trainData$TEAM_PITCHING_H <- log(trainData$TEAM_PITCHING_H)
trainData$TEAM_FIELDING_E <- trainTransformed$TEAM_FIELDING_E  


evalData$TEAM_PITCHING_SO <- trainTransformed2$TEAM_PITCHING_SO
evalData$TEAM_PITCHING_BB <- trainTransformed2$TEAM_PITCHING_BB
evalData$TEAM_BASERUN_SB <- trainTransformed2$TEAM_BASERUN_SB
evalData$TEAM_BASERUN_CS <- trainTransformed2$TEAM_BASERUN_CS
evalData$TEAM_PITCHING_H <- log(evalData$TEAM_PITCHING_H)
evalData$TEAM_FIELDING_E <- trainTransformed2$TEAM_FIELDING_E



par(mfrow = c(3, 3))
plotData <- melt(trainData)
## No id variables; using all as measure variables
ggplot(plotData, aes(x= value)) + 
  theme(panel.border = element_blank(), panel.background = element_blank(), 
        panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
  geom_density(fill='dodgerblue') + facet_wrap(~variable, scales = 'free') 

New Features

TEAM_BATTING_1B

The variable TEAM_BATTING_H is the base hits by batters (1B, 2B, 3B, and HR). However, singles should be considered. So TEAM_BATTING_1B will be created.

trainData$TEAM_BATTING_1B <- trainData$TEAM_BATTING_H  - trainData$TEAM_BATTING_2B - trainData$TEAM_BATTING_3B - trainData$TEAM_BATTING_HR 
evalData$TEAM_BATTING_1B <- evalData$TEAM_BATTING_H  - evalData$TEAM_BATTING_2B - evalData$TEAM_BATTING_3B - evalData$TEAM_BATTING_HR 


trainData <- subset(trainData, select = -TEAM_BATTING_H)
evalData <- subset(evalData, select = -TEAM_BATTING_H)

head(trainData)

Modified OPS

Billy Beane of Moneyball fame was known to base drafting of players by combining two statistics: OBP (On-base Percentage) and SLG(Slugging Percentage). He would combine these two statistics to form the OPS (On-base Plus Slugging) statistic.

Because I dropped the TEAM_BATTING_HBP statistic, I will use a modified OPS based upon the data set.

On-base Percentage = (TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB) / PA

Assuming there are 162 games, 9 innings, and 3 chances at bat: Plate Appearance (PA) = (162 * 9 * 3) + TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR - TEAM_BASERUN_CS – TEAM_FIELDING_DP

Slugging Percentage (SLG) = (TEAM_BATTING_1B + 2 * TEAM_BATTING_2B + 3 * TEAM_BATTING_3B + 4*TEAM_BATTING_HR) / (PA - TEAM_BATTING_BB)

OPS = OBP + SLG

trainData$PA <- (162 * 9 * 3) + trainData$TEAM_BATTING_1B + trainData$TEAM_BATTING_2B + trainData$TEAM_BATTING_3B + trainData$TEAM_BATTING_HR - trainData$TEAM_BASERUN_CS - trainData$TEAM_FIELDING_DP  

trainData$OBP <- (trainData$TEAM_BATTING_1B + trainData$TEAM_BATTING_2B + trainData$TEAM_BATTING_3B + trainData$TEAM_BATTING_HR + trainData$TEAM_BATTING_BB) / trainData$PA  

trainData$SLG <- (trainData$TEAM_BATTING_1B + 2 * trainData$TEAM_BATTING_2B + 3 * trainData$TEAM_BATTING_3B + 4 * trainData$TEAM_BATTING_HR) / (trainData$PA - trainData$TEAM_BATTING_BB)  

trainData$OPS <- trainData$OBP + trainData$SLG  


evalData$PA <- (162 * 9 * 3) + evalData$TEAM_BATTING_1B + evalData$TEAM_BATTING_2B + evalData$TEAM_BATTING_3B + evalData$TEAM_BATTING_HR - evalData$TEAM_BASERUN_CS - evalData$TEAM_FIELDING_DP  

evalData$OBP <- (evalData$TEAM_BATTING_1B + evalData$TEAM_BATTING_2B + evalData$TEAM_BATTING_3B + evalData$TEAM_BATTING_HR + evalData$TEAM_BATTING_BB) / evalData$PA  

evalData$SLG <- (evalData$TEAM_BATTING_1B + 2 * evalData$TEAM_BATTING_2B + 3 * evalData$TEAM_BATTING_3B + 4 * evalData$TEAM_BATTING_HR) / (evalData$PA - evalData$TEAM_BATTING_BB)  

evalData$OPS <- evalData$OBP + evalData$SLG

Build Models

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Model 1

This model uses the variables related to batting: TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, and TEAM_BATTING_BB. These variables were selected because in the 1st section they had a positive correlation coeefficient.

##  [1] "TARGET_WINS"      "TEAM_BATTING_2B"  "TEAM_BATTING_3B"  "TEAM_BATTING_HR" 
##  [5] "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB"  "TEAM_BASERUN_CS" 
##  [9] "TEAM_BASERUN_CS"  "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO"
## [13] "TEAM_FIELDING_E"  "TEAM_FIELDING_DP" "TEAM_BATTING_1B"  "PA"              
## [17] "OBP"              "SLG"              "OPS"

m1 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB, data = trainData, na.action = na.omit)
summary(m1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB, data = trainData, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65.41  -8.60   0.52   9.14  55.28 
## 
## Coefficients:
##                 Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      3.32154    3.46608    0.96                 0.34    
## TEAM_BATTING_1B  0.03746    0.00308   12.18 < 0.0000000000000002 ***
## TEAM_BATTING_2B  0.02969    0.00751    3.95              0.00008 ***
## TEAM_BATTING_3B  0.13614    0.01498    9.09 < 0.0000000000000002 ***
## TEAM_BATTING_HR  0.08644    0.00788   10.97 < 0.0000000000000002 ***
## TEAM_BATTING_BB  0.02786    0.00280    9.93 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14 on 2270 degrees of freedom
## Multiple R-squared:  0.236,  Adjusted R-squared:  0.234 
## F-statistic:  140 on 5 and 2270 DF,  p-value: <0.0000000000000002

The summary of the model yields a \(R^2\) of 0.236. This means 23.6% of the variability in wins is explained by TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, andTEAM_BATTING_BB.

Model 2

This model uses the variables related to batting and pitching statistics: TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_PITCHING_SO, TEAM_PITCHING_HR, TEAM_PITCHING_H, and TEAM_PITCHING_BB.

m2 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_PITCHING_SO + TEAM_PITCHING_HR + TEAM_PITCHING_H + TEAM_PITCHING_BB, data = trainData, na.action = na.omit)
summary(m2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_PITCHING_SO + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_H + TEAM_PITCHING_BB, data = trainData, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -53.67  -8.75   0.41   9.00  52.84 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       79.88562   16.62033    4.81          0.000001637 ***
## TEAM_BATTING_1B    0.05544    0.00393   14.11 < 0.0000000000000002 ***
## TEAM_BATTING_2B    0.03418    0.00759    4.50          0.000007067 ***
## TEAM_BATTING_3B    0.14191    0.01510    9.40 < 0.0000000000000002 ***
## TEAM_BATTING_HR    0.06144    0.02854    2.15              0.03140 *  
## TEAM_BATTING_BB    0.01676    0.00635    2.64              0.00837 ** 
## TEAM_PITCHING_SO   1.53815    0.46152    3.33              0.00087 ***
## TEAM_PITCHING_HR   0.02917    0.02537    1.15              0.25022    
## TEAM_PITCHING_H  -12.46648    2.21001   -5.64          0.000000019 ***
## TEAM_PITCHING_BB   0.13646    0.71012    0.19              0.84764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14 on 2266 degrees of freedom
## Multiple R-squared:  0.254,  Adjusted R-squared:  0.251 
## F-statistic: 85.6 on 9 and 2266 DF,  p-value: <0.0000000000000002

The summary of the model yields a \(R^2\) of 0.254. This means 25.4% of the variability in wins is explained by TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_PITCHING_SO, TEAM_PITCHING_HR, TEAM_PITCHING_H, and TEAM_PITCHING_BB.

Model 3

This model uses the features we created: PA, OBP, SLG, and OPS.

m3 <- lm(TARGET_WINS ~ PA + OBP + SLG + OPS, data = trainData, na.action = na.omit)
summary(m3)
## 
## Call:
## lm(formula = TARGET_WINS ~ PA + OBP + SLG + OPS, data = trainData, 
##     na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -61.53  -8.79   0.48   9.23  49.88 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -125.80905   11.63466  -10.81 < 0.0000000000000002 ***
## PA             0.02562    0.00228   11.25 < 0.0000000000000002 ***
## OBP          131.26455   18.59171    7.06      0.0000000000022 ***
## SLG           37.31747   10.20266    3.66              0.00026 ***
## OPS                 NA         NA      NA                   NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14 on 2272 degrees of freedom
## Multiple R-squared:  0.233,  Adjusted R-squared:  0.232 
## F-statistic:  230 on 3 and 2272 DF,  p-value: <0.0000000000000002

The summary of the model yields a \(R^2\) of 0.233. This means 23.3% of the variability in wins is explained by PA, OBP, SLG, and OPS.

Model 4

This model uses the features and the theoretical positive effect variables as TEAM_BASERUN_SB, TEAM_FIELDING_DP, and TEAM_PITCHING_SO.

m4 <- lm(TARGET_WINS ~ PA + OBP + SLG + OPS + TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_SO, data = trainData, na.action = na.omit)
summary(m4)
## 
## Call:
## lm(formula = TARGET_WINS ~ PA + OBP + SLG + OPS + TEAM_BASERUN_SB + 
##     TEAM_FIELDING_DP + TEAM_PITCHING_SO, data = trainData, na.action = na.omit)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -58.99  -8.46   0.44   9.01  55.33 
## 
## Coefficients: (1 not defined because of singularities)
##                   Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)      -56.71037   13.58321   -4.18 0.00003091877566307 ***
## PA                 0.01099    0.00269    4.09 0.00004440169213577 ***
## OBP              152.55524   18.77988    8.12 0.00000000000000074 ***
## SLG               86.33136   11.38447    7.58 0.00000000000004882 ***
## OPS                     NA         NA      NA                  NA    
## TEAM_BASERUN_SB    1.88715    0.37430    5.04 0.00000049772941154 ***
## TEAM_FIELDING_DP  -0.09180    0.01302   -7.05 0.00000000000231713 ***
## TEAM_PITCHING_SO  -0.03860    0.31113   -0.12                 0.9    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14 on 2269 degrees of freedom
## Multiple R-squared:  0.273,  Adjusted R-squared:  0.271 
## F-statistic:  142 on 6 and 2269 DF,  p-value: <0.0000000000000002

The summary of the model yields a \(R^2\) of 0.273. This means 27.3% of the variability in wins is explained by PA, OBP, SLG, OPS, TEAM_BASERUN_SB, TEAM_FIELDING_DP, and TEAM_PITCHING_SO.

Model 5

This model uses all of the variables except for the new features.

m5 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR + 
           TEAM_BATTING_SO + TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
           TEAM_BASERUN_SB + TEAM_FIELDING_E, data = trainData)  

summary(m5)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR + TEAM_BATTING_SO + 
##     TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E, data = trainData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.25  -7.92   0.06   8.16  66.75 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      24.83206    5.40367    4.60   0.0000045566070790 ***
## TEAM_BATTING_1B   0.04242    0.00359   11.81 < 0.0000000000000002 ***
## TEAM_BATTING_2B   0.01768    0.00729    2.43               0.0153 *  
## TEAM_BATTING_3B   0.13109    0.01632    8.03   0.0000000000000015 ***
## TEAM_BATTING_BB   0.03559    0.00454    7.83   0.0000000000000072 ***
## TEAM_BATTING_HR   0.06371    0.02695    2.36               0.0181 *  
## TEAM_BATTING_SO  -0.01726    0.00257   -6.71   0.0000000000249938 ***
## TEAM_FIELDING_DP -0.10657    0.01254   -8.50 < 0.0000000000000002 ***
## TEAM_BASERUN_CS   1.27336    0.54919    2.32               0.0205 *  
## TEAM_PITCHING_HR  0.02308    0.02383    0.97               0.3330    
## TEAM_PITCHING_BB -1.86879    0.57826   -3.23               0.0012 ** 
## TEAM_PITCHING_SO  2.15667    0.48431    4.45   0.0000088756767528 ***
## TEAM_BASERUN_SB   2.80448    0.50983    5.50   0.0000000420801604 ***
## TEAM_FIELDING_E  -8.40447    0.59325  -14.17 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 2262 degrees of freedom
## Multiple R-squared:  0.34,   Adjusted R-squared:  0.336 
## F-statistic: 89.7 on 13 and 2262 DF,  p-value: <0.0000000000000002

The summary of the model yields a \(R^2\) of 0.34. This means 34% of the variability in wins is explained by this model.

Model 6

This model uses the original variables and OBP, SLG, and OPS.

m6 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR + 
           TEAM_BATTING_SO + TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
           TEAM_BASERUN_SB + TEAM_FIELDING_E + OBP + SLG + OPS, data = trainData)  

summary(m6)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR + TEAM_BATTING_SO + 
##     TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + OBP + 
##     SLG + OPS, data = trainData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.78  -7.89   0.15   8.41  60.29 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       -89.53683   21.97356   -4.07    0.000047652947848 ***
## TEAM_BATTING_1B    -0.11529    0.03037   -3.80              0.00015 ***
## TEAM_BATTING_2B     0.00359    0.04612    0.08              0.93789    
## TEAM_BATTING_3B     0.24964    0.07482    3.34              0.00086 ***
## TEAM_BATTING_BB    -0.27523    0.05041   -5.46    0.000000052708656 ***
## TEAM_BATTING_HR     0.35882    0.11875    3.02              0.00254 ** 
## TEAM_BATTING_SO    -0.02036    0.00263   -7.73    0.000000000000016 ***
## TEAM_FIELDING_DP   -0.17425    0.01987   -8.77 < 0.0000000000000002 ***
## TEAM_BASERUN_CS     1.16584    0.54506    2.14              0.03255 *  
## TEAM_PITCHING_HR    0.00465    0.02398    0.19              0.84622    
## TEAM_PITCHING_BB   -1.65484    0.57443   -2.88              0.00400 ** 
## TEAM_PITCHING_SO    2.51165    0.48335    5.20    0.000000221441374 ***
## TEAM_BASERUN_SB     3.08567    0.50971    6.05    0.000000001651868 ***
## TEAM_FIELDING_E    -7.94156    0.59338  -13.38 < 0.0000000000000002 ***
## OBP              2083.77028  324.95919    6.41    0.000000000173859 ***
## SLG              -732.98646  184.42976   -3.97    0.000072788262741 ***
## OPS                      NA         NA      NA                   NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 2260 degrees of freedom
## Multiple R-squared:  0.352,  Adjusted R-squared:  0.348 
## F-statistic: 81.8 on 15 and 2260 DF,  p-value: <0.0000000000000002

The summary of the model yields a \(R^2\) of 0.352. This means 35.2% of the variability in wins is explained by this model.

Select Models

Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.

I decided to go with the last model because of the \(R^2\) value. Now, I will run this model on the evaluation data set.

predictions <- predict(m6, evalData)

Because the evaluation data does not have the TARGET_WINS variable, we are not able to calculate the accuracy for the model with the evaluation data set. However, this is a sample of what the data set looks like:

head(predictions)
##  1  2  3  4  5  6 
## 61 67 74 87 60 70