The purpose of this homework assignment is to build a multiple linear regression model on the training data to predict the number of wins for the team given in the data set.
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
The data set contains approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
| VARIABLE NAME | DEFINITION | THEORETICAL EFFECT |
|---|---|---|
| INDEX | Identification Variable (do not use) | None |
| TARGET_WINS | Number of wins | |
| TEAM_BATTING_H | Base Hits by batters (1B,2B,3B,HR) | Positive Impact on Wins |
| TEAM_BATTING_2B | Doubles by batters (2B) | Positive Impact on Wins |
| TEAM_BATTING_3B | Triples by batters (3B) | Positive Impact on Wins |
| TEAM_BATTING_HR | Homeruns by batters (4B) | Positive Impact on Wins |
| TEAM_BATTING_BB | Walks by batters | Positive Impact on Wins |
| TEAM_BATTING_HBP | Batters hit by pitch (get a free base) | Positive Impact on Wins |
| TEAM_BATTING_SO | Strikeouts by batters | Negative Impact on Wins |
| TEAM_BASERUN_SB | Stolen bases | Positive Impact on Wins |
| TEAM_BASERUN_CS | Caught stealing | Negative Impact on Wins |
| TEAM_FIELDING_E | Errors | Negative Impact on Wins |
| TEAM_FIELDING_DP | Double Plays | Positive Impact on Wins |
| TEAM_PITCHING_BB | Walks allowed | Negative Impact on Wins |
| TEAM_PITCHING_H | Hits allowed | Negative Impact on Wins |
| TEAM_PITCHING_HR | Homeruns allowed | Negative Impact on Wins |
| TEAM_PITCHING_SO | Strikeouts by pitchers | Positive Impact on Wins |
# https://cran.r-project.org/web/packages/pastecs/pastecs.pdf
library(pastecs)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.4 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::extract() masks pastecs::extract()
## x dplyr::filter() masks stats::filter()
## x dplyr::first() masks pastecs::first()
## x dplyr::lag() masks stats::lag()
## x dplyr::last() masks pastecs::last()
# https://www.rdocumentation.org/packages/naniar/versions/0.6.1
library(naniar)
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
# Used for skewness
library(moments)
# Log Scale
library(scales)
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
# Correlation corrplot
# https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
library(corrplot)
## corrplot 0.90 loaded
# Correlation
library(correlation)
# MICE: for missing values
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
# Caret: Center and Scaling
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
# To help prevent scientific notation for viewed values
options(scipen=100)
# To set the number of decimal places
options(digits=2)
# From http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
The following is an exploration of the data set.
# Reading the data
trainData <- read.csv('https://raw.githubusercontent.com/logicalschema/Fall-2021/main/DATA621/hw1/moneyball-training-data.csv')
evalData <- read.csv('https://raw.githubusercontent.com/logicalschema/Fall-2021/main/DATA621/hw1/moneyball-evaluation-data.csv')
# Remove the Index column
trainData <- subset(trainData, select = -INDEX)
evalData <- subset(evalData, select = -INDEX)
head(trainData)
# Summary of the training data
summary(trainData)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## Min. : 0 Min. : 891 Min. : 69 Min. : 0 Min. : 0
## 1st Qu.: 71 1st Qu.:1383 1st Qu.:208 1st Qu.: 34 1st Qu.: 42
## Median : 82 Median :1454 Median :238 Median : 47 Median :102
## Mean : 81 Mean :1469 Mean :241 Mean : 55 Mean :100
## 3rd Qu.: 92 3rd Qu.:1537 3rd Qu.:273 3rd Qu.: 72 3rd Qu.:147
## Max. :146 Max. :2554 Max. :458 Max. :223 Max. :264
##
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:451 1st Qu.: 548 1st Qu.: 66 1st Qu.: 38
## Median :512 Median : 750 Median :101 Median : 49
## Mean :502 Mean : 736 Mean :125 Mean : 53
## 3rd Qu.:580 3rd Qu.: 930 3rd Qu.:156 3rd Qu.: 62
## Max. :878 Max. :1399 Max. :697 Max. :201
## NA's :102 NA's :131 NA's :772
## TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## Min. :29 Min. : 1137 Min. : 0 Min. : 0
## 1st Qu.:50 1st Qu.: 1419 1st Qu.: 50 1st Qu.: 476
## Median :58 Median : 1518 Median :107 Median : 536
## Mean :59 Mean : 1779 Mean :106 Mean : 553
## 3rd Qu.:67 3rd Qu.: 1682 3rd Qu.:150 3rd Qu.: 611
## Max. :95 Max. :30132 Max. :343 Max. :3645
## NA's :2085
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 0 Min. : 65 Min. : 52
## 1st Qu.: 615 1st Qu.: 127 1st Qu.:131
## Median : 814 Median : 159 Median :149
## Mean : 818 Mean : 246 Mean :146
## 3rd Qu.: 968 3rd Qu.: 249 3rd Qu.:164
## Max. :19278 Max. :1898 Max. :228
## NA's :102 NA's :286
# https://cran.r-project.org/web/packages/pastecs/pastecs.pdf
stat.desc(trainData, basic = FALSE)
It is important to note that with the summary of the data set:
TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO, and TEAM_FIELDING_DP have NA values.INDEX variable is described as having no theoretical effect, we can remove the INDEX column.TARGET_WINS is our dependent variable with 15 remaining variables.The following is a look at each of the variable distributions for the training data set.
par(mfrow = c(3, 3))
plotData <- melt(trainData)
## No id variables; using all as measure variables
ggplot(plotData, aes(x= value)) +
theme(panel.border = element_blank(), panel.background = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
geom_density(fill='dodgerblue') + facet_wrap(~variable, scales = 'free')
TARGET_WINS, TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_BB, and TEAM_BASERUN_CS look to be normally distributed.TEAM_BATTING_HR, TEAM_BATTING_SO, and TEAM_PITCHING_HR are bimodal.skewness(trainData)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## -0.40 1.57 0.22 1.11
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0.19 -1.03 NA NA
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## NA NA 10.34 0.29
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 6.75 NA 2.99 NA
The following are the boxplots of each of the variables for the training data set. Along with the distribution graphs above, these are helpful to identify outliers.
trainData %>%
tidyr::gather(key, value) %>%
ggplot(aes(x = key, y = value, fill = key)) +
geom_boxplot() +
# scale_y_continuous(labels = scales::dollar) +
geom_boxplot(outlier.colour = "red") +
theme(legend.position = "none",
panel.background = element_blank(),
axis.title.y = element_blank()) +
scale_y_continuous(trans = log2_trans()) +
coord_flip()
With the variables TEAM_PITCHING_SO, TEAM_PITCHING_H, TEAM_PITCHING_BB, TEAM_FIELDING_E, TEAM_BATTING_SO, TEAM_BATTING_BB, and TEAM_BASERUN_CS, there are a large number of outliers.
In the Data Preparation section, we will continue to winnow the variables to produce the multiple linear regression model.
trainData %>%
complete.cases() %>%
trainData[., ] %>%
cor() %>%
corrplot(method = "shade")
The correlation matrix shows a strong relationship between TEAM_PITCHING_H and TEAM_BATTING_H, TEAM_PITCHING_HR and TEAM_BATTING_HR, TEAM_PITCHING_BB and TEAM_BATTING_BB, TEAM_PITCHING_SO and TEAM_BATTING_SO.
cor(trainData)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## TARGET_WINS 1.00 0.3888 0.289 0.1426
## TEAM_BATTING_H 0.39 1.0000 0.563 0.4277
## TEAM_BATTING_2B 0.29 0.5628 1.000 -0.1073
## TEAM_BATTING_3B 0.14 0.4277 -0.107 1.0000
## TEAM_BATTING_HR 0.18 -0.0065 0.435 -0.6356
## TEAM_BATTING_BB 0.23 -0.0725 0.256 -0.2872
## TEAM_BATTING_SO NA NA NA NA
## TEAM_BASERUN_SB NA NA NA NA
## TEAM_BASERUN_CS NA NA NA NA
## TEAM_BATTING_HBP NA NA NA NA
## TEAM_PITCHING_H -0.11 0.3027 0.024 0.1949
## TEAM_PITCHING_HR 0.19 0.0729 0.455 -0.5678
## TEAM_PITCHING_BB 0.12 0.0942 0.178 -0.0022
## TEAM_PITCHING_SO NA NA NA NA
## TEAM_FIELDING_E -0.18 0.2649 -0.235 0.5098
## TEAM_FIELDING_DP NA NA NA NA
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## TARGET_WINS 0.1762 0.233 NA
## TEAM_BATTING_H -0.0065 -0.072 NA
## TEAM_BATTING_2B 0.4354 0.256 NA
## TEAM_BATTING_3B -0.6356 -0.287 NA
## TEAM_BATTING_HR 1.0000 0.514 NA
## TEAM_BATTING_BB 0.5137 1.000 NA
## TEAM_BATTING_SO NA NA 1
## TEAM_BASERUN_SB NA NA NA
## TEAM_BASERUN_CS NA NA NA
## TEAM_BATTING_HBP NA NA NA
## TEAM_PITCHING_H -0.2501 -0.450 NA
## TEAM_PITCHING_HR 0.9694 0.460 NA
## TEAM_PITCHING_BB 0.1369 0.489 NA
## TEAM_PITCHING_SO NA NA NA
## TEAM_FIELDING_E -0.5873 -0.656 NA
## TEAM_FIELDING_DP NA NA NA
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## TARGET_WINS NA NA NA
## TEAM_BATTING_H NA NA NA
## TEAM_BATTING_2B NA NA NA
## TEAM_BATTING_3B NA NA NA
## TEAM_BATTING_HR NA NA NA
## TEAM_BATTING_BB NA NA NA
## TEAM_BATTING_SO NA NA NA
## TEAM_BASERUN_SB 1 NA NA
## TEAM_BASERUN_CS NA 1 NA
## TEAM_BATTING_HBP NA NA 1
## TEAM_PITCHING_H NA NA NA
## TEAM_PITCHING_HR NA NA NA
## TEAM_PITCHING_BB NA NA NA
## TEAM_PITCHING_SO NA NA NA
## TEAM_FIELDING_E NA NA NA
## TEAM_FIELDING_DP NA NA NA
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## TARGET_WINS -0.110 0.189 0.1242
## TEAM_BATTING_H 0.303 0.073 0.0942
## TEAM_BATTING_2B 0.024 0.455 0.1781
## TEAM_BATTING_3B 0.195 -0.568 -0.0022
## TEAM_BATTING_HR -0.250 0.969 0.1369
## TEAM_BATTING_BB -0.450 0.460 0.4894
## TEAM_BATTING_SO NA NA NA
## TEAM_BASERUN_SB NA NA NA
## TEAM_BASERUN_CS NA NA NA
## TEAM_BATTING_HBP NA NA NA
## TEAM_PITCHING_H 1.000 -0.142 0.3207
## TEAM_PITCHING_HR -0.142 1.000 0.2219
## TEAM_PITCHING_BB 0.321 0.222 1.0000
## TEAM_PITCHING_SO NA NA NA
## TEAM_FIELDING_E 0.668 -0.493 -0.0228
## TEAM_FIELDING_DP NA NA NA
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## TARGET_WINS NA -0.176 NA
## TEAM_BATTING_H NA 0.265 NA
## TEAM_BATTING_2B NA -0.235 NA
## TEAM_BATTING_3B NA 0.510 NA
## TEAM_BATTING_HR NA -0.587 NA
## TEAM_BATTING_BB NA -0.656 NA
## TEAM_BATTING_SO NA NA NA
## TEAM_BASERUN_SB NA NA NA
## TEAM_BASERUN_CS NA NA NA
## TEAM_BATTING_HBP NA NA NA
## TEAM_PITCHING_H NA 0.668 NA
## TEAM_PITCHING_HR NA -0.493 NA
## TEAM_PITCHING_BB NA -0.023 NA
## TEAM_PITCHING_SO 1 NA NA
## TEAM_FIELDING_E NA 1.000 NA
## TEAM_FIELDING_DP NA NA 1
The following runs Pearson correlation tests between each variable and the variable TARGET_WINS.
# Note: getOption("na.action") is na.omit
cor.test(trainData$TEAM_BATTING_H, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BATTING_H and trainData$TARGET_WINS
## t = 20, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.35 0.42
## sample estimates:
## cor
## 0.39
cor.test(trainData$TEAM_BATTING_2B, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BATTING_2B and trainData$TARGET_WINS
## t = 14, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.25 0.33
## sample estimates:
## cor
## 0.29
cor.test(trainData$TEAM_BATTING_3B, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BATTING_3B and trainData$TARGET_WINS
## t = 7, df = 2274, p-value = 0.000000000008
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.10 0.18
## sample estimates:
## cor
## 0.14
cor.test(trainData$TEAM_BATTING_HR, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BATTING_HR and trainData$TARGET_WINS
## t = 9, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.14 0.22
## sample estimates:
## cor
## 0.18
cor.test(trainData$TEAM_BATTING_BB, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BATTING_BB and trainData$TARGET_WINS
## t = 11, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.19 0.27
## sample estimates:
## cor
## 0.23
cor.test(trainData$TEAM_BATTING_HBP, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BATTING_HBP and trainData$TARGET_WINS
## t = 1, df = 189, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.069 0.213
## sample estimates:
## cor
## 0.074
cor.test(trainData$TEAM_BATTING_SO, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BATTING_SO and trainData$TARGET_WINS
## t = -1, df = 2172, p-value = 0.1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.074 0.010
## sample estimates:
## cor
## -0.032
cor.test(trainData$TEAM_BASERUN_SB, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BASERUN_SB and trainData$TARGET_WINS
## t = 6, df = 2143, p-value = 0.0000000003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.093 0.176
## sample estimates:
## cor
## 0.14
cor.test(trainData$TEAM_BASERUN_CS, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_BASERUN_CS and trainData$TARGET_WINS
## t = 0.9, df = 1502, p-value = 0.4
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.028 0.073
## sample estimates:
## cor
## 0.022
cor.test(trainData$TEAM_FIELDING_E, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_FIELDING_E and trainData$TARGET_WINS
## t = -9, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.22 -0.14
## sample estimates:
## cor
## -0.18
cor.test(trainData$TEAM_FIELDING_DP, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_FIELDING_DP and trainData$TARGET_WINS
## t = -2, df = 1988, p-value = 0.1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.0787 0.0091
## sample estimates:
## cor
## -0.035
cor.test(trainData$TEAM_PITCHING_BB, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_PITCHING_BB and trainData$TARGET_WINS
## t = 6, df = 2274, p-value = 0.000000003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.084 0.164
## sample estimates:
## cor
## 0.12
cor.test(trainData$TEAM_PITCHING_H, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_PITCHING_H and trainData$TARGET_WINS
## t = -5, df = 2274, p-value = 0.0000001
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.150 -0.069
## sample estimates:
## cor
## -0.11
cor.test(trainData$TEAM_PITCHING_HR, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_PITCHING_HR and trainData$TARGET_WINS
## t = 9, df = 2274, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.15 0.23
## sample estimates:
## cor
## 0.19
cor.test(trainData$TEAM_PITCHING_SO, trainData$TARGET_WINS)
##
## Pearson's product-moment correlation
##
## data: trainData$TEAM_PITCHING_SO and trainData$TARGET_WINS
## t = -4, df = 2172, p-value = 0.0003
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.120 -0.037
## sample estimates:
## cor
## -0.078
p1 <- ggplot(trainData) +
aes(x = TEAM_BATTING_H, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p2 <- ggplot(trainData) +
aes(x = TEAM_BATTING_2B, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p3 <- ggplot(trainData) +
aes(x = TEAM_BATTING_3B, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p4 <- ggplot(trainData) +
aes(x = TEAM_BATTING_HR, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p5 <- ggplot(trainData) +
aes(x = TEAM_BATTING_BB, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p6 <- ggplot(trainData) +
aes(x = TEAM_BATTING_HBP, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p7 <- ggplot(trainData) +
aes(x = TEAM_BATTING_SO, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p8 <- ggplot(trainData) +
aes(x = TEAM_BASERUN_SB, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p9 <- ggplot(trainData) +
aes(x = TEAM_BASERUN_CS, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p10 <- ggplot(trainData) +
aes(x = TEAM_FIELDING_E, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p11 <- ggplot(trainData) +
aes(x = TEAM_FIELDING_DP, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p12 <- ggplot(trainData) +
aes(x = TEAM_PITCHING_BB, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p13 <- ggplot(trainData) +
aes(x = TEAM_PITCHING_H, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p14 <- ggplot(trainData) +
aes(x = TEAM_PITCHING_HR, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
p15 <- ggplot(trainData) +
aes(x = TEAM_PITCHING_SO, y = TARGET_WINS) +
geom_point(colour = "dodgerblue") +
theme_minimal()
# Empty Plot
p_empty <- ggplot() +
theme_void()
multiplot(p1, p2, p3, p4, p5, p6, cols=3)
multiplot(p7, p8, p9, p10, p11, p12, cols=3)
multiplot(p13, p14, p15, p_empty, p_empty, p_empty, cols=3)
The following gives a chart of the percentages of missing values by variable for the training data set.
# Percentage of missing values by the variable
miss_var_summary(trainData)
The missing values will be handled in the data preparation section of this report.
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
According to this Dan Berdikulov in this article, when data is missing for 60-70% of a variable, dropping the variable should be considered. With 91.6% missing, the variable TEAM_BATTING_HBP will be dropped.
# Remove the TEAM_BATTING_HBP column
trainData <- subset(trainData, select = -TEAM_BATTING_HBP)
evalData <- subset(evalData, select = -TEAM_BATTING_HBP)
For missing values, I decided to fill them using the Predictive Mean Matching method because the variables are numerical data. I used the MICE (Multivariate Imputation via Chained Equations) library for this.
#TEAM_BASERUN_CS
#TEAM_FIELDING_DP
#TEAM_BASERUN_SB
#TEAM_BATTING_SO
#TEAM_PITCHING_SO
# https://www.rdocumentation.org/packages/mice/versions/3.13.0/topics/mice.impute.pmm
set.seed(91421)
temp <- mice(trainData, m=5, maxit=5, meth='pmm')
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
tempData <- complete(temp, 1)
trainData$TEAM_BASERUN_CS <- tempData$TEAM_BASERUN_CS
trainData$TEAM_FIELDING_DP <- tempData$TEAM_FIELDING_DP
trainData$TEAM_BASERUN_SB <-tempData$TEAM_BASERUN_SB
trainData$TEAM_BATTING_SO <- tempData$TEAM_BATTING_SO
trainData$TEAM_PITCHING_SO <- tempData$TEAM_PITCHING_SO
temp <- mice(evalData, m=5, maxit=5, meth='pmm')
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_SO TEAM_FIELDING_DP
tempData <- complete(temp, 1)
evalData$TEAM_BASERUN_CS <- tempData$TEAM_BASERUN_CS
evalData$TEAM_FIELDING_DP <- tempData$TEAM_FIELDING_DP
evalData$TEAM_BASERUN_SB <-tempData$TEAM_BASERUN_SB
evalData$TEAM_BATTING_SO <- tempData$TEAM_BATTING_SO
evalData$TEAM_PITCHING_SO <- tempData$TEAM_PITCHING_SO
Let’s look at a summary of the imputed data set.
summary(trainData)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## Min. : 0 Min. : 891 Min. : 69 Min. : 0 Min. : 0
## 1st Qu.: 71 1st Qu.:1383 1st Qu.:208 1st Qu.: 34 1st Qu.: 42
## Median : 82 Median :1454 Median :238 Median : 47 Median :102
## Mean : 81 Mean :1469 Mean :241 Mean : 55 Mean :100
## 3rd Qu.: 92 3rd Qu.:1537 3rd Qu.:273 3rd Qu.: 72 3rd Qu.:147
## Max. :146 Max. :2554 Max. :458 Max. :223 Max. :264
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.:451 1st Qu.: 542 1st Qu.: 67 1st Qu.: 43
## Median :512 Median : 733 Median :106 Median : 57
## Mean :502 Mean : 728 Mean :136 Mean : 77
## 3rd Qu.:580 3rd Qu.: 925 3rd Qu.:170 3rd Qu.: 91
## Max. :878 Max. :1399 Max. :697 Max. :201
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## Min. : 1137 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 1419 1st Qu.: 50 1st Qu.: 476 1st Qu.: 613
## Median : 1518 Median :107 Median : 536 Median : 804
## Mean : 1779 Mean :106 Mean : 553 Mean : 812
## 3rd Qu.: 1682 3rd Qu.:150 3rd Qu.: 611 3rd Qu.: 958
## Max. :30132 Max. :343 Max. :3645 Max. :19278
## TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 65 Min. : 52
## 1st Qu.: 127 1st Qu.:126
## Median : 159 Median :146
## Mean : 246 Mean :142
## 3rd Qu.: 249 3rd Qu.:162
## Max. :1898 Max. :228
# https://topepo.github.io/caret/pre-processing.html#centering-and-scaling
trainTransformed <- trainData
preProcessValues <- preProcess(trainTransformed, method = c("BoxCox", "center", "scale"))
trainTransformed <- predict(preProcessValues, trainTransformed)
trainTransformed2 <- evalData
preProcessValues <- preProcess(trainTransformed2, method = c("BoxCox", "center", "scale"))
trainTransformed2 <- predict(preProcessValues, trainTransformed2)
trainData$TEAM_PITCHING_SO <- trainTransformed$TEAM_PITCHING_SO
trainData$TEAM_PITCHING_BB <- trainTransformed$TEAM_PITCHING_BB
trainData$TEAM_BASERUN_SB <- trainTransformed$TEAM_BASERUN_SB
trainData$TEAM_BASERUN_CS <- trainTransformed$TEAM_BASERUN_CS
trainData$TEAM_PITCHING_H <- log(trainData$TEAM_PITCHING_H)
trainData$TEAM_FIELDING_E <- trainTransformed$TEAM_FIELDING_E
evalData$TEAM_PITCHING_SO <- trainTransformed2$TEAM_PITCHING_SO
evalData$TEAM_PITCHING_BB <- trainTransformed2$TEAM_PITCHING_BB
evalData$TEAM_BASERUN_SB <- trainTransformed2$TEAM_BASERUN_SB
evalData$TEAM_BASERUN_CS <- trainTransformed2$TEAM_BASERUN_CS
evalData$TEAM_PITCHING_H <- log(evalData$TEAM_PITCHING_H)
evalData$TEAM_FIELDING_E <- trainTransformed2$TEAM_FIELDING_E
par(mfrow = c(3, 3))
plotData <- melt(trainData)
## No id variables; using all as measure variables
ggplot(plotData, aes(x= value)) +
theme(panel.border = element_blank(), panel.background = element_blank(),
panel.grid.major = element_blank(), panel.grid.minor = element_blank()) +
geom_density(fill='dodgerblue') + facet_wrap(~variable, scales = 'free')
The variable TEAM_BATTING_H is the base hits by batters (1B, 2B, 3B, and HR). However, singles should be considered. So TEAM_BATTING_1B will be created.
trainData$TEAM_BATTING_1B <- trainData$TEAM_BATTING_H - trainData$TEAM_BATTING_2B - trainData$TEAM_BATTING_3B - trainData$TEAM_BATTING_HR
evalData$TEAM_BATTING_1B <- evalData$TEAM_BATTING_H - evalData$TEAM_BATTING_2B - evalData$TEAM_BATTING_3B - evalData$TEAM_BATTING_HR
trainData <- subset(trainData, select = -TEAM_BATTING_H)
evalData <- subset(evalData, select = -TEAM_BATTING_H)
head(trainData)
Billy Beane of Moneyball fame was known to base drafting of players by combining two statistics: OBP (On-base Percentage) and SLG(Slugging Percentage). He would combine these two statistics to form the OPS (On-base Plus Slugging) statistic.
Because I dropped the TEAM_BATTING_HBP statistic, I will use a modified OPS based upon the data set.
On-base Percentage = (TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB) / PA
Assuming there are 162 games, 9 innings, and 3 chances at bat: Plate Appearance (PA) = (162 * 9 * 3) + TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR - TEAM_BASERUN_CS – TEAM_FIELDING_DP
Slugging Percentage (SLG) = (TEAM_BATTING_1B + 2 * TEAM_BATTING_2B + 3 * TEAM_BATTING_3B + 4*TEAM_BATTING_HR) / (PA - TEAM_BATTING_BB)
OPS = OBP + SLG
trainData$PA <- (162 * 9 * 3) + trainData$TEAM_BATTING_1B + trainData$TEAM_BATTING_2B + trainData$TEAM_BATTING_3B + trainData$TEAM_BATTING_HR - trainData$TEAM_BASERUN_CS - trainData$TEAM_FIELDING_DP
trainData$OBP <- (trainData$TEAM_BATTING_1B + trainData$TEAM_BATTING_2B + trainData$TEAM_BATTING_3B + trainData$TEAM_BATTING_HR + trainData$TEAM_BATTING_BB) / trainData$PA
trainData$SLG <- (trainData$TEAM_BATTING_1B + 2 * trainData$TEAM_BATTING_2B + 3 * trainData$TEAM_BATTING_3B + 4 * trainData$TEAM_BATTING_HR) / (trainData$PA - trainData$TEAM_BATTING_BB)
trainData$OPS <- trainData$OBP + trainData$SLG
evalData$PA <- (162 * 9 * 3) + evalData$TEAM_BATTING_1B + evalData$TEAM_BATTING_2B + evalData$TEAM_BATTING_3B + evalData$TEAM_BATTING_HR - evalData$TEAM_BASERUN_CS - evalData$TEAM_FIELDING_DP
evalData$OBP <- (evalData$TEAM_BATTING_1B + evalData$TEAM_BATTING_2B + evalData$TEAM_BATTING_3B + evalData$TEAM_BATTING_HR + evalData$TEAM_BATTING_BB) / evalData$PA
evalData$SLG <- (evalData$TEAM_BATTING_1B + 2 * evalData$TEAM_BATTING_2B + 3 * evalData$TEAM_BATTING_3B + 4 * evalData$TEAM_BATTING_HR) / (evalData$PA - evalData$TEAM_BATTING_BB)
evalData$OPS <- evalData$OBP + evalData$SLG
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
This model uses the variables related to batting: TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, and TEAM_BATTING_BB. These variables were selected because in the 1st section they had a positive correlation coeefficient.
## [1] "TARGET_WINS" "TEAM_BATTING_2B" "TEAM_BATTING_3B" "TEAM_BATTING_HR"
## [5] "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB" "TEAM_BASERUN_CS"
## [9] "TEAM_BASERUN_CS" "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO"
## [13] "TEAM_FIELDING_E" "TEAM_FIELDING_DP" "TEAM_BATTING_1B" "PA"
## [17] "OBP" "SLG" "OPS"
m1 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB, data = trainData, na.action = na.omit)
summary(m1)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB, data = trainData,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.41 -8.60 0.52 9.14 55.28
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.32154 3.46608 0.96 0.34
## TEAM_BATTING_1B 0.03746 0.00308 12.18 < 0.0000000000000002 ***
## TEAM_BATTING_2B 0.02969 0.00751 3.95 0.00008 ***
## TEAM_BATTING_3B 0.13614 0.01498 9.09 < 0.0000000000000002 ***
## TEAM_BATTING_HR 0.08644 0.00788 10.97 < 0.0000000000000002 ***
## TEAM_BATTING_BB 0.02786 0.00280 9.93 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14 on 2270 degrees of freedom
## Multiple R-squared: 0.236, Adjusted R-squared: 0.234
## F-statistic: 140 on 5 and 2270 DF, p-value: <0.0000000000000002
The summary of the model yields a \(R^2\) of 0.236. This means 23.6% of the variability in wins is explained by TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, andTEAM_BATTING_BB.
This model uses the variables related to batting and pitching statistics: TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_PITCHING_SO, TEAM_PITCHING_HR, TEAM_PITCHING_H, and TEAM_PITCHING_BB.
m2 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_PITCHING_SO + TEAM_PITCHING_HR + TEAM_PITCHING_H + TEAM_PITCHING_BB, data = trainData, na.action = na.omit)
summary(m2)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_PITCHING_SO +
## TEAM_PITCHING_HR + TEAM_PITCHING_H + TEAM_PITCHING_BB, data = trainData,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.67 -8.75 0.41 9.00 52.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.88562 16.62033 4.81 0.000001637 ***
## TEAM_BATTING_1B 0.05544 0.00393 14.11 < 0.0000000000000002 ***
## TEAM_BATTING_2B 0.03418 0.00759 4.50 0.000007067 ***
## TEAM_BATTING_3B 0.14191 0.01510 9.40 < 0.0000000000000002 ***
## TEAM_BATTING_HR 0.06144 0.02854 2.15 0.03140 *
## TEAM_BATTING_BB 0.01676 0.00635 2.64 0.00837 **
## TEAM_PITCHING_SO 1.53815 0.46152 3.33 0.00087 ***
## TEAM_PITCHING_HR 0.02917 0.02537 1.15 0.25022
## TEAM_PITCHING_H -12.46648 2.21001 -5.64 0.000000019 ***
## TEAM_PITCHING_BB 0.13646 0.71012 0.19 0.84764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14 on 2266 degrees of freedom
## Multiple R-squared: 0.254, Adjusted R-squared: 0.251
## F-statistic: 85.6 on 9 and 2266 DF, p-value: <0.0000000000000002
The summary of the model yields a \(R^2\) of 0.254. This means 25.4% of the variability in wins is explained by TEAM_BATTING_1B, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_PITCHING_SO, TEAM_PITCHING_HR, TEAM_PITCHING_H, and TEAM_PITCHING_BB.
This model uses the features we created: PA, OBP, SLG, and OPS.
m3 <- lm(TARGET_WINS ~ PA + OBP + SLG + OPS, data = trainData, na.action = na.omit)
summary(m3)
##
## Call:
## lm(formula = TARGET_WINS ~ PA + OBP + SLG + OPS, data = trainData,
## na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -61.53 -8.79 0.48 9.23 49.88
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -125.80905 11.63466 -10.81 < 0.0000000000000002 ***
## PA 0.02562 0.00228 11.25 < 0.0000000000000002 ***
## OBP 131.26455 18.59171 7.06 0.0000000000022 ***
## SLG 37.31747 10.20266 3.66 0.00026 ***
## OPS NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14 on 2272 degrees of freedom
## Multiple R-squared: 0.233, Adjusted R-squared: 0.232
## F-statistic: 230 on 3 and 2272 DF, p-value: <0.0000000000000002
The summary of the model yields a \(R^2\) of 0.233. This means 23.3% of the variability in wins is explained by PA, OBP, SLG, and OPS.
This model uses the features and the theoretical positive effect variables as TEAM_BASERUN_SB, TEAM_FIELDING_DP, and TEAM_PITCHING_SO.
m4 <- lm(TARGET_WINS ~ PA + OBP + SLG + OPS + TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_PITCHING_SO, data = trainData, na.action = na.omit)
summary(m4)
##
## Call:
## lm(formula = TARGET_WINS ~ PA + OBP + SLG + OPS + TEAM_BASERUN_SB +
## TEAM_FIELDING_DP + TEAM_PITCHING_SO, data = trainData, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.99 -8.46 0.44 9.01 55.33
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -56.71037 13.58321 -4.18 0.00003091877566307 ***
## PA 0.01099 0.00269 4.09 0.00004440169213577 ***
## OBP 152.55524 18.77988 8.12 0.00000000000000074 ***
## SLG 86.33136 11.38447 7.58 0.00000000000004882 ***
## OPS NA NA NA NA
## TEAM_BASERUN_SB 1.88715 0.37430 5.04 0.00000049772941154 ***
## TEAM_FIELDING_DP -0.09180 0.01302 -7.05 0.00000000000231713 ***
## TEAM_PITCHING_SO -0.03860 0.31113 -0.12 0.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14 on 2269 degrees of freedom
## Multiple R-squared: 0.273, Adjusted R-squared: 0.271
## F-statistic: 142 on 6 and 2269 DF, p-value: <0.0000000000000002
The summary of the model yields a \(R^2\) of 0.273. This means 27.3% of the variability in wins is explained by PA, OBP, SLG, OPS, TEAM_BASERUN_SB, TEAM_FIELDING_DP, and TEAM_PITCHING_SO.
This model uses all of the variables except for the new features.
m5 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR +
TEAM_BATTING_SO + TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
TEAM_BASERUN_SB + TEAM_FIELDING_E, data = trainData)
summary(m5)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR + TEAM_BATTING_SO +
## TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.25 -7.92 0.06 8.16 66.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.83206 5.40367 4.60 0.0000045566070790 ***
## TEAM_BATTING_1B 0.04242 0.00359 11.81 < 0.0000000000000002 ***
## TEAM_BATTING_2B 0.01768 0.00729 2.43 0.0153 *
## TEAM_BATTING_3B 0.13109 0.01632 8.03 0.0000000000000015 ***
## TEAM_BATTING_BB 0.03559 0.00454 7.83 0.0000000000000072 ***
## TEAM_BATTING_HR 0.06371 0.02695 2.36 0.0181 *
## TEAM_BATTING_SO -0.01726 0.00257 -6.71 0.0000000000249938 ***
## TEAM_FIELDING_DP -0.10657 0.01254 -8.50 < 0.0000000000000002 ***
## TEAM_BASERUN_CS 1.27336 0.54919 2.32 0.0205 *
## TEAM_PITCHING_HR 0.02308 0.02383 0.97 0.3330
## TEAM_PITCHING_BB -1.86879 0.57826 -3.23 0.0012 **
## TEAM_PITCHING_SO 2.15667 0.48431 4.45 0.0000088756767528 ***
## TEAM_BASERUN_SB 2.80448 0.50983 5.50 0.0000000420801604 ***
## TEAM_FIELDING_E -8.40447 0.59325 -14.17 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 2262 degrees of freedom
## Multiple R-squared: 0.34, Adjusted R-squared: 0.336
## F-statistic: 89.7 on 13 and 2262 DF, p-value: <0.0000000000000002
The summary of the model yields a \(R^2\) of 0.34. This means 34% of the variability in wins is explained by this model.
This model uses the original variables and OBP, SLG, and OPS.
m6 <- lm(TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR +
TEAM_BATTING_SO + TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
TEAM_BASERUN_SB + TEAM_FIELDING_E + OBP + SLG + OPS, data = trainData)
summary(m6)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_1B + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_HR + TEAM_BATTING_SO +
## TEAM_FIELDING_DP + TEAM_BASERUN_CS + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + OBP +
## SLG + OPS, data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.78 -7.89 0.15 8.41 60.29
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -89.53683 21.97356 -4.07 0.000047652947848 ***
## TEAM_BATTING_1B -0.11529 0.03037 -3.80 0.00015 ***
## TEAM_BATTING_2B 0.00359 0.04612 0.08 0.93789
## TEAM_BATTING_3B 0.24964 0.07482 3.34 0.00086 ***
## TEAM_BATTING_BB -0.27523 0.05041 -5.46 0.000000052708656 ***
## TEAM_BATTING_HR 0.35882 0.11875 3.02 0.00254 **
## TEAM_BATTING_SO -0.02036 0.00263 -7.73 0.000000000000016 ***
## TEAM_FIELDING_DP -0.17425 0.01987 -8.77 < 0.0000000000000002 ***
## TEAM_BASERUN_CS 1.16584 0.54506 2.14 0.03255 *
## TEAM_PITCHING_HR 0.00465 0.02398 0.19 0.84622
## TEAM_PITCHING_BB -1.65484 0.57443 -2.88 0.00400 **
## TEAM_PITCHING_SO 2.51165 0.48335 5.20 0.000000221441374 ***
## TEAM_BASERUN_SB 3.08567 0.50971 6.05 0.000000001651868 ***
## TEAM_FIELDING_E -7.94156 0.59338 -13.38 < 0.0000000000000002 ***
## OBP 2083.77028 324.95919 6.41 0.000000000173859 ***
## SLG -732.98646 184.42976 -3.97 0.000072788262741 ***
## OPS NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 2260 degrees of freedom
## Multiple R-squared: 0.352, Adjusted R-squared: 0.348
## F-statistic: 81.8 on 15 and 2260 DF, p-value: <0.0000000000000002
The summary of the model yields a \(R^2\) of 0.352. This means 35.2% of the variability in wins is explained by this model.
Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.
I decided to go with the last model because of the \(R^2\) value. Now, I will run this model on the evaluation data set.
predictions <- predict(m6, evalData)
Because the evaluation data does not have the TARGET_WINS variable, we are not able to calculate the accuracy for the model with the evaluation data set. However, this is a sample of what the data set looks like:
head(predictions)
## 1 2 3 4 5 6
## 61 67 74 87 60 70