Data 621 Homework #1

Load Libraries

# loading libraries
library(dplyr)
library(DataExplorer)
library(ggplot2)
library(tidyr)
library(corrplot)
library(mice)
library(reshape)
library(jtools)  # use of summ()
library(e1071)  # check skewness
library(gtsummary)
library(caret)  #data splitting

\(~\)

Data Overview

The data set contains approximately 2276 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Below is a short description of the variables:

\(~\)

Objective

To build a multiple linear regression model on the training data to predict TARGET_WINS, which is the number of wins for the team.

\(~\)

Data Exploration

Due to the number of fields in this data, I broke the dataset into intuitive sections and explored each section individually.

# read csv files
eval_df <- read.csv("https://raw.githubusercontent.com/letisalba/Data_621/master/Homework_1/csv/moneyball-evaluation-data.csv")
train_df <- read.csv("https://raw.githubusercontent.com/letisalba/Data_621/master/Homework_1/csv/moneyball-training-data.csv")

\(~\)

print(paste0("Number of observations: ", nrow(train_df)))

## [1] "Number of observations: 2276"

print(paste0("Observations per year, 1871 - 2006: ", round(nrow(train_df)/(2006 -
    1871), 2)))

## [1] "Observations per year, 1871 - 2006: 16.86"

The assignment mentions that some of the season records were adjusted to match the performance during a 162-game season. There are 2276 seasons in the training set. Observations span 128 years, with an average of 17 teams playing per year.

\(~\)

# distribution
plot_histogram(train_df)

# against the response variable
plot_scatterplot(train_df, by = "TARGET_WINS")

## Warning: Removed 1005 rows containing missing values (geom_point).

## Warning: Removed 2473 rows containing missing values (geom_point).

corrplot(cor(train_df[, 2:17], use = "complete.obs"), tl.cex = 0.5)

Looking at the correlation plot, there appear to be several strong correlations between explanatory variables and the target. From an initial inspection, it appears the team should focus on getting players on base through hits or walks. Teams can still win if the pitchers allow homeruns, hits and walks to the other team.

\(~\)

Variables with Highest Positive Correlation with TARGET_WINS:

TEAM_BATTING_H = 0.47
TEAM_BATTING_HR = 0.42
TEAM_BATTING_BB = 0.47
TEAM_PITCHING_H = 0.47
TEAM_PITCHING_HR = 0.42
TEAM_PITCHING_BB = 0.47

To win more games it makes sense the team will need to make fewer errors.

\(~\)

Variables with Strongly Negative Correlation with TARGET_WINS:

There were several batting variables which were related.

\(~\)

Positive Correlations between variables:

TEAM_PITCHING_H and TEAM_BATTING_H = 0.99
TEAM_PITCHING_HR and TEAM_BATTING_HR = 0.99
TEAM_PITCHING_BB and TEAM_BATTING_BB = 0.99
TEAM_PITCHING_SO and TEAM_BATTING_SO = 0.99

\(~\)

Base Hits by Batter

TARGET_WINS - Number of wins
TEAM_BATTING_H - Base Hits by batters (1B,2B,3B,HR)
TEAM_BATTING_2B - Doubles by batters (2B)
TEAM_BATTING_3B - Triples by batters (3B)
TEAM_BATTING_HR - Homeruns by batters (4B)

The means and medians are very similar for the base hits variables implying little skew to the distributions.

train_df %>%
    select(c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B",
        "TEAM_BATTING_3B", "TEAM_BATTING_HR")) %>%
    gtsummary::tbl_summary(statistic = list(c("TARGET_WINS",
        "TEAM_BATTING_H", "TEAM_BATTING_2B", "TEAM_BATTING_3B",
        "TEAM_BATTING_HR") ~ "{mean} {median} {sd}"))

Characteristic	N = 2,276¹
TARGET_WINS	81 82 16
TEAM_BATTING_H	1,469 1,454 145
TEAM_BATTING_2B	241 238 47
TEAM_BATTING_3B	55 47 28
TEAM_BATTING_HR	100 102 61
¹ Mean Median SD

\(~\)

We see tight distributions except for all base hits by batters (TEAM_BATTING_H).

temp <- train_df %>%
    select(c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B",
        "TEAM_BATTING_3B", "TEAM_BATTING_HR"))
ggplot2::ggplot(stack(temp), aes(x = ind, y = values)) + geom_boxplot() +
    labs(title = "Base Hit Variables")

\(~\)

Unsurprisingly, all possible base hits (TEAM_BATTING_H) is correlated with winning. As you increase the number of bases achieved by an at bat, the correlation decreases.

Interestingly, doubles and triples are correlated with base hits while home runs are not.

train_df %>%
    select(c("TARGET_WINS", "TEAM_BATTING_H", "TEAM_BATTING_2B",
        "TEAM_BATTING_3B", "TEAM_BATTING_HR")) %>%
    GGally::ggpairs()

\(~\)

Batting

TARGET_WINS - Number of wins
TEAM_BATTING_BB - Walks by batters
TEAM_BATTING_HBP - Batters hit by pitch (get a free base)
TEAM_BATTING_SO - Strikeouts by batters
TEAM_BASERUN_SB - Stolen bases
TEAM_BASERUN_CS - Caught stealing

The measures of central tendency show us that most of these variable have slight skew to their distributions. Stolen bases has a large right skew to its distribution.

We are missing values for strikeouts, stolen bases and caught stealing.

train_df %>%
    select(c("TEAM_BATTING_BB", "TEAM_BATTING_SO", "TEAM_BASERUN_SB",
        "TEAM_BASERUN_CS")) %>%
    gtsummary::tbl_summary(statistic = list(c("TEAM_BATTING_BB",
        "TEAM_BATTING_SO", "TEAM_BASERUN_SB", "TEAM_BASERUN_CS") ~
        "{mean} {median} {sd}"))

Characteristic	N = 2,276¹
TEAM_BATTING_BB	502 512 123
TEAM_BATTING_SO	736 750 249
Unknown	102
TEAM_BASERUN_SB	125 101 88
Unknown	131
TEAM_BASERUN_CS	53 49 23
Unknown	772
¹ Mean Median SD

temp <- train_df %>%
    select(c("TEAM_BATTING_BB", "TEAM_BATTING_SO", "TEAM_BASERUN_SB",
        "TEAM_BASERUN_CS"))
ggplot2::ggplot(stack(temp), aes(x = ind, y = values)) + geom_boxplot() +
    labs(title = "Batting Variables")

\(~\)

Of all the batting variables, only walks by batter has a correlation to wins.

train_df %>%
    select(c("TARGET_WINS", "TEAM_BATTING_BB", "TEAM_BATTING_SO",
        "TEAM_BASERUN_SB", "TEAM_BASERUN_CS")) %>%
    GGally::ggpairs()

\(~\)

Fielding

TARGET_WINS - Number of wins
TEAM_FIELDING_E - Errors
TEAM_FIELDING_DP - Double Plays

The Errors variable(TEAM_FIELDING_E) has an incredibly right skewed distribution. We are missing some Double Plays values.

train_df %>%
    select(c("TEAM_FIELDING_E", "TEAM_FIELDING_DP")) %>%
    gtsummary::tbl_summary(statistic = list(c("TEAM_FIELDING_E",
        "TEAM_FIELDING_DP") ~ "{mean} {median} {sd}"))

Characteristic	N = 2,276¹
TEAM_FIELDING_E	246 159 228
TEAM_FIELDING_DP	146 149 26
Unknown	286
¹ Mean Median SD

temp <- train_df %>%
    select(c("TEAM_FIELDING_E", "TEAM_FIELDING_DP"))
ggplot2::ggplot(stack(temp), aes(x = ind, y = values)) + geom_boxplot() +
    labs(title = "Fielding Variables")

Both the Fielding variables are negatively correlated with Wins.

train_df %>%
    select(c("TARGET_WINS", "TEAM_FIELDING_E", "TEAM_FIELDING_DP")) %>%
    GGally::ggpairs()

\(~\)

Pitching

TARGET_WINS - Number of wins
TEAM_PITCHING_BB - Walks allowed
TEAM_PITCHING_H - Hits allowed
TEAM_PITCHING_HR - Homeruns allowed
TEAM_PITCHING_SO - Strikeouts by pitchers

Hits allowed (TEAM_PITCHING_H) has a right skew and we are missing some Strikeouts by pitcher (TEAM_PITCHING_SO) values.

Note: There is something off with the pitching stats; there is no way a team allowed 30k hits in a 162 game season or had 20k strikeouts.

train_df %>%
    select(c("TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB",
        "TEAM_PITCHING_SO")) %>%
    gtsummary::tbl_summary(statistic = list(c("TEAM_PITCHING_H",
        "TEAM_PITCHING_HR", "TEAM_PITCHING_BB", "TEAM_PITCHING_SO") ~
        "{mean} {median} {sd}"))

Characteristic	N = 2,276¹
TEAM_PITCHING_H	1,779 1,518 1,407
TEAM_PITCHING_HR	106 107 61
TEAM_PITCHING_BB	553 536 166
TEAM_PITCHING_SO	818 814 553
Unknown	102
¹ Mean Median SD

temp <- train_df %>%
    select(c("TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB",
        "TEAM_PITCHING_SO"))
ggplot2::ggplot(stack(temp), aes(x = ind, y = values)) + geom_boxplot() +
    labs(title = "Pitching Variables")

Hits allowed is negatively correlated with Winning.

Interestingly, Home runs allowed is positively correlated with Winning.

train_df %>%
    select(c("TARGET_WINS", "TEAM_PITCHING_H", "TEAM_PITCHING_HR",
        "TEAM_PITCHING_BB", "TEAM_PITCHING_SO")) %>%
    GGally::ggpairs()

\(~\)

Data Preparation

Missing values

# training dataset
round(100 * colSums(is.na(train_df))/nrow(train_df), 2)

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##             0.00             0.00             0.00             0.00 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##             0.00             0.00             0.00             4.48 
##  TEAM_BASERUN_SB  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H 
##             5.76            33.92            91.61             0.00 
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##             0.00             0.00             4.48             0.00 
## TEAM_FIELDING_DP 
##            12.57

# training dataset
round(100 * colSums(is.na(eval_df))/nrow(eval_df), 2)

##            INDEX   TEAM_BATTING_H  TEAM_BATTING_2B  TEAM_BATTING_3B 
##             0.00             0.00             0.00             0.00 
##  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO  TEAM_BASERUN_SB 
##             0.00             0.00             6.95             5.02 
##  TEAM_BASERUN_CS TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_HR 
##            33.59            92.66             0.00             0.00 
## TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E TEAM_FIELDING_DP 
##             0.00             6.95             0.00            11.97

In terms of missing values, there are two variables missing many observations. TEAM_BATTING_HBP is missing over 90% of its values, while TEAM_BASERUN_CS is missing just around 30%.

# New DF with Missing Removed
train_df_mv <- train_df[, !names(train_df) %in% c("TEAM_BATTING_HBP",
    "TEAM_BASERUN_CS", "TEAM_FIELDING_DP")]

# Impute NAs with Median
train_df_imputed <- mice(train_df_mv, m = 5, maxit = 5, method = "pmm")

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO

train_df_final <- complete(train_df_imputed)


ggplot(melt(train_df_final), aes(x = value)) + geom_histogram() +
    facet_wrap(~variable, scale = "free") + labs(x = "", y = "Frequency")

## Using  as id variables

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Replace Error Maxs
train_df_final$TEAM_PITCHING_H[train_df_final$TEAM_PITCHING_H >
    3 * sd(train_df_final$TEAM_PITCHING_H)] <- median(train_df_final$TEAM_PITCHING_H)
train_df_final$TEAM_PITCHING_BB[train_df_final$TEAM_PITCHING_BB >
    3 * sd(train_df_final$TEAM_PITCHING_BB)] <- median(train_df_final$TEAM_PITCHING_BB)
train_df_final$TEAM_PITCHING_SO[train_df_final$TEAM_PITCHING_SO >
    3 * sd(train_df_final$TEAM_PITCHING_SO)] <- median(train_df_final$TEAM_PITCHING_SO)
train_df_final$TEAM_FIELDING_E[train_df_final$TEAM_FIELDING_E >
    3 * sd(train_df_final$TEAM_FIELDING_E)] <- median(train_df_final$TEAM_FIELDING_E)

summary(train_df_final)

##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 541.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 732.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 726.7  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 925.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##  TEAM_BASERUN_SB TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
##  Min.   :  0.0   Min.   :1137    Min.   :  0.0    Min.   :  0.0   
##  1st Qu.: 67.0   1st Qu.:1419    1st Qu.: 50.0    1st Qu.:476.0   
##  Median :105.0   Median :1518    Median :107.0    Median :536.5   
##  Mean   :135.8   Mean   :1605    Mean   :105.7    Mean   :500.4   
##  3rd Qu.:170.0   3rd Qu.:1660    3rd Qu.:150.0    3rd Qu.:536.5   
##  Max.   :697.0   Max.   :4134    Max.   :343.0    Max.   :536.5   
##  TEAM_PITCHING_SO TEAM_FIELDING_E
##  Min.   :   0.0   Min.   : 65.0  
##  1st Qu.: 611.0   1st Qu.:127.0  
##  Median : 802.2   Median :159.0  
##  Mean   : 788.3   Mean   :198.9  
##  3rd Qu.: 954.2   3rd Qu.:215.0  
##  Max.   :1600.0   Max.   :681.0

# New DF with Missing Removed for eval data
eval_df_mv <- eval_df[, !names(eval_df) %in% c("TEAM_BATTING_HBP",
    "TEAM_BASERUN_CS", "TEAM_FIELDING_DP")]

# Impute NAs with Median
eval_df_imputed <- mice(eval_df_mv, m = 5, maxit = 5, method = "pmm")

## 
##  iter imp variable
##   1   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   1   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   2   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   3   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   4   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   1  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   2  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   3  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   4  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO
##   5   5  TEAM_BATTING_SO  TEAM_BASERUN_SB  TEAM_PITCHING_SO

eval_df_final <- complete(eval_df_imputed)


ggplot(melt(eval_df_final), aes(x = value)) + geom_histogram() +
    facet_wrap(~variable, scale = "free") + labs(x = "", y = "Frequency")

## Using  as id variables

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Replace Error Maxs
eval_df_final$TEAM_PITCHING_H[eval_df_final$TEAM_PITCHING_H >
    3 * sd(eval_df_final$TEAM_PITCHING_H)] <- median(eval_df_final$TEAM_PITCHING_H)
eval_df_final$TEAM_PITCHING_BB[eval_df_final$TEAM_PITCHING_BB >
    3 * sd(eval_df_final$TEAM_PITCHING_BB)] <- median(eval_df_final$TEAM_PITCHING_BB)
eval_df_final$TEAM_PITCHING_SO[eval_df_final$TEAM_PITCHING_SO >
    3 * sd(eval_df_final$TEAM_PITCHING_SO)] <- median(eval_df_final$TEAM_PITCHING_SO)
eval_df_final$TEAM_FIELDING_E[eval_df_final$TEAM_FIELDING_E >
    3 * sd(eval_df_final$TEAM_FIELDING_E)] <- median(eval_df_final$TEAM_FIELDING_E)

summary(eval_df_final)

##      INDEX      TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :   9   Min.   : 819   Min.   : 44.0   Min.   : 14.00  
##  1st Qu.: 708   1st Qu.:1387   1st Qu.:210.0   1st Qu.: 35.00  
##  Median :1249   Median :1455   Median :239.0   Median : 52.00  
##  Mean   :1264   Mean   :1469   Mean   :241.3   Mean   : 55.91  
##  3rd Qu.:1832   3rd Qu.:1548   3rd Qu.:278.5   3rd Qu.: 72.00  
##  Max.   :2525   Max.   :2170   Max.   :376.0   Max.   :155.00  
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   : 15.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 44.50   1st Qu.:436.5   1st Qu.: 543.0   1st Qu.: 60.5  
##  Median :101.00   Median :509.0   Median : 677.0   Median : 96.0  
##  Mean   : 95.63   Mean   :499.0   Mean   : 700.4   Mean   :133.4  
##  3rd Qu.:135.50   3rd Qu.:565.5   3rd Qu.: 904.5   3rd Qu.:157.5  
##  Max.   :242.00   Max.   :792.0   Max.   :1268.0   Max.   :580.0  
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
##  Min.   :1155    Min.   :  0.0    Min.   :136      Min.   :   0.0  
##  1st Qu.:1426    1st Qu.: 52.0    1st Qu.:471      1st Qu.: 604.0  
##  Median :1515    Median :104.0    Median :526      Median : 745.0  
##  Mean   :1602    Mean   :102.1    Mean   :490      Mean   : 755.2  
##  3rd Qu.:1654    3rd Qu.:142.5    3rd Qu.:526      3rd Qu.: 923.5  
##  Max.   :4120    Max.   :336.0    Max.   :526      Max.   :1462.0  
##  TEAM_FIELDING_E
##  Min.   : 73.0  
##  1st Qu.:131.0  
##  Median :163.0  
##  Mean   :206.5  
##  3rd Qu.:225.0  
##  Max.   :680.0

\(~\)

Model Building

\(~\)

Model 1 - Full Model

By testing all variables in this first model we are able to see how significant are the variables in our dataset. We will then be able to use this model to base our other models.

# model 1
m1 <- lm(TARGET_WINS ~ ., data = train_df_final, na.action = na.omit)
summ(m1)

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(13,2262)	67.76
R²	0.28
Adj. R²	0.28

	Est.	S.E.	t val.	p
(Intercept)	9.34	5.40	1.73	0.08
INDEX	-0.00	0.00	-1.33	0.18
TEAM_BATTING_H	0.03	0.00	9.23	0.00
TEAM_BATTING_2B	-0.01	0.01	-1.17	0.24
TEAM_BATTING_3B	0.10	0.02	5.42	0.00
TEAM_BATTING_HR	0.10	0.03	3.86	0.00
TEAM_BATTING_BB	0.04	0.00	9.22	0.00
TEAM_BATTING_SO	0.00	0.00	0.18	0.86
TEAM_BASERUN_SB	0.04	0.00	9.79	0.00
TEAM_PITCHING_H	0.00	0.00	2.00	0.05
TEAM_PITCHING_HR	-0.05	0.02	-1.95	0.05
TEAM_PITCHING_BB	-0.02	0.01	-2.35	0.02
TEAM_PITCHING_SO	-0.00	0.00	-1.12	0.26
TEAM_FIELDING_E	-0.02	0.00	-6.88	0.00
Standard errors: OLS

par(mfrow = c(2, 2))
plot(m1)

Model 2: Log transformation

Use of log transformation method which distributes skewness into a more “normally” distributed shape. I applied log transformation for highly skewed variables (less than -1 or greater than 1).

Note: Model 2 was not a successful model compared to model 1. There weren’t any significant changes between the two models therefore discarding this model.

# Checking skewness of dataset
sapply(train_df_final, function(x) skewness(x))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##      0.004214942     -0.398723203      1.571333477      0.215101802 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##      1.109465188      0.186042144     -1.025759890     -0.225907299 
##  TEAM_BASERUN_SB  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##      1.915258582      3.349018017      0.287787667     -2.493163443 
## TEAM_PITCHING_SO  TEAM_FIELDING_E 
##     -0.033173809      2.108419924

# Doing log transformations from model 1
train_df_final_log <- train_df_final

# Applying log transformation for highly skewed variables
train_df_final_log$TEAM_BATTING_H <- log10(train_df_final_log$TEAM_BATTING_H +
    1)
train_df_final_log$TEAM_BATTING_2B <- log10(train_df_final_log$TEAM_BATTING_2B +
    1)
train_df_final_log$TEAM_PITCHING_H <- log10(train_df_final_log$TEAM_PITCHING_H +
    1)
train_df_final_log$TEAM_PITCHING_BB <- log10(train_df_final_log$TEAM_PITCHING_BB +
    1)
train_df_final_log$TEAM_FIELDING_E <- log10(train_df_final_log$TEAM_FIELDING_E +
    1)
train_df_final_log$TEAM_BASERUN_SB <- log10(train_df_final_log$TEAM_BASERUN_SB +
    1)

# Checking skewness
sapply(train_df_final_log, function(x) skewness(x))

##            INDEX      TARGET_WINS   TEAM_BATTING_H  TEAM_BATTING_2B 
##      0.004214942     -0.398723203      0.783501655     -0.404123601 
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB  TEAM_BATTING_SO 
##      1.109465188      0.186042144     -1.025759890     -0.225907299 
##  TEAM_BASERUN_SB  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##     -0.269189416      2.142986416      0.287787667    -13.600156875 
## TEAM_PITCHING_SO  TEAM_FIELDING_E 
##     -0.033173809      1.087713936

# model 2 log
m2 <- lm(TARGET_WINS ~ ., data = train_df_final_log, na.action = na.omit)
summ(m2)

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(13,2262)	70.08
R²	0.29
Adj. R²	0.28

	Est.	S.E.	t val.	p
(Intercept)	-269.17	37.51	-7.18	0.00
INDEX	-0.00	0.00	-1.42	0.15
TEAM_BATTING_H	105.80	13.77	7.68	0.00
TEAM_BATTING_2B	-5.16	5.25	-0.98	0.33
TEAM_BATTING_3B	0.11	0.02	6.22	0.00
TEAM_BATTING_HR	0.11	0.03	4.13	0.00
TEAM_BATTING_BB	0.03	0.00	8.70	0.00
TEAM_BATTING_SO	-0.01	0.00	-1.32	0.19
TEAM_BASERUN_SB	13.72	1.22	11.24	0.00
TEAM_PITCHING_H	7.21	5.75	1.25	0.21
TEAM_PITCHING_HR	-0.05	0.02	-1.99	0.05
TEAM_PITCHING_BB	-2.29	4.27	-0.54	0.59
TEAM_PITCHING_SO	-0.00	0.00	-0.85	0.39
TEAM_FIELDING_E	-17.14	2.30	-7.44	0.00
Standard errors: OLS

par(mfrow = c(2, 2))
plot(m2)

\(~\)

Model 3: Statistically significant

Focusing on statistically significant values chosen primarily from their R output.

# model 3
m3 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_HR +
    TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_FIELDING_E, data = train_df_final)
summ(m3)

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(6,2269)	140.79
R²	0.27
Adj. R²	0.27

	Est.	S.E.	t val.	p
(Intercept)	2.54	3.38	0.75	0.45
TEAM_BATTING_H	0.03	0.00	14.54	0.00
TEAM_BATTING_3B	0.09	0.02	5.43	0.00
TEAM_BATTING_HR	0.05	0.01	6.03	0.00
TEAM_BATTING_BB	0.03	0.00	12.08	0.00
TEAM_BASERUN_SB	0.04	0.00	10.31	0.00
TEAM_FIELDING_E	-0.02	0.00	-6.30	0.00
Standard errors: OLS

par(mfrow = c(2, 2))
plot(m3)

\(~\)

Model 4: Backwards Elimination

Variables that are not statistically significant are removed to determine a best fit model.

# model 4
m4 <- lm(TARGET_WINS ~ TEAM_BATTING_2B + TEAM_PITCHING_H + TEAM_PITCHING_HR +
    TEAM_PITCHING_BB + TEAM_PITCHING_SO, data = train_df_final)
summ(m4)

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(5,2270)	81.14
R²	0.15
Adj. R²	0.15

	Est.	S.E.	t val.	p
(Intercept)	44.38	3.44	12.89	0.00
TEAM_BATTING_2B	0.06	0.01	7.22	0.00
TEAM_PITCHING_H	0.01	0.00	8.06	0.00
TEAM_PITCHING_HR	0.06	0.01	8.44	0.00
TEAM_PITCHING_BB	0.03	0.01	6.13	0.00
TEAM_PITCHING_SO	-0.02	0.00	-9.64	0.00
Standard errors: OLS

par(mfrow = c(2, 2))
plot(m4)

\(~\)

Model 5: Power

Using a power model may be more effective considering each independent variable doesn’t appear to have a truly linear relationship with wins. Here we create a model using a cubit for each independent variable.

# model 5
m5 <- lm(TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) +
    I(TEAM_BATTING_H^3) + TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) +
    I(TEAM_BATTING_2B^3) + TEAM_BATTING_3B + I(TEAM_BATTING_3B^2) +
    I(TEAM_BATTING_3B^3) + TEAM_BATTING_HR + I(TEAM_BATTING_HR^2) +
    I(TEAM_BATTING_HR^3) + TEAM_BATTING_BB + I(TEAM_BATTING_BB^2) +
    I(TEAM_BATTING_BB^3) + TEAM_BATTING_SO + I(TEAM_BATTING_SO^2) +
    I(TEAM_BATTING_SO^3) + TEAM_BASERUN_SB + I(TEAM_BASERUN_SB^2) +
    I(TEAM_BASERUN_SB^3) + TEAM_PITCHING_H + I(TEAM_PITCHING_H^2) +
    I(TEAM_PITCHING_H^3) + TEAM_PITCHING_HR + I(TEAM_PITCHING_HR^2) +
    I(TEAM_PITCHING_HR^3) + TEAM_PITCHING_BB + I(TEAM_PITCHING_BB^2) +
    I(TEAM_PITCHING_BB^3) + TEAM_PITCHING_SO + I(TEAM_PITCHING_SO^2) +
    I(TEAM_PITCHING_SO^3) + TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) +
    I(TEAM_FIELDING_E^3), data = train_df_final)
summ(m5)

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(36,2239)	35.36
R²	0.36
Adj. R²	0.35

	Est.	S.E.	t val.	p
(Intercept)	-1.32	64.34	-0.02	0.98
TEAM_BATTING_H	0.05	0.13	0.38	0.71
I(TEAM_BATTING_H^2)	-0.00	0.00	-0.08	0.94
I(TEAM_BATTING_H^3)	0.00	0.00	0.20	0.84
TEAM_BATTING_2B	1.03	0.20	5.22	0.00
I(TEAM_BATTING_2B^2)	-0.00	0.00	-5.19	0.00
I(TEAM_BATTING_2B^3)	0.00	0.00	5.01	0.00
TEAM_BATTING_3B	-0.06	0.09	-0.61	0.54
I(TEAM_BATTING_3B^2)	0.00	0.00	3.00	0.00
I(TEAM_BATTING_3B^3)	-0.00	0.00	-4.24	0.00
TEAM_BATTING_HR	0.18	0.14	1.22	0.22
I(TEAM_BATTING_HR^2)	-0.00	0.00	-1.05	0.29
I(TEAM_BATTING_HR^3)	0.00	0.00	0.76	0.45
TEAM_BATTING_BB	0.24	0.05	4.80	0.00
I(TEAM_BATTING_BB^2)	-0.00	0.00	-4.00	0.00
I(TEAM_BATTING_BB^3)	0.00	0.00	3.72	0.00
TEAM_BATTING_SO	0.10	0.03	3.09	0.00
I(TEAM_BATTING_SO^2)	-0.00	0.00	-2.74	0.01
I(TEAM_BATTING_SO^3)	0.00	0.00	1.96	0.05
TEAM_BASERUN_SB	0.08	0.02	4.31	0.00
I(TEAM_BASERUN_SB^2)	-0.00	0.00	-0.97	0.33
I(TEAM_BASERUN_SB^3)	-0.00	0.00	-0.16	0.87
TEAM_PITCHING_H	-0.13	0.03	-4.29	0.00
I(TEAM_PITCHING_H^2)	0.00	0.00	3.80	0.00
I(TEAM_PITCHING_H^3)	-0.00	0.00	-3.19	0.00
TEAM_PITCHING_HR	-0.34	0.12	-2.76	0.01
I(TEAM_PITCHING_HR^2)	0.00	0.00	3.61	0.00
I(TEAM_PITCHING_HR^3)	-0.00	0.00	-3.67	0.00
TEAM_PITCHING_BB	0.08	0.11	0.78	0.43
I(TEAM_PITCHING_BB^2)	-0.00	0.00	-1.75	0.08
I(TEAM_PITCHING_BB^3)	0.00	0.00	2.18	0.03
TEAM_PITCHING_SO	-0.07	0.02	-2.79	0.01
I(TEAM_PITCHING_SO^2)	0.00	0.00	2.68	0.01
I(TEAM_PITCHING_SO^3)	-0.00	0.00	-2.69	0.01
TEAM_FIELDING_E	-0.13	0.04	-3.49	0.00
I(TEAM_FIELDING_E^2)	0.00	0.00	1.45	0.15
I(TEAM_FIELDING_E^3)	-0.00	0.00	-0.30	0.77
Standard errors: OLS

par(mfrow = c(2, 2))
plot(m5)

\(~\)

Model 6: Power with Reverse Elimination

Used reverse elimination on model 5 to remove variables with p-values higher than .05.

# model 6
m6 <- lm(TARGET_WINS ~ TEAM_BATTING_H + 
           TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_2B^3) + 
           I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_3B^3) + 
           
           TEAM_BATTING_BB + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_BB^3) + 
           TEAM_BATTING_SO + I(TEAM_BATTING_SO^2) + 
           TEAM_BASERUN_SB + I(TEAM_BASERUN_SB^2) + 
           TEAM_PITCHING_H + I(TEAM_PITCHING_H^2) + I(TEAM_PITCHING_H^3) + 
           TEAM_PITCHING_HR + I(TEAM_PITCHING_HR^2) + I(TEAM_PITCHING_HR^3) + 
           I(TEAM_PITCHING_BB^2) + I(TEAM_PITCHING_BB^3) + 
           TEAM_PITCHING_SO + I(TEAM_PITCHING_SO^2) + I(TEAM_PITCHING_SO^3) + 
           TEAM_FIELDING_E + I(TEAM_FIELDING_E^2), data = train_df_final)
summ(m6)

Observations	2276
Dependent variable	TARGET_WINS
Type	OLS linear regression

F(26,2249)	48.71
R²	0.36
Adj. R²	0.35

	Est.	S.E.	t val.	p
(Intercept)	6.91	23.80	0.29	0.77
TEAM_BATTING_H	0.05	0.00	12.82	0.00
TEAM_BATTING_2B	1.01	0.19	5.27	0.00
I(TEAM_BATTING_2B^2)	-0.00	0.00	-5.23	0.00
I(TEAM_BATTING_2B^3)	0.00	0.00	5.06	0.00
I(TEAM_BATTING_3B^2)	0.00	0.00	8.48	0.00
I(TEAM_BATTING_3B^3)	-0.00	0.00	-7.96	0.00
TEAM_BATTING_BB	0.29	0.05	6.34	0.00
I(TEAM_BATTING_BB^2)	-0.00	0.00	-5.07	0.00
I(TEAM_BATTING_BB^3)	0.00	0.00	4.56	0.00
TEAM_BATTING_SO	0.05	0.01	3.59	0.00
I(TEAM_BATTING_SO^2)	-0.00	0.00	-4.95	0.00
TEAM_BASERUN_SB	0.08	0.01	8.90	0.00
I(TEAM_BASERUN_SB^2)	-0.00	0.00	-5.28	0.00
TEAM_PITCHING_H	-0.14	0.03	-5.42	0.00
I(TEAM_PITCHING_H^2)	0.00	0.00	4.84	0.00
I(TEAM_PITCHING_H^3)	-0.00	0.00	-4.16	0.00
TEAM_PITCHING_HR	-0.19	0.04	-4.57	0.00
I(TEAM_PITCHING_HR^2)	0.00	0.00	6.02	0.00
I(TEAM_PITCHING_HR^3)	-0.00	0.00	-5.46	0.00
I(TEAM_PITCHING_BB^2)	-0.00	0.00	-3.75	0.00
I(TEAM_PITCHING_BB^3)	0.00	0.00	3.44	0.00
TEAM_PITCHING_SO	-0.04	0.02	-2.30	0.02
I(TEAM_PITCHING_SO^2)	0.00	0.00	1.99	0.05
I(TEAM_PITCHING_SO^3)	-0.00	0.00	-1.89	0.06
TEAM_FIELDING_E	-0.12	0.02	-7.84	0.00
I(TEAM_FIELDING_E^2)	0.00	0.00	6.51	0.00
Standard errors: OLS

par(mfrow = c(2, 2))
plot(m6)

Model Selection

We are choosing model 6 after evaluating all the models from the section Model Building. The adjusted \(R^2\) is better than the other models.

# model 6 eval
eval_m6 <- lm(INDEX ~ TEAM_BATTING_H + TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) +
    I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_3B^2) + I(TEAM_BATTING_3B^3) +
    TEAM_BATTING_BB + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_BB^3) +
    TEAM_BATTING_SO + I(TEAM_BATTING_SO^2) + TEAM_BASERUN_SB +
    I(TEAM_BASERUN_SB^2) + TEAM_PITCHING_H + I(TEAM_PITCHING_H^2) +
    I(TEAM_PITCHING_H^3) + TEAM_PITCHING_HR + I(TEAM_PITCHING_HR^2) +
    I(TEAM_PITCHING_HR^3) + I(TEAM_PITCHING_BB^2) + I(TEAM_PITCHING_BB^3) +
    TEAM_PITCHING_SO + I(TEAM_PITCHING_SO^2) + I(TEAM_PITCHING_SO^3) +
    TEAM_FIELDING_E + I(TEAM_FIELDING_E^2), data = eval_df_final)
summ(eval_m6)

Observations	259
Dependent variable	INDEX
Type	OLS linear regression

F(26,232)	2.18
R²	0.20
Adj. R²	0.11

	Est.	S.E.	t val.	p
(Intercept)	5718.41	3898.50	1.47	0.14
TEAM_BATTING_H	-0.49	0.73	-0.66	0.51
TEAM_BATTING_2B	-10.81	23.31	-0.46	0.64
I(TEAM_BATTING_2B^2)	0.06	0.10	0.58	0.56
I(TEAM_BATTING_2B^3)	-0.00	0.00	-0.65	0.52
I(TEAM_BATTING_3B^2)	0.12	0.07	1.74	0.08
I(TEAM_BATTING_3B^3)	-0.00	0.00	-1.12	0.27
TEAM_BATTING_BB	-0.13	9.08	-0.01	0.99
I(TEAM_BATTING_BB^2)	-0.00	0.02	-0.24	0.81
I(TEAM_BATTING_BB^3)	0.00	0.00	0.32	0.75
TEAM_BATTING_SO	-0.69	2.48	-0.28	0.78
I(TEAM_BATTING_SO^2)	0.00	0.00	0.61	0.54
TEAM_BASERUN_SB	4.59	1.51	3.03	0.00
I(TEAM_BASERUN_SB^2)	-0.01	0.00	-2.98	0.00
TEAM_PITCHING_H	-3.36	4.24	-0.79	0.43
I(TEAM_PITCHING_H^2)	0.00	0.00	0.94	0.35
I(TEAM_PITCHING_H^3)	-0.00	0.00	-1.15	0.25
TEAM_PITCHING_HR	20.13	6.47	3.11	0.00
I(TEAM_PITCHING_HR^2)	-0.12	0.05	-2.62	0.01
I(TEAM_PITCHING_HR^3)	0.00	0.00	1.69	0.09
I(TEAM_PITCHING_BB^2)	-0.04	0.02	-1.79	0.08
I(TEAM_PITCHING_BB^3)	0.00	0.00	1.86	0.06
TEAM_PITCHING_SO	2.92	3.32	0.88	0.38
I(TEAM_PITCHING_SO^2)	-0.00	0.00	-0.85	0.39
I(TEAM_PITCHING_SO^3)	0.00	0.00	0.76	0.45
TEAM_FIELDING_E	-3.53	2.66	-1.33	0.19
I(TEAM_FIELDING_E^2)	0.00	0.00	1.28	0.20
Standard errors: OLS

par(mfrow = c(2, 2))
plot(eval_m6)

eval_df_final$TARGET_WINS <- round(predict(m6, eval_df_final),
    0)

# predictions are as follows
eval_pred <- eval_df_final %>%
    select(TARGET_WINS, everything())
head(eval_pred)

##   TARGET_WINS INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1          62     9           1209             170              33
## 2          64    10           1221             151              29
## 3          72    14           1395             183              29
## 4          86    47           1539             309              29
## 5          73    60           1445             203              68
## 6          65    63           1431             236              53
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              83             447            1080              62
## 2              88             516             929              54
## 3              93             509             816              59
## 4             159             486             914             148
## 5               5              95             416             365
## 6              10             215             377             319
##   TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 1            1209               83              447             1080
## 2            1221               88              516              929
## 3            1395               93              509              816
## 4            1539              159              486              914
## 5            3902               14              257             1123
## 6            2793               20              420              736
##   TEAM_FIELDING_E
## 1             140
## 2             135
## 3             156
## 4             124
## 5             616
## 6             572

\(~\)

Data 621 Homework #1

Group 2: William Aiken, Donald Butler, Michael Ippolito, Bharani Nittala, and Leticia Salazar

09-25-2022

Load Libraries

Data Overview

Objective

Data Exploration

Base Hits by Batter

Batting

Fielding

Pitching

Data Preparation

Missing values

Model Building

Model 1 - Full Model

Model 2: Log transformation

Model 3: Statistically significant

Model 4: Backwards Elimination

Model 5: Power

Model 6: Power with Reverse Elimination

Model Selection

References