Overview
In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.
Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Data Exploration
The goal of each team is to win as many games out of a 162 game season as possible. This allows a ticket to the post season and a chance to play at the World Series, where the champion is defined.
The sample data having 2276 observations dating from 1871 to 2006. Data consist of total 16 variables, out of that 15 variables are explanatory variables and 1 is taget variable i.e TARGET_WINS.
We will first look at the data to get a sense of what we have.
data_url <- "https://raw.githubusercontent.com/mkollontai/DATA621_GroupWork/master/HW1/moneyball-training-data.csv"
MB_train <- read.csv(data_url)
DT::datatable(head(MB_train))MB_summary <- function(MB_train){
MB_train %>%
summary() %>%
kable() %>%
kable_styling()
}
MB_summary(MB_train)| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 630.8 | 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
| Median :1270.5 | Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
| Mean :1268.5 | Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
| 3rd Qu.:1915.5 | 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
| Max. :2535.0 | Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
The above tables shows TEAM_BATTING_HBP and TEAM_BASERUN_CS having more NA. We will come to that part later. The co-relation plot shows the relationship between Target_Win with other variables.
# Correlations of columns from 1 to 8
cor <-cor(MB_train[2:17], use="complete.obs", method="pearson")
#round to two decimals
DT::datatable(round(cor, 2))Above plot shows the strong co-relation between below variables.
- TEAM_BATTING_H -> TEAM_PITCHING_H
- TEAM_BATTING_HR -> TEAM_PITCHING_HR
- TEAM_BATTING_BB -> TEAM_PITCHING_BB
- TEAM_BATTING_SO -> TEAM_PITCHING_SO
- TEAM_BASERUN_CS -> TEAM_BASERUN_SB
Let’s look at all of the metrics in order to evaluate the presence of outliers and quality of the data overall. To get the better clarity divide the data in to 4 groups i.e
- batting
- baserun
- fielding
- pitching
Box-plot for batting variables.
batting_df <- MB_train[,c("TEAM_BATTING_H","TEAM_BATTING_2B","TEAM_BATTING_3B","TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_SO","TEAM_BATTING_HBP")]
ggplot(stack(batting_df), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))Above plot shows TEAM_BATTING_H having more outliers.
Box-plot for baserun variables.
#colnames(MB_train)
baserun_df <- MB_train[,c("TEAM_BASERUN_SB","TEAM_BASERUN_CS")]
ggplot(stack(baserun_df), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))Box-plot for fielding variables.
#colnames(MB_train)
fielding_df <- MB_train[,c("TEAM_FIELDING_E","TEAM_FIELDING_DP")]
ggplot(stack(fielding_df), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))Box-plot for pitching variables.
pitching_df <- MB_train[,c("TEAM_PITCHING_H","TEAM_PITCHING_HR","TEAM_PITCHING_BB","TEAM_PITCHING_SO")]
ggplot(stack(pitching_df), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))There are many NAs avalaible in the sample data, replace the NAs with zero and below shows the sample data distribution.
mb_hist <- MB_train
mb_hist$TEAM_BATTING_SO <- ifelse(is.na(mb_hist$TEAM_BATTING_SO), 0, mb_hist$TEAM_BATTING_SO)
mb_hist$TEAM_BASERUN_SB <- ifelse(is.na(mb_hist$TEAM_BASERUN_SB), 0, mb_hist$TEAM_BASERUN_SB)
mb_hist$TEAM_BASERUN_CS <- ifelse(is.na(mb_hist$TEAM_BASERUN_CS), 0, mb_hist$TEAM_BASERUN_CS)
mb_hist$TEAM_BATTING_HBP <- ifelse(is.na(mb_hist$TEAM_BATTING_HBP), 0, mb_hist$TEAM_BATTING_HBP)
mb_hist$TEAM_PITCHING_SO <- ifelse(is.na(mb_hist$TEAM_PITCHING_SO), 0, mb_hist$TEAM_PITCHING_SO)
mb_hist$TEAM_FIELDING_DP <- ifelse(is.na(mb_hist$TEAM_FIELDING_DP), 0, mb_hist$TEAM_FIELDING_DP)
# remove index col
mb_hist <- mb_hist[,-1 ]
# plot distribution
mb_hist %>%
purrr::keep(is.numeric) %>%
tidyr::gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 35, col = "darkblue", fill = "darkblue") The above distribution shows some are left skewed, right skewed, and normally distributed. We also see other variables have many zeros that need to be addressed because the are significantly skewing the data. Two particularly high offenders in this area were TEAM_BATTING_HBP and TEAM_BASERUN_CS. Data preparation is in the next section will handle this issue so that the resulting models perform better without removing rows of data.
Data Preparation
For the data preparation main focus in addressing outliers and replace NAs with median/mean, log transformation for skewed distribution and creating additional variables. First, create additional variables and then review outliers. $ new variables we are going to add in the data those are TEAM_BATTING_1B, TEAM_BATTING_WALK, TEAM_BASERUN_SB_RATIO, and TEAM_BASERUN_CS_RATIO. TEAM_BATTING_H having singles, doubles, triples and home run. From this one more variable can calculate i.e TEAM_BATTING_1B. Below is the formula:
TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR Total hits by team can be the sum of TEAM_BATTING_BB AND TEAM_BATTING_HBP TEAM_BATTING_WALK = TEAM_BATTING_BB + TEAM_BATTING_HBP. In co-relation plot we see TEAM_BASERUN_SB_RATIO and TEAM_BASERUN_SB_RATIO having strong co-relation. TEAM_BASERUN_SB_RATIO and TEAM_BASERUN_CS_RATIO are the ratio of Caught Stealing and Stolen Bases.
# created variables
MB_train$TEAM_BATTING_1B <- MB_train$TEAM_BATTING_H - MB_train$TEAM_BATTING_2B - MB_train$TEAM_BATTING_3B - MB_train$TEAM_BATTING_HR
MB_train$TEAM_BASERUN_SB_RATIO <- MB_train$TEAM_BASERUN_SB/(MB_train$TEAM_BASERUN_SB + MB_train$TEAM_BASERUN_CS)
MB_train$TEAM_BASERUN_CS_RATIO <- MB_train$TEAM_BASERUN_CS/(MB_train$TEAM_BASERUN_SB + MB_train$TEAM_BASERUN_CS)
MB_train$TEAM_BATTING_WALK <- MB_train$TEAM_BATTING_BB+MB_train$TEAM_BATTING_HBPBelow shows distribution of singles with total hits by batting team and caught stealing ratio with stealing base ratio.
par(mfrow=c(2,2), mai=c(0.5,0.5,0.5,0.2))
par(fig=c(0,0.5,0.25,1))
hist(MB_train$TEAM_BATTING_H, main="Total Hits", breaks=30, col="darkblue")
par(fig=c(0.5,1,0.25,1), new=TRUE)
hist(MB_train$TEAM_BATTING_1B, main="Singles", breaks=30, col="darkblue")
par(fig=c(0,0.5,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BATTING_H, horizontal=TRUE, width=1, col="darkblue")
par(fig=c(0.5,1,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BATTING_1B, horizontal=TRUE, width=1, col="darkblue")par(mfrow=c(2,2), mai=c(0.5,0.5,0.5,0.2))
par(fig=c(0,0.5,0.25,1))
hist(MB_train$TEAM_BASERUN_CS_RATIO, main="Caught Stealing Ratio", breaks=30, col="firebrick")
par(fig=c(0.5,1,0.25,1), new=TRUE)
hist(MB_train$TEAM_BASERUN_SB_RATIO, main="Stolen Bases Ratio", breaks=30, col="darkblue")
par(fig=c(0,0.5,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BASERUN_CS_RATIO, horizontal=TRUE, width=1, col="firebrick")
par(fig=c(0.5,1,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BASERUN_SB_RATIO, horizontal=TRUE, width=1, col="darkblue")| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | TEAM_BATTING_1B | TEAM_BASERUN_SB_RATIO | TEAM_BASERUN_CS_RATIO | TEAM_BATTING_WALK | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | Min. : 709.0 | Min. :0.0000 | Min. :0.1597 | Min. :429.0 | |
| 1st Qu.: 630.8 | 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | 1st Qu.: 990.8 | 1st Qu.:0.5789 | 1st Qu.:0.3048 | 1st Qu.:550.5 | |
| Median :1270.5 | Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | Median :1050.0 | Median :0.6429 | Median :0.3571 | Median :594.0 | |
| Mean :1268.5 | Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | Mean :1073.2 | Mean :0.6327 | Mean :0.3673 | Mean :602.7 | |
| 3rd Qu.:1915.5 | 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | 3rd Qu.:1129.0 | 3rd Qu.:0.6952 | 3rd Qu.:0.4211 | 3rd Qu.:653.5 | |
| Max. :2535.0 | Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | Max. :2112.0 | Max. :0.8403 | Max. :1.0000 | Max. :823.0 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 | NA | NA’s :773 | NA’s :773 | NA’s :2085 |
Sample data having NAs, lets replace NA with mean/median. We will create 2 datasets, one dataset replace NA with mean and another replace NA with median. We apply our model on these dataset.
MB_train_mean <- MB_train
# replace by mean
MB_train_mean$TEAM_BATTING_SO[is.na(MB_train_mean$TEAM_BATTING_SO)==TRUE] <- mean(MB_train_mean$TEAM_BATTING_SO, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_SB[is.na(MB_train_mean$TEAM_BASERUN_SB)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_SB, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_CS[is.na(MB_train_mean$TEAM_BASERUN_CS)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_CS, na.rm = TRUE)
MB_train_mean$TEAM_BATTING_HBP[is.na(MB_train_mean$TEAM_BATTING_HBP)==TRUE] <- mean(MB_train_mean$TEAM_BATTING_HBP, na.rm = TRUE)
MB_train_mean$TEAM_PITCHING_SO[is.na(MB_train_mean$TEAM_PITCHING_SO)==TRUE] <- mean(MB_train_mean$TEAM_PITCHING_SO, na.rm = TRUE)
MB_train_mean$TEAM_FIELDING_DP[is.na(MB_train_mean$TEAM_FIELDING_DP)==TRUE] <- mean(MB_train_mean$TEAM_FIELDING_DP, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_SB_RATIO[is.na(MB_train_mean$TEAM_BASERUN_SB_RATIO)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_SB_RATIO, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_CS_RATIO[is.na(MB_train_mean$TEAM_BASERUN_CS_RATIO)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_CS_RATIO, na.rm = TRUE)
MB_train_mean$TEAM_BATTING_WALK[is.na(MB_train_mean$TEAM_BATTING_WALK)==TRUE] <- mean(MB_train_mean$TEAM_BATTING_WALK, na.rm = TRUE)
# replace by median
MB_train_median <- MB_train
MB_train_median$TEAM_BATTING_SO[is.na(MB_train_median$TEAM_BATTING_SO)==TRUE] <- median(MB_train_median$TEAM_BATTING_SO, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_SB[is.na(MB_train_median$TEAM_BASERUN_SB)==TRUE] <- median(MB_train_median$TEAM_BASERUN_SB, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_CS[is.na(MB_train_median$TEAM_BASERUN_CS)==TRUE] <- median(MB_train_median$TEAM_BASERUN_CS, na.rm = TRUE)
MB_train_median$TEAM_BATTING_HBP[is.na(MB_train_median$TEAM_BATTING_HBP)==TRUE] <- median(MB_train_median$TEAM_BATTING_HBP, na.rm = TRUE)
MB_train_median$TEAM_PITCHING_SO[is.na(MB_train_median$TEAM_PITCHING_SO)==TRUE] <- median(MB_train_median$TEAM_PITCHING_SO, na.rm = TRUE)
MB_train_median$TEAM_FIELDING_DP[is.na(MB_train_median$TEAM_FIELDING_DP)==TRUE] <- median(MB_train_median$TEAM_FIELDING_DP, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_SB_RATIO[is.na(MB_train_median$TEAM_BASERUN_SB_RATIO)==TRUE] <- median(MB_train_median$TEAM_BASERUN_SB_RATIO, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_CS_RATIO[is.na(MB_train_median$TEAM_BASERUN_CS_RATIO)==TRUE] <- median(MB_train_median$TEAM_BASERUN_CS_RATIO, na.rm = TRUE)
MB_train_median$TEAM_BATTING_WALK[is.na(MB_train_median$TEAM_BATTING_WALK)==TRUE] <- median(MB_train_median$TEAM_BATTING_WALK, na.rm = TRUE)The log-transformation is used to deal with skewed data. Log transformation can decrease the variability of data and make data conform more closely to the normal distribution. Above histogram we saw many skewed distibutions, apply log transformation on those variables.
# Log transformation
MB_train_log <- MB_train
# replace by log
MB_train_log$TEAM_FIELDING_E <- log(MB_train_log$TEAM_FIELDING_E)
MB_train_log$TEAM_PITCHING_H <- log(MB_train_log$TEAM_PITCHING_H)
MB_train_log$TEAM_PITCHING_SO[MB_train_log$TEAM_PITCHING_SO==0] <- 1
MB_train_log$TEAM_PITCHING_SO <- log(MB_train_log$TEAM_PITCHING_SO)
MB_train_log$TEAM_PITCHING_BB[MB_train_log$TEAM_PITCHING_BB==0] <- 1
MB_train_log$TEAM_PITCHING_BB <- log(MB_train_log$TEAM_PITCHING_BB)To remove outliers, we decide to trim the data by 99% and 95%.
ggplot(stack(MB_train), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))ggplot(stack(MB_train_mean), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))ggplot(stack(MB_train_median), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))ggplot(stack(MB_train_log), aes(x = ind, y = values)) +
geom_boxplot(col="darkblue") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) First we need to calculate trimming percentile value for TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_PITCHING_BB. Because these variables having highest number of extreme outliers.
# trimming percentile value
quant_95_TEAM_FIELDING_E <- unname(quantile(MB_train_mean$TEAM_FIELDING_E, probs=c(0.01,0.05,0.95,0.99))[3])
quant_95_TEAM_PITCHING_H <- unname(quantile(MB_train_mean$TEAM_PITCHING_H, probs=c(0.01,0.05,0.95,0.99))[3])
quant_95_TEAM_PITCHING_SO <- unname(quantile(MB_train_mean$TEAM_PITCHING_SO, probs=c(0.01,0.05,0.95,0.99))[3])
quant_95_TEAM_PITCHING_BB <- unname(quantile(MB_train_mean$TEAM_PITCHING_BB, probs=c(0.01,0.05,0.95,0.99))[3])
quant_99_TEAM_FIELDING_E <- unname(quantile(MB_train_mean$TEAM_FIELDING_E, probs=c(0.01,0.05,0.95,0.99))[4])
quant_99_TEAM_PITCHING_H <- unname(quantile(MB_train_mean$TEAM_PITCHING_H, probs=c(0.01,0.05,0.95,0.99))[4])
quant_99_TEAM_PITCHING_SO <- unname(quantile(MB_train_mean$TEAM_PITCHING_SO, probs=c(0.01,0.05,0.95,0.99))[4])
quant_99_TEAM_PITCHING_BB <- unname(quantile(MB_train_mean$TEAM_PITCHING_BB, probs=c(0.01,0.05,0.95,0.99))[4])# Trim data by 5th & 95th percentile
MB_train_95trim <- MB_train_mean
MB_train_95trim$TEAM_FIELDING_E[MB_train_95trim$TEAM_FIELDING_E > quant_95_TEAM_FIELDING_E] <- quant_95_TEAM_FIELDING_E
MB_train_95trim$TEAM_PITCHING_H[MB_train_95trim$TEAM_PITCHING_H > quant_95_TEAM_PITCHING_H] <- quant_95_TEAM_PITCHING_H
MB_train_95trim$TEAM_PITCHING_SO[MB_train_95trim$TEAM_PITCHING_SO > quant_95_TEAM_PITCHING_SO] <- quant_95_TEAM_PITCHING_SO
MB_train_95trim$TEAM_PITCHING_BB[MB_train_95trim$TEAM_PITCHING_BB > quant_95_TEAM_PITCHING_BB] <- quant_95_TEAM_PITCHING_BB
# Trim data by 1st & 99th percentile
MB_train_99trim <- MB_train_mean
MB_train_99trim$TEAM_FIELDING_E[MB_train_99trim$TEAM_FIELDING_E > quant_99_TEAM_FIELDING_E] <- quant_99_TEAM_FIELDING_E
MB_train_99trim$TEAM_PITCHING_H[MB_train_99trim$TEAM_PITCHING_H > quant_99_TEAM_PITCHING_H] <- quant_99_TEAM_PITCHING_H
MB_train_99trim$TEAM_PITCHING_SO[MB_train_99trim$TEAM_PITCHING_SO > quant_99_TEAM_PITCHING_SO] <- quant_99_TEAM_PITCHING_SO
MB_train_99trim$TEAM_PITCHING_BB[MB_train_99trim$TEAM_PITCHING_BB > quant_99_TEAM_PITCHING_BB] <- quant_99_TEAM_PITCHING_BBAfter remove extreme outliers data looks quite normalize.
# visualize
par(mfrow=c(3,1))
hist(MB_train_log$TEAM_FIELDING_E, breaks = 20)
hist(MB_train_95trim$TEAM_FIELDING_E, breaks = 20)
hist(MB_train_99trim$TEAM_FIELDING_E, breaks = 20)par(mfrow=c(3,1))
par(mfrow=c(3,1))
hist(MB_train_log$TEAM_PITCHING_H, breaks = 20)
hist(MB_train_95trim$TEAM_PITCHING_H, breaks = 20)
hist(MB_train_99trim$TEAM_PITCHING_H, breaks = 20)par(mfrow=c(3,1))
par(mfrow=c(3,1))
hist(MB_train_log$TEAM_PITCHING_SO, breaks = 20)
hist(MB_train_95trim$TEAM_PITCHING_SO, breaks = 20)
hist(MB_train_99trim$TEAM_PITCHING_SO, breaks = 20)par(mfrow=c(3,1))
par(mfrow=c(3,1))
hist(MB_train_log$TEAM_PITCHING_BB, breaks = 20)
hist(MB_train_95trim$TEAM_PITCHING_BB, breaks = 20)
hist(MB_train_99trim$TEAM_PITCHING_BB, breaks = 20)Build Models
The primary metrics used for determining the accuracy of the model were Adjusted R-Squared, \(R^2\), MSE values. The stats package helps to quickly perform the calculations and provides a table output of the summary. The combination of the \(R^2\), \(Adj R^2\), MSE and F-statistics scored gives insight as to which specific variables gave the best performing model. p-value to determine the statistical significance of each coefficient included in the model. A p-value of less than 0.5 indicates the variable is statistically significant.
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = mb_hist)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.041 -8.558 0.145 8.907 53.331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.2930631 4.5824723 6.174 7.86e-10 ***
## TEAM_BATTING_H 0.0429317 0.0034842 12.322 < 2e-16 ***
## TEAM_BATTING_2B -0.0042264 0.0097037 -0.436 0.663209
## TEAM_BATTING_3B 0.0649817 0.0170796 3.805 0.000146 ***
## TEAM_BATTING_HR 0.0703822 0.0282035 2.496 0.012648 *
## TEAM_BATTING_BB 0.0022019 0.0059003 0.373 0.709047
## TEAM_BATTING_SO -0.0101940 0.0020913 -4.875 1.17e-06 ***
## TEAM_BASERUN_SB 0.0042336 0.0042081 1.006 0.314494
## TEAM_BASERUN_CS -0.0077185 0.0109760 -0.703 0.481993
## TEAM_BATTING_HBP -0.0542327 0.0194784 -2.784 0.005410 **
## TEAM_PITCHING_H -0.0008096 0.0003800 -2.130 0.033238 *
## TEAM_PITCHING_HR -0.0025561 0.0249180 -0.103 0.918304
## TEAM_PITCHING_BB 0.0028645 0.0042178 0.679 0.497119
## TEAM_PITCHING_SO 0.0019800 0.0009422 2.101 0.035710 *
## TEAM_FIELDING_E -0.0286814 0.0030151 -9.512 < 2e-16 ***
## TEAM_FIELDING_DP -0.0657273 0.0101980 -6.445 1.41e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.23 on 2260 degrees of freedom
## Multiple R-squared: 0.299, Adjusted R-squared: 0.2944
## F-statistic: 64.27 on 15 and 2260 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.008 -8.499 0.069 8.438 59.785
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.8846190 9.7449412 -0.809 0.418544
## INDEX -0.0005938 0.0003764 -1.578 0.114789
## TEAM_BATTING_H 0.0471168 0.0036848 12.787 < 2e-16 ***
## TEAM_BATTING_2B -0.0214205 0.0091939 -2.330 0.019901 *
## TEAM_BATTING_3B 0.0685838 0.0169941 4.036 5.62e-05 ***
## TEAM_BATTING_HR 0.0547467 0.0273400 2.002 0.045357 *
## TEAM_BATTING_BB 0.0068437 0.0058913 1.162 0.245492
## TEAM_BATTING_SO -0.0098886 0.0025590 -3.864 0.000115 ***
## TEAM_BASERUN_SB 0.0272772 0.0048644 5.607 2.30e-08 ***
## TEAM_BASERUN_CS -0.0036781 0.0166524 -0.221 0.825208
## TEAM_BATTING_HBP 0.0151583 0.0745654 0.203 0.838928
## TEAM_PITCHING_H -0.0007044 0.0003668 -1.920 0.054943 .
## TEAM_PITCHING_HR 0.0127466 0.0242483 0.526 0.599169
## TEAM_PITCHING_BB 0.0007824 0.0041356 0.189 0.849972
## TEAM_PITCHING_SO 0.0026989 0.0009162 2.946 0.003253 **
## TEAM_FIELDING_E -0.0224901 0.0024992 -8.999 < 2e-16 ***
## TEAM_FIELDING_DP -0.1120703 0.0131356 -8.532 < 2e-16 ***
## TEAM_BATTING_1B NA NA NA NA
## TEAM_BASERUN_SB_RATIO 9.7118989 4.7791864 2.032 0.042258 *
## TEAM_BASERUN_CS_RATIO NA NA NA NA
## TEAM_BATTING_WALK 0.0480023 0.0130995 3.664 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.99 on 2257 degrees of freedom
## Multiple R-squared: 0.3251, Adjusted R-squared: 0.3197
## F-statistic: 60.4 on 18 and 2257 DF, p-value: < 2.2e-16
model_2 <- lm(TARGET_WINS ~ ., data=MB_train_mean[c(2,3,4,5,6,7,8,9,15,16,17,19,21)])
summary(model_2)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_mean[c(2, 3, 4,
## 5, 6, 7, 8, 9, 15, 16, 17, 19, 21)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.429 -8.554 -0.110 8.457 60.585
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.9906386 9.1768603 -0.871 0.383990
## TEAM_BATTING_H 0.0463844 0.0036211 12.810 < 2e-16 ***
## TEAM_BATTING_2B -0.0221185 0.0091639 -2.414 0.015872 *
## TEAM_BATTING_3B 0.0751391 0.0164317 4.573 5.07e-06 ***
## TEAM_BATTING_HR 0.0662129 0.0095982 6.898 6.80e-12 ***
## TEAM_BATTING_BB 0.0083310 0.0034570 2.410 0.016036 *
## TEAM_BATTING_SO -0.0090283 0.0024111 -3.744 0.000185 ***
## TEAM_BASERUN_SB 0.0278509 0.0045523 6.118 1.11e-09 ***
## TEAM_PITCHING_SO 0.0022214 0.0005933 3.744 0.000185 ***
## TEAM_FIELDING_E -0.0246881 0.0020551 -12.013 < 2e-16 ***
## TEAM_FIELDING_DP -0.1128548 0.0131165 -8.604 < 2e-16 ***
## TEAM_BASERUN_SB_RATIO 9.6022690 4.5883768 2.093 0.036484 *
## TEAM_BATTING_WALK 0.0479878 0.0127753 3.756 0.000177 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13 on 2263 degrees of freedom
## Multiple R-squared: 0.3229, Adjusted R-squared: 0.3193
## F-statistic: 89.94 on 12 and 2263 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_median)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.843 -8.528 0.002 8.473 59.910
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.2624290 9.6007797 -0.652 0.514285
## INDEX -0.0005873 0.0003775 -1.556 0.119900
## TEAM_BATTING_H 0.0478166 0.0036925 12.950 < 2e-16 ***
## TEAM_BATTING_2B -0.0238012 0.0092259 -2.580 0.009948 **
## TEAM_BATTING_3B 0.0733280 0.0170207 4.308 1.72e-05 ***
## TEAM_BATTING_HR 0.0533380 0.0274143 1.946 0.051823 .
## TEAM_BATTING_BB 0.0076295 0.0059063 1.292 0.196574
## TEAM_BATTING_SO -0.0091831 0.0025542 -3.595 0.000331 ***
## TEAM_BASERUN_SB 0.0221644 0.0047543 4.662 3.31e-06 ***
## TEAM_BASERUN_CS 0.0005946 0.0163712 0.036 0.971032
## TEAM_BATTING_HBP -0.0033628 0.0746905 -0.045 0.964093
## TEAM_PITCHING_H -0.0008018 0.0003668 -2.186 0.028910 *
## TEAM_PITCHING_HR 0.0117683 0.0243228 0.484 0.628549
## TEAM_PITCHING_BB 0.0013913 0.0041474 0.335 0.737315
## TEAM_PITCHING_SO 0.0026876 0.0009195 2.923 0.003504 **
## TEAM_FIELDING_E -0.0208981 0.0024809 -8.424 < 2e-16 ***
## TEAM_FIELDING_DP -0.1123314 0.0130813 -8.587 < 2e-16 ***
## TEAM_BATTING_1B NA NA NA NA
## TEAM_BASERUN_SB_RATIO 11.8139658 4.7615848 2.481 0.013170 *
## TEAM_BASERUN_CS_RATIO NA NA NA NA
## TEAM_BATTING_WALK 0.0430731 0.0131131 3.285 0.001036 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 2257 degrees of freedom
## Multiple R-squared: 0.3213, Adjusted R-squared: 0.3158
## F-statistic: 59.35 on 18 and 2257 DF, p-value: < 2.2e-16
model_3 <- lm(TARGET_WINS ~ ., data=MB_train_median[c(2,3,4,5,6,7,8,9,15,16,17,19,21)])
summary(model_3)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_median[c(2, 3,
## 4, 5, 6, 7, 8, 9, 15, 16, 17, 19, 21)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.838 -8.495 -0.117 8.425 60.859
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.723157 9.058955 -0.742 0.458070
## TEAM_BATTING_H 0.046994 0.003629 12.948 < 2e-16 ***
## TEAM_BATTING_2B -0.024483 0.009192 -2.663 0.007790 **
## TEAM_BATTING_3B 0.080780 0.016418 4.920 9.27e-07 ***
## TEAM_BATTING_HR 0.063018 0.009600 6.565 6.45e-11 ***
## TEAM_BATTING_BB 0.009822 0.003479 2.823 0.004802 **
## TEAM_BATTING_SO -0.008280 0.002403 -3.446 0.000580 ***
## TEAM_BASERUN_SB 0.022975 0.004482 5.126 3.20e-07 ***
## TEAM_PITCHING_SO 0.002237 0.000595 3.761 0.000174 ***
## TEAM_FIELDING_E -0.023431 0.002050 -11.431 < 2e-16 ***
## TEAM_FIELDING_DP -0.113024 0.013064 -8.651 < 2e-16 ***
## TEAM_BASERUN_SB_RATIO 11.671968 4.545969 2.568 0.010306 *
## TEAM_BATTING_WALK 0.042156 0.012780 3.299 0.000987 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.04 on 2263 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3152
## F-statistic: 88.25 on 12 and 2263 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_log)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.0211 -5.8358 -0.1011 5.2071 23.7836
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.926e+03 2.388e+03 -1.225 0.222103
## INDEX -3.559e-04 8.575e-04 -0.415 0.678605
## TEAM_BATTING_H -3.100e-01 2.528e-01 -1.226 0.221783
## TEAM_BATTING_2B 2.347e-02 3.052e-02 0.769 0.442970
## TEAM_BATTING_3B -1.088e-01 7.805e-02 -1.394 0.165253
## TEAM_BATTING_HR 1.949e+00 2.693e+00 0.724 0.470251
## TEAM_BATTING_BB -8.305e-03 9.256e-02 -0.090 0.928612
## TEAM_BATTING_SO -1.072e-03 8.531e-02 -0.013 0.989987
## TEAM_BASERUN_SB 1.290e-01 8.937e-02 1.443 0.150816
## TEAM_BASERUN_CS -2.272e-01 2.065e-01 -1.100 0.272806
## TEAM_BATTING_HBP 8.079e-02 4.969e-02 1.626 0.105843
## TEAM_PITCHING_H 4.912e+02 3.751e+02 1.310 0.192066
## TEAM_PITCHING_HR -1.856e+00 2.694e+00 -0.689 0.491833
## TEAM_PITCHING_BB 3.417e+01 5.034e+01 0.679 0.498163
## TEAM_PITCHING_SO -3.370e+01 9.022e+01 -0.374 0.709186
## TEAM_FIELDING_E -1.728e+01 4.389e+00 -3.938 0.000119 ***
## TEAM_FIELDING_DP -9.073e-02 3.739e-02 -2.427 0.016271 *
## TEAM_BATTING_1B NA NA NA NA
## TEAM_BASERUN_SB_RATIO -3.888e+01 3.651e+01 -1.065 0.288435
## TEAM_BASERUN_CS_RATIO NA NA NA NA
## TEAM_BATTING_WALK NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.475 on 173 degrees of freedom
## (2085 observations deleted due to missingness)
## Multiple R-squared: 0.5545, Adjusted R-squared: 0.5107
## F-statistic: 12.66 on 17 and 173 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_log[c(2, 3, 4,
## 5, 6, 7, 8, 9, 15, 16, 17)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.116 -7.134 0.043 7.091 29.759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 141.987953 20.142940 7.049 2.55e-12 ***
## TEAM_BATTING_H 0.028487 0.004350 6.549 7.50e-11 ***
## TEAM_BATTING_2B -0.053440 0.008934 -5.981 2.66e-09 ***
## TEAM_BATTING_3B 0.187939 0.019156 9.811 < 2e-16 ***
## TEAM_BATTING_HR 0.096363 0.009244 10.425 < 2e-16 ***
## TEAM_BATTING_BB 0.034615 0.003129 11.061 < 2e-16 ***
## TEAM_BATTING_SO -0.025636 0.004341 -5.905 4.19e-09 ***
## TEAM_BASERUN_SB 0.062197 0.005482 11.345 < 2e-16 ***
## TEAM_PITCHING_SO 1.518468 3.071460 0.494 0.621
## TEAM_FIELDING_E -21.394394 1.300870 -16.446 < 2e-16 ***
## TEAM_FIELDING_DP -0.110620 0.012215 -9.056 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.2 on 1824 degrees of freedom
## (441 observations deleted due to missingness)
## Multiple R-squared: 0.4019, Adjusted R-squared: 0.3986
## F-statistic: 122.5 on 10 and 1824 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_95trim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.763 -8.365 0.029 8.214 74.484
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.529e+01 9.863e+00 -1.550 0.12123
## INDEX -6.222e-04 3.782e-04 -1.645 0.10005
## TEAM_BATTING_H 3.679e-02 3.890e-03 9.458 < 2e-16 ***
## TEAM_BATTING_2B -2.317e-02 9.310e-03 -2.489 0.01288 *
## TEAM_BATTING_3B 8.956e-02 1.703e-02 5.258 1.59e-07 ***
## TEAM_BATTING_HR 8.221e-02 2.937e-02 2.799 0.00517 **
## TEAM_BATTING_BB 5.980e-02 8.489e-03 7.044 2.47e-12 ***
## TEAM_BATTING_SO -1.646e-02 5.755e-03 -2.860 0.00427 **
## TEAM_BASERUN_SB 3.414e-02 5.152e-03 6.627 4.27e-11 ***
## TEAM_BASERUN_CS -2.701e-03 1.697e-02 -0.159 0.87352
## TEAM_BATTING_HBP 3.004e-02 7.503e-02 0.400 0.68895
## TEAM_PITCHING_H 1.380e-02 2.543e-03 5.426 6.39e-08 ***
## TEAM_PITCHING_HR -3.021e-02 2.623e-02 -1.152 0.24945
## TEAM_PITCHING_BB -4.185e-02 7.751e-03 -5.399 7.40e-08 ***
## TEAM_PITCHING_SO 1.511e-02 4.912e-03 3.076 0.00212 **
## TEAM_FIELDING_E -3.535e-02 4.005e-03 -8.827 < 2e-16 ***
## TEAM_FIELDING_DP -1.089e-01 1.327e-02 -8.210 3.68e-16 ***
## TEAM_BATTING_1B NA NA NA NA
## TEAM_BASERUN_SB_RATIO 5.271e+00 4.856e+00 1.085 0.27784
## TEAM_BASERUN_CS_RATIO NA NA NA NA
## TEAM_BATTING_WALK 3.974e-02 1.317e-02 3.018 0.00257 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.07 on 2257 degrees of freedom
## Multiple R-squared: 0.3174, Adjusted R-squared: 0.312
## F-statistic: 58.32 on 18 and 2257 DF, p-value: < 2.2e-16
model_5 <- lm(TARGET_WINS ~ ., data=MB_train_95trim[c(2,3,4,5,6,7,8,12,13,15,16,20,21)])
summary(model_5)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_95trim[c(2, 3,
## 4, 5, 6, 7, 8, 12, 13, 15, 16, 20, 21)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.262 -8.566 0.032 8.668 60.989
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.848579 9.529302 -1.978 0.048054 *
## TEAM_BATTING_H 0.041858 0.003983 10.510 < 2e-16 ***
## TEAM_BATTING_2B -0.032256 0.009527 -3.386 0.000722 ***
## TEAM_BATTING_3B 0.140324 0.016903 8.302 < 2e-16 ***
## TEAM_BATTING_HR 0.099973 0.028040 3.565 0.000371 ***
## TEAM_BATTING_BB 0.021396 0.003217 6.650 3.65e-11 ***
## TEAM_BATTING_SO -0.002913 0.005666 -0.514 0.607271
## TEAM_PITCHING_H 0.006819 0.002463 2.768 0.005684 **
## TEAM_PITCHING_HR -0.076235 0.025402 -3.001 0.002719 **
## TEAM_PITCHING_SO 0.007638 0.004823 1.584 0.113356
## TEAM_FIELDING_E -0.026852 0.003690 -7.277 4.68e-13 ***
## TEAM_BASERUN_CS_RATIO -23.124996 4.425156 -5.226 1.89e-07 ***
## TEAM_BATTING_WALK 0.042028 0.013179 3.189 0.001447 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.52 on 2263 degrees of freedom
## Multiple R-squared: 0.2671, Adjusted R-squared: 0.2633
## F-statistic: 68.74 on 12 and 2263 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_99trim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.355 -8.518 -0.061 8.195 66.340
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.1714381 9.7943870 -0.936 0.349169
## INDEX -0.0005839 0.0003775 -1.547 0.122062
## TEAM_BATTING_H 0.0433943 0.0038969 11.136 < 2e-16 ***
## TEAM_BATTING_2B -0.0187801 0.0091791 -2.046 0.040875 *
## TEAM_BATTING_3B 0.0845590 0.0174760 4.839 1.40e-06 ***
## TEAM_BATTING_HR 0.0812832 0.0314005 2.589 0.009699 **
## TEAM_BATTING_BB 0.0434242 0.0098952 4.388 1.19e-05 ***
## TEAM_BATTING_SO -0.0266313 0.0054446 -4.891 1.07e-06 ***
## TEAM_BASERUN_SB 0.0331708 0.0050378 6.584 5.66e-11 ***
## TEAM_BASERUN_CS -0.0036342 0.0167675 -0.217 0.828430
## TEAM_BATTING_HBP 0.0228174 0.0748587 0.305 0.760541
## TEAM_PITCHING_H 0.0023023 0.0009248 2.489 0.012866 *
## TEAM_PITCHING_HR -0.0185614 0.0279482 -0.664 0.506672
## TEAM_PITCHING_BB -0.0293608 0.0081877 -3.586 0.000343 ***
## TEAM_PITCHING_SO 0.0203634 0.0044766 4.549 5.68e-06 ***
## TEAM_FIELDING_E -0.0282761 0.0031730 -8.912 < 2e-16 ***
## TEAM_FIELDING_DP -0.1050337 0.0132362 -7.935 3.28e-15 ***
## TEAM_BATTING_1B NA NA NA NA
## TEAM_BASERUN_SB_RATIO 8.5501892 4.8316905 1.770 0.076928 .
## TEAM_BASERUN_CS_RATIO NA NA NA NA
## TEAM_BATTING_WALK 0.0429526 0.0131798 3.259 0.001135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.04 on 2257 degrees of freedom
## Multiple R-squared: 0.3201, Adjusted R-squared: 0.3146
## F-statistic: 59.02 on 18 and 2257 DF, p-value: < 2.2e-16
model_6 <- lm(TARGET_WINS ~ ., data=MB_train_99trim[c(2,3,4,5,6,7,8,9,12,14,15,16,17,21)])
summary(model_6)##
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_99trim[c(2, 3,
## 4, 5, 6, 7, 8, 9, 12, 14, 15, 16, 17, 21)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.874 -8.419 0.016 8.203 66.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.527734 8.985359 -0.504 0.614380
## TEAM_BATTING_H 0.043483 0.003886 11.189 < 2e-16 ***
## TEAM_BATTING_2B -0.016681 0.009080 -1.837 0.066326 .
## TEAM_BATTING_3B 0.077813 0.017138 4.540 5.91e-06 ***
## TEAM_BATTING_HR 0.062586 0.009633 6.497 1.00e-10 ***
## TEAM_BATTING_BB 0.045760 0.009726 4.705 2.69e-06 ***
## TEAM_BATTING_SO -0.025301 0.005180 -4.885 1.11e-06 ***
## TEAM_BASERUN_SB 0.035533 0.004454 7.978 2.34e-15 ***
## TEAM_PITCHING_H 0.002364 0.000908 2.604 0.009280 **
## TEAM_PITCHING_BB -0.031599 0.007973 -3.963 7.63e-05 ***
## TEAM_PITCHING_SO 0.019540 0.004206 4.646 3.57e-06 ***
## TEAM_FIELDING_E -0.027748 0.003020 -9.187 < 2e-16 ***
## TEAM_FIELDING_DP -0.108767 0.013142 -8.277 < 2e-16 ***
## TEAM_BATTING_WALK 0.043904 0.012851 3.416 0.000646 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.04 on 2262 degrees of freedom
## Multiple R-squared: 0.3182, Adjusted R-squared: 0.3143
## F-statistic: 81.22 on 13 and 2262 DF, p-value: < 2.2e-16
R_2 <- summary(model_1)$r.squared
Adj_R_2 <- summary(model_1)$adj.r.squared
MSE <- sum(model_1$residuals ^ 2 ) / model_1$df.residual
F_statistics <- summary(model_1)$fstatistic[1]
m2_R_2 <- summary(model_2)$r.squared
m2_Adj_R_2 <- summary(model_2)$adj.r.squared
m2_MSE <- sum(model_2$residuals ^ 2 ) / model_2$df.residual
m2_F_statistics <- summary(model_2)$fstatistic[1]
m3_R_2 <- summary(model_3)$r.squared
m3_Adj_R_2 <- summary(model_3)$adj.r.squared
m3_MSE <- sum(model_3$residuals ^ 2 ) / model_3$df.residual
m3_F_statistics <- summary(model_3)$fstatistic[1]
m4_R_2 <- summary(model_4)$r.squared
m4_Adj_R_2 <- summary(model_4)$adj.r.squared
m4_MSE <- sum(model_4$residuals ^ 2 ) / model_4$df.residual
m4_F_statistics <- summary(model_4)$fstatistic[1]
m5_R_2 <- summary(model_5)$r.squared
m5_Adj_R_2 <- summary(model_5)$adj.r.squared
m5_MSE <- sum(model_5$residuals ^ 2 ) / model_5$df.residual
m5_F_statistics <- summary(model_5)$fstatistic[1]
m6_R_2 <- summary(model_6)$r.squared
m6_Adj_R_2 <- summary(model_6)$adj.r.squared
m6_MSE <- sum(model_6$residuals ^ 2 ) / model_6$df.residual
m6_F_statistics <- summary(model_6)$fstatistic[1]
compare_model1 <- rbind(R_2, Adj_R_2, MSE, F_statistics )
compare_model2 <- rbind(m2_R_2, m2_Adj_R_2, m2_MSE, m2_F_statistics )
compare_model3 <- rbind(m3_R_2, m3_Adj_R_2, m3_MSE, m3_F_statistics )
compare_model4 <- rbind(m4_R_2, m4_Adj_R_2, m4_MSE, m4_F_statistics )
compare_model5 <- rbind(m5_R_2, m5_Adj_R_2, m5_MSE, m5_F_statistics )
compare_model6 <- rbind(m6_R_2, m6_Adj_R_2, m6_MSE, m6_F_statistics )
compare <- data.frame(compare_model1, compare_model2, compare_model3, compare_model4, compare_model5,compare_model6)
colnames(compare) <- c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6")
kable(compare)| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
|---|---|---|---|---|---|---|
| R_2 | 0.2990080 | 0.3229188 | 0.3187801 | 0.4018550 | 0.2671466 | 0.3182463 |
| Adj_R_2 | 0.2943554 | 0.3193284 | 0.3151678 | 0.3985757 | 0.2632605 | 0.3143281 |
| MSE | 175.0918065 | 168.8952541 | 169.9276276 | 104.1329280 | 182.8074100 | 170.1359684 |
| F_statistics | 64.2668440 | 89.9406020 | 88.2484656 | 122.5427589 | 68.7441504 | 81.2241260 |
Model Selection
The above table we see, model 4 and model 2 are looking the best fit. But in model 4 if we look at p-value of all explanatory variables is higer than 0.05 which is not statistically significant.So, we will go with Model 2, which having high \(R^2\) , \(Adj R ^ 2\) and F-Statistics. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.
MB_test<- read.csv("https://raw.githubusercontent.com/mkollontai/DATA621_GroupWork/master/HW1/moneyball-evaluation-data.csv")
MB_test$TEAM_BATTING_1B <- MB_test$TEAM_BATTING_H - MB_test$TEAM_BATTING_2B - MB_test$TEAM_BATTING_3B - MB_test$TEAM_BATTING_HR
MB_test$TEAM_BASERUN_SB_RATIO <- MB_test$TEAM_BASERUN_SB/(MB_test$TEAM_BASERUN_SB + MB_test$TEAM_BASERUN_CS)
MB_test$TEAM_BASERUN_CS_RATIO <- MB_test$TEAM_BASERUN_CS/(MB_test$TEAM_BASERUN_SB + MB_test$TEAM_BASERUN_CS)
MB_test$TEAM_BATTING_WALK <- MB_test$TEAM_BATTING_BB + MB_test$TEAM_BATTING_HBP
# replace by median
MB_test$TEAM_BATTING_SO[is.na(MB_test$TEAM_BATTING_SO)==TRUE] <- median(MB_test$TEAM_BATTING_SO, na.rm = TRUE)
MB_test$TEAM_BASERUN_SB[is.na(MB_test$TEAM_BASERUN_SB)==TRUE] <- median(MB_test$TEAM_BASERUN_SB, na.rm = TRUE)
MB_test$TEAM_BASERUN_CS[is.na(MB_test$TEAM_BASERUN_CS)==TRUE] <- median(MB_test$TEAM_BASERUN_CS, na.rm = TRUE)
MB_test$TEAM_BATTING_HBP[is.na(MB_test$TEAM_BATTING_HBP)==TRUE] <- median(MB_test$TEAM_BATTING_HBP, na.rm = TRUE)
MB_test$TEAM_PITCHING_SO[is.na(MB_test$TEAM_PITCHING_SO)==TRUE] <- median(MB_test$TEAM_PITCHING_SO, na.rm = TRUE)
MB_test$TEAM_FIELDING_DP[is.na(MB_test$TEAM_FIELDING_DP)==TRUE] <- median(MB_test$TEAM_FIELDING_DP, na.rm = TRUE)
MB_test$TEAM_BASERUN_SB_RATIO[is.na(MB_test$TEAM_BASERUN_SB_RATIO)==TRUE] <- median(MB_test$TEAM_BASERUN_SB_RATIO, na.rm = TRUE)
MB_test$TEAM_BASERUN_CS_RATIO[is.na(MB_test$TEAM_BASERUN_CS_RATIO)==TRUE] <- median(MB_test$TEAM_BASERUN_CS_RATIO, na.rm = TRUE)
MB_test$TEAM_BATTING_WALK[is.na(MB_test$TEAM_BATTING_WALK)==TRUE] <- median(MB_test$TEAM_BATTING_WALK, na.rm = TRUE)
prediction <- predict(model_2, MB_test, interval = "prediction")
DT::datatable(prediction)## fit lwr upr
## Min. : 25.13 Min. :-0.8872 Min. : 51.15
## 1st Qu.: 74.08 1st Qu.:48.5580 1st Qu.: 99.60
## Median : 79.42 Median :53.8642 Median :104.97
## Mean : 79.10 Mean :53.5344 Mean :104.67
## 3rd Qu.: 84.76 3rd Qu.:59.2142 3rd Qu.:110.31
## Max. :109.47 Max. :83.8049 Max. :135.13
MB_test$PRED_WINS <- round(predict(model_2, newdata = MB_test),0)
Pred_Data <- MB_test[, c("INDEX","PRED_WINS")]
DT::datatable(Pred_Data)The above tables shows predicted win for the test data.
Conclusion
This assignment helps us to understand how to performing the EDA, Data Preparation and Model building. Building process. The statistical calculation and graphical results helps us to understand the distribution of data.
Future work
Explore additional outside factors such as state of mind of the player, weather, location impact on the win. These factors in near future we will consider to predict the Win.