DATA621 HW1

Overview

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season.

Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Data Exploration

The goal of each team is to win as many games out of a 162 game season as possible. This allows a ticket to the post season and a chance to play at the World Series, where the champion is defined.

The sample data having 2276 observations dating from 1871 to 2006. Data consist of total 16 variables, out of that 15 variables are explanatory variables and 1 is taget variable i.e TARGET_WINS.

We will first look at the data to get a sense of what we have.

data_url <- "https://raw.githubusercontent.com/mkollontai/DATA621_GroupWork/master/HW1/moneyball-training-data.csv"
MB_train <- read.csv(data_url)
DT::datatable(head(MB_train))

Let’s see the summary statistics of data.

MB_summary <- function(MB_train){
  MB_train %>%
    summary() %>%
    kable() %>%
    kable_styling()
}

MB_summary(MB_train)

INDEX	TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP
Min. : 1.0	Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :29.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 65.0	Min. : 52.0
1st Qu.: 630.8	1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 548.0	1st Qu.: 66.0	1st Qu.: 38.0	1st Qu.:50.50	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 615.0	1st Qu.: 127.0	1st Qu.:131.0
Median :1270.5	Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 750.0	Median :101.0	Median : 49.0	Median :58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 813.5	Median : 159.0	Median :149.0
Mean :1268.5	Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.6	Mean :124.8	Mean : 52.8	Mean :59.36	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 817.7	Mean : 246.5	Mean :146.4
3rd Qu.:1915.5	3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 930.0	3rd Qu.:156.0	3rd Qu.: 62.0	3rd Qu.:67.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 968.0	3rd Qu.: 249.2	3rd Qu.:164.0
Max. :2535.0	Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.0	Max. :95.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :19278.0	Max. :1898.0	Max. :228.0
NA	NA	NA	NA	NA	NA	NA	NA’s :102	NA’s :131	NA’s :772	NA’s :2085	NA	NA	NA	NA’s :102	NA	NA’s :286

The above tables shows TEAM_BATTING_HBP and TEAM_BASERUN_CS having more NA. We will come to that part later. The co-relation plot shows the relationship between Target_Win with other variables.

# Correlations of columns from 1 to 8
cor <-cor(MB_train[2:17], use="complete.obs", method="pearson")

#round to two decimals
DT::datatable(round(cor, 2))

# visualisation of correlation
corrplot(cor, method="square")

Above plot shows the strong co-relation between below variables.

TEAM_BATTING_H -> TEAM_PITCHING_H
TEAM_BATTING_HR -> TEAM_PITCHING_HR
TEAM_BATTING_BB -> TEAM_PITCHING_BB
TEAM_BATTING_SO -> TEAM_PITCHING_SO
TEAM_BASERUN_CS -> TEAM_BASERUN_SB

Let’s look at all of the metrics in order to evaluate the presence of outliers and quality of the data overall. To get the better clarity divide the data in to 4 groups i.e

batting
baserun
fielding
pitching

Box-plot for batting variables.

batting_df <- MB_train[,c("TEAM_BATTING_H","TEAM_BATTING_2B","TEAM_BATTING_3B","TEAM_BATTING_HR","TEAM_BATTING_BB","TEAM_BATTING_SO","TEAM_BATTING_HBP")]

ggplot(stack(batting_df), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Above plot shows TEAM_BATTING_H having more outliers.

Box-plot for baserun variables.

#colnames(MB_train)
baserun_df <- MB_train[,c("TEAM_BASERUN_SB","TEAM_BASERUN_CS")]

ggplot(stack(baserun_df), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Box-plot for fielding variables.

#colnames(MB_train)
fielding_df <- MB_train[,c("TEAM_FIELDING_E","TEAM_FIELDING_DP")]

ggplot(stack(fielding_df), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

Box-plot for pitching variables.

pitching_df <- MB_train[,c("TEAM_PITCHING_H","TEAM_PITCHING_HR","TEAM_PITCHING_BB","TEAM_PITCHING_SO")]

ggplot(stack(pitching_df), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

There are many NAs avalaible in the sample data, replace the NAs with zero and below shows the sample data distribution.

mb_hist <- MB_train

mb_hist$TEAM_BATTING_SO <- ifelse(is.na(mb_hist$TEAM_BATTING_SO), 0, mb_hist$TEAM_BATTING_SO)
mb_hist$TEAM_BASERUN_SB <- ifelse(is.na(mb_hist$TEAM_BASERUN_SB), 0, mb_hist$TEAM_BASERUN_SB)
mb_hist$TEAM_BASERUN_CS <- ifelse(is.na(mb_hist$TEAM_BASERUN_CS), 0, mb_hist$TEAM_BASERUN_CS)
mb_hist$TEAM_BATTING_HBP <- ifelse(is.na(mb_hist$TEAM_BATTING_HBP), 0, mb_hist$TEAM_BATTING_HBP)
mb_hist$TEAM_PITCHING_SO <- ifelse(is.na(mb_hist$TEAM_PITCHING_SO), 0, mb_hist$TEAM_PITCHING_SO)
mb_hist$TEAM_FIELDING_DP <- ifelse(is.na(mb_hist$TEAM_FIELDING_DP), 0, mb_hist$TEAM_FIELDING_DP)

# remove index col
mb_hist <- mb_hist[,-1 ]
# plot distribution
mb_hist %>%
  purrr::keep(is.numeric) %>%                  
  tidyr::gather() %>%                             
  ggplot(aes(value)) +                    
    facet_wrap(~ key, scales = "free") +  
    geom_histogram(bins = 35, col = "darkblue", fill = "darkblue")

The above distribution shows some are left skewed, right skewed, and normally distributed. We also see other variables have many zeros that need to be addressed because the are significantly skewing the data. Two particularly high offenders in this area were TEAM_BATTING_HBP and TEAM_BASERUN_CS. Data preparation is in the next section will handle this issue so that the resulting models perform better without removing rows of data.

Data Preparation

For the data preparation main focus in addressing outliers and replace NAs with median/mean, log transformation for skewed distribution and creating additional variables. First, create additional variables and then review outliers. $ new variables we are going to add in the data those are TEAM_BATTING_1B, TEAM_BATTING_WALK, TEAM_BASERUN_SB_RATIO, and TEAM_BASERUN_CS_RATIO. TEAM_BATTING_H having singles, doubles, triples and home run. From this one more variable can calculate i.e TEAM_BATTING_1B. Below is the formula:
TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - TEAM_BATTING_HR Total hits by team can be the sum of TEAM_BATTING_BB AND TEAM_BATTING_HBP TEAM_BATTING_WALK = TEAM_BATTING_BB + TEAM_BATTING_HBP. In co-relation plot we see TEAM_BASERUN_SB_RATIO and TEAM_BASERUN_SB_RATIO having strong co-relation. TEAM_BASERUN_SB_RATIO and TEAM_BASERUN_CS_RATIO are the ratio of Caught Stealing and Stolen Bases.

# created variables
MB_train$TEAM_BATTING_1B <- MB_train$TEAM_BATTING_H - MB_train$TEAM_BATTING_2B - MB_train$TEAM_BATTING_3B - MB_train$TEAM_BATTING_HR
MB_train$TEAM_BASERUN_SB_RATIO <- MB_train$TEAM_BASERUN_SB/(MB_train$TEAM_BASERUN_SB + MB_train$TEAM_BASERUN_CS)
MB_train$TEAM_BASERUN_CS_RATIO <- MB_train$TEAM_BASERUN_CS/(MB_train$TEAM_BASERUN_SB + MB_train$TEAM_BASERUN_CS)
MB_train$TEAM_BATTING_WALK <- MB_train$TEAM_BATTING_BB+MB_train$TEAM_BATTING_HBP

Below shows distribution of singles with total hits by batting team and caught stealing ratio with stealing base ratio.

par(mfrow=c(2,2), mai=c(0.5,0.5,0.5,0.2))
par(fig=c(0,0.5,0.25,1))
hist(MB_train$TEAM_BATTING_H, main="Total Hits", breaks=30, col="darkblue")
par(fig=c(0.5,1,0.25,1), new=TRUE)
hist(MB_train$TEAM_BATTING_1B, main="Singles", breaks=30, col="darkblue")
par(fig=c(0,0.5,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BATTING_H, horizontal=TRUE, width=1, col="darkblue")
par(fig=c(0.5,1,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BATTING_1B, horizontal=TRUE, width=1, col="darkblue")

par(mfrow=c(1,1))

par(mfrow=c(2,2), mai=c(0.5,0.5,0.5,0.2))
par(fig=c(0,0.5,0.25,1))
hist(MB_train$TEAM_BASERUN_CS_RATIO, main="Caught Stealing Ratio", breaks=30, col="firebrick")
par(fig=c(0.5,1,0.25,1), new=TRUE)
hist(MB_train$TEAM_BASERUN_SB_RATIO, main="Stolen Bases Ratio", breaks=30, col="darkblue")
par(fig=c(0,0.5,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BASERUN_CS_RATIO, horizontal=TRUE, width=1, col="firebrick")
par(fig=c(0.5,1,0,0.3), new=TRUE)
boxplot(MB_train$TEAM_BASERUN_SB_RATIO, horizontal=TRUE, width=1, col="darkblue")

par(mfrow=c(1,1))

MB_summary(MB_train)

INDEX	TARGET_WINS	TEAM_BATTING_H	TEAM_BATTING_2B	TEAM_BATTING_3B	TEAM_BATTING_HR	TEAM_BATTING_BB	TEAM_BATTING_SO	TEAM_BASERUN_SB	TEAM_BASERUN_CS	TEAM_BATTING_HBP	TEAM_PITCHING_H	TEAM_PITCHING_HR	TEAM_PITCHING_BB	TEAM_PITCHING_SO	TEAM_FIELDING_E	TEAM_FIELDING_DP	TEAM_BATTING_1B	TEAM_BASERUN_SB_RATIO	TEAM_BASERUN_CS_RATIO	TEAM_BATTING_WALK
Min. : 1.0	Min. : 0.00	Min. : 891	Min. : 69.0	Min. : 0.00	Min. : 0.00	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. :29.00	Min. : 1137	Min. : 0.0	Min. : 0.0	Min. : 0.0	Min. : 65.0	Min. : 52.0	Min. : 709.0	Min. :0.0000	Min. :0.1597	Min. :429.0
1st Qu.: 630.8	1st Qu.: 71.00	1st Qu.:1383	1st Qu.:208.0	1st Qu.: 34.00	1st Qu.: 42.00	1st Qu.:451.0	1st Qu.: 548.0	1st Qu.: 66.0	1st Qu.: 38.0	1st Qu.:50.50	1st Qu.: 1419	1st Qu.: 50.0	1st Qu.: 476.0	1st Qu.: 615.0	1st Qu.: 127.0	1st Qu.:131.0	1st Qu.: 990.8	1st Qu.:0.5789	1st Qu.:0.3048	1st Qu.:550.5
Median :1270.5	Median : 82.00	Median :1454	Median :238.0	Median : 47.00	Median :102.00	Median :512.0	Median : 750.0	Median :101.0	Median : 49.0	Median :58.00	Median : 1518	Median :107.0	Median : 536.5	Median : 813.5	Median : 159.0	Median :149.0	Median :1050.0	Median :0.6429	Median :0.3571	Median :594.0
Mean :1268.5	Mean : 80.79	Mean :1469	Mean :241.2	Mean : 55.25	Mean : 99.61	Mean :501.6	Mean : 735.6	Mean :124.8	Mean : 52.8	Mean :59.36	Mean : 1779	Mean :105.7	Mean : 553.0	Mean : 817.7	Mean : 246.5	Mean :146.4	Mean :1073.2	Mean :0.6327	Mean :0.3673	Mean :602.7
3rd Qu.:1915.5	3rd Qu.: 92.00	3rd Qu.:1537	3rd Qu.:273.0	3rd Qu.: 72.00	3rd Qu.:147.00	3rd Qu.:580.0	3rd Qu.: 930.0	3rd Qu.:156.0	3rd Qu.: 62.0	3rd Qu.:67.00	3rd Qu.: 1682	3rd Qu.:150.0	3rd Qu.: 611.0	3rd Qu.: 968.0	3rd Qu.: 249.2	3rd Qu.:164.0	3rd Qu.:1129.0	3rd Qu.:0.6952	3rd Qu.:0.4211	3rd Qu.:653.5
Max. :2535.0	Max. :146.00	Max. :2554	Max. :458.0	Max. :223.00	Max. :264.00	Max. :878.0	Max. :1399.0	Max. :697.0	Max. :201.0	Max. :95.00	Max. :30132	Max. :343.0	Max. :3645.0	Max. :19278.0	Max. :1898.0	Max. :228.0	Max. :2112.0	Max. :0.8403	Max. :1.0000	Max. :823.0
NA	NA	NA	NA	NA	NA	NA	NA’s :102	NA’s :131	NA’s :772	NA’s :2085	NA	NA	NA	NA’s :102	NA	NA’s :286	NA	NA’s :773	NA’s :773	NA’s :2085

Sample data having NAs, lets replace NA with mean/median. We will create 2 datasets, one dataset replace NA with mean and another replace NA with median. We apply our model on these dataset.

MB_train_mean <- MB_train
# replace by mean
MB_train_mean$TEAM_BATTING_SO[is.na(MB_train_mean$TEAM_BATTING_SO)==TRUE] <- mean(MB_train_mean$TEAM_BATTING_SO, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_SB[is.na(MB_train_mean$TEAM_BASERUN_SB)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_SB, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_CS[is.na(MB_train_mean$TEAM_BASERUN_CS)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_CS, na.rm = TRUE)
MB_train_mean$TEAM_BATTING_HBP[is.na(MB_train_mean$TEAM_BATTING_HBP)==TRUE] <- mean(MB_train_mean$TEAM_BATTING_HBP, na.rm = TRUE)
MB_train_mean$TEAM_PITCHING_SO[is.na(MB_train_mean$TEAM_PITCHING_SO)==TRUE] <- mean(MB_train_mean$TEAM_PITCHING_SO, na.rm = TRUE)
MB_train_mean$TEAM_FIELDING_DP[is.na(MB_train_mean$TEAM_FIELDING_DP)==TRUE] <- mean(MB_train_mean$TEAM_FIELDING_DP, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_SB_RATIO[is.na(MB_train_mean$TEAM_BASERUN_SB_RATIO)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_SB_RATIO, na.rm = TRUE)
MB_train_mean$TEAM_BASERUN_CS_RATIO[is.na(MB_train_mean$TEAM_BASERUN_CS_RATIO)==TRUE] <- mean(MB_train_mean$TEAM_BASERUN_CS_RATIO, na.rm = TRUE)
MB_train_mean$TEAM_BATTING_WALK[is.na(MB_train_mean$TEAM_BATTING_WALK)==TRUE] <- mean(MB_train_mean$TEAM_BATTING_WALK, na.rm = TRUE)
# replace by median
MB_train_median <- MB_train
MB_train_median$TEAM_BATTING_SO[is.na(MB_train_median$TEAM_BATTING_SO)==TRUE] <- median(MB_train_median$TEAM_BATTING_SO, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_SB[is.na(MB_train_median$TEAM_BASERUN_SB)==TRUE] <- median(MB_train_median$TEAM_BASERUN_SB, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_CS[is.na(MB_train_median$TEAM_BASERUN_CS)==TRUE] <- median(MB_train_median$TEAM_BASERUN_CS, na.rm = TRUE)
MB_train_median$TEAM_BATTING_HBP[is.na(MB_train_median$TEAM_BATTING_HBP)==TRUE] <- median(MB_train_median$TEAM_BATTING_HBP, na.rm = TRUE)
MB_train_median$TEAM_PITCHING_SO[is.na(MB_train_median$TEAM_PITCHING_SO)==TRUE] <- median(MB_train_median$TEAM_PITCHING_SO, na.rm = TRUE)
MB_train_median$TEAM_FIELDING_DP[is.na(MB_train_median$TEAM_FIELDING_DP)==TRUE] <- median(MB_train_median$TEAM_FIELDING_DP, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_SB_RATIO[is.na(MB_train_median$TEAM_BASERUN_SB_RATIO)==TRUE] <- median(MB_train_median$TEAM_BASERUN_SB_RATIO, na.rm = TRUE)
MB_train_median$TEAM_BASERUN_CS_RATIO[is.na(MB_train_median$TEAM_BASERUN_CS_RATIO)==TRUE] <- median(MB_train_median$TEAM_BASERUN_CS_RATIO, na.rm = TRUE)
MB_train_median$TEAM_BATTING_WALK[is.na(MB_train_median$TEAM_BATTING_WALK)==TRUE] <- median(MB_train_median$TEAM_BATTING_WALK, na.rm = TRUE)

The log-transformation is used to deal with skewed data. Log transformation can decrease the variability of data and make data conform more closely to the normal distribution. Above histogram we saw many skewed distibutions, apply log transformation on those variables.

# Log transformation
MB_train_log <- MB_train
# replace by log
MB_train_log$TEAM_FIELDING_E <- log(MB_train_log$TEAM_FIELDING_E)
MB_train_log$TEAM_PITCHING_H <- log(MB_train_log$TEAM_PITCHING_H)
MB_train_log$TEAM_PITCHING_SO[MB_train_log$TEAM_PITCHING_SO==0] <- 1
MB_train_log$TEAM_PITCHING_SO <- log(MB_train_log$TEAM_PITCHING_SO)
MB_train_log$TEAM_PITCHING_BB[MB_train_log$TEAM_PITCHING_BB==0] <- 1
MB_train_log$TEAM_PITCHING_BB <- log(MB_train_log$TEAM_PITCHING_BB)

To remove outliers, we decide to trim the data by 99% and 95%.

ggplot(stack(MB_train), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

ggplot(stack(MB_train_mean), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

ggplot(stack(MB_train_median), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

ggplot(stack(MB_train_log), aes(x = ind, y = values)) +
  geom_boxplot(col="darkblue") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

First we need to calculate trimming percentile value for TEAM_FIELDING_E, TEAM_PITCHING_H, TEAM_PITCHING_SO, TEAM_PITCHING_BB. Because these variables having highest number of extreme outliers.

# trimming percentile value
quant_95_TEAM_FIELDING_E <- unname(quantile(MB_train_mean$TEAM_FIELDING_E, probs=c(0.01,0.05,0.95,0.99))[3])
quant_95_TEAM_PITCHING_H <- unname(quantile(MB_train_mean$TEAM_PITCHING_H, probs=c(0.01,0.05,0.95,0.99))[3])
quant_95_TEAM_PITCHING_SO <- unname(quantile(MB_train_mean$TEAM_PITCHING_SO, probs=c(0.01,0.05,0.95,0.99))[3])
quant_95_TEAM_PITCHING_BB <- unname(quantile(MB_train_mean$TEAM_PITCHING_BB, probs=c(0.01,0.05,0.95,0.99))[3])

quant_99_TEAM_FIELDING_E <- unname(quantile(MB_train_mean$TEAM_FIELDING_E, probs=c(0.01,0.05,0.95,0.99))[4])
quant_99_TEAM_PITCHING_H <- unname(quantile(MB_train_mean$TEAM_PITCHING_H, probs=c(0.01,0.05,0.95,0.99))[4])
quant_99_TEAM_PITCHING_SO <- unname(quantile(MB_train_mean$TEAM_PITCHING_SO, probs=c(0.01,0.05,0.95,0.99))[4])
quant_99_TEAM_PITCHING_BB <- unname(quantile(MB_train_mean$TEAM_PITCHING_BB, probs=c(0.01,0.05,0.95,0.99))[4])

# Trim data by 5th & 95th percentile
MB_train_95trim <- MB_train_mean
MB_train_95trim$TEAM_FIELDING_E[MB_train_95trim$TEAM_FIELDING_E > quant_95_TEAM_FIELDING_E] <- quant_95_TEAM_FIELDING_E
MB_train_95trim$TEAM_PITCHING_H[MB_train_95trim$TEAM_PITCHING_H > quant_95_TEAM_PITCHING_H] <- quant_95_TEAM_PITCHING_H
MB_train_95trim$TEAM_PITCHING_SO[MB_train_95trim$TEAM_PITCHING_SO > quant_95_TEAM_PITCHING_SO] <- quant_95_TEAM_PITCHING_SO
MB_train_95trim$TEAM_PITCHING_BB[MB_train_95trim$TEAM_PITCHING_BB > quant_95_TEAM_PITCHING_BB] <- quant_95_TEAM_PITCHING_BB

# Trim data by 1st & 99th percentile
MB_train_99trim <- MB_train_mean
MB_train_99trim$TEAM_FIELDING_E[MB_train_99trim$TEAM_FIELDING_E > quant_99_TEAM_FIELDING_E] <- quant_99_TEAM_FIELDING_E
MB_train_99trim$TEAM_PITCHING_H[MB_train_99trim$TEAM_PITCHING_H > quant_99_TEAM_PITCHING_H] <- quant_99_TEAM_PITCHING_H
MB_train_99trim$TEAM_PITCHING_SO[MB_train_99trim$TEAM_PITCHING_SO > quant_99_TEAM_PITCHING_SO] <- quant_99_TEAM_PITCHING_SO
MB_train_99trim$TEAM_PITCHING_BB[MB_train_99trim$TEAM_PITCHING_BB > quant_99_TEAM_PITCHING_BB] <- quant_99_TEAM_PITCHING_BB

After remove extreme outliers data looks quite normalize.

# visualize
par(mfrow=c(3,1))
hist(MB_train_log$TEAM_FIELDING_E, breaks = 20)
hist(MB_train_95trim$TEAM_FIELDING_E, breaks = 20)
hist(MB_train_99trim$TEAM_FIELDING_E, breaks = 20)

par(mfrow=c(3,1))

par(mfrow=c(3,1))
hist(MB_train_log$TEAM_PITCHING_H, breaks = 20)
hist(MB_train_95trim$TEAM_PITCHING_H, breaks = 20)
hist(MB_train_99trim$TEAM_PITCHING_H, breaks = 20)

par(mfrow=c(3,1))

par(mfrow=c(3,1))
hist(MB_train_log$TEAM_PITCHING_SO, breaks = 20)
hist(MB_train_95trim$TEAM_PITCHING_SO, breaks = 20)
hist(MB_train_99trim$TEAM_PITCHING_SO, breaks = 20)

par(mfrow=c(3,1))

par(mfrow=c(3,1))
hist(MB_train_log$TEAM_PITCHING_BB, breaks = 20)
hist(MB_train_95trim$TEAM_PITCHING_BB, breaks = 20)
hist(MB_train_99trim$TEAM_PITCHING_BB, breaks = 20)

par(mfrow=c(3,1))

Build Models

The primary metrics used for determining the accuracy of the model were Adjusted R-Squared, $R^2$, MSE values. The stats package helps to quickly perform the calculations and provides a table output of the summary. The combination of the $R^2$, $Adj R^2$, MSE and F-statistics scored gives insight as to which specific variables gave the best performing model. p-value to determine the statistical significance of each coefficient included in the model. A p-value of less than 0.5 indicates the variable is statistically significant.

# model 1
model_1 <- lm(TARGET_WINS ~ . , data = mb_hist)
summary(model_1)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = mb_hist)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.041  -8.558   0.145   8.907  53.331 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      28.2930631  4.5824723   6.174 7.86e-10 ***
## TEAM_BATTING_H    0.0429317  0.0034842  12.322  < 2e-16 ***
## TEAM_BATTING_2B  -0.0042264  0.0097037  -0.436 0.663209    
## TEAM_BATTING_3B   0.0649817  0.0170796   3.805 0.000146 ***
## TEAM_BATTING_HR   0.0703822  0.0282035   2.496 0.012648 *  
## TEAM_BATTING_BB   0.0022019  0.0059003   0.373 0.709047    
## TEAM_BATTING_SO  -0.0101940  0.0020913  -4.875 1.17e-06 ***
## TEAM_BASERUN_SB   0.0042336  0.0042081   1.006 0.314494    
## TEAM_BASERUN_CS  -0.0077185  0.0109760  -0.703 0.481993    
## TEAM_BATTING_HBP -0.0542327  0.0194784  -2.784 0.005410 ** 
## TEAM_PITCHING_H  -0.0008096  0.0003800  -2.130 0.033238 *  
## TEAM_PITCHING_HR -0.0025561  0.0249180  -0.103 0.918304    
## TEAM_PITCHING_BB  0.0028645  0.0042178   0.679 0.497119    
## TEAM_PITCHING_SO  0.0019800  0.0009422   2.101 0.035710 *  
## TEAM_FIELDING_E  -0.0286814  0.0030151  -9.512  < 2e-16 ***
## TEAM_FIELDING_DP -0.0657273  0.0101980  -6.445 1.41e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.23 on 2260 degrees of freedom
## Multiple R-squared:  0.299,  Adjusted R-squared:  0.2944 
## F-statistic: 64.27 on 15 and 2260 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_1)

# model 2
model_2 <- lm(TARGET_WINS ~ ., data=MB_train_mean)
summary(model_2)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_mean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.008  -8.499   0.069   8.438  59.785 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -7.8846190  9.7449412  -0.809 0.418544    
## INDEX                 -0.0005938  0.0003764  -1.578 0.114789    
## TEAM_BATTING_H         0.0471168  0.0036848  12.787  < 2e-16 ***
## TEAM_BATTING_2B       -0.0214205  0.0091939  -2.330 0.019901 *  
## TEAM_BATTING_3B        0.0685838  0.0169941   4.036 5.62e-05 ***
## TEAM_BATTING_HR        0.0547467  0.0273400   2.002 0.045357 *  
## TEAM_BATTING_BB        0.0068437  0.0058913   1.162 0.245492    
## TEAM_BATTING_SO       -0.0098886  0.0025590  -3.864 0.000115 ***
## TEAM_BASERUN_SB        0.0272772  0.0048644   5.607 2.30e-08 ***
## TEAM_BASERUN_CS       -0.0036781  0.0166524  -0.221 0.825208    
## TEAM_BATTING_HBP       0.0151583  0.0745654   0.203 0.838928    
## TEAM_PITCHING_H       -0.0007044  0.0003668  -1.920 0.054943 .  
## TEAM_PITCHING_HR       0.0127466  0.0242483   0.526 0.599169    
## TEAM_PITCHING_BB       0.0007824  0.0041356   0.189 0.849972    
## TEAM_PITCHING_SO       0.0026989  0.0009162   2.946 0.003253 ** 
## TEAM_FIELDING_E       -0.0224901  0.0024992  -8.999  < 2e-16 ***
## TEAM_FIELDING_DP      -0.1120703  0.0131356  -8.532  < 2e-16 ***
## TEAM_BATTING_1B               NA         NA      NA       NA    
## TEAM_BASERUN_SB_RATIO  9.7118989  4.7791864   2.032 0.042258 *  
## TEAM_BASERUN_CS_RATIO         NA         NA      NA       NA    
## TEAM_BATTING_WALK      0.0480023  0.0130995   3.664 0.000254 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.99 on 2257 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3197 
## F-statistic:  60.4 on 18 and 2257 DF,  p-value: < 2.2e-16

model_2 <- lm(TARGET_WINS ~ ., data=MB_train_mean[c(2,3,4,5,6,7,8,9,15,16,17,19,21)])
summary(model_2)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_mean[c(2, 3, 4, 
##     5, 6, 7, 8, 9, 15, 16, 17, 19, 21)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.429  -8.554  -0.110   8.457  60.585 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -7.9906386  9.1768603  -0.871 0.383990    
## TEAM_BATTING_H         0.0463844  0.0036211  12.810  < 2e-16 ***
## TEAM_BATTING_2B       -0.0221185  0.0091639  -2.414 0.015872 *  
## TEAM_BATTING_3B        0.0751391  0.0164317   4.573 5.07e-06 ***
## TEAM_BATTING_HR        0.0662129  0.0095982   6.898 6.80e-12 ***
## TEAM_BATTING_BB        0.0083310  0.0034570   2.410 0.016036 *  
## TEAM_BATTING_SO       -0.0090283  0.0024111  -3.744 0.000185 ***
## TEAM_BASERUN_SB        0.0278509  0.0045523   6.118 1.11e-09 ***
## TEAM_PITCHING_SO       0.0022214  0.0005933   3.744 0.000185 ***
## TEAM_FIELDING_E       -0.0246881  0.0020551 -12.013  < 2e-16 ***
## TEAM_FIELDING_DP      -0.1128548  0.0131165  -8.604  < 2e-16 ***
## TEAM_BASERUN_SB_RATIO  9.6022690  4.5883768   2.093 0.036484 *  
## TEAM_BATTING_WALK      0.0479878  0.0127753   3.756 0.000177 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 2263 degrees of freedom
## Multiple R-squared:  0.3229, Adjusted R-squared:  0.3193 
## F-statistic: 89.94 on 12 and 2263 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_2)

# Model 3
model_3 <- lm(TARGET_WINS ~ ., data=MB_train_median)
summary(model_3)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_median)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.843  -8.528   0.002   8.473  59.910 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -6.2624290  9.6007797  -0.652 0.514285    
## INDEX                 -0.0005873  0.0003775  -1.556 0.119900    
## TEAM_BATTING_H         0.0478166  0.0036925  12.950  < 2e-16 ***
## TEAM_BATTING_2B       -0.0238012  0.0092259  -2.580 0.009948 ** 
## TEAM_BATTING_3B        0.0733280  0.0170207   4.308 1.72e-05 ***
## TEAM_BATTING_HR        0.0533380  0.0274143   1.946 0.051823 .  
## TEAM_BATTING_BB        0.0076295  0.0059063   1.292 0.196574    
## TEAM_BATTING_SO       -0.0091831  0.0025542  -3.595 0.000331 ***
## TEAM_BASERUN_SB        0.0221644  0.0047543   4.662 3.31e-06 ***
## TEAM_BASERUN_CS        0.0005946  0.0163712   0.036 0.971032    
## TEAM_BATTING_HBP      -0.0033628  0.0746905  -0.045 0.964093    
## TEAM_PITCHING_H       -0.0008018  0.0003668  -2.186 0.028910 *  
## TEAM_PITCHING_HR       0.0117683  0.0243228   0.484 0.628549    
## TEAM_PITCHING_BB       0.0013913  0.0041474   0.335 0.737315    
## TEAM_PITCHING_SO       0.0026876  0.0009195   2.923 0.003504 ** 
## TEAM_FIELDING_E       -0.0208981  0.0024809  -8.424  < 2e-16 ***
## TEAM_FIELDING_DP      -0.1123314  0.0130813  -8.587  < 2e-16 ***
## TEAM_BATTING_1B               NA         NA      NA       NA    
## TEAM_BASERUN_SB_RATIO 11.8139658  4.7615848   2.481 0.013170 *  
## TEAM_BASERUN_CS_RATIO         NA         NA      NA       NA    
## TEAM_BATTING_WALK      0.0430731  0.0131131   3.285 0.001036 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.03 on 2257 degrees of freedom
## Multiple R-squared:  0.3213, Adjusted R-squared:  0.3158 
## F-statistic: 59.35 on 18 and 2257 DF,  p-value: < 2.2e-16

model_3 <- lm(TARGET_WINS ~ ., data=MB_train_median[c(2,3,4,5,6,7,8,9,15,16,17,19,21)])
summary(model_3)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_median[c(2, 3, 
##     4, 5, 6, 7, 8, 9, 15, 16, 17, 19, 21)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.838  -8.495  -0.117   8.425  60.859 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -6.723157   9.058955  -0.742 0.458070    
## TEAM_BATTING_H         0.046994   0.003629  12.948  < 2e-16 ***
## TEAM_BATTING_2B       -0.024483   0.009192  -2.663 0.007790 ** 
## TEAM_BATTING_3B        0.080780   0.016418   4.920 9.27e-07 ***
## TEAM_BATTING_HR        0.063018   0.009600   6.565 6.45e-11 ***
## TEAM_BATTING_BB        0.009822   0.003479   2.823 0.004802 ** 
## TEAM_BATTING_SO       -0.008280   0.002403  -3.446 0.000580 ***
## TEAM_BASERUN_SB        0.022975   0.004482   5.126 3.20e-07 ***
## TEAM_PITCHING_SO       0.002237   0.000595   3.761 0.000174 ***
## TEAM_FIELDING_E       -0.023431   0.002050 -11.431  < 2e-16 ***
## TEAM_FIELDING_DP      -0.113024   0.013064  -8.651  < 2e-16 ***
## TEAM_BASERUN_SB_RATIO 11.671968   4.545969   2.568 0.010306 *  
## TEAM_BATTING_WALK      0.042156   0.012780   3.299 0.000987 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2263 degrees of freedom
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.3152 
## F-statistic: 88.25 on 12 and 2263 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_3)

# model 4
model_4 <- lm(TARGET_WINS ~ ., data=MB_train_log)
summary(model_4)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_log)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.0211  -5.8358  -0.1011   5.2071  23.7836 
## 
## Coefficients: (3 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.926e+03  2.388e+03  -1.225 0.222103    
## INDEX                 -3.559e-04  8.575e-04  -0.415 0.678605    
## TEAM_BATTING_H        -3.100e-01  2.528e-01  -1.226 0.221783    
## TEAM_BATTING_2B        2.347e-02  3.052e-02   0.769 0.442970    
## TEAM_BATTING_3B       -1.088e-01  7.805e-02  -1.394 0.165253    
## TEAM_BATTING_HR        1.949e+00  2.693e+00   0.724 0.470251    
## TEAM_BATTING_BB       -8.305e-03  9.256e-02  -0.090 0.928612    
## TEAM_BATTING_SO       -1.072e-03  8.531e-02  -0.013 0.989987    
## TEAM_BASERUN_SB        1.290e-01  8.937e-02   1.443 0.150816    
## TEAM_BASERUN_CS       -2.272e-01  2.065e-01  -1.100 0.272806    
## TEAM_BATTING_HBP       8.079e-02  4.969e-02   1.626 0.105843    
## TEAM_PITCHING_H        4.912e+02  3.751e+02   1.310 0.192066    
## TEAM_PITCHING_HR      -1.856e+00  2.694e+00  -0.689 0.491833    
## TEAM_PITCHING_BB       3.417e+01  5.034e+01   0.679 0.498163    
## TEAM_PITCHING_SO      -3.370e+01  9.022e+01  -0.374 0.709186    
## TEAM_FIELDING_E       -1.728e+01  4.389e+00  -3.938 0.000119 ***
## TEAM_FIELDING_DP      -9.073e-02  3.739e-02  -2.427 0.016271 *  
## TEAM_BATTING_1B               NA         NA      NA       NA    
## TEAM_BASERUN_SB_RATIO -3.888e+01  3.651e+01  -1.065 0.288435    
## TEAM_BASERUN_CS_RATIO         NA         NA      NA       NA    
## TEAM_BATTING_WALK             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.475 on 173 degrees of freedom
##   (2085 observations deleted due to missingness)
## Multiple R-squared:  0.5545, Adjusted R-squared:  0.5107 
## F-statistic: 12.66 on 17 and 173 DF,  p-value: < 2.2e-16

model_4 <- lm(TARGET_WINS ~ ., data=MB_train_log[c(2,3,4,5,6,7,8,9,15,16,17)])
summary(model_4)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_log[c(2, 3, 4, 
##     5, 6, 7, 8, 9, 15, 16, 17)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.116  -7.134   0.043   7.091  29.759 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      141.987953  20.142940   7.049 2.55e-12 ***
## TEAM_BATTING_H     0.028487   0.004350   6.549 7.50e-11 ***
## TEAM_BATTING_2B   -0.053440   0.008934  -5.981 2.66e-09 ***
## TEAM_BATTING_3B    0.187939   0.019156   9.811  < 2e-16 ***
## TEAM_BATTING_HR    0.096363   0.009244  10.425  < 2e-16 ***
## TEAM_BATTING_BB    0.034615   0.003129  11.061  < 2e-16 ***
## TEAM_BATTING_SO   -0.025636   0.004341  -5.905 4.19e-09 ***
## TEAM_BASERUN_SB    0.062197   0.005482  11.345  < 2e-16 ***
## TEAM_PITCHING_SO   1.518468   3.071460   0.494    0.621    
## TEAM_FIELDING_E  -21.394394   1.300870 -16.446  < 2e-16 ***
## TEAM_FIELDING_DP  -0.110620   0.012215  -9.056  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.2 on 1824 degrees of freedom
##   (441 observations deleted due to missingness)
## Multiple R-squared:  0.4019, Adjusted R-squared:  0.3986 
## F-statistic: 122.5 on 10 and 1824 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_4)

# model 5 
model_5 <- lm(TARGET_WINS ~ ., data=MB_train_95trim)
summary(model_5)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_95trim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.763  -8.365   0.029   8.214  74.484 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.529e+01  9.863e+00  -1.550  0.12123    
## INDEX                 -6.222e-04  3.782e-04  -1.645  0.10005    
## TEAM_BATTING_H         3.679e-02  3.890e-03   9.458  < 2e-16 ***
## TEAM_BATTING_2B       -2.317e-02  9.310e-03  -2.489  0.01288 *  
## TEAM_BATTING_3B        8.956e-02  1.703e-02   5.258 1.59e-07 ***
## TEAM_BATTING_HR        8.221e-02  2.937e-02   2.799  0.00517 ** 
## TEAM_BATTING_BB        5.980e-02  8.489e-03   7.044 2.47e-12 ***
## TEAM_BATTING_SO       -1.646e-02  5.755e-03  -2.860  0.00427 ** 
## TEAM_BASERUN_SB        3.414e-02  5.152e-03   6.627 4.27e-11 ***
## TEAM_BASERUN_CS       -2.701e-03  1.697e-02  -0.159  0.87352    
## TEAM_BATTING_HBP       3.004e-02  7.503e-02   0.400  0.68895    
## TEAM_PITCHING_H        1.380e-02  2.543e-03   5.426 6.39e-08 ***
## TEAM_PITCHING_HR      -3.021e-02  2.623e-02  -1.152  0.24945    
## TEAM_PITCHING_BB      -4.185e-02  7.751e-03  -5.399 7.40e-08 ***
## TEAM_PITCHING_SO       1.511e-02  4.912e-03   3.076  0.00212 ** 
## TEAM_FIELDING_E       -3.535e-02  4.005e-03  -8.827  < 2e-16 ***
## TEAM_FIELDING_DP      -1.089e-01  1.327e-02  -8.210 3.68e-16 ***
## TEAM_BATTING_1B               NA         NA      NA       NA    
## TEAM_BASERUN_SB_RATIO  5.271e+00  4.856e+00   1.085  0.27784    
## TEAM_BASERUN_CS_RATIO         NA         NA      NA       NA    
## TEAM_BATTING_WALK      3.974e-02  1.317e-02   3.018  0.00257 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.07 on 2257 degrees of freedom
## Multiple R-squared:  0.3174, Adjusted R-squared:  0.312 
## F-statistic: 58.32 on 18 and 2257 DF,  p-value: < 2.2e-16

model_5 <- lm(TARGET_WINS ~ ., data=MB_train_95trim[c(2,3,4,5,6,7,8,12,13,15,16,20,21)])
summary(model_5)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_95trim[c(2, 3, 
##     4, 5, 6, 7, 8, 12, 13, 15, 16, 20, 21)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.262  -8.566   0.032   8.668  60.989 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -18.848579   9.529302  -1.978 0.048054 *  
## TEAM_BATTING_H          0.041858   0.003983  10.510  < 2e-16 ***
## TEAM_BATTING_2B        -0.032256   0.009527  -3.386 0.000722 ***
## TEAM_BATTING_3B         0.140324   0.016903   8.302  < 2e-16 ***
## TEAM_BATTING_HR         0.099973   0.028040   3.565 0.000371 ***
## TEAM_BATTING_BB         0.021396   0.003217   6.650 3.65e-11 ***
## TEAM_BATTING_SO        -0.002913   0.005666  -0.514 0.607271    
## TEAM_PITCHING_H         0.006819   0.002463   2.768 0.005684 ** 
## TEAM_PITCHING_HR       -0.076235   0.025402  -3.001 0.002719 ** 
## TEAM_PITCHING_SO        0.007638   0.004823   1.584 0.113356    
## TEAM_FIELDING_E        -0.026852   0.003690  -7.277 4.68e-13 ***
## TEAM_BASERUN_CS_RATIO -23.124996   4.425156  -5.226 1.89e-07 ***
## TEAM_BATTING_WALK       0.042028   0.013179   3.189 0.001447 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.52 on 2263 degrees of freedom
## Multiple R-squared:  0.2671, Adjusted R-squared:  0.2633 
## F-statistic: 68.74 on 12 and 2263 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_5)

# model 6 
model_6 <- lm(TARGET_WINS ~ ., data=MB_train_99trim)
summary(model_6)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_99trim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.355  -8.518  -0.061   8.195  66.340 
## 
## Coefficients: (2 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -9.1714381  9.7943870  -0.936 0.349169    
## INDEX                 -0.0005839  0.0003775  -1.547 0.122062    
## TEAM_BATTING_H         0.0433943  0.0038969  11.136  < 2e-16 ***
## TEAM_BATTING_2B       -0.0187801  0.0091791  -2.046 0.040875 *  
## TEAM_BATTING_3B        0.0845590  0.0174760   4.839 1.40e-06 ***
## TEAM_BATTING_HR        0.0812832  0.0314005   2.589 0.009699 ** 
## TEAM_BATTING_BB        0.0434242  0.0098952   4.388 1.19e-05 ***
## TEAM_BATTING_SO       -0.0266313  0.0054446  -4.891 1.07e-06 ***
## TEAM_BASERUN_SB        0.0331708  0.0050378   6.584 5.66e-11 ***
## TEAM_BASERUN_CS       -0.0036342  0.0167675  -0.217 0.828430    
## TEAM_BATTING_HBP       0.0228174  0.0748587   0.305 0.760541    
## TEAM_PITCHING_H        0.0023023  0.0009248   2.489 0.012866 *  
## TEAM_PITCHING_HR      -0.0185614  0.0279482  -0.664 0.506672    
## TEAM_PITCHING_BB      -0.0293608  0.0081877  -3.586 0.000343 ***
## TEAM_PITCHING_SO       0.0203634  0.0044766   4.549 5.68e-06 ***
## TEAM_FIELDING_E       -0.0282761  0.0031730  -8.912  < 2e-16 ***
## TEAM_FIELDING_DP      -0.1050337  0.0132362  -7.935 3.28e-15 ***
## TEAM_BATTING_1B               NA         NA      NA       NA    
## TEAM_BASERUN_SB_RATIO  8.5501892  4.8316905   1.770 0.076928 .  
## TEAM_BASERUN_CS_RATIO         NA         NA      NA       NA    
## TEAM_BATTING_WALK      0.0429526  0.0131798   3.259 0.001135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2257 degrees of freedom
## Multiple R-squared:  0.3201, Adjusted R-squared:  0.3146 
## F-statistic: 59.02 on 18 and 2257 DF,  p-value: < 2.2e-16

model_6 <- lm(TARGET_WINS ~ ., data=MB_train_99trim[c(2,3,4,5,6,7,8,9,12,14,15,16,17,21)])
summary(model_6)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = MB_train_99trim[c(2, 3, 
##     4, 5, 6, 7, 8, 9, 12, 14, 15, 16, 17, 21)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -62.874  -8.419   0.016   8.203  66.822 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -4.527734   8.985359  -0.504 0.614380    
## TEAM_BATTING_H     0.043483   0.003886  11.189  < 2e-16 ***
## TEAM_BATTING_2B   -0.016681   0.009080  -1.837 0.066326 .  
## TEAM_BATTING_3B    0.077813   0.017138   4.540 5.91e-06 ***
## TEAM_BATTING_HR    0.062586   0.009633   6.497 1.00e-10 ***
## TEAM_BATTING_BB    0.045760   0.009726   4.705 2.69e-06 ***
## TEAM_BATTING_SO   -0.025301   0.005180  -4.885 1.11e-06 ***
## TEAM_BASERUN_SB    0.035533   0.004454   7.978 2.34e-15 ***
## TEAM_PITCHING_H    0.002364   0.000908   2.604 0.009280 ** 
## TEAM_PITCHING_BB  -0.031599   0.007973  -3.963 7.63e-05 ***
## TEAM_PITCHING_SO   0.019540   0.004206   4.646 3.57e-06 ***
## TEAM_FIELDING_E   -0.027748   0.003020  -9.187  < 2e-16 ***
## TEAM_FIELDING_DP  -0.108767   0.013142  -8.277  < 2e-16 ***
## TEAM_BATTING_WALK  0.043904   0.012851   3.416 0.000646 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2262 degrees of freedom
## Multiple R-squared:  0.3182, Adjusted R-squared:  0.3143 
## F-statistic: 81.22 on 13 and 2262 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(model_6)

R_2 <- summary(model_1)$r.squared
Adj_R_2 <-  summary(model_1)$adj.r.squared
MSE <- sum(model_1$residuals ^ 2 ) / model_1$df.residual
F_statistics <- summary(model_1)$fstatistic[1]

m2_R_2 <- summary(model_2)$r.squared
m2_Adj_R_2 <-  summary(model_2)$adj.r.squared
m2_MSE <- sum(model_2$residuals ^ 2 ) / model_2$df.residual
m2_F_statistics <- summary(model_2)$fstatistic[1]

m3_R_2 <- summary(model_3)$r.squared
m3_Adj_R_2 <-  summary(model_3)$adj.r.squared
m3_MSE <- sum(model_3$residuals ^ 2 ) / model_3$df.residual
m3_F_statistics <- summary(model_3)$fstatistic[1]

m4_R_2 <- summary(model_4)$r.squared
m4_Adj_R_2 <-  summary(model_4)$adj.r.squared
m4_MSE <- sum(model_4$residuals ^ 2 ) / model_4$df.residual
m4_F_statistics <- summary(model_4)$fstatistic[1]

m5_R_2 <- summary(model_5)$r.squared
m5_Adj_R_2 <-  summary(model_5)$adj.r.squared
m5_MSE <- sum(model_5$residuals ^ 2 ) / model_5$df.residual
m5_F_statistics <- summary(model_5)$fstatistic[1]

m6_R_2 <- summary(model_6)$r.squared
m6_Adj_R_2 <-  summary(model_6)$adj.r.squared
m6_MSE <- sum(model_6$residuals ^ 2 ) / model_6$df.residual
m6_F_statistics <- summary(model_6)$fstatistic[1]



 compare_model1 <- rbind(R_2, Adj_R_2, MSE, F_statistics )
 compare_model2 <- rbind(m2_R_2, m2_Adj_R_2, m2_MSE, m2_F_statistics )
 compare_model3 <- rbind(m3_R_2, m3_Adj_R_2, m3_MSE, m3_F_statistics )
 compare_model4 <- rbind(m4_R_2, m4_Adj_R_2, m4_MSE, m4_F_statistics )
 compare_model5 <- rbind(m5_R_2, m5_Adj_R_2, m5_MSE, m5_F_statistics )
 compare_model6 <- rbind(m6_R_2, m6_Adj_R_2, m6_MSE, m6_F_statistics )

compare <- data.frame(compare_model1, compare_model2, compare_model3, compare_model4, compare_model5,compare_model6)
colnames(compare) <- c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", "Model 6")

kable(compare)

	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6
R_2	0.2990080	0.3229188	0.3187801	0.4018550	0.2671466	0.3182463
Adj_R_2	0.2943554	0.3193284	0.3151678	0.3985757	0.2632605	0.3143281
MSE	175.0918065	168.8952541	169.9276276	104.1329280	182.8074100	170.1359684
F_statistics	64.2668440	89.9406020	88.2484656	122.5427589	68.7441504	81.2241260

Model Selection

The above table we see, model 4 and model 2 are looking the best fit. But in model 4 if we look at p-value of all explanatory variables is higer than 0.05 which is not statistically significant.So, we will go with Model 2, which having high $R^2$ , $Adj R ^ 2$ and F-Statistics. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction.

MB_test<- read.csv("https://raw.githubusercontent.com/mkollontai/DATA621_GroupWork/master/HW1/moneyball-evaluation-data.csv")
MB_test$TEAM_BATTING_1B <- MB_test$TEAM_BATTING_H - MB_test$TEAM_BATTING_2B - MB_test$TEAM_BATTING_3B - MB_test$TEAM_BATTING_HR
MB_test$TEAM_BASERUN_SB_RATIO <- MB_test$TEAM_BASERUN_SB/(MB_test$TEAM_BASERUN_SB + MB_test$TEAM_BASERUN_CS)
MB_test$TEAM_BASERUN_CS_RATIO <- MB_test$TEAM_BASERUN_CS/(MB_test$TEAM_BASERUN_SB + MB_test$TEAM_BASERUN_CS)
MB_test$TEAM_BATTING_WALK <- MB_test$TEAM_BATTING_BB + MB_test$TEAM_BATTING_HBP

# replace by median
MB_test$TEAM_BATTING_SO[is.na(MB_test$TEAM_BATTING_SO)==TRUE] <- median(MB_test$TEAM_BATTING_SO, na.rm = TRUE)
MB_test$TEAM_BASERUN_SB[is.na(MB_test$TEAM_BASERUN_SB)==TRUE] <- median(MB_test$TEAM_BASERUN_SB, na.rm = TRUE)
MB_test$TEAM_BASERUN_CS[is.na(MB_test$TEAM_BASERUN_CS)==TRUE] <- median(MB_test$TEAM_BASERUN_CS, na.rm = TRUE)
MB_test$TEAM_BATTING_HBP[is.na(MB_test$TEAM_BATTING_HBP)==TRUE] <- median(MB_test$TEAM_BATTING_HBP, na.rm = TRUE)
MB_test$TEAM_PITCHING_SO[is.na(MB_test$TEAM_PITCHING_SO)==TRUE] <- median(MB_test$TEAM_PITCHING_SO, na.rm = TRUE)
MB_test$TEAM_FIELDING_DP[is.na(MB_test$TEAM_FIELDING_DP)==TRUE] <- median(MB_test$TEAM_FIELDING_DP, na.rm = TRUE)
MB_test$TEAM_BASERUN_SB_RATIO[is.na(MB_test$TEAM_BASERUN_SB_RATIO)==TRUE] <- median(MB_test$TEAM_BASERUN_SB_RATIO, na.rm = TRUE)
MB_test$TEAM_BASERUN_CS_RATIO[is.na(MB_test$TEAM_BASERUN_CS_RATIO)==TRUE] <- median(MB_test$TEAM_BASERUN_CS_RATIO, na.rm = TRUE)
MB_test$TEAM_BATTING_WALK[is.na(MB_test$TEAM_BATTING_WALK)==TRUE] <- median(MB_test$TEAM_BATTING_WALK, na.rm = TRUE)

prediction <- predict(model_2, MB_test, interval = "prediction")

DT::datatable(prediction)

summary(prediction)

##       fit              lwr               upr        
##  Min.   : 25.13   Min.   :-0.8872   Min.   : 51.15  
##  1st Qu.: 74.08   1st Qu.:48.5580   1st Qu.: 99.60  
##  Median : 79.42   Median :53.8642   Median :104.97  
##  Mean   : 79.10   Mean   :53.5344   Mean   :104.67  
##  3rd Qu.: 84.76   3rd Qu.:59.2142   3rd Qu.:110.31  
##  Max.   :109.47   Max.   :83.8049   Max.   :135.13

MB_test$PRED_WINS <- round(predict(model_2,  newdata = MB_test),0)

Pred_Data <- MB_test[, c("INDEX","PRED_WINS")]

DT::datatable(Pred_Data)

The above tables shows predicted win for the test data.

Conclusion

This assignment helps us to understand how to performing the EDA, Data Preparation and Model building. Building process. The statistical calculation and graphical results helps us to understand the distribution of data.

Future work

Explore additional outside factors such as state of mind of the player, weather, location impact on the win. These factors in near future we will consider to predict the Win.