Data Description

The data has 16 variables that can be categories into pitching,batting and filding.The variables have positive and negative effects.

loc_train = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/moneyball-training-data.csv "
loc_test = "https://raw.githubusercontent.com/chrisestevez/DataAnalytics/master/Data/moneyball-evaluation-data.csv"

train_df = read.csv(loc_train, stringsAsFactors = FALSE)
test_df = read.csv(loc_test, stringsAsFactors = FALSE)
VARIABLE NAME DEFINITION THEORETICAL EFFECT
INDEX Identification Variable (do not use) None
TARGET_WINS Number of wins
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR) Positive Impact on Wins
TEAM_BATTING_2B Doubles by batters (2B) Positive Impact on Wins
TEAM_BATTING_3B Triples by batters (3B) Positive Impact on Wins
TEAM_BATTING_HR Homeruns by batters (4B) Positive Impact on Wins
TEAM_BATTING_BB Walks by batters Positive Impact on Wins
TEAM_BATTING_HBP Batters hit by pitch (get a free base) Positive Impact on Wins
TEAM_BATTING_SO Strikeouts by batters Negative Impact on Wins
TEAM_BASERUN_SB Stolen bases Positive Impact on Wins
TEAM_BASERUN_CS Caught stealing Negative Impact on Wins
TEAM_FIELDING_E Errors Negative Impact on Wins
TEAM_FIELDING_DP Double Plays Positive Impact on Wins
TEAM_PITCHING_BB Walks allowed Negative Impact on Wins
TEAM_PITCHING_H Hits allowed Negative Impact on Wins
TEAM_PITCHING_HR Homeruns allowed Negative Impact on Wins
TEAM_PITCHING_SO Strikeouts by pitchers Positive Impact on Wins

DATA EXPLORATION

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

Summary

I started by summarizing each variable and identifying NAs.

  1. Mean / Standard Deviation / Median
summary(train_df)
##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286

Descriptive Plots

To get a feel for the data, I created various descriptive plots. I specifically choose the histogram, box plot and scatter plot. One shows the distribution the other gives a good sense of possible outliers present.

  1. Bar Chart or Box Plot of the data
library(dplyr)
#library(plotly)
box = train_df %>% select(-INDEX)

par(mfrow=c(3,3))    
for (i in 1:2) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)
      
      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")
}

par(mfrow=c(3,3))    

for (i in 3:4) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)

      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")      
}
par(mfrow=c(3,3))    

for (i in 5:6) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)
      
      
      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")
}
par(mfrow=c(3,3))    

for (i in 7:8) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)
      
      
      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")
}
par(mfrow=c(3,3))    

for (i in 9:10) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)
      
      
      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")
}
par(mfrow=c(3,3))    

for (i in 11:12) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)
      
      
      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")
}
par(mfrow=c(3,3))    

for (i in 13:14) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)
      
      
      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")
}
par(mfrow=c(3,3))    

for (i in 15:16) {
      hist(box[,i],main=names(box[i]),51)
      boxplot(box[,i], main=names(box[i]), type="l",horizontal = TRUE)
      
      
      plot(box[,i] ,box$TARGET_WINS, main = names(box[i]))
      abline(lm(box$TARGET_WINS ~ box[,i], data = box), col = "blue")
}

Correlation

  1. Is the data correlated to the target variable (or to other variables?)

I ran a correlation matrix and excluded observations that were missing. After running the correlation matrix, there was no substantial correlation with target win variable. Team Pitching HR and Team Batting HR seem to have a robust correlation that might indicate colinearity.

library("corrplot")
cor_mx = cor(box ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color", 
         type = "upper", order = "original", number.cex = .7,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
                  # hide correlation coefficient on the principal diagonal
         diag = TRUE)

Missing Values

  1. Are any of the variables missing and need to be imputed “fixed”?

Five variables contain all the Nas within the data set. The variable will be imputed or transformed.

The variables that contain NAs are:

Variable Na’s
TEAM_BATTING_SO 102
TEAM_BASERUN_SB 131
TEAM_BASERUN_CS 772
TEAM_BATTING_HBP 2085
TEAM_PITCHING_SO 102
TEAM_FIELDING_DP 286

DATA PREPARATION

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

Missing Values Train Data

  1. Fix missing values (maybe with a Mean or Median value)

I will replace TEAM_BATTING_HBP missing values with zero, all other missing values with the median.

Data = train_df

Data$TEAM_BATTING_SO[is.na(Data$TEAM_BATTING_SO)]= median(Data$TEAM_BATTING_SO,na.rm = TRUE)

Data$TEAM_BASERUN_SB[is.na(Data$TEAM_BASERUN_SB)]= median(Data$TEAM_BASERUN_SB,na.rm = TRUE)

Data$TEAM_BASERUN_CS[is.na(Data$TEAM_BASERUN_CS)]= median(Data$TEAM_BASERUN_CS,na.rm = TRUE)

Data$TEAM_BATTING_HBP[is.na(Data$TEAM_BATTING_HBP)]= 0

Data$TEAM_PITCHING_SO[is.na(Data$TEAM_PITCHING_SO)]= median(Data$TEAM_PITCHING_SO,na.rm = TRUE)

Data$TEAM_FIELDING_DP[is.na(Data$TEAM_FIELDING_DP)]= median(Data$TEAM_FIELDING_DP,na.rm = TRUE)

Missing Values Test Data

I will replicate the prior step in the test data.

test_df$TEAM_BATTING_SO[is.na(test_df$TEAM_BATTING_SO)]= median(test_df$TEAM_BATTING_SO,na.rm = TRUE)

test_df$TEAM_BASERUN_SB[is.na(test_df$TEAM_BASERUN_SB)]= median(test_df$TEAM_BASERUN_SB,na.rm = TRUE)

test_df$TEAM_BASERUN_CS[is.na(test_df$TEAM_BASERUN_CS)]= median(test_df$TEAM_BASERUN_CS,na.rm = TRUE)

test_df$TEAM_BATTING_HBP[is.na(test_df$TEAM_BATTING_HBP)]= 0

test_df$TEAM_PITCHING_SO[is.na(test_df$TEAM_PITCHING_SO)]= median(test_df$TEAM_PITCHING_SO,na.rm = TRUE)

test_df$TEAM_FIELDING_DP[is.na(test_df$TEAM_FIELDING_DP)]= median(test_df$TEAM_FIELDING_DP,na.rm = TRUE)
  1. Create flags to suggest if a variable was missing
  2. Transform data by putting it into buckets
  3. Mathematical transforms such as log or square root (or use Box-Cox)

For the first model, I will run the data as is a replacement of the NAs. The second model will have a log transformation on two variables that require normalization. Finally, the third model I will remove outliers.

  1. Combine variables (such as ratios or adding or multiplying) to create new variables

I will create two variables one offensive and the second defense.offense variable will contain positively variable that encourage scoring while defensive variable will include strikeouts and double plays.This will also be replicated for the test data frame.

Data$offense = (
Data$TEAM_BATTING_H+
Data$TEAM_BATTING_H+
Data$TEAM_BATTING_2B+
Data$TEAM_BATTING_3B+
Data$TEAM_BATTING_HR+
Data$TEAM_BATTING_BB+
Data$TEAM_BATTING_HBP+
Data$TEAM_BASERUN_SB)

Data$defense = round(
Data$TEAM_FIELDING_DP+
Data$TEAM_PITCHING_SO
)

# New varibles descriptive charts.
par(mfrow=c(2,2))    
hist(Data$offense,main=names(Data$offense),51)
boxplot(Data$offense, main="offense", type="l",horizontal = TRUE)
 
par(mfrow=c(2,2))   

hist(Data$defense,main=names(Data$defense),51)
boxplot(Data$defense, main="defense", type="l",horizontal = TRUE)

# applied the creation of these variables to the test dataframe.

test_df$offense = (
test_df$TEAM_BATTING_H+
test_df$TEAM_BATTING_H+
test_df$TEAM_BATTING_2B+
test_df$TEAM_BATTING_3B+
test_df$TEAM_BATTING_HR+
test_df$TEAM_BATTING_BB+
test_df$TEAM_BATTING_HBP+
test_df$TEAM_BASERUN_SB)

test_df$defense = round(
test_df$TEAM_FIELDING_DP+
test_df$TEAM_PITCHING_SO
)

#converted log variables
test_df2 = test_df

test_df2$TEAM_PITCHING_H = log(test_df2$TEAM_PITCHING_H)
test_df2$TEAM_FIELDING_E = log(test_df2$TEAM_FIELDING_E)

New correlation

I created a new correlation matrix to visualize the new variables in the data. Out of the new variables, the offense seems to highly correlated to wins.

cor_mx = cor(Data ,use="pairwise.complete.obs", method = "pearson")
corrplot(cor_mx, method = "color", 
         type = "upper", order = "original", number.cex = .7,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
                  # hide correlation coefficient on the principal diagonal
         diag = TRUE)

BUILD MODELS

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

Model 1

In the first model, I included the newly created variables to see how they perform using Backward Selection.I selected the best model based on the p valued and dropped variables that were not significant.After, running the model I did a VIF test on the predictors, and identifies TEAM_PITCHING_SO and defense had a score of 650, values were removed.

All coefficients seem to be positive with the exemption of TEAM_BATTING_SO, TEAM_BATTING_HBP, TEAM_PITCHING_H, TEAM_FIELDING_E. This might indicate a negative correlation. TEAM_BATTING_SO is an excellent example of a predictor of not winning by striking out.

\[Team Wins=11.537+0.043TEAM.BATTING.H+0.077TEAM.BATTING.3B+\] \[0.055TEAM.BATTING.HR-0.005TEAM.BATTING.SO+0.033TEAM_BASERUN.SB-0.061TEAM.BATTING.HBP\]

\[-0.001TEAM.PITCHING.H0.002TEAM.PITCHING.SO-0.020TEAM_FIELDING_E\]

library("car")

Model1 =lm(TARGET_WINS~ ., data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.847  -8.779   0.108   8.337  57.590 
## 
## Coefficients: (1 not defined because of singularities)
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      23.0688238  5.4095384   4.264 2.09e-05 ***
## INDEX            -0.0004367  0.0003759  -1.162  0.24549    
## TEAM_BATTING_H    0.0482745  0.0036925  13.074  < 2e-16 ***
## TEAM_BATTING_2B  -0.0126595  0.0095144  -1.331  0.18347    
## TEAM_BATTING_3B   0.0666001  0.0168026   3.964 7.61e-05 ***
## TEAM_BATTING_HR   0.0679279  0.0278189   2.442  0.01469 *  
## TEAM_BATTING_BB   0.0083350  0.0058596   1.422  0.15503    
## TEAM_BATTING_SO  -0.0067459  0.0025893  -2.605  0.00924 ** 
## TEAM_BASERUN_SB   0.0266949  0.0043680   6.112 1.16e-09 ***
## TEAM_BASERUN_CS  -0.0183749  0.0159185  -1.154  0.24850    
## TEAM_BATTING_HBP -0.0621057  0.0194199  -3.198  0.00140 ** 
## TEAM_PITCHING_H  -0.0007974  0.0003672  -2.172  0.02999 *  
## TEAM_PITCHING_HR  0.0012236  0.0246097   0.050  0.96035    
## TEAM_PITCHING_BB  0.0013535  0.0041553   0.326  0.74466    
## TEAM_PITCHING_SO  1.1908191  2.5966764   0.459  0.64657    
## TEAM_FIELDING_E  -0.0192913  0.0024600  -7.842 6.78e-15 ***
## TEAM_FIELDING_DP  1.0649562  2.5959469   0.410  0.68167    
## offense                  NA         NA      NA       NA    
## defense          -1.1882583  2.5966761  -0.458  0.64728    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 2258 degrees of freedom
## Multiple R-squared:  0.3189, Adjusted R-squared:  0.3138 
## F-statistic:  62.2 on 17 and 2258 DF,  p-value: < 2.2e-16
Model1 = update(Model1, .~.-offense-TEAM_FIELDING_DP-INDEX, data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + 
##     TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + defense, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.089  -8.661   0.102   8.459  57.142 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.5270660  5.3907673   4.179 3.04e-05 ***
## TEAM_BATTING_H    0.0483802  0.0036910  13.108  < 2e-16 ***
## TEAM_BATTING_2B  -0.0127219  0.0095119  -1.337  0.18120    
## TEAM_BATTING_3B   0.0654106  0.0167719   3.900 9.90e-05 ***
## TEAM_BATTING_HR   0.0673451  0.0278107   2.422  0.01553 *  
## TEAM_BATTING_BB   0.0085297  0.0058564   1.456  0.14540    
## TEAM_BATTING_SO  -0.0068695  0.0025871  -2.655  0.00798 ** 
## TEAM_BASERUN_SB   0.0263122  0.0043562   6.040 1.79e-09 ***
## TEAM_BASERUN_CS  -0.0182555  0.0159125  -1.147  0.25140    
## TEAM_BATTING_HBP -0.0622805  0.0194149  -3.208  0.00136 ** 
## TEAM_PITCHING_H  -0.0008189  0.0003667  -2.233  0.02563 *  
## TEAM_PITCHING_HR  0.0013609  0.0246066   0.055  0.95590    
## TEAM_PITCHING_BB  0.0014564  0.0041540   0.351  0.72592    
## TEAM_PITCHING_SO  0.1261750  0.0129778   9.722  < 2e-16 ***
## TEAM_FIELDING_E  -0.0191850  0.0024582  -7.804 9.06e-15 ***
## defense          -0.1236247  0.0129311  -9.560  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 2260 degrees of freedom
## Multiple R-squared:  0.3185, Adjusted R-squared:  0.314 
## F-statistic: 70.41 on 15 and 2260 DF,  p-value: < 2.2e-16
Model1 = update(Model1, .~.-TEAM_PITCHING_HR, data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + defense, 
##     data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.097  -8.663   0.098   8.461  57.145 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.4911454  5.3503171   4.204 2.73e-05 ***
## TEAM_BATTING_H    0.0484012  0.0036707  13.186  < 2e-16 ***
## TEAM_BATTING_2B  -0.0127173  0.0095094  -1.337  0.18125    
## TEAM_BATTING_3B   0.0655242  0.0166420   3.937 8.49e-05 ***
## TEAM_BATTING_HR   0.0687847  0.0097894   7.026 2.79e-12 ***
## TEAM_BATTING_BB   0.0083929  0.0053070   1.581  0.11391    
## TEAM_BATTING_SO  -0.0068443  0.0025458  -2.688  0.00723 ** 
## TEAM_BASERUN_SB   0.0262958  0.0043450   6.052 1.67e-09 ***
## TEAM_BASERUN_CS  -0.0182423  0.0159072  -1.147  0.25159    
## TEAM_BATTING_HBP -0.0624384  0.0191997  -3.252  0.00116 ** 
## TEAM_PITCHING_H  -0.0008203  0.0003658  -2.242  0.02505 *  
## TEAM_PITCHING_BB  0.0015727  0.0035813   0.439  0.66059    
## TEAM_PITCHING_SO  0.1261590  0.0129717   9.726  < 2e-16 ***
## TEAM_FIELDING_E  -0.0191775  0.0024539  -7.815 8.34e-15 ***
## defense          -0.1236252  0.0129283  -9.562  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2261 degrees of freedom
## Multiple R-squared:  0.3185, Adjusted R-squared:  0.3143 
## F-statistic: 75.48 on 14 and 2261 DF,  p-value: < 2.2e-16
Model1 = update(Model1, .~.-TEAM_PITCHING_BB, data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + defense, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.199  -8.665   0.062   8.430  57.120 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      22.2188779  5.3133273   4.182 3.00e-05 ***
## TEAM_BATTING_H    0.0484033  0.0036700  13.189  < 2e-16 ***
## TEAM_BATTING_2B  -0.0126966  0.0095076  -1.335  0.18187    
## TEAM_BATTING_3B   0.0660236  0.0166002   3.977 7.19e-05 ***
## TEAM_BATTING_HR   0.0686792  0.0097847   7.019 2.94e-12 ***
## TEAM_BATTING_BB   0.0101817  0.0034013   2.994  0.00279 ** 
## TEAM_BATTING_SO  -0.0070820  0.0024871  -2.848  0.00445 ** 
## TEAM_BASERUN_SB   0.0265476  0.0043062   6.165 8.33e-10 ***
## TEAM_BASERUN_CS  -0.0184165  0.0158994  -1.158  0.24686    
## TEAM_BATTING_HBP -0.0626710  0.0191890  -3.266  0.00111 ** 
## TEAM_PITCHING_H  -0.0007430  0.0003208  -2.316  0.02062 *  
## TEAM_PITCHING_SO  0.1262829  0.0129664   9.739  < 2e-16 ***
## TEAM_FIELDING_E  -0.0191028  0.0024476  -7.805 9.03e-15 ***
## defense          -0.1235041  0.0129230  -9.557  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2262 degrees of freedom
## Multiple R-squared:  0.3184, Adjusted R-squared:  0.3145 
## F-statistic:  81.3 on 13 and 2262 DF,  p-value: < 2.2e-16
Model1 = update(Model1, .~.-TEAM_BASERUN_CS, data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + 
##     TEAM_BASERUN_SB + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + defense, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.242  -8.631   0.019   8.423  57.162 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      21.1787267  5.2372901   4.044 5.43e-05 ***
## TEAM_BATTING_H    0.0483642  0.0036701  13.178  < 2e-16 ***
## TEAM_BATTING_2B  -0.0135385  0.0094805  -1.428  0.15342    
## TEAM_BATTING_3B   0.0667144  0.0165907   4.021 5.98e-05 ***
## TEAM_BATTING_HR   0.0704704  0.0096624   7.293 4.16e-13 ***
## TEAM_BATTING_BB   0.0106081  0.0033815   3.137  0.00173 ** 
## TEAM_BATTING_SO  -0.0071615  0.0024863  -2.880  0.00401 ** 
## TEAM_BASERUN_SB   0.0254686  0.0042046   6.057 1.62e-09 ***
## TEAM_BATTING_HBP -0.0594479  0.0189876  -3.131  0.00177 ** 
## TEAM_PITCHING_H  -0.0007623  0.0003204  -2.379  0.01742 *  
## TEAM_PITCHING_SO  0.1265462  0.0129654   9.760  < 2e-16 ***
## TEAM_FIELDING_E  -0.0185051  0.0023927  -7.734 1.56e-14 ***
## defense          -0.1237438  0.0129223  -9.576  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.04 on 2263 degrees of freedom
## Multiple R-squared:  0.318,  Adjusted R-squared:  0.3144 
## F-statistic: 87.95 on 12 and 2263 DF,  p-value: < 2.2e-16
Model1 = update(Model1, .~.-TEAM_BATTING_2B, data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
##     defense, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.026  -8.621   0.159   8.289  56.168 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      23.3785334  5.0067700   4.669 3.20e-06 ***
## TEAM_BATTING_H    0.0449880  0.0028078  16.022  < 2e-16 ***
## TEAM_BATTING_3B   0.0686868  0.0165369   4.154 3.40e-05 ***
## TEAM_BATTING_HR   0.0709784  0.0096581   7.349 2.77e-13 ***
## TEAM_BATTING_BB   0.0100918  0.0033629   3.001 0.002721 ** 
## TEAM_BATTING_SO  -0.0076681  0.0024615  -3.115 0.001861 ** 
## TEAM_BASERUN_SB   0.0262441  0.0041703   6.293 3.73e-10 ***
## TEAM_BATTING_HBP -0.0666003  0.0183193  -3.636 0.000284 ***
## TEAM_PITCHING_H  -0.0007715  0.0003204  -2.408 0.016106 *  
## TEAM_PITCHING_SO  0.1270159  0.0129642   9.797  < 2e-16 ***
## TEAM_FIELDING_E  -0.0179991  0.0023669  -7.605 4.16e-14 ***
## defense          -0.1243508  0.0129183  -9.626  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.05 on 2264 degrees of freedom
## Multiple R-squared:  0.3174, Adjusted R-squared:  0.3141 
## F-statistic: 95.71 on 11 and 2264 DF,  p-value: < 2.2e-16
#Due to high multicolinearity defense will be drop
vif(Model1)
##   TEAM_BATTING_H  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB 
##         2.203274         2.853386         4.571000         2.274894 
##  TEAM_BATTING_SO  TEAM_BASERUN_SB TEAM_BATTING_HBP  TEAM_PITCHING_H 
##         4.778815         1.695716         1.278557         2.715353 
## TEAM_PITCHING_SO  TEAM_FIELDING_E          defense 
##       656.440582         3.885045       653.713029
Model1 = update(Model1, .~.-defense, data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E, 
##     data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.915  -9.088   0.164   8.950  54.437 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.5592027  4.8926192   1.954 0.050848 .  
## TEAM_BATTING_H    0.0435273  0.0028599  15.220  < 2e-16 ***
## TEAM_BATTING_3B   0.0763835  0.0168485   4.534 6.10e-06 ***
## TEAM_BATTING_HR   0.0521103  0.0096466   5.402 7.28e-08 ***
## TEAM_BATTING_BB   0.0033267  0.0033546   0.992 0.321444    
## TEAM_BATTING_SO  -0.0044930  0.0024881  -1.806 0.071087 .  
## TEAM_BASERUN_SB   0.0320559  0.0042090   7.616 3.82e-14 ***
## TEAM_BATTING_HBP -0.0596125  0.0186716  -3.193 0.001429 ** 
## TEAM_PITCHING_H  -0.0007751  0.0003268  -2.372 0.017773 *  
## TEAM_PITCHING_SO  0.0023882  0.0006786   3.519 0.000441 ***
## TEAM_FIELDING_E  -0.0191701  0.0024111  -7.951 2.90e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.31 on 2265 degrees of freedom
## Multiple R-squared:  0.2895, Adjusted R-squared:  0.2863 
## F-statistic: 92.28 on 10 and 2265 DF,  p-value: < 2.2e-16
Model1 = update(Model1, .~.-TEAM_BATTING_BB, data = Data)
summary(Model1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -51.709  -9.153   0.079   8.912  54.503 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      11.5374038  4.4675057   2.583  0.00987 ** 
## TEAM_BATTING_H    0.0433478  0.0028541  15.188  < 2e-16 ***
## TEAM_BATTING_3B   0.0772422  0.0168261   4.591 4.66e-06 ***
## TEAM_BATTING_HR   0.0554103  0.0090545   6.120 1.10e-09 ***
## TEAM_BATTING_SO  -0.0049071  0.0024528  -2.001  0.04556 *  
## TEAM_BASERUN_SB   0.0333797  0.0039917   8.362  < 2e-16 ***
## TEAM_BATTING_HBP -0.0609642  0.0186217  -3.274  0.00108 ** 
## TEAM_PITCHING_H  -0.0007815  0.0003267  -2.392  0.01684 *  
## TEAM_PITCHING_SO  0.0023969  0.0006785   3.533  0.00042 ***
## TEAM_FIELDING_E  -0.0202643  0.0021439  -9.452  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.31 on 2266 degrees of freedom
## Multiple R-squared:  0.2892, Adjusted R-squared:  0.2864 
## F-statistic: 102.4 on 9 and 2266 DF,  p-value: < 2.2e-16
print("multi-collinearity test")
## [1] "multi-collinearity test"
vif(Model1)
##   TEAM_BATTING_H  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_SO 
##         2.188040         2.839196         3.861238         4.560829 
##  TEAM_BASERUN_SB TEAM_BATTING_HBP  TEAM_PITCHING_H TEAM_PITCHING_SO 
##         1.493179         1.269747         2.714305         1.728273 
##  TEAM_FIELDING_E 
##         3.063431

Model 1 Results

hist(Model1$residuals,25)

Final_Result_Model1=data.frame('INDEX'=test_df$INDEX,'TARGET_WINS'=predict(Model1, test_df))
boxplot(Final_Result_Model1$TARGET_WINS,xlab="Win Predicted" ,horizontal = TRUE)

par(mfrow=c(2,2))
plot(Model1)

Model 2

In model 2 I will transform to log two variables that are highly skewed in the data. I have also transformed the test data to correctly predict in the final stages.

\[Model 2\] \[ TARGET.WINS = 75.985+0.042TEAM.BATTING.H+0.111TEAM.BATTING.3B+0.054TEAM.BATTING.HR\] \[+0.012TEAM.BATTING.BB-0.007TEAM.BATTING.SO+0.031TEAM.BASERUN.SB-0.082TEAM.BATTING.HBP\] \[+0.002TEAM.PITCHING.SO-10.314TEAM.FIELDING.E-0.134TEAM.FIELDING.DP\]

options(scipen=999)
DataM2 = Data

DataM2$TEAM_PITCHING_H = log(DataM2$TEAM_PITCHING_H)
DataM2$TEAM_FIELDING_E = log(DataM2$TEAM_FIELDING_E)


Model2 =lm(TARGET_WINS~ ., data = DataM2)
summary(Model2)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = DataM2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.164  -8.383  -0.079   8.083  62.144 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)       77.8188058  16.3352152   4.764    0.000002019762402 ***
## INDEX             -0.0005219   0.0003741  -1.395              0.16313    
## TEAM_BATTING_H     0.0469792   0.0039952  11.759 < 0.0000000000000002 ***
## TEAM_BATTING_2B   -0.0168678   0.0095017  -1.775              0.07599 .  
## TEAM_BATTING_3B    0.1077869   0.0175072   6.157    0.000000000876463 ***
## TEAM_BATTING_HR    0.0422044   0.0279690   1.509              0.13145    
## TEAM_BATTING_BB    0.0180675   0.0061007   2.962              0.00309 ** 
## TEAM_BATTING_SO   -0.0082754   0.0026280  -3.149              0.00166 ** 
## TEAM_BASERUN_SB    0.0334352   0.0044466   7.519    0.000000000000079 ***
## TEAM_BASERUN_CS   -0.0298590   0.0159212  -1.875              0.06086 .  
## TEAM_BATTING_HBP  -0.0772024   0.0193678  -3.986    0.000069294528709 ***
## TEAM_PITCHING_H   -0.2014121   2.4968418  -0.081              0.93571    
## TEAM_PITCHING_HR   0.0086468   0.0250504   0.345              0.72999    
## TEAM_PITCHING_BB  -0.0061709   0.0042063  -1.467              0.14250    
## TEAM_PITCHING_SO   1.0901508   2.5842150   0.422              0.67317    
## TEAM_FIELDING_E  -10.6610682   1.1010832  -9.682 < 0.0000000000000002 ***
## TEAM_FIELDING_DP   0.9543414   2.5834837   0.369              0.71186    
## offense                   NA          NA      NA                   NA    
## defense           -1.0865944   2.5842146  -0.420              0.67418    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.99 on 2258 degrees of freedom
## Multiple R-squared:  0.3255, Adjusted R-squared:  0.3204 
## F-statistic: 64.09 on 17 and 2258 DF,  p-value: < 0.00000000000000022
Model2 = update(Model2, .~.-offense-defense-INDEX-TEAM_PITCHING_H-TEAM_PITCHING_HR-TEAM_BATTING_2B-TEAM_PITCHING_BB-TEAM_BASERUN_CS, data = DataM2)
summary(Model2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BATTING_HBP + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, 
##     data = DataM2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.391  -8.542  -0.083   8.088  60.887 
## 
## Coefficients:
##                     Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)       75.9849762   7.4952864  10.138 < 0.0000000000000002 ***
## TEAM_BATTING_H     0.0418572   0.0027354  15.302 < 0.0000000000000002 ***
## TEAM_BATTING_3B    0.1114884   0.0165633   6.731    0.000000000021274 ***
## TEAM_BATTING_HR    0.0542575   0.0096871   5.601    0.000000023896248 ***
## TEAM_BATTING_BB    0.0120261   0.0031937   3.766             0.000170 ***
## TEAM_BATTING_SO   -0.0073427   0.0023991  -3.061             0.002235 ** 
## TEAM_BASERUN_SB    0.0313730   0.0042143   7.444    0.000000000000137 ***
## TEAM_BATTING_HBP  -0.0824430   0.0182398  -4.520    0.000006504028293 ***
## TEAM_PITCHING_SO   0.0020527   0.0005868   3.498             0.000477 ***
## TEAM_FIELDING_E  -10.3138761   0.9006844 -11.451 < 0.0000000000000002 ***
## TEAM_FIELDING_DP  -0.1341880   0.0128480 -10.444 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13 on 2265 degrees of freedom
## Multiple R-squared:  0.3217, Adjusted R-squared:  0.3187 
## F-statistic: 107.4 on 10 and 2265 DF,  p-value: < 0.00000000000000022
print("multi-collinearity test")
## [1] "multi-collinearity test"
vif(Model2)
##   TEAM_BATTING_H  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB 
##         2.105043         2.881638         4.629231         2.065391 
##  TEAM_BATTING_SO  TEAM_BASERUN_SB TEAM_BATTING_HBP TEAM_PITCHING_SO 
##         4.570141         1.743215         1.275957         1.353781 
##  TEAM_FIELDING_E TEAM_FIELDING_DP 
##         4.119114         1.337448

Model 2 Results

hist(Model2$residuals,25)

Final_Result_Model2=data.frame('INDEX'=test_df2$INDEX,'TARGET_WINS'=predict(Model2, test_df2))

boxplot(Final_Result_Model2$TARGET_WINS,xlab="Win Predicted" ,horizontal = TRUE)

par(mfrow=c(2,2))
plot(Model2)

Model 3

In model 3 I will use the initial model and remove outliers from the data. after running VIF, I removed TEAM_BATTING_SO due to a 5.3 score

\[Model 3\] \[TARGET.WINS = 27.626+0.041TEAM.BATTING.H+0.080TEAM.BATTING.3B+0.076TEAM.BATTING.HR\] \[+0.010TEAM.BATTING.BB-0.010TEAM.BATTING.SO+0.027TEAM.BASERUN.SB\] \[-0.061TEAM.BATTING.HBP+0.004TEAM.PITCHING.SO-0.021TEAM_FIELDING.E-0.124TEAM.FIELDING_DP\]

library(outliers)
## Warning: package 'outliers' was built under R version 3.3.2
DataM3 = Data

#located and removed outliers using outlier pakage.
outlier_tf1 = outlier(DataM3$TEAM_PITCHING_H,logical=TRUE)
find_outlier1 = which(outlier_tf1==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier1,]

outlier_tf2 = outlier(DataM3$TEAM_BASERUN_SB,logical=TRUE)
find_outlier2 = which(outlier_tf2==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier2,]

outlier_tf3 = outlier(DataM3$TEAM_PITCHING_HR,logical=TRUE)
find_outlier3 = which(outlier_tf3==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier3,]

outlier_tf4 = outlier(DataM3$TEAM_PITCHING_SO,logical=TRUE)
find_outlier4 = which(outlier_tf4==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier4,]

outlier_tf5 = outlier(DataM3$TEAM_BATTING_2B,logical=TRUE)
find_outlier5 = which(outlier_tf5==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier5,]

outlier_tf6 = outlier(DataM3$TEAM_BATTING_3B,logical=TRUE)
find_outlier6 = which(outlier_tf6==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier6,]

outlier_tf7 = outlier(DataM3$TEAM_BASERUN_CS,logical=TRUE)
find_outlier7 = which(outlier_tf7==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier7,]

outlier_tf8 = outlier(DataM3$TEAM_PITCHING_H,logical=TRUE)
find_outlier8 = which(outlier_tf8==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier8,]

outlier_tf9 = outlier(DataM3$TEAM_BATTING_BB,logical=TRUE)
find_outlier9 = which(outlier_tf9==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier9,]

outlier_tf10 = outlier(DataM3$TEAM_FIELDING_E,logical=TRUE)
find_outlier10 = which(outlier_tf10==TRUE,arr.ind=TRUE)
DataM3 = DataM3[-find_outlier10,]


par(mfrow=c(2,1))
boxplot(Data)
boxplot(DataM3)

Model3 =lm(TARGET_WINS~ ., data = DataM3)
summary(Model3)
## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = DataM3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -49.094  -8.604  -0.009   8.285  59.190 
## 
## Coefficients: (1 not defined because of singularities)
##                     Estimate  Std. Error t value             Pr(>|t|)    
## (Intercept)      25.26466176  5.60615741   4.507        0.00000692558 ***
## INDEX            -0.00041488  0.00037553  -1.105              0.26937    
## TEAM_BATTING_H    0.04652882  0.00415318  11.203 < 0.0000000000000002 ***
## TEAM_BATTING_2B  -0.01717854  0.00975531  -1.761              0.07838 .  
## TEAM_BATTING_3B   0.07582740  0.01739391   4.359        0.00001362828 ***
## TEAM_BATTING_HR   0.05951238  0.02916924   2.040              0.04144 *  
## TEAM_BATTING_BB   0.01814164  0.00682743   2.657              0.00794 ** 
## TEAM_BATTING_SO  -0.01241878  0.00303728  -4.089        0.00004488803 ***
## TEAM_BASERUN_SB   0.02861679  0.00451499   6.338        0.00000000028 ***
## TEAM_BASERUN_CS  -0.01781480  0.01615112  -1.103              0.27014    
## TEAM_BATTING_HBP -0.05459037  0.01946714  -2.804              0.00509 ** 
## TEAM_PITCHING_H  -0.00006791  0.00064800  -0.105              0.91655    
## TEAM_PITCHING_HR  0.01462381  0.02619493   0.558              0.57672    
## TEAM_PITCHING_BB -0.00753137  0.00516234  -1.459              0.14473    
## TEAM_PITCHING_SO  1.16147109  2.58819338   0.449              0.65365    
## TEAM_FIELDING_E  -0.02150757  0.00257946  -8.338 < 0.0000000000000002 ***
## TEAM_FIELDING_DP  1.03228575  2.58747211   0.399              0.68996    
## offense                   NA          NA      NA                   NA    
## defense          -1.15475944  2.58819407  -0.446              0.65552    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.01 on 2248 degrees of freedom
## Multiple R-squared:  0.3011, Adjusted R-squared:  0.2958 
## F-statistic: 56.97 on 17 and 2248 DF,  p-value: < 0.00000000000000022
Model3 = update(Model3, .~.-offense-defense-INDEX-TEAM_PITCHING_H-TEAM_PITCHING_HR-TEAM_BASERUN_CS-TEAM_PITCHING_BB-TEAM_BATTING_2B-TEAM_BATTING_SO, data = DataM3)
summary(Model3)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BATTING_HBP + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = DataM3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.151  -8.677   0.131   8.531  54.415 
## 
## Coefficients:
##                    Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      13.9003834  3.7774726   3.680             0.000239 ***
## TEAM_BATTING_H    0.0466253  0.0025076  18.593 < 0.0000000000000002 ***
## TEAM_BATTING_3B   0.0873478  0.0164563   5.308      0.0000001217709 ***
## TEAM_BATTING_HR   0.0545486  0.0081515   6.692      0.0000000000277 ***
## TEAM_BATTING_BB   0.0115304  0.0033576   3.434             0.000605 ***
## TEAM_BASERUN_SB   0.0226275  0.0040617   5.571      0.0000000283651 ***
## TEAM_BATTING_HBP -0.0792846  0.0179048  -4.428      0.0000099587706 ***
## TEAM_PITCHING_SO  0.0022622  0.0008321   2.719             0.006602 ** 
## TEAM_FIELDING_E  -0.0186940  0.0020260  -9.227 < 0.0000000000000002 ***
## TEAM_FIELDING_DP -0.1176248  0.0128412  -9.160 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.06 on 2256 degrees of freedom
## Multiple R-squared:  0.2932, Adjusted R-squared:  0.2904 
## F-statistic:   104 on 9 and 2256 DF,  p-value: < 0.00000000000000022
print("multi-collinearity test")
## [1] "multi-collinearity test"
vif(Model3)
##   TEAM_BATTING_H  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB 
##         1.708759         2.761793         3.216432         2.184953 
##  TEAM_BASERUN_SB TEAM_BATTING_HBP TEAM_PITCHING_SO  TEAM_FIELDING_E 
##         1.568815         1.219056         1.304564         2.600147 
## TEAM_FIELDING_DP 
##         1.322275

Model 3 Results

hist(Model3$residuals,25)

Final_Result_Model3=data.frame('INDEX'=test_df$INDEX,'TARGET_WINS'=predict(Model3, test_df))

boxplot(Final_Result_Model3$TARGET_WINS,xlab="Win Predicted" ,horizontal = TRUE)

par(mfrow=c(2,2))
plot(Model3)

SELECT MODELS

Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.

If I had to select a model, it would be Model 2 with the highest adjusted R square. Other than that all models seem to be predicting around the same with slight variation. The linear statistical plot can be observed in model results. A histogram of the residuals seems normal. Also, multi-collinearity came up during model 2 and 3 I decided to use the VIF function from the cars package and removed any variable with a score higher than 5.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
# Function that returns Root Mean Squared Error
rmse <- function(error)
{
    sqrt(mean(error^2))
}
 
# Function that returns Mean Absolute Error
mae <- function(error)
{
    mean(abs(error))
}

#rmse(Model3$residuals)
#mae(Model1$residuals)

par(mfrow=c(1,3))
ggplot(Final_Result_Model1, aes(x = Final_Result_Model1$TARGET_WINS )) + geom_density()+ggtitle("Model1")

ggplot(Final_Result_Model2, aes(x = Final_Result_Model2$TARGET_WINS)) + geom_density()+ggtitle("Model2")

ggplot(Final_Result_Model3, aes(x = Final_Result_Model3$TARGET_WINS)) + geom_density()+ggtitle("Model3")

par(mfrow=c(1,3))
boxplot(Final_Result_Model1$TARGET_WINS,xlab="Win Predicted M1" ,horizontal = FALSE)
boxplot(Final_Result_Model2$TARGET_WINS,xlab="Win Predicted M2" ,horizontal = FALSE)
boxplot(Final_Result_Model3$TARGET_WINS,xlab="Win Predicted M3" ,horizontal = FALSE)

Models R-ADJ F-Stat RMSE MAE
Model1 0.2864 102.4 13.28 10.45
Model2 0.3187 107.4 12.97 10.07
Model3 0.2904 104 13.03 10.23

Source