In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
Deliverables:
A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.
Assigned predictions (the number of wins for the team) for the evaluation data set.
Include your R statistical programming code in an Appendix.
Write Up:
Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.
Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.
Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.
Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.
Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.
For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.
The lmod3 is found to be a decent model that can be used to predict target_wins as outlined in this report.
Per guidance in the problem, I’ve included below both a numerical summary (mean, median, standard deviation) of variables in the training dataset as well as histogram plots for the variables since all the relevant variables are numeric. Furthermore, I’ve identified variables that have NA’s but these variables don’t need to be imputed because the default for missing values is NA. Additionally, although a few variables have minimum values of 0’s, these appear to be legitimate values and not ones entered in error. Finally, a correlation plot indicates the numeric correlation values (-1 to +1) of the variables in the dataset.
In terms of the size of the training data set, we notice that it has 2,276 records and 17 variables. We start with some numerical summaries. A close look at the minimum and maximum values of each variable is worthwhile. The following variables have minimum values of 0 - TARGET_WINS (Number of wins), TEAM_BATTING_3B (Triples by batters (3B)), TEAM_BATTING_HR (Homeruns by batters (4B)), TEAM_BATTING_BB (Walks by batters), TEAM_BATTING_SO (Strikeouts by batters), TEAM_BASERUN_SB (Stolen bases), TEAM_BASERUN_CS (Caught stealing), TEAM_PITCHING_HR (Homeruns allowed), TEAM_PITCHING_BB (Walks allowed), TEAM_PITCHING_SO (Strikeouts by pitchers). It is not unusual for any of these variables to have 0 values. Similalry, looking at the maximum values doesn’t cause alarm.
However, we do notice from the summary that some of the variables have NA’s - TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO and TEAM_FIELDING_DP. The dataset doesn’t immediately make clear as to why these NA’s exist (whether they are data omission errors or true missing values.)
A visual plot of the variables indicates most have a skewed distribution. The normally distributed variables are - TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_HBP and TEAM_FIELDING_DP.
From the correlation matrix plot, we identify a few variables that are perfectly postively correlated - TEAM_BATTING_H & TEAM_PITCHING_H, TEAM_BATTING_HR & TEAM_PITCHING_HR, TEAM_BATTING_BB & TEAM_PITCHING_BB, TEAM_BATTING_SO & TEAM_PITCHING_SO. This knowledge helps us narrow the size of the predictive model by including fewer variables. For example, we decide to retain the batting variables and remove the corresponding perfectly correlated pitching variables.
library(corrplot)
## corrplot 0.92 loaded
#Reading in the training and evaluation data files
training <- read.csv("/Users/tponnada/Downloads/moneyball-training-data.csv")
eval <- read.csv("/Users/tponnada/Downloads/moneyball-evaluation-data.csv")
#Checking the first 6 rows of the training data set, the dimensions of the data set and the usual univariate summary information.
head(training)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1 1 39 1445 194 39
## 2 2 70 1339 219 22
## 3 3 86 1377 232 35
## 4 4 70 1387 209 38
## 5 5 82 1297 186 27
## 6 6 75 1279 200 36
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1 13 143 842 NA
## 2 190 685 1075 37
## 3 137 602 917 46
## 4 96 451 922 43
## 5 102 472 920 49
## 6 92 443 973 107
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1 NA NA 9364 84
## 2 28 NA 1347 191
## 3 27 NA 1377 137
## 4 30 NA 1396 97
## 5 39 NA 1297 102
## 6 59 NA 1279 92
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1 927 5456 1011 NA
## 2 689 1082 193 155
## 3 602 917 175 153
## 4 454 928 164 156
## 5 472 920 138 168
## 6 443 973 123 149
dim(training)
## [1] 2276 17
summary(training)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
#Standard deviations of variables
sapply(training[,2:7], sd)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 15.75215 144.59120 46.80141 27.93856 60.54687
## TEAM_BATTING_BB
## 122.67086
sapply(training[,12:14], sd)
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 1406.84293 61.29875 166.35736
#Univariate plots using histograms, kernel density estimates and sorted data plotted against its index for the 17 variables.
#par(mfrow=c(,10))
#Number of wins
hist(training$TARGET_WINS, xlab = "Number of Wins", main = "")
plot(density(training$TARGET_WINS, na.rm = TRUE), main = "")
plot(sort(training$TARGET_WINS), ylab = "Sorted Diastolic")
#Base Hits by batters (1B,2B,3B,HR)
hist(training$TEAM_BATTING_H, xlab = "Base Hits by batters", main = "")
plot(density(training$TEAM_BATTING_H, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BATTING_H), ylab = "Sorted Diastolic")
#Doubles by batters (2B)
hist(training$TEAM_BATTING_2B, xlab = "Doubles by batters (2B)", main = "")
plot(density(training$TEAM_BATTING_2B, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BATTING_2B), ylab = "Sorted Diastolic")
#Triples by batters (3B)
hist(training$TEAM_BATTING_3B, xlab = "Triples by batters (3B)", main = "")
plot(density(training$TEAM_BATTING_3B, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BATTING_3B), ylab = "Sorted Diastolic")
#Homeruns by batters (4B)
hist(training$TEAM_BATTING_HR, xlab = "Homeruns by batters (4B)", main = "")
plot(density(training$TEAM_BATTING_HR, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BATTING_HR), ylab = "Sorted Diastolic")
#Walks by batters
hist(training$TEAM_BATTING_BB, xlab = "Walks by batters", main = "")
plot(density(training$TEAM_BATTING_BB, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BATTING_BB), ylab = "Sorted Diastolic")
#Strikeouts by batters
hist(training$TEAM_BATTING_SO, xlab = "Strikeouts by batters", main = "")
plot(density(training$TEAM_BATTING_SO, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BATTING_SO), ylab = "Sorted Diastolic")
#Stolen bases
hist(training$TEAM_BASERUN_SB, xlab = "Stolen bases", main = "")
plot(density(training$TEAM_BASERUN_SB, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BASERUN_SB), ylab = "Sorted Diastolic")
#Caught stealing
hist(training$TEAM_BASERUN_CS, xlab = "Caught stealing", main = "")
plot(density(training$TEAM_BASERUN_CS, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BASERUN_CS), ylab = "Sorted Diastolic")
#Batters hit by pitch (get a free base)
hist(training$TEAM_BATTING_HBP, xlab = "Batters hit by pitch (get a free base)", main = "")
plot(density(training$TEAM_BATTING_HBP, na.rm = TRUE), main = "")
plot(sort(training$TEAM_BATTING_HBP), ylab = "Sorted Diastolic")
#Hits allowed
hist(training$TEAM_PITCHING_H, xlab = "Hits allowed", main = "")
plot(density(training$TEAM_PITCHING_H, na.rm = TRUE), main = "")
plot(sort(training$TEAM_PITCHING_H), ylab = "Sorted Diastolic")
#Homeruns allowed
hist(training$TEAM_PITCHING_HR, xlab = "Homeruns allowed", main = "")
plot(density(training$TEAM_PITCHING_HR, na.rm = TRUE), main = "")
plot(sort(training$TEAM_PITCHING_HR), ylab = "Sorted Diastolic")
#Walks allowed
hist(training$TEAM_PITCHING_BB, xlab = "Walks allowed", main = "")
plot(density(training$TEAM_PITCHING_BB, na.rm = TRUE), main = "")
plot(sort(training$TEAM_PITCHING_BB), ylab = "Sorted Diastolic")
#Strikeouts by pitchers
hist(training$TEAM_PITCHING_SO, xlab = "Strikeouts by pitchers", main = "")
plot(density(training$TEAM_PITCHING_SO, na.rm = TRUE), main = "")
plot(sort(training$TEAM_PITCHING_SO), ylab = "Sorted Diastolic")
#Errors
hist(training$TEAM_FIELDING_E, xlab = "Errors", main = "")
plot(density(training$TEAM_FIELDING_E, na.rm = TRUE), main = "")
plot(sort(training$TEAM_FIELDING_E), ylab = "Sorted Diastolic")
#Double Plays
hist(training$TEAM_FIELDING_DP, xlab = "Double Plays", main = "")
plot(density(training$TEAM_FIELDING_DP, na.rm = TRUE), main = "")
plot(sort(training$TEAM_FIELDING_DP), ylab = "Sorted Diastolic")
#Instead of using scatterplots for each of the 17 variables against each other, I used the correlation matrix.
M = cor(training, use = "na.or.complete")
corrplot(M, method = 'number', type = 'lower', diag = FALSE, number.cex = 0.5, tl.cex = 0.5, cl.cex = 0.5)
As identified earlier, there are some variables in the training dataset that have missing NA values. Since, we decided to eliminate all the perfectly correlated pitching variables, the list of variables with missing values is reduced to below. In brackets are listed the number of NA’s reported for each (derived from the summary command in the data exploration section above). A quick look indicates TEAM_BASERUN_CS and TEAM_BATTING_HBP have the highest number of NA values and hence the higest proportion of missing values. Since, we know the number of observations in the training dataset totals 2,276, we can calculate the proportion of records that are not populated. From the problem, TEAM_BATTING_SO and TEAM_BASERUN_CS are expected to have a negative impact on wins while the other variables below are forecast to have a positive impact on wins.
TEAM_BATTING_SO (102), 4.48% of observations not populated TEAM_BASERUN_SB (131), 5.8% of observations not populated TEAM_BASERUN_CS (772), 33.9% of observations not populated TEAM_BATTING_HBP (2085), 91.6% of observations not populated TEAM_FIELDING_DP (286), 12.6% of observations not populated
Also, from the correlation plot in the exploration section above, we use a threshold of >=0.3 and <=-0.3 to identify other variables that should be included in the model. By performing this analysis, we eliminate the need to transform variables that have minimal predictive value. We also identified in the data exoploration section that some variables are perfectly correlated and hence redundant to include and decided to eliminate following:
TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO
The list of explanatory variables thus reduces to below.
TEAM_BATTING_H (Correlation: 0.47) TEAM_BATTING_2B (Correlation: 0.31) TEAM_BATTING_HR (Correlation: 0.42) TEAM_BATTING_BB (Correlation: 0.47) TEAM_FIELDING_E (Correlation: -0.39)
We see that none of the predictive variables have NA’s but we use scatterplots to identify if there are outliers or other problems that would necessitate a transformation. We see from the plots that TEAM_BATTING_HR, TEAM_BATTING_BB and TEAM_FIELDING_E have wide scales and outliers.
We use the boxcox function from the MASS package to check first whether the response (TARGET_WINS) needs a transformation. We see that the confidence interval for lambda does not include 1 (indicating the need for a transformation) and the power suggested is 1.3. Rounding to the nearest value of lambda (1.0) means a transformation of the response variable might not be worthwhile in this case.
For the independent variables, we can generalize by adding polynomial terms and determine the order of each variable that van be used in a potential predictive model.
require(MASS)
## Loading required package: MASS
plot(TARGET_WINS ~ TEAM_BATTING_H, training)
plot(TARGET_WINS ~ TEAM_BATTING_2B, training)
plot(TARGET_WINS ~ TEAM_BATTING_HR, training)
plot(TARGET_WINS ~ TEAM_BATTING_BB, training)
plot(TARGET_WINS ~ TEAM_FIELDING_E, training)
#Box-cox transformation only works for positive values of the response variable. There is one record of the response variable TARGET_WINS where the value is 0, we set this to NA and then apply the Box-Cox transformation. The Box-Cox method suggests that the response variable needs to be transformed. Rounding up a squared value of the response variable seems like a likely transformation.
training$TARGET_WINS[training$TARGET_WINS == 0] <- NA
lmod <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, training)
boxcox(lmod, plotit = T)
boxcox(lmod, plotit = T, lambda = seq(0, 3.0, by = 0.1))
bc <- boxcox(lmod)
str(bc)
## List of 2
## $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
## $ y: num [1:100] -7764 -7657 -7553 -7450 -7349 ...
bc.power <- bc$x[which.max(bc$y)]; bc.power
## [1] 1.313131
#Determining the order of the polynomial for TEAM_BATTING_H. The p-value of TEAM_BATTING_H is significant, so we move on to a quadratric and then a cubic term. The p-value of TEAM_BATTING_H^4 is not significant, so stick with the cubic term.
summary(lm(TARGET_WINS ~ TEAM_BATTING_H, training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.386 -8.805 0.862 9.779 45.945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.600519 3.109218 6.304 3.47e-10 ***
## TEAM_BATTING_H 0.041664 0.002106 19.786 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.47 on 2273 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1469, Adjusted R-squared: 0.1466
## F-statistic: 391.5 on 1 and 2273 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2),
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.547 -8.620 0.714 9.617 45.924
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.005e+01 1.428e+01 -2.804 0.00509 **
## TEAM_BATTING_H 1.179e-01 1.795e-02 6.570 6.24e-11 ***
## I(TEAM_BATTING_H^2) -2.405e-05 5.622e-06 -4.278 1.96e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.42 on 2272 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1537, Adjusted R-squared: 0.153
## F-statistic: 206.4 on 2 and 2272 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) + I(TEAM_BATTING_H^3), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) +
## I(TEAM_BATTING_H^3), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.553 -8.551 0.700 9.708 45.945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.793e+02 7.023e+01 -5.401 7.31e-08 ***
## TEAM_BATTING_H 7.422e-01 1.278e-01 5.807 7.24e-09 ***
## I(TEAM_BATTING_H^2) -3.989e-04 7.619e-05 -5.235 1.80e-07 ***
## I(TEAM_BATTING_H^3) 7.315e-08 1.483e-08 4.933 8.69e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.34 on 2271 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1627, Adjusted R-squared: 0.1616
## F-statistic: 147.1 on 3 and 2271 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) + I(TEAM_BATTING_H^3) + I(TEAM_BATTING_H^4) , training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) +
## I(TEAM_BATTING_H^3) + I(TEAM_BATTING_H^4), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.644 -8.608 0.771 9.720 46.681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.902e+02 2.882e+02 -2.742 0.00616 **
## TEAM_BATTING_H 1.753e+00 6.995e-01 2.506 0.01228 *
## I(TEAM_BATTING_H^2) -1.316e-03 6.284e-04 -2.094 0.03639 *
## I(TEAM_BATTING_H^3) 4.359e-07 2.472e-07 1.763 0.07802 .
## I(TEAM_BATTING_H^4) -5.271e-11 3.586e-11 -1.470 0.14174
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.34 on 2270 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1635, Adjusted R-squared: 0.162
## F-statistic: 110.9 on 4 and 2270 DF, p-value: < 2.2e-16
#Determining the order of the polynomial for TEAM_BATTING_2B. The p-value of TEAM_BATTING_2B is significant, so we move on to a quadratric and then a cubic term. The p-value of TEAM_BATTING_2B^4 is not significant, so stick with the cubic term.
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B, training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.013 -9.571 0.585 10.093 57.442
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.710776 1.654890 34.87 <2e-16 ***
## TEAM_BATTING_2B 0.095799 0.006733 14.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.01 on 2273 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.08178, Adjusted R-squared: 0.08137
## F-statistic: 202.4 on 1 and 2273 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2),
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.245 -9.670 0.499 10.095 58.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.8756779 6.2279788 6.242 5.14e-10 ***
## TEAM_BATTING_2B 0.2544659 0.0510304 4.987 6.61e-07 ***
## I(TEAM_BATTING_2B^2) -0.0003220 0.0001027 -3.137 0.00173 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.98 on 2272 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.08573, Adjusted R-squared: 0.08493
## F-statistic: 106.5 on 2 and 2272 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_2B^3), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) +
## I(TEAM_BATTING_2B^3), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.943 -9.657 0.743 9.950 59.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.419e+01 1.715e+01 -3.159 0.0016 **
## TEAM_BATTING_2B 1.444e+00 2.107e-01 6.854 9.22e-12 ***
## I(TEAM_BATTING_2B^2) -5.203e-03 8.453e-04 -6.155 8.83e-10 ***
## I(TEAM_BATTING_2B^3) 6.446e-06 1.108e-06 5.817 6.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.88 on 2271 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.09916, Adjusted R-squared: 0.09797
## F-statistic: 83.32 on 3 and 2271 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_2B^4) , training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) +
## I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_2B^4), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.83 -9.47 0.76 10.01 58.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.235e+02 3.871e+01 -3.191 0.001438 **
## TEAM_BATTING_2B 2.649e+00 6.388e-01 4.146 3.5e-05 ***
## I(TEAM_BATTING_2B^2) -1.272e-02 3.856e-03 -3.298 0.000987 ***
## I(TEAM_BATTING_2B^3) 2.644e-05 1.007e-05 2.625 0.008713 **
## I(TEAM_BATTING_2B^4) -1.919e-08 9.609e-09 -1.997 0.045896 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.87 on 2270 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.1007, Adjusted R-squared: 0.09915
## F-statistic: 63.57 on 4 and 2270 DF, p-value: < 2.2e-16
#Determining the order of the polynomial for TEAM_BATTING_HR. The p-value of TEAM_BATTING_HR is significant, so we move on to a quadratric term. The p-value of TEAM_BATTING_HR^2 is not significant, so stick with the unpowered term.
summary(lm(TARGET_WINS ~ TEAM_BATTING_HR, training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.350 -9.938 0.544 10.163 68.347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.350142 0.623296 122.494 <2e-16 ***
## TEAM_BATTING_HR 0.044917 0.005346 8.402 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.43 on 2273 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03012, Adjusted R-squared: 0.02969
## F-statistic: 70.59 on 1 and 2273 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_HR + I(TEAM_BATTING_HR^2), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + I(TEAM_BATTING_HR^2),
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -65.204 -9.891 0.601 10.144 68.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.720e+01 9.017e-01 85.619 <2e-16 ***
## TEAM_BATTING_HR 2.060e-02 1.932e-02 1.067 0.286
## I(TEAM_BATTING_HR^2) 1.155e-04 8.815e-05 1.310 0.190
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.43 on 2272 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03085, Adjusted R-squared: 0.03
## F-statistic: 36.16 on 2 and 2272 DF, p-value: 3.46e-16
#Determining the order of the polynomial for TEAM_BATTING_BB. The p-value of TEAM_BATTING_BB is significant, so we move on to a quadratric term. The p-value of TEAM_BATTING_BB^2 is also significant but the cubic term is not significant so we stick with the quadratic term.
summary(lm(TARGET_WINS ~ TEAM_BATTING_BB, training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.676 -9.745 0.541 9.798 77.822
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.329357 1.352298 49.05 <2e-16 ***
## TEAM_BATTING_BB 0.028891 0.002618 11.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.26 on 2273 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.05084, Adjusted R-squared: 0.05042
## F-statistic: 121.7 on 1 and 2273 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2),
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.399 -9.480 0.547 9.802 71.317
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.561e+01 2.504e+00 30.191 < 2e-16 ***
## TEAM_BATTING_BB -1.781e-02 1.094e-02 -1.628 0.104
## I(TEAM_BATTING_BB^2) 5.309e-05 1.208e-05 4.394 1.16e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.2 on 2272 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.05883, Adjusted R-squared: 0.05801
## F-statistic: 71.01 on 2 and 2272 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_BB^3), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2) +
## I(TEAM_BATTING_BB^3), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.177 -9.414 0.478 9.982 74.510
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.855e+01 3.782e+00 18.122 <2e-16 ***
## TEAM_BATTING_BB 5.419e-02 3.094e-02 1.752 0.0799 .
## I(TEAM_BATTING_BB^2) -1.376e-04 7.759e-05 -1.774 0.0762 .
## I(TEAM_BATTING_BB^3) 1.483e-07 5.960e-08 2.488 0.0129 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.19 on 2271 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.06139, Adjusted R-squared: 0.06015
## F-statistic: 49.51 on 3 and 2271 DF, p-value: < 2.2e-16
#Determining the order of the polynomial for TEAM_FIELDING_E. The p-value of TEAM_FIELDING_E is significant, so we move on to a quadratric term. The p-value of TEAM_FIELDING_E^2 is also significant so we move onto the higher order term and find that p-value for the 5th order is not significant. Hence we stop at 4th order term.
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E, training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E, data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -62.01 -10.06 0.70 10.33 73.17
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.613131 0.479768 174.278 < 2e-16 ***
## TEAM_FIELDING_E -0.011339 0.001439 -7.878 5.13e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.46 on 2273 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.02658, Adjusted R-squared: 0.02615
## F-statistic: 62.06 on 1 and 2273 DF, p-value: 5.126e-15
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2),
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -63.584 -9.738 0.599 10.501 72.815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.048e+01 7.357e-01 109.399 < 2e-16 ***
## TEAM_FIELDING_E 9.616e-03 4.015e-03 2.395 0.0167 *
## I(TEAM_FIELDING_E^2) -1.818e-05 3.255e-06 -5.586 2.61e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.36 on 2272 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03976, Adjusted R-squared: 0.03892
## F-statistic: 47.04 on 2 and 2272 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_E^3), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) +
## I(TEAM_FIELDING_E^3), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.853 -9.714 0.657 10.246 66.964
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.636e+01 1.197e+00 72.173 < 2e-16 ***
## TEAM_FIELDING_E -4.272e-02 9.330e-03 -4.578 4.95e-06 ***
## I(TEAM_FIELDING_E^2) 7.833e-05 1.589e-05 4.929 8.88e-07 ***
## I(TEAM_FIELDING_E^3) -4.366e-08 7.040e-09 -6.202 6.61e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.23 on 2271 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.05575, Adjusted R-squared: 0.05451
## F-statistic: 44.7 on 3 and 2271 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) +
## I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4), data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.698 -9.766 0.503 10.115 66.298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.173e+01 1.834e+00 50.016 < 2e-16 ***
## TEAM_FIELDING_E -1.020e-01 1.797e-02 -5.674 1.57e-08 ***
## I(TEAM_FIELDING_E^2) 2.511e-04 4.756e-05 5.281 1.41e-07 ***
## I(TEAM_FIELDING_E^3) -2.162e-07 4.533e-08 -4.770 1.96e-06 ***
## I(TEAM_FIELDING_E^4) 5.353e-11 1.389e-11 3.854 0.00012 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.19 on 2270 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.06189, Adjusted R-squared: 0.06024
## F-statistic: 37.44 on 4 and 2270 DF, p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_E^5), training))
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) +
## I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_E^5),
## data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.076 -9.813 0.444 10.136 67.429
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.414e+01 2.783e+00 33.831 < 2e-16 ***
## TEAM_FIELDING_E -1.345e-01 3.349e-02 -4.014 6.15e-05 ***
## I(TEAM_FIELDING_E^2) 3.850e-04 1.258e-04 3.061 0.00223 **
## I(TEAM_FIELDING_E^3) -4.331e-07 1.940e-07 -2.233 0.02568 *
## I(TEAM_FIELDING_E^4) 1.996e-10 1.278e-10 1.562 0.11843
## I(TEAM_FIELDING_E^5) -3.422e-14 2.976e-14 -1.150 0.25034
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.18 on 2269 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.06244, Adjusted R-squared: 0.06037
## F-statistic: 30.22 on 5 and 2269 DF, p-value: < 2.2e-16
As a baseline, I start by building out a multiple linear regression model that includes all of the variables in the dataset with the exception of the index variable. The summary shows that this baseline model lmodraw has only two variables that are significant with small p-values (TEAM_FIELDING_E, TEAM_FIELDING_DP). Furthermore, the adjusted R^2 value indicates that the model explains only 51% of the variation in the model. Also, the ANOVA test indicates that the null hypothesis of no predictor having significance in the full model can be rejected. Atleast, one of the predictors has significance.
TARGET_WINS = 60.28826 + (1.91348 * TEAM_BATTING_H) + (0.02639 * TEAM_BATTING_2B) + (-0.10118 * TEAM_BATTING_3B) + (-4.84371 * TEAM_BATTING_HR) + (-4.45969 * TEAM_BATTING_BB) + (0.34196 * TEAM_BATTING_SO) + (0.03304 * TEAM_BASERUN_SB) + (-0.01104 * TEAM_BASERUN_CS) + (0.08247 * TEAM_BATTING_HBP) + (-1.89096 * TEAM_PITCHING_H) + (4.93043 * TEAM_PITCHING_HR) + (4.51089 * TEAM_PITCHING_BB) + (-0.37364 * TEAM_PITCHING_SO) + (-0.17204 * TEAM_FIELDING_E) + (-0.10819 * TEAM_FIELDING_DP)
A quick look at the negative coefficients for some of the variables doesn’t make intuitive sense, e.g., one would expect triples by batters (TEAM_BATTING_3B) and homeruns by batters (TEAM_BATTING_HR) to increase the response variable (TARGET_WINS) and not decrease it. This is yet another indication that the full model is not the right model to predict number of wins.
We need to cull the full model and to select individual predictors, I could similarly use the ANOVA test as we performed for Model 1 but since we’ve done some initial analysis using scatterplots in the data preparation stage and decided to eliminate duplicate variables as identified in the correlation plot of the data exploration stage, we decide to move forward with the 5 explanatory variables previously identified. The summary(lmod) shows that p-values for all but one of the variables (TEAM_BATTING_2B) is very small. The adjusted R^2 value however is lower than that of the full model and explains only 45% of the variation in the model. Also, the ANOVA test indicates that the null hypothesis of no predictor having significance in the reduced model can be rejected.
The model can be written in mathematical form as:
TARGET_WINS = 3.4790727 + (0.0412796 * TEAM_BATTING_H) + (0.0009869 * TEAM_BATTING_2B) + (0.0680351 * TEAM_BATTING_HR) + (0.0502670 * TEAM_BATTING_BB) + (-0.2177225 * TEAM_FIELDING_E)
A quick look at the coefficients in the reduced model makes intuitive sense, only errors detract from total wins. Base hits, doubles by batters, homeruns and walks by batters all contribute positively to target wins.
For model 3, I decided to eliminate the TEAM_BATTING_2B variable which has a large p-value in model 2 above. The adjusted R-squared for this new model (0.449) is higher than for model 2 (0.446) but still less than the baseline model that included all variables (0.5116). The p-values for all of the variables are small indicating significance and the F-value has increased from 31.6 (model 2) to 39.7 (model 3).
We could also use one of the transformed explanatory variable to explain TARGET_WINS. For example, below could be an example of a polynomial model where we use upto the third order term of TEAM_BATTING_H to explain TARGET_WINS.
##
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = datanona)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8708 -5.6564 -0.0599 5.2545 22.9274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.28826 19.67842 3.064 0.00253 **
## TEAM_BATTING_H 1.91348 2.76139 0.693 0.48927
## TEAM_BATTING_2B 0.02639 0.03029 0.871 0.38484
## TEAM_BATTING_3B -0.10118 0.07751 -1.305 0.19348
## TEAM_BATTING_HR -4.84371 10.50851 -0.461 0.64542
## TEAM_BATTING_BB -4.45969 3.63624 -1.226 0.22167
## TEAM_BATTING_SO 0.34196 2.59876 0.132 0.89546
## TEAM_BASERUN_SB 0.03304 0.02867 1.152 0.25071
## TEAM_BASERUN_CS -0.01104 0.07143 -0.155 0.87730
## TEAM_BATTING_HBP 0.08247 0.04960 1.663 0.09815 .
## TEAM_PITCHING_H -1.89096 2.76095 -0.685 0.49432
## TEAM_PITCHING_HR 4.93043 10.50664 0.469 0.63946
## TEAM_PITCHING_BB 4.51089 3.63372 1.241 0.21612
## TEAM_PITCHING_SO -0.37364 2.59705 -0.144 0.88577
## TEAM_FIELDING_E -0.17204 0.04140 -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819 0.03654 -2.961 0.00349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.467 on 175 degrees of freedom
## Multiple R-squared: 0.5501, Adjusted R-squared: 0.5116
## F-statistic: 14.27 on 15 and 175 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: TARGET_WINS ~ 1
## Model 2: TARGET_WINS ~ (INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR +
## TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP) -
## INDEX
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 190 27887
## 2 175 12546 15 15341 14.266 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.922 -6.452 -0.105 5.984 23.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4790727 15.6213041 0.223 0.824004
## TEAM_BATTING_H 0.0412796 0.0112249 3.678 0.000309 ***
## TEAM_BATTING_2B 0.0009869 0.0302629 0.033 0.974019
## TEAM_BATTING_HR 0.0680351 0.0245324 2.773 0.006118 **
## TEAM_BATTING_BB 0.0502670 0.0099137 5.070 9.59e-07 ***
## TEAM_FIELDING_E -0.2177225 0.0412773 -5.275 3.69e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.017 on 185 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.446
## F-statistic: 31.6 on 5 and 185 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: TARGET_WINS ~ 1
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_FIELDING_E
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 190 27887
## 2 185 15042 5 12845 31.597 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9006 -6.4318 -0.1027 5.9876 23.4547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.496754 15.569914 0.225 0.82255
## TEAM_BATTING_H 0.041461 0.009730 4.261 3.23e-05 ***
## TEAM_BATTING_HR 0.068036 0.024466 2.781 0.00598 **
## TEAM_BATTING_BB 0.050298 0.009843 5.110 7.95e-07 ***
## TEAM_FIELDING_E -0.217805 0.041089 -5.301 3.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.993 on 186 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.449
## F-statistic: 39.71 on 4 and 186 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Model 1: TARGET_WINS ~ 1
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +
## TEAM_FIELDING_E
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 190 27887
## 2 186 15042 4 12845 39.709 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We run a couple of diagnostics to select a model among the three designed above.
The residual vs fitted plot for the baseline full model as well as the two reduced models (models 2 & 3) are homoscedastic.
Next, we can test the residuals for normality using the Q-Q plot. Residuals in the Q-Q plots for the full as well as reduced models follow the line approximately and hence the residuals look normal. The slight skew in the Q-Q plot of the full model is eliminated in the subsequent reduced versions of the models. Also using the formal Shapiro-Wilk test for normality for all three models shows that the p-value is large, hence we cannot reject the null hypothesis that the residuals are normal.
Two index values 116 and 122 are identified as leverage points in the full model. It is interesting to note that the leverage points change between the full and reduced models.
We identify influential points in both the full as well as in the reduced models but removal of the largest of the influential points in each model doesn’t change the summary stats meaningfully.
Based on the diagnostic plots and the summary analysis, I decide to use model3 after removing the influential points as my predictive model which has a F-stat value of 42.8, adjusted R-squared of ~47% i.e. explains 47% of variance in the model and whose residuals are normally distributed, uncorrelated and have equal variance (homoscedastic). The coefficients are quite similar in both models.
This model can be written as:
TARGET_WINS = -1.607690 + (0.043919 * TEAM_BATTING_H) + (0.070051 * TEAM_BATTING_HR) + (0.052495 * TEAM_BATTING_BB) + (-0.219777 * TEAM_FIELDING_E)
The original model 3 with influential points included as below:
TARGET_WINS = 3.496754 + (0.041461 * TEAM_BATTING_H) + (0.068036 * TEAM_BATTING_HR) + (0.050298 * TEAM_BATTING_BB) + (-0.217805 * TEAM_FIELDING_E)
require(faraway)
## Loading required package: faraway
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x dplyr::select() masks MASS::select()
# Model 1:
par(mfrow = c(2, 2))
plot(lmodraw)
#Formal test for normality
shapiro.test(residuals(lmodraw))
##
## Shapiro-Wilk normality test
##
## data: residuals(lmodraw)
## W = 0.99435, p-value = 0.6864
#Plot of successive pairs of residuals to check for serial correlation
n1 <- length(residuals(lmodraw))
plot(tail(residuals(lmodraw), n1-1) ~ head(residuals(lmodraw), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
# Model 2:
par(mfrow = c(2, 2))
plot(lmod2)
#Formal test for normality
shapiro.test(residuals(lmod2))
##
## Shapiro-Wilk normality test
##
## data: residuals(lmod2)
## W = 0.99638, p-value = 0.9342
#Plot of successive pairs of residuals to check for serial correlation
n2 <- length(residuals(lmod2))
plot(tail(residuals(lmod2), n2-1) ~ head(residuals(lmod2), n2-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
# Model 3:
par(mfrow = c(2, 2))
plot(lmod3)
#Formal test for normality
shapiro.test(residuals(lmod3))
##
## Shapiro-Wilk normality test
##
## data: residuals(lmod3)
## W = 0.9964, p-value = 0.9356
#Plot of successive pairs of residuals to check for serial correlation
n3 <- length(residuals(lmod3))
plot(tail(residuals(lmod3), n3-1) ~ head(residuals(lmod3), n3-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
par(mfrow = c(2, 2))
#Check for leverage points using half-normal plots
hatv <- hatvalues(lmodraw)
sum(hatv)
## [1] 16
index <- row.names(training)
halfnorm(hatv, labs = index, ylab = "Leverages")
hatvr1 <- hatvalues(lmod2)
sum(hatvr1)
## [1] 6
index <- row.names(training)
halfnorm(hatvr1, labs = index, ylab = "Leverages")
hatvr2 <- hatvalues(lmod3)
sum(hatvr2)
## [1] 5
index <- row.names(training)
halfnorm(hatvr2, labs = index, ylab = "Leverages")
## Identify influential points using Cook's distance
cookf <- cooks.distance(lmodraw)
halfnorm(cookf, 3, labs = index, ylab = "Cook's distances")
## Eliminating and re-running the full model
lmodfi <- lm(TARGET_WINS ~ . - INDEX, datanona, subset = (cookf < max(cookf)))
summary(lmodfi)
##
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = datanona, subset = (cookf <
## max(cookf)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8755 -5.9492 0.0424 4.9818 23.2286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.39607 20.30034 3.271 0.00129 **
## TEAM_BATTING_H -1.68340 4.07041 -0.414 0.67970
## TEAM_BATTING_2B 0.02531 0.03026 0.836 0.40413
## TEAM_BATTING_3B -0.09145 0.07783 -1.175 0.24163
## TEAM_BATTING_HR 42.03680 40.40576 1.040 0.29961
## TEAM_BATTING_BB -3.54049 3.71135 -0.954 0.34143
## TEAM_BATTING_SO -2.33806 3.42228 -0.683 0.49540
## TEAM_BASERUN_SB 0.03221 0.02865 1.124 0.26239
## TEAM_BASERUN_CS -0.01912 0.07166 -0.267 0.78995
## TEAM_BATTING_HBP 0.08482 0.04958 1.711 0.08887 .
## TEAM_PITCHING_H 1.70262 4.06809 0.419 0.67608
## TEAM_PITCHING_HR -41.95066 40.40574 -1.038 0.30060
## TEAM_PITCHING_BB 3.59222 3.70880 0.969 0.33410
## TEAM_PITCHING_SO 2.30442 3.41993 0.674 0.50132
## TEAM_FIELDING_E -0.16932 0.04141 -4.089 6.62e-05 ***
## TEAM_FIELDING_DP -0.10378 0.03668 -2.829 0.00521 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.456 on 174 degrees of freedom
## Multiple R-squared: 0.5501, Adjusted R-squared: 0.5114
## F-statistic: 14.19 on 15 and 174 DF, p-value: < 2.2e-16
summary(lmodraw)
##
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = datanona)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.8708 -5.6564 -0.0599 5.2545 22.9274
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.28826 19.67842 3.064 0.00253 **
## TEAM_BATTING_H 1.91348 2.76139 0.693 0.48927
## TEAM_BATTING_2B 0.02639 0.03029 0.871 0.38484
## TEAM_BATTING_3B -0.10118 0.07751 -1.305 0.19348
## TEAM_BATTING_HR -4.84371 10.50851 -0.461 0.64542
## TEAM_BATTING_BB -4.45969 3.63624 -1.226 0.22167
## TEAM_BATTING_SO 0.34196 2.59876 0.132 0.89546
## TEAM_BASERUN_SB 0.03304 0.02867 1.152 0.25071
## TEAM_BASERUN_CS -0.01104 0.07143 -0.155 0.87730
## TEAM_BATTING_HBP 0.08247 0.04960 1.663 0.09815 .
## TEAM_PITCHING_H -1.89096 2.76095 -0.685 0.49432
## TEAM_PITCHING_HR 4.93043 10.50664 0.469 0.63946
## TEAM_PITCHING_BB 4.51089 3.63372 1.241 0.21612
## TEAM_PITCHING_SO -0.37364 2.59705 -0.144 0.88577
## TEAM_FIELDING_E -0.17204 0.04140 -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819 0.03654 -2.961 0.00349 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.467 on 175 degrees of freedom
## Multiple R-squared: 0.5501, Adjusted R-squared: 0.5116
## F-statistic: 14.27 on 15 and 175 DF, p-value: < 2.2e-16
## Identify influential points using Cook's distance
cookr1 <- cooks.distance(lmod2)
halfnorm(cookr1, 3, labs = index, ylab = "Cook's distances")
## Eliminating and re-running the full model
lmodr1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, datanona, subset = (cookr1 < max(cookr1)))
summary(lmodr1)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona,
## subset = (cookr1 < max(cookr1)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.4671 -6.2940 -0.0933 6.0532 22.7230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.040463 15.586858 0.452 0.652023
## TEAM_BATTING_H 0.041258 0.011130 3.707 0.000277 ***
## TEAM_BATTING_2B -0.011901 0.030663 -0.388 0.698378
## TEAM_BATTING_HR 0.069322 0.024333 2.849 0.004887 **
## TEAM_BATTING_BB 0.049232 0.009843 5.002 1.32e-06 ***
## TEAM_FIELDING_E -0.210872 0.041065 -5.135 7.14e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.941 on 184 degrees of freedom
## Multiple R-squared: 0.4437, Adjusted R-squared: 0.4286
## F-statistic: 29.36 on 5 and 184 DF, p-value: < 2.2e-16
summary(lmod2)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.922 -6.452 -0.105 5.984 23.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.4790727 15.6213041 0.223 0.824004
## TEAM_BATTING_H 0.0412796 0.0112249 3.678 0.000309 ***
## TEAM_BATTING_2B 0.0009869 0.0302629 0.033 0.974019
## TEAM_BATTING_HR 0.0680351 0.0245324 2.773 0.006118 **
## TEAM_BATTING_BB 0.0502670 0.0099137 5.070 9.59e-07 ***
## TEAM_FIELDING_E -0.2177225 0.0412773 -5.275 3.69e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.017 on 185 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.446
## F-statistic: 31.6 on 5 and 185 DF, p-value: < 2.2e-16
## Identify influential points using Cook's distance
cookr2 <- cooks.distance(lmod3)
halfnorm(cookr2, 3, labs = index, ylab = "Cook's distances")
## Eliminating and re-running the full model
lmodr2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, datanona, subset = (cookr2 < max(cookr2)))
summary(lmodr2)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona, subset = (cookr2 <
## max(cookr2)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.040 -6.082 -0.182 6.013 19.883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.607690 15.428167 -0.104 0.91712
## TEAM_BATTING_H 0.043919 0.009612 4.569 8.93e-06 ***
## TEAM_BATTING_HR 0.070051 0.024073 2.910 0.00406 **
## TEAM_BATTING_BB 0.052495 0.009714 5.404 1.99e-07 ***
## TEAM_FIELDING_E -0.219777 0.040416 -5.438 1.69e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.844 on 185 degrees of freedom
## Multiple R-squared: 0.4808, Adjusted R-squared: 0.4696
## F-statistic: 42.83 on 4 and 185 DF, p-value: < 2.2e-16
summary(lmod3)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.9006 -6.4318 -0.1027 5.9876 23.4547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.496754 15.569914 0.225 0.82255
## TEAM_BATTING_H 0.041461 0.009730 4.261 3.23e-05 ***
## TEAM_BATTING_HR 0.068036 0.024466 2.781 0.00598 **
## TEAM_BATTING_BB 0.050298 0.009843 5.110 7.95e-07 ***
## TEAM_FIELDING_E -0.217805 0.041089 -5.301 3.24e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.993 on 186 degrees of freedom
## Multiple R-squared: 0.4606, Adjusted R-squared: 0.449
## F-statistic: 39.71 on 4 and 186 DF, p-value: < 2.2e-16
plot(lmodr2)
#Formal test for normality
shapiro.test(residuals(lmodr2))
##
## Shapiro-Wilk normality test
##
## data: residuals(lmodr2)
## W = 0.99413, p-value = 0.658
#Plot of successive pairs of residuals to check for serial correlation
n1 <- length(residuals(lmodr2))
plot(tail(residuals(lmodr2), n1-1) ~ head(residuals(lmodr2), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
Quickly checking the evaluation dataset shows that all the variables used in model 3 are populated - TEAM_BATTING_H, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_FIELDING_E. Hence, there is no need to omit records or transform variables. We run the root mean squared error for the training file and find it to be very small - 3.766439e-17. Hence, potentially this could be a suitable model to use.
#datanona1 <- na.omit(eval)
#pred <- predict(lmod3, datanona1); pred
# Prediction model
pred <- predict(lmodr2, eval); pred
## 1 2 3 4 5 6
## 50.0012944 55.5995995 58.6090268 75.3822188 -68.1895516 -52.4846260
## 7 8 9 10 11 12
## -14.0848461 8.1335131 30.0908206 47.0991426 46.5971615 66.8698795
## 13 14 15 16 17 18
## 70.5024735 73.7760966 70.4251387 62.7855826 56.3914682 77.9131603
## 19 20 21 22 23 24
## 9.5768320 58.5162293 65.0658055 74.3501110 64.5013321 59.9944363
## 25 26 27 28 29 30
## 75.1702661 80.6842483 -200.2121409 35.2047466 54.4304572 46.0331530
## 31 32 33 34 35 36
## 76.1649859 70.3660146 78.8960120 77.7327666 71.6671899 73.1922212
## 37 38 39 40 41 42
## 65.4306200 80.5855075 76.8088287 76.0737405 77.1701470 88.2528747
## 43 44 45 46 47 48
## -294.6391629 11.1048782 12.0066051 -12.8521109 -2.3672910 19.6658677
## 49 50 51 52 53 54
## 17.6237585 34.8150175 52.1467066 68.6593008 60.2323821 61.0602726
## 55 56 57 58 59 60
## 54.3416655 64.8786676 8.6007146 21.2934681 -2.4783820 17.2490192
## 61 62 63 64 65 66
## 50.6513429 71.0035483 78.1042931 84.6061872 78.2615553 27.1260515
## 67 68 69 70 71 72
## -6.7069429 6.2156531 13.9019825 27.1724245 54.8528578 61.1367587
## 73 74 75 76 77 78
## 71.6079541 82.5817910 63.1503120 66.7571847 76.2721051 70.3473779
## 79 80 81 82 83 84
## 16.8813628 23.6029705 67.0694804 63.5527617 73.0814687 61.2214749
## 85 86 87 88 89 90
## 72.7397807 68.4595986 70.8656203 69.9289478 77.2036329 88.3550414
## 91 92 93 94 95 96
## -7.4427898 -174.3823568 -3.1453118 20.0601221 22.7814502 21.5529948
## 97 98 99 100 101 102
## 41.4757476 61.3094890 60.0887524 57.5370569 56.0828195 48.1188596
## 103 104 105 106 107 108
## 72.3144470 75.8012991 73.7328219 -62.7822319 -93.6578705 76.2222675
## 109 110 111 112 113 114
## 84.3004984 -57.9268768 66.4657338 62.9505211 73.8470747 73.3325803
## 115 116 117 118 119 120
## 64.5428558 69.4717227 79.5975946 69.2814105 65.4676854 -46.0305921
## 121 122 123 124 125 126
## 21.3339584 -4.3322312 10.5831821 10.6720457 24.3789261 45.6454948
## 127 128 129 130 131 132
## 54.0799144 30.3023757 60.5339258 71.2203674 64.1490592 67.2850212
## 133 134 135 136 137 138
## 53.5524823 69.4232631 88.3889860 -93.4394626 63.7663130 61.8155283
## 139 140 141 142 143 144
## 76.2698687 70.1213211 -3.6050432 10.5002843 67.0636591 46.3793442
## 145 146 147 148 149 150
## 63.2644371 64.8522601 61.4916768 68.4050162 72.2114528 72.0632739
## 151 152 153 154 155 156
## 75.9956895 76.4142336 -256.6521122 47.9124455 69.3066756 53.6493004
## 157 158 159 160 161 162
## 82.7776659 -71.0045105 -11.1416971 3.3110633 74.7453283 90.0734493
## 163 164 165 166 167 168
## 74.5853010 90.7459880 83.0603790 83.6941638 77.0033410 72.9885112
## 169 170 171 172 173 174
## 61.4743937 77.5426980 33.1096604 46.9329008 56.0503903 62.3174032
## 175 176 177 178 179 180
## 51.9129863 63.8603396 70.6622460 55.9306416 65.6285557 72.5266031
## 181 182 183 184 185 186
## 64.0781709 75.1458501 81.4275786 85.9876818 -154.4116946 -46.6456458
## 187 188 189 190 191 192
## -16.6047742 -158.7449733 -69.5589125 23.8932377 -13.2632746 21.6895666
## 193 194 195 196 197 198
## 29.2247249 48.7040804 49.0294725 32.2652183 44.0074669 78.7281775
## 199 200 201 202 203 204
## 67.0392095 75.0546241 60.9186607 72.1208539 65.5074181 73.2866397
## 205 206 207 208 209 210
## 68.4409120 66.7180424 75.4606135 71.7341535 -9.6641850 -37.0305835
## 211 212 213 214 215 216
## 13.8747596 11.3599488 39.7520378 45.7244758 50.9659084 73.3707848
## 217 218 219 220 221 222
## 67.5422444 68.7987645 62.1325261 66.9005906 68.5972593 57.0167730
## 223 224 225 226 227 228
## 76.8997182 70.6075351 -167.0335413 60.3273146 66.9883537 72.0614822
## 229 230 231 232 233 234
## 77.2297320 -76.8029545 31.9416447 59.7577104 73.6512577 75.7162202
## 235 236 237 238 239 240
## 59.6114544 54.2542939 62.6478951 68.3299795 -12.3262127 20.6038644
## 241 242 243 244 245 246
## 56.9315210 79.4529702 69.3265103 70.0093504 42.2973096 71.3893426
## 247 248 249 250 251 252
## 62.6907717 72.5046031 60.2053941 80.8021376 79.3682767 -150.4846510
## 253 254 255 256 257 258
## -0.2079262 -159.6938910 55.2685513 66.8802759 63.2427082 62.3722840
## 259
## 66.1951585
write.csv(pred, file = '/Users/tponnada/Downloads/predict.csv')
rmse <- (function(x, y) sqrt(mean(x-y)^2))
rmse(fitted(lmodr2), datanona$TARGET_WINS)
## Warning in x - y: longer object length is not a multiple of shorter object
## length
## [1] 0.007957554
rmse(predict(lmodr2, eval), eval$TARGET_WINS)
## [1] NaN