DATA 621 - Homework # 1

In this homework assignment, you will explore, analyze and model a data set containing approximately 2200 records. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. Your objective is to build a multiple linear regression model on the training data to predict the number of wins for the team. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

Deliverables:

A write-up submitted in PDF format. Your write-up should have four sections. Each one is described below. You may assume you are addressing me as a fellow data scientist, so do not need to shy away from technical details.

Assigned predictions (the number of wins for the team) for the evaluation data set.

Include your R statistical programming code in an Appendix.

Write Up:

  1. DATA EXPLORATION (25 Points)

Describe the size and the variables in the moneyball training data set. Consider that too much detail will cause a manager to lose interest while too little detail will make the manager consider that you aren’t doing your job. Some suggestions are given below. Please do NOT treat this as a check list of things to do to complete the assignment. You should have your own thoughts on what to tell the boss. These are just ideas.

  1. Mean / Standard Deviation / Median
  2. Bar Chart or Box Plot of the data
  3. Is the data correlated to the target variable (or to other variables?)
  4. Are any of the variables missing and need to be imputed “fixed”?
  1. DATA PREPARATION (25 Points)

Describe how you have transformed the data by changing the original variables or creating new variables. If you did transform the data or create new variables, discuss why you did this. Here are some possible transformations.

  1. Fix missing values (maybe with a Mean or Median value)
  2. Create flags to suggest if a variable was missing
  3. Transform data by putting it into buckets
  4. Mathematical transforms such as log or square root (or use Box-Cox)
  5. Combine variables (such as ratios or adding or multiplying) to create new variables
  1. BUILD MODELS (25 Points)

Using the training data set, build at least three different multiple linear regression models, using different variables (or the same variables with different transformations). Since we have not yet covered automated variable selection methods, you should select the variables manually (unless you previously learned Forward or Stepwise selection, etc.). Since you manually selected a variable for inclusion into the model or exclusion into the model, indicate why this was done.

Discuss the coefficients in the models, do they make sense? For example, if a team hits a lot of Home Runs, it would be reasonably expected that such a team would win more games. However, if the coefficient is negative (suggesting that the team would lose more games), then that needs to be discussed. Are you keeping the model even though it is counter intuitive? Why? The boss needs to know.

  1. SELECT MODELS (25 Points)

Decide on the criteria for selecting the best multiple linear regression model. Will you select a model with slightly worse performance if it makes more sense or is more parsimonious? Discuss why you selected your model.

For the multiple linear regression model, will you use a metric such as Adjusted R2, RMSE, etc.? Be sure to explain how you can make inferences from the model, discuss multi-collinearity issues (if any), and discuss other relevant model output. Using the training data set, evaluate the multiple linear regression model based on (a) mean squared error, (b) R2, (c) F-statistic, and (d) residual plots. Make predictions using the evaluation data set.

DATA EXPLORATION

SUMMARY

The lmod3 is found to be a decent model that can be used to predict target_wins as outlined in this report.

DETAIL

Per guidance in the problem, I’ve included below both a numerical summary (mean, median, standard deviation) of variables in the training dataset as well as histogram plots for the variables since all the relevant variables are numeric. Furthermore, I’ve identified variables that have NA’s but these variables don’t need to be imputed because the default for missing values is NA. Additionally, although a few variables have minimum values of 0’s, these appear to be legitimate values and not ones entered in error. Finally, a correlation plot indicates the numeric correlation values (-1 to +1) of the variables in the dataset.

In terms of the size of the training data set, we notice that it has 2,276 records and 17 variables. We start with some numerical summaries. A close look at the minimum and maximum values of each variable is worthwhile. The following variables have minimum values of 0 - TARGET_WINS (Number of wins), TEAM_BATTING_3B (Triples by batters (3B)), TEAM_BATTING_HR (Homeruns by batters (4B)), TEAM_BATTING_BB (Walks by batters), TEAM_BATTING_SO (Strikeouts by batters), TEAM_BASERUN_SB (Stolen bases), TEAM_BASERUN_CS (Caught stealing), TEAM_PITCHING_HR (Homeruns allowed), TEAM_PITCHING_BB (Walks allowed), TEAM_PITCHING_SO (Strikeouts by pitchers). It is not unusual for any of these variables to have 0 values. Similalry, looking at the maximum values doesn’t cause alarm.

However, we do notice from the summary that some of the variables have NA’s - TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_BASERUN_CS, TEAM_BATTING_HBP, TEAM_PITCHING_SO and TEAM_FIELDING_DP. The dataset doesn’t immediately make clear as to why these NA’s exist (whether they are data omission errors or true missing values.)

A visual plot of the variables indicates most have a skewed distribution. The normally distributed variables are - TARGET_WINS, TEAM_BATTING_2B, TEAM_BATTING_HBP and TEAM_FIELDING_DP.

From the correlation matrix plot, we identify a few variables that are perfectly postively correlated - TEAM_BATTING_H & TEAM_PITCHING_H, TEAM_BATTING_HR & TEAM_PITCHING_HR, TEAM_BATTING_BB & TEAM_PITCHING_BB, TEAM_BATTING_SO & TEAM_PITCHING_SO. This knowledge helps us narrow the size of the predictive model by including fewer variables. For example, we decide to retain the batting variables and remove the corresponding perfectly correlated pitching variables.

library(corrplot)
## corrplot 0.92 loaded
#Reading in the training and evaluation data files

training <- read.csv("/Users/tponnada/Downloads/moneyball-training-data.csv")
eval <- read.csv("/Users/tponnada/Downloads/moneyball-evaluation-data.csv")

#Checking the first 6 rows of the training data set, the dimensions of the data set and the usual univariate summary information.

head(training)
##   INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 1     1          39           1445             194              39
## 2     2          70           1339             219              22
## 3     3          86           1377             232              35
## 4     4          70           1387             209              38
## 5     5          82           1297             186              27
## 6     6          75           1279             200              36
##   TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 1              13             143             842              NA
## 2             190             685            1075              37
## 3             137             602             917              46
## 4              96             451             922              43
## 5             102             472             920              49
## 6              92             443             973             107
##   TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## 1              NA               NA            9364               84
## 2              28               NA            1347              191
## 3              27               NA            1377              137
## 4              30               NA            1396               97
## 5              39               NA            1297              102
## 6              59               NA            1279               92
##   TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 1              927             5456            1011               NA
## 2              689             1082             193              155
## 3              602              917             175              153
## 4              454              928             164              156
## 5              472              920             138              168
## 6              443              973             123              149
dim(training)
## [1] 2276   17
summary(training)
##      INDEX         TARGET_WINS     TEAM_BATTING_H TEAM_BATTING_2B
##  Min.   :   1.0   Min.   :  0.00   Min.   : 891   Min.   : 69.0  
##  1st Qu.: 630.8   1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0  
##  Median :1270.5   Median : 82.00   Median :1454   Median :238.0  
##  Mean   :1268.5   Mean   : 80.79   Mean   :1469   Mean   :241.2  
##  3rd Qu.:1915.5   3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0  
##  Max.   :2535.0   Max.   :146.00   Max.   :2554   Max.   :458.0  
##                                                                  
##  TEAM_BATTING_3B  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.0   Min.   :   0.0  
##  1st Qu.: 34.00   1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0  
##  Median : 47.00   Median :102.00   Median :512.0   Median : 750.0  
##  Mean   : 55.25   Mean   : 99.61   Mean   :501.6   Mean   : 735.6  
##  3rd Qu.: 72.00   3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0  
##  Max.   :223.00   Max.   :264.00   Max.   :878.0   Max.   :1399.0  
##                                                    NA's   :102     
##  TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
##  Min.   :  0.0   Min.   :  0.0   Min.   :29.00    Min.   : 1137  
##  1st Qu.: 66.0   1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419  
##  Median :101.0   Median : 49.0   Median :58.00    Median : 1518  
##  Mean   :124.8   Mean   : 52.8   Mean   :59.36    Mean   : 1779  
##  3rd Qu.:156.0   3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682  
##  Max.   :697.0   Max.   :201.0   Max.   :95.00    Max.   :30132  
##  NA's   :131     NA's   :772     NA's   :2085                    
##  TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO  TEAM_FIELDING_E 
##  Min.   :  0.0    Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0    1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0  
##  Median :107.0    Median : 536.5   Median :  813.5   Median : 159.0  
##  Mean   :105.7    Mean   : 553.0   Mean   :  817.7   Mean   : 246.5  
##  3rd Qu.:150.0    3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2  
##  Max.   :343.0    Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##                                    NA's   :102                       
##  TEAM_FIELDING_DP
##  Min.   : 52.0   
##  1st Qu.:131.0   
##  Median :149.0   
##  Mean   :146.4   
##  3rd Qu.:164.0   
##  Max.   :228.0   
##  NA's   :286
#Standard deviations of variables 

sapply(training[,2:7], sd)
##     TARGET_WINS  TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR 
##        15.75215       144.59120        46.80141        27.93856        60.54687 
## TEAM_BATTING_BB 
##       122.67086
sapply(training[,12:14], sd)
##  TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB 
##       1406.84293         61.29875        166.35736
#Univariate plots using histograms, kernel density estimates and sorted data plotted against its index for the 17 variables.

#par(mfrow=c(,10))

#Number of wins

hist(training$TARGET_WINS, xlab = "Number of Wins", main = "")

plot(density(training$TARGET_WINS, na.rm = TRUE), main = "")

plot(sort(training$TARGET_WINS), ylab = "Sorted Diastolic")

#Base Hits by batters (1B,2B,3B,HR)

hist(training$TEAM_BATTING_H, xlab = "Base Hits by batters", main = "")

plot(density(training$TEAM_BATTING_H, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BATTING_H), ylab = "Sorted Diastolic")

#Doubles by batters (2B)

hist(training$TEAM_BATTING_2B, xlab = "Doubles by batters (2B)", main = "")

plot(density(training$TEAM_BATTING_2B, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BATTING_2B), ylab = "Sorted Diastolic")

#Triples by batters (3B)

hist(training$TEAM_BATTING_3B, xlab = "Triples by batters (3B)", main = "")

plot(density(training$TEAM_BATTING_3B, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BATTING_3B), ylab = "Sorted Diastolic")

#Homeruns by batters (4B)

hist(training$TEAM_BATTING_HR, xlab = "Homeruns by batters (4B)", main = "")

plot(density(training$TEAM_BATTING_HR, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BATTING_HR), ylab = "Sorted Diastolic")

#Walks by batters

hist(training$TEAM_BATTING_BB, xlab = "Walks by batters", main = "")

plot(density(training$TEAM_BATTING_BB, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BATTING_BB), ylab = "Sorted Diastolic")

#Strikeouts by batters

hist(training$TEAM_BATTING_SO, xlab = "Strikeouts by batters", main = "")

plot(density(training$TEAM_BATTING_SO, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BATTING_SO), ylab = "Sorted Diastolic")

#Stolen bases

hist(training$TEAM_BASERUN_SB, xlab = "Stolen bases", main = "")

plot(density(training$TEAM_BASERUN_SB, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BASERUN_SB), ylab = "Sorted Diastolic")

#Caught stealing

hist(training$TEAM_BASERUN_CS, xlab = "Caught stealing", main = "")

plot(density(training$TEAM_BASERUN_CS, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BASERUN_CS), ylab = "Sorted Diastolic")

#Batters hit by pitch (get a free base)

hist(training$TEAM_BATTING_HBP, xlab = "Batters hit by pitch (get a free base)", main = "")

plot(density(training$TEAM_BATTING_HBP, na.rm = TRUE), main = "")

plot(sort(training$TEAM_BATTING_HBP), ylab = "Sorted Diastolic")

#Hits allowed

hist(training$TEAM_PITCHING_H, xlab = "Hits allowed", main = "")

plot(density(training$TEAM_PITCHING_H, na.rm = TRUE), main = "")

plot(sort(training$TEAM_PITCHING_H), ylab = "Sorted Diastolic")

#Homeruns allowed

hist(training$TEAM_PITCHING_HR, xlab = "Homeruns allowed", main = "")

plot(density(training$TEAM_PITCHING_HR, na.rm = TRUE), main = "")

plot(sort(training$TEAM_PITCHING_HR), ylab = "Sorted Diastolic")

#Walks allowed

hist(training$TEAM_PITCHING_BB, xlab = "Walks allowed", main = "")

plot(density(training$TEAM_PITCHING_BB, na.rm = TRUE), main = "")

plot(sort(training$TEAM_PITCHING_BB), ylab = "Sorted Diastolic")

#Strikeouts by pitchers

hist(training$TEAM_PITCHING_SO, xlab = "Strikeouts by pitchers", main = "")

plot(density(training$TEAM_PITCHING_SO, na.rm = TRUE), main = "")

plot(sort(training$TEAM_PITCHING_SO), ylab = "Sorted Diastolic")

#Errors

hist(training$TEAM_FIELDING_E, xlab = "Errors", main = "")

plot(density(training$TEAM_FIELDING_E, na.rm = TRUE), main = "")

plot(sort(training$TEAM_FIELDING_E), ylab = "Sorted Diastolic")

#Double Plays

hist(training$TEAM_FIELDING_DP, xlab = "Double Plays", main = "")

plot(density(training$TEAM_FIELDING_DP, na.rm = TRUE), main = "")

plot(sort(training$TEAM_FIELDING_DP), ylab = "Sorted Diastolic")

#Instead of using scatterplots for each of the 17 variables against each other, I used the correlation matrix.

M = cor(training, use = "na.or.complete")
corrplot(M, method = 'number', type = 'lower', diag = FALSE, number.cex = 0.5, tl.cex = 0.5, cl.cex = 0.5)

DATA PREPARATION

As identified earlier, there are some variables in the training dataset that have missing NA values. Since, we decided to eliminate all the perfectly correlated pitching variables, the list of variables with missing values is reduced to below. In brackets are listed the number of NA’s reported for each (derived from the summary command in the data exploration section above). A quick look indicates TEAM_BASERUN_CS and TEAM_BATTING_HBP have the highest number of NA values and hence the higest proportion of missing values. Since, we know the number of observations in the training dataset totals 2,276, we can calculate the proportion of records that are not populated. From the problem, TEAM_BATTING_SO and TEAM_BASERUN_CS are expected to have a negative impact on wins while the other variables below are forecast to have a positive impact on wins.

TEAM_BATTING_SO (102), 4.48% of observations not populated TEAM_BASERUN_SB (131), 5.8% of observations not populated TEAM_BASERUN_CS (772), 33.9% of observations not populated TEAM_BATTING_HBP (2085), 91.6% of observations not populated TEAM_FIELDING_DP (286), 12.6% of observations not populated

Also, from the correlation plot in the exploration section above, we use a threshold of >=0.3 and <=-0.3 to identify other variables that should be included in the model. By performing this analysis, we eliminate the need to transform variables that have minimal predictive value. We also identified in the data exoploration section that some variables are perfectly correlated and hence redundant to include and decided to eliminate following:

TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO

The list of explanatory variables thus reduces to below.

TEAM_BATTING_H (Correlation: 0.47) TEAM_BATTING_2B (Correlation: 0.31) TEAM_BATTING_HR (Correlation: 0.42) TEAM_BATTING_BB (Correlation: 0.47) TEAM_FIELDING_E (Correlation: -0.39)

We see that none of the predictive variables have NA’s but we use scatterplots to identify if there are outliers or other problems that would necessitate a transformation. We see from the plots that TEAM_BATTING_HR, TEAM_BATTING_BB and TEAM_FIELDING_E have wide scales and outliers.

We use the boxcox function from the MASS package to check first whether the response (TARGET_WINS) needs a transformation. We see that the confidence interval for lambda does not include 1 (indicating the need for a transformation) and the power suggested is 1.3. Rounding to the nearest value of lambda (1.0) means a transformation of the response variable might not be worthwhile in this case.

For the independent variables, we can generalize by adding polynomial terms and determine the order of each variable that van be used in a potential predictive model.

require(MASS)
## Loading required package: MASS
plot(TARGET_WINS ~ TEAM_BATTING_H, training)

plot(TARGET_WINS ~ TEAM_BATTING_2B, training)

plot(TARGET_WINS ~ TEAM_BATTING_HR, training)

plot(TARGET_WINS ~ TEAM_BATTING_BB, training)

plot(TARGET_WINS ~ TEAM_FIELDING_E, training)

#Box-cox transformation only works for positive values of the response variable. There is one record of the response variable TARGET_WINS where the value is 0, we set this to NA and then apply the Box-Cox transformation. The Box-Cox method suggests that the response variable needs to be transformed. Rounding up a squared value of the response variable seems like a likely transformation.  

training$TARGET_WINS[training$TARGET_WINS == 0] <- NA

lmod <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, training)
boxcox(lmod, plotit = T)

boxcox(lmod, plotit = T, lambda = seq(0, 3.0, by = 0.1))

bc <- boxcox(lmod)

str(bc)
## List of 2
##  $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
##  $ y: num [1:100] -7764 -7657 -7553 -7450 -7349 ...
bc.power <- bc$x[which.max(bc$y)]; bc.power
## [1] 1.313131
#Determining the order of the polynomial for TEAM_BATTING_H. The p-value of TEAM_BATTING_H is significant, so we move on to a quadratric and then a cubic term. The p-value of TEAM_BATTING_H^4 is not significant, so stick with the cubic term.

summary(lm(TARGET_WINS ~ TEAM_BATTING_H, training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.386  -8.805   0.862   9.779  45.945 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    19.600519   3.109218   6.304 3.47e-10 ***
## TEAM_BATTING_H  0.041664   0.002106  19.786  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.47 on 2273 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1469, Adjusted R-squared:  0.1466 
## F-statistic: 391.5 on 1 and 2273 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2), 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.547  -8.620   0.714   9.617  45.924 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.005e+01  1.428e+01  -2.804  0.00509 ** 
## TEAM_BATTING_H       1.179e-01  1.795e-02   6.570 6.24e-11 ***
## I(TEAM_BATTING_H^2) -2.405e-05  5.622e-06  -4.278 1.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.42 on 2272 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1537, Adjusted R-squared:  0.153 
## F-statistic: 206.4 on 2 and 2272 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) + I(TEAM_BATTING_H^3), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) + 
##     I(TEAM_BATTING_H^3), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.553  -8.551   0.700   9.708  45.945 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -3.793e+02  7.023e+01  -5.401 7.31e-08 ***
## TEAM_BATTING_H       7.422e-01  1.278e-01   5.807 7.24e-09 ***
## I(TEAM_BATTING_H^2) -3.989e-04  7.619e-05  -5.235 1.80e-07 ***
## I(TEAM_BATTING_H^3)  7.315e-08  1.483e-08   4.933 8.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.34 on 2271 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1627, Adjusted R-squared:  0.1616 
## F-statistic: 147.1 on 3 and 2271 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) + I(TEAM_BATTING_H^3) + I(TEAM_BATTING_H^4) , training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + I(TEAM_BATTING_H^2) + 
##     I(TEAM_BATTING_H^3) + I(TEAM_BATTING_H^4), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.644  -8.608   0.771   9.720  46.681 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         -7.902e+02  2.882e+02  -2.742  0.00616 **
## TEAM_BATTING_H       1.753e+00  6.995e-01   2.506  0.01228 * 
## I(TEAM_BATTING_H^2) -1.316e-03  6.284e-04  -2.094  0.03639 * 
## I(TEAM_BATTING_H^3)  4.359e-07  2.472e-07   1.763  0.07802 . 
## I(TEAM_BATTING_H^4) -5.271e-11  3.586e-11  -1.470  0.14174   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.34 on 2270 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1635, Adjusted R-squared:  0.162 
## F-statistic: 110.9 on 4 and 2270 DF,  p-value: < 2.2e-16
#Determining the order of the polynomial for TEAM_BATTING_2B. The p-value of TEAM_BATTING_2B is significant, so we move on to a quadratric and then a cubic term. The p-value of TEAM_BATTING_2B^4 is not significant, so stick with the cubic term.

summary(lm(TARGET_WINS ~ TEAM_BATTING_2B, training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.013  -9.571   0.585  10.093  57.442 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     57.710776   1.654890   34.87   <2e-16 ***
## TEAM_BATTING_2B  0.095799   0.006733   14.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.01 on 2273 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.08178,    Adjusted R-squared:  0.08137 
## F-statistic: 202.4 on 1 and 2273 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2), 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.245  -9.670   0.499  10.095  58.571 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          38.8756779  6.2279788   6.242 5.14e-10 ***
## TEAM_BATTING_2B       0.2544659  0.0510304   4.987 6.61e-07 ***
## I(TEAM_BATTING_2B^2) -0.0003220  0.0001027  -3.137  0.00173 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.98 on 2272 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.08573,    Adjusted R-squared:  0.08493 
## F-statistic: 106.5 on 2 and 2272 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_2B^3), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_2B^3), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.943  -9.657   0.743   9.950  59.500 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.419e+01  1.715e+01  -3.159   0.0016 ** 
## TEAM_BATTING_2B       1.444e+00  2.107e-01   6.854 9.22e-12 ***
## I(TEAM_BATTING_2B^2) -5.203e-03  8.453e-04  -6.155 8.83e-10 ***
## I(TEAM_BATTING_2B^3)  6.446e-06  1.108e-06   5.817 6.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.88 on 2271 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.09916,    Adjusted R-squared:  0.09797 
## F-statistic: 83.32 on 3 and 2271 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) + I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_2B^4) , training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_2B + I(TEAM_BATTING_2B^2) + 
##     I(TEAM_BATTING_2B^3) + I(TEAM_BATTING_2B^4), data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -57.83  -9.47   0.76  10.01  58.83 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.235e+02  3.871e+01  -3.191 0.001438 ** 
## TEAM_BATTING_2B       2.649e+00  6.388e-01   4.146  3.5e-05 ***
## I(TEAM_BATTING_2B^2) -1.272e-02  3.856e-03  -3.298 0.000987 ***
## I(TEAM_BATTING_2B^3)  2.644e-05  1.007e-05   2.625 0.008713 ** 
## I(TEAM_BATTING_2B^4) -1.919e-08  9.609e-09  -1.997 0.045896 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.87 on 2270 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1007, Adjusted R-squared:  0.09915 
## F-statistic: 63.57 on 4 and 2270 DF,  p-value: < 2.2e-16
#Determining the order of the polynomial for TEAM_BATTING_HR. The p-value of TEAM_BATTING_HR is significant, so we move on to a quadratric term. The p-value of TEAM_BATTING_HR^2 is not significant, so stick with the unpowered term.

summary(lm(TARGET_WINS ~ TEAM_BATTING_HR, training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.350  -9.938   0.544  10.163  68.347 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     76.350142   0.623296 122.494   <2e-16 ***
## TEAM_BATTING_HR  0.044917   0.005346   8.402   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.43 on 2273 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03012,    Adjusted R-squared:  0.02969 
## F-statistic: 70.59 on 1 and 2273 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_HR + I(TEAM_BATTING_HR^2), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_HR + I(TEAM_BATTING_HR^2), 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -65.204  -9.891   0.601  10.144  68.102 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.720e+01  9.017e-01  85.619   <2e-16 ***
## TEAM_BATTING_HR      2.060e-02  1.932e-02   1.067    0.286    
## I(TEAM_BATTING_HR^2) 1.155e-04  8.815e-05   1.310    0.190    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.43 on 2272 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03085,    Adjusted R-squared:   0.03 
## F-statistic: 36.16 on 2 and 2272 DF,  p-value: 3.46e-16
#Determining the order of the polynomial for TEAM_BATTING_BB. The p-value of TEAM_BATTING_BB is significant, so we move on to a quadratric term. The p-value of TEAM_BATTING_BB^2 is also significant but the cubic term is not significant so we stick with the quadratic term.

summary(lm(TARGET_WINS ~ TEAM_BATTING_BB, training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB, data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.676  -9.745   0.541   9.798  77.822 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     66.329357   1.352298   49.05   <2e-16 ***
## TEAM_BATTING_BB  0.028891   0.002618   11.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.26 on 2273 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.05084,    Adjusted R-squared:  0.05042 
## F-statistic: 121.7 on 1 and 2273 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2), 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.399  -9.480   0.547   9.802  71.317 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.561e+01  2.504e+00  30.191  < 2e-16 ***
## TEAM_BATTING_BB      -1.781e-02  1.094e-02  -1.628    0.104    
## I(TEAM_BATTING_BB^2)  5.309e-05  1.208e-05   4.394 1.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.2 on 2272 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.05883,    Adjusted R-squared:  0.05801 
## F-statistic: 71.01 on 2 and 2272 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2) + I(TEAM_BATTING_BB^3), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_BB + I(TEAM_BATTING_BB^2) + 
##     I(TEAM_BATTING_BB^3), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.177  -9.414   0.478   9.982  74.510 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           6.855e+01  3.782e+00  18.122   <2e-16 ***
## TEAM_BATTING_BB       5.419e-02  3.094e-02   1.752   0.0799 .  
## I(TEAM_BATTING_BB^2) -1.376e-04  7.759e-05  -1.774   0.0762 .  
## I(TEAM_BATTING_BB^3)  1.483e-07  5.960e-08   2.488   0.0129 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.19 on 2271 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.06139,    Adjusted R-squared:  0.06015 
## F-statistic: 49.51 on 3 and 2271 DF,  p-value: < 2.2e-16
#Determining the order of the polynomial for TEAM_FIELDING_E. The p-value of TEAM_FIELDING_E is significant, so we move on to a quadratric term. The p-value of TEAM_FIELDING_E^2 is also significant so we move onto the higher order term and find that p-value for the 5th order is not significant. Hence we stop at 4th order term. 

summary(lm(TARGET_WINS ~ TEAM_FIELDING_E, training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E, data = training)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -62.01 -10.06   0.70  10.33  73.17 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     83.613131   0.479768 174.278  < 2e-16 ***
## TEAM_FIELDING_E -0.011339   0.001439  -7.878 5.13e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.46 on 2273 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02658,    Adjusted R-squared:  0.02615 
## F-statistic: 62.06 on 1 and 2273 DF,  p-value: 5.126e-15
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2), 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -63.584  -9.738   0.599  10.501  72.815 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.048e+01  7.357e-01 109.399  < 2e-16 ***
## TEAM_FIELDING_E       9.616e-03  4.015e-03   2.395   0.0167 *  
## I(TEAM_FIELDING_E^2) -1.818e-05  3.255e-06  -5.586 2.61e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.36 on 2272 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03976,    Adjusted R-squared:  0.03892 
## F-statistic: 47.04 on 2 and 2272 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_E^3), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + 
##     I(TEAM_FIELDING_E^3), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -67.853  -9.714   0.657  10.246  66.964 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           8.636e+01  1.197e+00  72.173  < 2e-16 ***
## TEAM_FIELDING_E      -4.272e-02  9.330e-03  -4.578 4.95e-06 ***
## I(TEAM_FIELDING_E^2)  7.833e-05  1.589e-05   4.929 8.88e-07 ***
## I(TEAM_FIELDING_E^3) -4.366e-08  7.040e-09  -6.202 6.61e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.23 on 2271 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.05575,    Adjusted R-squared:  0.05451 
## F-statistic:  44.7 on 3 and 2271 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + 
##     I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4), data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.698  -9.766   0.503  10.115  66.298 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.173e+01  1.834e+00  50.016  < 2e-16 ***
## TEAM_FIELDING_E      -1.020e-01  1.797e-02  -5.674 1.57e-08 ***
## I(TEAM_FIELDING_E^2)  2.511e-04  4.756e-05   5.281 1.41e-07 ***
## I(TEAM_FIELDING_E^3) -2.162e-07  4.533e-08  -4.770 1.96e-06 ***
## I(TEAM_FIELDING_E^4)  5.353e-11  1.389e-11   3.854  0.00012 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.19 on 2270 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.06189,    Adjusted R-squared:  0.06024 
## F-statistic: 37.44 on 4 and 2270 DF,  p-value: < 2.2e-16
summary(lm(TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_E^5), training))
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_FIELDING_E + I(TEAM_FIELDING_E^2) + 
##     I(TEAM_FIELDING_E^3) + I(TEAM_FIELDING_E^4) + I(TEAM_FIELDING_E^5), 
##     data = training)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.076  -9.813   0.444  10.136  67.429 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           9.414e+01  2.783e+00  33.831  < 2e-16 ***
## TEAM_FIELDING_E      -1.345e-01  3.349e-02  -4.014 6.15e-05 ***
## I(TEAM_FIELDING_E^2)  3.850e-04  1.258e-04   3.061  0.00223 ** 
## I(TEAM_FIELDING_E^3) -4.331e-07  1.940e-07  -2.233  0.02568 *  
## I(TEAM_FIELDING_E^4)  1.996e-10  1.278e-10   1.562  0.11843    
## I(TEAM_FIELDING_E^5) -3.422e-14  2.976e-14  -1.150  0.25034    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.18 on 2269 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.06244,    Adjusted R-squared:  0.06037 
## F-statistic: 30.22 on 5 and 2269 DF,  p-value: < 2.2e-16

BUILD MODELS

MODEL 1:

As a baseline, I start by building out a multiple linear regression model that includes all of the variables in the dataset with the exception of the index variable. The summary shows that this baseline model lmodraw has only two variables that are significant with small p-values (TEAM_FIELDING_E, TEAM_FIELDING_DP). Furthermore, the adjusted R^2 value indicates that the model explains only 51% of the variation in the model. Also, the ANOVA test indicates that the null hypothesis of no predictor having significance in the full model can be rejected. Atleast, one of the predictors has significance.

TARGET_WINS = 60.28826 + (1.91348 * TEAM_BATTING_H) + (0.02639 * TEAM_BATTING_2B) + (-0.10118 * TEAM_BATTING_3B) + (-4.84371 * TEAM_BATTING_HR) + (-4.45969 * TEAM_BATTING_BB) + (0.34196 * TEAM_BATTING_SO) + (0.03304 * TEAM_BASERUN_SB) + (-0.01104 * TEAM_BASERUN_CS) + (0.08247 * TEAM_BATTING_HBP) + (-1.89096 * TEAM_PITCHING_H) + (4.93043 * TEAM_PITCHING_HR) + (4.51089 * TEAM_PITCHING_BB) + (-0.37364 * TEAM_PITCHING_SO) + (-0.17204 * TEAM_FIELDING_E) + (-0.10819 * TEAM_FIELDING_DP)

A quick look at the negative coefficients for some of the variables doesn’t make intuitive sense, e.g., one would expect triples by batters (TEAM_BATTING_3B) and homeruns by batters (TEAM_BATTING_HR) to increase the response variable (TARGET_WINS) and not decrease it. This is yet another indication that the full model is not the right model to predict number of wins.

MODEL 2:

We need to cull the full model and to select individual predictors, I could similarly use the ANOVA test as we performed for Model 1 but since we’ve done some initial analysis using scatterplots in the data preparation stage and decided to eliminate duplicate variables as identified in the correlation plot of the data exploration stage, we decide to move forward with the 5 explanatory variables previously identified. The summary(lmod) shows that p-values for all but one of the variables (TEAM_BATTING_2B) is very small. The adjusted R^2 value however is lower than that of the full model and explains only 45% of the variation in the model. Also, the ANOVA test indicates that the null hypothesis of no predictor having significance in the reduced model can be rejected.

The model can be written in mathematical form as:

TARGET_WINS = 3.4790727 + (0.0412796 * TEAM_BATTING_H) + (0.0009869 * TEAM_BATTING_2B) + (0.0680351 * TEAM_BATTING_HR) + (0.0502670 * TEAM_BATTING_BB) + (-0.2177225 * TEAM_FIELDING_E)

A quick look at the coefficients in the reduced model makes intuitive sense, only errors detract from total wins. Base hits, doubles by batters, homeruns and walks by batters all contribute positively to target wins.

MODEL 3:

For model 3, I decided to eliminate the TEAM_BATTING_2B variable which has a large p-value in model 2 above. The adjusted R-squared for this new model (0.449) is higher than for model 2 (0.446) but still less than the baseline model that included all variables (0.5116). The p-values for all of the variables are small indicating significance and the F-value has increased from 31.6 (model 2) to 39.7 (model 3).

MODEL 4:

We could also use one of the transformed explanatory variable to explain TARGET_WINS. For example, below could be an example of a polynomial model where we use upto the third order term of TEAM_BATTING_H to explain TARGET_WINS.

## 
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = datanona)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8708  -5.6564  -0.0599   5.2545  22.9274 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.28826   19.67842   3.064  0.00253 ** 
## TEAM_BATTING_H    1.91348    2.76139   0.693  0.48927    
## TEAM_BATTING_2B   0.02639    0.03029   0.871  0.38484    
## TEAM_BATTING_3B  -0.10118    0.07751  -1.305  0.19348    
## TEAM_BATTING_HR  -4.84371   10.50851  -0.461  0.64542    
## TEAM_BATTING_BB  -4.45969    3.63624  -1.226  0.22167    
## TEAM_BATTING_SO   0.34196    2.59876   0.132  0.89546    
## TEAM_BASERUN_SB   0.03304    0.02867   1.152  0.25071    
## TEAM_BASERUN_CS  -0.01104    0.07143  -0.155  0.87730    
## TEAM_BATTING_HBP  0.08247    0.04960   1.663  0.09815 .  
## TEAM_PITCHING_H  -1.89096    2.76095  -0.685  0.49432    
## TEAM_PITCHING_HR  4.93043   10.50664   0.469  0.63946    
## TEAM_PITCHING_BB  4.51089    3.63372   1.241  0.21612    
## TEAM_PITCHING_SO -0.37364    2.59705  -0.144  0.88577    
## TEAM_FIELDING_E  -0.17204    0.04140  -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819    0.03654  -2.961  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.467 on 175 degrees of freedom
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5116 
## F-statistic: 14.27 on 15 and 175 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ 1
## Model 2: TARGET_WINS ~ (INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_BATTING_HBP + TEAM_PITCHING_H + TEAM_PITCHING_HR + 
##     TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP) - 
##     INDEX
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    190 27887                                  
## 2    175 12546 15     15341 14.266 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.922  -6.452  -0.105   5.984  23.459 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.4790727 15.6213041   0.223 0.824004    
## TEAM_BATTING_H   0.0412796  0.0112249   3.678 0.000309 ***
## TEAM_BATTING_2B  0.0009869  0.0302629   0.033 0.974019    
## TEAM_BATTING_HR  0.0680351  0.0245324   2.773 0.006118 ** 
## TEAM_BATTING_BB  0.0502670  0.0099137   5.070 9.59e-07 ***
## TEAM_FIELDING_E -0.2177225  0.0412773  -5.275 3.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.017 on 185 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.446 
## F-statistic:  31.6 on 5 and 185 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ 1
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_FIELDING_E
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    190 27887                                  
## 2    185 15042  5     12845 31.597 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9006  -6.4318  -0.1027   5.9876  23.4547 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.496754  15.569914   0.225  0.82255    
## TEAM_BATTING_H   0.041461   0.009730   4.261 3.23e-05 ***
## TEAM_BATTING_HR  0.068036   0.024466   2.781  0.00598 ** 
## TEAM_BATTING_BB  0.050298   0.009843   5.110 7.95e-07 ***
## TEAM_FIELDING_E -0.217805   0.041089  -5.301 3.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.993 on 186 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.449 
## F-statistic: 39.71 on 4 and 186 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Model 1: TARGET_WINS ~ 1
## Model 2: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB + 
##     TEAM_FIELDING_E
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    190 27887                                  
## 2    186 15042  4     12845 39.709 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

SELECT MODELS

We run a couple of diagnostics to select a model among the three designed above.

Residual vs Fitted Plots

The residual vs fitted plot for the baseline full model as well as the two reduced models (models 2 & 3) are homoscedastic.

Normality using Q-Q plots

Next, we can test the residuals for normality using the Q-Q plot. Residuals in the Q-Q plots for the full as well as reduced models follow the line approximately and hence the residuals look normal. The slight skew in the Q-Q plot of the full model is eliminated in the subsequent reduced versions of the models. Also using the formal Shapiro-Wilk test for normality for all three models shows that the p-value is large, hence we cannot reject the null hypothesis that the residuals are normal.

Checking for correlated errors using plot of successive pairs of residuals

We also check if the errors are uncorrelated in a plot of successive pairs of residuals for each model and observe a random scatter of points above and below the epsilon = 0 line.

Identify leverage points using half-normal plots

Two index values 116 and 122 are identified as leverage points in the full model. It is interesting to note that the leverage points change between the full and reduced models.

Identify influential points using Cook’s distance

We identify influential points in both the full as well as in the reduced models but removal of the largest of the influential points in each model doesn’t change the summary stats meaningfully.

Based on the diagnostic plots and the summary analysis, I decide to use model3 after removing the influential points as my predictive model which has a F-stat value of 42.8, adjusted R-squared of ~47% i.e. explains 47% of variance in the model and whose residuals are normally distributed, uncorrelated and have equal variance (homoscedastic). The coefficients are quite similar in both models.

This model can be written as:

TARGET_WINS = -1.607690 + (0.043919 * TEAM_BATTING_H) + (0.070051 * TEAM_BATTING_HR) + (0.052495 * TEAM_BATTING_BB) + (-0.219777 * TEAM_FIELDING_E)

The original model 3 with influential points included as below:

TARGET_WINS = 3.496754 + (0.041461 * TEAM_BATTING_H) + (0.068036 * TEAM_BATTING_HR) + (0.050298 * TEAM_BATTING_BB) + (-0.217805 * TEAM_FIELDING_E)

require(faraway)
## Loading required package: faraway
require(tidyverse)
## Loading required package: tidyverse
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x dplyr::select() masks MASS::select()
# Model 1:
par(mfrow = c(2, 2))
plot(lmodraw)

#Formal test for normality
shapiro.test(residuals(lmodraw))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmodraw)
## W = 0.99435, p-value = 0.6864
#Plot of successive pairs of residuals to check for serial correlation
n1 <- length(residuals(lmodraw))
plot(tail(residuals(lmodraw), n1-1) ~ head(residuals(lmodraw), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))
  
# Model 2:
par(mfrow = c(2, 2))

plot(lmod2)

#Formal test for normality
shapiro.test(residuals(lmod2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmod2)
## W = 0.99638, p-value = 0.9342
#Plot of successive pairs of residuals to check for serial correlation
n2 <- length(residuals(lmod2))
plot(tail(residuals(lmod2), n2-1) ~ head(residuals(lmod2), n2-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

# Model 3:
par(mfrow = c(2, 2))

plot(lmod3)

#Formal test for normality
shapiro.test(residuals(lmod3))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmod3)
## W = 0.9964, p-value = 0.9356
#Plot of successive pairs of residuals to check for serial correlation
n3 <- length(residuals(lmod3))
plot(tail(residuals(lmod3), n3-1) ~ head(residuals(lmod3), n3-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

par(mfrow = c(2, 2))

#Check for leverage points using half-normal plots
hatv <- hatvalues(lmodraw)
sum(hatv)
## [1] 16
index <- row.names(training)
halfnorm(hatv, labs = index, ylab = "Leverages")

hatvr1 <- hatvalues(lmod2)
sum(hatvr1)
## [1] 6
index <- row.names(training)
halfnorm(hatvr1, labs = index, ylab = "Leverages")

hatvr2 <- hatvalues(lmod3)
sum(hatvr2)
## [1] 5
index <- row.names(training)
halfnorm(hatvr2, labs = index, ylab = "Leverages")

## Identify influential points using Cook's distance
cookf <- cooks.distance(lmodraw)
halfnorm(cookf, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
lmodfi <- lm(TARGET_WINS ~ . - INDEX, datanona, subset = (cookf < max(cookf)))
summary(lmodfi)
## 
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = datanona, subset = (cookf < 
##     max(cookf)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8755  -5.9492   0.0424   4.9818  23.2286 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       66.39607   20.30034   3.271  0.00129 ** 
## TEAM_BATTING_H    -1.68340    4.07041  -0.414  0.67970    
## TEAM_BATTING_2B    0.02531    0.03026   0.836  0.40413    
## TEAM_BATTING_3B   -0.09145    0.07783  -1.175  0.24163    
## TEAM_BATTING_HR   42.03680   40.40576   1.040  0.29961    
## TEAM_BATTING_BB   -3.54049    3.71135  -0.954  0.34143    
## TEAM_BATTING_SO   -2.33806    3.42228  -0.683  0.49540    
## TEAM_BASERUN_SB    0.03221    0.02865   1.124  0.26239    
## TEAM_BASERUN_CS   -0.01912    0.07166  -0.267  0.78995    
## TEAM_BATTING_HBP   0.08482    0.04958   1.711  0.08887 .  
## TEAM_PITCHING_H    1.70262    4.06809   0.419  0.67608    
## TEAM_PITCHING_HR -41.95066   40.40574  -1.038  0.30060    
## TEAM_PITCHING_BB   3.59222    3.70880   0.969  0.33410    
## TEAM_PITCHING_SO   2.30442    3.41993   0.674  0.50132    
## TEAM_FIELDING_E   -0.16932    0.04141  -4.089 6.62e-05 ***
## TEAM_FIELDING_DP  -0.10378    0.03668  -2.829  0.00521 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.456 on 174 degrees of freedom
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5114 
## F-statistic: 14.19 on 15 and 174 DF,  p-value: < 2.2e-16
summary(lmodraw)
## 
## Call:
## lm(formula = TARGET_WINS ~ . - INDEX, data = datanona)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8708  -5.6564  -0.0599   5.2545  22.9274 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.28826   19.67842   3.064  0.00253 ** 
## TEAM_BATTING_H    1.91348    2.76139   0.693  0.48927    
## TEAM_BATTING_2B   0.02639    0.03029   0.871  0.38484    
## TEAM_BATTING_3B  -0.10118    0.07751  -1.305  0.19348    
## TEAM_BATTING_HR  -4.84371   10.50851  -0.461  0.64542    
## TEAM_BATTING_BB  -4.45969    3.63624  -1.226  0.22167    
## TEAM_BATTING_SO   0.34196    2.59876   0.132  0.89546    
## TEAM_BASERUN_SB   0.03304    0.02867   1.152  0.25071    
## TEAM_BASERUN_CS  -0.01104    0.07143  -0.155  0.87730    
## TEAM_BATTING_HBP  0.08247    0.04960   1.663  0.09815 .  
## TEAM_PITCHING_H  -1.89096    2.76095  -0.685  0.49432    
## TEAM_PITCHING_HR  4.93043   10.50664   0.469  0.63946    
## TEAM_PITCHING_BB  4.51089    3.63372   1.241  0.21612    
## TEAM_PITCHING_SO -0.37364    2.59705  -0.144  0.88577    
## TEAM_FIELDING_E  -0.17204    0.04140  -4.155 5.08e-05 ***
## TEAM_FIELDING_DP -0.10819    0.03654  -2.961  0.00349 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.467 on 175 degrees of freedom
## Multiple R-squared:  0.5501, Adjusted R-squared:  0.5116 
## F-statistic: 14.27 on 15 and 175 DF,  p-value: < 2.2e-16
## Identify influential points using Cook's distance
cookr1 <- cooks.distance(lmod2)
halfnorm(cookr1, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
lmodr1 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, datanona, subset = (cookr1 < max(cookr1)))
summary(lmodr1)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona, 
##     subset = (cookr1 < max(cookr1)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.4671  -6.2940  -0.0933   6.0532  22.7230 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.040463  15.586858   0.452 0.652023    
## TEAM_BATTING_H   0.041258   0.011130   3.707 0.000277 ***
## TEAM_BATTING_2B -0.011901   0.030663  -0.388 0.698378    
## TEAM_BATTING_HR  0.069322   0.024333   2.849 0.004887 ** 
## TEAM_BATTING_BB  0.049232   0.009843   5.002 1.32e-06 ***
## TEAM_FIELDING_E -0.210872   0.041065  -5.135 7.14e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.941 on 184 degrees of freedom
## Multiple R-squared:  0.4437, Adjusted R-squared:  0.4286 
## F-statistic: 29.36 on 5 and 184 DF,  p-value: < 2.2e-16
summary(lmod2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.922  -6.452  -0.105   5.984  23.459 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.4790727 15.6213041   0.223 0.824004    
## TEAM_BATTING_H   0.0412796  0.0112249   3.678 0.000309 ***
## TEAM_BATTING_2B  0.0009869  0.0302629   0.033 0.974019    
## TEAM_BATTING_HR  0.0680351  0.0245324   2.773 0.006118 ** 
## TEAM_BATTING_BB  0.0502670  0.0099137   5.070 9.59e-07 ***
## TEAM_FIELDING_E -0.2177225  0.0412773  -5.275 3.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.017 on 185 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.446 
## F-statistic:  31.6 on 5 and 185 DF,  p-value: < 2.2e-16
## Identify influential points using Cook's distance
cookr2 <- cooks.distance(lmod3)
halfnorm(cookr2, 3, labs = index, ylab = "Cook's distances")

## Eliminating and re-running the full model
lmodr2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_FIELDING_E, datanona, subset = (cookr2 < max(cookr2)))
summary(lmodr2)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona, subset = (cookr2 < 
##     max(cookr2)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.040  -6.082  -0.182   6.013  19.883 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -1.607690  15.428167  -0.104  0.91712    
## TEAM_BATTING_H   0.043919   0.009612   4.569 8.93e-06 ***
## TEAM_BATTING_HR  0.070051   0.024073   2.910  0.00406 ** 
## TEAM_BATTING_BB  0.052495   0.009714   5.404 1.99e-07 ***
## TEAM_FIELDING_E -0.219777   0.040416  -5.438 1.69e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.844 on 185 degrees of freedom
## Multiple R-squared:  0.4808, Adjusted R-squared:  0.4696 
## F-statistic: 42.83 on 4 and 185 DF,  p-value: < 2.2e-16
summary(lmod3)
## 
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + 
##     TEAM_BATTING_BB + TEAM_FIELDING_E, data = datanona)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9006  -6.4318  -0.1027   5.9876  23.4547 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.496754  15.569914   0.225  0.82255    
## TEAM_BATTING_H   0.041461   0.009730   4.261 3.23e-05 ***
## TEAM_BATTING_HR  0.068036   0.024466   2.781  0.00598 ** 
## TEAM_BATTING_BB  0.050298   0.009843   5.110 7.95e-07 ***
## TEAM_FIELDING_E -0.217805   0.041089  -5.301 3.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.993 on 186 degrees of freedom
## Multiple R-squared:  0.4606, Adjusted R-squared:  0.449 
## F-statistic: 39.71 on 4 and 186 DF,  p-value: < 2.2e-16
plot(lmodr2)

#Formal test for normality
shapiro.test(residuals(lmodr2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lmodr2)
## W = 0.99413, p-value = 0.658
#Plot of successive pairs of residuals to check for serial correlation
n1 <- length(residuals(lmodr2))
plot(tail(residuals(lmodr2), n1-1) ~ head(residuals(lmodr2), n1-1), xlab = expression(hat(epsilon)[i]), ylab = expression(hat(epsilon)[i+1]))
abline(h = 0, v = 0, col = grey(0.75))

Predicting using the eval dataset

Quickly checking the evaluation dataset shows that all the variables used in model 3 are populated - TEAM_BATTING_H, TEAM_BATTING_HR, TEAM_BATTING_BB, TEAM_FIELDING_E. Hence, there is no need to omit records or transform variables. We run the root mean squared error for the training file and find it to be very small - 3.766439e-17. Hence, potentially this could be a suitable model to use.

#datanona1 <- na.omit(eval)
#pred <- predict(lmod3, datanona1); pred

# Prediction model
pred <- predict(lmodr2, eval); pred
##            1            2            3            4            5            6 
##   50.0012944   55.5995995   58.6090268   75.3822188  -68.1895516  -52.4846260 
##            7            8            9           10           11           12 
##  -14.0848461    8.1335131   30.0908206   47.0991426   46.5971615   66.8698795 
##           13           14           15           16           17           18 
##   70.5024735   73.7760966   70.4251387   62.7855826   56.3914682   77.9131603 
##           19           20           21           22           23           24 
##    9.5768320   58.5162293   65.0658055   74.3501110   64.5013321   59.9944363 
##           25           26           27           28           29           30 
##   75.1702661   80.6842483 -200.2121409   35.2047466   54.4304572   46.0331530 
##           31           32           33           34           35           36 
##   76.1649859   70.3660146   78.8960120   77.7327666   71.6671899   73.1922212 
##           37           38           39           40           41           42 
##   65.4306200   80.5855075   76.8088287   76.0737405   77.1701470   88.2528747 
##           43           44           45           46           47           48 
## -294.6391629   11.1048782   12.0066051  -12.8521109   -2.3672910   19.6658677 
##           49           50           51           52           53           54 
##   17.6237585   34.8150175   52.1467066   68.6593008   60.2323821   61.0602726 
##           55           56           57           58           59           60 
##   54.3416655   64.8786676    8.6007146   21.2934681   -2.4783820   17.2490192 
##           61           62           63           64           65           66 
##   50.6513429   71.0035483   78.1042931   84.6061872   78.2615553   27.1260515 
##           67           68           69           70           71           72 
##   -6.7069429    6.2156531   13.9019825   27.1724245   54.8528578   61.1367587 
##           73           74           75           76           77           78 
##   71.6079541   82.5817910   63.1503120   66.7571847   76.2721051   70.3473779 
##           79           80           81           82           83           84 
##   16.8813628   23.6029705   67.0694804   63.5527617   73.0814687   61.2214749 
##           85           86           87           88           89           90 
##   72.7397807   68.4595986   70.8656203   69.9289478   77.2036329   88.3550414 
##           91           92           93           94           95           96 
##   -7.4427898 -174.3823568   -3.1453118   20.0601221   22.7814502   21.5529948 
##           97           98           99          100          101          102 
##   41.4757476   61.3094890   60.0887524   57.5370569   56.0828195   48.1188596 
##          103          104          105          106          107          108 
##   72.3144470   75.8012991   73.7328219  -62.7822319  -93.6578705   76.2222675 
##          109          110          111          112          113          114 
##   84.3004984  -57.9268768   66.4657338   62.9505211   73.8470747   73.3325803 
##          115          116          117          118          119          120 
##   64.5428558   69.4717227   79.5975946   69.2814105   65.4676854  -46.0305921 
##          121          122          123          124          125          126 
##   21.3339584   -4.3322312   10.5831821   10.6720457   24.3789261   45.6454948 
##          127          128          129          130          131          132 
##   54.0799144   30.3023757   60.5339258   71.2203674   64.1490592   67.2850212 
##          133          134          135          136          137          138 
##   53.5524823   69.4232631   88.3889860  -93.4394626   63.7663130   61.8155283 
##          139          140          141          142          143          144 
##   76.2698687   70.1213211   -3.6050432   10.5002843   67.0636591   46.3793442 
##          145          146          147          148          149          150 
##   63.2644371   64.8522601   61.4916768   68.4050162   72.2114528   72.0632739 
##          151          152          153          154          155          156 
##   75.9956895   76.4142336 -256.6521122   47.9124455   69.3066756   53.6493004 
##          157          158          159          160          161          162 
##   82.7776659  -71.0045105  -11.1416971    3.3110633   74.7453283   90.0734493 
##          163          164          165          166          167          168 
##   74.5853010   90.7459880   83.0603790   83.6941638   77.0033410   72.9885112 
##          169          170          171          172          173          174 
##   61.4743937   77.5426980   33.1096604   46.9329008   56.0503903   62.3174032 
##          175          176          177          178          179          180 
##   51.9129863   63.8603396   70.6622460   55.9306416   65.6285557   72.5266031 
##          181          182          183          184          185          186 
##   64.0781709   75.1458501   81.4275786   85.9876818 -154.4116946  -46.6456458 
##          187          188          189          190          191          192 
##  -16.6047742 -158.7449733  -69.5589125   23.8932377  -13.2632746   21.6895666 
##          193          194          195          196          197          198 
##   29.2247249   48.7040804   49.0294725   32.2652183   44.0074669   78.7281775 
##          199          200          201          202          203          204 
##   67.0392095   75.0546241   60.9186607   72.1208539   65.5074181   73.2866397 
##          205          206          207          208          209          210 
##   68.4409120   66.7180424   75.4606135   71.7341535   -9.6641850  -37.0305835 
##          211          212          213          214          215          216 
##   13.8747596   11.3599488   39.7520378   45.7244758   50.9659084   73.3707848 
##          217          218          219          220          221          222 
##   67.5422444   68.7987645   62.1325261   66.9005906   68.5972593   57.0167730 
##          223          224          225          226          227          228 
##   76.8997182   70.6075351 -167.0335413   60.3273146   66.9883537   72.0614822 
##          229          230          231          232          233          234 
##   77.2297320  -76.8029545   31.9416447   59.7577104   73.6512577   75.7162202 
##          235          236          237          238          239          240 
##   59.6114544   54.2542939   62.6478951   68.3299795  -12.3262127   20.6038644 
##          241          242          243          244          245          246 
##   56.9315210   79.4529702   69.3265103   70.0093504   42.2973096   71.3893426 
##          247          248          249          250          251          252 
##   62.6907717   72.5046031   60.2053941   80.8021376   79.3682767 -150.4846510 
##          253          254          255          256          257          258 
##   -0.2079262 -159.6938910   55.2685513   66.8802759   63.2427082   62.3722840 
##          259 
##   66.1951585
write.csv(pred, file = '/Users/tponnada/Downloads/predict.csv')

rmse <- (function(x, y) sqrt(mean(x-y)^2))
rmse(fitted(lmodr2), datanona$TARGET_WINS)
## Warning in x - y: longer object length is not a multiple of shorter object
## length
## [1] 0.007957554
rmse(predict(lmodr2, eval), eval$TARGET_WINS)
## [1] NaN