Introduction

The purpose of this assignment is to develop a linear model that can “reliably” predict a team’s number of wins (TARGET_WINS), based on a number of variables ranging from BATTING, BASERUN(ning), FIELDING, and PITCHING. Some of the variables have a positive impact on TARGET_WINS, while others have a negative impact

Data Exploration

## Observations: 2,276
## Variables: 17
## $ INDEX            <int> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 1...
## $ TARGET_WINS      <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 6...
## $ TEAM_BATTING_H   <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 127...
## $ TEAM_BATTING_2B  <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, ...
## $ TEAM_BATTING_3B  <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 3...
## $ TEAM_BATTING_HR  <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96,...
## $ TEAM_BATTING_BB  <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, ...
## $ TEAM_BATTING_SO  <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 92...
## $ TEAM_BASERUN_SB  <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, ...
## $ TEAM_BASERUN_CS  <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 7...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H  <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 128...
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96,...
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, ...
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 9...
## $ TEAM_FIELDING_E  <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127,...
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 1...

Using glimpse, we can see that there are 2276 observations and 17 variables in our training dataset. Of the 17 variables, it seems INDEX provides no additional value other than being a sorting/labelling mechanism for each observation. INDEX will be removed in the Data Preparation section. Another thing of note is that there is no variable for singles. The variable TEAM_BATTING_1B will also be created in the Data Preparation section

Non-Visual Inspection

Variables Breakdown

  • Response Variable: TARGET_WINS
  • Explanatory Variables:
    • 7 Batting variables
    • 4 Pitching variables
    • 2 Baserunning variables
    • 2 Fielding variables

Basic Stats

  • basicStats is able to show that TEAM_BATTING_HBP has the most egregious amount of missing values
  • There is very strong skew in the TEAM_PITCHING_SO variable
  • TEAM_PITCHING_H has the highest variance amongst the variables

Missing Values

The 6 fields shown in the table above having missing values.

Skew

The 4 fields shown in the table above have higher than average skew values, which provides evidence of outliers greatly effecting the mean of those fields

Correlation

  • The correlations tell us that HITS have the highest impact on winning games
  • There is some collinearity between some of the variables, especially the BATTING variables
  • FIELDING_E has the greatest impact on losing games
  • TEAM_BATTING_3B has an anomolous negative correlation on TARGET_WINS

Visual Inspection

Density Plots

## No id variables; using all as measure variables

* The density plots show various issues with skew an non-normality

Box Plots

Box plots can provide a visual representation of the variance of the data * The box plots reveal that a great majority of the explanatory variables have high variances * Some of the variables contain extreme outliers that this graph does not show because i had to reduce the limits on the graph to get clear box plots * Many of the medians and means are also not aligned which demonstrates the outliers’ effects

Histograms

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

* The histograms reveal that very few of the variables are normally distributed * A few variables are multi-modal * Some of the variable exhibit a lot of skew (e.g. BASERUN_SB)

Data Preparation

New Variable for Singles

  • The variable TEAM_BATTING_1B was created and added to the dataset

Variable Removal

  • Our dataset will be augmented by removing the fields with a large amount of NA values (BATTING_HBP, BASERUN_CS)
  • BATTING_H will be removed to reduce collinearity and replaced by BATTING_1B which is a calculated variable based on BATTING_H, BATTING_2B, BATTING_3B, BATTING_HR
  • INDEX is removed because it has no meaning in the dataset

Missing Values Handling

missForest will be used to handle all missing data by using a random forest algorithm to replace the missing values with “forest” values

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   TARGET_WINS       BATTING_2B      BATTING_3B       BATTING_HR    
##  Min.   :  0.00   Min.   : 69.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:208.0   1st Qu.: 34.00   1st Qu.: 42.00  
##  Median : 82.00   Median :238.0   Median : 47.00   Median :102.00  
##  Mean   : 80.79   Mean   :241.2   Mean   : 55.25   Mean   : 99.61  
##  3rd Qu.: 92.00   3rd Qu.:273.0   3rd Qu.: 72.00   3rd Qu.:147.00  
##  Max.   :146.00   Max.   :458.0   Max.   :223.00   Max.   :264.00  
##    BATTING_BB      BATTING_SO       BASERUN_SB      PITCHING_H   
##  Min.   :  0.0   Min.   :   0.0   Min.   :  0.0   Min.   : 1137  
##  1st Qu.:451.0   1st Qu.: 553.8   1st Qu.: 67.0   1st Qu.: 1419  
##  Median :512.0   Median : 733.6   Median :105.0   Median : 1518  
##  Mean   :501.6   Mean   : 731.1   Mean   :131.4   Mean   : 1779  
##  3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:168.0   3rd Qu.: 1682  
##  Max.   :878.0   Max.   :1399.0   Max.   :697.0   Max.   :30132  
##   PITCHING_HR     PITCHING_BB      PITCHING_SO        FIELDING_E    
##  Min.   :  0.0   Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0   1st Qu.: 476.0   1st Qu.:  622.8   1st Qu.: 127.0  
##  Median :107.0   Median : 536.5   Median :  800.0   Median : 159.0  
##  Mean   :105.7   Mean   : 553.0   Mean   :  812.5   Mean   : 246.5  
##  3rd Qu.:150.0   3rd Qu.: 611.0   3rd Qu.:  957.2   3rd Qu.: 249.2  
##  Max.   :343.0   Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##   FIELDING_DP      BATTING_1B    
##  Min.   : 52.0   Min.   : 709.0  
##  1st Qu.:124.0   1st Qu.: 990.8  
##  Median :145.0   Median :1050.0  
##  Mean   :142.5   Mean   :1073.2  
##  3rd Qu.:161.2   3rd Qu.:1129.0  
##  Max.   :228.0   Max.   :2112.0
##     NRMSE 
## 0.1690248
  • Through the summary() function, we can see that none of the fields have missing values any longer

Build Models

Model 1: All Variables

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = training.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.228  -8.368   0.245   8.169  56.129 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.1338541  5.2295323   6.718 2.32e-11 ***
## BATTING_2B   0.0269828  0.0071176   3.791 0.000154 ***
## BATTING_3B   0.0703391  0.0157969   4.453 8.89e-06 ***
## BATTING_HR   0.1093807  0.0266965   4.097 4.33e-05 ***
## BATTING_BB   0.0122318  0.0056539   2.163 0.030612 *  
## BATTING_SO  -0.0163691  0.0025608  -6.392 1.98e-10 ***
## BASERUN_SB   0.0513281  0.0045031  11.398  < 2e-16 ***
## PITCHING_H   0.0001662  0.0003695   0.450 0.652948    
## PITCHING_HR  0.0202085  0.0236584   0.854 0.393097    
## PITCHING_BB -0.0038747  0.0040465  -0.958 0.338407    
## PITCHING_SO  0.0025331  0.0008926   2.838 0.004579 ** 
## FIELDING_E  -0.0343813  0.0025215 -13.635  < 2e-16 ***
## FIELDING_DP -0.1270072  0.0135170  -9.396  < 2e-16 ***
## BATTING_1B   0.0444243  0.0036077  12.314  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.71 on 2262 degrees of freedom
## Multiple R-squared:  0.353,  Adjusted R-squared:  0.3493 
## F-statistic: 94.94 on 13 and 2262 DF,  p-value: < 2.2e-16
  • Model 1 has 9/13 statistically significant variables at the 5% significance level
  • Interestingly, FIELDING_DP has a negative impact on wins. This may be because it would mean the opposing team is getting hits
  • The rest of the variables align as one would expect to win contribution
    • BATTING_SO, PITCHING_BB, FIELDING_E have negative impacts on TARGET_WINS as expected
    • PITCHING_HR suprisingly has a positive impact on TARGET_WINS. It is possible there is a confounding variable affecting this coefficient
    • BATTING_HR has the highest impact on wins as one would expect
  • This model also shows that giving up home-runs is not as big a detriment as one may think
  • An R^2 of .3512 indicates that there may be room to improve the model

Model 2: Only Significant Variables

## 
## Call:
## lm(formula = TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR - 
##     PITCHING_BB - PITCHING_SO, data = training.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.973  -8.425   0.217   8.357  58.993 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.325318   4.873133   8.070 1.13e-15 ***
## BATTING_2B   0.031075   0.007020   4.427 1.00e-05 ***
## BATTING_3B   0.070845   0.015513   4.567 5.22e-06 ***
## BATTING_HR   0.132679   0.008206  16.168  < 2e-16 ***
## BATTING_SO  -0.014136   0.002286  -6.183 7.42e-10 ***
## BASERUN_SB   0.053578   0.004156  12.892  < 2e-16 ***
## FIELDING_E  -0.034946   0.001749 -19.985  < 2e-16 ***
## FIELDING_DP -0.119045   0.013309  -8.944  < 2e-16 ***
## BATTING_1B   0.042590   0.003516  12.112  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.74 on 2267 degrees of freedom
## Multiple R-squared:  0.3477, Adjusted R-squared:  0.3454 
## F-statistic: 151.1 on 8 and 2267 DF,  p-value: < 2.2e-16
  • Model 2 removes the non-significant variables
  • The R^2 is lower than Model 1 at .3459 however this may be acceptable due to removing the confounding variables
  • FIELDING_DP still has a negative impact on wins
  • The rest of the variables align as one would expect to win contribution
    • BATTING_SO, FIELDING_E have negative impacts on TARGET_WINS as expected
    • BATTING_HR has the highest impact on wins as one would expect

Model 3: Highly Correlated Variables

## 
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H + BATTING_BB + PITCHING_BB + 
##     PITCHING_HR + BATTING_HR + BATTING_2B + BATTING_1B, data = training.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.559  -9.075   0.210   9.073  47.821 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0022694  3.5289286  -0.001 0.999487    
## PITCHING_H  -0.0027828  0.0003394  -8.198 4.04e-16 ***
## BATTING_BB   0.0087439  0.0047829   1.828 0.067656 .  
## PITCHING_BB  0.0120497  0.0031610   3.812 0.000142 ***
## PITCHING_HR  0.0033868  0.0240691   0.141 0.888110    
## BATTING_HR   0.0587833  0.0265321   2.216 0.026821 *  
## BATTING_2B   0.0373918  0.0075320   4.964 7.41e-07 ***
## BATTING_1B   0.0554074  0.0030795  17.992  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 2268 degrees of freedom
## Multiple R-squared:  0.232,  Adjusted R-squared:  0.2296 
## F-statistic: 97.86 on 7 and 2268 DF,  p-value: < 2.2e-16
  • Of the three models, Model 3 has the lowest R^2 at .232
  • Model 3 also reintroduces variables that are not statistically significant
  • BATTING_3B is not introduced into this model as it did have a negative correlation
  • PITCHING_H is the only variable that has a negative impact on TARGET_WINS
  • The rest of the variables align as one would expect to win contribution
    • BATTING_HR has the highest impact on wins as one would expect
    • BATTING_1B has the second highest impact on wins, a lot more than BATTING_2B which is unexpected
    • PITCHING_HR has a very small positive impact on wins. This is counterintuitive as giving up runs should increase the chance of losing (decrease the chance of winning)

Model 4: Log Transformation of All Variables

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = log.training.imp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19757 -0.10250  0.00983  0.11102  1.13168 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.968571   0.495831   3.970 7.40e-05 ***
## BATTING_2B   0.210777   0.027546   7.652 2.91e-14 ***
## BATTING_3B   0.084437   0.011491   7.348 2.80e-13 ***
## BATTING_HR   0.061957   0.040872   1.516  0.12969    
## BATTING_BB  -0.226663   0.075563  -3.000  0.00273 ** 
## BATTING_SO  -0.448545   0.082954  -5.407 7.08e-08 ***
## BASERUN_SB   0.094801   0.007841  12.090  < 2e-16 ***
## PITCHING_H  -0.626786   0.065329  -9.594  < 2e-16 ***
## PITCHING_HR  0.039215   0.036149   1.085  0.27811    
## PITCHING_BB  0.357020   0.070431   5.069 4.32e-07 ***
## PITCHING_SO  0.348896   0.069819   4.997 6.26e-07 ***
## FIELDING_E  -0.213173   0.014975 -14.235  < 2e-16 ***
## FIELDING_DP -0.289188   0.027421 -10.546  < 2e-16 ***
## BATTING_1B   1.003225   0.066876  15.001  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1792 on 2262 degrees of freedom
## Multiple R-squared:  0.4393, Adjusted R-squared:  0.4361 
## F-statistic: 136.4 on 13 and 2262 DF,  p-value: < 2.2e-16
  • Model 4 has the highest R^2 at .4385
  • Negative impact on wins
    • BATTING_BB, BATTING_SO, PITCHING_H, FIELDING_E, FIELDING_DP
    • The anomolies are BATTING_BB and FIELDING_DP which one would expect a positive coefficient
  • Positive impact on wins
    • BATTING_1B has the highest impact on TARGET_WINS
    • PITCHING_BB has an anomolously high impact on TARGET_WINS. Giving up bases should have a negative impact on wins. It is impossible this is done to hitters that are particularly dangerous when being pitched straight up to
  • The homerun variables are not statistically significant

Model 5: Square Root Transform

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = sq.training.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9077 -0.4649  0.0193  0.4798  2.9646 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.997850   0.616711   8.104 8.61e-16 ***
## BATTING_2B   0.046920   0.013088   3.585 0.000344 ***
## BATTING_3B   0.085895   0.013982   6.143 9.52e-10 ***
## BATTING_HR   0.122499   0.037217   3.292 0.001012 ** 
## BATTING_BB  -0.005277   0.024159  -0.218 0.827124    
## BATTING_SO  -0.085678   0.012879  -6.653 3.60e-11 ***
## BASERUN_SB   0.084028   0.006548  12.832  < 2e-16 ***
## PITCHING_H  -0.016636   0.004786  -3.476 0.000519 ***
## PITCHING_HR  0.018839   0.033135   0.569 0.569711    
## PITCHING_BB  0.018206   0.020442   0.891 0.373235    
## PITCHING_SO  0.035931   0.008745   4.109 4.12e-05 ***
## FIELDING_E  -0.102975   0.006904 -14.915  < 2e-16 ***
## FIELDING_DP -0.201195   0.018958 -10.613  < 2e-16 ***
## BATTING_1B   0.180457   0.014793  12.199  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7304 on 2262 degrees of freedom
## Multiple R-squared:  0.3881, Adjusted R-squared:  0.3846 
## F-statistic: 110.4 on 13 and 2262 DF,  p-value: < 2.2e-16
  • Model 5 has an R^2 at .387
  • Negative impact on wins
    • BATTING_BB, BATTING_SO, PITCHING_H, FIELDING_E, FIELDING_DP
    • The anomolies are BATTING_BB and FIELDING_DP which one would expect a positive coefficient
  • Positive impact on wins
    • BATTING_1B has the highest impact on TARGET_WINS
  • PITCHING_HR, PITCHING_BB, and BATTING_BB are not statistically significant

Select Models

Based on the R^2, Model 4 is the ideal model to use and the best predictor for TARGET_WINS. Its R^2 was .4385. Model 3’s R^2 was simply far too low and reintroduced statistically insignificant variables. Model 1 provides a great benchmark for R^2 that Model 2 comes close to achieving. Model 5 was only able to achieve a .387 R^2.

Evaluation

* The QQ plot shows slight deviation from normal towards the extremities however this can be excused due to the sheer amount of observations * The residual plot indicates that there is no constant variance * The histogram shows a normal distribution amongst the residuals

Test Model

##      INDEX      TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :   9   Min.   : 819   Min.   : 44.0   Min.   : 14.00  
##  1st Qu.: 708   1st Qu.:1387   1st Qu.:210.0   1st Qu.: 35.00  
##  Median :1249   Median :1455   Median :239.0   Median : 52.00  
##  Mean   :1264   Mean   :1469   Mean   :241.3   Mean   : 55.91  
##  3rd Qu.:1832   3rd Qu.:1548   3rd Qu.:278.5   3rd Qu.: 72.00  
##  Max.   :2525   Max.   :2170   Max.   :376.0   Max.   :155.00  
##                                                                
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   : 15.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 44.50   1st Qu.:436.5   1st Qu.: 545.0   1st Qu.: 59.0  
##  Median :101.00   Median :509.0   Median : 686.0   Median : 92.0  
##  Mean   : 95.63   Mean   :499.0   Mean   : 709.3   Mean   :123.7  
##  3rd Qu.:135.50   3rd Qu.:565.5   3rd Qu.: 912.0   3rd Qu.:151.8  
##  Max.   :242.00   Max.   :792.0   Max.   :1268.0   Max.   :580.0  
##                                   NA's   :18       NA's   :13     
##  TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.00   Min.   :42.00    Min.   : 1155   Min.   :  0.0   
##  1st Qu.: 38.00   1st Qu.:53.50    1st Qu.: 1426   1st Qu.: 52.0   
##  Median : 49.50   Median :62.00    Median : 1515   Median :104.0   
##  Mean   : 52.32   Mean   :62.37    Mean   : 1813   Mean   :102.1   
##  3rd Qu.: 63.00   3rd Qu.:67.50    3rd Qu.: 1681   3rd Qu.:142.5   
##  Max.   :154.00   Max.   :96.00    Max.   :22768   Max.   :336.0   
##  NA's   :87       NA's   :240                                      
##  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   : 136.0   Min.   :   0.0   Min.   :  73.0   Min.   : 69.0   
##  1st Qu.: 471.0   1st Qu.: 613.0   1st Qu.: 131.0   1st Qu.:131.0   
##  Median : 526.0   Median : 745.0   Median : 163.0   Median :148.0   
##  Mean   : 552.4   Mean   : 799.7   Mean   : 249.7   Mean   :146.1   
##  3rd Qu.: 606.5   3rd Qu.: 938.0   3rd Qu.: 252.0   3rd Qu.:164.0   
##  Max.   :2008.0   Max.   :9963.0   Max.   :1568.0   Max.   :204.0   
##                   NA's   :18                        NA's   :31
  • INDEX can be removed from the data
  • The NA values will be handled with missForest similar to our training set
  • BATTING_1B will be added and BATTING_H removed
  • TEAM_BATTING_HBP, TEAM_BASERUN_CS will be removed

Transform Test Data

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!

Predict

Code Appendix

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)

libs <- c("tidyverse", "magrittr", "knitr", "kableExtra", "fBasics", "reshape2", 
    "missForest")

loadPkg <- function(x) {
    if (!require(x, character.only = T)) 
        install.packages(x, dependencies = T)
    require(x, character.only = T)
}

lapply(libs, loadPkg)
# load data
trainingdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-training-data.csv", 
    col_names = T, col_types = NULL, na = c("", "NA"))
testdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-evaluation-data.csv", 
    col_names = T, col_types = NULL, na = c("", "NA"))
glimpse(trainingdata)
data_frame(variables = names(trainingdata)) %>% mutate(variables = str_replace(variables, 
    "^([[:alnum:]]+?_{1})([[:alnum:]]+?)(_{1}[[:alnum:]]+?)$", "\\2")) %>% group_by(variables) %>% 
    summarise(count = n()) %>% arrange(desc(count))
trainingStats <- basicStats(trainingdata)[c("nobs", "NAs", "Minimum", "Maximum", 
    "1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness", 
    "Kurtosis"), ] %>% as.tibble() %>% rownames_to_column() %>% gather(var, 
    value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>% 
    rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`, 
    max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>% 
    mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables, 
    n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness, 
    kurtosis) %>% as.tibble()

trainingStats
trainingStats %>% dplyr::filter(na_vals > 0) %>% select(variables, na_vals, 
    obs) %>% arrange(desc(na_vals))
trainingStats %>% mutate(mean = mean(skewness)) %>% dplyr::filter(skewness > 
    mean(skewness)) %>% select(variables, skewness, mean)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - 
    TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>% 
    rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor, 
    TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>% 
    arrange(desc(TARGET_WINS))
# data frame for visuals
vis <- melt(trainingdata) %>% dplyr::filter(variable != "INDEX") %>% mutate(variable = str_replace(variable, 
    "TEAM_", ""))

ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable, 
    scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) + 
    stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) + 
    coord_flip() + ylim(0, 2200)
ggplot(vis, aes(value)) + geom_histogram() + facet_wrap(~variable, scales = "free")
remove_string <- function(x, remove) {
    str_replace(x, remove, "")
}

training <- trainingdata %>% # singles
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - 
    TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
training %<>% # remove fields with large amount of NAs
select(-c("BATTING_HBP", "BASERUN_CS")) %>% # remove all hits to reduce collinearity
select(-BATTING_H) %>% # remove INDEX
select(-INDEX)
training.forest <- training %>% as.data.frame() %>% missForest()

training.imp <- training.forest$ximp
# imputed values
summary(training.imp)

# imputation error
training.forest$OOBerror
m1 <- lm(TARGET_WINS ~ ., data = training.imp)
summary(m1)
m2 <- lm(TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR - PITCHING_BB - 
    PITCHING_SO, data = training.imp)
summary(m2)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - 
    TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>% 
    rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor, 
    TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>% 
    dplyr::filter(TARGET_WINS > mean(TARGET_WINS)) %>% arrange(desc(TARGET_WINS))

m3 <- lm(TARGET_WINS ~ PITCHING_H + BATTING_BB + PITCHING_BB + PITCHING_HR + 
    BATTING_HR + BATTING_2B + BATTING_1B, data = training.imp)
summary(m3)
remove_negInf <- function(x) {
    if_else(x < 0, 0, x)
}

log.training.imp <- training.imp %>% # log transform
mutate_all(funs(log(.))) %>% # replace -Inf with 0
mutate_all(funs(remove_negInf(.)))

m4 <- lm(TARGET_WINS ~ ., data = log.training.imp)
summary(m4)
sq.training.imp <- training.imp %>% # sqrt transform
mutate_all(funs(sqrt(.)))

m5 <- lm(TARGET_WINS ~ ., data = sq.training.imp)
summary(m5)
par(mfrow = c(2, 2))

plot(m4)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)
summary(testdata)
testdata %<>% # drop useless variables
select(-c("INDEX", "TEAM_BATTING_HBP", "TEAM_BASERUN_CS")) %>% # add BATTING_1B
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - 
    TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
test.forest <- testdata %>% as.data.frame() %>% missForest()

test.imp <- test.forest$ximp
test_results <- predict(m4, newdata = test.imp)

bind_cols(data.frame(TARGET_WINS = test_results), test.imp)