url_training<-"https://raw.githubusercontent.com/hovig/MSDS_CUNY/master/DATA621/moneyball-training-data.csv"
dfrow<-read.csv(url_training)
createSummaryTable <- function(trainingdata1) {
  
  # Mean and Medians Table
  mean_median_df = data.frame(matrix(0, nrow = ncol(trainingdata1), ncol = 2))
  
  mean2 <- function(x)   {mean(x, na.rm = TRUE)}
  median2 <- function(x) {median(x, na.rm = TRUE)}
  
  means = as.data.frame(lapply(trainingdata1, mean2))
  medians = as.data.frame(lapply(trainingdata1, median2))
  lengths = as.data.frame(lapply(trainingdata1, length))
  mean_median_df[, 1] = names(means)
  mean_median_df[, 2] = t(means[1, ])
  mean_median_df[, 3] = t(medians[1, ])
  names(mean_median_df) = c("VAR_NAME", "MEAN", "MEDIAN")

  # Correlations to Wins
  cor_df = data.frame(matrix(0, nrow = ncol(trainingdata1) - 2, ncol = 2))
  cors = as.data.frame(cor(trainingdata1$TARGET_WINS, trainingdata1[, 3:ncol(trainingdata1)], use = "pairwise.complete.obs"))
  cor_df[, 1] = names(cors)
  cor_df[, 2] = t(cors[1, ])
  names(cor_df) = c("VAR_NAME", "CORRELATION TO WINS (r)")
 
#Missing Values per variable
  mv_df = data.frame(matrix(0, nrow = ncol(trainingdata1), ncol = 2))
  num_missing <- function(x)   {sum(is.na(x))}
  missingvalues = as.data.frame(lapply(trainingdata1, num_missing))
  mv_df[, 1] = names(missingvalues)
  mv_df[, 2] = t(missingvalues[1, ])
  names(mv_df) = c("VAR_NAME", "NUM_MISSING")
  
  data_exp = merge(mean_median_df, cor_df, by.x = "VAR_NAME", by.y = "VAR_NAME")
  data_exp = merge(data_exp, mv_df, by.x = "VAR_NAME", by.y = "VAR_NAME")
  temp = as.data.frame(cbind(mean_median_df[2,], NA, NA))
  names(temp) = names(data_exp)
  data_exp = rbind(temp, data_exp)
}

trainingdata_bk = dfrow

Data Exploration

The data analyzed in this report includes 2276 professional baseball teams for the years 1871-2006. In total, 16 variables were present in the data provided. Included below is a summary of descriptive statistics, correlations to wins, and the number of missing values for each variable in the provided data set:

Table 1

VAR_NAME MEAN MEDIAN CORRELATION TO WINS (r) NUM_MISSING
2 TARGET_WINS 80.79086 82.0 NA NA
1 TEAM_BASERUN_CS 52.80386 49.0 0.0224041 772
21 TEAM_BASERUN_SB 124.76177 101.0 0.1351389 131
3 TEAM_BATTING_2B 241.24692 238.0 0.2891036 0
4 TEAM_BATTING_3B 55.25000 47.0 0.1426084 0
5 TEAM_BATTING_BB 501.55888 512.0 0.2325599 0
6 TEAM_BATTING_H 1469.26977 1454.0 0.3887675 0
7 TEAM_BATTING_HBP 59.35602 58.0 0.0735042 2085
8 TEAM_BATTING_HR 99.61204 102.0 0.1761532 0
9 TEAM_BATTING_SO 735.60534 750.0 -0.0317507 102
10 TEAM_FIELDING_DP 146.38794 149.0 -0.0348506 286
11 TEAM_FIELDING_E 246.48067 159.0 -0.1764848 0
12 TEAM_PITCHING_BB 553.00791 536.5 0.1241745 0
13 TEAM_PITCHING_H 1779.21046 1518.0 -0.1099371 0
14 TEAM_PITCHING_HR 105.69859 107.0 0.1890137 0
15 TEAM_PITCHING_SO 817.73045 813.5 -0.0784361 102

It can be seen that there are missing values in 6 of the variables in the data set, and these missing values range from approximately 5-92% of the data provided for their respective variables. However, in only two exceptions do the missing data account for more than 11% of the missing data.

Below are graphs that show the relationship to Target Wins for the three variables with the highest correlation coefficient:

As can be seen from Table 1, there are few variables that have any particularly strong correlation with TARGET_WINS.The full array of scatterplots representing correlations between TARGET_WINS and other variables may be found in Appendix A.

The distribution of values and outliers is also of significant importance in understanding the baseball data set. Here it can be seen that many variables have a skewed distribution:

## No id variables; using all as measure variables

In summary, the baseball data set provided includes many variables with a skewed distribution, few variables that correlate well with TARGET_WINS, and several variables that have missing data and should either require data imputation or should be excluded. The following sections serve to review these issues and go on to create a working regression model that can predict TARGET_WINS.

Data Preparation

The dataset contains 17 columns - an index column (INDEX), a response column (TARGET_WINS) and 15 predictor columns. There are 2,276 observations - but there are many missing values for many of the predictors.

Two predictors in particular stand out:

Predictor Name Description Impact % Missing r with Response p-Value
a TEAM_BATTING_HBP Batters hit by pitch (free base) Positive 91.6% 0.07 0.31
b TEAM_BASERUN_CS Strikeouts by batters Negative 33.9% 0.02 0.39

Including these predictors in our dataset would mean that we would either have to forego a significant portion of our data (34% or 92%), or impute a large number of data points. Their correlation coefficients with the response are less than an absolute value of 0.07; the p values of a simple one variable linear regression using them and the response yields models of no statistical significance (i.e. p > 0.05). Thus, it seems safe to exclude these predictors from our models. In this way, we avoid the twin pitfalls of mass exclusion and imputation.

It was determined that the Hits By Pitch variable had too many missing values to be useful for regression, and thus this variable was excluded from the model building process. As shown in Table 1 above, there are several variables that have missing values. The attempted solution to this problem involved imputation using the median for each variable in the data set. A summary of the data is shown here again for inspection and confirmation of similarity between the old and new data sets:

Missing Values Imputed With Median
VAR_NAME MEAN MEDIAN CORRELATION TO WINS (r) NUM_MISSING
2 TEAM_BATTING_H 1469.26977 1454.0 NA NA
1 TEAM_BASERUN_CS 51.51362 49.0 0.0159598 0
21 TEAM_BASERUN_SB 123.39411 101.0 0.1236109 0
3 TEAM_BATTING_2B 241.24692 238.0 0.2891036 0
4 TEAM_BATTING_3B 55.25000 47.0 0.1426084 0
5 TEAM_BATTING_BB 501.55888 512.0 0.2325599 0
6 TEAM_BATTING_HBP 58.11380 58.0 0.0165164 0
7 TEAM_BATTING_HR 99.61204 102.0 0.1761532 0
8 TEAM_BATTING_SO 736.25044 750.0 -0.0305814 0
9 TEAM_FIELDING_DP 146.71617 149.0 -0.0300863 0
10 TEAM_FIELDING_E 246.48067 159.0 -0.1764848 0
11 TEAM_PITCHING_BB 553.00791 536.5 0.1241745 0
12 TEAM_PITCHING_H 1779.21046 1518.0 -0.1099371 0
13 TEAM_PITCHING_HR 105.69859 107.0 0.1890137 0
14 TEAM_PITCHING_SO 817.54086 813.5 -0.0757997 0

Further exclusions to the data were made:

Exclusion Explanation
INDEX == 1347 This row had a suspicious set of zero entries
TEAM_BATTING_BB == 0 Anomalously low walk count (expected occurences of a zero value for this predictor are zero)
TEAM_BATTING_SO Outside of recognized records link
TEAM_BATTING_HR Outside of recognized records link

It should be noted that the records excluded from the first two rows of the table above are similar. This suggests that strikeouts were not recorded for those rows, but were marked as zero. Those two predictors have the same number of NA values, 102, suggesting their recording method could have been linked.

Many of the more advanced baseball statistics are simply combinations of other statistics (i.e. slugging percentage is total bases divided by at-bats). Using the predictors given in the data set, we wanted to see if combining predictors and/or calculating new values would increase any significance in a model that is trying to predict wins. Total bases and extra bases are both stats that can easily be calculated with the given data. Assuming that the number of doubles, triples and home runs are included in the TEAM_BATTING_H variable, we can subtract these out to obtain the number of singles. Adding this to the doubles, triples, and homeruns, each multiplied by the number of bases each is worth (2,3, and 4 respectively) would give total bases. Doing the same thing, but excluding single base hits would give the number of extra bases as well. These two statistics were used in the model, along with many of the other predictors.

Since the fielding errors predictor was drastically right-skewed (see Appendix A), a log transformation was done on this predictor, resulting in a better fit of the model.

Model Creation

Model Summary Table
Model # # of Predictors Adj. R^2 F-Statistic P-Value Residual Standard Error Degrees of Freedom
1 11 0.31 95 2.2e-16 13.07 2264
2 8 0.22 82 2.2e-16 13.82 2266
3 7 0.35 157 2.2e-16 11.07 1993
4 4 0.28 225 2.2e-16 11.78 2271
5 8 0.31 125 2.2e-16 13.15 2267

Model 1: Simple Full Linear Regression, With Removal of Non-Significant Predictors

Description: Missing values were replaced with the median values from the associated predictor, to retain all data points for making a regression; a linear regression was fit to all predictors; All non-significant predictors (p<.05) were removed sequentially.

The final iteration of this regression model is shown here:

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.34 5.234 4.269 2.043e-05
TEAM_BATTING_H 0.04909 0.00367 13.38 2.469e-39
TEAM_BATTING_2B -0.02137 0.009163 -2.333 0.01975
TEAM_BATTING_3B 0.06658 0.01662 4.005 6.4e-05
TEAM_BATTING_HR 0.0674 0.009632 6.998 3.399e-12
TEAM_BATTING_BB 0.01155 0.003375 3.421 0.0006342
TEAM_BATTING_SO -0.008521 0.002453 -3.474 0.0005227
TEAM_BASERUN_SB 0.02492 0.004209 5.92 3.699e-09
TEAM_PITCHING_H -0.000777 0.0003209 -2.421 0.01555
TEAM_PITCHING_SO 0.002966 0.0006719 4.415 1.059e-05
TEAM_FIELDING_E -0.01901 0.002392 -7.948 2.972e-15
TEAM_FIELDING_DP -0.1218 0.01293 -9.419 1.079e-20
Fitting linear model: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
2276 13.07 0.3151 0.3117

Model 2: SLR Data

A linear regression was fitted to all predictors; predictors were removed in order of significance, to obtain a model with a higher f-statistic.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.52 16.52 3.966 7.54e-05
TEAM_BATTING_SO -0.006241 0.005146 -1.213 0.2254
TEAM_BASERUN_SB 0.2837 0.1528 1.857 0.06348
TEAM_PITCHING_HR 0.04736 0.006355 7.451 1.305e-13
TEAM_FIELDING_E -0.007392 0.001797 -4.113 4.04e-05
TEAM_FIELDING_DP -0.081 0.01421 -5.702 1.336e-08
Fitting linear model: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
2275 15.23 0.05692 0.05484

Model 3: Data Bounded by 1880-2015 Records

Three predictors were removed: Predictors removed, due to high NA count: TEAM_BATTING_HBP and TEAM_BASERUN_CS Predictor removed, due to lack of relevance: TEAM_FIELDING_DP

Bounds for predictors were set by minimum and maximum all-time MLB records, with citations shown in the code. Where our records were outside the bounds of these external records, they were replaced with NA values. Predictors were removed sequentially removed, in order of significance; multiple predictors were created and tested, in the hopes of improving fit statistics, and one proved useful - the number of 1st base hits. This predictor was then added to the model.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 36.71 5.549 6.616 4.739e-11
TEAM_BATTING_3B 0.1804 0.01727 10.45 6.657e-25
TEAM_BATTING_HR 0.1149 0.007506 15.31 4.252e-50
TEAM_BATTING_BB 0.02858 0.003096 9.231 6.64e-20
TEAM_BATTING_SO -0.017 0.002323 -7.32 3.592e-13
TEAM_BASERUN_SB 0.07938 0.004722 16.81 2.048e-59
TEAM_FIELDING_E -0.0736 0.003888 -18.93 1.226e-73
TEAM_BATTING_1B 0.02327 0.004018 5.79 8.143e-09
Fitting linear model: TARGET_WINS ~ TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_FIELDING_E + TEAM_BATTING_1B
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
2001 11.07 0.3557 0.3535

Model 4: Using only significant predictors from model using all variables

When the model using all variables to predict TARGET_WINS was considered, it was decided to create a new model using only the most significant predictors from that regression. This method uses 4 variables only: Base hits by batters, Stolen Bases, Double plays, and Errors. This model produced an F statistic of 225 and \(R^2\) = 0.283.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.93 3.198 5.605 2.329e-08
TEAM_BATTING_H 0.05359 0.002051 26.13 8.068e-132
TEAM_BASERUN_SB 0.02968 0.003554 8.353 1.144e-16
TEAM_FIELDING_DP -0.08779 0.01224 -7.17 1.008e-12
TEAM_FIELDING_E -0.027 0.001367 -19.74 3.322e-80
Fitting linear model: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_FIELDING_E
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
2276 13.34 0.2841 0.2829

Here the residuals and QQ plots can be examined. These plots provide sufficient information to upload the assumptions taken when creating the regression model:

pander(summary(model4))
  Estimate Std. Error t value Pr(>|t|)
(Intercept) 17.93 3.198 5.605 2.329e-08
TEAM_BATTING_H 0.05359 0.002051 26.13 8.068e-132
TEAM_BASERUN_SB 0.02968 0.003554 8.353 1.144e-16
TEAM_FIELDING_DP -0.08779 0.01224 -7.17 1.008e-12
TEAM_FIELDING_E -0.027 0.001367 -19.74 3.322e-80
Fitting linear model: TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + TEAM_FIELDING_DP + TEAM_FIELDING_E
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
2276 13.34 0.2841 0.2829

Model 5: Feature engineering and new variables

Using a step-wise methodology, less significant predictors such as caught stealing, pitching walks & hits were left out. To avoid including predictors that were related, the base hits/singles and other multiple base hit predictors were left out. Most of the predictors have coefficients that behave the way we would expect, given the predictors effect on the game. The extra bases predictor ended up having a negative coefficient, when we would expect there to be a positive one. This may be due to the fact that it is similar stats that make up these predictors, and there is collinearity between the total bases and extra bases predictors. Removing extra bases from the model results in a lower coefficient for total bases, and an overall lower adjusted \(R^2\) value for the model.

  Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.92 7.546 8.604 1.413e-17
TEAM_BATTING_TOT_BASES 0.04703 0.003591 13.1 7.868e-38
TEAM_BATTING_XTRA_BASE -0.05788 0.01165 -4.969 7.221e-07
TEAM_BATTING_BB 0.01479 0.003173 4.661 3.328e-06
TEAM_BATTING_SO -0.007654 0.002161 -3.543 0.0004043
TEAM_BASERUN_SB 0.03415 0.004144 8.24 2.869e-16
TEAM_PITCHING_HR -0.03628 0.01042 -3.483 0.0005049
log(TEAM_FIELDING_E) -8.255 0.853 -9.677 9.833e-22
TEAM_FIELDING_DP -0.1334 0.01294 -10.31 2.104e-24
Fitting linear model: TARGET_WINS ~ TEAM_BATTING_TOT_BASES + TEAM_BATTING_XTRA_BASE + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + log(TEAM_FIELDING_E) + TEAM_FIELDING_DP
Observations Residual Std. Error \(R^2\) Adjusted \(R^2\)
2276 13.15 0.3054 0.303

Model Selection and Prediction

The model utilizing only the significant predictors (Model 4) is selected as the best model for prediction of team wins in a 162-game baseball season. While the R^2 value of this model the second-lowest of the five models tested, its high F-score indicates that it is the most statistically significant. Additionally, it is the most parsimonious models tested, and the simplicity lends itself to easier understanding of the model by other users.

This model has a root-mean-square error of 13.3394, an R^2 of 0.28, and an F-statistic of 225.3. The F statistic has a corresponding p-value of <2.2e-16.

The residual plots for this model are presented below:

There does not appear to be any pattern in the residuals in the scatterplot, so the condition of linearity can be accepted. The histogram and Q-Q plot indicate that the residuals are roughly normally distributed, albeit with short tails. Finally, the scatterplot and Q-Q plot indicate that the residuals indicate near-constant variability. Because the conditions are met, the validity of the use of a linear model is accepted.

The linear model is applied to an evaluation dataset containing response variables for 259 cases. A histogram of the predicted team wins is presented below.

The predicted wins appear roughly normally distributed, with a slight right-skewness. As expected, the distribution is centered near 82, which represents a 0.500 season. Further investigation shows that the median is indeed roughly 82 wins, with the mean slightly lower at roughly 81 wins.

Due to missing values, however, there are 89 missing predictions, representing roughly 34% of the total dataset. In order to allow for predictions of the full 259 cases in the evaluation dataset, missing values are filled with the median for each given missing variable. The linear model is again applied, this time to the evaluation dataset with imputed median values. A histogram of this modified predicted team wins is presented below.

In contrast to the predictions for the raw evaluation dataset, this set of predictions is left-skewed. The median and mean for this set of predictions are both roughly 81. The shape of the distribution, which is seemingly condensed towards the median, suggests that imputation using the median may have introduced a bias towards the center of the distribution (which, again, corresponds to a 0.500 season with 82 wins). Although it is outside the scope of this investigation, more advanced imputation that will avoid the introduction of bias may be advisable for cases with missing predictors.

A comparison of the full sets of predictions for the evaluation dataset is available in Appendix B.

Appendix A – TARGET_WINS Correlations

Appendix B – Indexed Results of Predictive Model

Index Predicted Value Index Predicted Value
9 67.08 1253 88.46
10 66.92 1261 81.49
14 76.79 1305 80.91
47 87.93 1314 84.94
60 NA 1323 83.96
63 NA 1328 NA
74 NA 1353 74.44
83 79.65 1363 79.96
98 72.16 1371 89.46
120 76.64 1372 81.91
123 74.83 1389 71.76
135 79.84 1393 74.76
138 76.58 1421 91.63
140 79.55 1431 76.85
151 82.18 1437 73.96
153 76.72 1442 71.85
171 74.71 1450 76.95
184 79.5 1463 80.79
193 74.99 1464 81.19
213 90.96 1470 82.37
217 83.31 1471 79.89
226 85.65 1484 84.8
230 78.71 1495 NA
241 72.71 1507 74.52
291 84.58 1514 78.84
294 88.45 1526 73.44
300 NA 1549 86.67
348 79.17 1552 NA
350 86.58 1556 92.06
357 82.53 1564 76.47
367 87.91 1585 97.51
368 83.83 1586 99.86
372 81.34 1590 90.02
382 81.51 1591 97.03
388 79.6 1592 90.23
396 81.4 1603 83.87
398 75.58 1612 78.62
403 85.64 1634 78.59
407 85.99 1645 74.5
410 89.59 1647 81.08
412 82.95 1673 90.1
414 90.63 1674 88.93
436 NA 1687 80.95
440 NA 1688 94.57
476 NA 1700 83.64
479 NA 1708 74.1
481 NA 1713 75.47
501 77.39 1717 71.86
503 72.46 1721 74.5
506 80.86 1730 78
519 79.94 1737 83.81
522 88.59 1748 84.28
550 77.27 1749 84.65
554 73.53 1763 81.6
566 78.52 1768 NA
578 80.76 1778 NA
596 89.47 1780 NA
599 77.73 1782 NA
605 65.44 1784 NA
607 81.23 1794 NA
614 89.2 1803 75.12
644 74.99 1804 84.22
692 86.87 1819 80.44
699 82.25 1832 82.05
700 82.59 1833 86.72
716 NA 1844 71.13
721 78.01 1847 79.48
722 86.08 1854 78.67
729 78.97 1855 76.35
731 86.76 1857 83.58
746 86.44 1864 75.94
763 75.01 1865 81.23
774 74.75 1869 75.2
776 81.76 1880 88.4
788 87.44 1881 82.28
789 85.36 1882 80.25
792 80.21 1894 81.85
811 82.77 1896 80.55
835 76.19 1916 NA
837 81.56 1918 NA
861 86.94 1921 NA
862 90.53 1926 NA
863 97.27 1938 80.07
871 79.21 1979 66.36
879 80.84 1982 68.9
887 80.66 1987 84.53
892 80.19 1997 80.63
904 83.76 2004 92.76
909 85.9 2011 80.52
925 88.37 2015 78.7
940 NA 2022 80.69
951 NA 2025 79.49
976 75.16 2027 82.78
981 88.98 2031 77.32
983 88.54 2036 NA
984 84.65 2066 74.46
989 85.56 2073 82.55
995 102.2 2087 80.12
1000 88.33 2092 81.86
1001 87.29 2125 NA
1007 82.95 2148 80.19
1016 74.59 2162 91.87
1027 84.75 2191 74.87
1033 79.25 2203 84.81
1070 78.8 2218 78.45
1081 NA 2221 76.34
1084 NA 2225 85.67
1098 80.81 2232 76.65
1150 84.03 2267 NA
1160 NA 2291 73.84
1169 81.77 2299 88.77
1172 85.58 2317 85.91
1174 93.95 2318 86.37
1176 92.06 2353 85.41
1178 83.49 2403 65.12
1184 82.58 2411 88.87
1193 87.06 2415 82.7
1196 80.97 2424 83.92
1199 79.25 2441 74.03
1207 NA 2464 83.58
1218 NA 2465 81.27
1223 71.28 2472 NA
1226 71.61 2481 NA
1227 67.38 2487 NA
1229 70.69 2500 70.5
1241 90.14 2501 74.19
1244 92.75 2520 84.48
1246 78.45 2521 83.32
1248 91.52 2525 79.7
1249 92.58 NA NA

Appendix C – R Code

dfrow4 <- read.csv(url(url_training))
createSummaryTable <- function(trainingdata1) {
    
    # Mean and Medians Table
    mean_median_df = data.frame(matrix(0, nrow = ncol(trainingdata1), 
        ncol = 2))
    mean2 <- function(x) {
        mean(x, na.rm = TRUE)
    }
    median2 <- function(x) {
        median(x, na.rm = TRUE)
    }
    means = as.data.frame(lapply(trainingdata1, mean2))
    medians = as.data.frame(lapply(trainingdata1, median2))
    lengths = as.data.frame(lapply(trainingdata1, length))
    mean_median_df[, 1] = names(means)
    mean_median_df[, 2] = t(means[1, ])
    mean_median_df[, 3] = t(medians[1, ])
    names(mean_median_df) = c("VAR_NAME", "MEAN", "MEDIAN")
    
    # Correlations to Wins
    cor_df = data.frame(matrix(0, nrow = ncol(trainingdata1) - 
        2, ncol = 2))
    
    cors = as.data.frame(cor(trainingdata1$TARGET_WINS, trainingdata1[, 
        3:ncol(trainingdata1)], use = "pairwise.complete.obs"))
    cor_df[, 1] = names(cors)
    cor_df[, 2] = t(cors[1, ])
    names(cor_df) = c("VAR_NAME", "CORRELATION TO WINS (r)")
    
    # Missing Values per variable
    mv_df = data.frame(matrix(0, nrow = ncol(trainingdata1), 
        ncol = 2))
    num_missing <- function(x) {
        sum(is.na(x))
    }
    missingvalues = as.data.frame(lapply(trainingdata1, num_missing))
    mv_df[, 1] = names(missingvalues)
    mv_df[, 2] = t(missingvalues[1, ])
    names(mv_df) = c("VAR_NAME", "NUM_MISSING")
    
    data_exp = merge(mean_median_df, cor_df, by.x = "VAR_NAME", 
        by.y = "VAR_NAME")
    data_exp = merge(data_exp, mv_df, by.x = "VAR_NAME", by.y = "VAR_NAME")
    temp = as.data.frame(cbind(mean_median_df[2, ], NA, NA))
    names(temp) = names(data_exp)
    data_exp = rbind(temp, data_exp)
}

trainingdata_bk = dfrow4
data_exp = createSummaryTable(dfrow4)
kable(data_exp)
par(mfrow = c(1, 3), pin = c(3/2, 3/2))
top3correlations = c(0, 0, 7, 4, 0, 0, 6)

for (plot_count in c(3, 4, 7)) {
    plot(x = dfrow4[, plot_count], y = dfrow4$TARGET_WINS, xlab = names(dfrow4)[plot_count], 
        ylab = "Target Wins", main = paste0(names(dfrow4)[plot_count], 
            " r = ", round(data_exp[top3correlations[plot_count], 
                4], 3)))
}

trainingDataRaw = dfrow4
data_no_index <- trainingDataRaw[, c(2:17)]  # delete index

for (i in 1:16) {
    data_no_index[, i][is.na(data_no_index[, i])] = median(data_no_index[, 
        i], na.rm = TRUE)
}

df_new = data_no_index
par(mfrow = c(6, 3))
m <- melt(df_new)
p <- ggplot(m, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale = "free")
trainingDataRaw = dfrow4
data_no_index <- trainingDataRaw[, c(2:17)]  # delete index

for (i in 1:16) {
    data_no_index[, i][is.na(data_no_index[, i])] = median(data_no_index[, 
        i], na.rm = TRUE)
}

df_new = data_no_index
imp_data = createSummaryTable(df_new)
kable(imp_data)
trainingdata = df_new
a = c("TEAM_BATTING_HBP", "Batters hit by pitch (free base)", 
    "Positive", "91.6%", "0.07", "0.31")
b = c("TEAM_BASERUN_CS", "Strikeouts by batters", "Negative", 
    "33.9%", "0.02", "0.39")
names(a) = c("Predictor Name", "Description", "Impact", "% Missing", 
    "r with Response", "p-Value")
names(b) = names(a)
c = as.data.frame(rbind(a, b))
kable(c)

# Remove INDEX column
data_no_index <- dfrow4[, c(2:17)]

for (i in 1:16) {
    data_no_index[, i][is.na(data_no_index[, i])] <- median(data_no_index[, 
        i], na.rm = TRUE)
}
df_new <- data_no_index

fit <- lm(TARGET_WINS ~ ., df_new)
fit <- update(fit, . ~ . - TEAM_PITCHING_BB)
fit <- update(fit, . ~ . - TEAM_BATTING_HBP)
fit <- update(fit, . ~ . - TEAM_BASERUN_CS)
fit <- update(fit, . ~ . - TEAM_PITCHING_HR)
fit <- lm(TARGET_WINS ~ ., df_new)
fit <- update(fit, . ~ . - TEAM_PITCHING_BB)
fit <- update(fit, . ~ . - TEAM_BATTING_HBP)
fit <- update(fit, . ~ . - TEAM_BASERUN_CS)
fit <- update(fit, . ~ . - TEAM_PITCHING_HR)
pander(summary(fit))
model1 = fit
dfremove <- subset(dfrow4, INDEX == 1347)$INDEX
df <- subset(dfrow4, !(INDEX %in% dfremove))

# Remove INDEX column
data_no_index <- df[, c(2:17)]

for (i in 1:16) {
    data_no_index[, i][is.na(data_no_index[, i])] <- median(data_no_index[, 
        i], na.rm = TRUE)
}
df <- data_no_index

df$TEAM_BATTING_BB <- ifelse(df$TEAM_BATTING_BB, median(df$TEAM_BATTING_BB), 
    df$TEAM_BATTING_BB)
df$TEAM_BATTING_H <- ifelse(df$TEAM_BATTING_H, median(df$TEAM_BATTING_H), 
    df$TEAM_BATTING_H)
df$TEAM_BATTING_2B <- ifelse(df$TEAM_BATTING_2B, median(df$TEAM_BATTING_2B), 
    df$TEAM_BATTING_2B)
df$TEAM_BATTING_3B <- ifelse(df$TEAM_BATTING_3B, median(df$TEAM_BATTING_3B), 
    df$TEAM_BATTING_3B)
df$TEAM_BATTING_HR <- ifelse(df$TEAM_BATTING_HR, median(df$TEAM_BATTING_HR), 
    df$TEAM_BATTING_HR)
df$TEAM_BATTING_SO <- ifelse(df$TEAM_BATTING_SO, median(df$TEAM_BATTING_SO), 
    df$TEAM_BATTING_SO)
df$TEAM_PITCHING_SO <- ifelse(df$TEAM_PITCHING_SO, median(df$TEAM_PITCHING_SO), 
    df$TEAM_PITCHING_SO)
df$TEAM_BASERUN_SB <- ifelse(df$TEAM_BASERUN_SB, median(df$TEAM_BASERUN_SB), 
    df$TEAM_BASERUN_SB)

fit <- lm(TARGET_WINS ~ ., df)
fit <- update(fit, . ~ . - TEAM_PITCHING_SO)
fit <- update(fit, . ~ . - TEAM_BATTING_3B)
fit <- update(fit, . ~ . - TEAM_BATTING_HBP)
fit <- update(fit, . ~ . - TEAM_PITCHING_BB)
fit <- update(fit, . ~ . - TEAM_BASERUN_CS)
fit <- update(fit, . ~ . - TEAM_PITCHING_H)
fit <- update(fit, . ~ . - TEAM_BATTING_HR)
pander(summary(fit))
model2 = fit

# Model 3 using 1880- Data
dfremove <- subset(dfrow4, INDEX == 1347)$INDEX
df <- subset(dfrow4, !(INDEX %in% dfremove))

# Remove caught stealing and hit by pitcher variables and
# fielding double play, due to too many NAs (CS and HBP) and
# lack of relevance (DP)
df <- df[, -c(1, 10, 11, 17)]
df$TEAM_BASERUN_SB <- ifelse(df$TEAM_BASERUN_SB < 13 | df$TEAM_BASERUN_SB > 
    638, NA, df$TEAM_BASERUN_SB)
df$TEAM_BATTING_3B <- ifelse(df$TEAM_BATTING_3B < 11 | df$TEAM_BATTING_3B > 
    153, NA, df$TEAM_BATTING_3B)
df$TEAM_BATTING_HR <- ifelse(df$TEAM_BATTING_HR < 3 | df$TEAM_BATTING_HR > 
    264, NA, df$TEAM_BATTING_HR)
df$TEAM_BATTING_SO <- ifelse(df$TEAM_BATTING_SO < 308 | df$TEAM_BATTING_SO > 
    1535, NA, df$TEAM_BATTING_SO)
df$TEAM_PITCHING_SO <- ifelse(df$TEAM_PITCHING_SO < 333 | df$TEAM_PITCHING_SO > 
    1450, NA, df$TEAM_PITCHING_SO)
df$TEAM_BASERUN_SB <- ifelse(df$TEAM_BASERUN_SB < 13 | df$TEAM_BASERUN_SB > 
    638, NA, df$TEAM_BASERUN_SB)

fit <- lm(TARGET_WINS ~ ., df)
fit1 <- update(fit, . ~ . - TEAM_BATTING_H)
fit2 <- update(fit1, . ~ . - TEAM_PITCHING_HR)
fit3 <- update(fit2, . ~ . - TEAM_PITCHING_SO)
fit4 <- update(fit3, . ~ . - TEAM_BATTING_2B)
fit5 <- update(fit4, . ~ . - TEAM_PITCHING_BB)
fit6 <- update(fit5, . ~ . - TEAM_PITCHING_H)

df$TEAM_BATTING_1B <- df$TEAM_BATTING_H - df$TEAM_BATTING_2B - 
    df$TEAM_BATTING_3B - df$TEAM_BATTING_HR  #Singles - 1st Base Hits
fit7 <- update(fit6, . ~ . + TEAM_BATTING_1B)
pander(summary(fit7))

# These are variables that I tried but didn't turn out to be
# valuable
df$TEAM_BATTING_HRP <- df$TEAM_BATTING_HR/df$TEAM_BATTING_H  #Home runs as a percentage of base hits
df$TEAM_BATTING_HSO <- df$TEAM_BATTING_H/df$TEAM_BATTING_SO  #Ratio of hits to strikeouts
model3 = fit7
model4 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BASERUN_SB + 
    TEAM_FIELDING_DP + TEAM_FIELDING_E, df_new)
pander(summary(model4))
par(mfrow = c(2, 2))
plot(model4)
par(mfrow = c(1, 1))
par(mfrow = c(1, 1), pin = c(3, 3/2))
print("")
pander(summary(model4))

# Get Singles from Base Hits
df_new$TEAM_BATTING_1B <- with(df_new, TEAM_BATTING_H - TEAM_BATTING_2B - 
    TEAM_BATTING_3B - TEAM_BATTING_HR)

# Create Extra Bases and Total Bases predictors
df_new$TEAM_BATTING_XTRA_BASE <- with(df_new, TEAM_BATTING_2B + 
    TEAM_BATTING_3B + TEAM_BATTING_HR)
df_new$TEAM_BATTING_TOT_BASES <- with(df_new, TEAM_BATTING_1B + 
    (2 * TEAM_BATTING_2B) + (3 * TEAM_BATTING_3B) + (4 * TEAM_BATTING_HR))
lt_model3 <- lm(TARGET_WINS ~ TEAM_BATTING_TOT_BASES + TEAM_BATTING_XTRA_BASE + 
    TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_PITCHING_HR + 
    log(TEAM_FIELDING_E) + TEAM_FIELDING_DP, df_new)

pander(summary(lt_model3))
stepReduced = model4

par(mfrow = c(1, 3))
plot(stepReduced$residuals, main = NULL)
hist(stepReduced$residuals, main = NULL)
qqnorm(stepReduced$residuals, main = NULL)
qqline(stepReduced$residuals)
par(mfrow = c(1, 1))
evaluation_data <- read.csv(url_eval)
par(mfrow = c(1, 1), pin = c(2, 2))
predicted_wins <- predict(stepReduced, evaluation_data)
hist(predicted_wins)
training_wins = trainingdata_bk$TARGET_WINS
hist(training_wins, breaks = 8)
evaluation_data_imputed <- evaluation_data

for (i in 2:ncol(evaluation_data_imputed)) {
    evaluation_data_imputed[, i][is.na(evaluation_data_imputed[, 
        i])] <- median(evaluation_data_imputed[, i], na.rm = TRUE)
}

par(mfrow = c(1, 1), pin = c(3, 2))
predicted_wins_imputed <- predict(model4, evaluation_data_imputed)
hist(predicted_wins_imputed)
par(pin = c(2, 2))

for (var_count in 3:17) {
    scatter.smooth(x = trainingdata_bk[, var_count], y = dfrow4$TARGET_WINS, 
        xlab = names(dfrow4)[var_count], ylab = "Target Wins", 
        col = "#999999")
}

appendixB = data.frame(matrix(NA, nrow = 130, ncol = 4))
appendixB[, 1] = evaluation_data$INDEX[1:130]
appendixB[, 2] = predicted_wins[1:130]
appendixB[, 3] = c(evaluation_data$INDEX[131:259], NA)
appendixB[, 4] = c(predicted_wins[131:259], NA)
names(appendixB) = c("Index", "Predicted Value", "Index", "Predicted Value")
pander(appendixB)
appendixB