Introduction
The purpose of this assignment is to develop a linear model that can “reliably” predict a team’s number of wins (TARGET_WINS), based on a number of variables ranging from BATTING, BASERUN(ning), FIELDING, and PITCHING. Some of the variables have a positive impact on TARGET_WINS, while others have a negative impact
Data Exploration
## Observations: 2,276
## Variables: 17
## $ INDEX <int> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 1...
## $ TARGET_WINS <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 6...
## $ TEAM_BATTING_H <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 127...
## $ TEAM_BATTING_2B <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, ...
## $ TEAM_BATTING_3B <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 3...
## $ TEAM_BATTING_HR <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96,...
## $ TEAM_BATTING_BB <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, ...
## $ TEAM_BATTING_SO <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 92...
## $ TEAM_BASERUN_SB <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, ...
## $ TEAM_BASERUN_CS <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 7...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 128...
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96,...
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, ...
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 9...
## $ TEAM_FIELDING_E <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127,...
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 1...
Using glimpse, we can see that there are 2276 observations and 17 variables in our training dataset. Of the 17 variables, it seems INDEX provides no additional value other than being a sorting/labelling mechanism for each observation. INDEX will be removed in the Data Preparation section. Another thing of note is that there is no variable for singles. The variable TEAM_BATTING_1B will also be created in the Data Preparation section
Non-Visual Inspection
Variables Breakdown
- Response Variable: TARGET_WINS
- Explanatory Variables:
- 7 Batting variables
- 4 Pitching variables
- 2 Baserunning variables
- 2 Fielding variables
Basic Stats
- basicStats is able to show that TEAM_BATTING_HBP has the most egregious amount of missing values
- There is very strong skew in the TEAM_PITCHING_SO variable
- TEAM_PITCHING_H has the highest variance amongst the variables
Missing Values
The 6 fields shown in the table above having missing values.
Skew
The 4 fields shown in the table above have higher than average skew values, which provides evidence of outliers greatly effecting the mean of those fields
Correlation
- The correlations tell us that HITS have the highest impact on winning games
- There is some collinearity between some of the variables, especially the BATTING variables
- FIELDING_E has the greatest impact on losing games
- TEAM_BATTING_3B has an anomolous negative correlation on TARGET_WINS
Visual Inspection
Density Plots
## No id variables; using all as measure variables
* The density plots show various issues with skew an non-normality
Box Plots
Box plots can provide a visual representation of the variance of the data * The box plots reveal that a great majority of the explanatory variables have high variances * Some of the variables contain extreme outliers that this graph does not show because i had to reduce the limits on the graph to get clear box plots * Many of the medians and means are also not aligned which demonstrates the outliers’ effects
Histograms
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
* The histograms reveal that very few of the variables are normally distributed * A few variables are multi-modal * Some of the variable exhibit a lot of skew (e.g. BASERUN_SB)
Data Preparation
New Variable for Singles
- The variable TEAM_BATTING_1B was created and added to the dataset
Variable Removal
- Our dataset will be augmented by removing the fields with a large amount of NA values (BATTING_HBP, BASERUN_CS)
- BATTING_H will be removed to reduce collinearity and replaced by BATTING_1B which is a calculated variable based on BATTING_H, BATTING_2B, BATTING_3B, BATTING_HR
- INDEX is removed because it has no meaning in the dataset
Missing Values Handling
missForest will be used to handle all missing data by using a random forest algorithm to replace the missing values with “forest” values
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## TARGET_WINS BATTING_2B BATTING_3B BATTING_HR
## Min. : 0.00 Min. : 69.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00
## Median : 82.00 Median :238.0 Median : 47.00 Median :102.00
## Mean : 80.79 Mean :241.2 Mean : 55.25 Mean : 99.61
## 3rd Qu.: 92.00 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00
## Max. :146.00 Max. :458.0 Max. :223.00 Max. :264.00
## BATTING_BB BATTING_SO BASERUN_SB PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 1137
## 1st Qu.:451.0 1st Qu.: 553.8 1st Qu.: 67.0 1st Qu.: 1419
## Median :512.0 Median : 733.6 Median :105.0 Median : 1518
## Mean :501.6 Mean : 731.1 Mean :131.4 Mean : 1779
## 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:168.0 3rd Qu.: 1682
## Max. :878.0 Max. :1399.0 Max. :697.0 Max. :30132
## PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 622.8 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 800.0 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 812.5 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.2 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## FIELDING_DP BATTING_1B
## Min. : 52.0 Min. : 709.0
## 1st Qu.:124.0 1st Qu.: 990.8
## Median :145.0 Median :1050.0
## Mean :142.5 Mean :1073.2
## 3rd Qu.:161.2 3rd Qu.:1129.0
## Max. :228.0 Max. :2112.0
## NRMSE
## 0.1690248
- Through the summary() function, we can see that none of the fields have missing values any longer
Build Models
Model 1: All Variables
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = training.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.228 -8.368 0.245 8.169 56.129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.1338541 5.2295323 6.718 2.32e-11 ***
## BATTING_2B 0.0269828 0.0071176 3.791 0.000154 ***
## BATTING_3B 0.0703391 0.0157969 4.453 8.89e-06 ***
## BATTING_HR 0.1093807 0.0266965 4.097 4.33e-05 ***
## BATTING_BB 0.0122318 0.0056539 2.163 0.030612 *
## BATTING_SO -0.0163691 0.0025608 -6.392 1.98e-10 ***
## BASERUN_SB 0.0513281 0.0045031 11.398 < 2e-16 ***
## PITCHING_H 0.0001662 0.0003695 0.450 0.652948
## PITCHING_HR 0.0202085 0.0236584 0.854 0.393097
## PITCHING_BB -0.0038747 0.0040465 -0.958 0.338407
## PITCHING_SO 0.0025331 0.0008926 2.838 0.004579 **
## FIELDING_E -0.0343813 0.0025215 -13.635 < 2e-16 ***
## FIELDING_DP -0.1270072 0.0135170 -9.396 < 2e-16 ***
## BATTING_1B 0.0444243 0.0036077 12.314 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.71 on 2262 degrees of freedom
## Multiple R-squared: 0.353, Adjusted R-squared: 0.3493
## F-statistic: 94.94 on 13 and 2262 DF, p-value: < 2.2e-16
- Model 1 has 9/13 statistically significant variables at the 5% significance level
- Interestingly, FIELDING_DP has a negative impact on wins. This may be because it would mean the opposing team is getting hits
- The rest of the variables align as one would expect to win contribution
- BATTING_SO, PITCHING_BB, FIELDING_E have negative impacts on TARGET_WINS as expected
- PITCHING_HR suprisingly has a positive impact on TARGET_WINS. It is possible there is a confounding variable affecting this coefficient
- BATTING_HR has the highest impact on wins as one would expect
- This model also shows that giving up home-runs is not as big a detriment as one may think
- An R^2 of .3512 indicates that there may be room to improve the model
Model 2: Only Significant Variables
##
## Call:
## lm(formula = TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR -
## PITCHING_BB - PITCHING_SO, data = training.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.973 -8.425 0.217 8.357 58.993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.325318 4.873133 8.070 1.13e-15 ***
## BATTING_2B 0.031075 0.007020 4.427 1.00e-05 ***
## BATTING_3B 0.070845 0.015513 4.567 5.22e-06 ***
## BATTING_HR 0.132679 0.008206 16.168 < 2e-16 ***
## BATTING_SO -0.014136 0.002286 -6.183 7.42e-10 ***
## BASERUN_SB 0.053578 0.004156 12.892 < 2e-16 ***
## FIELDING_E -0.034946 0.001749 -19.985 < 2e-16 ***
## FIELDING_DP -0.119045 0.013309 -8.944 < 2e-16 ***
## BATTING_1B 0.042590 0.003516 12.112 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.74 on 2267 degrees of freedom
## Multiple R-squared: 0.3477, Adjusted R-squared: 0.3454
## F-statistic: 151.1 on 8 and 2267 DF, p-value: < 2.2e-16
- Model 2 removes the non-significant variables
- The R^2 is lower than Model 1 at .3459 however this may be acceptable due to removing the confounding variables
- FIELDING_DP still has a negative impact on wins
- The rest of the variables align as one would expect to win contribution
- BATTING_SO, FIELDING_E have negative impacts on TARGET_WINS as expected
- BATTING_HR has the highest impact on wins as one would expect
Model 4: Log Transformation of All Variables
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = log.training.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19757 -0.10250 0.00983 0.11102 1.13168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.968571 0.495831 3.970 7.40e-05 ***
## BATTING_2B 0.210777 0.027546 7.652 2.91e-14 ***
## BATTING_3B 0.084437 0.011491 7.348 2.80e-13 ***
## BATTING_HR 0.061957 0.040872 1.516 0.12969
## BATTING_BB -0.226663 0.075563 -3.000 0.00273 **
## BATTING_SO -0.448545 0.082954 -5.407 7.08e-08 ***
## BASERUN_SB 0.094801 0.007841 12.090 < 2e-16 ***
## PITCHING_H -0.626786 0.065329 -9.594 < 2e-16 ***
## PITCHING_HR 0.039215 0.036149 1.085 0.27811
## PITCHING_BB 0.357020 0.070431 5.069 4.32e-07 ***
## PITCHING_SO 0.348896 0.069819 4.997 6.26e-07 ***
## FIELDING_E -0.213173 0.014975 -14.235 < 2e-16 ***
## FIELDING_DP -0.289188 0.027421 -10.546 < 2e-16 ***
## BATTING_1B 1.003225 0.066876 15.001 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1792 on 2262 degrees of freedom
## Multiple R-squared: 0.4393, Adjusted R-squared: 0.4361
## F-statistic: 136.4 on 13 and 2262 DF, p-value: < 2.2e-16
- Model 4 has the highest R^2 at .4385
- Negative impact on wins
- BATTING_BB, BATTING_SO, PITCHING_H, FIELDING_E, FIELDING_DP
- The anomolies are BATTING_BB and FIELDING_DP which one would expect a positive coefficient
- Positive impact on wins
- BATTING_1B has the highest impact on TARGET_WINS
- PITCHING_BB has an anomolously high impact on TARGET_WINS. Giving up bases should have a negative impact on wins. It is impossible this is done to hitters that are particularly dangerous when being pitched straight up to
- The homerun variables are not statistically significant
Model 5: Square Root Transform
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = sq.training.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.9077 -0.4649 0.0193 0.4798 2.9646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.997850 0.616711 8.104 8.61e-16 ***
## BATTING_2B 0.046920 0.013088 3.585 0.000344 ***
## BATTING_3B 0.085895 0.013982 6.143 9.52e-10 ***
## BATTING_HR 0.122499 0.037217 3.292 0.001012 **
## BATTING_BB -0.005277 0.024159 -0.218 0.827124
## BATTING_SO -0.085678 0.012879 -6.653 3.60e-11 ***
## BASERUN_SB 0.084028 0.006548 12.832 < 2e-16 ***
## PITCHING_H -0.016636 0.004786 -3.476 0.000519 ***
## PITCHING_HR 0.018839 0.033135 0.569 0.569711
## PITCHING_BB 0.018206 0.020442 0.891 0.373235
## PITCHING_SO 0.035931 0.008745 4.109 4.12e-05 ***
## FIELDING_E -0.102975 0.006904 -14.915 < 2e-16 ***
## FIELDING_DP -0.201195 0.018958 -10.613 < 2e-16 ***
## BATTING_1B 0.180457 0.014793 12.199 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7304 on 2262 degrees of freedom
## Multiple R-squared: 0.3881, Adjusted R-squared: 0.3846
## F-statistic: 110.4 on 13 and 2262 DF, p-value: < 2.2e-16
- Model 5 has an R^2 at .387
- Negative impact on wins
- BATTING_BB, BATTING_SO, PITCHING_H, FIELDING_E, FIELDING_DP
- The anomolies are BATTING_BB and FIELDING_DP which one would expect a positive coefficient
- Positive impact on wins
- BATTING_1B has the highest impact on TARGET_WINS
- PITCHING_HR, PITCHING_BB, and BATTING_BB are not statistically significant
Select Models
Based on the R^2, Model 4 is the ideal model to use and the best predictor for TARGET_WINS. Its R^2 was .4385. Model 3’s R^2 was simply far too low and reintroduced statistically insignificant variables. Model 1 provides a great benchmark for R^2 that Model 2 comes close to achieving. Model 5 was only able to achieve a .387 R^2.
Evaluation
* The QQ plot shows slight deviation from normal towards the extremities however this can be excused due to the sheer amount of observations * The residual plot indicates that there is no constant variance * The histogram shows a normal distribution amongst the residuals
Test Model
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00
## 1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00
## Median :1249 Median :1455 Median :239.0 Median : 52.00
## Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91
## 3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00
## Max. :2525 Max. :2170 Max. :376.0 Max. :155.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 545.0 1st Qu.: 59.0
## Median :101.00 Median :509.0 Median : 686.0 Median : 92.0
## Mean : 95.63 Mean :499.0 Mean : 709.3 Mean :123.7
## 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 912.0 3rd Qu.:151.8
## Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0
## NA's :18 NA's :13
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.00 Min. :42.00 Min. : 1155 Min. : 0.0
## 1st Qu.: 38.00 1st Qu.:53.50 1st Qu.: 1426 1st Qu.: 52.0
## Median : 49.50 Median :62.00 Median : 1515 Median :104.0
## Mean : 52.32 Mean :62.37 Mean : 1813 Mean :102.1
## 3rd Qu.: 63.00 3rd Qu.:67.50 3rd Qu.: 1681 3rd Qu.:142.5
## Max. :154.00 Max. :96.00 Max. :22768 Max. :336.0
## NA's :87 NA's :240
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 136.0 Min. : 0.0 Min. : 73.0 Min. : 69.0
## 1st Qu.: 471.0 1st Qu.: 613.0 1st Qu.: 131.0 1st Qu.:131.0
## Median : 526.0 Median : 745.0 Median : 163.0 Median :148.0
## Mean : 552.4 Mean : 799.7 Mean : 249.7 Mean :146.1
## 3rd Qu.: 606.5 3rd Qu.: 938.0 3rd Qu.: 252.0 3rd Qu.:164.0
## Max. :2008.0 Max. :9963.0 Max. :1568.0 Max. :204.0
## NA's :18 NA's :31
- INDEX can be removed from the data
- The NA values will be handled with missForest similar to our training set
- BATTING_1B will be added and BATTING_H removed
- TEAM_BATTING_HBP, TEAM_BASERUN_CS will be removed
Transform Test Data
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
Predict
Code Appendix
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)
libs <- c("tidyverse", "magrittr", "knitr", "kableExtra", "fBasics", "reshape2",
"missForest")
loadPkg <- function(x) {
if (!require(x, character.only = T))
install.packages(x, dependencies = T)
require(x, character.only = T)
}
lapply(libs, loadPkg)
# load data
trainingdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-training-data.csv",
col_names = T, col_types = NULL, na = c("", "NA"))
testdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-evaluation-data.csv",
col_names = T, col_types = NULL, na = c("", "NA"))
glimpse(trainingdata)
data_frame(variables = names(trainingdata)) %>% mutate(variables = str_replace(variables,
"^([[:alnum:]]+?_{1})([[:alnum:]]+?)(_{1}[[:alnum:]]+?)$", "\\2")) %>% group_by(variables) %>%
summarise(count = n()) %>% arrange(desc(count))
trainingStats <- basicStats(trainingdata)[c("nobs", "NAs", "Minimum", "Maximum",
"1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness",
"Kurtosis"), ] %>% as.tibble() %>% rownames_to_column() %>% gather(var,
value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>%
rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`,
max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>%
mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables,
n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness,
kurtosis) %>% as.tibble()
trainingStats
trainingStats %>% dplyr::filter(na_vals > 0) %>% select(variables, na_vals,
obs) %>% arrange(desc(na_vals))
trainingStats %>% mutate(mean = mean(skewness)) %>% dplyr::filter(skewness >
mean(skewness)) %>% select(variables, skewness, mean)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -
TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>%
rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor,
TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>%
arrange(desc(TARGET_WINS))
# data frame for visuals
vis <- melt(trainingdata) %>% dplyr::filter(variable != "INDEX") %>% mutate(variable = str_replace(variable,
"TEAM_", ""))
ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable,
scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) +
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
coord_flip() + ylim(0, 2200)
ggplot(vis, aes(value)) + geom_histogram() + facet_wrap(~variable, scales = "free")
remove_string <- function(x, remove) {
str_replace(x, remove, "")
}
training <- trainingdata %>% # singles
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B -
TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
training %<>% # remove fields with large amount of NAs
select(-c("BATTING_HBP", "BASERUN_CS")) %>% # remove all hits to reduce collinearity
select(-BATTING_H) %>% # remove INDEX
select(-INDEX)
training.forest <- training %>% as.data.frame() %>% missForest()
training.imp <- training.forest$ximp
# imputed values
summary(training.imp)
# imputation error
training.forest$OOBerror
m1 <- lm(TARGET_WINS ~ ., data = training.imp)
summary(m1)
m2 <- lm(TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR - PITCHING_BB -
PITCHING_SO, data = training.imp)
summary(m2)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -
TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>%
rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor,
TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>%
dplyr::filter(TARGET_WINS > mean(TARGET_WINS)) %>% arrange(desc(TARGET_WINS))
m3 <- lm(TARGET_WINS ~ PITCHING_H + BATTING_BB + PITCHING_BB + PITCHING_HR +
BATTING_HR + BATTING_2B + BATTING_1B, data = training.imp)
summary(m3)
remove_negInf <- function(x) {
if_else(x < 0, 0, x)
}
log.training.imp <- training.imp %>% # log transform
mutate_all(funs(log(.))) %>% # replace -Inf with 0
mutate_all(funs(remove_negInf(.)))
m4 <- lm(TARGET_WINS ~ ., data = log.training.imp)
summary(m4)
sq.training.imp <- training.imp %>% # sqrt transform
mutate_all(funs(sqrt(.)))
m5 <- lm(TARGET_WINS ~ ., data = sq.training.imp)
summary(m5)
par(mfrow = c(2, 2))
plot(m4)
hist(m4$residuals)
qqnorm(m4$residuals)
qqline(m4$residuals)
summary(testdata)
testdata %<>% # drop useless variables
select(-c("INDEX", "TEAM_BATTING_HBP", "TEAM_BASERUN_CS")) %>% # add BATTING_1B
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B -
TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
test.forest <- testdata %>% as.data.frame() %>% missForest()
test.imp <- test.forest$ximp
test_results <- predict(m4, newdata = test.imp)
bind_cols(data.frame(TARGET_WINS = test_results), test.imp)