Data Exploration
## Observations: 2,276
## Variables: 17
## $ INDEX <int> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 1...
## $ TARGET_WINS <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 6...
## $ TEAM_BATTING_H <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 127...
## $ TEAM_BATTING_2B <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, ...
## $ TEAM_BATTING_3B <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 3...
## $ TEAM_BATTING_HR <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96,...
## $ TEAM_BATTING_BB <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, ...
## $ TEAM_BATTING_SO <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 92...
## $ TEAM_BASERUN_SB <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, ...
## $ TEAM_BASERUN_CS <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 7...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 128...
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96,...
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, ...
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 9...
## $ TEAM_FIELDING_E <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127,...
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 1...
Using glimpse, we can see that there are 2276 observations and 17 variables in our training dataset. Of the 17 variables, it seems INDEX provides no additional value other than being a sorting/labelling mechanism for each observation. INDEX will be removed in the Data Preparation section.
Non-Visual Inspection
Variables Breakdown
- Response Variable: TARGET_WINS
- Explanatory Variables:
- 7 Batting variables
- 4 Pitching variables
- 2 Baserunning variables
- 2 Fielding variables
Basic Stats
Missing Values
The 6 fields shown in the table above having missing values.
Skew
The 4 fields shown in the table above have higher than average skew values, which provides evidence of outliers greatly effecting the mean of those fields
Correlation
- The correlations tell us that HITS have the highest impact on winning games
- There is some collinearity between some of the variables, especially the hitting variables
Visual Inspection
Density Plots
## No id variables; using all as measure variables
Box Plots
Box plots can provide a visual representation of the variance of the data * The box plots reveal that a great majority of the explanatory variables have high variances * Some of the variables contain extreme outliers that this graph does not show because i had to reduce the limits on the graph to get clear box plots * Many of the medians and means are also not aligned which demonstrates the outliers’ effects
Histograms
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
* The histograms reveal that very few of the variables are normally distributed * A few variables are multi-model * Some of the variable exhibit a lot of skew (e.g. BASERUN_SB)
Data Preparation
New Variable for Singles
Variable Removal
- Our dataset will be augmented by removing the fields with a large amount of NA values
- BATTING_H will be removed to reduce collinearity and replaced by BATTING_1B which is a calculated variable based on BATTING_H, BATTING_2B, BATTING_3B, BATTING_HR
- INDEX is removed because it has no meaning in the dataset
Missing Values Handling
missForest will be used to handle all missing data by using a random forest algorithm to replace the missing values with “forest” values
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## TARGET_WINS BATTING_2B BATTING_3B BATTING_HR
## Min. : 0.00 Min. : 69.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00
## Median : 82.00 Median :238.0 Median : 47.00 Median :102.00
## Mean : 80.79 Mean :241.2 Mean : 55.25 Mean : 99.61
## 3rd Qu.: 92.00 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00
## Max. :146.00 Max. :458.0 Max. :223.00 Max. :264.00
## BATTING_BB BATTING_SO BASERUN_SB PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 1137
## 1st Qu.:451.0 1st Qu.: 552.8 1st Qu.: 67.0 1st Qu.: 1419
## Median :512.0 Median : 732.0 Median :105.0 Median : 1518
## Mean :501.6 Mean : 730.7 Mean :130.9 Mean : 1779
## 3rd Qu.:580.0 3rd Qu.: 925.0 3rd Qu.:167.3 3rd Qu.: 1682
## Max. :878.0 Max. :1399.0 Max. :697.0 Max. :30132
## PITCHING_HR PITCHING_BB PITCHING_SO FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 623.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 798.0 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 811.9 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.2 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## FIELDING_DP BATTING_1B
## Min. : 52.0 Min. : 709.0
## 1st Qu.:123.9 1st Qu.: 990.8
## Median :145.0 Median :1050.0
## Mean :142.5 Mean :1073.2
## 3rd Qu.:161.2 3rd Qu.:1129.0
## Max. :228.0 Max. :2112.0
## NRMSE
## 0.1620713
- Through the summary() function, we can see that none of the fields have missing values any longer
Build Models
Model 1: All Variables
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = training.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.717 -8.363 0.250 8.157 55.762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.0126856 5.2286817 6.696 2.69e-11 ***
## BATTING_2B 0.0272885 0.0071304 3.827 0.000133 ***
## BATTING_3B 0.0725585 0.0158102 4.589 4.69e-06 ***
## BATTING_HR 0.1089497 0.0267497 4.073 4.80e-05 ***
## BATTING_BB 0.0117643 0.0056607 2.078 0.037799 *
## BATTING_SO -0.0159992 0.0025642 -6.239 5.22e-10 ***
## BASERUN_SB 0.0494434 0.0045167 10.947 < 2e-16 ***
## PITCHING_H 0.0001014 0.0003694 0.275 0.783720
## PITCHING_HR 0.0201248 0.0237022 0.849 0.395934
## PITCHING_BB -0.0036261 0.0040563 -0.894 0.371444
## PITCHING_SO 0.0025649 0.0008940 2.869 0.004157 **
## FIELDING_E -0.0335008 0.0025113 -13.340 < 2e-16 ***
## FIELDING_DP -0.1270950 0.0134964 -9.417 < 2e-16 ***
## BATTING_1B 0.0443855 0.0036090 12.298 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.73 on 2262 degrees of freedom
## Multiple R-squared: 0.3507, Adjusted R-squared: 0.347
## F-statistic: 93.99 on 13 and 2262 DF, p-value: < 2.2e-16
- Model 1 has 9/13 statistically significant variables at the 5% significance level
- Interestingly, FIELDING_DP has a negative impact on wins. This may be because it would mean the opposing team is getting hits
- The rest of the variables align as one would expect to win contribution
- BATTING_SO, PITCHING_BB, FIELDING_E have negative impacts on TARGET_WINS as expected
- PITCHING_HR suprisingly has a positive impact on TARGET_WINS. It is possible there is a confounding variable affecting this coefficient
- BATTING_HR has the highest impact on wins as one would expect
- This model also shows that giving up home-runs is not as big a detriment as one may think
- An R^2 of .3512 indicates that there may be room to improve the model
Model 2: Only Significant Variables
##
## Call:
## lm(formula = TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR -
## PITCHING_BB - PITCHING_SO, data = training.imp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.371 -8.381 0.149 8.350 57.961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.074093 4.873844 8.017 1.72e-15 ***
## BATTING_2B 0.031294 0.007033 4.450 9.02e-06 ***
## BATTING_3B 0.073225 0.015519 4.718 2.52e-06 ***
## BATTING_HR 0.131861 0.008220 16.042 < 2e-16 ***
## BATTING_SO -0.013695 0.002285 -5.993 2.39e-09 ***
## BASERUN_SB 0.051956 0.004162 12.483 < 2e-16 ***
## FIELDING_E -0.034223 0.001737 -19.708 < 2e-16 ***
## FIELDING_DP -0.119183 0.013281 -8.974 < 2e-16 ***
## BATTING_1B 0.042500 0.003519 12.077 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.77 on 2267 degrees of freedom
## Multiple R-squared: 0.3455, Adjusted R-squared: 0.3432
## F-statistic: 149.6 on 8 and 2267 DF, p-value: < 2.2e-16
- Model 2 removes the non-significant variables
- The R^2 is lower than Model 1 at .3459 however this may be acceptable due to removing the confounding variables
- FIELDING_DP still has a negative impact on wins
- The rest of the variables align as one would expect to win contribution
- BATTING_SO, FIELDING_E have negative impacts on TARGET_WINS as expected
- BATTING_HR has the highest impact on wins as one would expect
Select Models
Based on the R^2 and the removal of statistically insignificant variables, Model 2 is the ideal model to use. Model 3’s R^2 was simply far too low and reintroduced statistically insignificant variables. Model 1 provides a great benchmark for R^2 that Model 2 comes close to achieving
Evaluation
* The QQ plot shows slight deviation from normal towards the extremities however this can be excused due to the sheer amount of observations * The residual plot indicates that there is no constant variance * It is likely that further transformations could be made to the data to better meet regression appropriateness requirements
Test Model
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00
## 1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00
## Median :1249 Median :1455 Median :239.0 Median : 52.00
## Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91
## 3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00
## Max. :2525 Max. :2170 Max. :376.0 Max. :155.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 545.0 1st Qu.: 59.0
## Median :101.00 Median :509.0 Median : 686.0 Median : 92.0
## Mean : 95.63 Mean :499.0 Mean : 709.3 Mean :123.7
## 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 912.0 3rd Qu.:151.8
## Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0
## NA's :18 NA's :13
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.00 Min. :42.00 Min. : 1155 Min. : 0.0
## 1st Qu.: 38.00 1st Qu.:53.50 1st Qu.: 1426 1st Qu.: 52.0
## Median : 49.50 Median :62.00 Median : 1515 Median :104.0
## Mean : 52.32 Mean :62.37 Mean : 1813 Mean :102.1
## 3rd Qu.: 63.00 3rd Qu.:67.50 3rd Qu.: 1681 3rd Qu.:142.5
## Max. :154.00 Max. :96.00 Max. :22768 Max. :336.0
## NA's :87 NA's :240
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 136.0 Min. : 0.0 Min. : 73.0 Min. : 69.0
## 1st Qu.: 471.0 1st Qu.: 613.0 1st Qu.: 131.0 1st Qu.:131.0
## Median : 526.0 Median : 745.0 Median : 163.0 Median :148.0
## Mean : 552.4 Mean : 799.7 Mean : 249.7 Mean :146.1
## 3rd Qu.: 606.5 3rd Qu.: 938.0 3rd Qu.: 252.0 3rd Qu.:164.0
## Max. :2008.0 Max. :9963.0 Max. :1568.0 Max. :204.0
## NA's :18 NA's :31
- INDEX can be removed from the data
- The NA values will be handled with missForest similar to our training set
- BATTING_1B will be added and BATTING_H removed
- TEAM_BATTING_HBP, TEAM_BASERUN_CS will be removed
Transform Test Data
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
Predict
Code Appendix
knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)
libs <- c("tidyverse", "magrittr", "knitr", "kableExtra", "fBasics", "reshape2",
"missForest")
loadPkg <- function(x) {
if (!require(x, character.only = T))
install.packages(x, dependencies = T)
require(x, character.only = T)
}
lapply(libs, loadPkg)
# load data
trainingdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-training-data.csv",
col_names = T, col_types = NULL, na = c("", "NA"))
testdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-evaluation-data.csv",
col_names = T, col_types = NULL, na = c("", "NA"))
glimpse(trainingdata)
data_frame(variables = names(trainingdata)) %>% mutate(variables = str_replace(variables,
"^([[:alnum:]]+?_{1})([[:alnum:]]+?)(_{1}[[:alnum:]]+?)$", "\\2")) %>% group_by(variables) %>%
summarise(count = n()) %>% arrange(desc(count))
trainingStats <- basicStats(trainingdata)[c("nobs", "NAs", "Minimum", "Maximum",
"1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness",
"Kurtosis"), ] %>% as.tibble() %>% rownames_to_column() %>% gather(var,
value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>%
rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`,
max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>%
mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables,
n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness,
kurtosis) %>% as.tibble()
trainingStats
trainingStats %>% dplyr::filter(na_vals > 0) %>% select(variables, na_vals,
obs)
trainingStats %>% mutate(mean = mean(skewness)) %>% dplyr::filter(skewness >
mean(skewness)) %>% select(variables, skewness, mean)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -
TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>%
rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor,
TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>%
arrange(desc(TARGET_WINS))
# data frame for visuals
vis <- melt(trainingdata) %>% dplyr::filter(variable != "INDEX") %>% mutate(variable = str_replace(variable,
"TEAM_", ""))
ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable,
scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) +
stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) +
coord_flip() + ylim(0, 2200)
ggplot(vis, aes(value)) + geom_histogram() + facet_wrap(~variable, scales = "free")
remove_string <- function(x, remove) {
str_replace(x, remove, "")
}
training <- trainingdata %>% # singles
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B -
TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
training %<>% # remove fields with large amount of NAs
select(-c("BATTING_HBP", "BASERUN_CS")) %>% # remove all hits to reduce collinearity
select(-BATTING_H) %>% # remove INDEX
select(-INDEX)
training.forest <- training %>% as.data.frame() %>% missForest()
training.imp <- training.forest$ximp
# imputed values
summary(training.imp)
# imputation error
training.forest$OOBerror
m1 <- lm(TARGET_WINS ~ ., data = training.imp)
summary(m1)
m2 <- lm(TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR - PITCHING_BB -
PITCHING_SO, data = training.imp)
summary(m2)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B -
TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>%
rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor,
TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>%
dplyr::filter(TARGET_WINS > mean(TARGET_WINS)) %>% arrange(desc(TARGET_WINS))
m3 <- lm(TARGET_WINS ~ PITCHING_H + BATTING_BB + PITCHING_BB + PITCHING_HR +
BATTING_HR + BATTING_2B + BATTING_1B, data = training.imp)
summary(m3)
par(mfrow = c(2, 2))
plot(m2)
summary(testdata)
testdata %<>% # drop useless variables
select(-c("INDEX", "TEAM_BATTING_HBP", "TEAM_BASERUN_CS")) %>% # add BATTING_1B
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B -
TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
test.forest <- testdata %>% as.data.frame() %>% missForest()
test.imp <- test.forest$ximp
test_results <- predict(m2, newdata = test.imp)
bind_cols(data.frame(TARGET_WINS = test_results), test.imp)