Data 621 - Week1

Data Exploration

## Observations: 2,276
## Variables: 17
## $ INDEX            <int> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 1...
## $ TARGET_WINS      <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 6...
## $ TEAM_BATTING_H   <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 127...
## $ TEAM_BATTING_2B  <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, ...
## $ TEAM_BATTING_3B  <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 3...
## $ TEAM_BATTING_HR  <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96,...
## $ TEAM_BATTING_BB  <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, ...
## $ TEAM_BATTING_SO  <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 92...
## $ TEAM_BASERUN_SB  <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, ...
## $ TEAM_BASERUN_CS  <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 7...
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ TEAM_PITCHING_H  <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 128...
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96,...
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, ...
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 9...
## $ TEAM_FIELDING_E  <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127,...
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 1...

Using glimpse, we can see that there are 2276 observations and 17 variables in our training dataset. Of the 17 variables, it seems INDEX provides no additional value other than being a sorting/labelling mechanism for each observation. INDEX will be removed in the Data Preparation section.

Non-Visual Inspection

Variables Breakdown

Response Variable: TARGET_WINS
Explanatory Variables:
- 7 Batting variables
- 4 Pitching variables
- 2 Baserunning variables
- 2 Fielding variables

Basic Stats

Missing Values

The 6 fields shown in the table above having missing values.

Skew

The 4 fields shown in the table above have higher than average skew values, which provides evidence of outliers greatly effecting the mean of those fields

Correlation

The correlations tell us that HITS have the highest impact on winning games
There is some collinearity between some of the variables, especially the hitting variables

Visual Inspection

Density Plots

## No id variables; using all as measure variables

Box Plots

Box plots can provide a visual representation of the variance of the data * The box plots reveal that a great majority of the explanatory variables have high variances * Some of the variables contain extreme outliers that this graph does not show because i had to reduce the limits on the graph to get clear box plots * Many of the medians and means are also not aligned which demonstrates the outliers’ effects

Histograms

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

* The histograms reveal that very few of the variables are normally distributed * A few variables are multi-model * Some of the variable exhibit a lot of skew (e.g. BASERUN_SB)

Data Preparation

New Variable for Singles

Variable Removal

Our dataset will be augmented by removing the fields with a large amount of NA values
BATTING_H will be removed to reduce collinearity and replaced by BATTING_1B which is a calculated variable based on BATTING_H, BATTING_2B, BATTING_3B, BATTING_HR
INDEX is removed because it has no meaning in the dataset

Missing Values Handling

missForest will be used to handle all missing data by using a random forest algorithm to replace the missing values with “forest” values

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

##   TARGET_WINS       BATTING_2B      BATTING_3B       BATTING_HR    
##  Min.   :  0.00   Min.   : 69.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:208.0   1st Qu.: 34.00   1st Qu.: 42.00  
##  Median : 82.00   Median :238.0   Median : 47.00   Median :102.00  
##  Mean   : 80.79   Mean   :241.2   Mean   : 55.25   Mean   : 99.61  
##  3rd Qu.: 92.00   3rd Qu.:273.0   3rd Qu.: 72.00   3rd Qu.:147.00  
##  Max.   :146.00   Max.   :458.0   Max.   :223.00   Max.   :264.00  
##    BATTING_BB      BATTING_SO       BASERUN_SB      PITCHING_H   
##  Min.   :  0.0   Min.   :   0.0   Min.   :  0.0   Min.   : 1137  
##  1st Qu.:451.0   1st Qu.: 552.8   1st Qu.: 67.0   1st Qu.: 1419  
##  Median :512.0   Median : 732.0   Median :105.0   Median : 1518  
##  Mean   :501.6   Mean   : 730.7   Mean   :130.9   Mean   : 1779  
##  3rd Qu.:580.0   3rd Qu.: 925.0   3rd Qu.:167.3   3rd Qu.: 1682  
##  Max.   :878.0   Max.   :1399.0   Max.   :697.0   Max.   :30132  
##   PITCHING_HR     PITCHING_BB      PITCHING_SO        FIELDING_E    
##  Min.   :  0.0   Min.   :   0.0   Min.   :    0.0   Min.   :  65.0  
##  1st Qu.: 50.0   1st Qu.: 476.0   1st Qu.:  623.0   1st Qu.: 127.0  
##  Median :107.0   Median : 536.5   Median :  798.0   Median : 159.0  
##  Mean   :105.7   Mean   : 553.0   Mean   :  811.9   Mean   : 246.5  
##  3rd Qu.:150.0   3rd Qu.: 611.0   3rd Qu.:  957.2   3rd Qu.: 249.2  
##  Max.   :343.0   Max.   :3645.0   Max.   :19278.0   Max.   :1898.0  
##   FIELDING_DP      BATTING_1B    
##  Min.   : 52.0   Min.   : 709.0  
##  1st Qu.:123.9   1st Qu.: 990.8  
##  Median :145.0   Median :1050.0  
##  Mean   :142.5   Mean   :1073.2  
##  3rd Qu.:161.2   3rd Qu.:1129.0  
##  Max.   :228.0   Max.   :2112.0

##     NRMSE 
## 0.1620713

Through the summary() function, we can see that none of the fields have missing values any longer

Build Models

Model 1: All Variables

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = training.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.717  -8.363   0.250   8.157  55.762 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 35.0126856  5.2286817   6.696 2.69e-11 ***
## BATTING_2B   0.0272885  0.0071304   3.827 0.000133 ***
## BATTING_3B   0.0725585  0.0158102   4.589 4.69e-06 ***
## BATTING_HR   0.1089497  0.0267497   4.073 4.80e-05 ***
## BATTING_BB   0.0117643  0.0056607   2.078 0.037799 *  
## BATTING_SO  -0.0159992  0.0025642  -6.239 5.22e-10 ***
## BASERUN_SB   0.0494434  0.0045167  10.947  < 2e-16 ***
## PITCHING_H   0.0001014  0.0003694   0.275 0.783720    
## PITCHING_HR  0.0201248  0.0237022   0.849 0.395934    
## PITCHING_BB -0.0036261  0.0040563  -0.894 0.371444    
## PITCHING_SO  0.0025649  0.0008940   2.869 0.004157 ** 
## FIELDING_E  -0.0335008  0.0025113 -13.340  < 2e-16 ***
## FIELDING_DP -0.1270950  0.0134964  -9.417  < 2e-16 ***
## BATTING_1B   0.0443855  0.0036090  12.298  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.73 on 2262 degrees of freedom
## Multiple R-squared:  0.3507, Adjusted R-squared:  0.347 
## F-statistic: 93.99 on 13 and 2262 DF,  p-value: < 2.2e-16

Model 1 has 9/13 statistically significant variables at the 5% significance level
Interestingly, FIELDING_DP has a negative impact on wins. This may be because it would mean the opposing team is getting hits
The rest of the variables align as one would expect to win contribution
- BATTING_SO, PITCHING_BB, FIELDING_E have negative impacts on TARGET_WINS as expected
- PITCHING_HR suprisingly has a positive impact on TARGET_WINS. It is possible there is a confounding variable affecting this coefficient
- BATTING_HR has the highest impact on wins as one would expect
This model also shows that giving up home-runs is not as big a detriment as one may think
An R^2 of .3512 indicates that there may be room to improve the model

Model 2: Only Significant Variables

## 
## Call:
## lm(formula = TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR - 
##     PITCHING_BB - PITCHING_SO, data = training.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.371  -8.381   0.149   8.350  57.961 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.074093   4.873844   8.017 1.72e-15 ***
## BATTING_2B   0.031294   0.007033   4.450 9.02e-06 ***
## BATTING_3B   0.073225   0.015519   4.718 2.52e-06 ***
## BATTING_HR   0.131861   0.008220  16.042  < 2e-16 ***
## BATTING_SO  -0.013695   0.002285  -5.993 2.39e-09 ***
## BASERUN_SB   0.051956   0.004162  12.483  < 2e-16 ***
## FIELDING_E  -0.034223   0.001737 -19.708  < 2e-16 ***
## FIELDING_DP -0.119183   0.013281  -8.974  < 2e-16 ***
## BATTING_1B   0.042500   0.003519  12.077  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.77 on 2267 degrees of freedom
## Multiple R-squared:  0.3455, Adjusted R-squared:  0.3432 
## F-statistic: 149.6 on 8 and 2267 DF,  p-value: < 2.2e-16

Model 2 removes the non-significant variables
The R^2 is lower than Model 1 at .3459 however this may be acceptable due to removing the confounding variables
FIELDING_DP still has a negative impact on wins
The rest of the variables align as one would expect to win contribution
- BATTING_SO, FIELDING_E have negative impacts on TARGET_WINS as expected
- BATTING_HR has the highest impact on wins as one would expect

Model 3: Highly Correlated Variables

## 
## Call:
## lm(formula = TARGET_WINS ~ PITCHING_H + BATTING_BB + PITCHING_BB + 
##     PITCHING_HR + BATTING_HR + BATTING_2B + BATTING_1B, data = training.imp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.559  -9.075   0.210   9.073  47.821 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.0022694  3.5289286  -0.001 0.999487    
## PITCHING_H  -0.0027828  0.0003394  -8.198 4.04e-16 ***
## BATTING_BB   0.0087439  0.0047829   1.828 0.067656 .  
## PITCHING_BB  0.0120497  0.0031610   3.812 0.000142 ***
## PITCHING_HR  0.0033868  0.0240691   0.141 0.888110    
## BATTING_HR   0.0587833  0.0265321   2.216 0.026821 *  
## BATTING_2B   0.0373918  0.0075320   4.964 7.41e-07 ***
## BATTING_1B   0.0554074  0.0030795  17.992  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.83 on 2268 degrees of freedom
## Multiple R-squared:  0.232,  Adjusted R-squared:  0.2296 
## F-statistic: 97.86 on 7 and 2268 DF,  p-value: < 2.2e-16

Of the three models, Model 3 has the lowest R^2 at .232
Model 3 also reintroduces variables that are not statistically significant
BATTING_3B is not introduced into this model as it did have a negative correlation
PITCHING_H is the only variable that has a negative impact on TARGET_WINS
The rest of the variables align as one would expect to win contribution
- BATTING_HR has the highest impact on wins as one would expect
- BATTING_1B has the second highest impact on wins, a lot more than BATTING_2B which is unexpected
- PITCHING_HR has a very small positive impact on wins. This is counterintuitive as giving up runs should increase the chance of losing (decrease the chance of winning)

Select Models

Based on the R^2 and the removal of statistically insignificant variables, Model 2 is the ideal model to use. Model 3’s R^2 was simply far too low and reintroduced statistically insignificant variables. Model 1 provides a great benchmark for R^2 that Model 2 comes close to achieving

Evaluation

* The QQ plot shows slight deviation from normal towards the extremities however this can be excused due to the sheer amount of observations * The residual plot indicates that there is no constant variance * It is likely that further transformations could be made to the data to better meet regression appropriateness requirements

Test Model

##      INDEX      TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B 
##  Min.   :   9   Min.   : 819   Min.   : 44.0   Min.   : 14.00  
##  1st Qu.: 708   1st Qu.:1387   1st Qu.:210.0   1st Qu.: 35.00  
##  Median :1249   Median :1455   Median :239.0   Median : 52.00  
##  Mean   :1264   Mean   :1469   Mean   :241.3   Mean   : 55.91  
##  3rd Qu.:1832   3rd Qu.:1548   3rd Qu.:278.5   3rd Qu.: 72.00  
##  Max.   :2525   Max.   :2170   Max.   :376.0   Max.   :155.00  
##                                                                
##  TEAM_BATTING_HR  TEAM_BATTING_BB TEAM_BATTING_SO  TEAM_BASERUN_SB
##  Min.   :  0.00   Min.   : 15.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 44.50   1st Qu.:436.5   1st Qu.: 545.0   1st Qu.: 59.0  
##  Median :101.00   Median :509.0   Median : 686.0   Median : 92.0  
##  Mean   : 95.63   Mean   :499.0   Mean   : 709.3   Mean   :123.7  
##  3rd Qu.:135.50   3rd Qu.:565.5   3rd Qu.: 912.0   3rd Qu.:151.8  
##  Max.   :242.00   Max.   :792.0   Max.   :1268.0   Max.   :580.0  
##                                   NA's   :18       NA's   :13     
##  TEAM_BASERUN_CS  TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
##  Min.   :  0.00   Min.   :42.00    Min.   : 1155   Min.   :  0.0   
##  1st Qu.: 38.00   1st Qu.:53.50    1st Qu.: 1426   1st Qu.: 52.0   
##  Median : 49.50   Median :62.00    Median : 1515   Median :104.0   
##  Mean   : 52.32   Mean   :62.37    Mean   : 1813   Mean   :102.1   
##  3rd Qu.: 63.00   3rd Qu.:67.50    3rd Qu.: 1681   3rd Qu.:142.5   
##  Max.   :154.00   Max.   :96.00    Max.   :22768   Max.   :336.0   
##  NA's   :87       NA's   :240                                      
##  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E  TEAM_FIELDING_DP
##  Min.   : 136.0   Min.   :   0.0   Min.   :  73.0   Min.   : 69.0   
##  1st Qu.: 471.0   1st Qu.: 613.0   1st Qu.: 131.0   1st Qu.:131.0   
##  Median : 526.0   Median : 745.0   Median : 163.0   Median :148.0   
##  Mean   : 552.4   Mean   : 799.7   Mean   : 249.7   Mean   :146.1   
##  3rd Qu.: 606.5   3rd Qu.: 938.0   3rd Qu.: 252.0   3rd Qu.:164.0   
##  Max.   :2008.0   Max.   :9963.0   Max.   :1568.0   Max.   :204.0   
##                   NA's   :18                        NA's   :31

INDEX can be removed from the data
The NA values will be handled with missForest similar to our training set
BATTING_1B will be added and BATTING_H removed
TEAM_BATTING_HBP, TEAM_BASERUN_CS will be removed

Transform Test Data

##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!

Predict

Code Appendix

knitr::opts_chunk$set(echo = FALSE)
knitr::opts_chunk$set(tidy = TRUE)
knitr::opts_chunk$set(warning = FALSE)

libs <- c("tidyverse", "magrittr", "knitr", "kableExtra", "fBasics", "reshape2", 
    "missForest")

loadPkg <- function(x) {
    if (!require(x, character.only = T)) 
        install.packages(x, dependencies = T)
    require(x, character.only = T)
}

lapply(libs, loadPkg)
# load data
trainingdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-training-data.csv", 
    col_names = T, col_types = NULL, na = c("", "NA"))
testdata <- read_csv("https://raw.githubusercontent.com/baroncurtin2/data621/master/week1/moneyball-evaluation-data.csv", 
    col_names = T, col_types = NULL, na = c("", "NA"))
glimpse(trainingdata)
data_frame(variables = names(trainingdata)) %>% mutate(variables = str_replace(variables, 
    "^([[:alnum:]]+?_{1})([[:alnum:]]+?)(_{1}[[:alnum:]]+?)$", "\\2")) %>% group_by(variables) %>% 
    summarise(count = n()) %>% arrange(desc(count))
trainingStats <- basicStats(trainingdata)[c("nobs", "NAs", "Minimum", "Maximum", 
    "1. Quartile", "3. Quartile", "Mean", "Median", "Variance", "Stdev", "Skewness", 
    "Kurtosis"), ] %>% as.tibble() %>% rownames_to_column() %>% gather(var, 
    value, -rowname) %>% spread(rowname, value) %>% rename_all(str_to_lower) %>% 
    rename_all(str_trim) %>% rename(variables = "var", q1 = `1. quartile`, q3 = `3. quartile`, 
    max = maximum, min = minimum, na_vals = nas, n = nobs, sd = stdev, var = variance) %>% 
    mutate(obs = n - na_vals, range = max - min, iqr = q3 - q1) %>% select(variables, 
    n, na_vals, obs, mean, min, q1, median, q3, max, sd, var, range, iqr, skewness, 
    kurtosis) %>% as.tibble()

trainingStats
trainingStats %>% dplyr::filter(na_vals > 0) %>% select(variables, na_vals, 
    obs)
trainingStats %>% mutate(mean = mean(skewness)) %>% dplyr::filter(skewness > 
    mean(skewness)) %>% select(variables, skewness, mean)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - 
    TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>% 
    rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor, 
    TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>% 
    arrange(desc(TARGET_WINS))
# data frame for visuals
vis <- melt(trainingdata) %>% dplyr::filter(variable != "INDEX") %>% mutate(variable = str_replace(variable, 
    "TEAM_", ""))

ggplot(vis, aes(value)) + geom_density(fill = "skyblue") + facet_wrap(~variable, 
    scales = "free")
ggplot(vis, aes(x = variable, y = value)) + geom_boxplot(show.legend = T) + 
    stat_summary(fun.y = mean, color = "red", geom = "point", shape = 18, size = 3) + 
    coord_flip() + ylim(0, 2200)
ggplot(vis, aes(value)) + geom_histogram() + facet_wrap(~variable, scales = "free")
remove_string <- function(x, remove) {
    str_replace(x, remove, "")
}

training <- trainingdata %>% # singles
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - 
    TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
training %<>% # remove fields with large amount of NAs
select(-c("BATTING_HBP", "BASERUN_CS")) %>% # remove all hits to reduce collinearity
select(-BATTING_H) %>% # remove INDEX
select(-INDEX)
training.forest <- training %>% as.data.frame() %>% missForest()

training.imp <- training.forest$ximp
# imputed values
summary(training.imp)

# imputation error
training.forest$OOBerror
m1 <- lm(TARGET_WINS ~ ., data = training.imp)
summary(m1)
m2 <- lm(TARGET_WINS ~ . - BATTING_BB - PITCHING_H - PITCHING_HR - PITCHING_BB - 
    PITCHING_SO, data = training.imp)
summary(m2)
trainingdata %>% mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - 
    TEAM_BATTING_3B - TEAM_BATTING_HR) %>% cor(use = "na.or.complete") %>% as.data.frame() %>% 
    rownames_to_column(var = "predictor") %>% as_data_frame() %>% select(predictor, 
    TARGET_WINS) %>% dplyr::filter(!predictor %in% c("INDEX", "TARGET_WINS")) %>% 
    dplyr::filter(TARGET_WINS > mean(TARGET_WINS)) %>% arrange(desc(TARGET_WINS))

m3 <- lm(TARGET_WINS ~ PITCHING_H + BATTING_BB + PITCHING_BB + PITCHING_HR + 
    BATTING_HR + BATTING_2B + BATTING_1B, data = training.imp)
summary(m3)
par(mfrow = c(2, 2))
plot(m2)
summary(testdata)
testdata %<>% # drop useless variables
select(-c("INDEX", "TEAM_BATTING_HBP", "TEAM_BASERUN_CS")) %>% # add BATTING_1B
mutate(TEAM_BATTING_1B = TEAM_BATTING_H - TEAM_BATTING_2B - TEAM_BATTING_3B - 
    TEAM_BATTING_HR) %>% # remove 'TEAM_'
rename_all(remove_string, remove = "TEAM_")
test.forest <- testdata %>% as.data.frame() %>% missForest()

test.imp <- test.forest$ximp
test_results <- predict(m2, newdata = test.imp)

bind_cols(data.frame(TARGET_WINS = test_results), test.imp)