Read in the Moneyball training data
df <- read.csv('/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 1/moneyball-training-data.csv')
Confirming exact number of rows and columns
dim(df)
## [1] 2276 17
Getting the full summary, will refer back to this as I go
summary(df)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
Small thing to note here: the max of index is 2535, which is > than the # of rows, so the index is not purely sequential (some values are skipped).
Since some exploration and prep are some what iterative, we will both explore and address the NAs here, and then do some more exploration.
Though they are in the Summary above, I find it useful to look at how many values are not NA, and then only show those columns where that amount is < 2276 (the number of rows).
df_na <- colSums(! is.na(df))
(da_na <- df_na[df_na<2276])
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP
## 2174 2145 1504 191
## TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2174 1990
Diving into TEAM_BATTING_HBP (Batters hit by pitch (get a free base)) since that has the most NAs.
(191/2276)*100
## [1] 8.391916
8.4% of the rows have a value in this field.
Researching how common it is for a batter to get hit, 32,001 rows of pitchers by season of between 1876 and 2023 had at least 1 HB (Hit Batter) out of a total of 47,206 (~68%). Of the remaining rows, some where 0 and some where null. The range of years here is not exactly the same, but overlaps significantly. (Source: https://www.mlb.com/stats/pitching/hit-batsmen/all-time-by-season)
This rule dates from 1887. (Source: https://en.wikipedia.org/wiki/Hit_by_pitch) We might expect that there were no stats kept prior to that, each row of this data is meant to be the stats for a given year from 1871-2006.
Given the 2 facts above, we would expect to have significantly more than 8.4% of the rows populated. There is no way I can think of to input this data. Therefore, we will exclude it from our analysis. See section below with columns to exclude.
Next, we will look at TEAM_BASERUN_CS, (Caught Stealing).
(1504/2276)*100
## [1] 66.08084
It has a non NA value in about 2/3 of the rows (eg missing 1/3 rows)
Not sure if the moneyball data is real, max all time by season per MLB site is 42, and that is lower than the max in this data of 201, but also lower that the median 49 and mean 52. For simplicity, I am going to exclude this field too.
exclude_col = c("TEAM_BATTING_HBP", "TEAM_BASERUN_CS")
df<-df[, !(names(df) %in% exclude_col)]
dim(df)
## [1] 2276 15
TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP These fields are mostly populated (95.5%, 94.2%. 95.5%, 87.4%, respectively). To address the NAs, we will replace the NAs with the Mean.
#for fields TEAM_BATTING_SO, TEAM_BASERUN_SB, TEAM_PITCHING_SO, TEAM_FIELDING_DP
df$TEAM_BATTING_SO[is.na(df$TEAM_BATTING_SO)]<-mean(df$TEAM_BATTING_SO,na.rm=TRUE)
df$TEAM_BASERUN_SB[is.na(df$TEAM_BASERUN_SB)]<-mean(df$TEAM_BASERUN_SB,na.rm=TRUE)
df$TEAM_PITCHING_SO[is.na(df$TEAM_PITCHING_SO)]<-mean(df$TEAM_PITCHING_SO,na.rm=TRUE)
df$TEAM_FIELDING_DP[is.na(df$TEAM_FIELDING_DP)]<-mean(df$TEAM_FIELDING_DP,na.rm=TRUE)
plot_scatterplot(df[,-1], by ="TARGET_WINS")
Nothing super strong here, but TEAM_BASERUN_SB, TEAM_BATTING_2B, TEAM_BATTING_H look like they have positive correlations. And TEAM_FIELDING_E looks like it has a negative correlation. Oddly, TEAM_PITCHING_HR, which I would expect to have negative correlation, seems to have a positive on. Also, want to take a closer look at TEAM_BATTING_HR.
Let’s plot these 6 with a linear trend to look further
p1<- ggplot(df, aes(x=TEAM_BASERUN_SB, y=TARGET_WINS))+
geom_point()+
geom_smooth(method=lm , color="red", se=FALSE)
p2<- ggplot(df, aes(x=TEAM_BATTING_2B, y=TARGET_WINS))+
geom_point()+
geom_smooth(method=lm , color="red", se=FALSE)
p3<- ggplot(df, aes(x=TEAM_BATTING_H, y=TARGET_WINS))+
geom_point()+
geom_smooth(method=lm , color="red", se=FALSE)
p4<- ggplot(df, aes(x=TEAM_FIELDING_E, y=TARGET_WINS))+
geom_point()+
geom_smooth(method=lm , color="red", se=FALSE)
p5<- ggplot(df, aes(x=TEAM_PITCHING_HR, y=TARGET_WINS))+
geom_point()+
geom_smooth(method=lm , color="red", se=FALSE)
p6<- ggplot(df, aes(x=TEAM_BATTING_HR, y=TARGET_WINS))+
geom_point()+
geom_smooth(method=lm , color="red", se=FALSE)
#from library(gridExtra)
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 3)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Observations: TEAM_BATTING_H seems to have the strongest corr, followed by TEAM_FIELDING_E (a neg corr) and then TEAM_BATTING_2B.
#df[,-1] so we do not include the index
gather_df <- df[,-1] %>%
gather(key = 'variable', value = 'value')
ggplot(gather_df) +
geom_histogram(aes(x=value, y = ..density..), bins=30) +
geom_density(aes(x=value), color='red') +
facet_wrap(. ~variable, scales='free', ncol=5) +
theme(strip.text.x = element_text(size=5))
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Observations:
1. Many of the fields are skewed.
2. There are also some bimodal fields (with 2 peaks).
Typically in either of these situations (skewed or bimodal), we would look for a way to transform. See Side Note below for experiment with taking a log of the one of the skewed fields. It did not have an impact on the correlation, so we did not pursue this. For the bimodal, we do not have any data that might help us understand what might cause an observation to be in the low or high peak. Therefore we will keep this in mind as we progress, but not take any action on the data.
Took one of the skewed fields to see if taking the log would a. help the skewness b. incease the correlation
Histo of log(TEAM_FIELDING_E)
ggplot(df) +
geom_histogram(aes(x=log(TEAM_FIELDING_E)), bins=30)
Testing to see the impact on skewness
#type 2 uses moment based formula
skewness(df$TEAM_FIELDING_E, type=2)
## [1] 2.994411
#type 2 uses moment based formula
skewness(log(df$TEAM_FIELDING_E), type=2)
## [1] 1.25156
The skewness is halfed, but still skewed.
Testing to see if using the less skewed log has greater correlation.
df$log_FIELDING_E <- log(df$TEAM_FIELDING_E)
ggpairs(df[,c('TARGET_WINS','log_FIELDING_E', 'TEAM_FIELDING_E' )] )
corrplot(cor(df[,c('TARGET_WINS','log_FIELDING_E', 'TEAM_FIELDING_E' )]), type = 'lower', tl.col = 'black',
cl.ratio = 0.2, tl.srt = 45,tl.cex = 0.5)
It does not, so taking it out. Moving on.
df <- df[,-18]
In looking at this, it seems that it depends on having a Target that is binary. I could find no explanation for what to do in the case where the Target is numeric. Therefore, will create a field called More_W, that is True when the row has a higher number of wins than the mean. And then see how the Correlation Funnel works on that.
#create the MORE_Ws field
df$MORE_Ws <- df$TARGET_WINS > mean(df$TARGET_WINS)
#build the binary table for all the data except INDEX (the ID) and TARGET_WINS, now replaced as the target by MORE_Ws
df_bi_tbl <- df %>% select (-c("INDEX","TARGET_WINS") ) %>% binarize(n_bins = 4, thresh_infreq = 0.01)
#build corr table from the binary table
df_corr_tbl <- df_bi_tbl %>% correlate(target = MORE_Ws__1)
#plot the result
df_corr_tbl %>% plot_correlation_funnel(interactive = FALSE, limits = c(-0.2, 0.2))
## Warning: Removed 4 rows containing missing values (`geom_point()`).
## Warning: Removed 4 rows containing missing values (`geom_text_repel()`).
Observations:
- TEAM_BATTING_BB (walks by batters) has the strongest positive corr
when it is 580 or higher. It also has somewhat of a negative corr when
it is 451 or lower.
- TEAM_PITCHING_H (Hits allowed) has the strongest negative corr when it
is 1419 or lower. And a more positive corr when it is 1682.5 or higher.
This is counter intuitive, as a pitcher allowing more hits would seem
like it should have a neg corr with wins.
- TEAM_BATTING_2B (Doubles by batters) has a positive corr when it is
273 or higher. It also has somewhat of a negative corr when it is 208 or
lower.
- TEAM_PITCHING_BB (walks allowed) has a negative corr when it is 476 or
lower. And a more positive corr when it is 611 or higher. As with the
Hits Allowed, this is counter intuitive, as a pitcher allowing more
walks would seem like it should have a neg corr with wins.
- TEAM_FIELDING_E (Errors) has a positive corr when it low, 127 or
lower.
Before we make further changes to the data, let’s discuss how we intend to model.
Model 1: Use the most detailed level of fields. For example, TEAM_BATTING_H is the sum of TEAM_BATTING_2B/3B/HR + 1B (for which there is no field, but we could calc) - so here we will not use TEAM_BATTING_H, but will use TEAM_BATTING_2B/3B/HR + the calculated 1B.
Model 2: Use more summary fields. In addition to TEAM_BATTING_H, discussed above, also create the following: - the ratio of double plays to errors would be a summary of fielding stats. - the ratio of basehits to strikeouts for batters as a summary of batting - similarly, the ratio of strikeouts to hits could be a summary of pitching stats since, per my baseball SME, HR are included in H (Hits allowed).
Model 3: Since the three fields with the strongest corr (TEAM_BATTING_H, TEAM_FIELDING_E, TEAM_BATTING_2B.) also showed up in Correlation Funnel, let’s create binary fields around the highs and lows for those fields, plus the other 2 in the Corr Funnel Observations.
## For Model 1:
## Calc Singles by batters by subtracting doubles, triples and hr from the total base hits.
df$TEAM_BATTING_1B <- df$TEAM_BATTING_H - df$TEAM_BATTING_2B - df$TEAM_BATTING_3B - df$TEAM_BATTING_HR
## For Model 2:
## Create a ratio of double plays to errors
df$FIELDING <- df$TEAM_FIELDING_DP/df$TEAM_FIELDING_E
## Create a ratio of strikes to hits
df$BATTING <- df$TEAM_BATTING_SO/df$TEAM_BATTING_H
## Create a ratio of strikeouts to hits
df$PITCHING <- df$TEAM_PITCHING_SO/df$TEAM_PITCHING_H
## For Model 3:
df$BAT_BB_G <- with(df, ifelse(TEAM_BATTING_BB>=580,"G", "NG")) #Good / Not
df$BAT_BB_B <- with(df, ifelse(TEAM_BATTING_BB<=451, "B", "NB")) #Bad / Not
df$PITCH_H_G <- with(df, ifelse(TEAM_PITCHING_H>=1682.5, "G", "NG"))
df$PITCH_H_B <- with(df, ifelse(TEAM_PITCHING_H<=1419, "B", "NB"))
df$BAT_2B_G <- with(df, ifelse(TEAM_BATTING_2B>=273, "G", "NG"))
df$BAT_2B_B <- with(df, ifelse(TEAM_BATTING_2B<=208, "B", "NB"))
df$PITCH_BB_G <- with(df, ifelse(TEAM_PITCHING_BB>=611, "G", "NG"))
df$PITCH_BB_B <- with(df, ifelse(TEAM_PITCHING_BB<=476, "B", "NB"))
df$FIELD_E_G <- with(df, ifelse(TEAM_FIELDING_E<=127, "G", "NG"))
#note, only the G, positive corr with fielding errors.
# All will have TARGET_WINS, and will NOT have INDEX (a dumb number, not related to wins).
# df1 will have all the "detail" columns, so remove all the summary ones and the binaries. Also have to take out the MORE_Ws field we used in the Corr Funnel, and the log_FIELDING_E (from the side note)
df1<-df %>% select(-c(INDEX,TEAM_BATTING_H, FIELDING, BATTING, PITCHING, BAT_BB_G , BAT_BB_B , PITCH_H_G , PITCH_H_B, BAT_2B_G, BAT_2B_B, PITCH_BB_G, PITCH_BB_B, FIELD_E_G, MORE_Ws, log_FIELDING_E ))
# df2 will have only "summary" columns, that were removed above
df2<-df %>% select(c(TARGET_WINS,TEAM_BATTING_H, FIELDING, BATTING, PITCHING))
#df3 will have only the binaries
df3<-df %>% select(c(TARGET_WINS, BAT_BB_G , BAT_BB_B , PITCH_H_G , PITCH_H_B, BAT_2B_G, BAT_2B_B, PITCH_BB_G, PITCH_BB_B, FIELD_E_G))
lm1<-lm(TARGET_WINS ~ .,data=df1)
summary(lm1)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.906 -8.582 0.121 8.411 58.597
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.2467875 5.2929578 4.581 4.88e-06 ***
## TEAM_BATTING_2B 0.0278792 0.0073013 3.818 0.000138 ***
## TEAM_BATTING_3B 0.1090625 0.0158847 6.866 8.51e-12 ***
## TEAM_BATTING_HR 0.1032024 0.0273964 3.767 0.000169 ***
## TEAM_BATTING_BB 0.0105498 0.0058145 1.814 0.069749 .
## TEAM_BATTING_SO -0.0093113 0.0025500 -3.652 0.000267 ***
## TEAM_BASERUN_SB 0.0287308 0.0043400 6.620 4.47e-11 ***
## TEAM_PITCHING_H -0.0007465 0.0003672 -2.033 0.042189 *
## TEAM_PITCHING_HR 0.0141572 0.0243055 0.582 0.560309
## TEAM_PITCHING_BB 0.0001870 0.0041429 0.045 0.964010
## TEAM_PITCHING_SO 0.0028358 0.0009186 3.087 0.002046 **
## TEAM_FIELDING_E -0.0207258 0.0024210 -8.561 < 2e-16 ***
## TEAM_FIELDING_DP -0.1211711 0.0130165 -9.309 < 2e-16 ***
## TEAM_BATTING_1B 0.0482145 0.0036862 13.080 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.04 on 2262 degrees of freedom
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.3149
## F-statistic: 81.42 on 13 and 2262 DF, p-value: < 2.2e-16
lm1v2<-lm(TARGET_WINS ~ . - TEAM_PITCHING_HR -TEAM_PITCHING_BB - TEAM_FIELDING_E, data=df1)
summary(lm1v2)
##
## Call:
## lm(formula = TARGET_WINS ~ . - TEAM_PITCHING_HR - TEAM_PITCHING_BB -
## TEAM_FIELDING_E, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.985 -8.323 0.234 8.706 50.713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.0123752 5.2034565 2.885 0.00395 **
## TEAM_BATTING_2B 0.0369817 0.0073343 5.042 4.96e-07 ***
## TEAM_BATTING_3B 0.0931079 0.0157565 5.909 3.96e-09 ***
## TEAM_BATTING_HR 0.1200267 0.0089252 13.448 < 2e-16 ***
## TEAM_BATTING_BB 0.0235079 0.0030441 7.722 1.70e-14 ***
## TEAM_BATTING_SO -0.0072639 0.0024840 -2.924 0.00349 **
## TEAM_BASERUN_SB 0.0174572 0.0041447 4.212 2.63e-05 ***
## TEAM_PITCHING_H -0.0021647 0.0002751 -7.868 5.54e-15 ***
## TEAM_PITCHING_SO 0.0032834 0.0006796 4.831 1.45e-06 ***
## TEAM_FIELDING_DP -0.1258597 0.0132005 -9.534 < 2e-16 ***
## TEAM_BATTING_1B 0.0472641 0.0037171 12.715 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.24 on 2265 degrees of freedom
## Multiple R-squared: 0.2967, Adjusted R-squared: 0.2936
## F-statistic: 95.55 on 10 and 2265 DF, p-value: < 2.2e-16
lm2<-lm(TARGET_WINS ~ .,data=df2)
summary(lm2)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.428 -9.122 0.424 9.635 47.402
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.535e+00 4.646e+00 -0.761 0.447
## TEAM_BATTING_H 5.156e-02 2.695e-03 19.134 < 2e-16 ***
## FIELDING 5.513e+00 8.658e-01 6.368 2.32e-10 ***
## BATTING -3.232e+02 5.937e+01 -5.444 5.76e-08 ***
## PITCHING 3.297e+02 5.907e+01 5.582 2.66e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.15 on 2271 degrees of freedom
## Multiple R-squared: 0.1943, Adjusted R-squared: 0.1929
## F-statistic: 136.9 on 4 and 2271 DF, p-value: < 2.2e-16
lm3<-lm(TARGET_WINS ~ .,data=df3)
summary(lm3)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.615 -8.946 0.320 9.721 63.346
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.1965 1.7276 46.998 < 2e-16 ***
## BAT_BB_GNG -6.1933 1.1216 -5.522 3.74e-08 ***
## BAT_BB_BNB 2.6608 1.2605 2.111 0.0349 *
## PITCH_H_GNG -4.2791 0.8656 -4.943 8.24e-07 ***
## PITCH_H_BNB 3.8016 0.8088 4.700 2.75e-06 ***
## BAT_2B_GNG -1.3012 0.8132 -1.600 0.1097
## BAT_2B_BNB 4.7377 0.8028 5.902 4.14e-09 ***
## PITCH_BB_GNG 2.2520 1.1483 1.961 0.0500 *
## PITCH_BB_BNB 0.9454 1.2031 0.786 0.4321
## FIELD_E_GNG -3.1409 0.7961 -3.946 8.20e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.82 on 2266 degrees of freedom
## Multiple R-squared: 0.118, Adjusted R-squared: 0.1145
## F-statistic: 33.68 on 9 and 2266 DF, p-value: < 2.2e-16
Could try to further refine
Trying one more thing with just the 3 fields that had good linear trend
df4<-df %>% select(c(TARGET_WINS,TEAM_BATTING_H,TEAM_FIELDING_E, TEAM_BATTING_2B))
lm4<-lm(TARGET_WINS ~ .,data=df4)
summary(lm4)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = df4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.855 -9.069 0.265 9.163 57.061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.271243 3.022538 3.067 0.002185 **
## TEAM_BATTING_H 0.057341 0.002775 20.662 < 2e-16 ***
## TEAM_FIELDING_E -0.023250 0.001498 -15.519 < 2e-16 ***
## TEAM_BATTING_2B -0.029013 0.008506 -3.411 0.000659 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.75 on 2272 degrees of freedom
## Multiple R-squared: 0.239, Adjusted R-squared: 0.238
## F-statistic: 237.9 on 3 and 2272 DF, p-value: < 2.2e-16
(Including removing the least signif of the fields in model 1)