MSDS 410 EDA

#1 EDA

The most important aspect of a baseball season is the ultimate number of wins a team has. Teams should use stats to optimize for the most significant possible number of victories. In this exercise, I will comb through the different stats to see what is the most significant possible predictor of wins. Since we are concerned mostly about wins let’s take a look wins to see what we are dealng with.

     Mean Median Standard Deviation Max Min
[1,]   81     82              15.75 146   0

Kurtosis

[1] 4.031017

Skewness

[1] -0.3989861

Distribution of wins is relativelty normal. Mean is 81 wins which is a 50% winning percentage out of a 162 game scheule. The kurtosis of the distribution is aroud 4 suggesting the distribution is extremely peaked. The skewness is -.4 suggesting a slight left skew which makes sense considering our mean is slightly lower than the median.

There are outlier in the wins section. The max number of wins is 146, which is not true becuase The record for number of wins is 116 by the 2001 Seattle Mariners. And the minimum is 0 which mostly certaintly also has never happened. So these outliers we will take care of in the next section.

Now let’s take a look to see how many NAs are in our dataset. We will tackle the NAs in our next section.

$...1
[1] 0

$INDEX
[1] 0

$TARGET_WINS
[1] 0

$TEAM_BATTING_H
[1] 0

$TEAM_BATTING_2B
[1] 0

$TEAM_BATTING_3B
[1] 0

$TEAM_BATTING_HR
[1] 0

$TEAM_BATTING_BB
[1] 0

$TEAM_BATTING_SO
[1] 102

$TEAM_BASERUN_SB
[1] 131

$TEAM_BASERUN_CS
[1] 772

$TEAM_BATTING_HBP
[1] 2085

$TEAM_PITCHING_H
[1] 0

$TEAM_PITCHING_HR
[1] 0

$TEAM_PITCHING_BB
[1] 0

$TEAM_PITCHING_SO
[1] 102

$TEAM_FIELDING_E
[1] 0

$TEAM_FIELDING_DP
[1] 286

Let’s look at some hitting statistics

Warning: Removed 102 rows containing non-finite values (stat_boxplot).

Let’s see how the hitting statistics correlated to wins

     hits  HRs doubles triples  BBs SOs
[1,] 0.39 0.18    0.29    0.14 0.23  NA

Looks like we are missing Strikeout data as indicated in the previous section. Like I mentioned eariler we will address those issues in the next section. The total number of hits, is most correlated with wins followed by base on balls or walks. Somewhat surprisngly homeruns is not a closely correlated as doubles is to wins. Not surprisngly the least common batting outcome the triple is least correlated to wins. We are missing alot of hitch by pitch data and is a very infrequent occurence making it rather useless for our analysis.

Let’s take a quick look at stolen bases and caught stealing.

Warning: Removed 131 rows containing missing values (geom_point).

Warning: Removed 772 rows containing missing values (geom_point).

There is no obvious correlation between stolen bases andd wins or negative correlation betwee caught stealing andd wins.

Next up on the list let’s take a look at pitching stats.

Warning: Removed 102 rows containing non-finite values (stat_boxplot).

Let’s see how the pitching statistics correlated to wins

      hits  HRs  BBs SOs
[1,] -0.11 0.19 0.12  NA

Looks like there is a number of outliers in the pitching stats. Hits, BBs, and SOs all need to be addressed in the next section.

#2 Data Preparation

There are a few areas where we need to address outliers. Let’s start with the wins section. The record for number of wins ina season is 116 and loses is 120. So first we need to add a losses column, and then anything over the record for loses and wins will be re assesed to the median.

Now for the rest of the stats, let’s replace all blanks with the median for the stat column.

$...1
[1] 0

$INDEX
[1] 0

$TARGET_WINS
[1] 0

$TEAM_BATTING_H
[1] 0

$TEAM_BATTING_2B
[1] 0

$TEAM_BATTING_3B
[1] 0

$TEAM_BATTING_HR
[1] 0

$TEAM_BATTING_BB
[1] 0

$TEAM_BATTING_SO
[1] 0

$TEAM_BASERUN_SB
[1] 0

$TEAM_BASERUN_CS
[1] 0

$TEAM_BATTING_HBP
[1] 0

$TEAM_PITCHING_H
[1] 0

$TEAM_PITCHING_HR
[1] 0

$TEAM_PITCHING_BB
[1] 0

$TEAM_PITCHING_SO
[1] 0

$TEAM_FIELDING_E
[1] 0

$TEAM_FIELDING_DP
[1] 0

$loss
[1] 0

Now for the rest of the data points let’s take all data points above and below the 95th and 5th percentile and revalue those data points back to those respective values.

As we can see our boxplots are not as spread out and our NAs and outliers are cleaned.

In this dataset there is no number of at bats for calucating a batting average, on base percetange, or slugging percentage is not going to be possilbe. However we can caluclate the percetnage of the hits that are extra base hits. Let’s add a colunn for percentage of hits that are extra base hits. Let’s also do a quasi on-base stat and add together the hits and walks columns to get the gross on base for a year. I am not inclduing hit by pitches because that does not take any particular skill. Or at least not a skill a team would explicitly select for. Wheras getting hits and taking walks is a particular skill worth selecting for. ON the pitching side lets also create the on base given up column.

#3 Build Model

To build my linear regression model I will train it against three different variables. Let’s first take a look at the scatterplot for each of the three variables as it pertains to wins.

`geom_smooth()` using formula 'y ~ x'

[1] 0.1737319

`geom_smooth()` using formula 'y ~ x'

[1] 0.365578

`geom_smooth()` using formula 'y ~ x'

[1] 0.2568476

From the initial glance at the scatterplots and the correlations we might expect on_base to be the best predictor. Let’s run the regression models for all three and see what we get.


Call:
lm(formula = moneyball_train$TARGET_WINS ~ moneyball_train$on_base_x, 
    data = moneyball_train)

Coefficients:
              (Intercept)  moneyball_train$on_base_x  
                   43.869                      0.018


Call:
lm(formula = moneyball_train$TARGET_WINS ~ moneyball_train$on_base, 
    data = moneyball_train)

Coefficients:
            (Intercept)  moneyball_train$on_base  
               22.56341                  0.02986


Call:
lm(formula = moneyball_train$TARGET_WINS ~ moneyball_train$percent_hits_extrabase, 
    data = moneyball_train)

Coefficients:
                           (Intercept)  moneyball_train$percent_hits_extrabase  
                                 66.64                                   52.70


Call:
lm(formula = moneyball_train$TARGET_WINS ~ moneyball_train$on_base_x, 
    data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-46.722  -8.869   0.801   9.771  36.189 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)               43.86875    2.95370   14.85   <2e-16 ***
moneyball_train$on_base_x  0.01800    0.00142   12.67   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.54 on 2274 degrees of freedom
Multiple R-squared:  0.06597,   Adjusted R-squared:  0.06556 
F-statistic: 160.6 on 1 and 2274 DF,  p-value: < 2.2e-16


Call:
lm(formula = moneyball_train$TARGET_WINS ~ moneyball_train$on_base, 
    data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.165  -8.678   0.434   9.094  41.901 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             22.563415   3.138818   7.189 8.85e-13 ***
moneyball_train$on_base  0.029860   0.001594  18.730  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.04 on 2274 degrees of freedom
Multiple R-squared:  0.1336,    Adjusted R-squared:  0.1333 
F-statistic: 350.8 on 1 and 2274 DF,  p-value: < 2.2e-16


Call:
lm(formula = moneyball_train$TARGET_WINS ~ moneyball_train$percent_hits_extrabase, 
    data = moneyball_train)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.422  -9.292   0.597   9.576  37.959 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                              66.637      1.747  38.149   <2e-16 ***
moneyball_train$percent_hits_extrabase   52.697      6.264   8.413   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.8 on 2274 degrees of freedom
Multiple R-squared:  0.03018,   Adjusted R-squared:  0.02976 
F-statistic: 70.77 on 1 and 2274 DF,  p-value: < 2.2e-16

                                2.5 %      97.5 %
(Intercept)               38.07651388 49.66098726
moneyball_train$on_base_x  0.01521197  0.02078141

                              2.5 %      97.5 %
(Intercept)             16.40816809 28.71866112
moneyball_train$on_base  0.02673347  0.03298618

                                          2.5 %   97.5 %
(Intercept)                            63.21166 70.06239
moneyball_train$percent_hits_extrabase 40.41275 64.98026

The coeffiecients for on_base regression model is the y_intercept value of 22.56341, which means that wins have a starting base level of that number. The slope coefficient is .029886 so for each addtional time on base that is worth .029886 wins.

#4 WRITE OUT YOUR MODEL EQUATION

Please see coe below for all data transformations made to my test data

moneyball_test$TEAM_BATTING_HBP[is.na(moneyball_test$TEAM_BATTING_HBP)]<-median(moneyball_test$TEAM_BATTING_HBP,na.rm=TRUE)

moneyball_test$TEAM_BATTING_SO[is.na(moneyball_test$TEAM_BATTING_SO)]<-median(moneyball_test$TEAM_BATTING_SO,na.rm=TRUE)

moneyball_test$TEAM_BASERUN_SB[is.na(moneyball_test$TEAM_BASERUN_SB)]<-median(moneyball_test$TEAM_BASERUN_SB,na.rm=TRUE)

moneyball_test$TEAM_BASERUN_CS[is.na(moneyball_test$TEAM_BASERUN_CS)]<-median(moneyball_test$TEAM_BASERUN_CS,na.rm=TRUE)

moneyball_test$TEAM_PITCHING_SO[is.na(moneyball_test$TEAM_PITCHING_SO)]<-median(moneyball_test$TEAM_PITCHING_SO,na.rm=TRUE)

moneyball_test$TEAM_FIELDING_DP[is.na(moneyball_test$TEAM_FIELDING_DP)]<-median(moneyball_test$TEAM_FIELDING_DP,na.rm = TRUE)

fun <- function(x){
     quantiles <- quantile( x, c(.05, .95 ),na.rm = TRUE )
     x[ x < quantiles[1] ] <- quantiles[1]
     x[ x > quantiles[2] ] <- quantiles[2]
     x
}

moneyball_test<-fun(moneyball_test)

hits<-ggplot(moneyball_test,aes(moneyball_test$TEAM_PITCHING_H))+geom_boxplot(color="red",notch = TRUE)+labs(x="hits")
hrs<-ggplot(moneyball_test,aes(moneyball_test$TEAM_PITCHING_HR))+geom_boxplot(color="red",notch = TRUE)+labs(x="HRs")
BBs<-ggplot(moneyball_test,aes(moneyball_test$TEAM_PITCHING_BB))+geom_boxplot(color="red",notch = TRUE)+labs(x="BBs")
SOs<-ggplot(moneyball_test,aes(moneyball_test$TEAM_PITCHING_SO))+geom_boxplot(color="red",notch = TRUE)+labs(x="SOs")

hits<-ggplot(moneyball_test,aes(moneyball_test$TEAM_BATTING_H))+geom_boxplot(color="red",notch = TRUE)+labs(x="hits")
hrs<-ggplot(moneyball_test,aes(moneyball_test$TEAM_BATTING_HR))+geom_boxplot(color="red",notch = TRUE)+labs(x="HRs")
doubles<-ggplot(moneyball_test,aes(moneyball_test$TEAM_BATTING_2B))+geom_boxplot(color="red",notch = TRUE)+labs(x="doubles")
triples<-ggplot(moneyball_test,aes(moneyball_test$TEAM_BATTING_3B))+geom_boxplot(color="red",notch = TRUE)+labs(x="triples")
BBs<-ggplot(moneyball_test,aes(moneyball_test$TEAM_BATTING_BB))+geom_boxplot(color="red",notch = TRUE)+labs(x="BBs")
SOs<-ggplot(moneyball_test,aes(moneyball_test$TEAM_BATTING_SO))+geom_boxplot(color="red",notch = TRUE)+labs(x="SOs")

moneyball_test <- moneyball_test %>% mutate(percent_hits_extrabase=(moneyball_test$TEAM_BATTING_2B+moneyball_test$TEAM_BATTING_3B+moneyball_test$TEAM_BATTING_HR)/moneyball_test$TEAM_BATTING_H)

moneyball_test <- moneyball_test %>% mutate(on_base=moneyball_test$TEAM_BATTING_H+moneyball_test$TEAM_BATTING_BB)

moneyball_test <- moneyball_test %>% mutate(on_base_x=moneyball_test$TEAM_PITCHING_H+moneyball_test$TEAM_PITCHING_BB)

Please see the linear regression formula I used to create the vector P_Target_Wins. This is the regression formula from the on_base statistic vs wins.

P_Target_Wins<-round(22.56341+.02986*moneyball_test$on_base,0)

Using Kaggle the RSME was 13.26844.

MSDS 410 EDA

Bryan Collins

6/21/2021