Objective

Goal of this notebook is to import the data and explore. More specifically describe the size and the variables in the moneyball training data set.

Data

Let’s read in the training data and start the eda process.

# read in data 
df <- read_csv("./data/moneyball-training-data.csv") %>% clean_names()

# preview
head(df)
## # A tibble: 6 × 17
##   index target…¹ team_…² team_…³ team_…⁴ team_…⁵ team_…⁶ team_…⁷ team_…⁸ team_…⁹
##   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1     1       39    1445     194      39      13     143     842      NA      NA
## 2     2       70    1339     219      22     190     685    1075      37      28
## 3     3       86    1377     232      35     137     602     917      46      27
## 4     4       70    1387     209      38      96     451     922      43      30
## 5     5       82    1297     186      27     102     472     920      49      39
## 6     6       75    1279     200      36      92     443     973     107      59
## # … with 7 more variables: team_batting_hbp <dbl>, team_pitching_h <dbl>,
## #   team_pitching_hr <dbl>, team_pitching_bb <dbl>, team_pitching_so <dbl>,
## #   team_fielding_e <dbl>, team_fielding_dp <dbl>, and abbreviated variable
## #   names ¹​target_wins, ²​team_batting_h, ³​team_batting_2b, ⁴​team_batting_3b,
## #   ⁵​team_batting_hr, ⁶​team_batting_bb, ⁷​team_batting_so, ⁸​team_baserun_sb,
## #   ⁹​team_baserun_cs

Ok looks like they are all numeric variables, I already see some NAs in the preview. Let’s look at a summary:

# summarize no index
summary(df[,-1])
##   target_wins     team_batting_h team_batting_2b team_batting_3b 
##  Min.   :  0.00   Min.   : 891   Min.   : 69.0   Min.   :  0.00  
##  1st Qu.: 71.00   1st Qu.:1383   1st Qu.:208.0   1st Qu.: 34.00  
##  Median : 82.00   Median :1454   Median :238.0   Median : 47.00  
##  Mean   : 80.79   Mean   :1469   Mean   :241.2   Mean   : 55.25  
##  3rd Qu.: 92.00   3rd Qu.:1537   3rd Qu.:273.0   3rd Qu.: 72.00  
##  Max.   :146.00   Max.   :2554   Max.   :458.0   Max.   :223.00  
##                                                                  
##  team_batting_hr  team_batting_bb team_batting_so  team_baserun_sb
##  Min.   :  0.00   Min.   :  0.0   Min.   :   0.0   Min.   :  0.0  
##  1st Qu.: 42.00   1st Qu.:451.0   1st Qu.: 548.0   1st Qu.: 66.0  
##  Median :102.00   Median :512.0   Median : 750.0   Median :101.0  
##  Mean   : 99.61   Mean   :501.6   Mean   : 735.6   Mean   :124.8  
##  3rd Qu.:147.00   3rd Qu.:580.0   3rd Qu.: 930.0   3rd Qu.:156.0  
##  Max.   :264.00   Max.   :878.0   Max.   :1399.0   Max.   :697.0  
##                                   NA's   :102      NA's   :131    
##  team_baserun_cs team_batting_hbp team_pitching_h team_pitching_hr
##  Min.   :  0.0   Min.   :29.00    Min.   : 1137   Min.   :  0.0   
##  1st Qu.: 38.0   1st Qu.:50.50    1st Qu.: 1419   1st Qu.: 50.0   
##  Median : 49.0   Median :58.00    Median : 1518   Median :107.0   
##  Mean   : 52.8   Mean   :59.36    Mean   : 1779   Mean   :105.7   
##  3rd Qu.: 62.0   3rd Qu.:67.00    3rd Qu.: 1682   3rd Qu.:150.0   
##  Max.   :201.0   Max.   :95.00    Max.   :30132   Max.   :343.0   
##  NA's   :772     NA's   :2085                                     
##  team_pitching_bb team_pitching_so  team_fielding_e  team_fielding_dp
##  Min.   :   0.0   Min.   :    0.0   Min.   :  65.0   Min.   : 52.0   
##  1st Qu.: 476.0   1st Qu.:  615.0   1st Qu.: 127.0   1st Qu.:131.0   
##  Median : 536.5   Median :  813.5   Median : 159.0   Median :149.0   
##  Mean   : 553.0   Mean   :  817.7   Mean   : 246.5   Mean   :146.4   
##  3rd Qu.: 611.0   3rd Qu.:  968.0   3rd Qu.: 249.2   3rd Qu.:164.0   
##  Max.   :3645.0   Max.   :19278.0   Max.   :1898.0   Max.   :228.0   
##                   NA's   :102                        NA's   :286

Ten of the columns have min values of 0, these include:

  • target_wins
  • team_batting_3b
  • team_batting_hr
  • team_batting_bb
  • team_batting_so
  • team_batting_sb
  • team_batting_cs
  • team_pitching_hr
  • team_pitching_sb
  • team_pitching_so

Are these actually zero or NA, let’s investigate further.

Exploratory Data Analysis (EDA)

Let’s use the DataExplorer package to perform some eda.

Introductory

introduce(df)
## # A tibble: 1 × 9
##    rows columns discrete_columns conti…¹ all_m…² total…³ compl…⁴ total…⁵ memor…⁶
##   <int>   <int>            <int>   <int>   <int>   <int>   <int>   <int>   <dbl>
## 1  2276      17                0      17       0    3478     191   38692  314048
## # … with abbreviated variable names ¹​continuous_columns, ²​all_missing_columns,
## #   ³​total_missing_values, ⁴​complete_rows, ⁵​total_observations, ⁶​memory_usage

We have 2276 rows and 17 columns, all numerical continuous. There is a total of 3,478 missing values and only 191 out of the 2,276 records are complete.

We will have to figure out how to deal with these missing values.

Let’s visualize this:

plot_intro(df)

Let’s look at missing values per column.

Missing values

plot_missing(df)

So we can see that all one of the batting variables are 100% complete. Most of the missing values come from the the team_batting_hbp variable, in fact the DataExplorer package is telling us to drop this column!

Other 5 columns with missing values:

  • team_pitching_so at 4.48%
  • team_batting_so at 4.48%
  • team_baserun_sb at 5.76%
  • team_fielding_dp at 12.57%
  • team_baserun_cs at 33.92%

So we can see that only 6 of the 17 columns have missing values, and one column team_batting_hbp has 91.61% missing and should be dropped.

Let’s look at some histograms

Histograms

# plot no index or hbp
df %>% 
  select(-c(index,team_batting_hbp)) %>% 
  plot_histogram(nrow = 2L, ncol = 2L)

So we have a bunch of different distribution types here.

Normal

  • target_wins - seems pretty normal, most around 80
  • team_batting_2b - pretty normal most around 225
  • team_batting_h - a few high outlier’s but pretty normal
  • team_batting_bb - pretty normal although a bunch of lower outlier’s, this column didn’t have a missing values.
  • team_baserun_cs - somewhat normal, although as you can see
  • team_pitching_bb - very normal~500
  • team_pitching_h - pretty normal, a few higher outlier’s so slightly right-skewed.
  • team_fielding_dp - pretty normal most around 150, however some missing values
  • team_pitching_so - on closer look is normal, most ~1,000

Right-Skewed

  • team_batting_3b - right-skewed, seems fewer teams have greater numbers in this.
  • team_baseman_sb - very right-skewed, there are some missing values in this as well.
  • team_fielding_e - most about 125, many higher outliers

Bi-modal

  • team_batting_hr - odd distribution seems a bunch of teams have about 25 hrs, whilst the other bunch ~125.
  • team_batting_so - seems the majority either has ~500 or ~1000
  • team_pitching_hr - majority ~25 or ~125

Let’s take a closer look at a few:

# hist func
hist_func <- function(df,col,xlab,ylab="Count",
                      title=paste(xlab,"Distribution"),...) {
  df %>% 
  ggplot(aes(!!sym(col))) + 
    geom_histogram(...) +
    labs(title=title,x=xlab,y=ylab) +
    theme(plot.title = element_text(hjust = 0.5))
}

# plot
hist_func(df,"team_baserun_cs","Caught Stealing")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Ok, so team_baserun_cs is pretty much normal, but has a bunch of outlier’s in the higher realm

Let’s look at team_pitching_h:

hist_func(df,'team_pitching_h',"Hits Allowed",binwidth=400,bins=300)

Ok, pretty much normal with a few higher outlier’s.

Let’s take a closer look at team_pitching_so.

hist_func(df,"team_pitching_so",xlab = "Strikouts by Pitchers",
          binwdith=500, bins=300)

Let’s look at some box-plots

Box Plots

Let’s write a function for boxplots

# helper func or works for single plot
box_func <- function(df,col,...) {
  df %>% 
  ggplot(aes(!!sym(col))) + 
    geom_boxplot(...) +
    theme(plot.title = element_text(hjust = 0.5))
}

# boxplots for df
create_boxplots <- function(df) {
  col_names <- colnames(df)
  boxplot_list <- list()

  for (col in col_names) {
    p <- box_func(df, col, xlab = col)
    boxplot_list[[col]] <- p
  }

  return(boxplot_list)
}

Let’s try out the function.

# create list
boxplot_list <- create_boxplots(df %>% select(-c(index,team_batting_hbp)))

# loop through
for (col in names(boxplot_list)) {
  print(boxplot_list[[col]])
}

Here we can see that the columns that had Right-Skewed histogram distributions, have many outliers above the 3rd quartile. Contraily most of the Normal distributions have a few on either side. Additionally the Bi-Modal distributions seemd to have a wider Inter Quartile Range and little to know no outliers.

Let’s look at some scatterplots against target_wins

Scatterplots

# plot scatter, drop index and hbp
plot_scatterplot(df[,-c(1,11)],by="target_wins", nrow = 1L, ncol = 2L)

Positive

  • team_batting_2b, team_batting_h have very positive relationships with the target
  • team_batting_3b, team_batting_hr, team_batting_bb, team_baserun_sb have a moderately positive relationship with the target

Flat

  • team_batting_so, team_baserun_cs, team_pitching_h, team_pitching_hr, team_pitching_bb, team_pitching_so, team_fielding_dp seem to have a flat relationship with the target

Negative

  • team_fielding_e seems to have the only negative relationship to the target.

This kind of tells us batting seems to have the largest positive influence on the target, whereas fielding the most negative.

Let’s look at some correlations!

Correlation Matrix

plot_correlation(df[,-c(1,11)])

Looks like team_pitching_hr and team_batting_hr are highly correlated at 0.97, almost 1. This tells us that teams with more home runs also tend to give up more home runs.

The highest negative correlation is team_batting_bb and team_fielding_e at -0.66. Tells us teams that have more base on balls, tend to have less fielding errors.

Ok, so let’s sum up our EDA findings.

Summary

General

There are 2276 rows and 17 columns in this dataset. All columns are continuous numerical data type.

Missing Values

6 columns have missing values, they are:

  • team_pitching_so at 4.48%
  • team_batting_so at 4.48%
  • team_baserun_sb at 5.76%
  • team_fielding_dp at 12.57%
  • team_baserun_cs at 33.92%
  • team_batting_hbp at 91.61%

Drop:

Recommend dropping team_batting_hbp because so much of it is missing.

Impute:

Recommend imputing other column’s missing values.

Distributions

Most distributions were relatively normal, however there were a few right-skewed or bi-modal variables.

Those were as follows:

Right-Skewed

  • team_batting_3b
  • team_baseman_sb
  • team_fielding_e

Recommend using a log transformation or Box-Cox here.

Bi-modal

  • team_batting_hr
  • team_batting_so
  • team_pitching_hr

Could try a log or square root transformation here, or look into data segmentation. Could analyze or visualize subgroups separately, for instance the commonalities within groups at each peak.

Polynomial Transformations

We can also try some polynomial transformations here \(n^2\) or \(n^3\). Sometimes these can normalize the distributions as well.

Box Plots

Right-skewed distributions tend to have many outliers above the 3rd quartile.

Bi-modal little to no outliers with a larger IQR.

Normal distributions a few on either side.

Imputing Outliers

I recommend imputing all outlier for each statistic +-6 standard deviations from mean, down to +-6 sd. This will help us get a more balanced data set that is less influenced by outliers.

Scatterplots

We had positive, flat, and negative relationships to target_wins.

Postive

  • team_batting_2b
  • team_batting_h
  • team_batting_3b
  • team_batting_hr
  • team_batting_bb
  • team_baserun_sb

This leads us to believe that team batting statistics have a greater influence on winning.

Flat

  • team_batting_so
  • team_baserun_cs
  • team_pitching_h
  • team_pitching_hr
  • team_pitching_bb
  • team_pitching_so
  • team_fielding_dp

Negative

Only negative was team_fielding_e, this means the worse a team is at fielding, they tend to have less wins.

Multi-colinearity

Highest correlated variables wereteam_pitching_hr and team_batting_hr at 0.97. Drop team_pitching_hr since batting is more associated with the target.

Highest negative correlation was team_batting_bb and team_fielding_e at -0.66. We don’t need to drop here, but we can experiment with taking one out of the model, or put them in sequentially.