Goal of this notebook is to import the data and explore. More specifically describe the size and the variables in the moneyball training data set.
Let’s read in the training data and start the eda process.
# read in data
df <- read_csv("./data/moneyball-training-data.csv") %>% clean_names()
# preview
head(df)
## # A tibble: 6 × 17
## index target…¹ team_…² team_…³ team_…⁴ team_…⁵ team_…⁶ team_…⁷ team_…⁸ team_…⁹
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 39 1445 194 39 13 143 842 NA NA
## 2 2 70 1339 219 22 190 685 1075 37 28
## 3 3 86 1377 232 35 137 602 917 46 27
## 4 4 70 1387 209 38 96 451 922 43 30
## 5 5 82 1297 186 27 102 472 920 49 39
## 6 6 75 1279 200 36 92 443 973 107 59
## # … with 7 more variables: team_batting_hbp <dbl>, team_pitching_h <dbl>,
## # team_pitching_hr <dbl>, team_pitching_bb <dbl>, team_pitching_so <dbl>,
## # team_fielding_e <dbl>, team_fielding_dp <dbl>, and abbreviated variable
## # names ¹target_wins, ²team_batting_h, ³team_batting_2b, ⁴team_batting_3b,
## # ⁵team_batting_hr, ⁶team_batting_bb, ⁷team_batting_so, ⁸team_baserun_sb,
## # ⁹team_baserun_cs
Ok looks like they are all numeric variables, I already see some
NAs
in the preview. Let’s look at a summary:
# summarize no index
summary(df[,-1])
## target_wins team_batting_h team_batting_2b team_batting_3b
## Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00
## 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00
## Median : 82.00 Median :1454 Median :238.0 Median : 47.00
## Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25
## 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00
## Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00
##
## team_batting_hr team_batting_bb team_batting_so team_baserun_sb
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0 1st Qu.: 66.0
## Median :102.00 Median :512.0 Median : 750.0 Median :101.0
## Mean : 99.61 Mean :501.6 Mean : 735.6 Mean :124.8
## 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0 3rd Qu.:156.0
## Max. :264.00 Max. :878.0 Max. :1399.0 Max. :697.0
## NA's :102 NA's :131
## team_baserun_cs team_batting_hbp team_pitching_h team_pitching_hr
## Min. : 0.0 Min. :29.00 Min. : 1137 Min. : 0.0
## 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419 1st Qu.: 50.0
## Median : 49.0 Median :58.00 Median : 1518 Median :107.0
## Mean : 52.8 Mean :59.36 Mean : 1779 Mean :105.7
## 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682 3rd Qu.:150.0
## Max. :201.0 Max. :95.00 Max. :30132 Max. :343.0
## NA's :772 NA's :2085
## team_pitching_bb team_pitching_so team_fielding_e team_fielding_dp
## Min. : 0.0 Min. : 0.0 Min. : 65.0 Min. : 52.0
## 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0 1st Qu.:131.0
## Median : 536.5 Median : 813.5 Median : 159.0 Median :149.0
## Mean : 553.0 Mean : 817.7 Mean : 246.5 Mean :146.4
## 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2 3rd Qu.:164.0
## Max. :3645.0 Max. :19278.0 Max. :1898.0 Max. :228.0
## NA's :102 NA's :286
Ten of the columns have min values of 0, these include:
target_wins
team_batting_3b
team_batting_hr
team_batting_bb
team_batting_so
team_batting_sb
team_batting_cs
team_pitching_hr
team_pitching_sb
team_pitching_so
Are these actually zero or NA, let’s investigate further.
Let’s use the DataExplorer
package to perform some
eda.
introduce(df)
## # A tibble: 1 × 9
## rows columns discrete_columns conti…¹ all_m…² total…³ compl…⁴ total…⁵ memor…⁶
## <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
## 1 2276 17 0 17 0 3478 191 38692 314048
## # … with abbreviated variable names ¹continuous_columns, ²all_missing_columns,
## # ³total_missing_values, ⁴complete_rows, ⁵total_observations, ⁶memory_usage
We have 2276 rows and 17 columns, all numerical continuous. There is a total of 3,478 missing values and only 191 out of the 2,276 records are complete.
We will have to figure out how to deal with these missing values.
Let’s visualize this:
plot_intro(df)
Let’s look at missing values per column.
plot_missing(df)
So we can see that all one of the batting variables are 100%
complete. Most of the missing values come from the the
team_batting_hbp
variable, in fact the
DataExplorer
package is telling us to drop this column!
Other 5 columns with missing values:
team_pitching_so
at 4.48%team_batting_so
at 4.48%team_baserun_sb
at 5.76%team_fielding_dp
at 12.57%team_baserun_cs
at 33.92%So we can see that only 6 of the 17
columns have missing values, and one column
team_batting_hbp
has 91.61% missing and
should be dropped.
Let’s look at some histograms
# plot no index or hbp
df %>%
select(-c(index,team_batting_hbp)) %>%
plot_histogram(nrow = 2L, ncol = 2L)
So we have a bunch of different distribution types here.
Normal
target_wins
- seems pretty normal, most around 80team_batting_2b
- pretty normal most around 225team_batting_h
- a few high outlier’s but pretty
normalteam_batting_bb
- pretty normal although a bunch of
lower outlier’s, this column didn’t have a missing values.team_baserun_cs
- somewhat normal, although as you can
seeteam_pitching_bb
- very normal~500team_pitching_h
- pretty normal, a few higher outlier’s
so slightly right-skewed.team_fielding_dp
- pretty normal most around 150,
however some missing valuesteam_pitching_so
- on closer look is normal, most
~1,000Right-Skewed
team_batting_3b
- right-skewed, seems fewer teams have
greater numbers in this.team_baseman_sb
- very right-skewed, there are some
missing values in this as well.team_fielding_e
- most about 125, many higher
outliersBi-modal
team_batting_hr
- odd distribution seems a bunch of
teams have about 25 hrs, whilst the other bunch ~125.team_batting_so
- seems the majority either has ~500 or
~1000team_pitching_hr
- majority ~25 or ~125Let’s take a closer look at a few:
# hist func
hist_func <- function(df,col,xlab,ylab="Count",
title=paste(xlab,"Distribution"),...) {
df %>%
ggplot(aes(!!sym(col))) +
geom_histogram(...) +
labs(title=title,x=xlab,y=ylab) +
theme(plot.title = element_text(hjust = 0.5))
}
# plot
hist_func(df,"team_baserun_cs","Caught Stealing")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Ok, so team_baserun_cs
is pretty much normal, but has a
bunch of outlier’s in the higher realm
Let’s look at team_pitching_h
:
hist_func(df,'team_pitching_h',"Hits Allowed",binwidth=400,bins=300)
Ok, pretty much normal with a few higher outlier’s.
Let’s take a closer look at team_pitching_so
.
hist_func(df,"team_pitching_so",xlab = "Strikouts by Pitchers",
binwdith=500, bins=300)
Let’s look at some box-plots
Let’s write a function for boxplots
# helper func or works for single plot
box_func <- function(df,col,...) {
df %>%
ggplot(aes(!!sym(col))) +
geom_boxplot(...) +
theme(plot.title = element_text(hjust = 0.5))
}
# boxplots for df
create_boxplots <- function(df) {
col_names <- colnames(df)
boxplot_list <- list()
for (col in col_names) {
p <- box_func(df, col, xlab = col)
boxplot_list[[col]] <- p
}
return(boxplot_list)
}
Let’s try out the function.
# create list
boxplot_list <- create_boxplots(df %>% select(-c(index,team_batting_hbp)))
# loop through
for (col in names(boxplot_list)) {
print(boxplot_list[[col]])
}
Here we can see that the columns that had Right-Skewed histogram distributions, have many outliers above the 3rd quartile. Contraily most of the Normal distributions have a few on either side. Additionally the Bi-Modal distributions seemd to have a wider Inter Quartile Range and little to know no outliers.
Let’s look at some scatterplots against target_wins
# plot scatter, drop index and hbp
plot_scatterplot(df[,-c(1,11)],by="target_wins", nrow = 1L, ncol = 2L)
Positive
team_batting_2b
, team_batting_h
have very
positive relationships with the targetteam_batting_3b
, team_batting_hr
,
team_batting_bb
, team_baserun_sb
have a
moderately positive relationship with the targetFlat
team_batting_so
, team_baserun_cs
,
team_pitching_h
, team_pitching_hr
,
team_pitching_bb
, team_pitching_so
,
team_fielding_dp
seem to have a flat relationship with the
targetNegative
team_fielding_e
seems to have the only negative
relationship to the target.This kind of tells us batting seems to have the largest positive influence on the target, whereas fielding the most negative.
Let’s look at some correlations!
plot_correlation(df[,-c(1,11)])
Looks like team_pitching_hr
and
team_batting_hr
are highly correlated at 0.97, almost 1.
This tells us that teams with more home runs also tend to give up more
home runs.
The highest negative correlation is team_batting_bb
and
team_fielding_e
at -0.66. Tells us teams that have more
base on balls, tend to have less fielding errors.
Ok, so let’s sum up our EDA findings.
There are 2276 rows and 17 columns in this dataset. All columns are continuous numerical data type.
6 columns have missing values, they are:
team_pitching_so
at 4.48%team_batting_so
at 4.48%team_baserun_sb
at 5.76%team_fielding_dp
at 12.57%team_baserun_cs
at 33.92%team_batting_hbp
at 91.61%Drop:
Recommend dropping team_batting_hbp
because so much of
it is missing.
Impute:
Recommend imputing other column’s missing values.
Most distributions were relatively normal, however there were a few right-skewed or bi-modal variables.
Those were as follows:
Right-Skewed
team_batting_3b
team_baseman_sb
team_fielding_e
Recommend using a log transformation or Box-Cox here.
Bi-modal
team_batting_hr
team_batting_so
team_pitching_hr
Could try a log or square root transformation here, or look into data segmentation. Could analyze or visualize subgroups separately, for instance the commonalities within groups at each peak.
Polynomial Transformations
We can also try some polynomial transformations here \(n^2\) or \(n^3\). Sometimes these can normalize the distributions as well.
Right-skewed distributions tend to have many outliers above the 3rd quartile.
Bi-modal little to no outliers with a larger IQR.
Normal distributions a few on either side.
Imputing Outliers
I recommend imputing all outlier for each statistic +-6 standard deviations from mean, down to +-6 sd. This will help us get a more balanced data set that is less influenced by outliers.
We had positive, flat, and negative relationships to
target_wins
.
Postive
team_batting_2b
team_batting_h
team_batting_3b
team_batting_hr
team_batting_bb
team_baserun_sb
This leads us to believe that team batting statistics have a greater influence on winning.
Flat
team_batting_so
team_baserun_cs
team_pitching_h
team_pitching_hr
team_pitching_bb
team_pitching_so
team_fielding_dp
Negative
Only negative was team_fielding_e
, this means the worse
a team is at fielding, they tend to have less wins.
Highest correlated variables wereteam_pitching_hr
and
team_batting_hr
at 0.97. Drop team_pitching_hr
since batting is more associated with the target.
Highest negative correlation was team_batting_bb
and
team_fielding_e
at -0.66. We don’t need to drop here, but
we can experiment with taking one out of the model, or put them in
sequentially.