DATA 606 Data Project

## Warning: package 'baseballr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Abstract

The analysis weighed the correlation between the frequency with which players attempted to steal bases and their likelihood of hitting into a double play. The initial work was to extract the desired variables and build a data frame with stolen base and double play related columns. Through exploratory data analysis, we found that the data was skewed and necessitated filtering and log transformation to get a more normal distribution. Using Pearson’s correlation, we found a moderate negative correlation (-0.387) between the stolen base attempt rate and double play rate. The initial plots and weak correlation implied that this might be a linear correlation but with skews, so a GLM with Gauss distribution was used to fit the data to a model. The resulting r-squared value of 0.111 suggested that the model had predictive abilities, but was relatively weak. This is likely because the data is influenced by many other factors. We concluded that there was a weak negative correlation between stolen base attempt rate and double play rate, but need more analysis on factors to determine why.

Part 1 - Introduction

In Major League Baseball (MLB) today, players are evaluated according to their metrics, even to the point of rejecting the proverbial “eye test”. As these analyses become more complex, we start to investigate correlations that were never considered prior. For example, do players with a high amount of stolen base attempts ground into double plays (GIDP or GDP) less often?

The initial batter_df data frame contains 480 players from 2024 who had at least one plate appearance and stole at least one base.

The assumption here ignores the more likely factors such as players that are fast, steal many bases, or have higher stolen base rates. We will check some of these correlations to help validate our results. However, we really want to test if aggressive base runners are better at outrunning double plays. This would suggest that the age old adage to hustle holds true even in a more analytically guided world.

Part 2 - Data

# Pulling data from fangraphs.com.
# Originally wanted to obtain the same data from baseballreference.com.
# batter_df = bref_daily_batter('2024-03-20', '2024-09-30')
batter_df = fg_bat_leaders(startseason = '2024', endseason = '2024')
# Get the columns that we want to use for analysis. 
# Keep both name and id in case two players have the same name. 
batter_df <- batter_df %>% 
  select(playerid, PlayerName, G, PA, GDP, SB, CS)
# Rename the columns to be clearer
batter_df <- batter_df %>%
  rename(
    c('id' = 'playerid',
    'name' = 'PlayerName',
    'games_played' = 'G',
    'plate_appearances' = 'PA',
    'grounded_into_double_plays' = 'GDP',
    'stolen_bases' = 'SB',
    'caught_stealing' = 'CS')
  )
# Remove instances where we may divide by 0 plate appearances.
# Standardize the stolen base and double play rates by total plate appearances.
# Remove any player with no stolen base attempts.
batter_df <- batter_df %>%
  filter(plate_appearances > 0) %>%
  mutate(
    stolen_base_attempt_rate = (stolen_bases + caught_stealing) / plate_appearances,
    double_play_rate = grounded_into_double_plays / plate_appearances
  ) %>%
  filter(stolen_base_attempt_rate > 0)

Here is the modified data frame, sorted by stolen base attempt rate.

head(batter_df %>% arrange(desc(stolen_base_attempt_rate)))

## ── MLB Player Batting Leaders data from FanGraphs.com ─────── baseballr 1.6.0 ──

## ℹ Data updated: 2025-05-05 20:39:46 EDT

## # A tibble: 6 × 9
##      id name  games_played plate_appearances grounded_into_double…¹ stolen_bases
##   <int> <chr>        <int>             <int>                  <int>        <int>
## 1 27543 Duke…           11                 5                      0            5
## 2 17620 Myle…            7                 4                      0            2
## 3 19779 Dair…           88               132                      2           31
## 4 22261 Bubb…           17                18                      1            5
## 5 20450 Brew…            3                 5                      0            1
## 6 16496 Forr…           16                35                      0            4
## # ℹ abbreviated name: ¹grounded_into_double_plays
## # ℹ 3 more variables: caught_stealing <int>, stolen_base_attempt_rate <dbl>,
## #   double_play_rate <dbl>

Part 3 - Exploratory data analysis

Start by performing exploratory data analysis on the batter data frame in the form of histograms for the variables we are focusing on and a scatter plot of them together.

batter_df %>%
  ggplot(aes(stolen_base_attempt_rate)) + geom_histogram(binwidth = 0.02)

batter_df %>%
  ggplot(aes(double_play_rate)) + geom_histogram(binwidth = 0.01)

batter_df %>%
  ggplot(aes(x = stolen_base_attempt_rate, y = double_play_rate)) + geom_point()

The initial bar plots show that stolen base attempt rate is typically very low.

The stolen base attempt rate has a number of outliers that stray far beyond the rest of players who usually steal in under 25% of their plate appearances. We will adjust the dataframe to include just the players that qualify for end of season leaderboard statistics to remove those outliers. This is calculated as 3.1 average plate appearances per game times 162 games = 502. Removing lower plate appearance players should remove skewed cases such as dedicated pinch runner players whose roles are to run often. Replot to see if removing the outliers improves the distributions. If not, we may need to perform a log transformation.

batter_qual_df <- batter_df %>%
  filter(plate_appearances > 502)

head(batter_qual_df %>% arrange(desc(stolen_base_attempt_rate)))

## ── MLB Player Batting Leaders data from FanGraphs.com ─────── baseballr 1.6.0 ──

## ℹ Data updated: 2025-05-05 20:39:46 EDT

## # A tibble: 6 × 9
##      id name  games_played plate_appearances grounded_into_double…¹ stolen_bases
##   <int> <chr>        <int>             <int>                  <int>        <int>
## 1 26668 Elly…          160               696                     12           67
## 2 22186 Bric…          155               619                      8           50
## 3 16939 Lane…          130               528                      5           32
## 4 19755 Shoh…          159               731                      7           59
## 5 29931 Jaco…          150               521                     11           33
## 6 20454 Jazz…          147               621                      6           40
## # ℹ abbreviated name: ¹grounded_into_double_plays
## # ℹ 3 more variables: caught_stealing <int>, stolen_base_attempt_rate <dbl>,
## #   double_play_rate <dbl>

batter_qual_df %>%
  ggplot(aes(stolen_base_attempt_rate)) + 
  geom_histogram(binwidth = 0.01) +
  geom_density()

batter_qual_df %>%
  ggplot(aes(double_play_rate)) + 
  geom_histogram(binwidth = 0.005) +
  geom_density()

batter_qual_df %>%
  ggplot(aes(x = stolen_base_attempt_rate, y = double_play_rate)) + 
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

The stolen_base_attempt_rate still has a right skew, but double_play_rate looks like a normal distribution. We will use a log transformation on stolen_base_attempt_rate to improve the distribution.

batter_qual_df <- batter_df %>%
  filter(plate_appearances > 502) %>%
  mutate(stolen_base_attempt_rate = log(stolen_base_attempt_rate))

head(batter_qual_df %>% arrange(desc(stolen_base_attempt_rate)))

## ── MLB Player Batting Leaders data from FanGraphs.com ─────── baseballr 1.6.0 ──

## ℹ Data updated: 2025-05-05 20:39:46 EDT

## # A tibble: 6 × 9
##      id name  games_played plate_appearances grounded_into_double…¹ stolen_bases
##   <int> <chr>        <int>             <int>                  <int>        <int>
## 1 26668 Elly…          160               696                     12           67
## 2 22186 Bric…          155               619                      8           50
## 3 16939 Lane…          130               528                      5           32
## 4 19755 Shoh…          159               731                      7           59
## 5 29931 Jaco…          150               521                     11           33
## 6 20454 Jazz…          147               621                      6           40
## # ℹ abbreviated name: ¹grounded_into_double_plays
## # ℹ 3 more variables: caught_stealing <int>, stolen_base_attempt_rate <dbl>,
## #   double_play_rate <dbl>

batter_qual_df %>%
  ggplot(aes(stolen_base_attempt_rate)) + 
  geom_histogram(binwidth = 0.65) +
  geom_density()

batter_qual_df %>%
  ggplot(aes(double_play_rate)) + 
  geom_histogram(binwidth = 0.005) +
  geom_density()

batter_qual_df %>%
  ggplot(aes(x = stolen_base_attempt_rate, y = double_play_rate)) + 
  geom_point() +
  geom_smooth()

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

There seems to be some skew with the stolen_base_attempt_rate still, but far less after the log transformation.

Part 4 - Inference

The null hypothesis is that there is no correlation between players that attempt steals often and their double play rates. The alternate hypothesis is that there is a correlation between these values.

The above plots seem to imply that there is a weak negative correlation between a high stolen base attempt rate and double play rate. For our correlation test, this brings up the question of which test to use. My question seeks to find a linear relationship between two rate stats. A Pearson correlation coefficient test is suitable as I am comparing two continuous variables.

cor.test(batter_qual_df$stolen_base_attempt_rate, batter_qual_df$double_play_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  batter_qual_df$stolen_base_attempt_rate and batter_qual_df$double_play_rate
## t = -4.6687, df = 124, p-value = 7.746e-06
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5259866 -0.2271026
## sample estimates:
##        cor 
## -0.3866511

The -0.3866511 correlation coefficient reflects the negative correlation expected, but it is not a very strong correlation. We are 95% confident that the correlation coefficient is between -0.5259866 and -0.2271026 because the p-value is below 0.05. This may be a satisfactory correlation as the two variables we were comparing were expected to be correlated, but not very strongly.

We will fit a model via a Generalized Linear Model (GLM) with a Gamma distribution. The choice is because the correlation might be linear, our dependent variable double_play_rate is positive, and the data shows some skew.

model <- glm(double_play_rate ~ stolen_base_attempt_rate, family = Gamma, data = batter_qual_df)
summary(model)

## 
## Call:
## glm(formula = double_play_rate ~ stolen_base_attempt_rate, family = Gamma, 
##     data = batter_qual_df)
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                92.295      8.479  10.885  < 2e-16 ***
## stolen_base_attempt_rate    8.350      1.838   4.542  1.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.17663)
## 
##     Null deviance: 31.042  on 125  degrees of freedom
## Residual deviance: 27.592  on 124  degrees of freedom
## AIC: -871.33
## 
## Number of Fisher Scoring iterations: 5

plot(model, which = 1)

plot(model, which = 2)

The first residuals plot shows that the residual points do not have a pattern and hover around 0. The Q-Q plot shows that the residuals stay around normal until 1.5 standard deviation. It then has a heavy right skew tail. There are far more extreme values on the right side of the distribution.

We can calculate r-squared using the deviance values from the model generated.

rsquared <- 1 - (model$deviance / model$null.deviance)
rsquared

## [1] 0.1111449

R-squared values are interpreted by context. High values such as 0.8 are desired to show that the models are good fits. 0.111 is a very low number and implies the model has relatively weak predictive abilities of 11%.

Part 5 - Conclusion

There appears to be a moderately weak correlation between stolen base attempt rate and double play rate. However, the model generated is not a strong fit. It is only 11% better than the null model. There are likely other factors that are stronger predictors of double plays than the stolen base attempt rate.

We did not establish strong evidence that aggressive base runners are more likely to hustle out double plays. The analysis was very limited by the variables chosen. A follow-up analysis could include more variables such as in-field hit rates and speed ratings.

References

“Acquiring and Analyzing Baseball Data.” Github.io, 2015, billpetti.github.io/baseballr. Accessed 5 May 2025. Rate stats Qualifiers | Glossary | MLB.com. (n.d.). MLB.com. https://www.mlb.com/glossary/standard-stats/rate-stats-qualifiers