Introduction

In this analysis, we will explore the baseball pitcher dataset to answer interesting questions using count data regression modeling techniques. We will use the clarify package to extract quantities of interest from the estimated regression parameters.

The Story of the Dataset

The dataset we are working with contains detailed statistics of baseball pitchers over the years. It includes various performance metrics such as wins, losses, games played, strikeouts, and more. This dataset is rich with information that can help us understand the factors that contribute to a pitcher’s success and performance on the field.

The dataset includes the following columns:

Load Libraries

library(tidyverse)
library(MASS)
library(clarify)

Load Data

# Load the data
url <- "https://raw.githubusercontent.com/angelbayron/AngelB/main/baseball_pitcher.csv"
data <- read_csv(url)

Data Exploration

summary(data)
##  Player name          Position              Win              Loss       
##  Length:2577        Length:2577        Min.   :  3.00   Min.   :  1.00  
##  Class :character   Class :character   1st Qu.: 22.00   1st Qu.: 25.00  
##  Mode  :character   Mode  :character   Median : 41.00   Median : 44.00  
##                                        Mean   : 62.62   Mean   : 59.64  
##                                        3rd Qu.: 83.00   3rd Qu.: 84.00  
##                                        Max.   :511.00   Max.   :316.00  
##                                        NA's   :2        NA's   :2       
##  Earned run Average  Games played    Games Started   Complete Game  
##  Min.   :1.890      Min.   :  23.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:3.460      1st Qu.: 156.0   1st Qu.: 25.5   1st Qu.:  0.0  
##  Median :3.910      Median : 252.0   Median : 81.0   Median :  7.0  
##  Mean   :3.926      Mean   : 295.1   Mean   :125.7   Mean   : 39.5  
##  3rd Qu.:4.370      3rd Qu.: 388.5   3rd Qu.:192.0   3rd Qu.: 50.0  
##  Max.   :6.560      Max.   :1252.0   Max.   :815.0   Max.   :749.0  
##  NA's   :2          NA's   :2        NA's   :2       NA's   :2      
##     Shutout            Save        Save Opportunity   Innings pitched 
##  Min.   :  0.00   Min.   :  0.00   Length:2577        Min.   : 106.2  
##  1st Qu.:  0.00   1st Qu.:  1.00   Class :character   1st Qu.: 438.2  
##  Median :  2.00   Median :  5.00   Mode  :character   Median : 751.2  
##  Mean   :  6.01   Mean   : 20.59                      Mean   :1079.5  
##  3rd Qu.:  8.00   3rd Qu.: 16.00                      3rd Qu.:1452.7  
##  Max.   :110.00   Max.   :652.00                      Max.   :7356.0  
##  NA's   :2        NA's   :2                           NA's   :2       
##       hit            run           earned run        home run     
##  Min.   :  76   Min.   :  43.0   Min.   :  35.0   Min.   :  1.00  
##  1st Qu.: 432   1st Qu.: 220.0   1st Qu.: 195.5   1st Qu.: 35.00  
##  Median : 744   Median : 382.0   Median : 329.0   Median : 60.00  
##  Mean   :1063   Mean   : 527.6   Mean   : 448.7   Mean   : 85.04  
##  3rd Qu.:1446   3rd Qu.: 736.5   3rd Qu.: 626.5   3rd Qu.:112.00  
##  Max.   :7092   Max.   :3167.0   Max.   :2147.0   Max.   :522.00  
##  NA's   :2      NA's   :2        NA's   :2        NA's   :2       
##   Hit Batsmen     base on balls      Strikeouts          WHIP      
##  Min.   :  0.00   Min.   :  29.0   Min.   : 152.0   Min.   :0.890  
##  1st Qu.: 12.00   1st Qu.: 168.0   1st Qu.: 259.0   1st Qu.:1.270  
##  Median : 21.00   Median : 284.0   Median : 457.0   Median :1.350  
##  Mean   : 30.26   Mean   : 370.6   Mean   : 634.8   Mean   :1.354  
##  3rd Qu.: 39.00   3rd Qu.: 495.0   3rd Qu.: 823.0   3rd Qu.:1.440  
##  Max.   :277.00   Max.   :2795.0   Max.   :5714.0   Max.   :1.840  
##  NA's   :2        NA's   :12       NA's   :2        NA's   :2      
##      AVG           
##  Length:2577       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

Question 1: Factors Associated with the Number of Wins

# Fit a Poisson regression model
poisson_model_wins <- glm(Win ~ `Games played` + `Games Started` + `Complete Game` + `Shutout` + `Strikeouts`, 
                          family = poisson(), data = data)
summary(poisson_model_wins)
## 
## Call:
## glm(formula = Win ~ `Games played` + `Games Started` + `Complete Game` + 
##     Shutout + Strikeouts, family = poisson(), data = data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      3.200e+00  5.410e-03  591.51   <2e-16 ***
## `Games played`   9.159e-04  1.678e-05   54.56   <2e-16 ***
## `Games Started`  4.207e-03  3.592e-05  117.13   <2e-16 ***
## `Complete Game`  8.114e-04  4.153e-05   19.54   <2e-16 ***
## Shutout         -3.985e-03  3.474e-04  -11.47   <2e-16 ***
## Strikeouts      -1.433e-04  7.432e-06  -19.28   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 115694  on 2574  degrees of freedom
## Residual deviance:  20335  on 2569  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 34755
## 
## Number of Fisher Scoring iterations: 4
# Clarify the results
clarified_poisson_wins <- clarify::sim(poisson_model_wins, n = 1000)
summary(clarified_poisson_wins)
##           Length Class  Mode   
## sim.coefs 6000   -none- numeric
## coefs        6   -none- numeric
## vcov        36   -none- numeric
## fit         31   glm    list

Question 2: Factors Associated with the Number of Strikeouts

# Fit a Poisson regression model
poisson_model_strikeouts <- glm(Strikeouts ~ `Games played` + `Games Started` + `Complete Game` + `Shutout` + `Win`, 
                                family = poisson(), data = data)
summary(poisson_model_strikeouts)
## 
## Call:
## glm(formula = Strikeouts ~ `Games played` + `Games Started` + 
##     `Complete Game` + Shutout + Win, family = poisson(), data = data)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      5.528e+00  1.740e-03 3176.84   <2e-16 ***
## `Games played`   1.138e-03  5.883e-06  193.45   <2e-16 ***
## `Games Started`  3.271e-03  1.778e-05  184.00   <2e-16 ***
## `Complete Game` -1.559e-03  1.686e-05  -92.44   <2e-16 ***
## Shutout         -2.994e-03  1.209e-04  -24.76   <2e-16 ***
## Win              1.238e-03  5.129e-05   24.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 940454  on 2574  degrees of freedom
## Residual deviance: 172114  on 2569  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 192756
## 
## Number of Fisher Scoring iterations: 4
# Clarify the results
clarified_poisson_strikeouts <- clarify::sim(poisson_model_strikeouts, n = 1000)
summary(clarified_poisson_strikeouts)
##           Length Class  Mode   
## sim.coefs 6000   -none- numeric
## coefs        6   -none- numeric
## vcov        36   -none- numeric
## fit         31   glm    list

Conclusion

In this analysis, we explored factors associated with the number of wins and the number of strikeouts for baseball pitchers. We used count data regression modeling techniques and the clarify package to extract quantities of interest from the estimated regression parameters.