In this analysis, we will explore the baseball pitcher dataset to
answer interesting questions using count data regression modeling
techniques. We will use the clarify
package to extract
quantities of interest from the estimated regression parameters.
The dataset we are working with contains detailed statistics of baseball pitchers over the years. It includes various performance metrics such as wins, losses, games played, strikeouts, and more. This dataset is rich with information that can help us understand the factors that contribute to a pitcher’s success and performance on the field.
The dataset includes the following columns:
Player name
: The name of the baseball pitcher.Position
: The position of the player.Win
: The number of wins by the pitcher.Loss
: The number of losses by the pitcher.Earned run Average (ERA)
: The average number of earned
runs allowed by the pitcher per nine innings.Games played
: The total number of games played by the
pitcher.Games Started
: The number of games started by the
pitcher.Complete Game
: The number of games where the pitcher
pitched the entire game.Shutout
: The number of games where the pitcher did not
allow any runs.Save
: The number of games saved by the pitcher.Save Opportunity
: The number of save opportunities for
the pitcher.Innings pitched
: The total number of innings pitched by
the pitcher.hit
: The number of hits allowed by the pitcher.run
: The number of runs allowed by the pitcher.earned run
: The number of earned runs allowed by the
pitcher.home run
: The number of home runs allowed by the
pitcher.Hit Batsmen
: The number of batters hit by the
pitcher.base on balls
: The number of walks given by the
pitcher.Strikeout
: The number of strikeouts by the
pitcher.library(tidyverse)
library(MASS)
library(clarify)
# Load the data
url <- "https://raw.githubusercontent.com/angelbayron/AngelB/main/baseball_pitcher.csv"
data <- read_csv(url)
summary(data)
## Player name Position Win Loss
## Length:2577 Length:2577 Min. : 3.00 Min. : 1.00
## Class :character Class :character 1st Qu.: 22.00 1st Qu.: 25.00
## Mode :character Mode :character Median : 41.00 Median : 44.00
## Mean : 62.62 Mean : 59.64
## 3rd Qu.: 83.00 3rd Qu.: 84.00
## Max. :511.00 Max. :316.00
## NA's :2 NA's :2
## Earned run Average Games played Games Started Complete Game
## Min. :1.890 Min. : 23.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:3.460 1st Qu.: 156.0 1st Qu.: 25.5 1st Qu.: 0.0
## Median :3.910 Median : 252.0 Median : 81.0 Median : 7.0
## Mean :3.926 Mean : 295.1 Mean :125.7 Mean : 39.5
## 3rd Qu.:4.370 3rd Qu.: 388.5 3rd Qu.:192.0 3rd Qu.: 50.0
## Max. :6.560 Max. :1252.0 Max. :815.0 Max. :749.0
## NA's :2 NA's :2 NA's :2 NA's :2
## Shutout Save Save Opportunity Innings pitched
## Min. : 0.00 Min. : 0.00 Length:2577 Min. : 106.2
## 1st Qu.: 0.00 1st Qu.: 1.00 Class :character 1st Qu.: 438.2
## Median : 2.00 Median : 5.00 Mode :character Median : 751.2
## Mean : 6.01 Mean : 20.59 Mean :1079.5
## 3rd Qu.: 8.00 3rd Qu.: 16.00 3rd Qu.:1452.7
## Max. :110.00 Max. :652.00 Max. :7356.0
## NA's :2 NA's :2 NA's :2
## hit run earned run home run
## Min. : 76 Min. : 43.0 Min. : 35.0 Min. : 1.00
## 1st Qu.: 432 1st Qu.: 220.0 1st Qu.: 195.5 1st Qu.: 35.00
## Median : 744 Median : 382.0 Median : 329.0 Median : 60.00
## Mean :1063 Mean : 527.6 Mean : 448.7 Mean : 85.04
## 3rd Qu.:1446 3rd Qu.: 736.5 3rd Qu.: 626.5 3rd Qu.:112.00
## Max. :7092 Max. :3167.0 Max. :2147.0 Max. :522.00
## NA's :2 NA's :2 NA's :2 NA's :2
## Hit Batsmen base on balls Strikeouts WHIP
## Min. : 0.00 Min. : 29.0 Min. : 152.0 Min. :0.890
## 1st Qu.: 12.00 1st Qu.: 168.0 1st Qu.: 259.0 1st Qu.:1.270
## Median : 21.00 Median : 284.0 Median : 457.0 Median :1.350
## Mean : 30.26 Mean : 370.6 Mean : 634.8 Mean :1.354
## 3rd Qu.: 39.00 3rd Qu.: 495.0 3rd Qu.: 823.0 3rd Qu.:1.440
## Max. :277.00 Max. :2795.0 Max. :5714.0 Max. :1.840
## NA's :2 NA's :12 NA's :2 NA's :2
## AVG
## Length:2577
## Class :character
## Mode :character
##
##
##
##
# Fit a Poisson regression model
poisson_model_wins <- glm(Win ~ `Games played` + `Games Started` + `Complete Game` + `Shutout` + `Strikeouts`,
family = poisson(), data = data)
summary(poisson_model_wins)
##
## Call:
## glm(formula = Win ~ `Games played` + `Games Started` + `Complete Game` +
## Shutout + Strikeouts, family = poisson(), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.200e+00 5.410e-03 591.51 <2e-16 ***
## `Games played` 9.159e-04 1.678e-05 54.56 <2e-16 ***
## `Games Started` 4.207e-03 3.592e-05 117.13 <2e-16 ***
## `Complete Game` 8.114e-04 4.153e-05 19.54 <2e-16 ***
## Shutout -3.985e-03 3.474e-04 -11.47 <2e-16 ***
## Strikeouts -1.433e-04 7.432e-06 -19.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 115694 on 2574 degrees of freedom
## Residual deviance: 20335 on 2569 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 34755
##
## Number of Fisher Scoring iterations: 4
# Clarify the results
clarified_poisson_wins <- clarify::sim(poisson_model_wins, n = 1000)
summary(clarified_poisson_wins)
## Length Class Mode
## sim.coefs 6000 -none- numeric
## coefs 6 -none- numeric
## vcov 36 -none- numeric
## fit 31 glm list
# Fit a Poisson regression model
poisson_model_strikeouts <- glm(Strikeouts ~ `Games played` + `Games Started` + `Complete Game` + `Shutout` + `Win`,
family = poisson(), data = data)
summary(poisson_model_strikeouts)
##
## Call:
## glm(formula = Strikeouts ~ `Games played` + `Games Started` +
## `Complete Game` + Shutout + Win, family = poisson(), data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.528e+00 1.740e-03 3176.84 <2e-16 ***
## `Games played` 1.138e-03 5.883e-06 193.45 <2e-16 ***
## `Games Started` 3.271e-03 1.778e-05 184.00 <2e-16 ***
## `Complete Game` -1.559e-03 1.686e-05 -92.44 <2e-16 ***
## Shutout -2.994e-03 1.209e-04 -24.76 <2e-16 ***
## Win 1.238e-03 5.129e-05 24.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 940454 on 2574 degrees of freedom
## Residual deviance: 172114 on 2569 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 192756
##
## Number of Fisher Scoring iterations: 4
# Clarify the results
clarified_poisson_strikeouts <- clarify::sim(poisson_model_strikeouts, n = 1000)
summary(clarified_poisson_strikeouts)
## Length Class Mode
## sim.coefs 6000 -none- numeric
## coefs 6 -none- numeric
## vcov 36 -none- numeric
## fit 31 glm list
In this analysis, we explored factors associated with the number of
wins and the number of strikeouts for baseball pitchers. We used count
data regression modeling techniques and the clarify
package
to extract quantities of interest from the estimated regression
parameters.