This Data Dive explores the IPL Player Performance Dataset by:-
Building a generalized linear model
Diagnose the model
Interpreting coefficients
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025,overs_bowled > 0)
Also , filtering players who bowled zero overs because modeling wickets for non‑bowlers (overs_bowled = 0, wickets = 0) would artificially inflate the number of zero‑wicket observations and distort the relationship between bowling workload and wickets.
Do bowlers who bowl more overs and maintain tighter control (low economy, more maidens) tend to take more wickets in an IPL match?
Response :- \(wickets\) :Captures Bowling impact by dismissing opposition batters
Explanatory :-
\(overs\_bowled\):-Measures a bowler’s workload in the innings, indicating how many overs they delivered.
\(runs\_conceded\) :-Captures the total number of runs given by the bowler during their spell.
\(economy\) :Reflects bowling efficiency by indicating how many runs a bowler concedes per over.
\(maiden\):- Indicates whether the bowler delivered a maiden over, representing no run conceded in an over.
I modeled the number of \(wickets\) taken by a bowler using a Poisson GLM with a log link. The predictors included \(overs\_bowled\), \(runs\_conceded\), \(economy\) and \(maiden\) .
glm_wkts <- glm(
wickets ~ overs_bowled + runs_conceded + economy + maiden,
data = IPL,
family = poisson(link = "log")
)
summary(glm_wkts)
##
## Call:
## glm(formula = wickets ~ overs_bowled + runs_conceded + economy +
## maiden, family = poisson(link = "log"), data = IPL)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.443222 0.109120 -4.062 4.87e-05 ***
## overs_bowled 0.309604 0.029499 10.496 < 2e-16 ***
## runs_conceded 0.025735 0.003699 6.957 3.48e-12 ***
## economy -0.197196 0.013906 -14.181 < 2e-16 ***
## maiden 0.263607 0.036524 7.217 5.30e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 15832 on 12920 degrees of freedom
## Residual deviance: 12399 on 12916 degrees of freedom
## AIC: 29154
##
## Number of Fisher Scoring iterations: 5
Evaluating the Poisson regression wickets model using diagnostic plots and tools
The overdispersion test evaluates whether the variance in wickets exceeds the mean, which would violate the Poisson assumption.
check_overdispersion(glm_wkts)
## # Overdispersion test
##
## dispersion ratio = 0.911
## Pearson's Chi-Squared = 11762.884
## p-value = 1
## No overdispersion detected.
glm_wkts$deviance / glm_wkts$df.residual
## [1] 0.9599599
The Pearson-based dispersion ratio is \(0.911\) and the deviance-based ratio is \(0.96\), both of which are slightly \(<1\). This indicates mild underdispersion rather than overdispersion. The p‑value from the overdispersion test is \(1\), confirming that the Poisson assumption of Var = Mean is not violated. Therefore, the model does not exhibit overdispersion, and a standard Poisson GLM is valid.
influencePlot(glm_wkts)
## StudRes Hat CookD
## 133 3.79995482 0.0002953020 0.00252642868
## 948 -0.06192977 0.0149470229 0.00001150594
## 1791 3.74121302 0.0004621659 0.00690388578
## 2655 -0.72647258 0.0153032050 0.00142241420
## 9232 -2.16679008 0.0114405688 0.00546474533
IPL[c(133,948,1791,2655, 9232),]
## # A tibble: 5 × 23
## match_id player team runs balls_faced fours sixes wickets overs_bowled
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1254062 AD Russell Kolk… 9 15 1 0 5 2
## 2 1216494 Mohammed Si… Roya… 0 0 0 0 3 4
## 3 1178424 S Gopal Raja… 0 0 0 0 3 1
## 4 501233 S Sreesanth Koch… 0 2 0 0 2 4
## 5 829789 M de Lange Mumb… 0 0 0 0 0 4
## # ℹ 14 more variables: balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## # run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## # opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## # fantasy_points <dbl>, venue <chr>, date <date>, season <dbl>
The Cook’s Distance plot for the Poisson GLM appears visually dense due to the large sample size, but only a handful of observations stand out as influential. The influence plot identifies rows 133, 948, 1791, 2655, and 9232 as the most influential innings. These rows reveals that they correspond to genuine extreme bowling performances . For example, row 133 is a remarkable 5‑wicket haul taken in just 2 overs, row 948 features an exceptionally economical spell with an economy of 2.0 with 2 maiden overs, and 3 wickets, and row 1791 records 3 wickets in a single over.
These rare, high‑impact spells produce large residuals causing them to appear influential in the model. However, because these innings represent legitimate cricketing extremes rather than anomalies, they are influential but not problematic. Overall, Cook’s Distance indicates low‑severity influence issues, and the model remains stable for interpretation.
ggplot(data = slice(IPL, c(133, 948, 1791, 2655))) +
geom_point(data = IPL,
mapping = aes(x = overs_bowled, y = wickets),
alpha = 0.3) +
geom_point(mapping = aes(x = overs_bowled, y = wickets),
color = 'darkred', size = 3) +
geom_text_repel(mapping = aes(x = overs_bowled,
y = wickets,
label = player),
color = 'darkred') +
labs(title = "Investigating High Influence Points",
subtitle = "Label = Player Name",
x = "Overs Bowled",
y = "Wickets Taken")
The plot shows the influential observations identified by the Cook’s Distance and influence diagnostics. The red points correspond to the four influential rows These innings stand out because they represent extreme bowling performances—such as a 5‑wicket haul in 2 overs, 3 wickets in a single over, or an exceptionally economical spell with two maidens.
Null deviance: \(15832\)
Residual deviance: \(12399\)
df: \(12916\)
Reduction:
\(15832 - 12399 = 3433\)
This is a substantial drop, The model explains a meaningful portion of the variation in wickets. The predictors collectively reduce deviance by about \(3,400\) units, which indicates that overs_bowled, runs_conceded, economy, and maiden overs contribute to explaining wicket outcomes.
glm_wkts_ovr <- glm(wickets ~ overs_bowled,
family = poisson, data = IPL)
glm_wkts_runs <- glm(wickets ~ runs_conceded,
family = poisson, data = IPL)
glm_wkts_eco <- glm(wickets ~ economy,
family = poisson, data = IPL)
glm_wkts_maid <- glm(wickets ~ maiden,
family = poisson, data = IPL)
AIC(glm_wkts_ovr,
glm_wkts_runs,
glm_wkts_eco,
glm_wkts_maid)
## df AIC
## glm_wkts_ovr 2 30050.98
## glm_wkts_runs 2 32565.47
## glm_wkts_eco 2 30999.11
## glm_wkts_maid 2 32310.05
From \(glm\_wkts\) , main poisson model summary we have \(AIC = 29154\)
AIC comparison shows that \(overs\_bowled\) is the strongest single predictor of wickets with lowest \(AIC = 30050.98\), among all other predictor. The actual model with all predictors \(glm\_wkts\) has the lowest AIC \(29154\), indicating that combining all predictors provides the best overall fit.
Model is \(\text{wickets} \sim \text{overs_bowled} + \text{runs_conceded} + \text{economy} + \text{maiden}\)
This is a Poisson regression with a log link, each coefficient represents a multiplicative effect on expected wickets.
From Model Summary :-
1. overs_bowled (Estimate = \(0.3096\))
exp(0.3096)
## [1] 1.36288
Each additional over bowled increases expected wickets by about 36%, holding all other variables constant.
2. runs_conceded (Estimate = \(0.0257\))
exp(0.0257)
## [1] 1.026033
Each additional run conceded increases expected wickets by about 2.6%.
This is something unexpected sincebowlers want to concede less runs and take more wickets, but the result implies that the bowlers who concede more runs have more chance to take wickets which in cricketing terms generally called as creating wicket‑taking chances. This is a small but statistically strong effect.
3. economy (Estimate = \(–0.1972\))
exp(-0.1972)
## [1] 0.8210264
A negative coefficient indicates that each 1‑unit increase in economy rate expected wickets decrease by 18% , holding other variables constant. It shows that higher economy rate (more expensive bowling) are associated with fewer wickets. Efficient bowlers (low economy) tend to take more wickets.
4. maiden (Estimate = \(0.2636\))
exp(0.2636)
## [1] 1.301607
Bowling a maiden over increases expected wickets by 30%. Maiden overs build pressure, forcing batters into riskier shots, which aligns with the positive effect.
These results align with cricketing logic while quantifying the relative importance of different bowling metrics. The large positive effect of maiden overs underscores the importance of control and pressure. The negative effect of economy rate highlights the challenges faced by expensive bowlers.
Further question:- Does \(opposition\_team\) and \(venue\) affect wicket‑taking ability of bowlers?