Introduction

This Data Dive explores the IPL Player Performance Dataset by:-

  • Building a generalized linear model

  • Diagnose the model

  • Interpreting coefficients

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025,overs_bowled > 0)

Also , filtering players who bowled zero overs because modeling wickets for non‑bowlers (overs_bowled = 0, wickets = 0) would artificially inflate the number of zero‑wicket observations and distort the relationship between bowling workload and wickets.

Do bowlers who bowl more overs and maintain tighter control (low economy, more maidens) tend to take more wickets in an IPL match?

Variable Selection

Response :- \(wickets\) :Captures Bowling impact by dismissing opposition batters

Explanatory :-

  • \(overs\_bowled\):-Measures a bowler’s workload in the innings, indicating how many overs they delivered.

  • \(runs\_conceded\) :-Captures the total number of runs given by the bowler during their spell.

  • \(economy\) :Reflects bowling efficiency by indicating how many runs a bowler concedes per over.

  • \(maiden\):- Indicates whether the bowler delivered a maiden over, representing no run conceded in an over.

    Poisson Regression ( GLM)

I modeled the number of \(wickets\) taken by a bowler using a Poisson GLM with a log link. The predictors included \(overs\_bowled\), \(runs\_conceded\), \(economy\) and \(maiden\) .

glm_wkts <- glm(
  wickets ~ overs_bowled + runs_conceded + economy + maiden,
  data = IPL,
  family = poisson(link = "log")
)
summary(glm_wkts)
## 
## Call:
## glm(formula = wickets ~ overs_bowled + runs_conceded + economy + 
##     maiden, family = poisson(link = "log"), data = IPL)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.443222   0.109120  -4.062 4.87e-05 ***
## overs_bowled   0.309604   0.029499  10.496  < 2e-16 ***
## runs_conceded  0.025735   0.003699   6.957 3.48e-12 ***
## economy       -0.197196   0.013906 -14.181  < 2e-16 ***
## maiden         0.263607   0.036524   7.217 5.30e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 15832  on 12920  degrees of freedom
## Residual deviance: 12399  on 12916  degrees of freedom
## AIC: 29154
## 
## Number of Fisher Scoring iterations: 5

Model Diagnostics

Evaluating the Poisson regression wickets model using diagnostic plots and tools

Overdispersion

The overdispersion test evaluates whether the variance in wickets exceeds the mean, which would violate the Poisson assumption.

check_overdispersion(glm_wkts)
## # Overdispersion test
## 
##        dispersion ratio =     0.911
##   Pearson's Chi-Squared = 11762.884
##                 p-value =         1
## No overdispersion detected.
glm_wkts$deviance / glm_wkts$df.residual
## [1] 0.9599599

The Pearson-based dispersion ratio is \(0.911\) and the deviance-based ratio is \(0.96\), both of which are slightly \(<1\). This indicates mild underdispersion rather than overdispersion. The p‑value from the overdispersion test is \(1\), confirming that the Poisson assumption of Var = Mean is not violated. Therefore, the model does not exhibit overdispersion, and a standard Poisson GLM is valid.

Influence / Cook’s Distance

influencePlot(glm_wkts)

##          StudRes          Hat         CookD
## 133   3.79995482 0.0002953020 0.00252642868
## 948  -0.06192977 0.0149470229 0.00001150594
## 1791  3.74121302 0.0004621659 0.00690388578
## 2655 -0.72647258 0.0153032050 0.00142241420
## 9232 -2.16679008 0.0114405688 0.00546474533
IPL[c(133,948,1791,2655, 9232),]
## # A tibble: 5 × 23
##   match_id player       team   runs balls_faced fours sixes wickets overs_bowled
##      <dbl> <chr>        <chr> <dbl>       <dbl> <dbl> <dbl>   <dbl>        <dbl>
## 1  1254062 AD Russell   Kolk…     9          15     1     0       5            2
## 2  1216494 Mohammed Si… Roya…     0           0     0     0       3            4
## 3  1178424 S Gopal      Raja…     0           0     0     0       3            1
## 4   501233 S Sreesanth  Koch…     0           2     0     0       2            4
## 5   829789 M de Lange   Mumb…     0           0     0     0       0            4
## # ℹ 14 more variables: balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## #   run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## #   opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## #   fantasy_points <dbl>, venue <chr>, date <date>, season <dbl>

The Cook’s Distance plot for the Poisson GLM appears visually dense due to the large sample size, but only a handful of observations stand out as influential. The influence plot identifies rows 133, 948, 1791, 2655, and 9232 as the most influential innings. These rows reveals that they correspond to genuine extreme bowling performances . For example, row 133 is a remarkable 5‑wicket haul taken in just 2 overs, row 948 features an exceptionally economical spell with an economy of 2.0 with 2 maiden overs, and 3 wickets, and row 1791 records 3 wickets in a single over.

These rare, high‑impact spells produce large residuals causing them to appear influential in the model. However, because these innings represent legitimate cricketing extremes rather than anomalies, they are influential but not problematic. Overall, Cook’s Distance indicates low‑severity influence issues, and the model remains stable for interpretation.

Plotting high influence points

ggplot(data = slice(IPL, c(133, 948, 1791, 2655))) +
  geom_point(data = IPL,
             mapping = aes(x = overs_bowled, y = wickets),
             alpha = 0.3) +
  geom_point(mapping = aes(x = overs_bowled, y = wickets),
             color = 'darkred', size = 3) +
  geom_text_repel(mapping = aes(x = overs_bowled,
                                y = wickets,
                                label = player),
                  color = 'darkred') +
  labs(title = "Investigating High Influence Points",
       subtitle = "Label = Player Name",
       x = "Overs Bowled",
       y = "Wickets Taken")

The plot shows the influential observations identified by the Cook’s Distance and influence diagnostics. The red points correspond to the four influential rows  These innings stand out because they represent extreme bowling performances—such as a 5‑wicket haul in 2 overs, 3 wickets in a single over, or an exceptionally economical spell with two maidens.

Deviance (Null vs Residual)

  • Null deviance: \(15832\)

  • Residual deviance: \(12399\)

  • df: \(12916\)

Reduction:

\(15832 - 12399 = 3433\)

This is a substantial drop, The model explains a meaningful portion of the variation in wickets. The predictors collectively reduce deviance by about \(3,400\) units, which indicates that overs_bowled, runs_conceded, economy, and maiden overs contribute to explaining wicket outcomes.

AIC

glm_wkts_ovr <- glm(wickets ~ overs_bowled,
                    family = poisson, data = IPL)

glm_wkts_runs <- glm(wickets ~ runs_conceded,
                     family = poisson, data = IPL)

glm_wkts_eco <- glm(wickets ~ economy,
                    family = poisson, data = IPL)

glm_wkts_maid <- glm(wickets ~ maiden,
                     family = poisson, data = IPL)

AIC(glm_wkts_ovr,
    glm_wkts_runs,
    glm_wkts_eco,
    glm_wkts_maid)
##               df      AIC
## glm_wkts_ovr   2 30050.98
## glm_wkts_runs  2 32565.47
## glm_wkts_eco   2 30999.11
## glm_wkts_maid  2 32310.05

From \(glm\_wkts\) , main poisson model summary we have \(AIC = 29154\)

AIC comparison shows that \(overs\_bowled\) is the strongest single predictor of wickets with lowest \(AIC = 30050.98\), among all other predictor. The actual model with all predictors \(glm\_wkts\) has the lowest AIC \(29154\), indicating that combining all predictors provides the best overall fit.

Coefficient Interpretation

Model is \(\text{wickets} \sim \text{overs_bowled} + \text{runs_conceded} + \text{economy} + \text{maiden}\)

This is a Poisson regression with a log link, each coefficient represents a multiplicative effect on expected wickets.

From Model Summary :-

1. overs_bowled (Estimate = \(0.3096\))

exp(0.3096)
## [1] 1.36288

Each additional over bowled increases expected wickets by about 36%, holding all other variables constant.

2. runs_conceded (Estimate = \(0.0257\))

exp(0.0257)
## [1] 1.026033

Each additional run conceded increases expected wickets by about 2.6%.

This is something unexpected sincebowlers want to concede less runs and take more wickets, but the result implies that the bowlers who concede more runs have more chance to take wickets which in cricketing terms generally called as creating wicket‑taking chances. This is a small but statistically strong effect.

3. economy (Estimate = \(–0.1972\))

exp(-0.1972)
## [1] 0.8210264

A negative coefficient indicates that each 1‑unit increase in economy rate expected wickets decrease by 18% , holding other variables constant. It shows that higher economy rate (more expensive bowling) are associated with fewer wickets. Efficient bowlers (low economy) tend to take more wickets.

4. maiden (Estimate = \(0.2636\))

exp(0.2636)
## [1] 1.301607

Bowling a maiden over increases expected wickets by 30%. Maiden overs build pressure, forcing batters into riskier shots, which aligns with the positive effect.

These results align with cricketing logic while quantifying the relative importance of different bowling metrics. The large positive effect of maiden overs underscores the importance of control and pressure. The negative effect of economy rate highlights the challenges faced by expensive bowlers.

Further question:- Does \(opposition\_team\) and \(venue\) affect wicket‑taking ability of bowlers?