Introduction

This Data Dive explores the IPL Player Performance Dataset by building generalized linear model

  • Selecting a binary variable to serve as the response for a generalized linear model

  • Constructing a Logistic Regression Model using 1–4 explanatory variables

  • Interpreting Coefficients

  • Building Confidence Interval for one coefficient using standard error

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

Binary Variable Selection

Binary column:-\(match\_outcome\) with values as:

  • win

  • loss

This column represents whether the player’s team won the match, from this creating a binary variable

\(won\_match\)

IPL <- IPL |>
  mutate(
    won_match = if_else(match_outcome == "win", 1, 0)
  )

How do individual player performance metrics influence the probability that their team wins an IPL match?

Explanatory Variables

To Build logistic regression model I selected explanatory variables that capture key aspects of player performance and meaningful predictors of the probability of winning an IPL match.

  1. \(runs\): Measures the player’s batting contribution

  2. \(wickets\): Captures Bowling impact by dismissing opposition batters

  3. \(catches\): represents fielding effectiveness , contributions to batters dismissals .

  4. \(economy\): Reflects bowling efficiency by indicating how many runs a bowler concedes per over

IPL_GLM_model<- IPL |>
  select(
    won_match,
    runs,
    wickets,
    catches,
    economy
  )

Logistic Regression Model

This step fits the logistic regression model using the selected predictors to estimate the probability that a player’s team wins an IPL match.

The logistic regression model uses:

  • Response variable: \(won\_match (1 = win, 0 = loss)\)

  • Predictors: \(runs\), \(wickets\), \(catches\), \(economy\)

The model estimates how each performance metric influences the log‑odds of winning.

\(\text{logit}(\Pr(\text{won_match} = 1)) = \beta_0 + \beta_1(\text{runs}) + \beta_2(\text{wickets}) + \beta_3(\text{catches}) + \beta_4(\text{economy})\)

Where:

  • The logit function transforms probabilities into log‑odds

  • Each coefficient \(\beta_i\) represents the change in log‑odds of winning for a one‑unit increase in that predictor

logit_model <-glm(won_match ~ runs + wickets + catches + economy,
    data = IPL_GLM_model,
    family = binomial(link = "logit"))

summary(logit_model)
## 
## Call:
## glm(formula = won_match ~ runs + wickets + catches + economy, 
##     family = binomial(link = "logit"), data = IPL_GLM_model)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.3091690  0.0255914  -12.08  < 2e-16 ***
## runs         0.0049669  0.0007126    6.97 3.16e-12 ***
## wickets      0.4427470  0.0177700   24.91  < 2e-16 ***
## catches      0.4096976  0.0228836   17.90  < 2e-16 ***
## economy     -0.0377891  0.0032921  -11.48  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33149  on 23924  degrees of freedom
## Residual deviance: 32102  on 23920  degrees of freedom
## AIC: 32112
## 
## Number of Fisher Scoring iterations: 4

The logistic regression model shows that runs, wickets, and catches all increase the probability of winning, while a higher economy rate decreases it, indicating that batting, bowling, and fielding contributions each play a meaningful role in match outcomes.

  • \(runs\): Each additional run slightly increases the probability of winning. The effect is small but positive.

  • \(wickets\): one of the strongest positive coefficients, taking wickets significantly increases the team’s chances of winning.

  • \(catches\): A positive coefficient shows that fielding contributions matter - “Catches Win Matches”

  • \(economy\): The negative coefficient indicates that conceding more runs per over reduces the probability of winning. Lower economy (more efficient bowling) is associated with better match outcomes.

Interpreting Logistic Regression Coefficients

Logistic regression coefficients are expressed in log‑odds, converting them into odds ratios makes the results easier to understand.

\(\Large \text{Odds Ratio} = e^{\beta}\)

exp(0.0049669)
## [1] 1.004979
exp(0.4427470)
## [1] 1.556978
exp(0.4096976)
## [1] 1.506362
exp(-0.0377891)
## [1] 0.962916
  • Runs

    • Estimate: \(\beta = 0.0049669\)

    • Odds ratio: \(e^{0.0049669} \approx 1.00498\)

    • Each additional run increases the odds of winning by about 0.5%.

  • Wickets

    • Estimate: \(\beta = 0.4427470\)

    • Odds ratio: \(e^{0.4427470} \approx 1.557\)

    • Each wicket increases the odds of winning by about 56%.

  • Catches

    • Estimate: \(\beta = 0.4096976\)

    • Odds ratio: \(e^{0.4096976} \approx 1.506\)

    • Each catch increases the odds of winning by about 51%.

  • Economy

    • Estimate: \(\beta = -0.0377891\)

    • Odds ratio: \(e^{-0.0377891} \approx 0.963\)

    • Each extra run conceded per over reduces the odds of winning by about 3.7%.

\(wickets\) and \(catches\) have the strongest positive effects on winning probability, runs contribute modestly, and higher economy rates reduce the likelihood of winning the match.

Confidence Interval

To construct a 95% confidence interval for one coefficient, I used the estimate and standard error from the logistic regression output. I selected \(wickets\) because it has the strongest effect on winning probability.

  • Estimate: \(\hat{\beta}\text{wickets}=0.442747\)

  • Standard Error: \(SE(\hat{\beta}\text{wickets})=0.0177700\)

The 95% confidence interval on the log-odds scale is:

\(\Large\hat{\beta}\text{wickets} \pm 1.96 * SE(\hat{\beta}\text{wickets})\)

coef_wickets <- coef(logit_model)["wickets"]
se_wickets   <- summary(logit_model)$coefficients["wickets", "Std. Error"]

ci_lower_wickets <- coef_wickets - 1.96 * se_wickets
ci_upper_wickets <- coef_wickets + 1.96 * se_wickets

c(ci_lower_wickets, ci_upper_wickets)
##   wickets   wickets 
## 0.4079178 0.4775762

Both values are positive, and the interval does not include zero. This means that even at the lowest plausible value, taking wickets still increases the log‑odds of winning and it implies  strong statistical evidence that wickets have a meaningful positive effect on match outcomes.

exp(0.4079178)
## [1] 1.503684
exp(0.4775762)
## [1] 1.612162

The result indicates that each additional wicket increases the odds of winning by between roughly 50% and 61%. Because the entire interval is above 1, the effect is both statistically significant and practically meaningful.

Further question:- Would the model change if additional predictors are included?