This Data Dive explores the IPL Player Performance Dataset by building generalized linear model
Selecting a binary variable to serve as the response for a generalized linear model
Constructing a Logistic Regression Model using 1–4 explanatory variables
Interpreting Coefficients
Building Confidence Interval for one coefficient using standard error
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025)
Binary column:-\(match\_outcome\) with values as:
win
loss
This column represents whether the player’s team won the match, from this creating a binary variable
\(won\_match\)
IPL <- IPL |>
mutate(
won_match = if_else(match_outcome == "win", 1, 0)
)
How do individual player performance metrics influence the probability that their team wins an IPL match?
To Build logistic regression model I selected explanatory variables that capture key aspects of player performance and meaningful predictors of the probability of winning an IPL match.
\(runs\): Measures the player’s batting contribution
\(wickets\): Captures Bowling impact by dismissing opposition batters
\(catches\): represents fielding effectiveness , contributions to batters dismissals .
\(economy\): Reflects bowling efficiency by indicating how many runs a bowler concedes per over
IPL_GLM_model<- IPL |>
select(
won_match,
runs,
wickets,
catches,
economy
)
This step fits the logistic regression model using the selected predictors to estimate the probability that a player’s team wins an IPL match.
The logistic regression model uses:
Response variable: \(won\_match (1 = win, 0 = loss)\)
Predictors: \(runs\), \(wickets\), \(catches\), \(economy\)
The model estimates how each performance metric influences the log‑odds of winning.
\(\text{logit}(\Pr(\text{won_match} = 1)) = \beta_0 + \beta_1(\text{runs}) + \beta_2(\text{wickets}) + \beta_3(\text{catches}) + \beta_4(\text{economy})\)
Where:
The logit function transforms probabilities into log‑odds
Each coefficient \(\beta_i\) represents the change in log‑odds of winning for a one‑unit increase in that predictor
logit_model <-glm(won_match ~ runs + wickets + catches + economy,
data = IPL_GLM_model,
family = binomial(link = "logit"))
summary(logit_model)
##
## Call:
## glm(formula = won_match ~ runs + wickets + catches + economy,
## family = binomial(link = "logit"), data = IPL_GLM_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3091690 0.0255914 -12.08 < 2e-16 ***
## runs 0.0049669 0.0007126 6.97 3.16e-12 ***
## wickets 0.4427470 0.0177700 24.91 < 2e-16 ***
## catches 0.4096976 0.0228836 17.90 < 2e-16 ***
## economy -0.0377891 0.0032921 -11.48 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 33149 on 23924 degrees of freedom
## Residual deviance: 32102 on 23920 degrees of freedom
## AIC: 32112
##
## Number of Fisher Scoring iterations: 4
The logistic regression model shows that runs, wickets, and catches all increase the probability of winning, while a higher economy rate decreases it, indicating that batting, bowling, and fielding contributions each play a meaningful role in match outcomes.
\(runs\): Each additional run slightly increases the probability of winning. The effect is small but positive.
\(wickets\): one of the strongest positive coefficients, taking wickets significantly increases the team’s chances of winning.
\(catches\): A positive coefficient shows that fielding contributions matter - “Catches Win Matches”
\(economy\): The negative coefficient indicates that conceding more runs per over reduces the probability of winning. Lower economy (more efficient bowling) is associated with better match outcomes.
Logistic regression coefficients are expressed in log‑odds, converting them into odds ratios makes the results easier to understand.
\(\Large \text{Odds Ratio} = e^{\beta}\)
exp(0.0049669)
## [1] 1.004979
exp(0.4427470)
## [1] 1.556978
exp(0.4096976)
## [1] 1.506362
exp(-0.0377891)
## [1] 0.962916
Runs
Estimate: \(\beta = 0.0049669\)
Odds ratio: \(e^{0.0049669} \approx 1.00498\)
Each additional run increases the odds of winning by about 0.5%.
Wickets
Estimate: \(\beta = 0.4427470\)
Odds ratio: \(e^{0.4427470} \approx 1.557\)
Each wicket increases the odds of winning by about 56%.
Catches
Estimate: \(\beta = 0.4096976\)
Odds ratio: \(e^{0.4096976} \approx 1.506\)
Each catch increases the odds of winning by about 51%.
Economy
Estimate: \(\beta = -0.0377891\)
Odds ratio: \(e^{-0.0377891} \approx 0.963\)
Each extra run conceded per over reduces the odds of winning by about 3.7%.
\(wickets\) and \(catches\) have the strongest positive effects on winning probability, runs contribute modestly, and higher economy rates reduce the likelihood of winning the match.
To construct a 95% confidence interval for one coefficient, I used the estimate and standard error from the logistic regression output. I selected \(wickets\) because it has the strongest effect on winning probability.
Estimate: \(\hat{\beta}\text{wickets}=0.442747\)
Standard Error: \(SE(\hat{\beta}\text{wickets})=0.0177700\)
The 95% confidence interval on the log-odds scale is:
\(\Large\hat{\beta}\text{wickets} \pm 1.96 * SE(\hat{\beta}\text{wickets})\)
coef_wickets <- coef(logit_model)["wickets"]
se_wickets <- summary(logit_model)$coefficients["wickets", "Std. Error"]
ci_lower_wickets <- coef_wickets - 1.96 * se_wickets
ci_upper_wickets <- coef_wickets + 1.96 * se_wickets
c(ci_lower_wickets, ci_upper_wickets)
## wickets wickets
## 0.4079178 0.4775762
Both values are positive, and the interval does not include zero. This means that even at the lowest plausible value, taking wickets still increases the log‑odds of winning and it implies strong statistical evidence that wickets have a meaningful positive effect on match outcomes.
exp(0.4079178)
## [1] 1.503684
exp(0.4775762)
## [1] 1.612162
The result indicates that each additional wicket increases the odds of winning by between roughly 50% and 61%. Because the entire interval is above 1, the effect is both statistically significant and practically meaningful.
Further question:- Would the model change if additional predictors are included?