This Data Dive explores the IPL Player Performance Dataset by interpreting regression models and applying regression diagnostics.
Extending Last Week’s Regression Model
Building on the simple linear regression of runs ~ boundaries
by adding new predictors.
Testing additional Variable Types
Exploring continuous, binary, or interaction terms and identifying
whether each should be included.
Multicollinearity
Evaluating correlations and VIF values to determine whether added
predictors introduce redundancy.
Fitting Nested Regression Models
Compare models with increasing complexity to understand how each
variable changes model performance.
Evaluating Model Diagnostics
Using five diagnostic plots to assess linearity, variance, normality,
and influential observations.
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025)
# create boundaries and match outcome as binary
IPL <- IPL |>
mutate(
boundaries = fours + sixes,
match_outcome_binary = if_else(match_outcome == "won", 1, 0)
)
This week’s analysis extends the simple linear regression model developed in Week 8, where runs were modeled as a function of boundaries. In this datadive , this model is extended by adding additional predictors, testing an interaction term, incorporating a binary variable.
boundaries (continuous)
boundaries remain the core predictor from Week 8 and form the foundation for all nested models.
Model 1 — Baseline Model (Week 8)
\(\text{runs} \sim \text{boundaries}\)
Simple linear regression using boundaries as the sole predictor
M1 <- lm(runs ~ boundaries, data = IPL)
summary(M1)
##
## Call:
## lm(formula = runs ~ boundaries, data = IPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.916 -1.833 -1.833 1.482 35.167
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.83324 0.03988 45.97 <2e-16 ***
## boundaries 6.67126 0.01171 569.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.244 on 23923 degrees of freedom
## Multiple R-squared: 0.9313, Adjusted R-squared: 0.9313
## F-statistic: 3.244e+05 on 1 and 23923 DF, p-value: < 2.2e-16Interpretation
Boundaries are a very strong positive predictor of runs \(\hat{\beta}_1 = 6.67,\quad p < 2\times 10^{-16}\)
For each additional boundary, expected runs increase by about \(6.67\) runs.
The intercept is \(\hat{\beta}_0 = 1.83\) representing expected runs when boundaries equal to \(0\).
The model explains \(\text{Adjusted } R^2 = 0.9313\) or 93.1% of the variation in runs.
Residuals range from -38.9 to 35.1, showing room for improvement.
balls_faced (continuous)
Balls faced is added as a second continuous predictor to capture innings duration and scoring opportunity.
Model 2 — Add balls_faced (Main Effects Model)
\(\text{runs} \sim \text{boundaries} + \text{balls_faced}\)
Adding Balls faced as a second continuous predictor
M2 <- lm(runs ~ boundaries + balls_faced, data = IPL)
summary(M2)
##
## Call:
## lm(formula = runs ~ boundaries + balls_faced, data = IPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.5891 -1.0004 0.4183 0.5256 26.3616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.418308 0.024570 -17.02 <2e-16 ***
## boundaries 4.089826 0.013161 310.76 <2e-16 ***
## balls_faced 0.630900 0.002783 226.72 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.955 on 23922 degrees of freedom
## Multiple R-squared: 0.9782, Adjusted R-squared: 0.9782
## F-statistic: 5.365e+05 on 2 and 23922 DF, p-value: < 2.2e-16Interpretation
Adding balls_faced dramatically improves model fit: \(0.9313 \rightarrow 0.9782\)
Both predictors are highly significant \(p < 2\times 10^{-16}\)
boundaries coefficient drops from \(6.67 \rightarrow 4.09\) because \(balls\_faced\) now absorbs part of the scoring explanation.
\(balls\_faced\ coefficient:\) \(\hat{\beta}_2 = 0.63\) meaning each additional ball adds ~0.63 runs on average.
This model captures both scoring volume \(boundaries\) and scoring opportunity \(balls faced\)
Residuals shrink from \(\pm 38\) to \(\pm 18\), showing much higher accuracy.
This is a major improvement over Model 1.
Interaction Term: boundaries × balls_faced
This is the primary new variable and tests whether the effect of boundaries depends on balls faced
Model 3 — Add Interaction Term
\(\text{runs} \sim \text{boundaries} * \text{balls_faced}\)
Tests whether the effect of boundaries depends on balls faced.
M3 <- lm(runs ~ boundaries * balls_faced, data = IPL)
summary(M3)
##
## Call:
## lm(formula = runs ~ boundaries * balls_faced, data = IPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4510 -0.9628 0.2826 0.6079 25.5441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2826013 0.0263345 -10.73 <2e-16 ***
## boundaries 3.9036654 0.0187064 208.68 <2e-16 ***
## balls_faced 0.6227037 0.0028331 219.79 <2e-16 ***
## boundaries:balls_faced 0.0053404 0.0003829 13.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.943 on 23921 degrees of freedom
## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9784
## F-statistic: 3.606e+05 on 3 and 23921 DF, p-value: < 2.2e-16Interpretation
Interaction term is significant: \(\hat{\beta}_3 = 0.00534,\quad p < 2\times 10^{-16}\)
Model fit improves slightly: \(\text{Adjusted } R^2 = 0.9784\)
Residuals shrink further and become more symmetric.
The marginal effect of boundaries becomes:\(\frac{\partial \text{runs}}{\partial \text{boundaries}} = \beta_1 + \beta_3 \cdot \text{balls_faced}\)
Players who face more balls get a slightly higher marginal benefit from boundaries.
A binary variable indicating whether the player’s team won (1) or lost (0).
Model 4 — Add Binary Variable (match_outcome)
\(\text{runs} \sim \text{boundaries} * \text{balls_faced} + \text{match_outcome_binary}\)
Adding a binary variable indicating whether the player’s team won.
M4 <- lm(runs ~ boundaries * balls_faced + match_outcome_binary, data = IPL)
summary(M4)
##
## Call:
## lm(formula = runs ~ boundaries * balls_faced + match_outcome_binary,
## data = IPL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.4510 -0.9628 0.2826 0.6079 25.5441
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2826013 0.0263345 -10.73 <2e-16 ***
## boundaries 3.9036654 0.0187064 208.68 <2e-16 ***
## balls_faced 0.6227037 0.0028331 219.79 <2e-16 ***
## match_outcome_binary NA NA NA NA
## boundaries:balls_faced 0.0053404 0.0003829 13.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.943 on 23921 degrees of freedom
## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9784
## F-statistic: 3.606e+05 on 3 and 23921 DF, p-value: < 2.2e-16
Interpretation
\(match\_outcome\_binary\) is dropped due to singularity (NA coefficients).
Model fit remains identical to Model 3: \(\text{Adjusted } R^2 = 0.9784\) , no improvement in residuals.
Once boundaries and balls_faced are known, match outcome adds no additional predictive power.
This is expected: match outcome is a team‑level variable, not a batting‑mechanics variable
Model 4 collapses back to Model 3.
Multicollinearity occurs when predictors in a regression model are highly correlated with each other. This can inflate standard errors, destabilize coefficient estimates, and make interpretation unreliable. Since there are multiple continuous predictors and an interaction term, it is important to evaluate multicollinearity before selecting the final model.
# VIF for Model 2
vif(M2)
## boundaries balls_faced
## 3.975612 3.975612
# VIF for Model 3
vif(M3)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## boundaries balls_faced boundaries:balls_faced
## 8.096852 4.154305 6.784790
Variance Inflation Factors (VIF) are computed for Models 2 and 3, as these models contain multiple predictors.
\(\Large \text{VIF}\approx 3.98<5\)
indicating no harmful multicollinearity.
\(\Large \text{VIF}<10\), with
\(\Large \text{VIF}_{boundaries}=8.10,\quad \text{VIF}_{balls\_faced}=4.15,\quad \text{VIF}_{interaction}=6.78.\)
These values indicate that the interaction model does not suffer from severe multicollinearity. Overall, the predictors in the final model \(Model 3\) are stable, and multicollinearity is not a concern.
ggcorr(
select(IPL,
runs,
boundaries,
balls_faced),
label = TRUE
) +
labs(title = "Correlation Heatmap for Model Predictors")
The predictors $boundaries $ and \(balls\_faced\) show a very strong positive correlation \(≈ 0.9\), which is expected because players who face more balls tend to hit more boundaries.
runs is perfectly correlated with \(boundaries\) \(1.0\) and strongly correlated with \(balls\_faced\) \(≈0.9\) which makes sense because runs are directly produced from these actions
the predictors represent different cricket actions and the model includes an interaction term that captures their combined effect so not problematic for the model
To identify the most appropriate regression model for explaining variation in runs using batting‑level predictors.
Based on statistical significance, \(R^2\), interpretability, multicollinearity diagnostics, and cricket‑specific reasoning Model 3 is the best Model
The interaction term provides meaningful cricket insight:
\(\Large \frac{\partial \text{runs}}{\partial \text{boundaries}}= \beta_1 + \beta_3 \cdot \text{balls_faced}\)
showing that boundaries become slightly more valuable as a batter faces more balls.
Model 3 achieves the highest adjusted \(R^2\) without overfitting.
VIF values for Model 3 are below 10, confirming no severe multicollinearity.
Conclusion
Model 3 is selected as the final model because it provides the best balance of statistical performance, interpretability, and cricket‑specific reasoning. It captures both scoring volume (boundaries), scoring opportunity (balls faced), and the interaction between them, without introducing multicollinearity or redundant predictors.
Evaluating the final model using the five key diagnostic plots :
Residuals vs Fitted
Residuals vs Each Predictor
Residual Histogram
Normal QQ Plot
Cook’s Distance
gg_resfitted(M3) +
geom_smooth(se = FALSE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Interpretation
The residuals are centered around the horizontal zero line, indicating that the model does not systematically over‑ or under‑predict.
The smooth blue line is nearly flat, suggesting that the relationship between predictors and runs is reasonably linear.
The spread of residuals is mostly consistent across fitted values, with only slight widening at higher fitted values.
No U‑shape, wave pattern, or curvature is visible, meaning the model is not missing a nonlinear term.
Significance
Linearity: The flat smooth line indicates high confidence that the linearity assumption is met.
Homoscedasticity: There is mild heteroscedasticity at high fitted values, but it is low severity and unlikely to affect inference.
Model adequacy: No structural pattern suggests the model form is appropriate.
Confidence level: High
confidence that linearity is satisfied.
Moderate‑to‑high confidence that variance is
acceptable.
plots <- gg_resX(M3, plot.all = FALSE)
## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
## Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plots$boundaries +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Residuals are centered around zero with a nearly flat smooth line and no visible curvature.
This indicates that the relationship between \(boundaries\) and \(runs\) is modeled appropriately.
Variance is mostly consistent, any heteroscedasticity is mild and low‑severity.
High confidence that the model form is appropriate for this predictor
plots$balls_faced +
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
Residuals remain centered around zero across the full range of \(balls\_faced\), and the smooth line stays nearly flat, indicating no meaningful curvature or nonlinear pattern.
The spread of residuals is consistent, with no strong funnel shape, any widening at high \(balls\_faced\) values is mild and expected due to fewer observations.
No extreme outliers appear, suggesting the model is not missing a key term related to \(balls\_faced\)
gg_reshist(M3)
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
The histogram is centered around zero, with most residuals clustered tightly near the middle, indicating that the model does not systematically over‑ or under‑predict.
The shape is roughly symmetric, with only mild deviations from perfect normality — no heavy skew or extreme tail behavior.
The distribution shows a single clear peak and smooth tapering on both sides, which is typical for a well‑fitted regression model.
Any imperfections in normality appear low‑severity and are unlikely to meaningfully affect inference (p‑values, confidence intervals).
gg_qqplot(M3)
Most points fall close to the reference line in the middle of the plot, indicating that the central portion of the residual distribution aligns well with normality.
The deviations at both tails are noticeable , the upper tail bends above the line and the lower tail dips below it indicating mild non‑normality in the extremes.
These tail deviations are common in real cricket performance data, where extreme innings (very high or very low runs) are less predictable.
The histogram shows residuals centered around zero with a roughly symmetric shape, indicating that the bulk of the residuals behave close to normally.
The QQ plot confirms this in the middle region, where points closely follow the reference line, but it also highlights mild deviations in the tails.
Taken together, the two plots suggest approximate normality: the central distribution is well‑behaved, and the tail deviations are low‑severity and unlikely to affect inference.
gg_cooksd(M3, threshold ='matlab')
IPL[c(13, 183, 254, 218),]
## # A tibble: 4 × 25
## match_id player team runs balls_faced fours sixes wickets overs_bowled
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1359516 YBK Jaiswal Rajas… 124 65 16 8 0 0
## 2 501223 SE Marsh Kings… 95 47 9 6 0 0
## 3 1312197 JC Buttler Rajas… 89 60 12 2 0 0
## 4 419137 NV Ojha Rajas… 94 57 8 6 0 0
## # ℹ 16 more variables: balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## # run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## # opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## # fantasy_points <dbl>, venue <chr>, date <date>, season <dbl>,
## # boundaries <dbl>, match_outcome_binary <dbl>
The Cook’s Distance plot appears visually dense because the dataset contains ~24,000 observations. Only a few observations (e.g., 13, 183, 254, 218) stand out, and these correspond to extreme batting performances (e.g., 124 off 65, 95 off 47). These points are influential but not problematic, as they represent genuine cricket outcomes rather than data errors. Overall, Cook’s Distance indicates low‑severity influence issues and the model remains stable.
ggplot(data = slice(IPL, c(13, 183, 254, 218))) +
geom_point(data = IPL,
mapping = aes(x = balls_faced, y = runs),
alpha = 0.3) +
geom_point(mapping = aes(x = balls_faced, y = runs),
color = 'darkred', size = 3) +
geom_text_repel(mapping = aes(x = balls_faced,
y = runs,
label = player),
color = 'darkred') +
labs(title = "Investigating High Influence Points",
subtitle = "Label = Player Name",
x = "Balls Faced",
y = "Runs Scored")
The plot shows the specific innings flagged by Cook’s Distance in the original data space, confirming that their influence comes from unusually large fitted values and residuals rather than unusual predictor values. Even though many other innings appear higher in the scatterplot, those points alone do not determine influence, influence depends on how much an observation distorts the regression model.
Across all diagnostic plots, the model shows no major violations. Residual patterns remain stable and only a few observations show elevated influence. Overall, the diagnostics indicate that the model is reliable, well‑behaved, and suitable for interpretation.
Further question:-How would the model’s conclusions change if the influential innings identified by Cook’s Distance were removed, and what does that reveal about the model’s sensitivity to extreme performances?