Introduction

This Data Dive explores the IPL Player Performance Dataset by interpreting regression models and applying regression diagnostics.

Extending Last Week’s Regression Model
Building on the simple linear regression of runs ~ boundaries by adding new predictors.
Testing additional Variable Types
Exploring continuous, binary, or interaction terms and identifying whether each should be included.
Multicollinearity
Evaluating correlations and VIF values to determine whether added predictors introduce redundancy.
Fitting Nested Regression Models
Compare models with increasing complexity to understand how each variable changes model performance.
Evaluating Model Diagnostics
Using five diagnostic plots to assess linearity, variance, normality, and influential observations.

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only.

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

# create boundaries and match outcome as binary
IPL <- IPL |>
  mutate(
    boundaries = fours + sixes,
    match_outcome_binary = if_else(match_outcome == "won", 1, 0)
)

Variable Selection & Nested Model

This week’s analysis extends the simple linear regression model developed in Week 8, where runs were modeled as a function of boundaries. In this datadive , this model is extended by adding additional predictors, testing an interaction term, incorporating a binary variable.

boundaries (continuous)

boundaries remain the core predictor from Week 8 and form the foundation for all nested models.

Model 1 — Baseline Model (Week 8)

$\text{runs} \sim \text{boundaries}$

Simple linear regression using boundaries as the sole predictor

M1 <- lm(runs ~ boundaries, data = IPL)
summary(M1)

## 
## Call:
## lm(formula = runs ~ boundaries, data = IPL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.916  -1.833  -1.833   1.482  35.167 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.83324    0.03988   45.97   <2e-16 ***
## boundaries   6.67126    0.01171  569.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.244 on 23923 degrees of freedom
## Multiple R-squared:  0.9313, Adjusted R-squared:  0.9313 
## F-statistic: 3.244e+05 on 1 and 23923 DF,  p-value: < 2.2e-16

Interpretation

Boundaries are a very strong positive predictor of runs $\hat{\beta}_1 = 6.67,\quad p < 2\times 10^{-16}$
For each additional boundary, expected runs increase by about $6.67$ runs.
The intercept is $\hat{\beta}_0 = 1.83$ representing expected runs when boundaries equal to $0$.
The model explains $\text{Adjusted } R^2 = 0.9313$ or 93.1% of the variation in runs.
Residuals range from -38.9 to 35.1, showing room for improvement.

balls_faced (continuous)

Balls faced is added as a second continuous predictor to capture innings duration and scoring opportunity.

Model 2 — Add balls_faced (Main Effects Model)

$\text{runs} \sim \text{boundaries} + \text{balls_faced}$

Adding Balls faced as a second continuous predictor

M2 <- lm(runs ~ boundaries + balls_faced, data = IPL)
summary(M2)

## 
## Call:
## lm(formula = runs ~ boundaries + balls_faced, data = IPL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.5891  -1.0004   0.4183   0.5256  26.3616 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.418308   0.024570  -17.02   <2e-16 ***
## boundaries   4.089826   0.013161  310.76   <2e-16 ***
## balls_faced  0.630900   0.002783  226.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.955 on 23922 degrees of freedom
## Multiple R-squared:  0.9782, Adjusted R-squared:  0.9782 
## F-statistic: 5.365e+05 on 2 and 23922 DF,  p-value: < 2.2e-16

Interpretation

Adding balls_faced dramatically improves model fit: $0.9313 \rightarrow 0.9782$
Both predictors are highly significant $p < 2\times 10^{-16}$
boundaries coefficient drops from $6.67 \rightarrow 4.09$ because $balls\_faced$ now absorbs part of the scoring explanation.
$balls\_faced\ coefficient:$ $\hat{\beta}_2 = 0.63$ meaning each additional ball adds ~0.63 runs on average.
This model captures both scoring volume $boundaries$ and scoring opportunity $balls faced$
Residuals shrink from $\pm 38$ to $\pm 18$, showing much higher accuracy.
This is a major improvement over Model 1.

Interaction Term: boundaries × balls_faced

This is the primary new variable and tests whether the effect of boundaries depends on balls faced

Model 3 — Add Interaction Term

$\text{runs} \sim \text{boundaries} * \text{balls_faced}$

Tests whether the effect of boundaries depends on balls faced.

M3 <- lm(runs ~ boundaries * balls_faced, data = IPL)
summary(M3)

## 
## Call:
## lm(formula = runs ~ boundaries * balls_faced, data = IPL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4510  -0.9628   0.2826   0.6079  25.5441 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.2826013  0.0263345  -10.73   <2e-16 ***
## boundaries              3.9036654  0.0187064  208.68   <2e-16 ***
## balls_faced             0.6227037  0.0028331  219.79   <2e-16 ***
## boundaries:balls_faced  0.0053404  0.0003829   13.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.943 on 23921 degrees of freedom
## Multiple R-squared:  0.9784, Adjusted R-squared:  0.9784 
## F-statistic: 3.606e+05 on 3 and 23921 DF,  p-value: < 2.2e-16

Interpretation

Interaction term is significant: $\hat{\beta}_3 = 0.00534,\quad p < 2\times 10^{-16}$
Model fit improves slightly: $\text{Adjusted } R^2 = 0.9784$
Residuals shrink further and become more symmetric.
The marginal effect of boundaries becomes:$\frac{\partial \text{runs}}{\partial \text{boundaries}} = \beta_1 + \beta_3 \cdot \text{balls_faced}$
Players who face more balls get a slightly higher marginal benefit from boundaries.

match_outcome (binary)

A binary variable indicating whether the player’s team won (1) or lost (0).

Model 4 — Add Binary Variable (match_outcome)

$\text{runs} \sim \text{boundaries} * \text{balls_faced} + \text{match_outcome_binary}$

Adding a binary variable indicating whether the player’s team won.

M4 <- lm(runs ~ boundaries * balls_faced + match_outcome_binary, data = IPL)
summary(M4)

## 
## Call:
## lm(formula = runs ~ boundaries * balls_faced + match_outcome_binary, 
##     data = IPL)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.4510  -0.9628   0.2826   0.6079  25.5441 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -0.2826013  0.0263345  -10.73   <2e-16 ***
## boundaries              3.9036654  0.0187064  208.68   <2e-16 ***
## balls_faced             0.6227037  0.0028331  219.79   <2e-16 ***
## match_outcome_binary           NA         NA      NA       NA    
## boundaries:balls_faced  0.0053404  0.0003829   13.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.943 on 23921 degrees of freedom
## Multiple R-squared:  0.9784, Adjusted R-squared:  0.9784 
## F-statistic: 3.606e+05 on 3 and 23921 DF,  p-value: < 2.2e-16

Interpretation

$match\_outcome\_binary$ is dropped due to singularity (NA coefficients).
Model fit remains identical to Model 3: $\text{Adjusted } R^2 = 0.9784$ , no improvement in residuals.
Once boundaries and balls_faced are known, match outcome adds no additional predictive power.
This is expected: match outcome is a team‑level variable, not a batting‑mechanics variable
Model 4 collapses back to Model 3.

Multicollinearity

Multicollinearity occurs when predictors in a regression model are highly correlated with each other. This can inflate standard errors, destabilize coefficient estimates, and make interpretation unreliable. Since there are multiple continuous predictors and an interaction term, it is important to evaluate multicollinearity before selecting the final model.

# VIF for Model 2
vif(M2)

##  boundaries balls_faced 
##    3.975612    3.975612

# VIF for Model 3
vif(M3)

## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif

##             boundaries            balls_faced boundaries:balls_faced 
##               8.096852               4.154305               6.784790

Variance Inflation Factors (VIF) are computed for Models 2 and 3, as these models contain multiple predictors.

For Model 2, both predictors had VIF values of approximately:

$\Large \text{VIF}\approx 3.98<5$

indicating no harmful multicollinearity.

For Model 3, the interaction term increased VIF values as expected, but all remained below the conventional threshold of

$\Large \text{VIF}<10$, with

$\Large \text{VIF}_{boundaries}=8.10,\quad \text{VIF}_{balls\_faced}=4.15,\quad \text{VIF}_{interaction}=6.78.$

These values indicate that the interaction model does not suffer from severe multicollinearity. Overall, the predictors in the final model $Model 3$ are stable, and multicollinearity is not a concern.

Correlation Heatmap

ggcorr(
  select(IPL,
         runs,
         boundaries,
         balls_faced),
  label = TRUE
) +
  labs(title = "Correlation Heatmap for Model Predictors")

The predictors $boundaries $ and $balls\_faced$ show a very strong positive correlation $≈ 0.9$, which is expected because players who face more balls tend to hit more boundaries.
runs is perfectly correlated with $boundaries$ $1.0$ and strongly correlated with $balls\_faced$ $≈0.9$ which makes sense because runs are directly produced from these actions
the predictors represent different cricket actions and the model includes an interaction term that captures their combined effect so not problematic for the model

Model Selection

To identify the most appropriate regression model for explaining variation in runs using batting‑level predictors.

Based on statistical significance, $R^2$, interpretability, multicollinearity diagnostics, and cricket‑specific reasoning Model 3 is the best Model

All coefficients in Model 3 are highly significant.

The interaction term provides meaningful cricket insight:

$\Large \frac{\partial \text{runs}}{\partial \text{boundaries}}= \beta_1 + \beta_3 \cdot \text{balls_faced}$

showing that boundaries become slightly more valuable as a batter faces more balls.

Model 3 achieves the highest adjusted $R^2$ without overfitting.
VIF values for Model 3 are below 10, confirming no severe multicollinearity.

Conclusion

Model 3 is selected as the final model because it provides the best balance of statistical performance, interpretability, and cricket‑specific reasoning. It captures both scoring volume (boundaries), scoring opportunity (balls faced), and the interaction between them, without introducing multicollinearity or redundant predictors.

Model Diagnostics

Evaluating the final model using the five key diagnostic plots :

Residuals vs Fitted
Residuals vs Each Predictor
Residual Histogram
Normal QQ Plot
Cook’s Distance

Residuals vs Fitted

gg_resfitted(M3) +
  geom_smooth(se = FALSE)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Interpretation

The residuals are centered around the horizontal zero line, indicating that the model does not systematically over‑ or under‑predict.
The smooth blue line is nearly flat, suggesting that the relationship between predictors and runs is reasonably linear.
The spread of residuals is mostly consistent across fitted values, with only slight widening at higher fitted values.
No U‑shape, wave pattern, or curvature is visible, meaning the model is not missing a nonlinear term.

Significance

Linearity: The flat smooth line indicates high confidence that the linearity assumption is met.
Homoscedasticity: There is mild heteroscedasticity at high fitted values, but it is low severity and unlikely to affect inference.
Model adequacy: No structural pattern suggests the model form is appropriate.
Confidence level: High confidence that linearity is satisfied.
Moderate‑to‑high confidence that variance is acceptable.

Residuals vs Each Predictor

plots <- gg_resX(M3, plot.all = FALSE)

## Warning: `fortify(<lm>)` was deprecated in ggplot2 4.0.0.
## ℹ Please use `broom::augment(<lm>)` instead.
## ℹ The deprecated feature was likely used in the lindia package.
##   Please report the issue at <https://github.com/yeukyul/lindia/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

plots$boundaries +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Residuals are centered around zero with a nearly flat smooth line and no visible curvature.
This indicates that the relationship between $boundaries$ and $runs$ is modeled appropriately.
Variance is mostly consistent, any heteroscedasticity is mild and low‑severity.
High confidence that the model form is appropriate for this predictor

plots$balls_faced +
  geom_smooth(se = FALSE)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

Residuals remain centered around zero across the full range of $balls\_faced$, and the smooth line stays nearly flat, indicating no meaningful curvature or nonlinear pattern.
The spread of residuals is consistent, with no strong funnel shape, any widening at high $balls\_faced$ values is mild and expected due to fewer observations.
No extreme outliers appear, suggesting the model is not missing a key term related to $balls\_faced$

Residual Histogram

gg_reshist(M3)

## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

The histogram is centered around zero, with most residuals clustered tightly near the middle, indicating that the model does not systematically over‑ or under‑predict.
The shape is roughly symmetric, with only mild deviations from perfect normality — no heavy skew or extreme tail behavior.
The distribution shows a single clear peak and smooth tapering on both sides, which is typical for a well‑fitted regression model.
Any imperfections in normality appear low‑severity and are unlikely to meaningfully affect inference (p‑values, confidence intervals).

QQ-Plot

gg_qqplot(M3)

Most points fall close to the reference line in the middle of the plot, indicating that the central portion of the residual distribution aligns well with normality.
The deviations at both tails are noticeable , the upper tail bends above the line and the lower tail dips below it indicating mild non‑normality in the extremes.
These tail deviations are common in real cricket performance data, where extreme innings (very high or very low runs) are less predictable.

Histogram + QQ Plot

The histogram shows residuals centered around zero with a roughly symmetric shape, indicating that the bulk of the residuals behave close to normally.
The QQ plot confirms this in the middle region, where points closely follow the reference line, but it also highlights mild deviations in the tails.
Taken together, the two plots suggest approximate normality: the central distribution is well‑behaved, and the tail deviations are low‑severity and unlikely to affect inference.

Cook’s D

gg_cooksd(M3, threshold ='matlab')

IPL[c(13, 183, 254, 218),]

## # A tibble: 4 × 25
##   match_id player      team    runs balls_faced fours sixes wickets overs_bowled
##      <dbl> <chr>       <chr>  <dbl>       <dbl> <dbl> <dbl>   <dbl>        <dbl>
## 1  1359516 YBK Jaiswal Rajas…   124          65    16     8       0            0
## 2   501223 SE Marsh    Kings…    95          47     9     6       0            0
## 3  1312197 JC Buttler  Rajas…    89          60    12     2       0            0
## 4   419137 NV Ojha     Rajas…    94          57     8     6       0            0
## # ℹ 16 more variables: balls_bowled <dbl>, runs_conceded <dbl>, catches <dbl>,
## #   run_outs <dbl>, maiden <dbl>, stumps <dbl>, match_outcome <chr>,
## #   opposition_team <chr>, strike_rate <dbl>, economy <dbl>,
## #   fantasy_points <dbl>, venue <chr>, date <date>, season <dbl>,
## #   boundaries <dbl>, match_outcome_binary <dbl>

The Cook’s Distance plot appears visually dense because the dataset contains ~24,000 observations. Only a few observations (e.g., 13, 183, 254, 218) stand out, and these correspond to extreme batting performances (e.g., 124 off 65, 95 off 47). These points are influential but not problematic, as they represent genuine cricket outcomes rather than data errors. Overall, Cook’s Distance indicates low‑severity influence issues and the model remains stable.

Plotting high influence points

ggplot(data = slice(IPL, c(13, 183, 254, 218))) +
  geom_point(data = IPL,
             mapping = aes(x = balls_faced, y = runs),
             alpha = 0.3) +
  geom_point(mapping = aes(x = balls_faced, y = runs),
             color = 'darkred', size = 3) +
  geom_text_repel(mapping = aes(x = balls_faced,
                                y = runs,
                                label = player),
                  color = 'darkred') +
  labs(title = "Investigating High Influence Points",
       subtitle = "Label = Player Name",
       x = "Balls Faced",
       y = "Runs Scored")

The plot shows the specific innings flagged by Cook’s Distance in the original data space, confirming that their influence comes from unusually large fitted values and residuals rather than unusual predictor values. Even though many other innings appear higher in the scatterplot, those points alone do not determine influence, influence depends on how much an observation distorts the regression model.

Across all diagnostic plots, the model shows no major violations. Residual patterns remain stable and only a few observations show elevated influence. Overall, the diagnostics indicate that the model is reliable, well‑behaved, and suitable for interpretation.

Further question:-How would the model’s conclusions change if the influential innings identified by Cook’s Distance were removed, and what does that reveal about the model’s sensitivity to extreme performances?

Week8_Datadive

Mayank Gupta

2026-03-23

Introduction

Variable Selection & Nested Model

Multicollinearity

Correlation Heatmap

Model Selection

Model Diagnostics

Residuals vs Fitted

Residuals vs Each Predictor

Residual Histogram

QQ-Plot

Histogram + QQ Plot

Cook’s D

Plotting high influence points