Introduction

This Data Dive explores the IPL Player Performance Dataset by:-

Selecting the time‑based column (date) and a response variable to analyze over time
Creating a tsibble object
Plotting the time series across different windows of time
linear regression to detect upward or downward trends
rolling averages and LOESS
ACF and PACF

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")

## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

Selecting response variable and creating tsibble

In this analysis, the IPL’s calendar‑date structure was transformed into a season‑compressed time series to correctly study scoring patterns. Since the IPL is played only during a fixed window each year, the raw data contains long off‑season gaps with no matches, which could distort trend and seasonality detection. To remove this artifact, all off‑season days were excluded and the match dates were re‑indexed into a continuous match‑day sequence, where the final match of one season directly precedes the first match of the next. This produces a clean, uninterrupted time series that accurately reflects within‑season scoring dynamics, enabling meaningful analysis of trends, peaks, and autocorrelation patterns.

The response variable for the time‑series analysis is \(total\_runs\) which is created by aggregating the player‑level \(runs\) column into daily totals.

ipl_daily <- IPL |>
  group_by(date) |>
  summarise(total_runs = sum(runs), .groups = "drop") |>
  arrange(date) |>
  mutate(match_day = row_number()) |>
  as_tsibble(index = match_day)

Plotting response variable over time

ipl_daily |>
  ggplot(aes(match_day, total_runs)) +
  geom_line(color = "steelblue") +
  labs(
    title = "IPL Total Runs Over Compressed Match Days (2008–2024)",
    x = "Match Day (Season-Compressed)",
    y = "Total Runs"
  )

The season‑compressed plot of daily total runs reveals a much clearer picture of IPL scoring dynamics. With off‑season gaps removed, the time series forms a continuous sequence of match days, allowing the underlying structure to emerge cleanly. Distinct within‑season peaks become sharply visible, reflecting periods of intensified scoring during match‑dense stretches. These peaks are now far more interpretable without the long zero‑run intervals that previously obscured the pattern. Notably, the height of these peaks increases in later portions of the series, indicating that modern IPL seasons tend to be substantially more high‑scoring than earlier ones. Overall, the compressed timeline highlights strong within‑season variability and a gradual rise in scoring intensity across the league’s history.

Different windows of time

To consider different windows of time, plotting the early IPL seasons (2008–2013) and the modern era (2018–2024) separately.

Early IPL seasons (2008–2013)

ipl_daily |>
  filter(match_day <= 350) |>   # approx first 6 seasons
  ggplot(aes(match_day, total_runs)) +
  geom_line(color = "navy") +
  labs(title = "Early IPL Scoring Patterns (Compressed)")

Modern IPL era (2018–2024)

ipl_daily |>
  filter(match_day >= 700) |>   # approx last 6 seasons
  ggplot(aes(match_day, total_runs)) +
  geom_line(color = "firebrick") +
  labs(title = "Modern IPL Scoring Patterns (Compressed)")

When comparing the two compressed windows of time, a clear evolution in IPL scoring behavior becomes evident. In the early seasons, total‑run peaks generally fall within the 400–450 run range, with only occasional surges approaching 500 runs. The fluctuations are present but relatively contained, reflecting a more conservative scoring environment with fewer explosive match clusters. This pattern suggests that early IPL seasons were characterized by steadier scoring rhythms and less extreme variability from one match day to the next.

In contrast, the modern era displays much sharper and more frequent peaks, with several match‑day totals rising into the 500–550 run range. These elevated spikes, combined with greater volatility across consecutive match days, highlight the increasingly aggressive batting strategies and higher scoring intensity that define recent IPL seasons. Together, the two windows reveal a pronounced upward shift in scoring amplitude over time, underscoring how the league has evolved toward more dynamic, high‑impact run production.

linear regression to detect upward or downward trends

trend_model <- lm(total_runs ~ match_day, data = ipl_daily)
summary(trend_model)

## 
## Call:
## lm(formula = total_runs ~ match_day, data = ipl_daily)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -299.8 -112.6  -60.0  138.2  461.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 411.93622   10.76988  38.249   <2e-16 ***
## match_day    -0.02642    0.02265  -1.167    0.244    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154.3 on 821 degrees of freedom
## Multiple R-squared:  0.001656,   Adjusted R-squared:  0.0004396 
## F-statistic: 1.362 on 1 and 821 DF,  p-value: 0.2436

The linear regression model examining \(total\_runs\) over the season‑compressed match‑day index does not show a statistically significant upward or downward trend across the full 2008–2024 period. The estimated slope is very small \(–0.026\) and the p‑value \(0.244\) indicates that this trend is not statistically significant. The R‑squared value is extremely low \(0.0016\), meaning the linear model explains almost none of the variation in daily total runs. This suggests that a simple straight‑line trend is not an appropriate summary of the scoring dynamics, even after compressing seasons into a continuous timeline.

Overall, the global linear trend remains weak and that meaningful trends emerge only when the data is examined in separate structural windows rather than as a single continuous series.

Rolling averages and LOESS

ipl_daily <- ipl_daily |>
  mutate(roll5 = slide_dbl(total_runs, mean, .before = 4))

ipl_daily |>
  ggplot(aes(match_day)) +
  geom_line(aes(y = total_runs), alpha = 0.3) +
  geom_line(aes(y = roll5), color = "orange") +
  geom_smooth(aes(y = total_runs), method = "loess", span = 0.2, color = "red") +
  labs(title = "Smoothed IPL Scoring Patterns (Compressed)")

## `geom_smooth()` using formula = 'y ~ x'

The smoothed scoring plot reveals the underlying structure of IPL run production once off‑season gaps are removed. The faint raw line shows substantial match‑to‑match volatility, but the 5‑day rolling average (orange) highlights short‑term scoring cycles that rise and fall within each season. The LOESS curve (red) provides an even clearer long‑term pattern: scoring levels oscillate consistently across the compressed timeline, forming repeated within‑season waves rather than drifting steadily upward or downward. While the overall trend remains relatively stable, the peaks of these smoothed cycles gradually become higher in the later portions of the series, reflecting the increased scoring intensity seen in modern IPL seasons. Together, the rolling mean and LOESS smoothing make the seasonal rhythm of IPL scoring much more interpretable than the raw series alone.

ACF and PACF

acf(ipl_daily$total_runs)

The ACF plot for the season‑compressed series shows strong positive autocorrelation at the first few lags, indicating that total runs on a given match day are closely related to scoring on nearby match days. This reflects the natural clustering of high‑ and low‑scoring matches within each IPL season. The autocorrelation then declines gradually rather than dropping sharply, suggesting the presence of longer‑range dependence in scoring patterns. Several smaller bumps appear at intermediate lags, which correspond to repeated within‑season scoring cycles rather than off‑season gaps—an effect that becomes visible only after compressing the timeline. Overall, the ACF confirms that IPL scoring exhibits meaningful short‑term dependence and recurring seasonal structure, even though the long‑term linear trend remains weak.

The presence of repeated bumps at intermediate lags confirms that the ACF captures clear within‑season seasonality in IPL scoring

pacf(ipl_daily$total_runs)

The PACF plot shows a strong and statistically significant spike at lag 1, indicating that the previous match day has a direct and meaningful influence on the current day’s total runs. Beyond lag 1, the partial autocorrelations drop sharply and remain within the confidence bounds, suggesting that higher‑order lags do not contribute substantial additional predictive power once the immediate prior day is accounted for. This pattern is characteristic of an AR(1)‑type dependence structure, where short‑term scoring momentum carries over from one match day to the next but does not persist far into the future. The absence of large spikes at higher lags also confirms that the season‑compressed series does not exhibit long‑range partial autocorrelation, reinforcing the idea that IPL scoring dynamics are driven primarily by short‑term match clusters rather than extended multi‑day dependencies.

Further question :- Given the strong seasonality and weak global trend, would a seasonal ARIMA model capture IPL scoring patterns more effectively than a simple linear trend model?

Week12_Datadive

Mayank Gupta

2026-04-08