Introduction

This Data Dive explores the IPL Player Performance Dataset by:-

  • Selecting the time‑based column (date) and a response variable to analyze over time

  • Creating a tsibble object

  • Plotting the time series across different windows of time

  • linear regression to detect upward or downward trends

  • rolling averages and LOESS

  • ACF and PACF

ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): player, team, match_outcome, opposition_team, venue
## dbl  (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only

IPL <- ipl_raw |>
  mutate(
    date = as.Date(date),
    season = year(date)
  ) |>
  filter(season < 2025)

Selecting response varibale and creating tsibble

The response variable for the time‑series analysis is \(total\_runs\) which is created by aggregating the player‑level \(runs\) column into daily totals.

ipl_daily_runs <- IPL |>
  group_by(date) |>
  summarise(total_runs = sum(runs), .groups = "drop") |>
  as_tsibble(index = date)

Daily \(total\_ runs\) provide a meaningful time‑series representation of match‑day scoring intensity. This reduces 24,000+ player‑level rows into a clean date‑indexed series.

Plotting response variable over time

ipl_daily_runs |>
  ggplot(aes(date, total_runs)) +
  geom_line(color = "steelblue") +
  labs(
    title = "Daily IPL Runs (2008–2024)",
    y = "Total Runs"
  )

The plot below shows the response variable \(total\_runs\) over time. Each point represents the total number of runs scored across all IPL matches played on that date. The plot shows clear annual clusters (IPL seasons) and a visible upward scoring trend. The IPL’s fixed seasonal window (Mar–May) creates strong seasonality.

Different windows of time

To consider different windows of time, plotting the early IPL seasons (2008–2013) and the modern era (2018–2024) separately.

Early IPL seasons (2008–2013)

ipl_daily_runs |>
  filter(date < "2014-01-01") |>
  ggplot(aes(date, total_runs)) +
  geom_line(color = "navy") +
  labs(title = "IPL Runs (2008–2013)")

Modern IPL era (2018–2024)

ipl_daily_runs |>
  filter(date >= "2018-01-01") |>
  ggplot(aes(date, total_runs)) +
  geom_line(color = "firebrick") +
  labs(title = "IPL Runs (2018–2024)")

When comparing the two windows of time, a clear evolution in IPL scoring patterns emerges. In the early seasons (2008–2013), daily \(total\_ runs\) are noticeably lower and more stable, with peaks rarely exceeding \(500–600\) runs and a relatively modest level of volatility. This reflects a more conservative style of play and fewer extreme high‑scoring matches. In contrast, the modern era (2018–2024) shows dramatically higher peaks, frequent spikes above \(600–700\) runs, and much greater variability within each season.

The 2023–2024 seasons, in particular, display the highest scoring levels in IPL history, likely influenced by more aggressive batting strategies and the introduction of the Impact Player rule. Together, these windows highlight a strong upward shift in scoring intensity over time, reinforcing the long‑term trend observed in the full time‑series plot

Rolling averages and LOESS

ipl_daily_runs <- ipl_daily_runs |>
  mutate(roll3 = slide_dbl(total_runs, mean, .before = 2))

ipl_daily_runs |>
  ggplot(aes(date)) +
  geom_line(aes(y = total_runs), alpha = 0.3) +
  geom_line(aes(y = roll3), color = "orange") +
  geom_smooth(aes(y = total_runs), method = "loess", span = 0.3, color = "red") +
  labs(title = "Smoothed IPL Daily Runs")
## `geom_smooth()` using formula = 'y ~ x'

Applying smoothing techniques reveals a clear seasonal pattern in the IPL scoring data. The raw daily totals fluctuate sharply within each year, but the rolling mean and LOESS curve make the structure much easier to see. Each IPL season shows a distinct peak period where total runs rise sharply during the match-heavy months, followed by long off‑season troughs where totals drop to zero. This repeating pattern confirms the presence of strong annual seasonality in the data. The smoothed LOESS line also shows that while the overall long‑term trend is relatively stable, the amplitude of the seasonal peaks increases over time, especially in the most recent seasons. This suggests that IPL scoring intensity has grown, with modern seasons producing higher run totals during peak months compared to earlier years.

ACF and PACF

acf(ipl_daily_runs$total_runs)

The ACF plot shows strong positive autocorrelation at short lags, meaning that daily IPL run totals are highly dependent on the values from recent days. The autocorrelation decays slowly rather than dropping sharply, which indicates that the series is non‑stationary and contains long‑term structure. In addition, the ACF exhibits periodic bumps at multiple lags, reflecting the repeating seasonal pattern of IPL match clusters within each year. Together, these features confirm the presence of strong seasonality and temporal dependence in the data.

pacf(ipl_daily_runs$total_runs)

The PACF plot shows a strong and statistically significant spike at lag 1, indicating that daily IPL run totals are directly influenced by the previous day’s values. Beyond lag 1, most partial autocorrelations fall within the confidence bands, suggesting that higher‑order lags do not contribute much additional predictive information. This pattern is consistent with an AR(1)‑type structure, where short‑term dependence is strong but quickly diminishes. The PACF also shows small bumps at larger lags, which align with the seasonal clustering of IPL matches, but these effects are weaker than the dominant lag‑1 relationship.

Further question :- Given the strong seasonality and weak global trend, would a seasonal ARIMA model capture IPL scoring patterns more effectively than a simple linear trend model?