This Data Dive explores the IPL Player Performance Dataset by:-
Selecting the time‑based column (date) and a response variable to analyze over time
Creating a tsibble object
Plotting the time series across different windows of time
linear regression to detect upward or downward trends
rolling averages and LOESS
ACF and PACF
ipl_raw<-read_csv("C:/mayangup/SP26/ipl-data_Dataset 1.csv")
## Rows: 24044 Columns: 22
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): player, team, match_outcome, opposition_team, venue
## dbl (16): match_id, runs, balls_faced, fours, sixes, wickets, overs_bowled,...
## date (1): date
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Note: Data Preparation: The data set includes only 5 matches of year 2025 that is not complete and this would distort all calculations, so to avoid this, filtered out all rows from 2025 and used a clean dataset for further analysis including complete seasons only
IPL <- ipl_raw |>
mutate(
date = as.Date(date),
season = year(date)
) |>
filter(season < 2025)
The response variable for the time‑series analysis is \(total\_runs\) which is created by aggregating the player‑level \(runs\) column into daily totals.
ipl_daily_runs <- IPL |>
group_by(date) |>
summarise(total_runs = sum(runs), .groups = "drop") |>
as_tsibble(index = date)
Daily \(total\_ runs\) provide a meaningful time‑series representation of match‑day scoring intensity. This reduces 24,000+ player‑level rows into a clean date‑indexed series.
ipl_daily_runs |>
ggplot(aes(date, total_runs)) +
geom_line(color = "steelblue") +
labs(
title = "Daily IPL Runs (2008–2024)",
y = "Total Runs"
)
The plot below shows the response variable \(total\_runs\) over time. Each point represents the total number of runs scored across all IPL matches played on that date. The plot shows clear annual clusters (IPL seasons) and a visible upward scoring trend. The IPL’s fixed seasonal window (Mar–May) creates strong seasonality.
To consider different windows of time, plotting the early IPL seasons (2008–2013) and the modern era (2018–2024) separately.
ipl_daily_runs |>
filter(date < "2014-01-01") |>
ggplot(aes(date, total_runs)) +
geom_line(color = "navy") +
labs(title = "IPL Runs (2008–2013)")
ipl_daily_runs |>
filter(date >= "2018-01-01") |>
ggplot(aes(date, total_runs)) +
geom_line(color = "firebrick") +
labs(title = "IPL Runs (2018–2024)")
When comparing the two windows of time, a clear evolution in IPL scoring patterns emerges. In the early seasons (2008–2013), daily \(total\_ runs\) are noticeably lower and more stable, with peaks rarely exceeding \(500–600\) runs and a relatively modest level of volatility. This reflects a more conservative style of play and fewer extreme high‑scoring matches. In contrast, the modern era (2018–2024) shows dramatically higher peaks, frequent spikes above \(600–700\) runs, and much greater variability within each season.
The 2023–2024 seasons, in particular, display the highest scoring levels in IPL history, likely influenced by more aggressive batting strategies and the introduction of the Impact Player rule. Together, these windows highlight a strong upward shift in scoring intensity over time, reinforcing the long‑term trend observed in the full time‑series plot
trend_model <- lm(total_runs ~ as.numeric(date), data = ipl_daily_runs)
summary(trend_model)
##
## Call:
## lm(formula = total_runs ~ as.numeric(date), data = ipl_daily_runs)
##
## Residuals:
## Min 1Q Median 3Q Max
## -299.86 -112.80 -59.88 138.55 460.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 460.955297 50.868312 9.062 <2e-16 ***
## as.numeric(date) -0.003502 0.002957 -1.184 0.237
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 154.3 on 821 degrees of freedom
## Multiple R-squared: 0.001705, Adjusted R-squared: 0.0004895
## F-statistic: 1.403 on 1 and 821 DF, p-value: 0.2366
The linear regression model examining total_runsover
time does not show a statistically significant upward
or downward trend across the full 2008–2024 period. The slope estimate
for as.numeric(date) is very small \(–0.0035\) and the p‑value \(0.237\) indicates that this trend is
not statistically significant. The R‑squared value is
extremely low \(0.0017\), meaning the
linear model explains almost none of the variation in daily total runs.
This suggests that a single straight‑line trend is not an appropriate
summary of the full time series.
However, the earlier window (2008–2013) and the modern window (2018–2024) clearly show different scoring behaviors, with the modern era having much higher peaks and greater volatility. This indicates that the IPL likely contains multiple structural eras, and a single global trend line hides these differences. Therefore, subsetting the data into meaningful windows (early era vs. modern era) is necessary to detect changes in scoring patterns over time.
Overall, the strength of the global trend is weak, but the windowed plots reveal strong era‑specific trends: early seasons show lower, more stable scoring, while modern seasons show significantly higher scoring intensity. This suggests that IPL scoring evolution is non‑linear and better captured by analyzing separate time windows rather than a single linear trend across all years.
ipl_daily_runs <- ipl_daily_runs |>
mutate(roll3 = slide_dbl(total_runs, mean, .before = 2))
ipl_daily_runs |>
ggplot(aes(date)) +
geom_line(aes(y = total_runs), alpha = 0.3) +
geom_line(aes(y = roll3), color = "orange") +
geom_smooth(aes(y = total_runs), method = "loess", span = 0.3, color = "red") +
labs(title = "Smoothed IPL Daily Runs")
## `geom_smooth()` using formula = 'y ~ x'
Applying smoothing techniques reveals a clear seasonal pattern in the IPL scoring data. The raw daily totals fluctuate sharply within each year, but the rolling mean and LOESS curve make the structure much easier to see. Each IPL season shows a distinct peak period where total runs rise sharply during the match-heavy months, followed by long off‑season troughs where totals drop to zero. This repeating pattern confirms the presence of strong annual seasonality in the data. The smoothed LOESS line also shows that while the overall long‑term trend is relatively stable, the amplitude of the seasonal peaks increases over time, especially in the most recent seasons. This suggests that IPL scoring intensity has grown, with modern seasons producing higher run totals during peak months compared to earlier years.
acf(ipl_daily_runs$total_runs)
The ACF plot shows strong positive autocorrelation at short lags, meaning that daily IPL run totals are highly dependent on the values from recent days. The autocorrelation decays slowly rather than dropping sharply, which indicates that the series is non‑stationary and contains long‑term structure. In addition, the ACF exhibits periodic bumps at multiple lags, reflecting the repeating seasonal pattern of IPL match clusters within each year. Together, these features confirm the presence of strong seasonality and temporal dependence in the data.
pacf(ipl_daily_runs$total_runs)
The PACF plot shows a strong and statistically significant spike at lag 1, indicating that daily IPL run totals are directly influenced by the previous day’s values. Beyond lag 1, most partial autocorrelations fall within the confidence bands, suggesting that higher‑order lags do not contribute much additional predictive information. This pattern is consistent with an AR(1)‑type structure, where short‑term dependence is strong but quickly diminishes. The PACF also shows small bumps at larger lags, which align with the seasonal clustering of IPL matches, but these effects are weaker than the dominant lag‑1 relationship.
Further question :- Given the strong seasonality and weak global trend, would a seasonal ARIMA model capture IPL scoring patterns more effectively than a simple linear trend model?