Aim. This tutorial teaches step-by-step how to
perform time series analysis and short-term forecasting on weekly sports
team scores using R.
It is written so a complete novice can reproduce the analysis: all code,
explanation and interpretation are included.
Learning outcomes (what the reader will learn): - How to prepare and visualise weekly time series data. - How to decompose the series (trend / seasonal / remainder). - How to test for stationarity and apply differencing. - How to fit and compare ARIMA and ETS models. - How to forecast and evaluate model accuracy on a holdout set.
1.Setup — packages and environment
Explain: Install packages only if needed. Load libraries and set a random seed for reproducibility.
# Install packages if missing (uncomment to run)
# install.packages(c("tidyverse","forecast","lubridate","tseries","tsibble","feasts","scales"))
library(tidyverse)
library(forecast)
library(lubridate)
library(tseries)
library(tsibble)
library(feasts)
library(scales)
set.seed(42) # reproducible simulation
options(digits = 4)
2. Example dataset (self-contained)
# Simulate 4 years of weekly data (~208 weeks)
set.seed(42)
weeks <- seq.Date(from = as.Date("2019-01-01"), by = "week", length.out = 4 * 52)
n <- length(weeks)
# Simulate weekly team points:
# baseline 50 + slow upward trend + annual seasonality + random noise
points <- 50 + 0.03 * (1:n) + 6 * sin(2 * pi * (1:n) / 52) + rnorm(n, sd = 3)
# Combine into a tibble (data frame)
df <- tibble(
date = weeks,
team_points = round(points, 1)
)
# View first few rows
head(df)
## # A tibble: 6 × 2
## date team_points
## <date> <dbl>
## 1 2019-01-01 54.9
## 2 2019-01-08 49.8
## 3 2019-01-15 53.3
## 4 2019-01-22 54.8
## 5 2019-01-29 54.8
## 6 2019-02-05 53.8
# Optional: write to CSV if you want to save the simulated dataset
# write_csv(df, "simulated_weekly_team_points.csv")
# Quick visual check
ggplot(df, aes(x = date, y = team_points)) +
geom_line(color = "steelblue") +
geom_point(size = 0.8) +
labs(
title = "Simulated Weekly Team Points (2019–2022)",
x = "Date",
y = "Team Points"
) +
theme_minimal()
3.Using your own CSV (optional)
# Suppose you have a CSV file named "my_sports_data.csv"
# It should contain at least two columns:
# - date (in YYYY-MM-DD format)
# - team_points (numeric performance metric)
# Example of how your CSV should look:
# date,team_points
# 2019-01-01,48
# 2019-01-08,52
# 2019-01-15,55
# ...
# Step 1: Set working directory to where your CSV is located
# Step 2: Read your CSV file
library(readr)
library(tibble)
library(ggplot2)
write_csv(df, "my_sports_data.csv")
# Step 3: Check the first few rows
head(df)
## # A tibble: 6 × 2
## date team_points
## <date> <dbl>
## 1 2019-01-01 54.9
## 2 2019-01-08 49.8
## 3 2019-01-15 53.3
## 4 2019-01-22 54.8
## 5 2019-01-29 54.8
## 6 2019-02-05 53.8
# Step 4: Check column types
str(df)
## tibble [208 × 2] (S3: tbl_df/tbl/data.frame)
## $ date : Date[1:208], format: "2019-01-01" "2019-01-08" ...
## $ team_points: num [1:208] 54.9 49.8 53.3 54.8 54.8 53.8 59.2 54.9 61.6 55.7 ...
# Step 5: Convert the date column to proper Date format (if not already)
df$date <- as.Date(df$date)
# Step 6: Visualize your data
ggplot(df, aes(x = date, y = team_points)) +
geom_line(color = "darkorange") +
geom_point(size = 0.8) +
labs(
title = "Weekly Team Performance from CSV Data",
x = "Date",
y = "Team Points"
) +
theme_minimal()
4.Exploratory data analysis
Plot the raw series and show summary statistics.
glimpse(df)
## Rows: 208
## Columns: 2
## $ date <date> 2019-01-01, 2019-01-08, 2019-01-15, 2019-01-22, 2019-01-2…
## $ team_points <dbl> 54.9, 49.8, 53.3, 54.8, 54.8, 53.8, 59.2, 54.9, 61.6, 55.7…
summary(df$team_points)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.9 49.1 53.4 53.0 56.3 67.6
# Time plot
ggplot(df, aes(x = date, y = team_points)) +
geom_line() + geom_point(size = 0.8) +
labs(title = "Weekly Team Points", x = "Date", y = "Points") +
theme_minimal()
5.Convert to a time series object
For regular weekly data we can use ts() with frequency 52. If irregular, use tsibble.
start_year <- year(min(df$date))
start_week <- isoweek(min(df$date)) # ISO week as integer
# Approximate start as (year, week) for ts()
ts_data <- ts(df$team_points, start = c(start_year, start_week), frequency = 52)
# Simple plot with forecast::autoplot
autoplot(ts_data) + labs(title = "Team Points (ts object)", x = "Year", y = "Points")
Note: frequency=52 approximates weekly seasonality. If your competition uses irregular weeks, prefer tsibble.
df_tsibble <- df %>%
mutate(week = yearweek(date)) %>%
as_tsibble(index = week)
df_tsibble %>% autoplot(team_points) + labs(title = "Team Points (tsibble)")
6.Decomposition (STL)
Use STL to inspect trend and seasonality.
stl_fit <- stl(ts_data, s.window = "periodic", robust = TRUE)
autoplot(stl_fit) + labs(title = "STL Decomposition")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the forecast package.
## Please report the issue at <https://github.com/robjhyndman/forecast/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
7.Stationarity tests and differencing
Use ACF/PACF and Augmented Dickey-Fuller (ADF). If p-value > 0.05, series likely non-stationary.
par(mfrow = c(2,1))
Acf(ts_data, main = "ACF - original")
Pacf(ts_data, main = "PACF - original")
par(mfrow = c(1,1))
adf <- adf.test(ts_data, alternative = "stationary")
adf
##
## Augmented Dickey-Fuller Test
##
## data: ts_data
## Dickey-Fuller = -2.5, Lag order = 5, p-value = 0.4
## alternative hypothesis: stationary
nd <- ndiffs(ts_data)
nd
## [1] 0
8.Model building: ARIMA and ETS
Fit two competitive models — automatic ARIMA and ETS.
# ARIMA (auto)
fit_arima <- auto.arima(ts_data, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
fit_arima
## Series: ts_data
## ARIMA(0,0,0)(1,1,0)[52] with drift
##
## Coefficients:
## sar1 drift
## -0.405 0.027
## s.e. 0.087 0.005
##
## sigma^2 = 14.8: log likelihood = -435
## AIC=876 AICc=876.1 BIC=885.1
# ETS
fit_ets <- ets(ts_data)
## Warning in ets(ts_data): I can't handle data with frequency greater than 24.
## Seasonality will be ignored. Try stlf() if you need seasonal forecasts.
fit_ets
## ETS(A,Ad,N)
##
## Call:
## ets(y = ts_data)
##
## Smoothing parameters:
## alpha = 0.193
## beta = 0.0872
## phi = 0.8432
##
## Initial states:
## l = 49.8304
## b = 1.8244
##
## sigma: 3.49
##
## AIC AICc BIC
## 1637 1638 1657
ARIMA Model Summary Model: ARIMA(0,0,0)(1,1,0)[52] with drift Key output: sar1 = -0.405, drift = 0.027, AIC = 876, σ² = 14.8
Interpretation: This ARIMA model includes one seasonal autoregressive term and one seasonal difference with a 52-week cycle, capturing yearly performance patterns. The negative SAR(1) coefficient (–0.405) suggests that strong performance in one season tends to be followed by slightly weaker performance the next. The positive drift (0.027) indicates a gradual upward trend in team performance over time. The relatively low AIC (876) shows a good overall fit. This model effectively captures both trend and seasonality, making it suitable for forecasting future performance.
ETS Model Model: ETS(A,Ad,N) Key output: α = 0.193, β = 0.087, φ = 0.843, AIC = 1637, σ = 3.49
Interpretation: The ETS model uses additive errors and a damped trend, meaning performance increases over time but at a slowing rate. Because ETS cannot handle a 52-week seasonal period, seasonality was ignored, leading to a higher AIC (1637) and less accurate fit than ARIMA. This model captures the general trend but misses annual cycles, making it less effective for weekly sports data.
9. Forecasting (12 week horizon) and plot
Forecast short-term; choose horizon relevant to your sport (12 weeks here).
h <- 12
fc_arima <- forecast(fit_arima, h = h, level = c(80,95))
fc_ets <- forecast(fit_ets, h = h, level = c(80,95))
autoplot(ts_data) +
autolayer(fc_arima, series = "ARIMA", PI = TRUE) +
autolayer(fc_ets, series = "ETS", PI = TRUE) +
labs(title = paste("Forecasts for next", h, "weeks"), y = "Points") +
guides(colour = guide_legend(title = "Model"))
The ARIMA forecast shows a gradual upward trend with seasonal
fluctuations, and its prediction intervals widen over time, reflecting
growing uncertainty. The ETS forecast is smoother with no seasonal
pattern and slightly narrower intervals, but it underestimates
variability. Overall, ARIMA gives a more realistic forecast, capturing
both trend and seasonality, while ETS provides a simpler trend-only
outlook.
10. Train/test evaluation (backtesting)
Use last 12 weeks as a test set. Fit models on the training portion and compare accuracy metrics (MAE, RMSE, MAPE).
test_length <- 12
n_total <- length(ts_data)
train_end <- n_total - test_length
train_ts <- window(ts_data, end = time(ts_data)[train_end])
test_ts <- window(ts_data, start = time(ts_data)[train_end + 1])
# Fit on train
arima_train <- auto.arima(train_ts, seasonal = TRUE)
ets_train <- ets(train_ts)
## Warning in ets(train_ts): I can't handle data with frequency greater than 24.
## Seasonality will be ignored. Try stlf() if you need seasonal forecasts.
fc_arima_train <- forecast(arima_train, h = test_length)
fc_ets_train <- forecast(ets_train, h = test_length)
# Accuracy
acc_arima <- accuracy(fc_arima_train, test_ts)
acc_ets <- accuracy(fc_ets_train, test_ts)
acc_arima
## ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
## Training set 0.0599 3.004 2.044 -0.04262 3.834 0.5937 -0.0001621 NA
## Test set 0.5677 4.282 3.679 0.66602 6.948 1.0682 0.0523524 0.9232
acc_ets
## ME RMSE MAE MPE MAPE MASE ACF1 Theil's U
## Training set -0.08096 3.403 2.669 -0.4242 5.120 0.7752 0.01576 NA
## Test set 3.85976 5.113 4.418 6.9354 8.163 1.2830 0.06848 1.099
Between the two models, the ARIMA model shows lower RMSE (4.28) and MAE (3.68) on the test set compared to the ETS model (RMSE 5.11, MAE 4.42), indicating better predictive accuracy. The MAPE for ARIMA (6.9%) is also lower than ETS (8.2%), showing smaller average percentage errors. Overall, the ARIMA model performs better on all key metrics and is therefore the preferred forecasting model.
11. Residual diagnostics
# use fit_arima (or whichever performed better)
checkresiduals(fit_arima)
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0)(1,1,0)[52] with drift
## Q* = 41, df = 41, p-value = 0.5
##
## Model df: 1. Total lags used: 42
The Ljung–Box test gives a p-value of 0.5, which is well above 0.05. This means we fail to reject the null hypothesis that residuals are white noise. Therefore, the residuals show no significant autocorrelation, indicating that the ARIMA model fits the data well and captures the main structure of the time series.
12. Save forecasts and reproducibility
Save forecast table and include code to export results.
fc_df <- tibble(
date = seq.Date(from = as.Date(df$date[nrow(df)]) + 7, by = "week", length.out = h),
mean = as.numeric(fc_arima$mean),
lo80 = as.numeric(fc_arima$lower[,1]),
hi80 = as.numeric(fc_arima$upper[,1]),
lo95 = as.numeric(fc_arima$lower[,2]),
hi95 = as.numeric(fc_arima$upper[,2])
)
fc_df
## # A tibble: 12 × 6
## date mean lo80 hi80 lo95 hi95
## <date> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2022-12-27 56.0 51.1 60.9 48.5 63.5
## 2 2023-01-03 56.2 51.3 61.2 48.7 63.8
## 3 2023-01-10 56.8 51.9 61.7 49.3 64.3
## 4 2023-01-17 61.2 56.2 66.1 53.6 68.7
## 5 2023-01-24 59.5 54.6 64.4 52.0 67.0
## 6 2023-01-31 58.4 53.5 63.3 50.9 65.9
## 7 2023-02-07 61.0 56.1 65.9 53.5 68.5
## 8 2023-02-14 60.7 55.8 65.6 53.2 68.3
## 9 2023-02-21 62.1 57.1 67.0 54.5 69.6
## 10 2023-02-28 63.9 59.0 68.8 56.4 71.4
## 11 2023-03-07 58.4 53.5 63.4 50.9 66.0
## 12 2023-03-14 62.7 57.8 67.7 55.2 70.3
# write_csv(fc_df, "forecast_arima_next12weeks.csv")
13. Interpretation, limitations and suggestion
The ARIMA model provided the most accurate forecasts, effectively capturing the trend and seasonal patterns in team performance. However, it assumes past patterns continue unchanged, which may not hold if unexpected events or changes in strategy occur. The ETS model was simpler but ignored seasonality, reducing accuracy.
Future improvements could include adding external variables (e.g., player stats, injuries, opponent strength), testing hybrid models (ARIMA + regression or machine learning), and using longer data periods to enhance reliability.