Introduction

Aim. This tutorial teaches step-by-step how to perform time series analysis and short-term forecasting on weekly sports team scores using R.
It is written so a complete novice can reproduce the analysis: all code, explanation and interpretation are included.

Learning outcomes (what the reader will learn): - How to prepare and visualise weekly time series data. - How to decompose the series (trend / seasonal / remainder). - How to test for stationarity and apply differencing. - How to fit and compare ARIMA and ETS models. - How to forecast and evaluate model accuracy on a holdout set.

1.Setup — packages and environment

Explain: Install packages only if needed. Load libraries and set a random seed for reproducibility.

# Install packages if missing (uncomment to run)
# install.packages(c("tidyverse","forecast","lubridate","tseries","tsibble","feasts","scales"))

library(tidyverse)
library(forecast)
library(lubridate)
library(tseries)
library(tsibble)
library(feasts)
library(scales)

set.seed(42)  # reproducible simulation
options(digits = 4)

2. Example dataset (self-contained)

# Simulate 4 years of weekly data (~208 weeks)
set.seed(42)
weeks <- seq.Date(from = as.Date("2019-01-01"), by = "week", length.out = 4 * 52)
n <- length(weeks)

# Simulate weekly team points:
# baseline 50 + slow upward trend + annual seasonality + random noise
points <- 50 + 0.03 * (1:n) + 6 * sin(2 * pi * (1:n) / 52) + rnorm(n, sd = 3)

# Combine into a tibble (data frame)
df <- tibble(
  date = weeks,
  team_points = round(points, 1)
)

# View first few rows
head(df)

## # A tibble: 6 × 2
##   date       team_points
##   <date>           <dbl>
## 1 2019-01-01        54.9
## 2 2019-01-08        49.8
## 3 2019-01-15        53.3
## 4 2019-01-22        54.8
## 5 2019-01-29        54.8
## 6 2019-02-05        53.8

# Optional: write to CSV if you want to save the simulated dataset
# write_csv(df, "simulated_weekly_team_points.csv")

# Quick visual check
ggplot(df, aes(x = date, y = team_points)) +
  geom_line(color = "steelblue") +
  geom_point(size = 0.8) +
  labs(
    title = "Simulated Weekly Team Points (2019–2022)",
    x = "Date",
    y = "Team Points"
  ) +
  theme_minimal()

3.Using your own CSV (optional)

# Suppose you have a CSV file named "my_sports_data.csv"
# It should contain at least two columns:
#  - date (in YYYY-MM-DD format)
#  - team_points (numeric performance metric)

# Example of how your CSV should look:
# date,team_points
# 2019-01-01,48
# 2019-01-08,52
# 2019-01-15,55
# ...

# Step 1: Set working directory to where your CSV is located

# Step 2: Read your CSV file
library(readr)
library(tibble)
library(ggplot2)

write_csv(df, "my_sports_data.csv")

# Step 3: Check the first few rows
head(df)

## # A tibble: 6 × 2
##   date       team_points
##   <date>           <dbl>
## 1 2019-01-01        54.9
## 2 2019-01-08        49.8
## 3 2019-01-15        53.3
## 4 2019-01-22        54.8
## 5 2019-01-29        54.8
## 6 2019-02-05        53.8

# Step 4: Check column types
str(df)

## tibble [208 × 2] (S3: tbl_df/tbl/data.frame)
##  $ date       : Date[1:208], format: "2019-01-01" "2019-01-08" ...
##  $ team_points: num [1:208] 54.9 49.8 53.3 54.8 54.8 53.8 59.2 54.9 61.6 55.7 ...

# Step 5: Convert the date column to proper Date format (if not already)
df$date <- as.Date(df$date)

# Step 6: Visualize your data
ggplot(df, aes(x = date, y = team_points)) +
  geom_line(color = "darkorange") +
  geom_point(size = 0.8) +
  labs(
    title = "Weekly Team Performance from CSV Data",
    x = "Date",
    y = "Team Points"
  ) +
  theme_minimal()

4.Exploratory data analysis

Plot the raw series and show summary statistics.

glimpse(df)

## Rows: 208
## Columns: 2
## $ date        <date> 2019-01-01, 2019-01-08, 2019-01-15, 2019-01-22, 2019-01-2…
## $ team_points <dbl> 54.9, 49.8, 53.3, 54.8, 54.8, 53.8, 59.2, 54.9, 61.6, 55.7…

summary(df$team_points)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    37.9    49.1    53.4    53.0    56.3    67.6

# Time plot

ggplot(df, aes(x = date, y = team_points)) +
geom_line() + geom_point(size = 0.8) +
labs(title = "Weekly Team Points", x = "Date", y = "Points") +
theme_minimal()

5.Convert to a time series object

For regular weekly data we can use ts() with frequency 52. If irregular, use tsibble.

start_year <- year(min(df$date))
start_week <- isoweek(min(df$date))  # ISO week as integer

# Approximate start as (year, week) for ts()

ts_data <- ts(df$team_points, start = c(start_year, start_week), frequency = 52)

# Simple plot with forecast::autoplot

autoplot(ts_data) + labs(title = "Team Points (ts object)", x = "Year", y = "Points")

Note: frequency=52 approximates weekly seasonality. If your competition uses irregular weeks, prefer tsibble.

df_tsibble <- df %>%
mutate(week = yearweek(date)) %>%
as_tsibble(index = week)

df_tsibble %>% autoplot(team_points) + labs(title = "Team Points (tsibble)")

6.Decomposition (STL)

Use STL to inspect trend and seasonality.

stl_fit <- stl(ts_data, s.window = "periodic", robust = TRUE)
autoplot(stl_fit) + labs(title = "STL Decomposition")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the forecast package.
##   Please report the issue at <https://github.com/robjhyndman/forecast/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

7.Stationarity tests and differencing

Use ACF/PACF and Augmented Dickey-Fuller (ADF). If p-value > 0.05, series likely non-stationary.

par(mfrow = c(2,1))
Acf(ts_data, main = "ACF - original")
Pacf(ts_data, main = "PACF - original")

par(mfrow = c(1,1))

adf <- adf.test(ts_data, alternative = "stationary")
adf

## 
##  Augmented Dickey-Fuller Test
## 
## data:  ts_data
## Dickey-Fuller = -2.5, Lag order = 5, p-value = 0.4
## alternative hypothesis: stationary

nd <- ndiffs(ts_data)
nd

## [1] 0

8.Model building: ARIMA and ETS

Fit two competitive models — automatic ARIMA and ETS.

# ARIMA (auto)

fit_arima <- auto.arima(ts_data, seasonal = TRUE, stepwise = FALSE, approximation = FALSE)
fit_arima

## Series: ts_data 
## ARIMA(0,0,0)(1,1,0)[52] with drift 
## 
## Coefficients:
##         sar1  drift
##       -0.405  0.027
## s.e.   0.087  0.005
## 
## sigma^2 = 14.8:  log likelihood = -435
## AIC=876   AICc=876.1   BIC=885.1

# ETS

fit_ets <- ets(ts_data)

## Warning in ets(ts_data): I can't handle data with frequency greater than 24.
## Seasonality will be ignored. Try stlf() if you need seasonal forecasts.

fit_ets

## ETS(A,Ad,N) 
## 
## Call:
## ets(y = ts_data)
## 
##   Smoothing parameters:
##     alpha = 0.193 
##     beta  = 0.0872 
##     phi   = 0.8432 
## 
##   Initial states:
##     l = 49.8304 
##     b = 1.8244 
## 
##   sigma:  3.49
## 
##  AIC AICc  BIC 
## 1637 1638 1657

ARIMA Model Summary Model: ARIMA(0,0,0)(1,1,0)[52] with drift Key output: sar1 = -0.405, drift = 0.027, AIC = 876, σ² = 14.8

Interpretation: This ARIMA model includes one seasonal autoregressive term and one seasonal difference with a 52-week cycle, capturing yearly performance patterns. The negative SAR(1) coefficient (–0.405) suggests that strong performance in one season tends to be followed by slightly weaker performance the next. The positive drift (0.027) indicates a gradual upward trend in team performance over time. The relatively low AIC (876) shows a good overall fit. This model effectively captures both trend and seasonality, making it suitable for forecasting future performance.

ETS Model Model: ETS(A,Ad,N) Key output: α = 0.193, β = 0.087, φ = 0.843, AIC = 1637, σ = 3.49

Interpretation: The ETS model uses additive errors and a damped trend, meaning performance increases over time but at a slowing rate. Because ETS cannot handle a 52-week seasonal period, seasonality was ignored, leading to a higher AIC (1637) and less accurate fit than ARIMA. This model captures the general trend but misses annual cycles, making it less effective for weekly sports data.

9. Forecasting (12 week horizon) and plot

Forecast short-term; choose horizon relevant to your sport (12 weeks here).

h <- 12
fc_arima <- forecast(fit_arima, h = h, level = c(80,95))
fc_ets   <- forecast(fit_ets, h = h, level = c(80,95))

autoplot(ts_data) +
autolayer(fc_arima, series = "ARIMA", PI = TRUE) +
autolayer(fc_ets, series = "ETS", PI = TRUE) +
labs(title = paste("Forecasts for next", h, "weeks"), y = "Points") +
guides(colour = guide_legend(title = "Model"))

The ARIMA forecast shows a gradual upward trend with seasonal fluctuations, and its prediction intervals widen over time, reflecting growing uncertainty. The ETS forecast is smoother with no seasonal pattern and slightly narrower intervals, but it underestimates variability. Overall, ARIMA gives a more realistic forecast, capturing both trend and seasonality, while ETS provides a simpler trend-only outlook.

10. Train/test evaluation (backtesting)

Use last 12 weeks as a test set. Fit models on the training portion and compare accuracy metrics (MAE, RMSE, MAPE).

test_length <- 12
n_total <- length(ts_data)
train_end <- n_total - test_length

train_ts <- window(ts_data, end = time(ts_data)[train_end])
test_ts  <- window(ts_data, start = time(ts_data)[train_end + 1])

# Fit on train

arima_train <- auto.arima(train_ts, seasonal = TRUE)
ets_train   <- ets(train_ts)

## Warning in ets(train_ts): I can't handle data with frequency greater than 24.
## Seasonality will be ignored. Try stlf() if you need seasonal forecasts.

fc_arima_train <- forecast(arima_train, h = test_length)
fc_ets_train   <- forecast(ets_train, h = test_length)

# Accuracy

acc_arima <- accuracy(fc_arima_train, test_ts)
acc_ets   <- accuracy(fc_ets_train, test_ts)

acc_arima

##                  ME  RMSE   MAE      MPE  MAPE   MASE       ACF1 Theil's U
## Training set 0.0599 3.004 2.044 -0.04262 3.834 0.5937 -0.0001621        NA
## Test set     0.5677 4.282 3.679  0.66602 6.948 1.0682  0.0523524    0.9232

acc_ets

##                    ME  RMSE   MAE     MPE  MAPE   MASE    ACF1 Theil's U
## Training set -0.08096 3.403 2.669 -0.4242 5.120 0.7752 0.01576        NA
## Test set      3.85976 5.113 4.418  6.9354 8.163 1.2830 0.06848     1.099

Between the two models, the ARIMA model shows lower RMSE (4.28) and MAE (3.68) on the test set compared to the ETS model (RMSE 5.11, MAE 4.42), indicating better predictive accuracy. The MAPE for ARIMA (6.9%) is also lower than ETS (8.2%), showing smaller average percentage errors. Overall, the ARIMA model performs better on all key metrics and is therefore the preferred forecasting model.

11. Residual diagnostics

# use fit_arima (or whichever performed better)

checkresiduals(fit_arima)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(1,1,0)[52] with drift
## Q* = 41, df = 41, p-value = 0.5
## 
## Model df: 1.   Total lags used: 42

The Ljung–Box test gives a p-value of 0.5, which is well above 0.05. This means we fail to reject the null hypothesis that residuals are white noise. Therefore, the residuals show no significant autocorrelation, indicating that the ARIMA model fits the data well and captures the main structure of the time series.

12. Save forecasts and reproducibility

Save forecast table and include code to export results.

fc_df <- tibble(
date = seq.Date(from = as.Date(df$date[nrow(df)]) + 7, by = "week", length.out = h),
mean = as.numeric(fc_arima$mean),
lo80 = as.numeric(fc_arima$lower[,1]),
hi80 = as.numeric(fc_arima$upper[,1]),
lo95 = as.numeric(fc_arima$lower[,2]),
hi95 = as.numeric(fc_arima$upper[,2])
)

fc_df

## # A tibble: 12 × 6
##    date        mean  lo80  hi80  lo95  hi95
##    <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 2022-12-27  56.0  51.1  60.9  48.5  63.5
##  2 2023-01-03  56.2  51.3  61.2  48.7  63.8
##  3 2023-01-10  56.8  51.9  61.7  49.3  64.3
##  4 2023-01-17  61.2  56.2  66.1  53.6  68.7
##  5 2023-01-24  59.5  54.6  64.4  52.0  67.0
##  6 2023-01-31  58.4  53.5  63.3  50.9  65.9
##  7 2023-02-07  61.0  56.1  65.9  53.5  68.5
##  8 2023-02-14  60.7  55.8  65.6  53.2  68.3
##  9 2023-02-21  62.1  57.1  67.0  54.5  69.6
## 10 2023-02-28  63.9  59.0  68.8  56.4  71.4
## 11 2023-03-07  58.4  53.5  63.4  50.9  66.0
## 12 2023-03-14  62.7  57.8  67.7  55.2  70.3

# write_csv(fc_df, "forecast_arima_next12weeks.csv")

13. Interpretation, limitations and suggestion

The ARIMA model provided the most accurate forecasts, effectively capturing the trend and seasonal patterns in team performance. However, it assumes past patterns continue unchanged, which may not hold if unexpected events or changes in strategy occur. The ETS model was simpler but ignored seasonality, reducing accuracy.

Future improvements could include adding external variables (e.g., player stats, injuries, opponent strength), testing hybrid models (ARIMA + regression or machine learning), and using longer data periods to enhance reliability.

Time series

THEDINI JAYAWARDANA

2025-10-29

Introduction