suppressPackageStartupMessages({
library(fpp3) # Main time series package
library(patchwork) # For combining plots
library(broom) # For model summaries
library(stargazer) # For regression tables
})
?AirPassengers
# Load and prepare the AirPassengers data
AP_data <- AirPassengers %>%
as_tsibble() %>%
mutate(
t = row_number(), # Time trend variable
t_squared = t^2, # Quadratic trend
log_value = log(value) # Log transformation
)
# Display first few rows
head(AP_data)Time Series Linear Regression (TSLM) with fpp3
Introduction
Time series linear regression is a fundamental technique for modeling time series data with deterministic patterns like trends and seasonality. Unlike simple linear regression, time series regression must account for the temporal dependencies in the data.
In this analysis, we’ll examine the classic AirPassengers dataset - monthly international airline passenger numbers from 1949-1960. We’ll build three increasingly sophisticated models:
- Model 1: Quadratic trend only
- Model 2: Quadratic trend + seasonal dummies
- Model 3: Log-linear with trend + seasonality
Setup and Data Preparation
The as_tsibble() function converts the classic R time series object into the modern fpp3 format. We create additional variables for modeling:
-
t: Linear time trend (1, 2, 3, …) -
t_squared: Quadratic trend for curved relationships -
log_value: Log transformation for multiplicative patterns
Exploratory Data Analysis
??ggplot2
??ggplot2::autoplot
# Create side-by-side plots
p1 <- AP_data %>%
autoplot(value) +
labs(title = "Original AirPassengers Data",
y = "Passengers (thousands)",
x = "Year")
p2 <- AP_data %>%
autoplot(log_value) +
labs(title = "Log-Transformed Data",
y = "Log(Passengers)",
x = "Year")
# Display plots together
p1 / p2Key Observations:
- Original data shows clear upward trend and seasonal pattern
- Seasonality appears multiplicative - seasonal fluctuations increase over time
- Log transformation converts multiplicative seasonality to additive
- Both show strong annual cycles with summer peaks
ACF/PACF Analysis for Model Identification
# Analyze autocorrelation patterns
AP_data %>%
gg_tsdisplay(value, plot_type = "partial", lag_max = 36) +
labs(title = "ACF/PACF Analysis: Original Series")Warning: `gg_tsdisplay()` was deprecated in feasts 0.4.2.
ℹ Please use `ggtime::gg_tsdisplay()` instead.
ACF/PACF Interpretation:
- ACF (top): Slow decay + seasonal spikes at lags 12, 24, 36 → non-stationary with trend and seasonality
- PACF (bottom): Strong spike at lag 1, then smaller oscillations → suggests autoregressive structure
- Seasonal pattern: Clear 12-month cycle confirms annual seasonality
- Conclusion: Series needs detrending and deseasoning before ARIMA modeling
# ACF/PACF of log-transformed data
AP_data %>%
gg_tsdisplay(log_value, plot_type = "partial", lag_max = 36) +
labs(title = "ACF/PACF Analysis: Log-Transformed Series")Log-Transformed Analysis:
- Log transformation stabilizes variance but doesn’t remove trend/seasonality
- ACF pattern similar to original, confirming need for regression approach (explicit modeling of these components i.e. regression with trend + seasonal terms)
- This motivates our Model 3 (log-linear regression)
Model Development
Model 1: Quadratic Trend Only
The first model captures only the long-term growth pattern: \[AP_t = \beta_0 + \beta_1 t + \beta_2 t^2 + \epsilon_t\]
Series: value
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-100.353 -27.339 -7.442 21.603 146.116
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.124e+02 1.138e+01 9.872 < 2e-16 ***
trend() 1.641e+00 3.625e-01 4.527 1.26e-05 ***
I(trend()^2) 7.008e-03 2.421e-03 2.894 0.00441 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.91 on 141 degrees of freedom
Multiple R-squared: 0.8618, Adjusted R-squared: 0.8599
F-statistic: 439.8 on 2 and 141 DF, p-value: < 2.22e-16
Model 1 Interpretation:
- Intercept: Starting level of passengers
- trend(): Linear growth rate per month
- I(trend()^2): Quadratic term captures accelerating growth
- R-squared: Proportion of variance explained by trend alone
# Extract fitted values and residuals
aug1 <- augment(model1)
# Plot fitted vs actual
autoplot(AP_data, value) +
autolayer(aug1, .fitted, color = "red", size = 1) +
labs(title = "Model 1: Quadratic Trend Fit",
y = "Passengers (thousands)",
subtitle = "Red line shows fitted values")Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
ℹ The deprecated feature was likely used in the fabletools package.
Please report the issue at <https://github.com/tidyverts/fabletools/issues>.
Model 1 Assessment:
- Captures overall upward trend reasonably well
- Missing seasonal patterns - systematic over/under-prediction
- Residuals will show strong seasonal autocorrelation
Model 2: Trend + Seasonality
Model 2 adds seasonal dummy variables to capture monthly effects: \[AP_t = \beta_0 + \beta_1 t + \beta_2 t^2 + \gamma_1 dm1_t + ... + \gamma_{11} dm11_t + \epsilon_t\]
# Fit Model 2: Trend + seasonal dummies
model2 <- AP_data %>%
model(Model2 = TSLM(value ~ trend() + I(trend()^2) + season()
)
)
# Display model summary
report(model2)Series: value
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-46.1215 -13.3935 0.8246 12.7327 75.7730
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 88.557838 8.799762 10.064 < 2e-16 ***
trend() 1.625504 0.191722 8.478 4.34e-14 ***
I(trend()^2) 0.007137 0.001281 5.573 1.38e-07 ***
season()year2 -9.338962 9.694487 -0.963 0.337172
season()year3 23.224469 9.694859 2.396 0.018021 *
season()year4 17.523627 9.695469 1.807 0.073012 .
season()year5 19.641845 9.696310 2.026 0.044842 *
season()year6 56.829122 9.697379 5.860 3.59e-08 ***
season()year7 93.835460 9.698673 9.675 < 2e-16 ***
season()year8 90.910857 9.700192 9.372 2.91e-16 ***
season()year9 39.555314 9.701939 4.077 7.90e-05 ***
season()year10 1.018831 9.703917 0.105 0.916544
season()year11 -35.448592 9.706131 -3.652 0.000376 ***
season()year12 -9.180288 9.708591 -0.946 0.346115
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.75 on 130 degrees of freedom
Multiple R-squared: 0.9644, Adjusted R-squared: 0.9608
F-statistic: 270.8 on 13 and 130 DF, p-value: < 2.22e-16
Model 2 Features:
- season() automatically creates 11 dummy variables (December = reference)
- Seasonal coefficients show relative difference vs December
- Much higher R-squared indicates better fit
# Extract fitted values and residuals
aug2 <- augment(model2)
# Plot fitted vs actual
autoplot(AP_data, value) +
autolayer(aug2, .fitted, color = "blue", size = 1) +
labs(title = "Model 2: Trend + Seasonal Fit",
y = "Passengers (thousands)",
subtitle = "Blue line shows fitted values")Model 2 Assessment: - Excellent fit - captures both trend and seasonality - Residuals should show much less autocorrelation - Still assumes additive seasonality
Model 3: Log-Linear Regression
Model 3 uses log transformation to handle multiplicative seasonality: \[\log(AP_t) = \beta_0 + \beta_1 t + \beta_2 t^2 + \gamma_1 dm1_t + ... + \gamma_{11} dm11_t + \epsilon_t\]
Series: value
Model: TSLM
Transformation: log(value)
Residuals:
Min 1Q Median 3Q Max
-0.12748 -0.03709 0.00418 0.03197 0.11529
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.651e+00 1.786e-02 260.410 < 2e-16 ***
trend() 1.318e-02 3.892e-04 33.877 < 2e-16 ***
I(trend()^2) -2.148e-05 2.599e-06 -8.265 1.41e-13 ***
season()year2 -2.227e-02 1.968e-02 -1.132 0.259839
season()year3 1.078e-01 1.968e-02 5.477 2.15e-07 ***
season()year4 7.639e-02 1.968e-02 3.882 0.000164 ***
season()year5 7.393e-02 1.968e-02 3.756 0.000259 ***
season()year6 1.960e-01 1.968e-02 9.959 < 2e-16 ***
season()year7 3.000e-01 1.969e-02 15.238 < 2e-16 ***
season()year8 2.907e-01 1.969e-02 14.765 < 2e-16 ***
season()year9 1.462e-01 1.969e-02 7.423 1.33e-11 ***
season()year10 8.145e-03 1.970e-02 0.414 0.679912
season()year11 -1.354e-01 1.970e-02 -6.873 2.36e-10 ***
season()year12 -2.132e-02 1.971e-02 -1.082 0.281286
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.0482 on 130 degrees of freedom
Multiple R-squared: 0.9892, Adjusted R-squared: 0.9881
F-statistic: 912.7 on 13 and 130 DF, p-value: < 2.22e-16
Model 3 Advantages:
- Multiplicative seasonality: Seasonal effects grow with trend
- Percentage interpretation: Coefficients represent % changes
- Variance stabilization: More homoscedastic residuals
# Extract fitted values with back-transformation
#aug3 <- model3 %>%
# augment() %>%
# mutate(fitted_original = exp(.fitted)) # Back-transform to original scale
# Since .fitted is already on original scale, use it directly:
aug3 <- model3 %>%
augment()
# Plot without back-transformation
autoplot(AP_data, value) +
autolayer(aug3, .fitted, color = "green", size = 1) + # Use .fitted directly
labs(title = "Model 3: Log-Linear Fit",
y = "Passengers (thousands)",
subtitle = "Fitted values already on original scale")Model 3 Assessment:
- Best visual fit for multiplicative pattern
- Handles increasing seasonal variation appropriately
- Residuals on log scale should be most well-behaved
The stargazer function isn’t working because fpp3 model objects have a different structure than standard R regression objects. str(model1) output shows a complex nested list structure that stargazer doesn’t recognize.
So use alternate command to extract and display coefficients.
Residual Diagnostics
Residual analysis is crucial for validating model assumptions and identifying remaining patterns.
# Model 1 residuals
aug1 %>%
gg_tsdisplay(.resid, plot_type = "partial", lag_max = 24) +
labs(title = "Model 1 Residuals: Strong Seasonal Pattern Remains")# Model 2 residuals
aug2 %>%
gg_tsdisplay(.resid, plot_type = "partial", lag_max = 24) +
labs(title = "Model 2 Residuals: Much Improved")# Model 3 residuals (on log scale)
aug3 %>%
gg_tsdisplay(.resid, plot_type = "partial", lag_max = 24) +
labs(title = "Model 3 Residuals: Cleanest Pattern")Residual Interpretation:
-
Model 1: Strong seasonal spikes in ACF at lags 12, 24 → seasonality not captured - Model 2: Seasonal spikes largely removed → successful deseasonalization
- Model 3: Cleanest ACF/PACF pattern → best model for residual behavior
Model Comparison
# Calculate accuracy metrics for all models
acc <- bind_rows(
accuracy(model1) %>% mutate(Model = "Model 1: Trend Only"),
accuracy(model2) %>% mutate(Model = "Model 2: Trend + Seasonal"),
accuracy(model3) %>% mutate(Model = "Model 3: Log-Linear")
)
# Display key accuracy measures
acc %>%
select(Model, RMSE, MAE, MAPE, MASE) %>%
knitr::kable(digits = 3,
caption = "Model Accuracy Comparison")| Model | RMSE | MAE | MAPE | MASE |
|---|---|---|---|---|
| Model 1: Trend Only | 44.435 | 32.842 | 11.264 | 1.025 |
| Model 2: Trend + Seasonal | 22.562 | 17.579 | 7.273 | 0.549 |
| Model 3: Log-Linear | 13.119 | 10.001 | 3.696 | 0.312 |
Accuracy Metrics Explained:
- RMSE: Root Mean Square Error (lower = better)
- MAE: Mean Absolute Error (lower = better)
- MAPE: Mean Absolute Percentage Error (lower = better)
- MASE: Mean Absolute Scaled Error (< 1 = better than naive forecast)
# Custom R-squared function for time series models / or pull model metric from glance command
rsq <- function(mdl) {
aug <- augment(mdl)
1 - sum(aug$.resid^2) / sum((aug$value - mean(aug$value))^2)
}
# Calculate pseudo R-squared for each model
r2_comparison <- tibble(
Model = c("Model 1: Trend Only",
"Model 2: Trend + Seasonal",
"Model 3: Log-Linear"),
R_squared = c(rsq(model1), rsq(model2), rsq(model3))
)
r2_comparison %>%
knitr::kable(digits = 4,
caption = "Model R-squared Comparison")| Model | R_squared |
|---|---|
| Model 1: Trend Only | 0.8618 |
| Model 2: Trend + Seasonal | 0.9644 |
| Model 3: Log-Linear | 0.9880 |
R-squared Results:
- Model 1: Moderate fit from trend alone
- Model 2: Dramatic improvement with seasonality
- Model 3: Comparable to Model 2, with additional benefits
Forecasting
# Generate 12-month ahead forecasts
fc <- bind_rows(
model1 %>% forecast(h = 12) %>% mutate(Model = "Model 1"),
model2 %>% forecast(h = 12) %>% mutate(Model = "Model 2"),
model3 %>% forecast(h = 12) %>% mutate(Model = "Model 3")
)
# Plot forecasts with prediction intervals
AP_data %>%
autoplot(value) + # original series (black line)
autolayer(fc, .mean, alpha = 0.8, size = 1) + # model forecasts (colored lines)
autolayer(fc, level = 80, alpha = 0.3) + # 80% PI (shaded bands)
facet_wrap(~Model, ncol = 1) + # separate panels for each model
labs(title = "12-Month Ahead Forecasts with 80% Prediction Intervals",
y = "Passengers (thousands)",
x = "Year") +
theme_minimal() Scale for fill_ramp is already present.
Adding another scale for fill_ramp, which will replace the existing scale.
Forecast Comparison:
- Model 1: Smooth trend, misses seasonal peaks/troughs
- Model 2: Proper seasonal pattern, constant seasonal amplitude
- Model 3: Seasonal pattern with increasing amplitude (most realistic)
Key Findings and Recommendations
Statistical Results:
- Model 1 captures 86% of variance with trend alone
- Model 2 improves to 96% by adding seasonality
- Model 3 achieves similar fit with better residual properties
Practical Insights:
- Summer months (June-August) show highest passenger demand
- Winter months (November-February) show lowest demand
- Growth rate: Approximately 1% per month with acceleration
- Seasonal variation increases over time (multiplicative pattern)
Model Selection:
Recommended: Model 3 (Log-Linear)
- Handles multiplicative seasonality appropriately
- Most realistic forecast behavior
- Best residual diagnostics
- Interpretable percentage effects
Future Extensions:
- ARIMA modeling for residual autocorrelation
- External regressors (economic factors, fuel prices)
- Structural breaks for regime changes
- Exponential smoothing for comparison
Conclusion
This analysis demonstrates the evolution from simple trend models to sophisticated time series regression. The log-linear model (Model 3) provides the best balance of fit, interpretability, and realistic forecasting behavior for the AirPassengers data.
Key Learning Points:
- ACF/PACF analysis guides model specification
- Seasonal dummy variables effectively capture monthly patterns
- Log transformation handles multiplicative seasonality
- Residual diagnostics are essential for model validation
-
fpp3framework provides comprehensive tools for time series regression