Author

Griffin Lessinger

4. Analysis of souvenirs series

Code
souv <- souvenirs
souv |> autoplot(Sales) +
  labs(title = "Time plot of souvenirs sales figures", y = "Sales ($AUD?)")

Above is the time plot of the sales (by month) from souvenirs, a series tracking monthly sales of a beach resort souvenir/gift shop in Queensland, Australia. The yearly spikes in sales during December makes sense then. Being in the southern hemisphere, Australia will see many northern-hemisphere tourists during these later months.

Something to note is that the seasonal component of the series seems to scale multiplicatively with increasing monthly sales! As such, attempting to fit a linear model will fail, as the multiplicative scaling is a nonlinear pattern. Taking a log transform of the data will be necessary to turn that multiplicative scaling into a transformed additive scaling.

Code
souv$Sales.log <- log(souv$Sales)
souv$Holiday <- ifelse(month(souv$Month) %in% c(11, 12), 1, 0)
souv$Jan <- ifelse(month(souv$Month) == 1, 1, 0)
souv$Surf <- ifelse(month(souv$Month) == 3, 1, 0)

souv.lm <- souv |>
  model(TSLM(Sales.log ~ trend()  + Holiday + Surf))

augment(souv.lm) |>
  ggplot(aes(x = Month)) +
  geom_line(aes(y = Sales.log, color = "Data")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(title = "Souvenir sales by month", x = "Month", y = "Sales (logged)") +
  scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
  guides(color = guide_legend())

All features (trend() and intercept, along with Holiday, and Surf dummy variables) Are significant predictors at the 5% significance level. also, High \(R\) and \(R^2\) values.

Code
report(souv.lm)
Series: Sales.log 
Model: TSLM 

Residuals:
      Min        1Q    Median        3Q       Max 
-0.682035 -0.196370  0.002756  0.243221  0.554971 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.025572   0.068991 116.327   <2e-16 ***
trend()     0.023232   0.001374  16.910   <2e-16 ***
Holiday     1.121011   0.090123  12.439   <2e-16 ***
Surf        0.242773   0.121118   2.004   0.0484 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3039 on 80 degrees of freedom
Multiple R-squared: 0.8574, Adjusted R-squared: 0.8521
F-statistic: 160.4 on 3 and 80 DF, p-value: < 2.22e-16
Code
souv.lm |> gg_tsresiduals() + labs(title = "Residuals over time")

Code
augment(souv.lm) |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  xlim(7, 12) +
  ylim(-1, 1) +
  labs(title = "Residuals vs. Fitted", x = "Fitted", y = "Residuals")

The random scattering around 0 (of the residuals) suggests homoscedasticity. Additionally, the residuals over time plot shows what looks to be “white noise”, and the residual histogram is centered at 0 with some slight left skew. Overall, the model seems pretty good. The only major concern would be some high autocorrelation of the residuals, especially at 2 and 12 months lag.

Code
souv.residuals <- residuals(souv.lm)

boxplot(
  data = souv.residuals,
  .resid ~ month(Month),
  main = "Boxplots of residuals by month",
  xlab = "Month",
  ylab = "Residuals"
)

The model seems to be somewhat overestimating for January and November (negative residuals), underestimating for October and December (positive residuals). The model accuracy may be greatly improved if we were to include an additional dummy variable that signifies January, where sales dip greatly (coming right after December, a peak in sales).

Code
report(souv.lm)
Series: Sales.log 
Model: TSLM 

Residuals:
      Min        1Q    Median        3Q       Max 
-0.682035 -0.196370  0.002756  0.243221  0.554971 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.025572   0.068991 116.327   <2e-16 ***
trend()     0.023232   0.001374  16.910   <2e-16 ***
Holiday     1.121011   0.090123  12.439   <2e-16 ***
Surf        0.242773   0.121118   2.004   0.0484 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3039 on 80 degrees of freedom
Multiple R-squared: 0.8574, Adjusted R-squared: 0.8521
F-statistic: 160.4 on 3 and 80 DF, p-value: < 2.22e-16

Going back to the model report, the intercept already explains much of the logged sales (as most values are between \(7\) and \(11\), and the intercept is about \(8\)). trend() has relatively little influence, though it is probably one of the most statistically significant predictors. Holiday is greatly important, adding \(1.121\) in logged sales to November and December fitted values. Surf less-so, but still adding \(0.243\) to March fitted values.

Code
augment(souv.lm) |> features(.innov, ljung_box, lag = 8)
# A tibble: 1 × 3
  .model                                     lb_stat lb_pvalue
  <chr>                                        <dbl>     <dbl>
1 TSLM(Sales.log ~ trend() + Holiday + Surf)    16.6    0.0350

The p-value is relatively small, so the autocorrelations are distinguishable from a white-noise series. This matches what was seen earlier when looking at residual plots, the ACF plot showed several large spikes in autocorrelation.

Code
souv.new <- new_data(souv, n = 36) |>
  mutate(
    Holiday = ifelse(month(Month) %in% c(11, 12), 1, 0),
    Jan = ifelse(month(Month) == 1, 1, 0),
    Surf = ifelse(month(Month) == 3, 1, 0)
  )

souv.preds <- forecast(souv.lm, new_data = souv.new)

souv |>
  autoplot(Sales.log) +
  autolayer(souv.preds) +
  labs(title = "Forecasted sales (logged)", x = "Month", y = "Sales (logged)")

Code
souv.predsexp <- tsibble(souv.preds[, c(2, 4)], index = Month) |>
  select(.mean) |>
  mutate(restore = exp(.mean)) |>
  select(Month, restore)

souv |>
  autoplot(Sales) +
  autolayer(souv.predsexp, col = "blue", alpha = 0.6) +
  labs(title = "Forecasted sales", x = "Month", y = "Sales ($AUD?)")

The model predicts a very consistent linear increase in logged sales. As was said earlier, the model’s accuracy could likely be further increased by adding a dummy variable for January. But, too many dummy variables should be avoided, as the model will shift away from a proper linear regression and towards an “average monthly sales (logged) + trend” giver.

5. Analysis of us_gasoline series

Code
usg <- us_gasoline |>
  filter(year(Week) <= 2004)

usg.fit <- usg |>
  model(TSLM(Barrels ~ trend() + fourier(K = 4)))

augment(usg.fit) |> 
  filter(year(Week) > 2000) |>
  ggplot(aes(x = Week)) +
  geom_line(aes(y = Barrels, color = "Data")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(
    title = "Supplies of US gasoline product",
    subtitle = "Fitted harmonic regression, 4 Fourier terms",
    y = "Barrels (millions)") +
  scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
  guides(color = guide_legend(title = ""))

The harmonic regression with 4 Fourier terms shows (roughly) the cyclic shape of the data. Realistically, up to 26 Fourier terms could be used, because the seasonal period of the data is 52 intervals.

Code
report(usg.fit)
Series: Barrels 
Model: TSLM 

Residuals:
       Min         1Q     Median         3Q        Max 
-0.8560091 -0.1733642 -0.0003261  0.1910431  0.9111163 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          7.0945289  0.0205501 345.231  < 2e-16 ***
trend()              0.0027996  0.0000490  57.132  < 2e-16 ***
fourier(K = 4)C1_52 -0.1240988  0.0145021  -8.557  < 2e-16 ***
fourier(K = 4)S1_52 -0.2388993  0.0145064 -16.469  < 2e-16 ***
fourier(K = 4)C2_52  0.0451222  0.0144806   3.116  0.00191 ** 
fourier(K = 4)S2_52  0.0098840  0.0145108   0.681  0.49600    
fourier(K = 4)C3_52  0.0957303  0.0145060   6.599 8.04e-11 ***
fourier(K = 4)S3_52  0.0002916  0.0144822   0.020  0.98394    
fourier(K = 4)C4_52  0.0284342  0.0145035   1.961  0.05032 .  
fourier(K = 4)S4_52  0.0285358  0.0144836   1.970  0.04920 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2761 on 716 degrees of freedom
Multiple R-squared: 0.8386, Adjusted R-squared: 0.8366
F-statistic: 413.4 on 9 and 716 DF, p-value: < 2.22e-16
Code
kterms <- c()
for (k in 1:26) {
  usg.fittemp <- usg |>
    model(TSLM(Barrels ~ trend() + fourier(K = k)))
  
  kterms <- append(
    kterms,
    glance(usg.fittemp) |> select(AICc)
  )
}

kterms <- unlist(kterms)
best <- match(kterms[kterms == min(kterms)], kterms)

print(paste0("Optimal number of Fourier terms (to minimize AICc): ", best))
[1] "Optimal number of Fourier terms (to minimize AICc): 7"
Code
usg.hm <- usg |>
  model(TSLM(Barrels ~ trend() + fourier(K = 7)))

usg.hm |> gg_tsresiduals() + labs(title = "Residuals over time")

Code
augment(usg.hm) |> features(.innov, ljung_box, lag = 104)
# A tibble: 1 × 3
  .model                                   lb_stat lb_pvalue
  <chr>                                      <dbl>     <dbl>
1 TSLM(Barrels ~ trend() + fourier(K = 7))    144.   0.00561

This is a bit odd. The residual histogram shows no major skew, so no excessive bias towards over or under-prediction by the model. Additionally, the residual plot looks like white noise. The Ljung-Box test, however (with its tiny p-value), suggests that residual autocorrelation is not indistinguishable from white noise. If we look at the ACF plot, we can indeed see systemic biasing of autocorrelation in the positive direction, even if it is minor.

Code
usg.preds <- usg.hm |>
  forecast(h = 52)

usg <- us_gasoline |>
  filter(year(Week) <= 2005)

usg.preds |>
  autoplot(usg |> filter(year(Week) >= 2003), alpha = 2) +
  labs(
    title = "Supplies of US gasoline product",
    subtitle = "Fitted harmonic regression, 7 Fourier terms",
    y = "Barrels (millions)"
  )

Model seems decently accurate! The x-axis was shortened to make it easier to see the year of forecasted values against the actual values. For most intervals, the actual value is within the 95% confidence interval of the harmonic model, if the predicted values aren’t literally on top of the actual values anyway.