Chapter 7 Homework

Author

Griffin Lessinger

4. Analysis of `souvenirs` series

(a + b). Produce a time plot of the data and describe the patterns in the graph

Code

souv <- souvenirs
souv |> autoplot(Sales) +
  labs(title = "Time plot of souvenirs sales figures", y = "Sales ($AUD?)")

Above is the time plot of the sales (by month) from souvenirs, a series tracking monthly sales of a beach resort souvenir/gift shop in Queensland, Australia. The yearly spikes in sales during December makes sense then. Being in the southern hemisphere, Australia will see many northern-hemisphere tourists during these later months.

Something to note is that the seasonal component of the series seems to scale multiplicatively with increasing monthly sales! As such, attempting to fit a linear model will fail, as the multiplicative scaling is a nonlinear pattern. Taking a log transform of the data will be necessary to turn that multiplicative scaling into a transformed additive scaling.

(c). Fit a regression model to the logarithms of these sales data with a linear trend

Code

souv$Sales.log <- log(souv$Sales)
souv$Holiday <- ifelse(month(souv$Month) %in% c(11, 12), 1, 0)
souv$Jan <- ifelse(month(souv$Month) == 1, 1, 0)
souv$Surf <- ifelse(month(souv$Month) == 3, 1, 0)

souv.lm <- souv |>
  model(TSLM(Sales.log ~ trend()  + Holiday + Surf))

augment(souv.lm) |>
  ggplot(aes(x = Month)) +
  geom_line(aes(y = Sales.log, color = "Data")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(title = "Souvenir sales by month", x = "Month", y = "Sales (logged)") +
  scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
  guides(color = guide_legend())

All features (trend() and intercept, along with Holiday, and Surf dummy variables) Are significant predictors at the 5% significance level. also, High \(R\) and \(R^2\) values.

Linear model report

Code

report(souv.lm)

Series: Sales.log 
Model: TSLM 

Residuals:
      Min        1Q    Median        3Q       Max 
-0.682035 -0.196370  0.002756  0.243221  0.554971 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.025572   0.068991 116.327   <2e-16 ***
trend()     0.023232   0.001374  16.910   <2e-16 ***
Holiday     1.121011   0.090123  12.439   <2e-16 ***
Surf        0.242773   0.121118   2.004   0.0484 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3039 on 80 degrees of freedom
Multiple R-squared: 0.8574, Adjusted R-squared: 0.8521
F-statistic: 160.4 on 3 and 80 DF, p-value: < 2.22e-16

(d). Plot the residuals against time and against the fitted values

Code

souv.lm |> gg_tsresiduals() + labs(title = "Residuals over time")

Code

augment(souv.lm) |>
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point() +
  xlim(7, 12) +
  ylim(-1, 1) +
  labs(title = "Residuals vs. Fitted", x = "Fitted", y = "Residuals")

The random scattering around 0 (of the residuals) suggests homoscedasticity. Additionally, the residuals over time plot shows what looks to be “white noise”, and the residual histogram is centered at 0 with some slight left skew. Overall, the model seems pretty good. The only major concern would be some high autocorrelation of the residuals, especially at 2 and 12 months lag.

(e). Do boxplots of the residuals for each month

Code

souv.residuals <- residuals(souv.lm)

boxplot(
  data = souv.residuals,
  .resid ~ month(Month),
  main = "Boxplots of residuals by month",
  xlab = "Month",
  ylab = "Residuals"
)

The model seems to be somewhat overestimating for January and November (negative residuals), underestimating for October and December (positive residuals). The model accuracy may be greatly improved if we were to include an additional dummy variable that signifies January, where sales dip greatly (coming right after December, a peak in sales).

(f). What do the values of the coefficients tell you about each variable?

Code

report(souv.lm)

Series: Sales.log 
Model: TSLM 

Residuals:
      Min        1Q    Median        3Q       Max 
-0.682035 -0.196370  0.002756  0.243221  0.554971 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.025572   0.068991 116.327   <2e-16 ***
trend()     0.023232   0.001374  16.910   <2e-16 ***
Holiday     1.121011   0.090123  12.439   <2e-16 ***
Surf        0.242773   0.121118   2.004   0.0484 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3039 on 80 degrees of freedom
Multiple R-squared: 0.8574, Adjusted R-squared: 0.8521
F-statistic: 160.4 on 3 and 80 DF, p-value: < 2.22e-16

Going back to the model report, the intercept already explains much of the logged sales (as most values are between \(7\) and \(11\), and the intercept is about \(8\)). trend() has relatively little influence, though it is probably one of the most statistically significant predictors. Holiday is greatly important, adding \(1.121\) in logged sales to November and December fitted values. Surf less-so, but still adding \(0.243\) to March fitted values.

(g). What does the Ljung-Box test tell you about your model?

Code

augment(souv.lm) |> features(.innov, ljung_box, lag = 8)

# A tibble: 1 × 3
  .model                                     lb_stat lb_pvalue
  <chr>                                        <dbl>     <dbl>
1 TSLM(Sales.log ~ trend() + Holiday + Surf)    16.6    0.0350

The p-value is relatively small, so the autocorrelations are distinguishable from a white-noise series. This matches what was seen earlier when looking at residual plots, the ACF plot showed several large spikes in autocorrelation.

(h + i). Use your regression model to predict the monthly sales for 1994, 1995, and 1996

Code

souv.new <- new_data(souv, n = 36) |>
  mutate(
    Holiday = ifelse(month(Month) %in% c(11, 12), 1, 0),
    Jan = ifelse(month(Month) == 1, 1, 0),
    Surf = ifelse(month(Month) == 3, 1, 0)
  )

souv.preds <- forecast(souv.lm, new_data = souv.new)

souv |>
  autoplot(Sales.log) +
  autolayer(souv.preds) +
  labs(title = "Forecasted sales (logged)", x = "Month", y = "Sales (logged)")

Plot of un-transformed forecasts

Code

souv.predsexp <- tsibble(souv.preds[, c(2, 4)], index = Month) |>
  select(.mean) |>
  mutate(restore = exp(.mean)) |>
  select(Month, restore)

souv |>
  autoplot(Sales) +
  autolayer(souv.predsexp, col = "blue", alpha = 0.6) +
  labs(title = "Forecasted sales", x = "Month", y = "Sales ($AUD?)")

The model predicts a very consistent linear increase in logged sales. As was said earlier, the model’s accuracy could likely be further increased by adding a dummy variable for January. But, too many dummy variables should be avoided, as the model will shift away from a proper linear regression and towards an “average monthly sales (logged) + trend” giver.

5. Analysis of `us_gasoline` series

(a). Fit a harmonic regression with trend to the data

Code

usg <- us_gasoline |>
  filter(year(Week) <= 2004)

usg.fit <- usg |>
  model(TSLM(Barrels ~ trend() + fourier(K = 4)))

augment(usg.fit) |> 
  filter(year(Week) > 2000) |>
  ggplot(aes(x = Week)) +
  geom_line(aes(y = Barrels, color = "Data")) +
  geom_line(aes(y = .fitted, color = "Fitted")) +
  labs(
    title = "Supplies of US gasoline product",
    subtitle = "Fitted harmonic regression, 4 Fourier terms",
    y = "Barrels (millions)") +
  scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
  guides(color = guide_legend(title = ""))

The harmonic regression with 4 Fourier terms shows (roughly) the cyclic shape of the data. Realistically, up to 26 Fourier terms could be used, because the seasonal period of the data is 52 intervals.

Harmonic model report

Code

report(usg.fit)

Series: Barrels 
Model: TSLM 

Residuals:
       Min         1Q     Median         3Q        Max 
-0.8560091 -0.1733642 -0.0003261  0.1910431  0.9111163 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          7.0945289  0.0205501 345.231  < 2e-16 ***
trend()              0.0027996  0.0000490  57.132  < 2e-16 ***
fourier(K = 4)C1_52 -0.1240988  0.0145021  -8.557  < 2e-16 ***
fourier(K = 4)S1_52 -0.2388993  0.0145064 -16.469  < 2e-16 ***
fourier(K = 4)C2_52  0.0451222  0.0144806   3.116  0.00191 ** 
fourier(K = 4)S2_52  0.0098840  0.0145108   0.681  0.49600    
fourier(K = 4)C3_52  0.0957303  0.0145060   6.599 8.04e-11 ***
fourier(K = 4)S3_52  0.0002916  0.0144822   0.020  0.98394    
fourier(K = 4)C4_52  0.0284342  0.0145035   1.961  0.05032 .  
fourier(K = 4)S4_52  0.0285358  0.0144836   1.970  0.04920 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2761 on 716 degrees of freedom
Multiple R-squared: 0.8386, Adjusted R-squared: 0.8366
F-statistic: 413.4 on 9 and 716 DF, p-value: < 2.22e-16

(b). Select the appropriate number of Fourier terms to include

Code

kterms <- c()
for (k in 1:26) {
  usg.fittemp <- usg |>
    model(TSLM(Barrels ~ trend() + fourier(K = k)))
  
  kterms <- append(
    kterms,
    glance(usg.fittemp) |> select(AICc)
  )
}

kterms <- unlist(kterms)
best <- match(kterms[kterms == min(kterms)], kterms)

print(paste0("Optimal number of Fourier terms (to minimize AICc): ", best))

[1] "Optimal number of Fourier terms (to minimize AICc): 7"

(c). Plot the residuals, test residual autocorrelation with Ljung-Box test

Code

usg.hm <- usg |>
  model(TSLM(Barrels ~ trend() + fourier(K = 7)))

usg.hm |> gg_tsresiduals() + labs(title = "Residuals over time")

Code

augment(usg.hm) |> features(.innov, ljung_box, lag = 104)

# A tibble: 1 × 3
  .model                                   lb_stat lb_pvalue
  <chr>                                      <dbl>     <dbl>
1 TSLM(Barrels ~ trend() + fourier(K = 7))    144.   0.00561

This is a bit odd. The residual histogram shows no major skew, so no excessive bias towards over or under-prediction by the model. Additionally, the residual plot looks like white noise. The Ljung-Box test, however (with its tiny p-value), suggests that residual autocorrelation is not indistinguishable from white noise. If we look at the ACF plot, we can indeed see systemic biasing of autocorrelation in the positive direction, even if it is minor.

(d). Generate forecasts for the next year of data and plot them along the actual data

Code

usg.preds <- usg.hm |>
  forecast(h = 52)

usg <- us_gasoline |>
  filter(year(Week) <= 2005)

usg.preds |>
  autoplot(usg |> filter(year(Week) >= 2003), alpha = 2) +
  labs(
    title = "Supplies of US gasoline product",
    subtitle = "Fitted harmonic regression, 7 Fourier terms",
    y = "Barrels (millions)"
  )

Model seems decently accurate! The x-axis was shortened to make it easier to see the year of forecasted values against the actual values. For most intervals, the actual value is within the 95% confidence interval of the harmonic model, if the predicted values aren’t literally on top of the actual values anyway.

4. Analysis of souvenirs series

5. Analysis of us_gasoline series

4. Analysis of `souvenirs` series

5. Analysis of `us_gasoline` series