Code
souv <- souvenirs
souv |> autoplot(Sales) +
labs(title = "Time plot of souvenirs sales figures", y = "Sales ($AUD?)")Griffin Lessinger
souvenirs seriesAbove is the time plot of the sales (by month) from souvenirs, a series tracking monthly sales of a beach resort souvenir/gift shop in Queensland, Australia. The yearly spikes in sales during December makes sense then. Being in the southern hemisphere, Australia will see many northern-hemisphere tourists during these later months.
Something to note is that the seasonal component of the series seems to scale multiplicatively with increasing monthly sales! As such, attempting to fit a linear model will fail, as the multiplicative scaling is a nonlinear pattern. Taking a log transform of the data will be necessary to turn that multiplicative scaling into a transformed additive scaling.
souv$Sales.log <- log(souv$Sales)
souv$Holiday <- ifelse(month(souv$Month) %in% c(11, 12), 1, 0)
souv$Jan <- ifelse(month(souv$Month) == 1, 1, 0)
souv$Surf <- ifelse(month(souv$Month) == 3, 1, 0)
souv.lm <- souv |>
model(TSLM(Sales.log ~ trend() + Holiday + Surf))
augment(souv.lm) |>
ggplot(aes(x = Month)) +
geom_line(aes(y = Sales.log, color = "Data")) +
geom_line(aes(y = .fitted, color = "Fitted")) +
labs(title = "Souvenir sales by month", x = "Month", y = "Sales (logged)") +
scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
guides(color = guide_legend())All features (trend() and intercept, along with Holiday, and Surf dummy variables) Are significant predictors at the 5% significance level. also, High \(R\) and \(R^2\) values.
Series: Sales.log
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.682035 -0.196370 0.002756 0.243221 0.554971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.025572 0.068991 116.327 <2e-16 ***
trend() 0.023232 0.001374 16.910 <2e-16 ***
Holiday 1.121011 0.090123 12.439 <2e-16 ***
Surf 0.242773 0.121118 2.004 0.0484 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3039 on 80 degrees of freedom
Multiple R-squared: 0.8574, Adjusted R-squared: 0.8521
F-statistic: 160.4 on 3 and 80 DF, p-value: < 2.22e-16
The random scattering around 0 (of the residuals) suggests homoscedasticity. Additionally, the residuals over time plot shows what looks to be “white noise”, and the residual histogram is centered at 0 with some slight left skew. Overall, the model seems pretty good. The only major concern would be some high autocorrelation of the residuals, especially at 2 and 12 months lag.
The model seems to be somewhat overestimating for January and November (negative residuals), underestimating for October and December (positive residuals). The model accuracy may be greatly improved if we were to include an additional dummy variable that signifies January, where sales dip greatly (coming right after December, a peak in sales).
Series: Sales.log
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.682035 -0.196370 0.002756 0.243221 0.554971
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.025572 0.068991 116.327 <2e-16 ***
trend() 0.023232 0.001374 16.910 <2e-16 ***
Holiday 1.121011 0.090123 12.439 <2e-16 ***
Surf 0.242773 0.121118 2.004 0.0484 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3039 on 80 degrees of freedom
Multiple R-squared: 0.8574, Adjusted R-squared: 0.8521
F-statistic: 160.4 on 3 and 80 DF, p-value: < 2.22e-16
Going back to the model report, the intercept already explains much of the logged sales (as most values are between \(7\) and \(11\), and the intercept is about \(8\)). trend() has relatively little influence, though it is probably one of the most statistically significant predictors. Holiday is greatly important, adding \(1.121\) in logged sales to November and December fitted values. Surf less-so, but still adding \(0.243\) to March fitted values.
# A tibble: 1 × 3
.model lb_stat lb_pvalue
<chr> <dbl> <dbl>
1 TSLM(Sales.log ~ trend() + Holiday + Surf) 16.6 0.0350
The p-value is relatively small, so the autocorrelations are distinguishable from a white-noise series. This matches what was seen earlier when looking at residual plots, the ACF plot showed several large spikes in autocorrelation.
souv.new <- new_data(souv, n = 36) |>
mutate(
Holiday = ifelse(month(Month) %in% c(11, 12), 1, 0),
Jan = ifelse(month(Month) == 1, 1, 0),
Surf = ifelse(month(Month) == 3, 1, 0)
)
souv.preds <- forecast(souv.lm, new_data = souv.new)
souv |>
autoplot(Sales.log) +
autolayer(souv.preds) +
labs(title = "Forecasted sales (logged)", x = "Month", y = "Sales (logged)")The model predicts a very consistent linear increase in logged sales. As was said earlier, the model’s accuracy could likely be further increased by adding a dummy variable for January. But, too many dummy variables should be avoided, as the model will shift away from a proper linear regression and towards an “average monthly sales (logged) + trend” giver.
us_gasoline seriesusg <- us_gasoline |>
filter(year(Week) <= 2004)
usg.fit <- usg |>
model(TSLM(Barrels ~ trend() + fourier(K = 4)))
augment(usg.fit) |>
filter(year(Week) > 2000) |>
ggplot(aes(x = Week)) +
geom_line(aes(y = Barrels, color = "Data")) +
geom_line(aes(y = .fitted, color = "Fitted")) +
labs(
title = "Supplies of US gasoline product",
subtitle = "Fitted harmonic regression, 4 Fourier terms",
y = "Barrels (millions)") +
scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
guides(color = guide_legend(title = ""))The harmonic regression with 4 Fourier terms shows (roughly) the cyclic shape of the data. Realistically, up to 26 Fourier terms could be used, because the seasonal period of the data is 52 intervals.
Series: Barrels
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.8560091 -0.1733642 -0.0003261 0.1910431 0.9111163
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.0945289 0.0205501 345.231 < 2e-16 ***
trend() 0.0027996 0.0000490 57.132 < 2e-16 ***
fourier(K = 4)C1_52 -0.1240988 0.0145021 -8.557 < 2e-16 ***
fourier(K = 4)S1_52 -0.2388993 0.0145064 -16.469 < 2e-16 ***
fourier(K = 4)C2_52 0.0451222 0.0144806 3.116 0.00191 **
fourier(K = 4)S2_52 0.0098840 0.0145108 0.681 0.49600
fourier(K = 4)C3_52 0.0957303 0.0145060 6.599 8.04e-11 ***
fourier(K = 4)S3_52 0.0002916 0.0144822 0.020 0.98394
fourier(K = 4)C4_52 0.0284342 0.0145035 1.961 0.05032 .
fourier(K = 4)S4_52 0.0285358 0.0144836 1.970 0.04920 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2761 on 716 degrees of freedom
Multiple R-squared: 0.8386, Adjusted R-squared: 0.8366
F-statistic: 413.4 on 9 and 716 DF, p-value: < 2.22e-16
kterms <- c()
for (k in 1:26) {
usg.fittemp <- usg |>
model(TSLM(Barrels ~ trend() + fourier(K = k)))
kterms <- append(
kterms,
glance(usg.fittemp) |> select(AICc)
)
}
kterms <- unlist(kterms)
best <- match(kterms[kterms == min(kterms)], kterms)
print(paste0("Optimal number of Fourier terms (to minimize AICc): ", best))[1] "Optimal number of Fourier terms (to minimize AICc): 7"
# A tibble: 1 × 3
.model lb_stat lb_pvalue
<chr> <dbl> <dbl>
1 TSLM(Barrels ~ trend() + fourier(K = 7)) 144. 0.00561
This is a bit odd. The residual histogram shows no major skew, so no excessive bias towards over or under-prediction by the model. Additionally, the residual plot looks like white noise. The Ljung-Box test, however (with its tiny p-value), suggests that residual autocorrelation is not indistinguishable from white noise. If we look at the ACF plot, we can indeed see systemic biasing of autocorrelation in the positive direction, even if it is minor.
Model seems decently accurate! The x-axis was shortened to make it easier to see the year of forecasted values against the actual values. For most intervals, the actual value is within the 95% confidence interval of the harmonic model, if the predicted values aren’t literally on top of the actual values anyway.
---
title: "Chapter 7 Homework"
author: "Griffin Lessinger"
format:
html:
code-fold: true
code-tools: true
toc: true
editor: visual
---
```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(fpp3)
```
### 4. Analysis of `souvenirs` series
::: {.callout collapse=true}
#### (a + b). Produce a time plot of the data and describe the patterns in the graph
```{r}
souv <- souvenirs
souv |> autoplot(Sales) +
labs(title = "Time plot of souvenirs sales figures", y = "Sales ($AUD?)")
```
Above is the time plot of the sales (by month) from `souvenirs`, a series tracking monthly sales of a beach resort souvenir/gift shop in Queensland, Australia. The yearly spikes in sales during December makes sense then. Being in the southern hemisphere, Australia will see many northern-hemisphere tourists during these later months.
Something to note is that the seasonal component of the series seems to scale multiplicatively with increasing monthly sales! As such, attempting to fit a linear model *will* fail, as the multiplicative scaling is a nonlinear pattern. Taking a log transform of the data will be necessary to turn that multiplicative scaling into a transformed additive scaling.
:::
::: {.callout collapse=true}
#### (c). Fit a regression model to the logarithms of these sales data with a linear trend
```{r}
souv$Sales.log <- log(souv$Sales)
souv$Holiday <- ifelse(month(souv$Month) %in% c(11, 12), 1, 0)
souv$Jan <- ifelse(month(souv$Month) == 1, 1, 0)
souv$Surf <- ifelse(month(souv$Month) == 3, 1, 0)
souv.lm <- souv |>
model(TSLM(Sales.log ~ trend() + Holiday + Surf))
augment(souv.lm) |>
ggplot(aes(x = Month)) +
geom_line(aes(y = Sales.log, color = "Data")) +
geom_line(aes(y = .fitted, color = "Fitted")) +
labs(title = "Souvenir sales by month", x = "Month", y = "Sales (logged)") +
scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
guides(color = guide_legend())
```
All features (`trend()` and `intercept`, along with `Holiday`, and `Surf` dummy variables) Are significant predictors at the 5% significance level. also, High $R$ and $R^2$ values.
::: {.callout collapse=true}
##### Linear model report
```{r}
report(souv.lm)
```
:::
:::
::: {.callout collapse=true}
#### (d). Plot the residuals against time and against the fitted values
```{r, warning = FALSE}
souv.lm |> gg_tsresiduals() + labs(title = "Residuals over time")
augment(souv.lm) |>
ggplot(aes(x = .fitted, y = .resid)) +
geom_point() +
xlim(7, 12) +
ylim(-1, 1) +
labs(title = "Residuals vs. Fitted", x = "Fitted", y = "Residuals")
```
The random scattering around 0 (of the residuals) suggests homoscedasticity. Additionally, the residuals over time plot shows what looks to be "white noise", and the residual histogram is centered at 0 with some slight left skew. Overall, the model seems pretty good. The only major concern would be some high autocorrelation of the residuals, especially at 2 and 12 months lag.
:::
::: {.callout collapse=true}
#### (e). Do boxplots of the residuals for each month
```{r}
souv.residuals <- residuals(souv.lm)
boxplot(
data = souv.residuals,
.resid ~ month(Month),
main = "Boxplots of residuals by month",
xlab = "Month",
ylab = "Residuals"
)
```
The model seems to be somewhat overestimating for January and November (negative residuals), underestimating for October and December (positive residuals). The model accuracy may be greatly improved if we were to include an additional dummy variable that signifies January, where sales dip greatly (coming right after December, a peak in sales).
:::
::: {.callout collapse=true}
#### (f). What do the values of the coefficients tell you about each variable?
```{r}
report(souv.lm)
```
Going back to the model report, the intercept already explains much of the logged sales (as most values are between $7$ and $11$, and the intercept is about $8$). `trend()` has relatively little influence, though it is probably one of the most statistically significant predictors. `Holiday` is greatly important, adding $1.121$ in logged sales to November and December fitted values. `Surf` less-so, but still adding $0.243$ to March fitted values.
:::
::: {.callout collapse=true}
#### (g). What does the Ljung-Box test tell you about your model?
```{r}
augment(souv.lm) |> features(.innov, ljung_box, lag = 8)
```
The p-value is relatively small, so the autocorrelations *are* distinguishable from a white-noise series. This matches what was seen earlier when looking at residual plots, the ACF plot showed several large spikes in autocorrelation.
:::
::: {.callout collapse=true}
#### (h + i). Use your regression model to predict the monthly sales for 1994, 1995, and 1996
```{r}
souv.new <- new_data(souv, n = 36) |>
mutate(
Holiday = ifelse(month(Month) %in% c(11, 12), 1, 0),
Jan = ifelse(month(Month) == 1, 1, 0),
Surf = ifelse(month(Month) == 3, 1, 0)
)
souv.preds <- forecast(souv.lm, new_data = souv.new)
souv |>
autoplot(Sales.log) +
autolayer(souv.preds) +
labs(title = "Forecasted sales (logged)", x = "Month", y = "Sales (logged)")
```
::: {.callout collapse=true}
##### Plot of un-transformed forecasts
```{r, message = FALSE, warning = FALSE}
souv.predsexp <- tsibble(souv.preds[, c(2, 4)], index = Month) |>
select(.mean) |>
mutate(restore = exp(.mean)) |>
select(Month, restore)
souv |>
autoplot(Sales) +
autolayer(souv.predsexp, col = "blue", alpha = 0.6) +
labs(title = "Forecasted sales", x = "Month", y = "Sales ($AUD?)")
```
:::
The model predicts a very consistent linear increase in logged sales. As was said earlier, the model's accuracy could likely be further increased by adding a dummy variable for January. But, too many dummy variables should be avoided, as the model will shift away from a proper linear regression and towards an "average monthly sales (logged) + trend" giver.
:::
### 5. Analysis of `us_gasoline` series
::: {.callout collapse=true}
#### (a). Fit a harmonic regression with trend to the data
```{r}
usg <- us_gasoline |>
filter(year(Week) <= 2004)
usg.fit <- usg |>
model(TSLM(Barrels ~ trend() + fourier(K = 4)))
augment(usg.fit) |>
filter(year(Week) > 2000) |>
ggplot(aes(x = Week)) +
geom_line(aes(y = Barrels, color = "Data")) +
geom_line(aes(y = .fitted, color = "Fitted")) +
labs(
title = "Supplies of US gasoline product",
subtitle = "Fitted harmonic regression, 4 Fourier terms",
y = "Barrels (millions)") +
scale_color_manual(values = c(Data = "black", Fitted = "red2")) +
guides(color = guide_legend(title = ""))
```
The harmonic regression with 4 Fourier terms shows (roughly) the cyclic shape of the data. Realistically, up to 26 Fourier terms could be used, because the seasonal period of the data is 52 intervals.
::: {.callout collapse=true}
##### Harmonic model report
```{r}
report(usg.fit)
```
:::
:::
::: {.callout collapse=true}
#### (b). Select the appropriate number of Fourier terms to include
```{r, cache = TRUE}
kterms <- c()
for (k in 1:26) {
usg.fittemp <- usg |>
model(TSLM(Barrels ~ trend() + fourier(K = k)))
kterms <- append(
kterms,
glance(usg.fittemp) |> select(AICc)
)
}
kterms <- unlist(kterms)
best <- match(kterms[kterms == min(kterms)], kterms)
print(paste0("Optimal number of Fourier terms (to minimize AICc): ", best))
```
:::
::: {.callout collapse=true}
#### (c). Plot the residuals, test residual autocorrelation with Ljung-Box test
```{r}
usg.hm <- usg |>
model(TSLM(Barrels ~ trend() + fourier(K = 7)))
usg.hm |> gg_tsresiduals() + labs(title = "Residuals over time")
augment(usg.hm) |> features(.innov, ljung_box, lag = 104)
```
This is a bit odd. The residual histogram shows no major skew, so no excessive bias towards over or under-prediction by the model. Additionally, the residual plot *looks* like white noise. The Ljung-Box test, however (with its tiny p-value), suggests that residual autocorrelation *is not* indistinguishable from white noise. If we look at the ACF plot, we can indeed see systemic biasing of autocorrelation in the positive direction, even if it is minor.
:::
::: {.callout collapse=true}
#### (d). Generate forecasts for the next year of data and plot them along the actual data
```{r}
usg.preds <- usg.hm |>
forecast(h = 52)
usg <- us_gasoline |>
filter(year(Week) <= 2005)
usg.preds |>
autoplot(usg |> filter(year(Week) >= 2003), alpha = 2) +
labs(
title = "Supplies of US gasoline product",
subtitle = "Fitted harmonic regression, 7 Fourier terms",
y = "Barrels (millions)"
)
```
Model seems decently accurate! The x-axis was shortened to make it easier to see the year of forecasted values against the actual values. For most intervals, the actual value is within the 95% confidence interval of the harmonic model, if the predicted values aren't literally on top of the actual values anyway.
:::