Chapter 7

Exercise 4

A

autoplot(souvenirs)

## Plot variable not specified, automatically selected `.vars = Sales`

The monthly souvenir sales trends have some pretty clear seasonality. Given the nature/location of the store, it would make sense that the peaks are related to summer-month, tourism-related sales boosts. There are also substantial increases in 1993 and 1994, with sales double or triple what they were from 1988-1991.

B

Log transformations will allow us to understand the underlying trend and reduce the effects of the outlier data points on our model. Stable variance over time could help avoid heteroskedasticity, or unstable residuals. If we were to model this series using an ARIMA model, a log transformation may cause the series to be stationary.

C

souvenirs <- souvenirs |> 
  mutate(
    Year = year(Month),
    Month_num = month(Month),      # Monthly dummies                          
    Trend = as.numeric(row_number()),            # Linear trend
    SurfingFestival = ifelse(Month_num == 3 & Year >= 1988, 1, 0),
    Season = if_else(Month_num %in% c(12, 1, 2), 1, 0)
  )

# Fit the regression model with log transformation
sales_fit <- lm(log(Sales) ~ Trend + Month_num + SurfingFestival + Season, data = souvenirs)
summary(sales_fit)

## 
## Call:
## lm(formula = log(Sales) ~ Trend + Month_num + SurfingFestival + 
##     Season, data = souvenirs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58855 -0.15111  0.02698  0.16496  0.40899 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.175810   0.076001  94.418  < 2e-16 ***
## Trend           0.021809   0.001056  20.657  < 2e-16 ***
## Month_num       0.145459   0.008127  17.898  < 2e-16 ***
## SurfingFestival 0.784553   0.105840   7.413 1.20e-10 ***
## Season          0.463080   0.062242   7.440 1.07e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2316 on 79 degrees of freedom
## Multiple R-squared:  0.9182, Adjusted R-squared:  0.9141 
## F-statistic: 221.8 on 4 and 79 DF,  p-value: < 2.2e-16

The model is performing fairly well upon looking at the residual plots. Some observations are having significant influence on the results, though, including numbers 5, 1 and 26.

D

souvenirs <- souvenirs |>
  mutate(
    Residuals = residuals(sales_fit),
    Fitted = fitted(sales_fit)
  )

ggplot(souvenirs, aes(x = Trend, y = Residuals)) +
  geom_line() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residuals Over Time", x = "Trend", y = "Residuals")

ggplot(souvenirs, aes(x = Fitted, y = Residuals)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals")

The residuals appear to be mostly normally distributed over time and across the fitted values.

Exercise 1

D

ggplot(souvenirs, aes(x = factor(Month), y = Residuals)) +
  geom_boxplot() +
  labs(title = "Residuals by Month", x = "Month", y = "Residuals")

F

summary(sales_fit)

## 
## Call:
## lm(formula = log(Sales) ~ Trend + Month_num + SurfingFestival + 
##     Season, data = souvenirs)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.58855 -0.15111  0.02698  0.16496  0.40899 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.175810   0.076001  94.418  < 2e-16 ***
## Trend           0.021809   0.001056  20.657  < 2e-16 ***
## Month_num       0.145459   0.008127  17.898  < 2e-16 ***
## SurfingFestival 0.784553   0.105840   7.413 1.20e-10 ***
## Season          0.463080   0.062242   7.440 1.07e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2316 on 79 degrees of freedom
## Multiple R-squared:  0.9182, Adjusted R-squared:  0.9141 
## F-statistic: 221.8 on 4 and 79 DF,  p-value: < 2.2e-16

Each coeffciient points to the relative effect of the independent variable on a certain percentage change in sales. The surfing festival, for instance, tells us that the month of march coincides with a 0.78% increase in sales, controlling for all other variables. The trend variable tells us that for each one month that goes by, a 0.02% increase in sales occurs. The month variable’s 0.14 coefficient hints at the effect of the warmer months (later in year) on sales.

G

lb_test <- Box.test(sales_fit$residuals, lag = 12, type = "Ljung-Box")

print(lb_test)

## 
##  Box-Ljung test
## 
## data:  sales_fit$residuals
## X-squared = 52.382, df = 12, p-value = 5.304e-07

The results of the ljung box test suggest that there is autocorrelation in the residuals.

H

new_months <- tibble(
  Month = yearmonth(paste0(rep(1994:1996, each = 12), "-", rep(sprintf("%02d", 1:12), 3)))
)

# Creating additional variables
new_data <- new_months %>%
  mutate(
    Year = year(Month),
    Month_num = month(Month),  # Numeric month
    Trend = as.numeric(row_number()+84),  # Incremental trend (if needed)
    SurfingFestival = ifelse(Month_num == 3 & Year >= 1988, 1, 0),  # Example dummy
    Season = if_else(Month_num %in% c(12, 1, 2), 1, 0)  # Winter Dummy
  )

sales_fcst <- sales_fit |>
  forecast(h=36,newdata=new_data)


forecast_df <- tibble(
  Mean = sales_fcst$mean,
  Time = new_data$Month,
  Lower_80 = sales_fcst$lower[,1],
  Upper_80 = sales_fcst$upper[,1],
  Lower_95 = sales_fcst$lower[,2],
  Upper_95 = sales_fcst$upper[,2]
  
)


ggplot(forecast_df, aes(x = Time)) +
  geom_line(aes(y = Mean)) +  # Forecast line
  geom_ribbon(aes(ymin = Lower_80, ymax = Upper_80), alpha = 0.2, fill = "blue") +  # 80% confidence interval
  geom_ribbon(aes(ymin = Lower_95, ymax = Upper_95), alpha = 0.2, fill = "red") +  # 95% confidence interval
  labs(title = "Forecast with 80% and 95% Confidence Intervals",
       x = "Time",
       y = "Forecasted Values") +
  theme_minimal()

Chapter 7 - question 6

A

AFG_Pop <- global_economy |>
  dplyr::select(Country,Year,Population,) |>
  filter(Country=="Afghanistan")
  
AFG_Pop |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = Population`

Sadly, the effects of the war on the level of population is apparent in the data, as it declines beginning in the late 1970’s into the late 1980’s. The population grew noticeably in the period after the war.

B

linear_model <- lm(Population ~ Year, data = AFG_Pop)

# Create knot variables
AFG_Pop <- AFG_Pop %>%
  mutate(
    knot_1980 = pmax(0, Year - 1980),
    knot_1989 = pmax(0, Year - 1989)
  )

# Fit the piecewise linear model
piecewise_model <- lm(Population ~ Year + knot_1980 + knot_1989, data = AFG_Pop)

# Compare model summaries
summary(linear_model)

## 
## Call:
## lm(formula = Population ~ Year, data = AFG_Pop)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -5794518 -2582559   744761  2259222  6036280 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -829292529   49730866  -16.68   <2e-16 ***
## Year            425774      25008   17.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3188000 on 56 degrees of freedom
## Multiple R-squared:  0.8381, Adjusted R-squared:  0.8352 
## F-statistic: 289.9 on 1 and 56 DF,  p-value: < 2.2e-16

summary(piecewise_model)

## 
## Call:
## lm(formula = Population ~ Year + knot_1980 + knot_1989, data = AFG_Pop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -577590 -174198  -16784  187226  679947 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -430846433   18967116  -22.71   <2e-16 ***
## Year            224372       9623   23.32   <2e-16 ***
## knot_1980      -456804      24498  -18.65   <2e-16 ***
## knot_1989      1082782      21418   50.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 300900 on 54 degrees of freedom
## Multiple R-squared:  0.9986, Adjusted R-squared:  0.9985 
## F-statistic: 1.293e+04 on 3 and 54 DF,  p-value: < 2.2e-16

# Compare AIC values for model selection
AIC(linear_model, piecewise_model)

##                 df      AIC
## linear_model     3 1905.665
## piecewise_model  5 1633.724

C

# Create a new dataframe for forecast years
forecast_years <- data.frame(
  Year = 2018:2022,
  knot_1980 = pmax(0, 2018:2022 - 1980),
  knot_1989 = pmax(0, 2018:2022 - 1989)
)

# Forecast with the linear model
forecast_linear <- predict(linear_model, newdata = forecast_years)

# Forecast with the piecewise model
forecast_piecewise <- predict(piecewise_model, newdata = forecast_years)


# Combine historical data and forecasts for plotting
forecast_data <- forecast_years %>%
  mutate(
    Linear = forecast_linear,
    Piecewise = forecast_piecewise
  ) %>%
  pivot_longer(cols = c(Linear, Piecewise), names_to = "Model", values_to = "Population") %>%
      dplyr::select(Year, Model, Population)
  


# Plot historical data and forecasts
ggplot(forecast_data, aes(x = Year, y = Population, color = Model)) +
  geom_line() +
  geom_line(data = AFG_Pop, aes(x = Year, y = Population), color = "grey") +
  labs(
    title = "Afghani Population Trends and Forecasts",
    x = "Year",
    y = "Population",
    color = "Model"
  ) +
  theme_minimal()

Chapter 8

Exercise 7

A

gas_data <- aus_production %>% filter(!is.na(Gas))

# Fit an ETS model with automatic selection
gas_ets <- gas_data %>%
  model(ETS(Gas ~ error("M") + trend("A")))

# Summarize the fitted model
report(gas_ets)

## Series: Gas 
## Model: ETS(M,A,M) 
##   Smoothing parameters:
##     alpha = 0.6528545 
##     beta  = 0.1441675 
##     gamma = 0.09784922 
## 
##   Initial states:
##      l[0]       b[0]      s[0]    s[-1]    s[-2]     s[-3]
##  5.945592 0.07062881 0.9309236 1.177883 1.074851 0.8163427
## 
##   sigma^2:  0.0032
## 
##      AIC     AICc      BIC 
## 1680.929 1681.794 1711.389

# Generate forecasts
gas_forecast <- gas_ets %>% forecast(h = "6 years")

# Plot the forecasts
gas_forecast %>%
  autoplot(gas_data) +
  labs(title = "ETS Forecast for Gas Production",
       y = "Gas Production",
       x = "Year")

Multiplicative seasonality is necessary when seasonal fluctuations increase with the trend. In the Gas data, the seasonal variation becomes more pronounced as the production level rises, indicating a multiplicative relationship.

# Fit an ETS model with a damped trend
gas_ets_damped <- gas_data %>%
  model(ETS(Gas ~ error("M") + trend("Ad") + season("M")))

# Summarize the damped model
report(gas_ets_damped)

## Series: Gas 
## Model: ETS(M,Ad,M) 
##   Smoothing parameters:
##     alpha = 0.6489044 
##     beta  = 0.1551275 
##     gamma = 0.09369372 
##     phi   = 0.98 
## 
##   Initial states:
##      l[0]       b[0]      s[0]    s[-1]   s[-2]     s[-3]
##  5.858941 0.09944006 0.9281912 1.177903 1.07678 0.8171255
## 
##   sigma^2:  0.0033
## 
##      AIC     AICc      BIC 
## 1684.028 1685.091 1717.873

# Forecast with the damped model
gas_forecast_damped <- gas_ets_damped %>% forecast(h = "6 years")

# Plot the damped model forecasts
gas_forecast_damped %>%
  autoplot(gas_data) +
  labs(title = "ETS Forecast for Gas Production (Damped Trend)",
       y = "Gas Production",
       x = "Year")

gas_forecast %>%
  autoplot(gas_data) +
  labs(title = "ETS Forecast for Gas Production - no Damped trend",
       y = "Gas Production",
       x = "Year")

accuracy(gas_ets)

## # A tibble: 1 × 10
##   .model                .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>                 <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 "ETS(Gas ~ error(\"M… Trai… -0.115  4.60  3.02 0.199  4.08 0.542 0.606 -0.0131

accuracy(gas_ets_damped)

## # A tibble: 1 × 10
##   .model              .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE    ACF1
##   <chr>               <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1 "ETS(Gas ~ error(\… Trai… -0.00439  4.59  3.03 0.326  4.10 0.544 0.606 -0.0217

The damped trend term appears to have made little difference in the performance of the model. In fact, the damped trend model appears to have wider prediction intervals, which indicates more uncertainty in the estimates.

Chapter 9

Exercise 10

A

Total_private <- us_employment |>
  filter(Title=="Total Private") %>%
  dplyr::select(Month,Employed) |>
  stl(t.window=13, s.window=11) 
  
Total_private |>
  autoplot()

On looking at the raw data and the seasonal component extracted in the STL decomposition, the data appear to have a seasonal pattern. A seasonal adjustment will be necessary

B

stl_result <- us_employment |> 
  filter(Title == "Total Private") |> 
  dplyr::select(Month, Employed) |> 
  pull(Employed) |> 
  ts(frequency = 12) |> 
  stl(t.window = 13, s.window = 11)

# Extract the seasonal component
seasonal_component <- stl_result$time.series[, "seasonal"]

# Adjust the series for seasonality
seasonally_adjusted <- stl_result$time.series[, "trend"] + stl_result$time.series[, "remainder"]

# Create a new data frame with the adjusted series
adjusted_data <- us_employment |> 
  filter(Title == "Total Private") |> 
  mutate(Seasonally_Adjusted = seasonally_adjusted) |>
  dplyr::select(-Employed)

adjusted_data |>
  autoplot()

## Plot variable not specified, automatically selected `.vars = Seasonally_Adjusted`
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

The data appear to be non-stationary, as the ACF remains elevated across the entire period. I’ll create a (one or more) differenced series to address this.

C

adjusted_data |> 
  ACF(seasonally_adjusted) |> 
  autoplot() +
  labs(title = "ACF of Seasonally Adjusted Series")

adjusted_data |> 
  mutate(diff = difference(seasonally_adjusted, lag = 1)) |> 
  ACF(diff) |> 
  autoplot() +
  labs(title = "ACF of Differenced Series: 1")

adjusted_data |> 
  mutate(diff = difference(seasonally_adjusted, lag = 2)) |> 
  ACF(diff) |> 
  autoplot() +
  labs(title = "ACF of Differenced Series: 2")

Transforming into the first difference causes the data to appear more like white noise in the ACF plot. The ACF values quickly decline after a few lags, which suggests the data are now stationary.

D

Identify a couple of ARIMA models that might be useful in describing the time series. Which of your models is the best according to their AIC values?

fit_employment1 <- adjusted_data |> 
  model(arima1 = ARIMA(seasonally_adjusted ~ pdq(1,1,0)))

report(fit_employment1)

## Series: seasonally_adjusted 
## Model: ARIMA(1,1,0)(2,0,0)[12] w/ drift 
## 
## Coefficients:
##          ar1     sar1     sar2  constant
##       0.5914  -0.0935  -0.1854   55.3858
## s.e.  0.0268   0.0329   0.0321    5.6147
## 
## sigma^2 estimated as 30458:  log likelihood=-6369.06
## AIC=12748.12   AICc=12748.18   BIC=12772.49

fit_employment2 <- adjusted_data |> 
  model(arima2 = ARIMA(seasonally_adjusted ~ pdq(2,1,1)))

report(fit_employment2)

## Series: seasonally_adjusted 
## Model: ARIMA(2,1,1)(0,0,1)[12] w/ drift 
## 
## Coefficients:
##          ar1     ar2      ma1     sma1  constant
##       0.7426  0.1466  -0.4279  -0.2018   11.7751
## s.e.  0.0692  0.0534   0.0649   0.0542    2.3725
## 
## sigma^2 estimated as 26567:  log likelihood=-6302.32
## AIC=12616.64   AICc=12616.73   BIC=12645.9

fit_employment3 <- adjusted_data |> 
  model(arima3 = ARIMA(seasonally_adjusted ~ pdq(3,1,0)))

report(fit_employment3)

## Series: seasonally_adjusted 
## Model: ARIMA(3,1,0)(0,0,1)[12] w/ drift 
## 
## Coefficients:
##          ar1     ar2     ar3     sma1  constant
##       0.3208  0.2901  0.1755  -0.1930   22.7465
## s.e.  0.0319  0.0319  0.0318   0.0527    4.2027
## 
## sigma^2 estimated as 26596:  log likelihood=-6302.84
## AIC=12617.68   AICc=12617.77   BIC=12646.93

The model with three autoregressive terms yielded the lowest AIC value.

E

Estimate the parameters of your best model and do diagnostic testing on the residuals. Do the residuals resemble white noise?

fit_employment2 |>
  dplyr::select(arima2) |>
  gg_tsresiduals()

The residuals appear to be white noise based on the results of the ACF.

F

Forecast of two years.

# Generate the forecast
forecast_employment <- fit_employment2 |> 
  dplyr::select(arima2) |> 
  forecast(h = 36)

forecast_employment |>
  autoplot()

In the first few months of the forecast, the predictions are accurate. This forecast overlaps with the COVID-19 recession, the effects of which can be seen in the employment data beginning in March 2020. Specifically, the forecast declines from 129 million in February to 109 million in March of 2020. This of course is not accounted for in the forecast and demonstrates the limitations of this type of model in predicting the effects of unlikely events. Around the middle of 2022 after US economy’s relatively quick recovery, the forecast becomes more reasonable once more.

G

I think three years is right around the limit for a forecast’s usefulness considering the widening prediction intervals. A better question might be: how sure do you need to be about your forecast in 1, 2, or 3 years time? Are you making decisions based on this longer run forecast that could affect you or your business adversely if there are major deviations from the estimates? If so, the value of a forecast several years out could be low if the prediction intervals are wide.

Final problem set: B

Daniel McLaughlin

2024-12-15