Introduction

These are exercises on chapter #7 from “Forecasting Principles and Practice”

7.1

Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.

  1. Plot the data and find the regression model for Demand with temperature as a predictor variable. Why is there a positive relationship?

  2. Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

  3. Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘C and compare it with the forecast if the with maximum temperature was 35∘C. Do you believe these forecasts?

7.1.a

Answer

It appears relatively simple that as temperature rises, so does electricity demand, say, to turn on AC units for longer, and possibly at higher intensity. When performing the linear regression, we can see that, on average, for every change in degree, there is a positive relationship in electricity demand by about 6,150 mega Watts.

#a. 

jan14_vic_elec <- vic_elec %>%
  filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
  index_by(Date = as_date(Time)) %>%
  summarise(
    Demand = sum(Demand),
    Temperature = max(Temperature)
  )

jan14_vic_elec %>%
  ggplot(aes(y=Demand, x=Temperature)) +
  theme_ipsum_ps() +
  labs(y = "Electricity Demand in MW"
       , x = "Temperature"
       , title = "Demand Electricity in Victoria"
       , subtitle = "January 2014, Temperature") +
  geom_point() +
  geom_smooth(formula = y ~ x, method="lm")

fit <- jan14_vic_elec %>%
  model(TSLM(Demand ~ Temperature))

report(fit)
## Series: Demand 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -49978.2 -10218.9   -121.3  18533.2  35440.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59083.9    17424.8   3.391  0.00203 ** 
## Temperature   6154.3      601.3  10.235 3.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832,  Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
regmodel <-lm(Demand ~ Temperature, data = jan14_vic_elec)

summary(regmodel)
## 
## Call:
## lm(formula = Demand ~ Temperature, data = jan14_vic_elec)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49978 -10219   -121  18533  35441 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59083.9    17424.8   3.391  0.00203 ** 
## Temperature   6154.3      601.3  10.235 3.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared:  0.7832, Adjusted R-squared:  0.7757 
## F-statistic: 104.7 on 1 and 29 DF,  p-value: 3.89e-11

7.1.b

Answer

By the look of the residuals, it does not appear there is a trend, nor autocorrelation in the data. Therefore, the model could be relatively adequate though there seems to be few outliers present when performing just a simple linear regression, which could skew the model into possibly over predicting electricity after a certain threshold.

layout(matrix(1:4, byrow = F, ncol = 2))

fit %>% 
  gg_tsresiduals() +
  labs(title = "Residuals")

plot(regmodel)

7.1.c

Answer

The forecast predicted that about 35 degrees Celsius, the expected electricity demand would be around 275 thousand mega Watts, and when temperature is around 15 degrees Celsius, electricity demand would be around 150 thousand mega Watts.

future_scenarios <- scenarios(
  Temp15 = new_data(jan14_vic_elec) %>%
    mutate(Temperature=15),
  Temp35 = new_data(jan14_vic_elec) %>%
    mutate(Temperature=35),
  names_to = "Scenario")

fc <- forecast(fit, new_data = future_scenarios)
jan14_vic_elec %>%
  autoplot(Demand) +
  autolayer(fc) +
  labs(title = "Electricity Demand"
       , y = "Demand in MW"
       , x = "Date") +
  theme_ipsum_es()

7.1.d

Answer

The intervals are below:

int<-forecast(regmodel, newdata=data.frame(Temperature=c(15,35)))

int %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
151398.35 151398.4 117127.2 185669.5 97951.22 204845.5
274484.25 274484.2 241333.0 307635.5 222783.69 326184.8

7.1.e

Answer

It looks like a J-shaped curve relationship, where if time is not a factor (say, just take January like we did), electricity demand doesn’t really start to increase until after temperatures reach and start the 25 degrees Celsuis mark.

ggplot(data = vic_elec,
       aes(x = Temperature,
           y = Demand)) +
  geom_point() +
  labs(x = "Temperature"
       , y = "Electricity Demand"
       , title = "All Temperature and Electricity Demand"
       , subtitle = "All months") +
  geom_smooth() +
  theme_ipsum_es()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

7.2

Data set olympic_running contains the winning times (in seconds) in each Olympic Games sprint, middle-distance and long-distance track events from 1896 to 2016.

  1. Plot the winning time against the year for each event. Describe the main features of the plot. Fit a regression line to the data for each event. Obviously the winning times have been decreasing, but at what average rate per year?

  2. Plot the residuals against the year. What does this indicate about the suitability of the fitted lines?

  3. Predict the winning time for each race in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

7.2.a

Answer

We can clearly see that the trend over time has been decreasing for both genders. Women, however, have accelerated their pace towards lower times, possibly because of factors outside of physical capabilities but rather social. Times for shorter distances continue to trend lower into the 2010’s; times for longer distance races, however, plateaued and are now increasing.Only 400 meter distance races have moved in opposite directions over the last observations. 500 meter races contains too few data points for women that it would be hard to compare. I would not use this dataset.

#one by one

head(olympic_running)
## # A tsibble: 6 x 4 [4Y]
## # Key:       Length, Sex [1]
##    Year Length Sex    Time
##   <int>  <int> <chr> <dbl>
## 1  1896    100 men    12  
## 2  1900    100 men    11  
## 3  1904    100 men    11  
## 4  1908    100 men    10.8
## 5  1912    100 men    10.8
## 6  1916    100 men    NA
unique(olympic_running$Length)
## [1]   100   200   400   800  1500  5000 10000
olympic_running %>%
  filter(Length==100) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "100 m, By race") +
  geom_smooth(method = "lm", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==200) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "200 m, By race") +
  geom_smooth(method = "lm", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==400) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "400 m, By race") +
  geom_smooth(method = "lm", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==800) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "800 m, By race") +
  geom_smooth(method = "lm", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==1500) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "1500 m, By race") +
  geom_smooth(method = "lm", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==5000) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "5000 m, By race") +
  geom_smooth(method = "lm", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==10000) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "10000 m, By race") +
  geom_smooth(method = "lm", formula = y ~ x) + 
  theme_ipsum_es()

7.2.b

Below are the coefficients by race distance and by gender. We can see different decreases over time per race and per sex.

#inspect data -- took from Tim White for loop of the length of vector to generate all charts instead of generating charts manually by distance race.

head(olympic_running)
## # A tsibble: 6 x 4 [4Y]
## # Key:       Length, Sex [1]
##    Year Length Sex    Time
##   <int>  <int> <chr> <dbl>
## 1  1896    100 men    12  
## 2  1900    100 men    11  
## 3  1904    100 men    11  
## 4  1908    100 men    10.8
## 5  1912    100 men    10.8
## 6  1916    100 men    NA
unique(olympic_running$Length)
## [1]   100   200   400   800  1500  5000 10000
olympic_running %>%
  filter(Length==100) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "100 m, By race") +
  geom_smooth(method = "loess", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==200) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "200 m, By race") +
  geom_smooth(method = "loess", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==400) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "400 m, By race") +
  geom_smooth(method = "loess", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==800) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "800 m, By race") +
  geom_smooth(method = "loess", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==1500) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "1500 m, By race") +
  geom_smooth(method = "loess", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==5000) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "5000 m, By race") +
  geom_smooth(method = "loess", formula = y ~ x) + 
  theme_ipsum_es()

olympic_running %>%
  filter(Length==10000) %>%
  autoplot(Time) +
  labs(x = "Year"
       , y = "Time in Seconds"
       , title = "Running Times"
       , subtitle = "10000 m, By race") +
  geom_smooth(method = "loess", formula = y ~ x) + 
  theme_ipsum_es()

## Linear regression and coefficients

olympic_men <-
  olympic_running %>%
  filter(Sex=="men")

olympic_women <-
  olympic_running %>%
  filter(Sex=="women")

model_men <- 
  olympic_men %>%
  group_by(Length) %>%
  #model(TSLM(Time~trend()))
  do(model=lm(Time~Year, data=.))
  
model_women <- 
  olympic_women %>%
  group_by(Length) %>%
  #model(TSLM(Time~trend()))
  do(model=lm(Time~Year, data=.))


model_men$model
## [[1]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    35.01158     -0.01261  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    69.37650     -0.02488  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##   172.48148     -0.06457  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    405.8457      -0.1518  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    843.4367      -0.3151  
## 
## 
## [[6]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##     2853.20        -1.03  
## 
## 
## [[7]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    6965.459       -2.666
model_women$model
## [[1]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    39.18816     -0.01419  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    89.13761     -0.03363  
## 
## 
## [[3]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##   129.39165     -0.04008  
## 
## 
## [[4]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##     511.369       -0.198  
## 
## 
## [[5]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    -50.9926       0.1468  
## 
## 
## [[6]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    1504.175       -0.303  
## 
## 
## [[7]]
## 
## Call:
## lm(formula = Time ~ Year, data = .)
## 
## Coefficients:
## (Intercept)         Year  
##    8825.260       -3.496

7.2.c - 7.2.d

Checking Residuals and Forecast

Had some trouble but was able to get a prediction for one example, 400 meter for men, and forecast the next two Olympics.

Based on the residuals, it looks it is normally distributed and close to zero. I would probably remove outliers at the very beginning, and start from earlier than the staring point.

or_clean <-
  
  olympic_running %>%
  filter(Year != 1916,
         Year != 1940,
         Year != 1944)

mens400 <-
  or_clean %>%
  filter(Length==400)%>%
  filter(Sex=="men")

mens400_model <-
  mens400 %>%
  model(model = TSLM(Time ~ trend()))

fc_mens_400 <-forecast(mens400_model)

fc_mens_400 %>%
  autoplot(mens400) + 
  labs(
    title = "400 Meter Mens Forecast"
    ,y = "seconds"
  )

mens400_linear <-
  mens400 %>%
  lm(formula = Time~Year)

checkresiduals(mens400_linear)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals
## LM test = 3.6082, df = 6, p-value = 0.7295

7.3

An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is defined as (dy/dx)×(x/y).Consider the log-log model, logy=β0+β1logx+ε. Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.

Answer

1.- log(y) = β0 + β1log(x) + e

2.- β1log(x) = log(y) - β0 - e

3.- β1 = (log(y) - β0 - e) / log(x)

This basically expresses that the coefficient β1 (for x, the independent variable), will be the percent change (or the natural log) relative to the percent change for our dependent variable y. It is expressed as “for every 1 (one) percent change in x (the independent variable), our dependent variable y will change by the coefficient result β1.

7.4

The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

  1. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

  2. Explain why it is necessary to take logarithms of these data before fitting a model.

  3. Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

  4. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

  5. Do boxplots of the residuals for each month. Does this reveal any problems with the model?

  6. What do the values of the coefficients tell you about each variable?

  7. What does the Ljung-Box test tell you about your model?

  8. Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

  9. How could you improve these predictions by modifying the model?

7.4.a

Answer

It looks to have seasonality, with increasing variance (non-stationary), and a rising trend. Therefore, a forecast could continue the trend seen in the latest observations.

#a.

head(souvenirs)

autoplot(souvenirs)+
  labs(x = "Date"
       , y = "Sales"
       , title = "Souvenir Sales") +
  geom_smooth(method = "loess", formula = y ~ x, color="red") + 
  theme_ipsum_es() +
  scale_y_continuous(labels = scales::dollar)

7.4.b and c.

Answer

Because of its increasing variance and rising trend, a logarithmic transformation could be useful to scale it, and to help with its increasing variance. Further, after decomposing the data using an ETS model, we can see that automatically it finds a lag of 12, meaning that this monthly data fins a strong seasonal component, plus a upward positive slope/trend

we first have to assign a 1 to every March [month number = 3] and zeros to the rest of the months. Initially I don’t know how this will pan out since seasonally, sales peak during December. From the regression model it doesn’t appear that the dummy variable helped much, or, if anything, could be causing a variance inflation factor since March and the dummy variable appear to be statistically the less statistically significant variables.

#b.

souvenirs %>%
  model(ETS(Sales)) %>%
  components() %>%
  autoplot() +
  #labs(subtitle = "ETS Decomposition of Souvenir Sales") +
  scale_y_continuous(labels = scales::number)

#c.

# borrowed from online rpubs (fancy)

# log_souv <-log(ts(souvenirs, frequency =12)
# 
# dummy_surf = rep(0, length(souvenirs))
# dummy_surf[seq_along(dummy) %% 12 == 3] <- 1
# dummy_surf[3]<-0
# dummy_surf <-ts(dummy.surf, frequency = 12, start=c(1987,1))
# 
# log.souv.df <-data.frame(log_souv,dummy_surf)
##log.souv.df
# 
# model_fit <-tslm(log_souv ~ trend + season + dummy_surf, data = log.souv.df)
# 
# fc_surf <- data.frame(dummy_surf = rep(0,12))
# 
# fc_surf[12,] <- 1 
# fc_souv<-data.frame(forecast(model_fit, newdata = fc_surf))
# 
##fc_souv
# 
##log.souv.df
# 
# fc_souv$Point.Forecast %>%
# autoplot(log.souv.df)

### Borrowed from Tim White rpubs, again.
# his code is neat and makes sense. Finally understood how a for loop works.

date_loop <- as.Date(souvenirs$Month)

surf_loop <- c()

for(i in 1:length(date_loop)) {
  
  if(year(date_loop[i])>=1988 & month(date_loop[i]) == 3) {
    
    surf_loop[i] <- 1
    
  } else {
    
    surf_loop[i] <- 0
    
  }
  
}

souv_model <- 
  souvenirs %>%
  add_column(surf_dummy = surf_loop) %>%
  add_column(log_sales = log(souvenirs$Sales)) %>%
  add_column(season_month = as.factor(month(souvenirs$Month)))

souvenirs_tslm <-
  souv_model %>%
  model(tslm = TSLM(log_sales~trend()+as.factor(season_month)+surf_dummy))

report(souvenirs_tslm)
## Series: log_sales 
## Model: TSLM 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.336727 -0.127571  0.002568  0.109106  0.376714 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.6196670  0.0742471 102.626  < 2e-16 ***
## trend()                   0.0220198  0.0008268  26.634  < 2e-16 ***
## as.factor(season_month)2  0.2514168  0.0956790   2.628 0.010555 *  
## as.factor(season_month)3  0.2660828  0.1934044   1.376 0.173275    
## as.factor(season_month)4  0.3840535  0.0957075   4.013 0.000148 ***
## as.factor(season_month)5  0.4094870  0.0957325   4.277 5.88e-05 ***
## as.factor(season_month)6  0.4488283  0.0957647   4.687 1.33e-05 ***
## as.factor(season_month)7  0.6104545  0.0958039   6.372 1.71e-08 ***
## as.factor(season_month)8  0.5879644  0.0958503   6.134 4.53e-08 ***
## as.factor(season_month)9  0.6693299  0.0959037   6.979 1.36e-09 ***
## as.factor(season_month)10 0.7473919  0.0959643   7.788 4.48e-11 ***
## as.factor(season_month)11 1.2067479  0.0960319  12.566  < 2e-16 ***
## as.factor(season_month)12 1.9622412  0.0961066  20.417  < 2e-16 ***
## surf_dummy                0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567,  Adjusted R-squared: 0.9487
## F-statistic:   119 on 13 and 70 DF, p-value: < 2.22e-16
######## -borrowed from Lucas Kaplan, rpubs

sts<-ts(souvenirs$Sales, start=c(1987,1), end=c(1993,6), frequency=12)
logsales = log(sts)

surfing_festival <- c()
for(i in 1:length(sts)){
  month <- round(12*(sts[i] - floor(sts[i]))) +1
  year <- floor(sts[i])
  if(year >= 1988 & month == 3){
    surfing_festival[i] <- 1
  } else {
    surfing_festival[i] <- 0
  }
}

surfmodel = tslm(logsales ~ trend + season + surfing_festival)

summary(surfmodel)
## 
## Call:
## tslm(formula = logsales ~ trend + season + surfing_festival)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45174 -0.11922 -0.00074  0.10175  0.35933 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.6464607  0.0773528  98.852  < 2e-16 ***
## trend            0.0212957  0.0009292  22.919  < 2e-16 ***
## season2          0.2521410  0.0979991   2.573 0.012414 *  
## season3          0.6932048  0.0990785   6.997 1.89e-09 ***
## season4          0.3820296  0.0991135   3.854 0.000271 ***
## season5          0.4123836  0.0980652   4.205 8.27e-05 ***
## season6          0.4524491  0.0981048   4.612 1.96e-05 ***
## season7          0.5767690  0.1019961   5.655 3.91e-07 ***
## season8          0.5352140  0.1033731   5.177 2.43e-06 ***
## season9          0.6237063  0.1034004   6.032 8.93e-08 ***
## season10         0.7223515  0.1020342   7.080 1.35e-09 ***
## season11         1.1936391  0.1020638  11.695  < 2e-16 ***
## season12         1.9447112  0.1021019  19.047  < 2e-16 ***
## surfing_festival 0.0293746  0.1001947   0.293 0.770337    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1833 on 64 degrees of freedom
## Multiple R-squared:  0.9444, Adjusted R-squared:  0.9331 
## F-statistic: 83.67 on 13 and 64 DF,  p-value: < 2.2e-16

7.4.d

Answer

When plotting a “loess” regression, or a polynomial regression, we can see a close relationship between time and the residuals; this makes sense as there is very strong seasonality and non-linearity. The residuals and the fitted model show a large dispersion so it appears to be an ok model with no clear heteroscedasticity; however, there appears to be a funnel shaped for the latest residuals that could question this model.

#d.


ggp_restime <- 
  ggplot(data = souv_model,
         aes(x = Month,
             y = residuals(souvenirs_tslm)$".resid")) +
  theme_ipsum_es() +
  geom_smooth(method="loess", formula = y ~x, colour="navyblue") +
  geom_smooth(method="lm", formula = y ~x, colour="red")+
  geom_point() +
  labs(x = "Month",
       y = "Regr Residuals",
       title = "Residuals and Time")

ggp_restime

ggp_resfitted <-
  ggplot(data = souv_model,
         aes(x = fitted(souvenirs_tslm)$".fitted",
             y = residuals(souvenirs_tslm)$".resid")) +
  theme_ipsum_es() +
  geom_point() +
  geom_smooth(method="lm", formula = y~ x, color = "red")+
  geom_smooth(method="loess", formula = y ~x, colour="navyblue") +
  labs(x = "Regression Fitted",
       y = "Regression Residuals",
       title = "Residuals vs. Fitted")

ggp_resfitted

7.4.e and f.

Answer

  1. The box plot of the residuals show that March, as suggested by the model, will likely be the month less accurately predicted. Summer months, particularly June and July (not August), have the smallest residuals, or rather, the best chance at predicting sales. however, given the variability overall in other months, this model might not be great for months in which variability is greater, espcially from August going into the end of the year.

  2. The Summary tables below show the coefficients per month, and the dummy variable (similar to what we showed in the models above). As stated above, it appears that for every 1% increase in the summer months, sales increase by a lot more; for example, December is a 194-196% increase relative to January. March, is the anomaly in between to 70%. The dummy variable appears to affect differently within the two models tested. One model suggest the increase is only 4% while the other 50% in March, when the Festival tales place.

boxplot(residuals(surfmodel) ~ cycle(residuals(surfmodel))
        , xlab = "Month Number"
        , ylab = "Residuals"
        , main = "Residuals Box Plot"
        , col = brewer.pal(4, "Set2")
        ) 

knitr::kable(data.frame(surfmodel$coefficients),"html")
surfmodel.coefficients
(Intercept) 7.6464607
trend 0.0212957
season2 0.2521410
season3 0.6932048
season4 0.3820296
season5 0.4123836
season6 0.4524491
season7 0.5767690
season8 0.5352140
season9 0.6237063
season10 0.7223515
season11 1.1936391
season12 1.9447112
surfing_festival 0.0293746
report(souvenirs_tslm)
## Series: log_sales 
## Model: TSLM 
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.336727 -0.127571  0.002568  0.109106  0.376714 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               7.6196670  0.0742471 102.626  < 2e-16 ***
## trend()                   0.0220198  0.0008268  26.634  < 2e-16 ***
## as.factor(season_month)2  0.2514168  0.0956790   2.628 0.010555 *  
## as.factor(season_month)3  0.2660828  0.1934044   1.376 0.173275    
## as.factor(season_month)4  0.3840535  0.0957075   4.013 0.000148 ***
## as.factor(season_month)5  0.4094870  0.0957325   4.277 5.88e-05 ***
## as.factor(season_month)6  0.4488283  0.0957647   4.687 1.33e-05 ***
## as.factor(season_month)7  0.6104545  0.0958039   6.372 1.71e-08 ***
## as.factor(season_month)8  0.5879644  0.0958503   6.134 4.53e-08 ***
## as.factor(season_month)9  0.6693299  0.0959037   6.979 1.36e-09 ***
## as.factor(season_month)10 0.7473919  0.0959643   7.788 4.48e-11 ***
## as.factor(season_month)11 1.2067479  0.0960319  12.566  < 2e-16 ***
## as.factor(season_month)12 1.9622412  0.0961066  20.417  < 2e-16 ***
## surf_dummy                0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567,  Adjusted R-squared: 0.9487
## F-statistic:   119 on 13 and 70 DF, p-value: < 2.22e-16

7.4.g

There seems to be autocorrelation on closer months in the data.

checkresiduals(surfmodel)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 32.739, df = 17, p-value = 0.01216
# borrowed from Tim too, since I also used his logic on the model.

augment(souvenirs_tslm) %>%
  features(.innov, ljung_box, lag = 20, dof = 14)
## # A tibble: 1 x 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 tslm      88.8         0
ACF_res <-
  augment(souvenirs_tslm) %>%
  ACF(.innov) %>%
  autoplot() +
  labs(x = "lag",
       y = "ACF",
       title = "ACF Plot",
       subtitle = "Sales Model Residuals") +
  theme_ipsum_es()

ACF_res

7.4.h and i.

Answer

The forecast below is in log sales. Having worked with highly seasonal data in the past and performing mostly ARIMA models, it appears that this TSLM model would actually work well if we don’t consider adding more variables. I would remove the dummy variable since it appears that it didn’t help add much to the seasonality itself, and let the historically data provide the information we need for that particular month. However, I could consider other variables like weather; from rainfall, say rainy days vs. not rainy days, and temperature. Possibly test disposable income with souvenir sales and see if disposable income can explain some of the variation–say, people would spend less in souvenirs when times are tough, or more when times are good.

fc_loop<- rep(0, 36)
for(i in 1:36){
  if(i %% 12 == 3){
    fc_loop[i] <- 1
  }
}
souv_fc<-ts(data=fc_loop,start=1994,
                end=c(1996,12),
                frequency=12)

souv_fc_final<-forecast(surfmodel, newdata=souv_fc)

autoplot(souv_fc_final) +
  labs(x = "Date",
       y = "Log Forecast",
       title = "Final Forecast with Model",
       subtitle = "Despite not having excellent results") +
  theme_ipsum_es()

7.5

The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.

  1. Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

  2. Select the appropriate number of Fourier terms to include by minimizing the AICc or CV value.

  3. Plot the residuals of the final model using the gg_tsresiduals() function and comment on these. Use a Ljung-Box test to check for residual autocorrelation.

  4. Generate forecasts for the next year of data and plot these along with the actual data for 2005. Comment on the forecasts.

7.5.a.b.c

Answer

I only tried 3 different Fourier: 4, 7, and 10. From the initial results I would use the K = 7, with a good R2 and a slightly less desirable AIC and BIC. The residuals show some sort of autocorrelation, but not necessarily a strong one.

#filter data to before 2005
us_gas_2004<-
us_gasoline %>%
  filter(year(Week)<2005)%>%
  mutate("MA" = slider::slide_dbl(Barrels, mean,.before = 4, .after = 4, .complete = TRUE))

#plot data initially
us_gas_2004 %>%
autoplot(Barrels) +
  geom_line(col = "lightblue")+
  geom_line(aes(y=MA), color="#D55E00")+
  labs(x = "Date",
       y = "Barrels",
       title = "Gasoline in the U.S.",
       subtitle = "Million Barrels per day") +
  theme_ipsum_es()

gas <-window(gasoline, end=2005)

gas_K4 = tslm(gas ~ trend + fourier(gas, K=4))
gas_K7 = tslm(gas ~ trend + fourier(gas, K=7))
gas_K10= tslm(gas ~ trend + fourier(gas, K=10))

autoplot(gas) + 
  labs(x = "Date"
    , y = "Million Barrels"
    , title = "Harmonic Regression"
  ) +
  theme_ipsum_es() +
  autolayer(fitted(gas_K4)) + 
  autolayer(fitted(gas_K7)) + 
  autolayer(fitted(gas_K10))

knitr::kable(as.data.frame(CV(gas_K4)),"html")
CV(gas_K4)
CV 0.0764861
AIC -1864.1120233
AICc -1863.7422754
BIC -1813.6489732
AdjR2 0.8382404
knitr::kable(as.data.frame(CV(gas_K7)),"html")
CV(gas_K7)
CV 0.0714734
AIC -1913.5362459
AICc -1912.6718391
BIC -1835.5478956
AdjR2 0.8501073
knitr::kable(as.data.frame(CV(gas_K10)),"html")
CV(gas_K10)
CV 0.0713575
AIC -1915.0136582
AICc -1913.4410086
BIC -1809.5000079
AdjR2 0.8516102
checkresiduals(gas_K7$residuals)

7.5.d

Answer

The forecast does not seem to be terrible in terms of trend, though the small ups and downs are not always captured. Furthermore, when prices dropped in 2005, the model was also not capable to absorb the entire downturn. Still, it doesn’t appear to be necessarily a model to be discarded, but rather improved.

#run the model and forecast
fcst = forecast(gas_K7, 
               newdata = data.frame(fourier(gas, 7, 52)))


fcst %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
2005.014 8.812756 8.469266 9.156246 8.287026 9.338486
2005.033 8.666945 8.323371 9.010518 8.141087 9.192803
2005.052 8.599822 8.256209 8.943435 8.073903 9.125740
2005.071 8.620059 8.276458 8.963660 8.094159 9.145959
2005.090 8.701192 8.357636 9.044748 8.175361 9.227023
2005.110 8.801076 8.457564 9.144588 8.275312 9.326840
2005.129 8.885137 8.541645 9.228630 8.359403 9.410872
2005.148 8.940023 8.596528 9.283517 8.414286 9.465760
2005.167 8.972165 8.628667 9.315664 8.446422 9.497908
2005.186 8.995783 8.652288 9.339277 8.470046 9.521519
2005.205 9.020393 8.676906 9.363880 8.494667 9.546119
2005.225 9.046109 8.702625 9.389592 8.520389 9.571829
2005.244 9.067947 8.724464 9.411430 8.542227 9.593666
2005.263 9.083648 8.740163 9.427132 8.557927 9.609369
2005.282 9.097734 8.754247 9.441220 8.572009 9.623458
2005.301 9.118445 8.774957 9.461934 8.592718 9.644173
2005.320 9.150463 8.806975 9.493951 8.624736 9.676190
2005.340 9.189916 8.846430 9.533401 8.664192 9.715639
2005.359 9.226458 8.882974 9.569941 8.700738 9.752178
2005.378 9.251462 8.907979 9.594945 8.725742 9.777181
2005.397 9.266112 8.922628 9.609595 8.740391 9.791832
2005.416 9.282567 8.939081 9.626052 8.756844 9.808290
2005.435 9.316150 8.972663 9.659637 8.790424 9.841875
2005.455 9.373402 9.029915 9.716889 8.847676 9.899127
2005.474 9.444701 9.101216 9.788186 8.918978 9.970424
2005.493 9.507761 9.164277 9.851245 8.982041 10.033481
2005.512 9.540941 9.197458 9.884424 9.015222 10.066660
2005.531 9.537887 9.194403 9.881370 9.012166 10.063607
2005.550 9.513043 9.169558 9.856528 8.987321 10.038766
2005.570 9.492990 9.149503 9.836476 8.967265 10.018715
2005.589 9.498121 9.154635 9.841608 8.972396 10.023847
2005.608 9.526608 9.183122 9.870093 9.000885 10.052331
2005.627 9.552267 9.208783 9.895751 9.026546 10.077988
2005.646 9.539617 9.196134 9.883100 9.013898 10.065337
2005.665 9.467863 9.124380 9.811347 8.942143 9.993583
2005.685 9.348603 9.005118 9.692088 8.822881 9.874326
2005.704 9.224634 8.881147 9.568121 8.698908 9.750359
2005.723 9.148483 8.804995 9.491971 8.622756 9.674209
2005.742 9.152341 8.808855 9.495828 8.626618 9.678065
2005.761 9.227562 8.884078 9.571046 8.701841 9.753283
2005.780 9.327163 8.983680 9.670647 8.801444 9.852883
2005.800 9.391261 9.047777 9.734744 8.865541 9.916980
2005.819 9.381140 9.037654 9.724625 8.855417 9.906863
2005.838 9.301798 8.958308 9.645287 8.776069 9.827526
2005.857 9.199081 8.855590 9.542572 8.673349 9.724812
2005.876 9.132561 8.789072 9.476049 8.606832 9.658289
2005.895 9.139896 8.796410 9.483382 8.614173 9.665619
2005.915 9.213689 8.870205 9.557174 8.687968 9.739411
2005.934 9.304509 8.961025 9.647993 8.778788 9.830230
2005.953 9.348279 9.004789 9.691768 8.822550 9.874008
2005.972 9.302031 8.958513 9.645549 8.776259 9.827804
2005.991 9.167552 8.823967 9.511136 8.641677 9.693427
#prepare test
gas_test = window(gasoline, start = 2005, end = 2006)

#plot results
#head(fcst)

fcst_plot <-fcst

forecast <- fcst_plot$mean
autoplot(gas_test) + 
  theme_ipsum()+
  labs(x = "Test Dates 52 weeks in 2005"
       , y = "Million Barrels per Day"
       , title = "US Gasoline Supply Forecast")+
  autolayer(forecast)

7.6

The annual population of Afghanistan is available in the global_economy data set.

  1. Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?

  2. Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989. Generate forecasts from these two models for the five years after the end of the data, and comment on the results.

7.6.a

Answer

Yes, from about 1980 through about 1987, we can see population declining. This could be for many reasons, particularly people moving out of Afghanistan.

af_pop <-
  global_economy %>%
  filter(Code == "AFG") %>%
  mutate(Population = Population/1000000) 

af_pop %>%
  autoplot(Population) +
  labs(x = "Date"
       ,y = "Millions of People"
       ,title = "Population in Afghanistan") +
  theme_ipsum_pub() +
  geom_line(col="blue")

7.6.b and c

Answer

The linear model shows simply a trend that takes data from a period in time that can skew results. Piecewise looks like a more accurate model that assumes the trend of the last few years, but taking into account the dip seen from 1980 to 1989. Therefore, the piecewise model is clearly the preferred model between these two, with a higher rsquared. I would question if this model might be “overfitting” and thus provide misleading results.

af_pop_trend <-
  af_pop %>%
  model(linear = TSLM(Population ~ trend()),
        piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
        )

#report(af_pop_trend)

af_pop_trend %>%
  kbl()%>%
  kable_material(c("striped","hover", full_width = T))
Country linear piecewise
Afghanistan <TSLM> <TSLM>
af_pop_fcst <- af_pop_trend %>%
  forecast(h = 5)

af_pop %>%
  autoplot(Population) + 
  geom_line(col = "navyblue")+
  labs(y = "Million People"
       , x = "Date"
       , title = "Population in Afghanistan") +
  geom_line(data = fitted(af_pop_trend),
            aes(y = .fitted, color = .model)) +
  autolayer(af_pop_fcst, alpha = 0.5, level = 95) +
  theme_ipsum_ps()