HW#4 Answers

Question 1

Half-hourly electricity demand for Victoria, Australia is contained in vic_elec. Extract the January 2014 electricity demand, and aggregate this data to daily with daily total demands and maximum temperatures.

Plot the data, plot the scatter plot bettwen demand and temperature, and find the regression model for Demand with temperature as a predictor variable. Why is there a positive relationship?

jan_vic_elec <- vic_elec %>%
     filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
     index_by(Date = as_date(Time)) %>%
     summarise(Demand = sum(Demand), Temperature = max(Temperature))

jan_vic_elec %>%
  pivot_longer(2:3, names_to="key", values_to="value")%>%
  autoplot(.vars = value) +
  facet_grid(vars(key), scales = "free_y")

jan_vic_elec %>%
  ggplot(aes(x = Temperature, y = Demand)) +
  geom_point()

fit <- jan_vic_elec %>%
  model(TSLM(Demand ~ Temperature))
fit %>% report()

## Series: Demand 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -49978.2 -10218.9   -121.3  18533.2  35440.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  59083.9    17424.8   3.391  0.00203 ** 
## Temperature   6154.3      601.3  10.235 3.89e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832,  Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11

Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

fit %>% gg_tsresiduals()

The residuals from this section of the data suggest that the model is lacking a trend (although looking at a larger window of the data, this is clearly not a real pattern). Some large variability suggests that there are some outliers in this data.

Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was \(15^\circ\text{C}\) and compare it with the forecast if the with maximum temperature was \(35^\circ\text{C}\). Do you believe these forecasts?

next_day <- scenarios(
  `Cold day` = new_data(jan_vic_elec, 1) %>% mutate(Temperature = 15),
  `Hot day` =  new_data(jan_vic_elec, 1) %>% mutate(Temperature = 35))

fc <- fit %>%
  forecast(new_data = next_day)
  autoplot(jan_vic_elec, Demand) +
  autolayer(fc)

The forecasts seem reasonable. However we should be aware that there is not much data to support the forecasts at these temperature extremes, especially in that no daily maximum below \(20^\circ\text{C}\) is observed during January (a summer month in Victoria).

Give prediction intervals for your forecasts.

fc %>%
  hilo() %>%
  select(-.model)

## # A tsibble: 2 x 7 [1D]
## # Key:       .scenario [2]
##   .scenario Date                   Demand   .mean Temperature
##   <chr>     <date>                 <dist>   <dbl>       <dbl>
## 1 Cold day  2014-02-01 N(151398, 6.8e+08) 151398.          15
## 2 Hot day   2014-02-01 N(274484, 6.4e+08) 274484.          35
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>

Plot Demand vs Temperature for all of the available data in vic_elec aggregated to daily total demand and maximum temperature. What does this say about your model?

vic_elec %>%
  index_by(Date = as_date(Time)) %>%
  summarise(Demand = sum(Demand), Temperature = max(Temperature)) %>%
  ggplot(aes(x = Temperature, y = Demand)) +
  geom_point()

The “U” shaped scatterplot suggests that there is a non-linear relationship between electricity demand and daily maximum temperature. The model assumption that electricity demand is linearly predicted by temperature is incorrect.

Question 2

Data set olympic_running contains the winning times (in seconds) in each Olympic Games sprint, middle-distance and long-distance track events from 1896 to 2016.

Plot the winning time against the year by sex and lenght. Describe the main features of the plot.

olympic_running %>%
  ggplot(aes(x = Year, y = Time, colour = Sex)) +
  geom_line() +
  geom_point(size = 1) +
  facet_wrap(~Length, scales = "free_y", nrow = 2) +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  labs(y = "Running time (seconds)")

## Warning: Removed 31 rows containing missing values (geom_point).

The running times are generally decreasing as time progresses (although the rate of this decline is slowing down in recent olympics). There are some missing values in the data corresponding to the World Wars (in which the Olympic Games were not held).

Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?

fit <- olympic_running %>%
  model(TSLM(Time ~ trend()))
tidy(fit)

## # A tibble: 28 × 8
##    Length Sex   .model               term  estimate std.error statistic  p.value
##     <int> <chr> <chr>                <chr>    <dbl>     <dbl>     <dbl>    <dbl>
##  1    100 men   TSLM(Time ~ trend()) (Int…  11.1      0.0909     123.   1.86e-37
##  2    100 men   TSLM(Time ~ trend()) tren…  -0.0504   0.00479    -10.5  7.24e-11
##  3    100 women TSLM(Time ~ trend()) (Int…  12.4      0.163       76.0  4.58e-25
##  4    100 women TSLM(Time ~ trend()) tren…  -0.0567   0.00749     -7.58 3.72e- 7
##  5    200 men   TSLM(Time ~ trend()) (Int…  22.3      0.140      159.   4.06e-39
##  6    200 men   TSLM(Time ~ trend()) tren…  -0.0995   0.00725    -13.7  3.80e-13
##  7    200 women TSLM(Time ~ trend()) (Int…  25.5      0.525       48.6  8.34e-19
##  8    200 women TSLM(Time ~ trend()) tren…  -0.135    0.0227      -5.92 2.17e- 5
##  9    400 men   TSLM(Time ~ trend()) (Int…  50.3      0.445      113.   1.53e-36
## 10    400 men   TSLM(Time ~ trend()) tren…  -0.258    0.0235     -11.0  2.75e-11
## # … with 18 more rows

The men’s 100 running time has been decreasing by an average of 0.013 seconds each year.
or 0.05 every 4 years The women’s 100 running time has been decreasing by an average of 0.014 seconds each year.
or 0.057 every 4 years The men’s 200 running time has been decreasing by an average of 0.025 seconds each year.
or 0.1 every 4 years The women’s 200 running time has been decreasing by an average of 0.034 seconds each year.
or 0.135 every 4 years The men’s 400 running time has been decreasing by an average of 0.065 seconds each year.
or 0.258 every 4 years The women’s 400 running time has been decreasing by an average of 0.04 seconds each year.
or 0.16 every 4 years The men’s 800 running time has been decreasing by an average of 0.152 seconds each year.
or 0.607 every 4 years The women’s 800 running time has been decreasing by an average of 0.198 seconds each year.
or 0.792 every 4 years The men’s 1500 running time has been decreasing by an average of 0.315 seconds each year.
or 1.26 every 4 years The women’s 1500 running time has been increasing by an average of 0.147 seconds each year.
or 0.587 every 4 years The men’s 5000 running time has been decreasing by an average of 1.03 seconds each year.
or 4.12 every 4 years The women’s 5000 running time has been decreasing by an average of 0.303 seconds each year.
or 1.212 every 4 years The men’s 10000 running time has been decreasing by an average of 2.666 seconds each year.
or 10.664 every 4 years The women’s 10000 running time has been decreasing by an average of 3.496 seconds each year.
or 13.985 every 4 years

Plot the residuals against the year. What does this indicate about the suitability of the fitted line?

augment(fit) %>%
  ggplot(aes(x = Year, y = .innov, colour = Sex)) +
  geom_line() +
  geom_point(size = 1) +
  facet_wrap(~Length, scales = "free_y", nrow = 2) +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") +
  theme(legend.position = "bottom", legend.title = element_blank())

## Warning: Removed 31 rows containing missing values (geom_point).

It doesn’t seem that the linear trend is appropriate for this data.

Predict the winning time for each race in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

fit %>%
  forecast(h = 1) %>%
  mutate(PI = hilo(Time, 95)) %>%
  select(-.model)

## # A fable: 14 x 6 [4Y]
## # Key:     Length, Sex [14]
##    Length Sex    Year          Time   .mean                          PI
##     <int> <chr> <dbl>        <dist>   <dbl>                      <hilo>
##  1    100 men    2020 N(9.5, 0.061)    9.53 [   9.049446,   10.01957]95
##  2    100 women  2020  N(11, 0.059)   10.5  [  10.056992,   11.01183]95
##  3    200 men    2020   N(19, 0.13)   19.1  [  18.423087,   19.81179]95
##  4    200 women  2020   N(21, 0.31)   21.2  [  20.106551,   22.29293]95
##  5    400 men    2020    N(42, 1.5)   42.0  [  39.668596,   44.41603]95
##  6    400 women  2020    N(48, 1.4)   48.4  [  46.123295,   50.74923]95
##  7    800 men    2020     N(99, 13)   99.2  [  92.137378,  106.33802]95
##  8    800 women  2020    N(111, 14)  111.   [ 104.105124,  118.86174]95
##  9   1500 men    2020    N(207, 74)  207.   [ 190.091940,  223.90462]95
## 10   1500 women  2020    N(245, 35)  245.   [ 233.786710,  257.12632]95
## 11   5000 men    2020   N(773, 285)  773.   [ 739.735488,  805.91363]95
## 12   5000 women  2020  N(892, 1562)  892.   [ 814.664722,  969.56461]95
## 13  10000 men    2020  N(1580, 970) 1580.   [1519.300480, 1641.40196]95
## 14  10000 women  2020  N(1763, 530) 1763.   [1717.895285, 1808.13543]95

Using a linear trend we assume that winning times will decrease in a linear fashion which is unrealistic for running times. As we saw from the residual plots above there are mainly large positive residuals for the last few years, indicating that the decreases in the winning times are not linear. We also assume that the residuals are normally distributed.

Question 3

The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

souvenirs %>% autoplot(Sales)

Features of the data:

Seasonal data – similar scaled pattern repeats every year
A spike every March (except for year 1987) is the influence of the surfing festival
The size of the pattern increases proportionally to the level of sales

Explain why it is necessary to take logarithms of these data before fitting a model.

The last feature above suggests taking logs to make the pattern (and variance) more stable

# Taking logarithms of the data
souvenirs %>% autoplot(log(Sales))

After taking logs, the trend looks more linear and the seasonal variation is roughly constant.

Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable. Plot the fitted values and the actual sales in one graph.

fit <- souvenirs %>%
  mutate(festival = month(Month) == 3 & year(Month) != 1987) %>%
  model(reg = TSLM(log(Sales) ~ trend() + season() + festival))

  souvenirs %>%
  autoplot(Sales, col = "gray") +
  geom_line(data = augment(fit), aes(y = .fitted), col = "blue")

Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

fit %>% gg_tsresiduals()

The residuals are serially correlated. This is both visible from the time plot but also from the ACF. The residuals reveal nonlinearity in the trend.

augment(fit) %>%
  ggplot(aes(x = .fitted, y = .innov)) +
  geom_point() +
  scale_x_log10()

The plot of residuals against fitted values looks fine - no notable patterns emerge. We take logarithms of fitted values because we took logs in the model.

Do boxplots of the residuals for each month. Does this reveal any problems with the model?

augment(fit) %>%
  mutate(month = month(Month, label = TRUE)) %>%
  ggplot(aes(x = month, y = .innov)) +
  geom_boxplot()

The boxplots show differences in variation across the months revealing some potential heteroscedasticity.

What do the values of the coefficients tell you about each variable?

tidy(fit) %>% mutate(pceffect = (exp(estimate) - 1) * 100)

## # A tibble: 14 × 7
##    .model term           estimate std.error statistic  p.value  pceffect
##    <chr>  <chr>             <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
##  1 reg    (Intercept)      7.62    0.0742      103.   4.67e-78 203688.  
##  2 reg    trend()          0.0220  0.000827     26.6  2.32e-38      2.23
##  3 reg    season()year2    0.251   0.0957        2.63 1.06e- 2     28.6 
##  4 reg    season()year3    0.266   0.193         1.38 1.73e- 1     30.5 
##  5 reg    season()year4    0.384   0.0957        4.01 1.48e- 4     46.8 
##  6 reg    season()year5    0.409   0.0957        4.28 5.88e- 5     50.6 
##  7 reg    season()year6    0.449   0.0958        4.69 1.33e- 5     56.6 
##  8 reg    season()year7    0.610   0.0958        6.37 1.71e- 8     84.1 
##  9 reg    season()year8    0.588   0.0959        6.13 4.53e- 8     80.0 
## 10 reg    season()year9    0.669   0.0959        6.98 1.36e- 9     95.3 
## 11 reg    season()year10   0.747   0.0960        7.79 4.48e-11    111.  
## 12 reg    season()year11   1.21    0.0960       12.6  1.29e-19    234.  
## 13 reg    season()year12   1.96    0.0961       20.4  3.39e-31    612.  
## 14 reg    festivalTRUE     0.502   0.196         2.55 1.29e- 2     65.1

(Intercept) is not interpretable.
trend coefficient shows that with every month sales increases on average by 2.2%.
season2 coefficient shows that February sales are greater than January on average by 28.6%, after allowing for the trend.
…
season12 coefficient shows that December sales are greater than January on average by 611.5%, after allowing for the trend.
festivalTRUE coefficient shows that for months that include the surfing festival, sales increases on average by 65.1% compared to months without the festival, after allowing for the trend and seasonality.

What does the Ljung-Box test tell you about your model?

augment(fit) %>%
  features(.innov, ljung_box, dof = 14, lag = 24)

## # A tibble: 1 × 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 reg       112.         0

The serial correlation in the residuals is significant.

Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

future_souvenirs <- new_data(souvenirs, n = 36) %>%
  mutate(festival = month(Month) == 3)

fit %>%
  forecast(new_data = future_souvenirs) %>%
  autoplot(souvenirs)

How could you improve these predictions by modifying the model?

The model can be improved by taking into account nonlinearity of the trend.

Question 4

The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “thousand barrels per day”. Consider only the data to the end of 2004.

Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.

gas <- us_gasoline %>% filter(year(Week) <= 2004)
gas %>% autoplot(Barrels)

fit <- gas %>%
  model(
    fourier1 = TSLM(Barrels ~ trend() + fourier(K = 1)),
    fourier2 = TSLM(Barrels ~ trend() + fourier(K = 2)),
    fourier3 = TSLM(Barrels ~ trend() + fourier(K = 3)),
    fourier4 = TSLM(Barrels ~ trend() + fourier(K = 4)),
    fourier5 = TSLM(Barrels ~ trend() + fourier(K = 5)),
    fourier6 = TSLM(Barrels ~ trend() + fourier(K = 6)),
    fourier7 = TSLM(Barrels ~ trend() + fourier(K = 7)),
    fourier8 = TSLM(Barrels ~ trend() + fourier(K = 8)),
    fourier9 = TSLM(Barrels ~ trend() + fourier(K = 9)),
    fourier10 = TSLM(Barrels ~ trend() + fourier(K = 10)),
    fourier11 = TSLM(Barrels ~ trend() + fourier(K = 11)),
    fourier12 = TSLM(Barrels ~ trend() + fourier(K = 12)),
    fourier13 = TSLM(Barrels ~ trend() + fourier(K = 13)),
    fourier14 = TSLM(Barrels ~ trend() + fourier(K = 14)),
    fourier15 = TSLM(Barrels ~ trend() + fourier(K = 15))
  )


glance(fit) %>%
  arrange(AICc) %>%
  select(.model, AICc, CV)

## # A tibble: 15 × 3
##    .model      AICc     CV
##    <chr>      <dbl>  <dbl>
##  1 fourier7  -1887. 0.0740
##  2 fourier11 -1885. 0.0742
##  3 fourier8  -1885. 0.0743
##  4 fourier12 -1884. 0.0742
##  5 fourier6  -1884. 0.0744
##  6 fourier9  -1882. 0.0745
##  7 fourier10 -1881. 0.0746
##  8 fourier13 -1881. 0.0745
##  9 fourier14 -1879. 0.0748
## 10 fourier15 -1876. 0.0750
## 11 fourier5  -1862. 0.0766
## 12 fourier4  -1856. 0.0773
## 13 fourier3  -1853. 0.0777
## 14 fourier2  -1814. 0.0819
## 15 fourier1  -1809. 0.0825

# Best model has order 7
gas %>%
  autoplot(Barrels, col = "gray") +
  geom_line(
    data = augment(fit) %>% filter(.model == "fourier7"),
    aes(y = .fitted), col = "blue"
  )

This seems to have captured the annual seasonality quite well.

Select the appropriate number of Fourier terms to include by minimising the AICc or CV value.

As above.

Check the residuals of the final model using the gg_tsresiduals() function. Use a Ljung-Box test to check for residual autocorrelation.

fit %>%
  select(fourier7) %>%
  gg_tsresiduals()

There is some small remaining autocorrelation at lag 1.

augment(fit) %>%
  filter(.model == "fourier7") %>%
  features(.innov, ljung_box, dof = 16, lag = 52)

## # A tibble: 1 × 3
##   .model   lb_stat lb_pvalue
##   <chr>      <dbl>     <dbl>
## 1 fourier7    56.9    0.0148

The Ljung-Box test is significant, showing there is a (very) small amount of autocorrelation left. Given the size of the autocorrelations, this is unlikely to be of any consequence.

Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)

Generate forecasts for the next year of data. Plot these along with the actual data for 2005. Comment on the forecasts.

fit %>%
  select("fourier7") %>%
  forecast(h = "1 year") %>%
  autoplot(filter_index(us_gasoline, "2000" ~ "2006"))

The forecasts fit pretty well for the first six months, but not so well after that.

Question 5

The annual population of Afghanistan is available in the global_economy data set.

Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?

global_economy %>%
  filter(Country == "Afghanistan") %>%
  autoplot(Population / 1e6) +
  labs(y = "Population (millions)") +
  geom_ribbon(aes(xmin = 1979.98, xmax = 1989.13), fill = "pink", alpha = 0.4) +
  annotate("text", x = 1984.5, y = 10, label = "Soviet-Afghan war", col = "red", size = 3)

The population increases slowly from 1960 to 1980, then decreases during the Soviet-Afghan war (24 Dec 1979 – 15 Feb 1989), and has increased relatively rapidly since then. The last 30 years has shown an almost linear increase in population.

Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989.

fit <- global_economy %>%
  filter(Country == "Afghanistan") %>%
  model(
    linear = TSLM(Population ~ trend()),
    piecewise = TSLM(Population ~ trend(knots = c(1980, 1989))))

tidy(fit)

## # A tibble: 6 × 7
##   Country     .model    term               estimate std.error statistic  p.value
##   <fct>       <chr>     <chr>                 <dbl>     <dbl>     <dbl>    <dbl>
## 1 Afghanistan linear    (Intercept)        4798904.   848259.      5.66 5.45e- 7
## 2 Afghanistan linear    trend()             425774.    25008.     17.0  8.37e-24
## 3 Afghanistan piecewise (Intercept)        8697573.   131122.     66.3  1.97e-53
## 4 Afghanistan piecewise trend(knots = c(1…  224372.     9623.     23.3  7.32e-30
## 5 Afghanistan piecewise trend(knots = c(1… -456803.    24498.    -18.6  3.41e-25
## 6 Afghanistan piecewise trend(knots = c(1… 1082782.    21418.     50.6  3.67e-47

glance(fit)

## # A tibble: 2 × 16
##   Country     .model    r_squared adj_r_squared  sigma2 statistic  p_value    df
##   <fct>       <chr>         <dbl>         <dbl>   <dbl>     <dbl>    <dbl> <int>
## 1 Afghanistan linear        0.838         0.835 1.02e13      290. 8.37e-24     2
## 2 Afghanistan piecewise     0.999         0.999 9.05e10    12929. 4.34e-77     4
## # … with 8 more variables: log_lik <dbl>, AIC <dbl>, AICc <dbl>, BIC <dbl>,
## #   CV <dbl>, deviance <dbl>, df.residual <int>, rank <int>

augment(fit) %>%
  autoplot(.fitted) +
  geom_line(aes(y = Population), colour = "black")

The fitted values show that the piecewise linear model has tracked the data very closely, while the linear model is inaccurate.

Generate forecasts from these two models for the five years after the end of the data, and comment on the results.

fc <- fit %>% forecast(h = "5 years")
autoplot(fc) +
  autolayer(filter(global_economy %>% filter(Country == "Afghanistan")), Population)

The linear model is clearly incorrect with prediction intervals too wide, and the point forecasts too low.

The piecewise linear model looks good, but the prediction intervals are probably too narrow. This model assumes that the trend since the last knot will continue unchanged, which is unrealistic.

Question 6

List and explain each assumption of the OLS
What is the difference between unbiased and consistent estimator?
What is the different between \(R^2\) and adjusted \(R^2\)?