Half-hourly electricity demand for Victoria, Australia is contained
in vic_elec. Extract the January 2014 electricity demand,
and aggregate this data to daily with daily total demands and maximum
temperatures.
jan_vic_elec <- vic_elec %>%
filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
index_by(Date = as_date(Time)) %>%
summarise(Demand = sum(Demand), Temperature = max(Temperature))
jan_vic_elec %>%
pivot_longer(2:3, names_to="key", values_to="value")%>%
autoplot(.vars = value) +
facet_grid(vars(key), scales = "free_y")
jan_vic_elec %>%
ggplot(aes(x = Temperature, y = Demand)) +
geom_point()
fit <- jan_vic_elec %>%
model(TSLM(Demand ~ Temperature))
fit %>% report()
## Series: Demand
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -49978.2 -10218.9 -121.3 18533.2 35440.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59083.9 17424.8 3.391 0.00203 **
## Temperature 6154.3 601.3 10.235 3.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
fit %>% gg_tsresiduals()
The residuals from this section of the data suggest that the model is lacking a trend (although looking at a larger window of the data, this is clearly not a real pattern). Some large variability suggests that there are some outliers in this data.
next_day <- scenarios(
`Cold day` = new_data(jan_vic_elec, 1) %>% mutate(Temperature = 15),
`Hot day` = new_data(jan_vic_elec, 1) %>% mutate(Temperature = 35))
fc <- fit %>%
forecast(new_data = next_day)
autoplot(jan_vic_elec, Demand) +
autolayer(fc)
The forecasts seem reasonable. However we should be aware that there is not much data to support the forecasts at these temperature extremes, especially in that no daily maximum below \(20^\circ\text{C}\) is observed during January (a summer month in Victoria).
fc %>%
hilo() %>%
select(-.model)
## # A tsibble: 2 x 7 [1D]
## # Key: .scenario [2]
## .scenario Date Demand .mean Temperature
## <chr> <date> <dist> <dbl> <dbl>
## 1 Cold day 2014-02-01 N(151398, 6.8e+08) 151398. 15
## 2 Hot day 2014-02-01 N(274484, 6.4e+08) 274484. 35
## # … with 2 more variables: `80%` <hilo>, `95%` <hilo>
vic_elec aggregated to daily total demand and maximum
temperature. What does this say about your model?vic_elec %>%
index_by(Date = as_date(Time)) %>%
summarise(Demand = sum(Demand), Temperature = max(Temperature)) %>%
ggplot(aes(x = Temperature, y = Demand)) +
geom_point()
The “U” shaped scatterplot suggests that there is a non-linear relationship between electricity demand and daily maximum temperature. The model assumption that electricity demand is linearly predicted by temperature is incorrect.
Data set olympic_running contains the winning times (in
seconds) in each Olympic Games sprint, middle-distance and long-distance
track events from 1896 to 2016.
olympic_running %>%
ggplot(aes(x = Year, y = Time, colour = Sex)) +
geom_line() +
geom_point(size = 1) +
facet_wrap(~Length, scales = "free_y", nrow = 2) +
theme_minimal() +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "bottom", legend.title = element_blank()) +
labs(y = "Running time (seconds)")
## Warning: Removed 31 rows containing missing values (geom_point).
The running times are generally decreasing as time progresses (although the rate of this decline is slowing down in recent olympics). There are some missing values in the data corresponding to the World Wars (in which the Olympic Games were not held).
fit <- olympic_running %>%
model(TSLM(Time ~ trend()))
tidy(fit)
## # A tibble: 28 × 8
## Length Sex .model term estimate std.error statistic p.value
## <int> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 100 men TSLM(Time ~ trend()) (Int… 11.1 0.0909 123. 1.86e-37
## 2 100 men TSLM(Time ~ trend()) tren… -0.0504 0.00479 -10.5 7.24e-11
## 3 100 women TSLM(Time ~ trend()) (Int… 12.4 0.163 76.0 4.58e-25
## 4 100 women TSLM(Time ~ trend()) tren… -0.0567 0.00749 -7.58 3.72e- 7
## 5 200 men TSLM(Time ~ trend()) (Int… 22.3 0.140 159. 4.06e-39
## 6 200 men TSLM(Time ~ trend()) tren… -0.0995 0.00725 -13.7 3.80e-13
## 7 200 women TSLM(Time ~ trend()) (Int… 25.5 0.525 48.6 8.34e-19
## 8 200 women TSLM(Time ~ trend()) tren… -0.135 0.0227 -5.92 2.17e- 5
## 9 400 men TSLM(Time ~ trend()) (Int… 50.3 0.445 113. 1.53e-36
## 10 400 men TSLM(Time ~ trend()) tren… -0.258 0.0235 -11.0 2.75e-11
## # … with 18 more rows
The men’s 100 running time has been decreasing by an average of 0.013
seconds each year.
or 0.05 every 4 years The women’s 100 running
time has been decreasing by an average of 0.014 seconds each year.
or 0.057 every 4 years The men’s 200 running time has been decreasing by
an average of 0.025 seconds each year.
or 0.1 every 4 years The
women’s 200 running time has been decreasing by an average of 0.034
seconds each year.
or 0.135 every 4 years The men’s 400 running time
has been decreasing by an average of 0.065 seconds each year.
or
0.258 every 4 years The women’s 400 running time has been decreasing by
an average of 0.04 seconds each year.
or 0.16 every 4 years The
men’s 800 running time has been decreasing by an average of 0.152
seconds each year.
or 0.607 every 4 years The women’s 800 running
time has been decreasing by an average of 0.198 seconds each year.
or 0.792 every 4 years The men’s 1500 running time has been decreasing
by an average of 0.315 seconds each year.
or 1.26 every 4 years The
women’s 1500 running time has been increasing by an average of 0.147
seconds each year.
or 0.587 every 4 years The men’s 5000 running
time has been decreasing by an average of 1.03 seconds each year.
or
4.12 every 4 years The women’s 5000 running time has been decreasing by
an average of 0.303 seconds each year.
or 1.212 every 4 years The
men’s 10000 running time has been decreasing by an average of 2.666
seconds each year.
or 10.664 every 4 years The women’s 10000 running
time has been decreasing by an average of 3.496 seconds each year.
or 13.985 every 4 years
augment(fit) %>%
ggplot(aes(x = Year, y = .innov, colour = Sex)) +
geom_line() +
geom_point(size = 1) +
facet_wrap(~Length, scales = "free_y", nrow = 2) +
theme_minimal() +
scale_color_brewer(palette = "Dark2") +
theme(legend.position = "bottom", legend.title = element_blank())
## Warning: Removed 31 rows containing missing values (geom_point).
It doesn’t seem that the linear trend is appropriate for this data.
fit %>%
forecast(h = 1) %>%
mutate(PI = hilo(Time, 95)) %>%
select(-.model)
## # A fable: 14 x 6 [4Y]
## # Key: Length, Sex [14]
## Length Sex Year Time .mean PI
## <int> <chr> <dbl> <dist> <dbl> <hilo>
## 1 100 men 2020 N(9.5, 0.061) 9.53 [ 9.049446, 10.01957]95
## 2 100 women 2020 N(11, 0.059) 10.5 [ 10.056992, 11.01183]95
## 3 200 men 2020 N(19, 0.13) 19.1 [ 18.423087, 19.81179]95
## 4 200 women 2020 N(21, 0.31) 21.2 [ 20.106551, 22.29293]95
## 5 400 men 2020 N(42, 1.5) 42.0 [ 39.668596, 44.41603]95
## 6 400 women 2020 N(48, 1.4) 48.4 [ 46.123295, 50.74923]95
## 7 800 men 2020 N(99, 13) 99.2 [ 92.137378, 106.33802]95
## 8 800 women 2020 N(111, 14) 111. [ 104.105124, 118.86174]95
## 9 1500 men 2020 N(207, 74) 207. [ 190.091940, 223.90462]95
## 10 1500 women 2020 N(245, 35) 245. [ 233.786710, 257.12632]95
## 11 5000 men 2020 N(773, 285) 773. [ 739.735488, 805.91363]95
## 12 5000 women 2020 N(892, 1562) 892. [ 814.664722, 969.56461]95
## 13 10000 men 2020 N(1580, 970) 1580. [1519.300480, 1641.40196]95
## 14 10000 women 2020 N(1763, 530) 1763. [1717.895285, 1808.13543]95
Using a linear trend we assume that winning times will decrease in a linear fashion which is unrealistic for running times. As we saw from the residual plots above there are mainly large positive residuals for the last few years, indicating that the decreases in the winning times are not linear. We also assume that the residuals are normally distributed.
The data set souvenirs concerns the monthly sales
figures of a shop which opened in January 1987 and sells gifts,
souvenirs, and novelties. The shop is situated on the wharf at a beach
resort town in Queensland, Australia. The sales volume varies with the
seasonal population of tourists. There is a large influx of visitors to
the town at Christmas and for the local surfing festival, held every
March since 1988. Over time, the shop has expanded its premises, range
of products, and staff.
souvenirs %>% autoplot(Sales)
Features of the data:
The last feature above suggests taking logs to make the pattern (and variance) more stable
# Taking logarithms of the data
souvenirs %>% autoplot(log(Sales))
After taking logs, the trend looks more linear and the seasonal variation is roughly constant.
fit <- souvenirs %>%
mutate(festival = month(Month) == 3 & year(Month) != 1987) %>%
model(reg = TSLM(log(Sales) ~ trend() + season() + festival))
souvenirs %>%
autoplot(Sales, col = "gray") +
geom_line(data = augment(fit), aes(y = .fitted), col = "blue")
fit %>% gg_tsresiduals()
The residuals are serially correlated. This is both visible from the time plot but also from the ACF. The residuals reveal nonlinearity in the trend.
augment(fit) %>%
ggplot(aes(x = .fitted, y = .innov)) +
geom_point() +
scale_x_log10()
The plot of residuals against fitted values looks fine - no notable patterns emerge. We take logarithms of fitted values because we took logs in the model.
augment(fit) %>%
mutate(month = month(Month, label = TRUE)) %>%
ggplot(aes(x = month, y = .innov)) +
geom_boxplot()
The boxplots show differences in variation across the months revealing some potential heteroscedasticity.
tidy(fit) %>% mutate(pceffect = (exp(estimate) - 1) * 100)
## # A tibble: 14 × 7
## .model term estimate std.error statistic p.value pceffect
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 reg (Intercept) 7.62 0.0742 103. 4.67e-78 203688.
## 2 reg trend() 0.0220 0.000827 26.6 2.32e-38 2.23
## 3 reg season()year2 0.251 0.0957 2.63 1.06e- 2 28.6
## 4 reg season()year3 0.266 0.193 1.38 1.73e- 1 30.5
## 5 reg season()year4 0.384 0.0957 4.01 1.48e- 4 46.8
## 6 reg season()year5 0.409 0.0957 4.28 5.88e- 5 50.6
## 7 reg season()year6 0.449 0.0958 4.69 1.33e- 5 56.6
## 8 reg season()year7 0.610 0.0958 6.37 1.71e- 8 84.1
## 9 reg season()year8 0.588 0.0959 6.13 4.53e- 8 80.0
## 10 reg season()year9 0.669 0.0959 6.98 1.36e- 9 95.3
## 11 reg season()year10 0.747 0.0960 7.79 4.48e-11 111.
## 12 reg season()year11 1.21 0.0960 12.6 1.29e-19 234.
## 13 reg season()year12 1.96 0.0961 20.4 3.39e-31 612.
## 14 reg festivalTRUE 0.502 0.196 2.55 1.29e- 2 65.1
(Intercept) is not interpretable.trend coefficient shows that with every month sales
increases on average by 2.2%.season2 coefficient shows that February sales are
greater than January on average by 28.6%, after allowing for the
trend.season12 coefficient shows that December sales are
greater than January on average by 611.5%, after allowing for the
trend.festivalTRUE coefficient shows that for months that
include the surfing festival, sales increases on average by 65.1%
compared to months without the festival, after allowing for the trend
and seasonality.augment(fit) %>%
features(.innov, ljung_box, dof = 14, lag = 24)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 reg 112. 0
The serial correlation in the residuals is significant.
future_souvenirs <- new_data(souvenirs, n = 36) %>%
mutate(festival = month(Month) == 3)
fit %>%
forecast(new_data = future_souvenirs) %>%
autoplot(souvenirs)
The model can be improved by taking into account nonlinearity of the trend.
The us_gasoline series consists of weekly data for
supplies of US finished motor gasoline product, from 2 February 1991 to
20 January 2017. The units are in “thousand barrels per day”. Consider
only the data to the end of 2004.
gas <- us_gasoline %>% filter(year(Week) <= 2004)
gas %>% autoplot(Barrels)
fit <- gas %>%
model(
fourier1 = TSLM(Barrels ~ trend() + fourier(K = 1)),
fourier2 = TSLM(Barrels ~ trend() + fourier(K = 2)),
fourier3 = TSLM(Barrels ~ trend() + fourier(K = 3)),
fourier4 = TSLM(Barrels ~ trend() + fourier(K = 4)),
fourier5 = TSLM(Barrels ~ trend() + fourier(K = 5)),
fourier6 = TSLM(Barrels ~ trend() + fourier(K = 6)),
fourier7 = TSLM(Barrels ~ trend() + fourier(K = 7)),
fourier8 = TSLM(Barrels ~ trend() + fourier(K = 8)),
fourier9 = TSLM(Barrels ~ trend() + fourier(K = 9)),
fourier10 = TSLM(Barrels ~ trend() + fourier(K = 10)),
fourier11 = TSLM(Barrels ~ trend() + fourier(K = 11)),
fourier12 = TSLM(Barrels ~ trend() + fourier(K = 12)),
fourier13 = TSLM(Barrels ~ trend() + fourier(K = 13)),
fourier14 = TSLM(Barrels ~ trend() + fourier(K = 14)),
fourier15 = TSLM(Barrels ~ trend() + fourier(K = 15))
)
glance(fit) %>%
arrange(AICc) %>%
select(.model, AICc, CV)
## # A tibble: 15 × 3
## .model AICc CV
## <chr> <dbl> <dbl>
## 1 fourier7 -1887. 0.0740
## 2 fourier11 -1885. 0.0742
## 3 fourier8 -1885. 0.0743
## 4 fourier12 -1884. 0.0742
## 5 fourier6 -1884. 0.0744
## 6 fourier9 -1882. 0.0745
## 7 fourier10 -1881. 0.0746
## 8 fourier13 -1881. 0.0745
## 9 fourier14 -1879. 0.0748
## 10 fourier15 -1876. 0.0750
## 11 fourier5 -1862. 0.0766
## 12 fourier4 -1856. 0.0773
## 13 fourier3 -1853. 0.0777
## 14 fourier2 -1814. 0.0819
## 15 fourier1 -1809. 0.0825
# Best model has order 7
gas %>%
autoplot(Barrels, col = "gray") +
geom_line(
data = augment(fit) %>% filter(.model == "fourier7"),
aes(y = .fitted), col = "blue"
)
This seems to have captured the annual seasonality quite well.
As above.
gg_tsresiduals() function. Use a Ljung-Box test to check
for residual autocorrelation.fit %>%
select(fourier7) %>%
gg_tsresiduals()
There is some small remaining autocorrelation at lag 1.
augment(fit) %>%
filter(.model == "fourier7") %>%
features(.innov, ljung_box, dof = 16, lag = 52)
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 fourier7 56.9 0.0148
The Ljung-Box test is significant, showing there is a (very) small amount of autocorrelation left. Given the size of the autocorrelations, this is unlikely to be of any consequence.
Even though the residuals fail the correlation tests, the results are probably not severe enough to make much difference to the forecasts and prediction intervals. (Note that the correlations are relatively small, even though they are significant.)
fit %>%
select("fourier7") %>%
forecast(h = "1 year") %>%
autoplot(filter_index(us_gasoline, "2000" ~ "2006"))
The forecasts fit pretty well for the first six months, but not so well after that.
The annual population of Afghanistan is available in the
global_economy data set.
global_economy %>%
filter(Country == "Afghanistan") %>%
autoplot(Population / 1e6) +
labs(y = "Population (millions)") +
geom_ribbon(aes(xmin = 1979.98, xmax = 1989.13), fill = "pink", alpha = 0.4) +
annotate("text", x = 1984.5, y = 10, label = "Soviet-Afghan war", col = "red", size = 3)
The population increases slowly from 1960 to 1980, then decreases during the Soviet-Afghan war (24 Dec 1979 – 15 Feb 1989), and has increased relatively rapidly since then. The last 30 years has shown an almost linear increase in population.
fit <- global_economy %>%
filter(Country == "Afghanistan") %>%
model(
linear = TSLM(Population ~ trend()),
piecewise = TSLM(Population ~ trend(knots = c(1980, 1989))))
tidy(fit)
## # A tibble: 6 × 7
## Country .model term estimate std.error statistic p.value
## <fct> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan linear (Intercept) 4798904. 848259. 5.66 5.45e- 7
## 2 Afghanistan linear trend() 425774. 25008. 17.0 8.37e-24
## 3 Afghanistan piecewise (Intercept) 8697573. 131122. 66.3 1.97e-53
## 4 Afghanistan piecewise trend(knots = c(1… 224372. 9623. 23.3 7.32e-30
## 5 Afghanistan piecewise trend(knots = c(1… -456803. 24498. -18.6 3.41e-25
## 6 Afghanistan piecewise trend(knots = c(1… 1082782. 21418. 50.6 3.67e-47
glance(fit)
## # A tibble: 2 × 16
## Country .model r_squared adj_r_squared sigma2 statistic p_value df
## <fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Afghanistan linear 0.838 0.835 1.02e13 290. 8.37e-24 2
## 2 Afghanistan piecewise 0.999 0.999 9.05e10 12929. 4.34e-77 4
## # … with 8 more variables: log_lik <dbl>, AIC <dbl>, AICc <dbl>, BIC <dbl>,
## # CV <dbl>, deviance <dbl>, df.residual <int>, rank <int>
augment(fit) %>%
autoplot(.fitted) +
geom_line(aes(y = Population), colour = "black")
The fitted values show that the piecewise linear model has tracked the data very closely, while the linear model is inaccurate.
fc <- fit %>% forecast(h = "5 years")
autoplot(fc) +
autolayer(filter(global_economy %>% filter(Country == "Afghanistan")), Population)
The linear model is clearly incorrect with prediction intervals too wide, and the point forecasts too low.
The piecewise linear model looks good, but the prediction intervals are probably too narrow. This model assumes that the trend since the last knot will continue unchanged, which is unrealistic.
List and explain each assumption of the OLS
What is the difference between unbiased and consistent estimator?
What is the different between \(R^2\) and adjusted \(R^2\)?