These are exercises on chapter #7 from “Forecasting Principles and Practice”
Plot the data and find the regression model for Demand with temperature as a predictor variable. Why is there a positive relationship?
Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15∘C and compare it with the forecast if the with maximum temperature was 35∘C. Do you believe these forecasts?
Answer
It appears relatively simple that as temperature rises, so does electricity demand, say, to turn on AC units for longer, and possibly at higher intensity. When performing the linear regression, we can see that, on average, for every change in degree, there is a positive relationship in electricity demand by about 6,150 mega Watts.
#a.
jan14_vic_elec <- vic_elec %>%
filter(yearmonth(Time) == yearmonth("2014 Jan")) %>%
index_by(Date = as_date(Time)) %>%
summarise(
Demand = sum(Demand),
Temperature = max(Temperature)
)
jan14_vic_elec %>%
ggplot(aes(y=Demand, x=Temperature)) +
theme_ipsum_ps() +
labs(y = "Electricity Demand in MW"
, x = "Temperature"
, title = "Demand Electricity in Victoria"
, subtitle = "January 2014, Temperature") +
geom_point() +
geom_smooth(formula = y ~ x, method="lm")
fit <- jan14_vic_elec %>%
model(TSLM(Demand ~ Temperature))
report(fit)
## Series: Demand
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -49978.2 -10218.9 -121.3 18533.2 35440.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59083.9 17424.8 3.391 0.00203 **
## Temperature 6154.3 601.3 10.235 3.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.8897e-11
regmodel <-lm(Demand ~ Temperature, data = jan14_vic_elec)
summary(regmodel)
##
## Call:
## lm(formula = Demand ~ Temperature, data = jan14_vic_elec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49978 -10219 -121 18533 35441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 59083.9 17424.8 3.391 0.00203 **
## Temperature 6154.3 601.3 10.235 3.89e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24540 on 29 degrees of freedom
## Multiple R-squared: 0.7832, Adjusted R-squared: 0.7757
## F-statistic: 104.7 on 1 and 29 DF, p-value: 3.89e-11
Answer
By the look of the residuals, it does not appear there is a trend, nor autocorrelation in the data. Therefore, the model could be relatively adequate though there seems to be few outliers present when performing just a simple linear regression, which could skew the model into possibly over predicting electricity after a certain threshold.
layout(matrix(1:4, byrow = F, ncol = 2))
fit %>%
gg_tsresiduals() +
labs(title = "Residuals")
plot(regmodel)
Answer
The forecast predicted that about 35 degrees Celsius, the expected electricity demand would be around 275 thousand mega Watts, and when temperature is around 15 degrees Celsius, electricity demand would be around 150 thousand mega Watts.
future_scenarios <- scenarios(
Temp15 = new_data(jan14_vic_elec) %>%
mutate(Temperature=15),
Temp35 = new_data(jan14_vic_elec) %>%
mutate(Temperature=35),
names_to = "Scenario")
fc <- forecast(fit, new_data = future_scenarios)
jan14_vic_elec %>%
autoplot(Demand) +
autolayer(fc) +
labs(title = "Electricity Demand"
, y = "Demand in MW"
, x = "Date") +
theme_ipsum_es()
Answer
The intervals are below:
int<-forecast(regmodel, newdata=data.frame(Temperature=c(15,35)))
int %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))
| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| 151398.35 | 151398.4 | 117127.2 | 185669.5 | 97951.22 | 204845.5 |
| 274484.25 | 274484.2 | 241333.0 | 307635.5 | 222783.69 | 326184.8 |
Answer
It looks like a J-shaped curve relationship, where if time is not a factor (say, just take January like we did), electricity demand doesn’t really start to increase until after temperatures reach and start the 25 degrees Celsuis mark.
ggplot(data = vic_elec,
aes(x = Temperature,
y = Demand)) +
geom_point() +
labs(x = "Temperature"
, y = "Electricity Demand"
, title = "All Temperature and Electricity Demand"
, subtitle = "All months") +
geom_smooth() +
theme_ipsum_es()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Data set olympic_running contains the winning times (in seconds) in each Olympic Games sprint, middle-distance and long-distance track events from 1896 to 2016.
Plot the winning time against the year for each event. Describe the main features of the plot. Fit a regression line to the data for each event. Obviously the winning times have been decreasing, but at what average rate per year?
Plot the residuals against the year. What does this indicate about the suitability of the fitted lines?
Predict the winning time for each race in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?
Answer
We can clearly see that the trend over time has been decreasing for both genders. Women, however, have accelerated their pace towards lower times, possibly because of factors outside of physical capabilities but rather social. Times for shorter distances continue to trend lower into the 2010’s; times for longer distance races, however, plateaued and are now increasing.Only 400 meter distance races have moved in opposite directions over the last observations. 500 meter races contains too few data points for women that it would be hard to compare. I would not use this dataset.
#one by one
head(olympic_running)
## # A tsibble: 6 x 4 [4Y]
## # Key: Length, Sex [1]
## Year Length Sex Time
## <int> <int> <chr> <dbl>
## 1 1896 100 men 12
## 2 1900 100 men 11
## 3 1904 100 men 11
## 4 1908 100 men 10.8
## 5 1912 100 men 10.8
## 6 1916 100 men NA
unique(olympic_running$Length)
## [1] 100 200 400 800 1500 5000 10000
olympic_running %>%
filter(Length==100) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "100 m, By race") +
geom_smooth(method = "lm", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==200) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "200 m, By race") +
geom_smooth(method = "lm", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==400) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "400 m, By race") +
geom_smooth(method = "lm", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==800) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "800 m, By race") +
geom_smooth(method = "lm", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==1500) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "1500 m, By race") +
geom_smooth(method = "lm", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==5000) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "5000 m, By race") +
geom_smooth(method = "lm", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==10000) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "10000 m, By race") +
geom_smooth(method = "lm", formula = y ~ x) +
theme_ipsum_es()
Below are the coefficients by race distance and by gender. We can see different decreases over time per race and per sex.
#inspect data -- took from Tim White for loop of the length of vector to generate all charts instead of generating charts manually by distance race.
head(olympic_running)
## # A tsibble: 6 x 4 [4Y]
## # Key: Length, Sex [1]
## Year Length Sex Time
## <int> <int> <chr> <dbl>
## 1 1896 100 men 12
## 2 1900 100 men 11
## 3 1904 100 men 11
## 4 1908 100 men 10.8
## 5 1912 100 men 10.8
## 6 1916 100 men NA
unique(olympic_running$Length)
## [1] 100 200 400 800 1500 5000 10000
olympic_running %>%
filter(Length==100) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "100 m, By race") +
geom_smooth(method = "loess", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==200) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "200 m, By race") +
geom_smooth(method = "loess", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==400) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "400 m, By race") +
geom_smooth(method = "loess", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==800) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "800 m, By race") +
geom_smooth(method = "loess", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==1500) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "1500 m, By race") +
geom_smooth(method = "loess", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==5000) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "5000 m, By race") +
geom_smooth(method = "loess", formula = y ~ x) +
theme_ipsum_es()
olympic_running %>%
filter(Length==10000) %>%
autoplot(Time) +
labs(x = "Year"
, y = "Time in Seconds"
, title = "Running Times"
, subtitle = "10000 m, By race") +
geom_smooth(method = "loess", formula = y ~ x) +
theme_ipsum_es()
## Linear regression and coefficients
olympic_men <-
olympic_running %>%
filter(Sex=="men")
olympic_women <-
olympic_running %>%
filter(Sex=="women")
model_men <-
olympic_men %>%
group_by(Length) %>%
#model(TSLM(Time~trend()))
do(model=lm(Time~Year, data=.))
model_women <-
olympic_women %>%
group_by(Length) %>%
#model(TSLM(Time~trend()))
do(model=lm(Time~Year, data=.))
model_men$model
## [[1]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 35.01158 -0.01261
##
##
## [[2]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 69.37650 -0.02488
##
##
## [[3]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 172.48148 -0.06457
##
##
## [[4]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 405.8457 -0.1518
##
##
## [[5]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 843.4367 -0.3151
##
##
## [[6]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 2853.20 -1.03
##
##
## [[7]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 6965.459 -2.666
model_women$model
## [[1]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 39.18816 -0.01419
##
##
## [[2]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 89.13761 -0.03363
##
##
## [[3]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 129.39165 -0.04008
##
##
## [[4]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 511.369 -0.198
##
##
## [[5]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## -50.9926 0.1468
##
##
## [[6]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 1504.175 -0.303
##
##
## [[7]]
##
## Call:
## lm(formula = Time ~ Year, data = .)
##
## Coefficients:
## (Intercept) Year
## 8825.260 -3.496
Checking Residuals and Forecast
Had some trouble but was able to get a prediction for one example, 400 meter for men, and forecast the next two Olympics.
Based on the residuals, it looks it is normally distributed and close to zero. I would probably remove outliers at the very beginning, and start from earlier than the staring point.
or_clean <-
olympic_running %>%
filter(Year != 1916,
Year != 1940,
Year != 1944)
mens400 <-
or_clean %>%
filter(Length==400)%>%
filter(Sex=="men")
mens400_model <-
mens400 %>%
model(model = TSLM(Time ~ trend()))
fc_mens_400 <-forecast(mens400_model)
fc_mens_400 %>%
autoplot(mens400) +
labs(
title = "400 Meter Mens Forecast"
,y = "seconds"
)
mens400_linear <-
mens400 %>%
lm(formula = Time~Year)
checkresiduals(mens400_linear)
##
## Breusch-Godfrey test for serial correlation of order up to 6
##
## data: Residuals
## LM test = 3.6082, df = 6, p-value = 0.7295
An elasticity coefficient is the ratio of the percentage change in the forecast variable (y) to the percentage change in the predictor variable (x). Mathematically, the elasticity is defined as (dy/dx)×(x/y).Consider the log-log model, logy=β0+β1logx+ε. Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.
Answer
1.- log(y) = β0 + β1log(x) + e
2.- β1log(x) = log(y) - β0 - e
3.- β1 = (log(y) - β0 - e) / log(x)
This basically expresses that the coefficient β1 (for x, the independent variable), will be the percent change (or the natural log) relative to the percent change for our dependent variable y. It is expressed as “for every 1 (one) percent change in x (the independent variable), our dependent variable y will change by the coefficient result β1.
The data set souvenirs concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
Explain why it is necessary to take logarithms of these data before fitting a model.
Fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
Do boxplots of the residuals for each month. Does this reveal any problems with the model?
What do the values of the coefficients tell you about each variable?
What does the Ljung-Box test tell you about your model?
Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
How could you improve these predictions by modifying the model?
Answer
It looks to have seasonality, with increasing variance (non-stationary), and a rising trend. Therefore, a forecast could continue the trend seen in the latest observations.
#a.
head(souvenirs)
autoplot(souvenirs)+
labs(x = "Date"
, y = "Sales"
, title = "Souvenir Sales") +
geom_smooth(method = "loess", formula = y ~ x, color="red") +
theme_ipsum_es() +
scale_y_continuous(labels = scales::dollar)
Answer
Because of its increasing variance and rising trend, a logarithmic transformation could be useful to scale it, and to help with its increasing variance. Further, after decomposing the data using an ETS model, we can see that automatically it finds a lag of 12, meaning that this monthly data fins a strong seasonal component, plus a upward positive slope/trend
we first have to assign a 1 to every March [month number = 3] and zeros to the rest of the months. Initially I don’t know how this will pan out since seasonally, sales peak during December. From the regression model it doesn’t appear that the dummy variable helped much, or, if anything, could be causing a variance inflation factor since March and the dummy variable appear to be statistically the less statistically significant variables.
#b.
souvenirs %>%
model(ETS(Sales)) %>%
components() %>%
autoplot() +
#labs(subtitle = "ETS Decomposition of Souvenir Sales") +
scale_y_continuous(labels = scales::number)
#c.
# borrowed from online rpubs (fancy)
# log_souv <-log(ts(souvenirs, frequency =12)
#
# dummy_surf = rep(0, length(souvenirs))
# dummy_surf[seq_along(dummy) %% 12 == 3] <- 1
# dummy_surf[3]<-0
# dummy_surf <-ts(dummy.surf, frequency = 12, start=c(1987,1))
#
# log.souv.df <-data.frame(log_souv,dummy_surf)
##log.souv.df
#
# model_fit <-tslm(log_souv ~ trend + season + dummy_surf, data = log.souv.df)
#
# fc_surf <- data.frame(dummy_surf = rep(0,12))
#
# fc_surf[12,] <- 1
# fc_souv<-data.frame(forecast(model_fit, newdata = fc_surf))
#
##fc_souv
#
##log.souv.df
#
# fc_souv$Point.Forecast %>%
# autoplot(log.souv.df)
### Borrowed from Tim White rpubs, again.
# his code is neat and makes sense. Finally understood how a for loop works.
date_loop <- as.Date(souvenirs$Month)
surf_loop <- c()
for(i in 1:length(date_loop)) {
if(year(date_loop[i])>=1988 & month(date_loop[i]) == 3) {
surf_loop[i] <- 1
} else {
surf_loop[i] <- 0
}
}
souv_model <-
souvenirs %>%
add_column(surf_dummy = surf_loop) %>%
add_column(log_sales = log(souvenirs$Sales)) %>%
add_column(season_month = as.factor(month(souvenirs$Month)))
souvenirs_tslm <-
souv_model %>%
model(tslm = TSLM(log_sales~trend()+as.factor(season_month)+surf_dummy))
report(souvenirs_tslm)
## Series: log_sales
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.336727 -0.127571 0.002568 0.109106 0.376714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6196670 0.0742471 102.626 < 2e-16 ***
## trend() 0.0220198 0.0008268 26.634 < 2e-16 ***
## as.factor(season_month)2 0.2514168 0.0956790 2.628 0.010555 *
## as.factor(season_month)3 0.2660828 0.1934044 1.376 0.173275
## as.factor(season_month)4 0.3840535 0.0957075 4.013 0.000148 ***
## as.factor(season_month)5 0.4094870 0.0957325 4.277 5.88e-05 ***
## as.factor(season_month)6 0.4488283 0.0957647 4.687 1.33e-05 ***
## as.factor(season_month)7 0.6104545 0.0958039 6.372 1.71e-08 ***
## as.factor(season_month)8 0.5879644 0.0958503 6.134 4.53e-08 ***
## as.factor(season_month)9 0.6693299 0.0959037 6.979 1.36e-09 ***
## as.factor(season_month)10 0.7473919 0.0959643 7.788 4.48e-11 ***
## as.factor(season_month)11 1.2067479 0.0960319 12.566 < 2e-16 ***
## as.factor(season_month)12 1.9622412 0.0961066 20.417 < 2e-16 ***
## surf_dummy 0.5015151 0.1964273 2.553 0.012856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9487
## F-statistic: 119 on 13 and 70 DF, p-value: < 2.22e-16
######## -borrowed from Lucas Kaplan, rpubs
sts<-ts(souvenirs$Sales, start=c(1987,1), end=c(1993,6), frequency=12)
logsales = log(sts)
surfing_festival <- c()
for(i in 1:length(sts)){
month <- round(12*(sts[i] - floor(sts[i]))) +1
year <- floor(sts[i])
if(year >= 1988 & month == 3){
surfing_festival[i] <- 1
} else {
surfing_festival[i] <- 0
}
}
surfmodel = tslm(logsales ~ trend + season + surfing_festival)
summary(surfmodel)
##
## Call:
## tslm(formula = logsales ~ trend + season + surfing_festival)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45174 -0.11922 -0.00074 0.10175 0.35933
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6464607 0.0773528 98.852 < 2e-16 ***
## trend 0.0212957 0.0009292 22.919 < 2e-16 ***
## season2 0.2521410 0.0979991 2.573 0.012414 *
## season3 0.6932048 0.0990785 6.997 1.89e-09 ***
## season4 0.3820296 0.0991135 3.854 0.000271 ***
## season5 0.4123836 0.0980652 4.205 8.27e-05 ***
## season6 0.4524491 0.0981048 4.612 1.96e-05 ***
## season7 0.5767690 0.1019961 5.655 3.91e-07 ***
## season8 0.5352140 0.1033731 5.177 2.43e-06 ***
## season9 0.6237063 0.1034004 6.032 8.93e-08 ***
## season10 0.7223515 0.1020342 7.080 1.35e-09 ***
## season11 1.1936391 0.1020638 11.695 < 2e-16 ***
## season12 1.9447112 0.1021019 19.047 < 2e-16 ***
## surfing_festival 0.0293746 0.1001947 0.293 0.770337
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1833 on 64 degrees of freedom
## Multiple R-squared: 0.9444, Adjusted R-squared: 0.9331
## F-statistic: 83.67 on 13 and 64 DF, p-value: < 2.2e-16
Answer
When plotting a “loess” regression, or a polynomial regression, we can see a close relationship between time and the residuals; this makes sense as there is very strong seasonality and non-linearity. The residuals and the fitted model show a large dispersion so it appears to be an ok model with no clear heteroscedasticity; however, there appears to be a funnel shaped for the latest residuals that could question this model.
#d.
ggp_restime <-
ggplot(data = souv_model,
aes(x = Month,
y = residuals(souvenirs_tslm)$".resid")) +
theme_ipsum_es() +
geom_smooth(method="loess", formula = y ~x, colour="navyblue") +
geom_smooth(method="lm", formula = y ~x, colour="red")+
geom_point() +
labs(x = "Month",
y = "Regr Residuals",
title = "Residuals and Time")
ggp_restime
ggp_resfitted <-
ggplot(data = souv_model,
aes(x = fitted(souvenirs_tslm)$".fitted",
y = residuals(souvenirs_tslm)$".resid")) +
theme_ipsum_es() +
geom_point() +
geom_smooth(method="lm", formula = y~ x, color = "red")+
geom_smooth(method="loess", formula = y ~x, colour="navyblue") +
labs(x = "Regression Fitted",
y = "Regression Residuals",
title = "Residuals vs. Fitted")
ggp_resfitted
Answer
The box plot of the residuals show that March, as suggested by the model, will likely be the month less accurately predicted. Summer months, particularly June and July (not August), have the smallest residuals, or rather, the best chance at predicting sales. however, given the variability overall in other months, this model might not be great for months in which variability is greater, espcially from August going into the end of the year.
The Summary tables below show the coefficients per month, and the dummy variable (similar to what we showed in the models above). As stated above, it appears that for every 1% increase in the summer months, sales increase by a lot more; for example, December is a 194-196% increase relative to January. March, is the anomaly in between to 70%. The dummy variable appears to affect differently within the two models tested. One model suggest the increase is only 4% while the other 50% in March, when the Festival tales place.
boxplot(residuals(surfmodel) ~ cycle(residuals(surfmodel))
, xlab = "Month Number"
, ylab = "Residuals"
, main = "Residuals Box Plot"
, col = brewer.pal(4, "Set2")
)
knitr::kable(data.frame(surfmodel$coefficients),"html")
| surfmodel.coefficients | |
|---|---|
| (Intercept) | 7.6464607 |
| trend | 0.0212957 |
| season2 | 0.2521410 |
| season3 | 0.6932048 |
| season4 | 0.3820296 |
| season5 | 0.4123836 |
| season6 | 0.4524491 |
| season7 | 0.5767690 |
| season8 | 0.5352140 |
| season9 | 0.6237063 |
| season10 | 0.7223515 |
| season11 | 1.1936391 |
| season12 | 1.9447112 |
| surfing_festival | 0.0293746 |
report(souvenirs_tslm)
## Series: log_sales
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.336727 -0.127571 0.002568 0.109106 0.376714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6196670 0.0742471 102.626 < 2e-16 ***
## trend() 0.0220198 0.0008268 26.634 < 2e-16 ***
## as.factor(season_month)2 0.2514168 0.0956790 2.628 0.010555 *
## as.factor(season_month)3 0.2660828 0.1934044 1.376 0.173275
## as.factor(season_month)4 0.3840535 0.0957075 4.013 0.000148 ***
## as.factor(season_month)5 0.4094870 0.0957325 4.277 5.88e-05 ***
## as.factor(season_month)6 0.4488283 0.0957647 4.687 1.33e-05 ***
## as.factor(season_month)7 0.6104545 0.0958039 6.372 1.71e-08 ***
## as.factor(season_month)8 0.5879644 0.0958503 6.134 4.53e-08 ***
## as.factor(season_month)9 0.6693299 0.0959037 6.979 1.36e-09 ***
## as.factor(season_month)10 0.7473919 0.0959643 7.788 4.48e-11 ***
## as.factor(season_month)11 1.2067479 0.0960319 12.566 < 2e-16 ***
## as.factor(season_month)12 1.9622412 0.0961066 20.417 < 2e-16 ***
## surf_dummy 0.5015151 0.1964273 2.553 0.012856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9487
## F-statistic: 119 on 13 and 70 DF, p-value: < 2.22e-16
There seems to be autocorrelation on closer months in the data.
checkresiduals(surfmodel)
##
## Breusch-Godfrey test for serial correlation of order up to 17
##
## data: Residuals from Linear regression model
## LM test = 32.739, df = 17, p-value = 0.01216
# borrowed from Tim too, since I also used his logic on the model.
augment(souvenirs_tslm) %>%
features(.innov, ljung_box, lag = 20, dof = 14)
## # A tibble: 1 x 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 tslm 88.8 0
ACF_res <-
augment(souvenirs_tslm) %>%
ACF(.innov) %>%
autoplot() +
labs(x = "lag",
y = "ACF",
title = "ACF Plot",
subtitle = "Sales Model Residuals") +
theme_ipsum_es()
ACF_res
Answer
The forecast below is in log sales. Having worked with highly seasonal data in the past and performing mostly ARIMA models, it appears that this TSLM model would actually work well if we don’t consider adding more variables. I would remove the dummy variable since it appears that it didn’t help add much to the seasonality itself, and let the historically data provide the information we need for that particular month. However, I could consider other variables like weather; from rainfall, say rainy days vs. not rainy days, and temperature. Possibly test disposable income with souvenir sales and see if disposable income can explain some of the variation–say, people would spend less in souvenirs when times are tough, or more when times are good.
fc_loop<- rep(0, 36)
for(i in 1:36){
if(i %% 12 == 3){
fc_loop[i] <- 1
}
}
souv_fc<-ts(data=fc_loop,start=1994,
end=c(1996,12),
frequency=12)
souv_fc_final<-forecast(surfmodel, newdata=souv_fc)
autoplot(souv_fc_final) +
labs(x = "Date",
y = "Log Forecast",
title = "Final Forecast with Model",
subtitle = "Despite not having excellent results") +
theme_ipsum_es()
The us_gasoline series consists of weekly data for supplies of US finished motor gasoline product, from 2 February 1991 to 20 January 2017. The units are in “million barrels per day”. Consider only the data to the end of 2004.
Fit a harmonic regression with trend to the data. Experiment with changing the number Fourier terms. Plot the observed gasoline and fitted values and comment on what you see.
Select the appropriate number of Fourier terms to include by minimizing the AICc or CV value.
Plot the residuals of the final model using the gg_tsresiduals() function and comment on these. Use a Ljung-Box test to check for residual autocorrelation.
Generate forecasts for the next year of data and plot these along with the actual data for 2005. Comment on the forecasts.
Answer
I only tried 3 different Fourier: 4, 7, and 10. From the initial results I would use the K = 7, with a good R2 and a slightly less desirable AIC and BIC. The residuals show some sort of autocorrelation, but not necessarily a strong one.
#filter data to before 2005
us_gas_2004<-
us_gasoline %>%
filter(year(Week)<2005)%>%
mutate("MA" = slider::slide_dbl(Barrels, mean,.before = 4, .after = 4, .complete = TRUE))
#plot data initially
us_gas_2004 %>%
autoplot(Barrels) +
geom_line(col = "lightblue")+
geom_line(aes(y=MA), color="#D55E00")+
labs(x = "Date",
y = "Barrels",
title = "Gasoline in the U.S.",
subtitle = "Million Barrels per day") +
theme_ipsum_es()
gas <-window(gasoline, end=2005)
gas_K4 = tslm(gas ~ trend + fourier(gas, K=4))
gas_K7 = tslm(gas ~ trend + fourier(gas, K=7))
gas_K10= tslm(gas ~ trend + fourier(gas, K=10))
autoplot(gas) +
labs(x = "Date"
, y = "Million Barrels"
, title = "Harmonic Regression"
) +
theme_ipsum_es() +
autolayer(fitted(gas_K4)) +
autolayer(fitted(gas_K7)) +
autolayer(fitted(gas_K10))
knitr::kable(as.data.frame(CV(gas_K4)),"html")
| CV(gas_K4) | |
|---|---|
| CV | 0.0764861 |
| AIC | -1864.1120233 |
| AICc | -1863.7422754 |
| BIC | -1813.6489732 |
| AdjR2 | 0.8382404 |
knitr::kable(as.data.frame(CV(gas_K7)),"html")
| CV(gas_K7) | |
|---|---|
| CV | 0.0714734 |
| AIC | -1913.5362459 |
| AICc | -1912.6718391 |
| BIC | -1835.5478956 |
| AdjR2 | 0.8501073 |
knitr::kable(as.data.frame(CV(gas_K10)),"html")
| CV(gas_K10) | |
|---|---|
| CV | 0.0713575 |
| AIC | -1915.0136582 |
| AICc | -1913.4410086 |
| BIC | -1809.5000079 |
| AdjR2 | 0.8516102 |
checkresiduals(gas_K7$residuals)
Answer
The forecast does not seem to be terrible in terms of trend, though the small ups and downs are not always captured. Furthermore, when prices dropped in 2005, the model was also not capable to absorb the entire downturn. Still, it doesn’t appear to be necessarily a model to be discarded, but rather improved.
#run the model and forecast
fcst = forecast(gas_K7,
newdata = data.frame(fourier(gas, 7, 52)))
fcst %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))
| Point Forecast | Lo 80 | Hi 80 | Lo 95 | Hi 95 | |
|---|---|---|---|---|---|
| 2005.014 | 8.812756 | 8.469266 | 9.156246 | 8.287026 | 9.338486 |
| 2005.033 | 8.666945 | 8.323371 | 9.010518 | 8.141087 | 9.192803 |
| 2005.052 | 8.599822 | 8.256209 | 8.943435 | 8.073903 | 9.125740 |
| 2005.071 | 8.620059 | 8.276458 | 8.963660 | 8.094159 | 9.145959 |
| 2005.090 | 8.701192 | 8.357636 | 9.044748 | 8.175361 | 9.227023 |
| 2005.110 | 8.801076 | 8.457564 | 9.144588 | 8.275312 | 9.326840 |
| 2005.129 | 8.885137 | 8.541645 | 9.228630 | 8.359403 | 9.410872 |
| 2005.148 | 8.940023 | 8.596528 | 9.283517 | 8.414286 | 9.465760 |
| 2005.167 | 8.972165 | 8.628667 | 9.315664 | 8.446422 | 9.497908 |
| 2005.186 | 8.995783 | 8.652288 | 9.339277 | 8.470046 | 9.521519 |
| 2005.205 | 9.020393 | 8.676906 | 9.363880 | 8.494667 | 9.546119 |
| 2005.225 | 9.046109 | 8.702625 | 9.389592 | 8.520389 | 9.571829 |
| 2005.244 | 9.067947 | 8.724464 | 9.411430 | 8.542227 | 9.593666 |
| 2005.263 | 9.083648 | 8.740163 | 9.427132 | 8.557927 | 9.609369 |
| 2005.282 | 9.097734 | 8.754247 | 9.441220 | 8.572009 | 9.623458 |
| 2005.301 | 9.118445 | 8.774957 | 9.461934 | 8.592718 | 9.644173 |
| 2005.320 | 9.150463 | 8.806975 | 9.493951 | 8.624736 | 9.676190 |
| 2005.340 | 9.189916 | 8.846430 | 9.533401 | 8.664192 | 9.715639 |
| 2005.359 | 9.226458 | 8.882974 | 9.569941 | 8.700738 | 9.752178 |
| 2005.378 | 9.251462 | 8.907979 | 9.594945 | 8.725742 | 9.777181 |
| 2005.397 | 9.266112 | 8.922628 | 9.609595 | 8.740391 | 9.791832 |
| 2005.416 | 9.282567 | 8.939081 | 9.626052 | 8.756844 | 9.808290 |
| 2005.435 | 9.316150 | 8.972663 | 9.659637 | 8.790424 | 9.841875 |
| 2005.455 | 9.373402 | 9.029915 | 9.716889 | 8.847676 | 9.899127 |
| 2005.474 | 9.444701 | 9.101216 | 9.788186 | 8.918978 | 9.970424 |
| 2005.493 | 9.507761 | 9.164277 | 9.851245 | 8.982041 | 10.033481 |
| 2005.512 | 9.540941 | 9.197458 | 9.884424 | 9.015222 | 10.066660 |
| 2005.531 | 9.537887 | 9.194403 | 9.881370 | 9.012166 | 10.063607 |
| 2005.550 | 9.513043 | 9.169558 | 9.856528 | 8.987321 | 10.038766 |
| 2005.570 | 9.492990 | 9.149503 | 9.836476 | 8.967265 | 10.018715 |
| 2005.589 | 9.498121 | 9.154635 | 9.841608 | 8.972396 | 10.023847 |
| 2005.608 | 9.526608 | 9.183122 | 9.870093 | 9.000885 | 10.052331 |
| 2005.627 | 9.552267 | 9.208783 | 9.895751 | 9.026546 | 10.077988 |
| 2005.646 | 9.539617 | 9.196134 | 9.883100 | 9.013898 | 10.065337 |
| 2005.665 | 9.467863 | 9.124380 | 9.811347 | 8.942143 | 9.993583 |
| 2005.685 | 9.348603 | 9.005118 | 9.692088 | 8.822881 | 9.874326 |
| 2005.704 | 9.224634 | 8.881147 | 9.568121 | 8.698908 | 9.750359 |
| 2005.723 | 9.148483 | 8.804995 | 9.491971 | 8.622756 | 9.674209 |
| 2005.742 | 9.152341 | 8.808855 | 9.495828 | 8.626618 | 9.678065 |
| 2005.761 | 9.227562 | 8.884078 | 9.571046 | 8.701841 | 9.753283 |
| 2005.780 | 9.327163 | 8.983680 | 9.670647 | 8.801444 | 9.852883 |
| 2005.800 | 9.391261 | 9.047777 | 9.734744 | 8.865541 | 9.916980 |
| 2005.819 | 9.381140 | 9.037654 | 9.724625 | 8.855417 | 9.906863 |
| 2005.838 | 9.301798 | 8.958308 | 9.645287 | 8.776069 | 9.827526 |
| 2005.857 | 9.199081 | 8.855590 | 9.542572 | 8.673349 | 9.724812 |
| 2005.876 | 9.132561 | 8.789072 | 9.476049 | 8.606832 | 9.658289 |
| 2005.895 | 9.139896 | 8.796410 | 9.483382 | 8.614173 | 9.665619 |
| 2005.915 | 9.213689 | 8.870205 | 9.557174 | 8.687968 | 9.739411 |
| 2005.934 | 9.304509 | 8.961025 | 9.647993 | 8.778788 | 9.830230 |
| 2005.953 | 9.348279 | 9.004789 | 9.691768 | 8.822550 | 9.874008 |
| 2005.972 | 9.302031 | 8.958513 | 9.645549 | 8.776259 | 9.827804 |
| 2005.991 | 9.167552 | 8.823967 | 9.511136 | 8.641677 | 9.693427 |
#prepare test
gas_test = window(gasoline, start = 2005, end = 2006)
#plot results
#head(fcst)
fcst_plot <-fcst
forecast <- fcst_plot$mean
autoplot(gas_test) +
theme_ipsum()+
labs(x = "Test Dates 52 weeks in 2005"
, y = "Million Barrels per Day"
, title = "US Gasoline Supply Forecast")+
autolayer(forecast)
The annual population of Afghanistan is available in the global_economy data set.
Plot the data and comment on its features. Can you observe the effect of the Soviet-Afghan war?
Fit a linear trend model and compare this to a piecewise linear trend model with knots at 1980 and 1989. Generate forecasts from these two models for the five years after the end of the data, and comment on the results.
Answer
Yes, from about 1980 through about 1987, we can see population declining. This could be for many reasons, particularly people moving out of Afghanistan.
af_pop <-
global_economy %>%
filter(Code == "AFG") %>%
mutate(Population = Population/1000000)
af_pop %>%
autoplot(Population) +
labs(x = "Date"
,y = "Millions of People"
,title = "Population in Afghanistan") +
theme_ipsum_pub() +
geom_line(col="blue")
Answer
The linear model shows simply a trend that takes data from a period in time that can skew results. Piecewise looks like a more accurate model that assumes the trend of the last few years, but taking into account the dip seen from 1980 to 1989. Therefore, the piecewise model is clearly the preferred model between these two, with a higher rsquared. I would question if this model might be “overfitting” and thus provide misleading results.
af_pop_trend <-
af_pop %>%
model(linear = TSLM(Population ~ trend()),
piecewise = TSLM(Population ~ trend(knots = c(1980, 1989)))
)
#report(af_pop_trend)
af_pop_trend %>%
kbl()%>%
kable_material(c("striped","hover", full_width = T))
| Country | linear | piecewise |
|---|---|---|
| Afghanistan | <TSLM> | <TSLM> |
af_pop_fcst <- af_pop_trend %>%
forecast(h = 5)
af_pop %>%
autoplot(Population) +
geom_line(col = "navyblue")+
labs(y = "Million People"
, x = "Date"
, title = "Population in Afghanistan") +
geom_line(data = fitted(af_pop_trend),
aes(y = .fitted, color = .model)) +
autolayer(af_pop_fcst, alpha = 0.5, level = 95) +
theme_ipsum_ps()