knitr::opts_chunk$set(echo = TRUE)
library(fpp2)
## Warning: package 'fpp2' was built under R version 4.0.5
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## -- Attaching packages ---------------------------------------------- fpp2 2.4 --
## v ggplot2 3.3.3 v fma 2.4
## v forecast 8.14 v expsmooth 2.3
## Warning: package 'fma' was built under R version 4.0.5
## Warning: package 'expsmooth' was built under R version 4.0.5
##
Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily. The data for the first 20 days can be obtained as follows.
daily20 <- head(elecdaily,20)
daily20 <- ts(daily20, frequency = 1)
Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?
autoplot(daily20, main="Daily Electricity Demand for Victoria, Australia \n First 20 Days of 2014",
xlab = "Days", ylab = "Demand (MW), Temperature (C)") +
geom_point() +
xlim(1, 20) +
theme(plot.title = element_text(hjust = 0.5))
lmq1a <- lm(Demand~Temperature, data=daily20)
summary(lmq1a)
##
## Call:
## lm(formula = Demand ~ Temperature, data = daily20)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.060 -7.117 -1.437 17.484 27.102
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.2117 17.9915 2.179 0.0428 *
## Temperature 6.7572 0.6114 11.052 1.88e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22 on 18 degrees of freedom
## Multiple R-squared: 0.8716, Adjusted R-squared: 0.8644
## F-statistic: 122.1 on 1 and 18 DF, p-value: 1.876e-09
cor(daily20)
## Demand WorkDay Temperature
## Demand 1.0000000 0.5380072 0.9335758
## WorkDay 0.5380072 1.0000000 0.4537745
## Temperature 0.9335758 0.4537745 1.0000000
It’s important to recognize that, as Australia is in the southern hemisphere, Victoria would be experiencing summer temperatures in January (i.e. the first 20 days of 2014). This helps explain the strong positive relationship between electricity demand and temperature (indicated by the positive r value of 0.933), because as temperature increases, use of cooling appliances increases, which is especially common during summer months. If this analysis was conducted on winter data, the inverse would likely be true due to the need for electricity to power heating devices.
Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
plot(lmq1a)
checkresiduals(lmq1a)
##
## Breusch-Godfrey test for serial correlation of order up to 5
##
## data: Residuals
## LM test = 3.8079, df = 5, p-value = 0.5774
The model does a decent job in terms of adequacy. When viewing these residual plots, generally, the normality assumption holds. No obvious pattern exists within the residuals (there is a slight left skew in terms of the residual distribution, and an uptick in autocorrelation at lag 4, though not extreme). However, observations 1, 5, and 9 appear to be possible outliers. Given the nature of the data (i.e. summer temperatures in Australia and electricity demand), we would expect some outliers, possibly reflecting the irregular nature of summer temperatures/weather. Some days might be more humid than others, some days might have more wind, etc.
Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15 C and compare it with the forecast if the with maximum temperature was 35 C. Do you believe these forecasts?
temp.forecasts <- data.frame(Temperature = c(15,35))
lmq1.forecast <- forecast(lmq1a, newdata = temp.forecasts)
lmq1.forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1 140.5701 108.6810 172.4591 90.21166 190.9285
## 2 275.7146 245.2278 306.2014 227.57056 323.8586
autoplot(lmq1.forecast)
I generally believe these forecasts. The high R^2 of the model and strong positive correlation between temperature and electricity demand tend to suggest that the explantory power of the linear model is pretty good. The forecasts fall in line with that model, obviously, and fit well within the scope of the other data.
Give prediction intervals for your forecasts. The following R code will get you started:
lmq1.forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1 140.5701 108.6810 172.4591 90.21166 190.9285
## 2 275.7146 245.2278 306.2014 227.57056 323.8586
autoplot(lmq1.forecast)
Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
plot(Demand ~ Temperature, data=elecdaily, main="Electricity Demand vs. Temperature for Victoria in 2014")
I think this plot of all the 2014 data supports some of the previous discussion above, along with part of our model. As discussed before, electricity demand tends to increase in warm months to cool things, and based off this plot, it also increases when temperatures are cooler, which makes sense (to heat things). Moderate temperatures tend to reflect the lowest electricity demand. My model is pretty good, but it’s only useful for summer months generally, as it was only “trained” on summer-month data. This explains why our 15 C forecast for demand is low, when in reality it may be higher because 15 C might be cool enough to justify heating. My model does not account for that “electricity demand to heat” effect. A model reflecting seasonality or a quadratic, instead of linear, model may be better to encapsulate the entire dataset, rather than just the first 20 days.
Data set mens400 contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.
Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400, main="Winning times for Men's 400 meter Olympic Games Final \n 1896-2016", ylab="Time (seconds)", xlab="Year") +
theme(plot.title = element_text(hjust = 0.5))
As shown by the plot, men’s 400m winning times are trending down, though it seems at a decreasing rate, which makes sense as technology/human potential approaches its own asymtote. Moreover, there’s periods of missing data, reflecting the years the olympics were not held due to WWI and WWII.
Fit a regression line to the data. Obviously the winning times have been decreasing, but at what average rate per year?
autoplot(mens400, main="Winning times for Men's 400 meter Olympic Games Final \n 1896-2016", ylab="Time (seconds)", xlab="Year") + geom_smooth(method="lm", se=FALSE) + theme(plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 3 rows containing non-finite values (stat_smooth).
time.year <- time(mens400)
lmq2b <- tslm(mens400 ~ time.year, data=mens400)
summary(lmq2b)
##
## Call:
## tslm(formula = mens400 ~ time.year, data = mens400)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6002 -0.5747 -0.2858 0.5751 4.1505
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 172.481477 11.487522 15.02 2.52e-14 ***
## time.year -0.064574 0.005865 -11.01 2.75e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.136 on 26 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.8234, Adjusted R-squared: 0.8166
## F-statistic: 121.2 on 1 and 26 DF, p-value: 2.752e-11
Our basic linear model suggests that as the time increases by 1 year, the winning time for the men’s 400m decreases by about 0.06 seconds on average. Every 4 years (i.e. every Olympic Games), the winning time decreases by about 0.25 seconds.
Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
cbind(Time = time.year, Residuals = lmq2b$residuals) %>%
as.data.frame() %>%
ggplot(aes(x = Time, y = Residuals)) +
geom_point() +
xlab("Year") +
ggtitle("Residuals vs Year") +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 3 rows containing missing values (geom_point).
The residual v. year plot generally shows no pattern in the residuals, though there is a recent uptrend in the residuals starting in 1980. It’s clear that 1896 is an outlier, and the uptrend in the residuals starting in 1980-present may indicate that the linear model is becoming less accurate with newer observations.
Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?
mens400.forecast <- forecast(lmq2b, newdata = data.frame(time.year = 2020))
autoplot(mens400) + autolayer(mens400.forecast) + geom_point() +
xlab("Year") + ylab("Time (seconds)") +
ggtitle("Winning times for Men's 400 meter Olympic Games Final \n 1896-2016, with 2020 Prediction") +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 3 rows containing missing values (geom_point).
mens400.forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2020 42.04231 40.44975 43.63487 39.55286 44.53176
In terms of assumptions made to make these calculations, I have made all the assumptions underlying OLS modeling. For building the prediction intervals specifically, the assumption of normality is essential for the construction of the 80% and 95% with the standard errors. If this assumption does not hold, then the interval is essentially meaningless.
Type easter(ausbeer) and interpret what you see.
easter(ausbeer)
## Qtr1 Qtr2 Qtr3 Qtr4
## 1956 0.67 0.33 0.00 0.00
## 1957 0.00 1.00 0.00 0.00
## 1958 0.00 1.00 0.00 0.00
## 1959 1.00 0.00 0.00 0.00
## 1960 0.00 1.00 0.00 0.00
## 1961 0.33 0.67 0.00 0.00
## 1962 0.00 1.00 0.00 0.00
## 1963 0.00 1.00 0.00 0.00
## 1964 1.00 0.00 0.00 0.00
## 1965 0.00 1.00 0.00 0.00
## 1966 0.00 1.00 0.00 0.00
## 1967 1.00 0.00 0.00 0.00
## 1968 0.00 1.00 0.00 0.00
## 1969 0.00 1.00 0.00 0.00
## 1970 1.00 0.00 0.00 0.00
## 1971 0.00 1.00 0.00 0.00
## 1972 0.33 0.67 0.00 0.00
## 1973 0.00 1.00 0.00 0.00
## 1974 0.00 1.00 0.00 0.00
## 1975 1.00 0.00 0.00 0.00
## 1976 0.00 1.00 0.00 0.00
## 1977 0.00 1.00 0.00 0.00
## 1978 1.00 0.00 0.00 0.00
## 1979 0.00 1.00 0.00 0.00
## 1980 0.00 1.00 0.00 0.00
## 1981 0.00 1.00 0.00 0.00
## 1982 0.00 1.00 0.00 0.00
## 1983 0.00 1.00 0.00 0.00
## 1984 0.00 1.00 0.00 0.00
## 1985 0.00 1.00 0.00 0.00
## 1986 1.00 0.00 0.00 0.00
## 1987 0.00 1.00 0.00 0.00
## 1988 0.00 1.00 0.00 0.00
## 1989 1.00 0.00 0.00 0.00
## 1990 0.00 1.00 0.00 0.00
## 1991 1.00 0.00 0.00 0.00
## 1992 0.00 1.00 0.00 0.00
## 1993 0.00 1.00 0.00 0.00
## 1994 0.00 1.00 0.00 0.00
## 1995 0.00 1.00 0.00 0.00
## 1996 0.00 1.00 0.00 0.00
## 1997 1.00 0.00 0.00 0.00
## 1998 0.00 1.00 0.00 0.00
## 1999 0.00 1.00 0.00 0.00
## 2000 0.00 1.00 0.00 0.00
## 2001 0.00 1.00 0.00 0.00
## 2002 1.00 0.00 0.00 0.00
## 2003 0.00 1.00 0.00 0.00
## 2004 0.00 1.00 0.00 0.00
## 2005 1.00 0.00 0.00 0.00
## 2006 0.00 1.00 0.00 0.00
## 2007 0.00 1.00 0.00 0.00
## 2008 1.00 0.00 0.00 0.00
## 2009 0.00 1.00 0.00 0.00
## 2010 0.00 1.00
The resulting dataset shows the occurence of something in each year in each quarter from 1956 to 2010 for most likely something to do with australian beer. I would say it’s a dummy variable, but it sometypes takes fractional values, so it’s not a binary dummy variable. I am assuming the fractional cases involve a situation where the occurence/event spans both quarters, which tends to suggest a multi-day event, like a three-day holiday weekend (indicated by the .33 and .66 entries rather than .5 entries which would imply a 2-day event).
Express y as a function of x and show that the coefficient β1 is the elasticity coefficient.
Start: log(y) = β0 + β1log(x) + e 1) The integral of 1/x equals ln(x).
β1=(dy/dx)*(x/y) –> β1(dx/x)=(dy/y).
Using 1, β1(dx/x)=(dy/y) –> β1(log(x))=log(y).
β1=log(y)/log(x) i.e. ~ the percent change in y over the percent change in x (the elasticity)
The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(fancy, main="Sales Volume 1987-1994", ylab="Sales Volume ($)", xlab = "Year") +
theme(plot.title = element_text(hjust = 0.5))
There are not any obvious “unusal” flucuations in the time series data. As you can see, it is very seasonal, and the peaks coincide with Christmas and the March festival. I think, in this case, the trend is more interesting, as while the seasonality has stayed relatively consistent, the “strength”/size of the peaks has increased/trended up with sales volume in general. It would seem Queensland became significantly more popular from 1992 onwards.
Explain why it is necessary to take logarithms of these data before fitting a model.
It’s necessary to do log transformations of the data before fitting a model in this case because of the highly seasonal nature of the data. The seasonality could lead to heteroskedastic residuals/patterns in the residuals (instead of the OLS assumed homoskedastic). Log transformations could solve this to uphold the necessary assumptions.
Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
time.fancy <- time(fancy)
surfFest <- c()
for(i in 1:length(time.fancy)){
month <- round(12*(time.fancy[i] - floor(time.fancy[i]))) + 1
year <- floor(time.fancy[i])
if(year >= 1988 & month == 3)
{
surfFest[i] <- 1
} else {
surfFest[i] <- 0
}
}
lmq5c <- tslm(log(fancy) ~ trend + season + surfFest)
summary(lmq5c)
##
## Call:
## tslm(formula = log(fancy) ~ trend + season + surfFest)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33673 -0.12757 0.00257 0.10911 0.37671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6196670 0.0742471 102.626 < 2e-16 ***
## trend 0.0220198 0.0008268 26.634 < 2e-16 ***
## season2 0.2514168 0.0956790 2.628 0.010555 *
## season3 0.2660828 0.1934044 1.376 0.173275
## season4 0.3840535 0.0957075 4.013 0.000148 ***
## season5 0.4094870 0.0957325 4.277 5.88e-05 ***
## season6 0.4488283 0.0957647 4.687 1.33e-05 ***
## season7 0.6104545 0.0958039 6.372 1.71e-08 ***
## season8 0.5879644 0.0958503 6.134 4.53e-08 ***
## season9 0.6693299 0.0959037 6.979 1.36e-09 ***
## season10 0.7473919 0.0959643 7.788 4.48e-11 ***
## season11 1.2067479 0.0960319 12.566 < 2e-16 ***
## season12 1.9622412 0.0961066 20.417 < 2e-16 ***
## surfFest 0.5015151 0.1964273 2.553 0.012856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9487
## F-statistic: 119 on 13 and 70 DF, p-value: < 2.2e-16
Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
autoplot(lmq5c$residuals)
cbind(Residuals = lmq5c$residuals,
FittedValues = lmq5c$fitted.values)%>%
as.data.frame%>%
ggplot(aes(x=FittedValues, y=Residuals))+geom_point()
From the plots, there doesn’t appear to be any obvious pattern in the residuals after accounting for trend, seasonality, and the March festival.
Do boxplots of the residuals for each month. Does this reveal any problems with the model?
cbind.data.frame(
Month = factor(
month.abb[round(12*(time.fancy - floor(time.fancy)) + 1)],
labels = month.abb,
ordered = TRUE
),
Residuals = lmq5c$residuals
) %>%
ggplot(aes(x = Month,
y = Residuals)) +
geom_boxplot()
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.
Some of the months (like Feb, Nov, Dec) have a larger spread/range of residuals, which may warrant further investigation. It’s not necessarily a cause for concern, but it does indicate that the residuals are not truly homoskedastic, as they have increased variance in certain months, and less variance in others.
What do the values of the coefficients tell you about each variable?
lmq5c$coefficients
## (Intercept) trend season2 season3 season4 season5
## 7.61966701 0.02201983 0.25141682 0.26608280 0.38405351 0.40948697
## season6 season7 season8 season9 season10 season11
## 0.44882828 0.61045453 0.58796443 0.66932985 0.74739195 1.20674790
## season12 surfFest
## 1.96224123 0.50151509
SurfFest has a value of 0.5015, showing an increase in sales during the festival/March compared to the baseline null case. We see the general positive uptrend, shown by the trend coefficient. The coefficients on the seasonality dummies show the increase in sales during the Christmas holiday, while the march seasonality variable (season3) doesn’t see that uptick because it’s embodied in surfFest.
What does the Breusch-Godfrey test tell you about your model?
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bgtest(lmq5c)
##
## Breusch-Godfrey test for serial correlation of order up to 1
##
## data: lmq5c
## LM test = 25.031, df = 1, p-value = 5.642e-07
checkresiduals(lmq5c)
##
## Breusch-Godfrey test for serial correlation of order up to 17
##
## data: Residuals from Linear regression model
## LM test = 37.954, df = 17, p-value = 0.002494
The Breusch-Godfrey test tells us if there is autocorrelation in our regression model. The low p-value means we can reject the null (that there is no autocorrelation), meaning there is autocorrelation/heteroskedasticity in our residuals. This is backed up by the ACF plot above. In the small lags (1 and 2), there appears to be autocorrelation, as well as in some of the more distant lags.
Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
threeyr <- ts(data=time.fancy, start=1994, end=c(1996,12), frequency=12)
threeyr.forecast <- forecast(lmq5c, threeyr)
## Warning in forecast.lm(lmq5c, threeyr): newdata column names not specified,
## defaulting to first variable required.
threeyr.forecast
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1994 1006.002 501.0539 1510.950 227.5849 1784.419
## Feb 1994 1006.317 501.3479 1511.286 227.8675 1784.767
## Mar 1994 1006.396 501.6231 1511.168 228.2491 1784.542
## Apr 1994 1006.577 501.5658 1511.589 228.0624 1785.092
## May 1994 1006.667 501.6339 1511.699 228.1190 1785.214
## Jun 1994 1006.770 501.7159 1511.824 228.1896 1785.350
## Jul 1994 1006.995 501.9201 1512.070 228.3823 1785.608
## Aug 1994 1007.036 501.9403 1512.133 228.3910 1785.682
## Sep 1994 1007.182 502.0643 1512.299 228.5036 1785.860
## Oct 1994 1007.324 502.1850 1512.462 228.6128 1786.034
## Nov 1994 1007.847 502.6870 1513.006 229.1033 1786.590
## Dec 1994 1008.666 503.4851 1513.847 229.8900 1787.442
## Jan 1995 1006.768 501.5678 1511.967 227.9624 1785.573
## Feb 1995 1007.083 501.8618 1512.304 228.2450 1785.921
## Mar 1995 1007.161 502.1369 1512.186 228.6266 1785.696
## Apr 1995 1007.343 502.0797 1512.606 228.4399 1786.246
## May 1995 1007.432 502.1478 1512.717 228.4965 1786.368
## Jun 1995 1007.535 502.2298 1512.841 228.5670 1786.504
## Jul 1995 1007.761 502.4340 1513.088 228.7598 1786.762
## Aug 1995 1007.802 502.4542 1513.150 228.7685 1786.836
## Sep 1995 1007.947 502.5782 1513.317 228.8810 1787.014
## Oct 1995 1008.089 502.6989 1513.480 228.9903 1787.188
## Nov 1995 1008.612 503.2009 1514.024 229.4808 1787.744
## Dec 1995 1009.432 503.9990 1514.865 230.2674 1788.596
## Jan 1996 1007.533 502.0817 1512.985 228.3399 1786.727
## Feb 1996 1007.849 502.3757 1513.321 228.6225 1787.075
## Mar 1996 1007.927 502.6508 1513.203 229.0041 1786.850
## Apr 1996 1008.109 502.5936 1513.624 228.8174 1787.400
## May 1996 1008.198 502.6617 1513.734 228.8740 1787.522
## Jun 1996 1008.301 502.7437 1513.859 228.9445 1787.658
## Jul 1996 1008.527 502.9479 1514.105 229.1373 1787.916
## Aug 1996 1008.568 502.9681 1514.168 229.1460 1787.990
## Sep 1996 1008.713 503.0921 1514.334 229.2585 1788.168
## Oct 1996 1008.855 503.2128 1514.497 229.3678 1788.342
## Nov 1996 1009.378 503.7148 1515.042 229.8583 1788.898
## Dec 1996 1010.198 504.5129 1515.882 230.6449 1789.750
Transform your predictions and intervals to obtain predictions and intervals for the raw data.
How could you improve these predictions by modifying the model?