Data set mens400 (in fpp package) contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.
autoplot(mens400) + ggtitle({"Winning Times in Olympic Men's 400m Track Final (1896-2016)"}) + xlab("Time") + ylab("Seconds")
The time plot reveals the following features:
There are three missing values, which occur in 1916, 1940 and 1944 due to the World Wars.
A clear downward trend is observed.
There is no seasonal pattern in the time series data.
#Interpolate missing values (No transformation is required)
df <- na.interp(mens400, lambda = NULL)
t <- time(df)
fit.mens400 <- tslm(df~t, data = df)
autoplot(mens400, series = "Data") +
autolayer(fitted(fit.mens400), series = "Fitted") + ggtitle({"Winning Times in Olympic Men's 400m Track Final (1896-2016)"}) + xlab("Time") + ylab("Seconds") + scale_color_manual(values = c("black", "red"))
summary(fit.mens400)
##
## Call:
## tslm(formula = df ~ t, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5622 -0.5877 -0.2707 0.5497 4.2157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 171.697133 10.724592 16.01 6.19e-16 ***
## t -0.064195 0.005482 -11.71 1.64e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.092 on 29 degrees of freedom
## Multiple R-squared: 0.8254, Adjusted R-squared: 0.8194
## F-statistic: 137.1 on 1 and 29 DF, p-value: 1.638e-12
According to the linear regression model, the winning times have been decreasing by 0.064 seconds in average per year.
checkresiduals(fit.mens400)
##
## Breusch-Godfrey test for serial correlation of order up to 6
##
## data: Residuals from Linear regression model
## LM test = 4.3582, df = 6, p-value = 0.6283
According to the residual plots above, the residual of the fitted linear model has the following features:
The time plot shows some variation around 0 over time. Further, the histogram shows that more residuals have fall smaller than 0. However, the fitted normal distribution has a zero-mean.
The correlogram shows no significant autocorrelation in the residuals. The result of Breusch-Godfrey has confirmed this observation, with a p-value larger than 0.05.
In addition to the residuals, the linear model has large values of both R and R-adjusted. Therefore, we can conclude that this linear regression model is a good fit for our data.
t <- seq(2020,2020, 4)
forecast(fit.mens400, newdata = data.frame(t))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 2020 42.02413 40.4981 43.55016 39.64423 44.40402
The winning time for the men’s 400 meters final is 42.02 seconds in the 2020 Olympics. The prediction interval is between 40.50 seconds and 43.55 seconds with 80% confidence.
The assumptions that we made for this forecasting include:
The relationship between the winning time and the year satisfies the linear model.
The forecasting errors (residuals) meet the requirements of zero mean and no autocorrelation. And the error should be independent from the winning time.
Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.
p1 <- autoplot(huron, main = NULL) + xlab("Year") + ylab("feet")
p2 <- ggAcf(window(huron), main = NULL)
grid.arrange(p1, p2, ncol = 2, top = "Level of Lake Huron in feet (reduced by 570 feet) (1875-1972)")
The time plot reveals the following features:
A clear downward trend is observed.
There is no seasonal pattern in the time series data, which is confirmed by the correlogram.
t <- time(huron)
fit.huron <- tslm(huron~t)
summary(fit.huron)
##
## Call:
## tslm(formula = huron ~ t)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.50997 -0.72726 0.00083 0.74402 2.53565
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.554918 7.764293 7.155 1.66e-10 ***
## t -0.024201 0.004036 -5.996 3.55e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared: 0.2725, Adjusted R-squared: 0.2649
## F-statistic: 35.95 on 1 and 96 DF, p-value: 3.545e-08
autoplot(huron, series = "Data") +
autolayer(fitted(fit.huron), series = "Fitted") +
ggtitle({"Level of Lake Huron in feet (reduced by 570 feet)(1875-1972)"}) +
xlab("Year") + ylab("feet") + scale_color_manual(values = c("black", "red"))
A linear regression model is fitted with results shown above. In average, the level of Lake Huron decreases 0.024 feet every year.However, since both R and R-adjusted are small, this linear model is not a good representation of the actual data.
checkresiduals(fit.huron)
##
## Breusch-Godfrey test for serial correlation of order up to 10
##
## data: Residuals from Linear regression model
## LM test = 63.979, df = 10, p-value = 6.353e-10
The correlogram reveals the autocorrelation in the residuals. In other words, there is information left unexplained. The small p-value of the Breusch-Godfrey test shows the same result.
t <- seq(1973,1980)
fhuron <- forecast(fit.huron, newdata = data.frame(t))
fhuron
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1973 7.806127 6.317648 9.294605 5.516501 10.095752
## 1974 7.781926 6.292536 9.271315 5.490899 10.072952
## 1975 7.757724 6.267406 9.248043 5.465269 10.050179
## 1976 7.733523 6.242259 9.224788 5.439613 10.027434
## 1977 7.709322 6.217094 9.201550 5.413929 10.004715
## 1978 7.685121 6.191912 9.178331 5.388219 9.982024
## 1979 7.660920 6.166712 9.155128 5.362481 9.959359
## 1980 7.636719 6.141494 9.131943 5.336717 9.936721
autoplot(fhuron, series = "Prediction") + ggtitle({"Level of Lake Huron in feet (reduced by 570 feet)"}) +
xlab("Year") + ylab("feet")
The forecasting results are shown above. According to the plot, we can conclude the linear regression model fails to fit the actual data well. With large variation over time, a linear model is not adequate.
Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily.
elecdaily %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand)) +
ggtitle("Electricity Demand for Victoria, Australia (2014)") +
ylab("Demand") +
xlab("Temperature (Celsius)") +
geom_point() +
geom_smooth(method="lm", se=FALSE)
fit.elecdaily <- tslm(Demand~Temperature, data = elecdaily)
summary(fit.elecdaily)
##
## Call:
## tslm(formula = Demand ~ Temperature, data = elecdaily)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.474 -15.719 -0.336 15.767 117.184
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 212.3856 5.0080 42.409 <2e-16 ***
## Temperature 0.4182 0.2263 1.848 0.0654 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.55 on 363 degrees of freedom
## Multiple R-squared: 0.009322, Adjusted R-squared: 0.006592
## F-statistic: 3.416 on 1 and 363 DF, p-value: 0.0654
A linear model is fitted, as shown in the plot. With the coefficient equal to 0.42, there is a positive relationship between demand and temperature.
However, in the scatter plot, we can see that the relationship between demand and temperature is not linear. Further, smaller values of R and R-adjusted also reveal that a linear model is not adequate for this dataset.
checkresiduals(fit.elecdaily)
##
## Breusch-Godfrey test for serial correlation of order up to 14
##
## data: Residuals from Linear regression model
## LM test = 271.66, df = 14, p-value < 2.2e-16
According to the time plot and histogram, the residuals variate around 0. The correlogram shows a clear seasonality pattern remained in the residuals. Therefore, the linear model is not adequate.
According to the scatter plot, there are no outliers. Even though there are some points whose temperatures are larger than 40 and demand larger than 300. However, these values are reasonable and are not extreme enough to be outliers. Since these points are locating on one side of the fitted line, they definitely influnce the coefficent of the forecasting model.
forecast(fit.elecdaily, newdata = data.frame(Temperature = c(15, 35)))
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 53.14286 218.6592 184.4779 252.8406 166.3039 271.0146
## 53.28571 227.0242 192.6585 261.3898 174.3866 279.6617
The forecasted demand is 218.66 when the maximum temperature is 15. And the prediction interval is between 184.48 and 252.84 with 80% confidence. According to the scatter plot in the first section, this prediction interval covers most of the data points when the temperature is 15. Therefore, we can believe this forecast.
When the maximum temperature is 35, the forecasted demand is 227.02, with the prediction interval of [192.66, 261.39]. According to the scatter plot, the demand dramatically increases when the temperature is over 35. Therefore, this forecast may underestimate the demand.
The prediction intervals are [184.48, 252.84] and [192.66, 261.39] with 80% confidence level, for the maximum temperature of 15 and 35, respectively.
elecdaily %>%
as.data.frame() %>%
ggplot(aes(x=Temperature, y=Demand, colour=as.factor(WorkDay))) +
geom_point() +
labs(title = "Electricity Demand for Victoria, Australia (2014)", x="Demand", y="Temperature (Celsius)", colour = NULL) +
scale_color_manual(labels=c("Weekend", "Workday"), values=c("blue", "red"))
The red dots represent the data on workdays, while the blue dots represent the data on weekends. According to the plot above, the demands on workdays are higher than those on weekends. Therefore, in order to improve the forecasting model, we should include the WorkDay factor in the prediction model.
The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
p1 <- autoplot(fancy, main=NULL) + xlab("Month") + ylab("Sales")
p2 <- ggsubseriesplot(fancy, main=NULL)
grid.arrange(p1, p2, ncol = 2, top = "Monthly Sales for a Souvenir Shop (Jan 1987 - Dec 1993)")
According to the time plot and subseries plot, the time series data has the following features:
There is a clearly upward trend over time.
There is a strong seasonal pattern with frequency of 12 month.
The sales increases dramatically every December, because of Christmas. The variation increases over time.
Since 1988, the montly sales in March reaches a small peak, because of the surfing festival.
p1 <- autoplot(fancy, main="Original Data") + xlab("Month") + ylab("Sales")
p2 <- autoplot(BoxCox(fancy, 0), main="After Log Transformation") + xlab("Month") + ylab("Log(Sales)")
grid.arrange(p1, p2, ncol = 2)
In the original data, the variation increases with the level of the series, which will make it difficult to fit a forecasting model. After the logarithms transformation, the size of the seasonal variation is about the same over time. The transformed data will make the forecasting model simpler.
fit.fancy <- tslm(BoxCox(fancy, 0)~trend+season)
#summary(fit.fancy)
autoplot(BoxCox(fancy, 0), series = "Data" ) +
autolayer(fitted(fit.fancy), series = "Fitted") +
scale_color_manual(values = c("black", "red")) +
ggtitle("Monthly Sales for a Souvenir Shop (Jan 1987 - Dec 1993)") +
xlab("Month") + ylab("Log(Sales)")
p1 <- autoplot(fit.fancy$residuals)
cbind(Fitted = fitted(fit.fancy),
Residuals = residuals(fit.fancy)) %>%
as.data.frame() %>%
ggplot(aes(x = Fitted, y = Residuals)) + geom_point() -> p2
grid.arrange(p1, p2, ncol = 2)
The residuals show non-linear patterns over time, which means there is correlation between residuals and time. The variation of residuals against the fitted values are not constant. Therefore, we can conclude that the current model is not adequate for this time series.
summary(fit.fancy)
##
## Call:
## tslm(formula = BoxCox(fancy, 0) ~ trend + season)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41644 -0.12619 0.00608 0.11389 0.38567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6058604 0.0768740 98.939 < 2e-16 ***
## trend 0.0223930 0.0008448 26.508 < 2e-16 ***
## season2 0.2510437 0.0993278 2.527 0.013718 *
## season3 0.6952066 0.0993386 6.998 1.18e-09 ***
## season4 0.3829341 0.0993565 3.854 0.000252 ***
## season5 0.4079944 0.0993817 4.105 0.000106 ***
## season6 0.4469625 0.0994140 4.496 2.63e-05 ***
## season7 0.6082156 0.0994534 6.116 4.69e-08 ***
## season8 0.5853524 0.0995001 5.883 1.21e-07 ***
## season9 0.6663446 0.0995538 6.693 4.27e-09 ***
## season10 0.7440336 0.0996148 7.469 1.61e-10 ***
## season11 1.2030164 0.0996828 12.068 < 2e-16 ***
## season12 1.9581366 0.0997579 19.629 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1858 on 71 degrees of freedom
## Multiple R-squared: 0.9527, Adjusted R-squared: 0.9447
## F-statistic: 119.1 on 12 and 71 DF, p-value: < 2.2e-16
With positive coefficients, the sales amount increases as time goes on. December has the largest coefficient, which means a dramatically increasing in the sales every December. In addition, the coefficient of March is larger than those of the first six months. This observation verifies the sales increase in March, because of the surfing festival.
pred.fancy <- forecast(fit.fancy, h = 12*3)
autoplot(pred.fancy, series = "Prediction") + ggtitle({"Monthly Sales for a Souvenir Shop - Prediction (Jan 1994 - Dec 1996)"}) +
xlab("Month") + ylab("Log(Sales)")
pred.fancy <- forecast(fit.fancy, h = 12*3)
pred_raw <- exp(data.frame(pred.fancy))
print(pred_raw)
## Point.Forecast Lo.80 Hi.80 Lo.95 Hi.95
## Jan 1994 13484.06 10373.35 17527.60 9000.202 20201.76
## Feb 1994 17724.45 13635.50 23039.57 11830.533 26554.69
## Mar 1994 28261.51 21741.71 36736.44 18863.703 42341.27
## Apr 1994 21149.61 16270.49 27491.85 14116.722 31686.25
## May 1994 22177.42 17061.19 28827.88 14802.755 33226.11
## Jun 1994 23580.87 18140.87 30652.19 15739.516 35328.75
## Jul 1994 28334.55 21797.90 36831.38 18912.452 42450.69
## Aug 1994 28321.23 21787.65 36814.06 18903.560 42430.74
## Sep 1994 31405.93 24160.73 40823.80 20962.508 47052.23
## Oct 1994 34711.77 26703.92 45120.97 23169.053 52005.01
## Nov 1994 56174.03 43214.94 73019.23 37494.462 84159.68
## Dec 1994 122237.73 94038.05 158893.80 81589.975 183136.01
## Jan 1995 17640.97 13531.52 22998.44 11721.681 26549.43
## Feb 1995 23188.60 17786.83 30230.86 15407.845 34898.53
## Mar 1995 36974.06 28360.98 48202.90 24567.703 55645.47
## Apr 1995 27669.68 21224.05 36072.82 18385.332 41642.49
## May 1995 29014.35 22255.47 37825.85 19278.808 43666.20
## Jun 1995 30850.46 23663.86 40219.58 20498.826 46429.53
## Jul 1995 37069.62 28434.27 48327.47 24631.193 55789.28
## Aug 1995 37052.19 28420.91 48304.75 24619.613 55763.05
## Sep 1995 41087.86 31516.47 53566.03 27301.145 61836.67
## Oct 1995 45412.83 34833.94 59204.47 30174.903 68345.69
## Nov 1995 73491.54 56371.74 95810.55 48832.025 110603.79
## Dec 1995 159921.57 122667.95 208488.92 106261.125 240679.82
## Jan 1996 23079.39 17640.46 30195.26 15251.766 34924.36
## Feb 1996 30337.26 23187.92 39690.89 20048.051 45907.16
## Mar 1996 48372.55 36972.98 63286.85 31966.480 73198.66
## Apr 1996 36199.78 27668.87 47360.95 23922.233 54778.49
## May 1996 37958.98 29013.50 49662.56 25084.787 57440.57
## Jun 1996 40361.14 30849.55 52805.34 26672.224 61075.57
## Jul 1996 48497.56 37068.53 63450.40 32049.090 73387.82
## Aug 1996 48474.75 37051.10 63420.57 32034.022 73353.32
## Sep 1996 53754.55 41086.65 70328.24 35523.121 81342.86
## Oct 1996 59412.84 45411.50 77731.09 39262.336 89905.12
## Nov 1996 96147.75 73489.39 125792.17 63538.212 145493.40
## Dec 1996 209222.70 159916.89 273730.57 138262.581 316601.49
As shown above, the logarithms transformation did help simplify the forecasting model. However, the residuals still had correlation with time. In order to further improve the forecast, I will consider to perform the time series decomposition. The decomposition results may give an insight of what is missing in our current model.In addition, applying advanced forecasting models, such as ARIM, may also improve the forecast.