DSCI 609 Assignment #3 Time Series Regression Models

Question 1

Data set mens400 (in fpp package) contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

Plot the winning time against the year. Describe the main features of the plot.

autoplot(mens400) + ggtitle({"Winning Times in Olympic Men's 400m Track Final (1896-2016)"}) + xlab("Time") + ylab("Seconds")

The time plot reveals the following features:

There are three missing values, which occur in 1916, 1940 and 1944 due to the World Wars.
A clear downward trend is observed.
There is no seasonal pattern in the time series data.

Fit a regression line to the data. Obviously, the winning times have been decreasing, but at what average rate per year?

#Interpolate missing values (No transformation is required)
df <- na.interp(mens400, lambda = NULL)
t <- time(df)

fit.mens400 <- tslm(df~t, data = df)

autoplot(mens400, series = "Data") + 
  autolayer(fitted(fit.mens400), series = "Fitted") + ggtitle({"Winning Times in Olympic Men's 400m Track Final (1896-2016)"}) + xlab("Time") + ylab("Seconds") + scale_color_manual(values = c("black", "red"))

summary(fit.mens400)

## 
## Call:
## tslm(formula = df ~ t, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5622 -0.5877 -0.2707  0.5497  4.2157 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 171.697133  10.724592   16.01 6.19e-16 ***
## t            -0.064195   0.005482  -11.71 1.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.092 on 29 degrees of freedom
## Multiple R-squared:  0.8254, Adjusted R-squared:  0.8194 
## F-statistic: 137.1 on 1 and 29 DF,  p-value: 1.638e-12

According to the linear regression model, the winning times have been decreasing by 0.064 seconds in average per year.

Plot the residuals against the year. What does this indicate about the suitability of the fitted line?

checkresiduals(fit.mens400)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 4.3582, df = 6, p-value = 0.6283

According to the residual plots above, the residual of the fitted linear model has the following features:

The time plot shows some variation around 0 over time. Further, the histogram shows that more residuals have fall smaller than 0. However, the fitted normal distribution has a zero-mean.
The correlogram shows no significant autocorrelation in the residuals. The result of Breusch-Godfrey has confirmed this observation, with a p-value larger than 0.05.

In addition to the residuals, the linear model has large values of both R and R-adjusted. Therefore, we can conclude that this linear regression model is a good fit for our data.

Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?

t <- seq(2020,2020, 4)
forecast(fit.mens400, newdata = data.frame(t))

##      Point Forecast   Lo 80    Hi 80    Lo 95    Hi 95
## 2020       42.02413 40.4981 43.55016 39.64423 44.40402

The winning time for the men’s 400 meters final is 42.02 seconds in the 2020 Olympics. The prediction interval is between 40.50 seconds and 43.55 seconds with 80% confidence.

The assumptions that we made for this forecasting include:

The relationship between the winning time and the year satisfies the linear model.
The forecasting errors (residuals) meet the requirements of zero mean and no autocorrelation. And the error should be independent from the winning time.

Question 2

Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

Plot the data and comment on its features.

p1 <- autoplot(huron, main = NULL) + xlab("Year") + ylab("feet")

p2 <- ggAcf(window(huron), main = NULL)

grid.arrange(p1, p2, ncol = 2, top = "Level of Lake Huron in feet (reduced by 570 feet) (1875-1972)")

The time plot reveals the following features:

A clear downward trend is observed.
There is no seasonal pattern in the time series data, which is confirmed by the correlogram.

Fit a linear regression.

t <- time(huron)
fit.huron <- tslm(huron~t)
summary(fit.huron)

## 
## Call:
## tslm(formula = huron ~ t)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50997 -0.72726  0.00083  0.74402  2.53565 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.554918   7.764293   7.155 1.66e-10 ***
## t           -0.024201   0.004036  -5.996 3.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared:  0.2725, Adjusted R-squared:  0.2649 
## F-statistic: 35.95 on 1 and 96 DF,  p-value: 3.545e-08

autoplot(huron, series = "Data") + 
  autolayer(fitted(fit.huron), series = "Fitted") + 
  ggtitle({"Level of Lake Huron in feet (reduced by 570 feet)(1875-1972)"}) + 
  xlab("Year") + ylab("feet") + scale_color_manual(values = c("black", "red"))

A linear regression model is fitted with results shown above. In average, the level of Lake Huron decreases 0.024 feet every year.However, since both R and R-adjusted are small, this linear model is not a good representation of the actual data.

checkresiduals(fit.huron)

## 
##  Breusch-Godfrey test for serial correlation of order up to 10
## 
## data:  Residuals from Linear regression model
## LM test = 63.979, df = 10, p-value = 6.353e-10

The correlogram reveals the autocorrelation in the residuals. In other words, there is information left unexplained. The small p-value of the Breusch-Godfrey test shows the same result.

Generate forecasts from the models for the period up to 1980 and comment on the results.

t <- seq(1973,1980)
fhuron <- forecast(fit.huron, newdata = data.frame(t))
fhuron

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 1973       7.806127 6.317648 9.294605 5.516501 10.095752
## 1974       7.781926 6.292536 9.271315 5.490899 10.072952
## 1975       7.757724 6.267406 9.248043 5.465269 10.050179
## 1976       7.733523 6.242259 9.224788 5.439613 10.027434
## 1977       7.709322 6.217094 9.201550 5.413929 10.004715
## 1978       7.685121 6.191912 9.178331 5.388219  9.982024
## 1979       7.660920 6.166712 9.155128 5.362481  9.959359
## 1980       7.636719 6.141494 9.131943 5.336717  9.936721

autoplot(fhuron, series = "Prediction") + ggtitle({"Level of Lake Huron in feet (reduced by 570 feet)"}) + 
  xlab("Year") + ylab("feet")

The forecasting results are shown above. According to the plot, we can conclude the linear regression model fails to fit the actual data well. With large variation over time, a linear model is not adequate.

Question 3

Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily.

Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?

elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) + 
  ggtitle("Electricity Demand for Victoria, Australia (2014)") +
  ylab("Demand") + 
  xlab("Temperature (Celsius)") + 
  geom_point() + 
  geom_smooth(method="lm", se=FALSE)

fit.elecdaily <- tslm(Demand~Temperature, data = elecdaily)

summary(fit.elecdaily)

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.474 -15.719  -0.336  15.767 117.184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 212.3856     5.0080  42.409   <2e-16 ***
## Temperature   0.4182     0.2263   1.848   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.55 on 363 degrees of freedom
## Multiple R-squared:  0.009322,   Adjusted R-squared:  0.006592 
## F-statistic: 3.416 on 1 and 363 DF,  p-value: 0.0654

A linear model is fitted, as shown in the plot. With the coefficient equal to 0.42, there is a positive relationship between demand and temperature.

However, in the scatter plot, we can see that the relationship between demand and temperature is not linear. Further, smaller values of R and R-adjusted also reveal that a linear model is not adequate for this dataset.

Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?

checkresiduals(fit.elecdaily)

## 
##  Breusch-Godfrey test for serial correlation of order up to 14
## 
## data:  Residuals from Linear regression model
## LM test = 271.66, df = 14, p-value < 2.2e-16

According to the time plot and histogram, the residuals variate around 0. The correlogram shows a clear seasonality pattern remained in the residuals. Therefore, the linear model is not adequate.

According to the scatter plot, there are no outliers. Even though there are some points whose temperatures are larger than 40 and demand larger than 300. However, these values are reasonable and are not extreme enough to be outliers. Since these points are locating on one side of the fitted line, they definitely influnce the coefficent of the forecasting model.

Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15 and compare it with the forecast if the with maximum temperature was 35. Do you believe these forecasts?

forecast(fit.elecdaily, newdata = data.frame(Temperature = c(15, 35)))

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 53.14286       218.6592 184.4779 252.8406 166.3039 271.0146
## 53.28571       227.0242 192.6585 261.3898 174.3866 279.6617

The forecasted demand is 218.66 when the maximum temperature is 15. And the prediction interval is between 184.48 and 252.84 with 80% confidence. According to the scatter plot in the first section, this prediction interval covers most of the data points when the temperature is 15. Therefore, we can believe this forecast.

When the maximum temperature is 35, the forecasted demand is 227.02, with the prediction interval of [192.66, 261.39]. According to the scatter plot, the demand dramatically increases when the temperature is over 35. Therefore, this forecast may underestimate the demand.

Give prediction intervals for your forecasts.

The prediction intervals are [184.48, 252.84] and [192.66, 261.39] with 80% confidence level, for the maximum temperature of 15 and 35, respectively.

Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?

elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand, colour=as.factor(WorkDay))) + 
  geom_point() +
  labs(title = "Electricity Demand for Victoria, Australia (2014)", x="Demand", y="Temperature (Celsius)", colour = NULL) +
  scale_color_manual(labels=c("Weekend", "Workday"), values=c("blue", "red"))

The red dots represent the data on workdays, while the blue dots represent the data on weekends. According to the plot above, the demands on workdays are higher than those on weekends. Therefore, in order to improve the forecasting model, we should include the WorkDay factor in the prediction model.

Question 4

The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

p1 <- autoplot(fancy, main=NULL) + xlab("Month") + ylab("Sales")

p2 <- ggsubseriesplot(fancy, main=NULL)

grid.arrange(p1, p2, ncol = 2, top = "Monthly Sales for a Souvenir Shop (Jan 1987 - Dec 1993)")

According to the time plot and subseries plot, the time series data has the following features:

There is a clearly upward trend over time.
There is a strong seasonal pattern with frequency of 12 month.
The sales increases dramatically every December, because of Christmas. The variation increases over time.
Since 1988, the montly sales in March reaches a small peak, because of the surfing festival.

Explain why it is necessary to take logarithms of these data before fitting a model.

p1 <- autoplot(fancy, main="Original Data") + xlab("Month") + ylab("Sales")

p2 <- autoplot(BoxCox(fancy, 0), main="After Log Transformation") + xlab("Month") + ylab("Log(Sales)")

grid.arrange(p1, p2, ncol = 2)

In the original data, the variation increases with the level of the series, which will make it difficult to fit a forecasting model. After the logarithms transformation, the size of the seasonal variation is about the same over time. The transformed data will make the forecasting model simpler.

Fit a regression model to the logarithms of these sales data with a linear trend and seasonal variables.

fit.fancy <- tslm(BoxCox(fancy, 0)~trend+season)

#summary(fit.fancy)

autoplot(BoxCox(fancy, 0), series = "Data" ) + 
  autolayer(fitted(fit.fancy), series = "Fitted") +
  scale_color_manual(values = c("black", "red")) +
  ggtitle("Monthly Sales for a Souvenir Shop (Jan 1987 - Dec 1993)") +
  xlab("Month") + ylab("Log(Sales)")

Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

p1 <- autoplot(fit.fancy$residuals)

cbind(Fitted = fitted(fit.fancy),
      Residuals = residuals(fit.fancy)) %>%
  as.data.frame() %>%
  ggplot(aes(x = Fitted, y = Residuals)) + geom_point() -> p2

grid.arrange(p1, p2, ncol = 2)

The residuals show non-linear patterns over time, which means there is correlation between residuals and time. The variation of residuals against the fitted values are not constant. Therefore, we can conclude that the current model is not adequate for this time series.

What do the values of the coefficients tell you about each variable?

summary(fit.fancy)

## 
## Call:
## tslm(formula = BoxCox(fancy, 0) ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41644 -0.12619  0.00608  0.11389  0.38567 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6058604  0.0768740  98.939  < 2e-16 ***
## trend       0.0223930  0.0008448  26.508  < 2e-16 ***
## season2     0.2510437  0.0993278   2.527 0.013718 *  
## season3     0.6952066  0.0993386   6.998 1.18e-09 ***
## season4     0.3829341  0.0993565   3.854 0.000252 ***
## season5     0.4079944  0.0993817   4.105 0.000106 ***
## season6     0.4469625  0.0994140   4.496 2.63e-05 ***
## season7     0.6082156  0.0994534   6.116 4.69e-08 ***
## season8     0.5853524  0.0995001   5.883 1.21e-07 ***
## season9     0.6663446  0.0995538   6.693 4.27e-09 ***
## season10    0.7440336  0.0996148   7.469 1.61e-10 ***
## season11    1.2030164  0.0996828  12.068  < 2e-16 ***
## season12    1.9581366  0.0997579  19.629  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1858 on 71 degrees of freedom
## Multiple R-squared:  0.9527, Adjusted R-squared:  0.9447 
## F-statistic: 119.1 on 12 and 71 DF,  p-value: < 2.2e-16

With positive coefficients, the sales amount increases as time goes on. December has the largest coefficient, which means a dramatically increasing in the sales every December. In addition, the coefficient of March is larger than those of the first six months. This observation verifies the sales increase in March, because of the surfing festival.

Use your regression model to predict the monthly sales for 1994, 1995, and 1996.

pred.fancy <- forecast(fit.fancy, h = 12*3)

autoplot(pred.fancy, series = "Prediction") + ggtitle({"Monthly Sales for a Souvenir Shop - Prediction (Jan 1994 - Dec 1996)"}) + 
  xlab("Month") + ylab("Log(Sales)")

Transform your predictions and intervals to obtain predictions and intervals for the raw data.

pred.fancy <- forecast(fit.fancy, h = 12*3)

pred_raw <- exp(data.frame(pred.fancy))

print(pred_raw)

##          Point.Forecast     Lo.80     Hi.80      Lo.95     Hi.95
## Jan 1994       13484.06  10373.35  17527.60   9000.202  20201.76
## Feb 1994       17724.45  13635.50  23039.57  11830.533  26554.69
## Mar 1994       28261.51  21741.71  36736.44  18863.703  42341.27
## Apr 1994       21149.61  16270.49  27491.85  14116.722  31686.25
## May 1994       22177.42  17061.19  28827.88  14802.755  33226.11
## Jun 1994       23580.87  18140.87  30652.19  15739.516  35328.75
## Jul 1994       28334.55  21797.90  36831.38  18912.452  42450.69
## Aug 1994       28321.23  21787.65  36814.06  18903.560  42430.74
## Sep 1994       31405.93  24160.73  40823.80  20962.508  47052.23
## Oct 1994       34711.77  26703.92  45120.97  23169.053  52005.01
## Nov 1994       56174.03  43214.94  73019.23  37494.462  84159.68
## Dec 1994      122237.73  94038.05 158893.80  81589.975 183136.01
## Jan 1995       17640.97  13531.52  22998.44  11721.681  26549.43
## Feb 1995       23188.60  17786.83  30230.86  15407.845  34898.53
## Mar 1995       36974.06  28360.98  48202.90  24567.703  55645.47
## Apr 1995       27669.68  21224.05  36072.82  18385.332  41642.49
## May 1995       29014.35  22255.47  37825.85  19278.808  43666.20
## Jun 1995       30850.46  23663.86  40219.58  20498.826  46429.53
## Jul 1995       37069.62  28434.27  48327.47  24631.193  55789.28
## Aug 1995       37052.19  28420.91  48304.75  24619.613  55763.05
## Sep 1995       41087.86  31516.47  53566.03  27301.145  61836.67
## Oct 1995       45412.83  34833.94  59204.47  30174.903  68345.69
## Nov 1995       73491.54  56371.74  95810.55  48832.025 110603.79
## Dec 1995      159921.57 122667.95 208488.92 106261.125 240679.82
## Jan 1996       23079.39  17640.46  30195.26  15251.766  34924.36
## Feb 1996       30337.26  23187.92  39690.89  20048.051  45907.16
## Mar 1996       48372.55  36972.98  63286.85  31966.480  73198.66
## Apr 1996       36199.78  27668.87  47360.95  23922.233  54778.49
## May 1996       37958.98  29013.50  49662.56  25084.787  57440.57
## Jun 1996       40361.14  30849.55  52805.34  26672.224  61075.57
## Jul 1996       48497.56  37068.53  63450.40  32049.090  73387.82
## Aug 1996       48474.75  37051.10  63420.57  32034.022  73353.32
## Sep 1996       53754.55  41086.65  70328.24  35523.121  81342.86
## Oct 1996       59412.84  45411.50  77731.09  39262.336  89905.12
## Nov 1996       96147.75  73489.39 125792.17  63538.212 145493.40
## Dec 1996      209222.70 159916.89 273730.57 138262.581 316601.49

What are your recommendations for improving these forecasts by modifying the model?

As shown above, the logarithms transformation did help simplify the forecasting model. However, the residuals still had correlation with time. In order to further improve the forecast, I will consider to perform the time series decomposition. The decomposition results may give an insight of what is missing in our current model.In addition, applying advanced forecasting models, such as ARIM, may also improve the forecast.

DSCI 609 Assignment #3 Time Series Regression Models

Yiling He

February 22, 2019

Question 1

Question 2

Question 3

Question 4