DSCI-FXP-SP19- Assignment #3 Time Series Regression Models Luyun Liang

library(fpp2)

## Loading required package: ggplot2

## Loading required package: forecast

## Loading required package: fma

## Loading required package: expsmooth

Data set mens400 (in fpp package) contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

#Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400)

#feature1: overall, the winning time for the men's 400m final in Olympic Games is decreasing from 1896 to 2016.
#feature2: there are missing values.

#Fit a regression line to the data. Obviously, the winning times have been decreasing, but at what average rate per year?
time_mens400 <- time(mens400)
tslm_mens400<- tslm(mens400 ~time_mens400,data = mens400)
autoplot(mens400)+geom_abline(slope = tslm_mens400$coefficients[2],
                              intercept = tslm_mens400$coefficients[1],
                              colour="blue")

#get the winning time decreasing rate
tslm_mens400$coefficients[2]

## time_mens400 
##  -0.06457385

The winning time decreasing rate is 0.06457385.

#Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
cbind(Time = time_mens400, 
      Residuals = tslm_mens400$residuals) %>%
  as.data.frame() %>%
  ggplot(aes(x = Time, y = Residuals)) +
    geom_point() +
    ylab("Residuals of Regression Line(Unit:s)")

## Warning: Removed 3 rows containing missing values (geom_point).

The residual plot indicates that the regression model generally fitted the data with one obvious outlier.

checkresiduals(tslm_mens400)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 3.6082, df = 6, p-value = 0.7295

#Predict the winning time for the men’s 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?
#I use linear model in forecast function to make prediction interval with na.action argument as na.exclude to exclude missing value, because forecast function can't calculate prediction interval with missing value.
#I assume the model's residuals were normally distributed.
lm_mens400 <- lm(
  mens400 ~ time_mens400, 
  data = mens400,
  na.action = na.exclude
  )
fc_mens400 <- forecast(
  lm_mens400, 
  newdata = data.frame(time_mens400 = 2020)
  )
autoplot(mens400) +
  autolayer(fc_mens400, PI = TRUE)

# Get 80% and 95% prediction intervals
fc_mens400$upper

##          [,1]     [,2]
## [1,] 43.63487 44.53176

fc_mens400$lower

##          [,1]     [,2]
## [1,] 40.44975 39.55286

80% interval is from 40.45 to 43.63 95% interval is from 39.55 to 44.53

checkresiduals(lm_mens400)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals
## LM test = 3.6082, df = 6, p-value = 0.7295

# the residuals were not not normally distributed.

Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

#Plot the data and comment on its features.
autoplot(huron)

# there are some cyclic patterns and there is a overall dreasing trend from 1880 to 1930.After 1930, the trend disappears.

#Fit a linear regression.

#simple linear regression
tslm_huron<- tslm(huron ~ trend)
fc_tslm_huron<- forecast(tslm_huron)

#linear regression with log transformation
tslm_log_huron <- tslm(huron ~ trend, 
                          lambda = 0)
fc_tslm_log_huron <- forecast(tslm_log_huron)

#plot the results
autoplot(huron)+
  autolayer(fitted(tslm_huron),series = "Linear")+
  autolayer(fitted(tslm_log_huron),series = "Logarithm")

#Generate forecasts from the models for the period up to 1980 and comment on the results
autoplot(huron) +
  autolayer(fitted(tslm_huron), series = "Linear") +
  autolayer(fc_tslm_huron, series = "Linear") +
  xlab("Year") +  ylab("Water level") +
  ggtitle("Lake Huron water level change",
          subtitle = "Linear Regression model") +
  guides(colour=guide_legend(title=" ")) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

autoplot(huron) +
  autolayer(fitted(tslm_log_huron), series = "Logarithm") +
  autolayer(fc_tslm_log_huron, series="Logarithm") +
  xlab("Year") +  ylab("Water level") +
  ggtitle("Lake Huron water level change",
          subtitle = "Logarithm linear model") +
  guides(colour=guide_legend(title=" ")) +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

# both Linear regression model and log transformed linear model indicate couldn't relfect the trend change around the year 1915. The upper and the lower bounds of prediction intervals decrease as time goes on.

Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily

#Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?
autoplot(elecdaily)

#regression model
tslm_dem_temp <- tslm(Demand ~ Temperature, data = elecdaily)
tslm_dem_temp

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Coefficients:
## (Intercept)  Temperature  
##    212.3856       0.4182

# There is a positive relationship between demand and temperature because as temperature increases, people might use air conditioner to cool off, thus increasing the electricity demand.

#Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(tslm_dem_temp$residuals)

## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

# The model looks adeqate and the residuals are normally distributed. There is an outlier.

#Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15 and compare it with the forecast if the with maximum temperature was 35. Do you believe these forecasts?
fc_dem_temp <- forecast(tslm_dem_temp, 
  newdata=data.frame(Temperature=c(15,35)))
fc_dem_temp

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 53.14286       218.6592 184.4779 252.8406 166.3039 271.0146
## 53.28571       227.0242 192.6585 261.3898 174.3866 279.6617

# The forecasts look good becasue the forecast temperature is close to the range of temperature of the data.

#Give prediction intervals for your forecasts
# 80% intervals
fc_dem_temp$upper[, 1]

## Time Series:
## Start = c(53, 2) 
## End = c(53, 3) 
## Frequency = 7 
## [1] 252.8406 261.3898

fc_dem_temp$lower[, 1]

## Time Series:
## Start = c(53, 2) 
## End = c(53, 3) 
## Frequency = 7 
## [1] 184.4779 192.6585

# 95% intervals
fc_dem_temp$upper[, 2]

## Time Series:
## Start = c(53, 2) 
## End = c(53, 3) 
## Frequency = 7 
## [1] 271.0146 279.6617

fc_dem_temp$lower[, 2]

## Time Series:
## Start = c(53, 2) 
## End = c(53, 3) 
## Frequency = 7 
## [1] 166.3039 174.3866

#Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) +
    ylab("Electricity Demand") +
    xlab("Temperature") +
    geom_point() +
    geom_smooth(method="lm", se=FALSE)

# the plot shows that the model was not fit for all the data. It only works for part of the data and couldn't explain the relationship very well.

The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

#Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series
autoplot(fancy)

# There is a seasonal pattern from 1987 to 1994. # There is an overall increasing trend from 1987 to 1994. # Sales generally increased from January to December. # There is a dramatic increase every year in December. # There is also an increase in March each year due to local surfing festival. # However, in 1991, there is an unexpected decrease in sales.

# Explain why it is necessary to take logarithms of these data before fitting a model.

# The log transformation can be used to make high skewed distribution less skewed. In this case, the fancy data increased exponentially over years and the size of the seasonal variations are almost same across the whole series. Therefore, it is necessary to take logarithms of the data.

#Fit a regression model to the logarithms of these sales data with a linear trend and seasonal variables.
log_fancy <- log(fancy)
dummy_festival = rep(0, length(fancy))
dummy_festival[seq_along(dummy_festival)%%12 == 3] <- 1
dummy_festival[3] <- 0
dummy_festival <- ts(dummy_festival, freq = 12, start=c(1987,1))
my_data <- data.frame(
  log_fancy,
  dummy_festival
)

tslm_log_fancy <- tslm(log_fancy ~ trend + season + dummy_festival, data=my_data)

future_data <- data.frame(
  dummy_festival = rep(0, 12)
)
future_data[3,] <- 1
forecast(tslm_log_fancy, newdata=future_data)

##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
## Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
## Mar 1994      10.302990 10.048860 10.557120  9.911228 10.69475
## Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
## May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
## Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
## Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
## Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
## Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
## Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557

#Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

plot(residuals(tslm_log_fancy), type='p')
abline(h = 0, col="grey")

#The residuals have pattern over time, indicating correlation between residuals and time. #The residuals plotted against time vary from -0.03 to 0.03.
#There seems to be a trend for the residuals to increase from 1991 to 1994, which might include some missing relationship in the model.

#What do the values of the coefficients tell you about each variable?
summary(tslm_log_fancy)

## 
## Call:
## tslm(formula = log_fancy ~ trend + season + dummy_festival, data = my_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.6196670  0.0742471 102.626  < 2e-16 ***
## trend          0.0220198  0.0008268  26.634  < 2e-16 ***
## season2        0.2514168  0.0956790   2.628 0.010555 *  
## season3        0.2660828  0.1934044   1.376 0.173275    
## season4        0.3840535  0.0957075   4.013 0.000148 ***
## season5        0.4094870  0.0957325   4.277 5.88e-05 ***
## season6        0.4488283  0.0957647   4.687 1.33e-05 ***
## season7        0.6104545  0.0958039   6.372 1.71e-08 ***
## season8        0.5879644  0.0958503   6.134 4.53e-08 ***
## season9        0.6693299  0.0959037   6.979 1.36e-09 ***
## season10       0.7473919  0.0959643   7.788 4.48e-11 ***
## season11       1.2067479  0.0960319  12.566  < 2e-16 ***
## season12       1.9622412  0.0961066  20.417  < 2e-16 ***
## dummy_festival 0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

# The model has positive trend, indicating the sales increase as time goes by.
# The value of the coefficients show how much each month contributes to the conditional mean of the model.
#All months have positive coefficients and are statistically significant except for March.

#Use your regression model to predict the monthly sales for 1994, 1995, and 1996. 
future_data <- data.frame(
  dummy_festival = rep(0, 36)
)
fc_tslm_log_fancy <- forecast(tslm_log_fancy, newdata=future_data)
#plot the forecast
autoplot(fc_tslm_log_fancy)

# show prediction interval
fc_tslm_log_fancy$upper

##               [,1]     [,2]
## Jan 1994  9.744183  9.88111
## Feb 1994 10.017620 10.15455
## Mar 1994 10.141071 10.32499
## Apr 1994 10.194296 10.33122
## May 1994 10.241749 10.37868
## Jun 1994 10.303110 10.44004
## Jul 1994 10.486756 10.62368
## Aug 1994 10.486286 10.62321
## Sep 1994 10.589671 10.72660
## Oct 1994 10.689753 10.82668
## Nov 1994 11.171129 11.30806
## Dec 1994 11.948642 12.08557
## Jan 1995 10.011336 10.14984
## Feb 1995 10.284773 10.42328
## Mar 1995 10.408928 10.59481
## Apr 1995 10.461449 10.59996
## May 1995 10.508903 10.64741
## Jun 1995 10.570264 10.70877
## Jul 1995 10.753910 10.89242
## Aug 1995 10.753440 10.89195
## Sep 1995 10.856825 10.99533
## Oct 1995 10.956907 11.09541
## Nov 1995 11.438282 11.57679
## Dec 1995 12.215796 12.35430
## Jan 1996 10.279093 10.41951
## Feb 1996 10.552530 10.69294
## Mar 1996 10.677222 10.86530
## Apr 1996 10.729206 10.86962
## May 1996 10.776659 10.91707
## Jun 1996 10.838021 10.97843
## Jul 1996 11.021667 11.16208
## Aug 1996 11.021196 11.16161
## Sep 1996 11.124582 11.26499
## Oct 1996 11.224664 11.36508
## Nov 1996 11.706039 11.84645
## Dec 1996 12.483552 12.62396

fc_tslm_log_fancy$lower

##               [,1]      [,2]
## Jan 1994  9.238522  9.101594
## Feb 1994  9.511959  9.375031
## Mar 1994  9.461879  9.277961
## Apr 1994  9.688635  9.551707
## May 1994  9.736088  9.599161
## Jun 1994  9.797449  9.660522
## Jul 1994  9.981095  9.844168
## Aug 1994  9.980625  9.843698
## Sep 1994 10.084010  9.947083
## Oct 1994 10.184092 10.047165
## Nov 1994 10.665468 10.528541
## Dec 1994 11.442981 11.306054
## Jan 1995  9.499844  9.361338
## Feb 1995  9.773281  9.634775
## Mar 1995  9.722498  9.536620
## Apr 1995  9.949957  9.811451
## May 1995  9.997411  9.858904
## Jun 1995 10.058772  9.920265
## Jul 1995 10.242418 10.103911
## Aug 1995 10.241948 10.103441
## Sep 1995 10.345333 10.206826
## Oct 1995 10.445415 10.306908
## Nov 1995 10.926791 10.788284
## Dec 1995 11.704304 11.565797
## Jan 1996  9.760564  9.620151
## Feb 1996 10.034000  9.893588
## Mar 1996  9.982679  9.794605
## Apr 1996 10.210677 10.070264
## May 1996 10.258130 10.117718
## Jun 1996 10.319491 10.179079
## Jul 1996 10.503137 10.362725
## Aug 1996 10.502667 10.362254
## Sep 1996 10.606052 10.465640
## Oct 1996 10.706134 10.565722
## Nov 1996 11.187510 11.047097
## Dec 1996 11.965023 11.824611

# Transform your predictions and intervals to obtain predictions and intervals for the raw data.
df<- as.data.frame(fc_tslm_log_fancy)
df<- exp(df)

#What are your recommendations for improving these forecasts by modifying the model?
#as we found earlier, there are some autocorrelation in the residuals, we might consider using a dynamic regression model that can better fit the data with less bias

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)

##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

DSCI-FXP-SP19- Assignment #3 Time Series Regression Models “Luyun Liang”

Luyun Liang

2/20/2019

R Markdown

Including Plots