library(fpp2)

## Warning: package 'fpp2' was built under R version 3.5.2

## Loading required package: ggplot2

## Loading required package: forecast

## Warning: package 'forecast' was built under R version 3.5.2

## Loading required package: fma

## Warning: package 'fma' was built under R version 3.5.2

## Loading required package: expsmooth

## Warning: package 'expsmooth' was built under R version 3.5.2

Question 1

Data set mens400 (in fpp package) contains the winning times (in seconds) for the men’s 400 meters final in each Olympic Games from 1896 to 2016.

#- Plot the winning time against the year. Describe the main features of the plot.
autoplot(mens400)

#Observations: 
#There are missing values ( Total 3 : Missing values occur in 1916, 1940 and 1944 due to the World Wars.)
# With every year the winning time is keep on decreasing. and it stress a decreasing trend of winning time over a period of time.

#- Fit a regression line to the data. Obviously, the winning times have been decreasing, but at what average rate per year?
#First create a independent variable that will have all the years which we have used
mens400_df<- na.interp(mens400, lambda = NULL)
year <- time(mens400_df)

fit <- tslm(mens400_df ~ year )
summary(fit)

## 
## Call:
## tslm(formula = mens400_df ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5622 -0.5877 -0.2707  0.5497  4.2157 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 171.697133  10.724592   16.01 6.19e-16 ***
## year         -0.064195   0.005482  -11.71 1.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.092 on 29 degrees of freedom
## Multiple R-squared:  0.8254, Adjusted R-squared:  0.8194 
## F-statistic: 137.1 on 1 and 29 DF,  p-value: 1.638e-12

intercept_coeff<- fit$coefficients[1]
year_coeff<-fit$coefficients[2]
autoplot(mens400) +
  geom_abline(slope=year_coeff,intercept=intercept_coeff)

print(year_coeff)

##        year 
## -0.06419456

#Here year coefficient is negative , that means winning time is reducing w.r.to year. and we got a value of -0.0645 , that gives the average decrease of winning time of 0.0645 sec for every additional years

#- Plot the residuals against the year. What does this indicate about the suitability of the fitted line?
#Plotting 
cbind(year,
      Residuals = fit$residuals) %>%
  as.data.frame() %>%
  ggplot(aes(x = year, y = Residuals)) +
    geom_point() +
    ylab("Residuals of Regression Line(Unit:s)")

checkresiduals(fit)

## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  Residuals from Linear regression model
## LM test = 4.3582, df = 6, p-value = 0.6283

#In the plot majority of points are below zero and it looks like residual mean might not be zero. So in this case our forecasting method needs to be improved. There is no autocorrelaion within the residuals as majority of points are with in the boundary(inside the blue line). By seeing the distribution graph, it seems to be skewed to left as majority of points are negative.

#Predict the winning time for the men's 400 meters final in the 2020 Olympics. Give a prediction interval for your forecasts. What assumptions have you made in these calculations?


fst_mens400 <- forecast(
  fit, 
  newdata = data.frame(year = 2020)
  )
print(fst_mens400)

##      Point Forecast   Lo 80    Hi 80    Lo 95    Hi 95
## 2020       42.02413 40.4981 43.55016 39.64423 44.40402

autoplot(mens400) +
  autolayer(fst_mens400, PI = TRUE)

checkresiduals(fst_mens400)

## 
##  Ljung-Box test
## 
## data:  Residuals from Linear regression model
## Q* = 4.978, df = 4.2, p-value = 0.3148
## 
## Model df: 2.   Total lags used: 6.2

summary(fit)

## 
## Call:
## tslm(formula = mens400_df ~ year)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5622 -0.5877 -0.2707  0.5497  4.2157 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 171.697133  10.724592   16.01 6.19e-16 ***
## year         -0.064195   0.005482  -11.71 1.64e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.092 on 29 degrees of freedom
## Multiple R-squared:  0.8254, Adjusted R-squared:  0.8194 
## F-statistic: 137.1 on 1 and 29 DF,  p-value: 1.638e-12

#Prediction interval for 2020 year with 80 percent confidence interval is 40.4981(low) and 43.55016(high)
#It looks like residuals of the fitted model seems to be around zero.
#Autocorrelation of the residuals is zero, also p value in Ljung-Box test is greater than zero
#Apart from that adjusted r2 and r2 is showing a good value that is greater than 80percent. and these assumption concludes a good linear model to fit our data.

Question 2

Data set huron gives the water level of Lake Huron in feet from 1875 to 1972.

#  Plot the data and comment on its features.

autoplot(huron)

#There is a some negative trend till 1935 and then there is slight positive trend till it vanishes to overall low in 1962

#- Fit a linear regression.
fit_huron <- tslm(huron ~ trend)
summary(fit_huron)

## 
## Call:
## tslm(formula = huron ~ trend)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.50997 -0.72726  0.00083  0.74402  2.53565 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.202037   0.230111  44.335  < 2e-16 ***
## trend       -0.024201   0.004036  -5.996 3.55e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.13 on 96 degrees of freedom
## Multiple R-squared:  0.2725, Adjusted R-squared:  0.2649 
## F-statistic: 35.95 on 1 and 96 DF,  p-value: 3.545e-08

#Here trend coefficient is negative , that level of lake  is reducing w.r.to year. and we got a value of -0.024 , that gives the average decrease of lake level of 0.0645 sec for every additional years

#- Generate forecasts from the models for the period up to 1980 and comment on the results.
fst.huron <- forecast(fit_huron, h = 8)
fst.huron

##      Point Forecast    Lo 80    Hi 80    Lo 95     Hi 95
## 1973       7.806127 6.317648 9.294605 5.516501 10.095752
## 1974       7.781926 6.292536 9.271315 5.490899 10.072952
## 1975       7.757724 6.267406 9.248043 5.465269 10.050179
## 1976       7.733523 6.242259 9.224788 5.439613 10.027434
## 1977       7.709322 6.217094 9.201550 5.413929 10.004715
## 1978       7.685121 6.191912 9.178331 5.388219  9.982024
## 1979       7.660920 6.166712 9.155128 5.362481  9.959359
## 1980       7.636719 6.141494 9.131943 5.336717  9.936721

autoplot(huron) +
  autolayer(fitted(fit_huron), series = "Fitted") +
  autolayer(fst.huron, series = "Prediction") +
  xlab("Year") +  ylab("Water level")

#It looks like the linear regression model fails to fit the actual data well.

Question 3

Daily electricity demand for Victoria, Australia, during 2014 is contained in elecdaily.

# Plot the data and find the regression model for Demand with temperature as an explanatory variable. Why is there a positive relationship?

autoplot(elecdaily)

head(elecdaily)

## Time Series:
## Start = c(1, 1) 
## End = c(1, 6) 
## Frequency = 7 
##            Demand WorkDay Temperature
## 1.000000 174.8963       0        26.0
## 1.142857 188.5909       1        23.0
## 1.285714 188.9169       1        22.2
## 1.428571 173.8142       0        20.3
## 1.571429 169.5152       0        26.1
## 1.714286 195.7288       1        19.6

summary(elecdaily)

##      Demand         WorkDay        Temperature   
##  Min.   :166.7   Min.   :0.0000   Min.   : 9.80  
##  1st Qu.:205.4   1st Qu.:0.0000   1st Qu.:16.60  
##  Median :221.0   Median :1.0000   Median :20.30  
##  Mean   :221.3   Mean   :0.6877   Mean   :21.26  
##  3rd Qu.:236.7   3rd Qu.:1.0000   3rd Qu.:24.30  
##  Max.   :347.6   Max.   :1.0000   Max.   :43.20

#Plot
elecdaily %>%
  as.data.frame() %>%
  ggplot(aes(x=Temperature, y=Demand)) + 
  ggtitle("Half-hourly and daily electricity demand for Victoria, Australia, in 2014") +
  ylab("Demand") + 
  xlab("Temperature") + 
  geom_point() + 
  geom_smooth(method="lm", se=FALSE)

#regression model
fit.elecdaily <- tslm(Demand ~ Temperature, data = elecdaily)
fit.elecdaily

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Coefficients:
## (Intercept)  Temperature  
##    212.3856       0.4182

summary(fit.elecdaily)

## 
## Call:
## tslm(formula = Demand ~ Temperature, data = elecdaily)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -55.474 -15.719  -0.336  15.767 117.184 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 212.3856     5.0080  42.409   <2e-16 ***
## Temperature   0.4182     0.2263   1.848   0.0654 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 26.55 on 363 degrees of freedom
## Multiple R-squared:  0.009322,   Adjusted R-squared:  0.006592 
## F-statistic: 3.416 on 1 and 363 DF,  p-value: 0.0654

cbind(Data = elecdaily[,"Demand"],
Fitted = fitted(fit.elecdaily)) %>%
as.data.frame() %>%
ggplot(aes(x=Data, y=Fitted)) +
geom_point() +
ylab("Fitted (predicted values)") +
xlab("Data (actual values)") +
ggtitle("Percent change in US consumption expenditure") +
  geom_abline(intercept=0, slope=1)

checkresiduals(fit.elecdaily)

## 
##  Breusch-Godfrey test for serial correlation of order up to 14
## 
## data:  Residuals from Linear regression model
## LM test = 271.66, df = 14, p-value < 2.2e-16

#Here coeff is positive and its value is 0.4182. This means for every unit increase of temperature there could be an increase of 0.4182 in demand.
# R2 and Adjusted R2 is pretty low showing a bad model
# Pot shows that demand and temperature is not linear.

#- Produce a residual plot. Is the model adequate? Are there any outliers or influential observations?
checkresiduals(fit.elecdaily)

## 
##  Breusch-Godfrey test for serial correlation of order up to 14
## 
## data:  Residuals from Linear regression model
## LM test = 271.66, df = 14, p-value < 2.2e-16

# It looks like residuals fluctuate at a base zero with a seasonal pattern.
#The residuals are normally distributed
# There is no outliers as per scatterplot

#- Use the model to forecast the electricity demand that you would expect for the next day if the maximum temperature was 15 and compare it with the forecast if the with maximum temperature was 35. Do you believe these forecasts?
#- Give prediction intervals for your forecasts.

forecast(fit.elecdaily, newdata = data.frame(Temperature =  c(15,35) ))

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 53.14286       218.6592 184.4779 252.8406 166.3039 271.0146
## 53.28571       227.0242 192.6585 261.3898 174.3866 279.6617

#- Give prediction intervals for your forecasts.
#Prediction intervals is between 184.4779 & 252.8406 for a confidence level of 80 for the temperature 15
#Prediction intervals is between 192.6585 & 261.3898 for a confidence level of 80 for the temperature 35

#- Plot Demand vs Temperature for all of the available data in elecdaily. What does this say about your model?
elecdaily_df <- as.data.frame(elecdaily)
names(elecdaily_df)

## [1] "Demand"      "WorkDay"     "Temperature"

elecdaily_df$WorkDay <- as.factor(elecdaily_df$WorkDay)


elecdaily_df %>%
ggplot(aes(x=Temperature, y=Demand)) +
 geom_point(aes(col=WorkDay),size=2)+
    labs(title=" Demand vs Temperature", y="Demand", x="Temperature")

#It looks like demand is more on weekdays(1) compared to weekends(0). Before we include only one independent variable. The above plot suggest us to use workday as well in our analysis

Question4

The data set fancy concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

#- Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
autoplot(fancy)

seasonplot(fancy)

ggsubseriesplot(fancy, main=NULL)

#From the first plot, there is a positive trend 
#From the second plot, there is seasonality as well. The sales is more at December month compared to other month through out every year.

# Explain why it is necessary to take logarithms of these data before fitting a model.
autoplot(fancy)

autoplot(BoxCox(fancy, 0))

#The variance in sale is increasing exponentially over time. Inorder to make the forecasting simpler, we have used logarithmic transformation.

#Fit a regression model to the logarithms of these sales data with a linear trend and seasonal variables.

fit.fancy <- tslm(BoxCox(fancy, 0)~trend+season)
summary(fit.fancy)

## 
## Call:
## tslm(formula = BoxCox(fancy, 0) ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41644 -0.12619  0.00608  0.11389  0.38567 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6058604  0.0768740  98.939  < 2e-16 ***
## trend       0.0223930  0.0008448  26.508  < 2e-16 ***
## season2     0.2510437  0.0993278   2.527 0.013718 *  
## season3     0.6952066  0.0993386   6.998 1.18e-09 ***
## season4     0.3829341  0.0993565   3.854 0.000252 ***
## season5     0.4079944  0.0993817   4.105 0.000106 ***
## season6     0.4469625  0.0994140   4.496 2.63e-05 ***
## season7     0.6082156  0.0994534   6.116 4.69e-08 ***
## season8     0.5853524  0.0995001   5.883 1.21e-07 ***
## season9     0.6663446  0.0995538   6.693 4.27e-09 ***
## season10    0.7440336  0.0996148   7.469 1.61e-10 ***
## season11    1.2030164  0.0996828  12.068  < 2e-16 ***
## season12    1.9581366  0.0997579  19.629  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1858 on 71 degrees of freedom
## Multiple R-squared:  0.9527, Adjusted R-squared:  0.9447 
## F-statistic: 119.1 on 12 and 71 DF,  p-value: < 2.2e-16

autoplot(BoxCox(fancy, 0), series = "Data" ) + 
  autolayer(fitted(fit.fancy), series = "Fitted") +
  ggtitle("Sales for a souvenir shop") +
  xlab("Month") + ylab("Log(Sales)")

#Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
checkresiduals(fit.fancy)

## 
##  Breusch-Godfrey test for serial correlation of order up to 16
## 
## data:  Residuals from Linear regression model
## LM test = 37.152, df = 16, p-value = 0.001996

autoplot(fit.fancy$residuals)

cbind(Fitted = fitted(fit.fancy),
      Residuals = residuals(fit.fancy)) %>%
  as.data.frame() %>%
  ggplot(aes(x = Fitted, y = Residuals)) + 
  geom_point()

#The residual plot shows a pattern over time.
#The acf plot has also concluded  collinearity . And this two reason stress that this data maynot be good for time series.Or else we have to improve or use different forecating methods to improve those properties.

#What do the values of the coefficients tell you about each variable?
summary(fit.fancy)

## 
## Call:
## tslm(formula = BoxCox(fancy, 0) ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41644 -0.12619  0.00608  0.11389  0.38567 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6058604  0.0768740  98.939  < 2e-16 ***
## trend       0.0223930  0.0008448  26.508  < 2e-16 ***
## season2     0.2510437  0.0993278   2.527 0.013718 *  
## season3     0.6952066  0.0993386   6.998 1.18e-09 ***
## season4     0.3829341  0.0993565   3.854 0.000252 ***
## season5     0.4079944  0.0993817   4.105 0.000106 ***
## season6     0.4469625  0.0994140   4.496 2.63e-05 ***
## season7     0.6082156  0.0994534   6.116 4.69e-08 ***
## season8     0.5853524  0.0995001   5.883 1.21e-07 ***
## season9     0.6663446  0.0995538   6.693 4.27e-09 ***
## season10    0.7440336  0.0996148   7.469 1.61e-10 ***
## season11    1.2030164  0.0996828  12.068  < 2e-16 ***
## season12    1.9581366  0.0997579  19.629  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1858 on 71 degrees of freedom
## Multiple R-squared:  0.9527, Adjusted R-squared:  0.9447 
## F-statistic: 119.1 on 12 and 71 DF,  p-value: < 2.2e-16

#All of the seasonal coefficients are positive, it has a maximum values in Nov and Dec indicating more sales in those months. In the first two quarters, Sale happened in March is high.

#Use your regression model to predict the monthly sales for 1994, 1995, and 1996
prediction.fancy <- forecast(fit.fancy, h = 36)

autoplot(prediction.fancy, series = "Prediction") + ggtitle({"Sales for a souvenir shop"}) + 
  xlab("Month") + ylab("Log(Sales)")

#Transform your predictions and intervals to obtain predictions and intervals for the raw data.
forecast.raw <- forecast(fit.fancy, h = 36)
print(forecast.raw)

##          Point Forecast     Lo 80     Hi 80     Lo 95     Hi 95
## Jan 1994       9.509264  9.246995  9.771532  9.105002  9.913525
## Feb 1994       9.782700  9.520432 10.044969  9.378439 10.186962
## Mar 1994      10.249256  9.986988 10.511525  9.844995 10.653518
## Apr 1994       9.959377  9.697108 10.221645  9.555115 10.363638
## May 1994      10.006830  9.744562 10.269098  9.602569 10.411091
## Jun 1994      10.068191  9.805923 10.330459  9.663930 10.472452
## Jul 1994      10.251837  9.989569 10.514105  9.847576 10.656099
## Aug 1994      10.251367  9.989099 10.513635  9.847106 10.655628
## Sep 1994      10.354752 10.092484 10.617020  9.950491 10.759014
## Oct 1994      10.454834 10.192566 10.717102 10.050573 10.859095
## Nov 1994      10.936210 10.673942 11.198478 10.531949 11.340471
## Dec 1994      11.713723 11.451455 11.975991 11.309462 12.117984
## Jan 1995       9.777979  9.512777 10.043182  9.369195 10.186763
## Feb 1995      10.051416  9.786214 10.316618  9.642632 10.460200
## Mar 1995      10.517972 10.252770 10.783174 10.109188 10.926756
## Apr 1995      10.228092  9.962890 10.493295  9.819308 10.636876
## May 1995      10.275546 10.010343 10.540748  9.866762 10.684330
## Jun 1995      10.336907 10.071704 10.602109  9.928123 10.745691
## Jul 1995      10.520553 10.255351 10.785755 10.111769 10.929337
## Aug 1995      10.520083 10.254880 10.785285 10.111299 10.928867
## Sep 1995      10.623468 10.358266 10.888670 10.214684 11.032252
## Oct 1995      10.723550 10.458347 10.988752 10.314766 11.132334
## Nov 1995      11.204926 10.939723 11.470128 10.796142 11.613710
## Dec 1995      11.982439 11.717236 12.247641 11.573655 12.391223
## Jan 1996      10.046695  9.777950 10.315440  9.632451 10.460940
## Feb 1996      10.320132 10.051387 10.588877  9.905887 10.734376
## Mar 1996      10.786688 10.517943 11.055433 10.372443 11.200932
## Apr 1996      10.496808 10.228063 10.765553 10.082564 10.911053
## May 1996      10.544261 10.275516 10.813007 10.130017 10.958506
## Jun 1996      10.605623 10.336878 10.874368 10.191378 11.019867
## Jul 1996      10.789269 10.520524 11.058014 10.375024 11.203513
## Aug 1996      10.788798 10.520053 11.057543 10.374554 11.203043
## Sep 1996      10.892184 10.623439 11.160929 10.477939 11.306428
## Oct 1996      10.992266 10.723521 11.261011 10.578021 11.406510
## Nov 1996      11.473641 11.204896 11.742386 11.059397 11.887886
## Dec 1996      12.251155 11.982409 12.519900 11.836910 12.665399

#What are your recommendations for improving these forecasts by modifying the model?
#To improve the forecast, we can use different models, or use new independent variables to get new insights to the data.

DSCI-FXP-SP19- Assignment #3 Time Series Regression Models Nidhin

Nidhin

20 February 2019

Question 1

Question 2

Question 3

Question4