Chapter 5 - Multiple Regression

Question 1
The data below (data set fancy) concern the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

library(fma)
## Loading required package: forecast
library(fpp)
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: tseries
data(fancy)
dim(fancy)
## NULL
head(fancy)
##          Jan     Feb     Mar     Apr     May     Jun
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74

(a) Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

plot(fancy)

We see a prominent seasonal pattern in the data. There is a peculiar spike in sales in the peak Christmas season (infact Nov and Dec followed by a dip in Jan) and another smaller bump in sales during March (as mentioned in the question during Surfing festival barring 1987). These fluctuations seem to increase with the level of the time series indicating a positive upward trend. Since the magnitude of pattern increases with the level, it is imperative to consider log transformation of this variable to make the pattern and its variance stable.

(b) Explain why it is necessary to take logarithms of these data before fitting a model.
Answer -
The increasing seasonal fluctuations in the data make it is necessary to take logarithms in order to get an additive model and to stabilize the variance so it does not increase with the level of the series.

plot(log(fancy))

The relationship now appears more linear. This means that fitting a linear model to the transformed data might lead to better predicting accuracy.

(c) Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

log.fancy <- log(fancy)
dummy.surf = rep(0, length(fancy))
dummy.surf[seq_along(dummy.surf)%%12 == 3] <- 1
dummy.surf[3] <- 0
dummy.surf <- ts(dummy.surf, freq = 12, start=c(1987,1))
log.fancy.df <- data.frame(log.fancy,  dummy.surf)

fit <- tslm(log.fancy ~ trend + season + dummy.surf, data=log.fancy.df)

future.surf <- data.frame(dummy.surf = rep(0, 12))
future.surf[3,] <- 1
forecast(fit, newdata=future.surf)
##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
## Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
## Mar 1994      10.302990 10.048860 10.557120  9.911228 10.69475
## Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
## May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
## Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
## Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
## Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
## Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
## Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557

(d) Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

# Plot of the residuals against time
plot(residuals(fit), type='p')
abline(h = 0, col="grey")

The residuals seem to be serially correlated indicating nonlinearity in the trend. The residuals plotted against time vary from -0.03 to 0.03. Particularly, there seems to be a trend for the residuals to increase from 1991 to 1994, which may indicate missing some relationship in the model. The residuals appear random prior to this.

# Plot of the residuals against the fitted values
plot(as.numeric(fitted(fit)), residuals(fit), type='p')

The residuals plotted against the fitted values show no pattern and vary from -0.03 to 0.03. Such a plot shows unbiased and homoscedastic residuals.

(e) Do boxplots of the residuals for each month. Does this reveal any problems with the model?

boxplot(resid(fit) ~ cycle(resid(fit)))

The residuals for the months of Fall i.e. August, September and October show greater variance than the other months, which suggests that our model may not be capturing some information relevant to this time period. This may also mean heteroscedascity.

(f) What do the values of the coefficients tell you about each variable?

summary(fit)
## 
## Call:
## tslm(formula = log.fancy ~ trend + season + dummy.surf, data = log.fancy.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6196670  0.0742471 102.626  < 2e-16 ***
## trend       0.0220198  0.0008268  26.634  < 2e-16 ***
## season2     0.2514168  0.0956790   2.628 0.010555 *  
## season3     0.2660828  0.1934044   1.376 0.173275    
## season4     0.3840535  0.0957075   4.013 0.000148 ***
## season5     0.4094870  0.0957325   4.277 5.88e-05 ***
## season6     0.4488283  0.0957647   4.687 1.33e-05 ***
## season7     0.6104545  0.0958039   6.372 1.71e-08 ***
## season8     0.5879644  0.0958503   6.134 4.53e-08 ***
## season9     0.6693299  0.0959037   6.979 1.36e-09 ***
## season10    0.7473919  0.0959643   7.788 4.48e-11 ***
## season11    1.2067479  0.0960319  12.566  < 2e-16 ***
## season12    1.9622412  0.0961066  20.417  < 2e-16 ***
## dummy.surf  0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

The value of the coefficients show how much the:

(g) What does the Durbin-Watson statistic tell you about your model?

library(lmtest)
dwtest(fit, alt="two.sided")
## 
##  Durbin-Watson test
## 
## data:  fit
## DW = 0.88889, p-value = 1.956e-07
## alternative hypothesis: true autocorrelation is not 0

For DW Test,
Ho: The error terms are not autocorrelated or aut-correlation=0.
Ha: The error terms are positively autocorrelated.

The Durbin-Watson test with a small p-value and dw statistic = 0.89 (<2) shows that there is some autocorrelation remaining in the residuals. This means there is some information remaining in the residuals that can be exploited to obtain better forecasts.

(h) Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

future.surf <- data.frame(dummy.surf = rep(0, 36))
preds <- forecast(fit, newdata=future.surf)

(i) Transform your predictions and intervals to obtain predictions and intervals for the raw data.

preds.tr <- exp(data.frame(preds))

(j) How could you improve these predictions by modifying the model?

par(mfrow=c(1,2))
plot(log.fancy,xlab="Time",ylab="Log Sales", main="Fancy Store History")
plot(preds$mean,xlab="Time",ylab="Log Sales", main="Fancy Store Forecast")

frcst<-ts(preds.tr$Point.Forecast,freq=12,start=c(1994,1))

par(mfrow=c(1,2))
plot(fancy,xlab="Time",ylab="Sales", main="Fancy Store History")
plot(frcst,xlab="Time",ylab="Sales", main="Fancy Store Forecast")

We may seem to be overfitting here. Seasonality seems to be covered well by the forecasts, however trend seems bit too high as we see the very high total sales number(y axis of the two graphs above). We may need to look at adding more variables and find a better fitting model. Or fixing the non-linearity of trend may also help.


Question 2
The data below (data set texasgas) shows the demand for natural gas and the price of natural gas for 20 towns in Texas in 1969.

library(fma)
library(fpp)
library(segmented)
## Warning: package 'segmented' was built under R version 3.4.3
tg <- (texasgas)
dim(tg)
## [1] 20  2
head(tg)
##   price consumption
## 1    30         134
## 2    31         112
## 3    37         136
## 4    42         109
## 5    43         105
## 6    45          87

(a) Do a scatterplot of consumption against price (3 possible nonlinear models for the data are given in the question). The second model divides the data into two sections, depending on whether the price is above or below 60 cents per 1,000 cubic feet.

plot(tg$price,tg$consumption,  ylab="Consumption",xlab="Price")

The relationship in consumption and price is not exactly linear.

(b) Can you explain why the slope of the fitted line should change with P?
Answer -
As we see in the plot above, there is always inverse relationship between consumption and price however the price sensitivity (given by price elasticity or the slope of the straight line going through these points) is less sharp at prices above 60 cents per 1000 cu.f.
At prices below 60 cents per 1000 cu.f. the slope/ price elasticity has sharp slope. Hence we may conclude that this datset has a piece wise linear relationship.

(c) Fit the three models and find the coefficients, and residual variance in each case. For the second model, the parameters a1, a2, b1, b2 can be estimated by simply fitting a regression with four variations.

# First model - Simple linear regression
model1 <- lm(consumption ~ exp(price), tg)

# Second model - Piecewise linear regression
lin.mod <- lm(consumption ~ price, tg)
model2 <- segmented(lin.mod, seg.Z = ~price, psi=60)
# slope(model2)   gives the piece wise slopes.

# Third model - polynomial regression
model3 <- lm(consumption ~ poly(price, 2), tg)

(d) For each model, find the value of R2 and AIC, and produce a residual plot. Comment on the adequacy of the three models.

rsq<-c(summary(model1)$r.squared,summary(model2)$r.squared,summary(model3)$r.squared)
AIC<-c(AIC(model1), AIC(model2), AIC(model3))
model<-c("model1","model2","model3")

results<-cbind(model=model,rsq=as.numeric(rsq),AIC=as.numeric(AIC))
results<-cbind(model,rsq,AIC)
results
##      model    rsq                 AIC               
## [1,] "model1" "0.048639097397276" "200.736330488091"
## [2,] "model2" "0.871119231334424" "164.756214777845"
## [3,] "model3" "0.831511206313188" "168.115845542328"
# Plots
par(mfrow=c(1,3))
plot(model1$fitted.values, residuals(model1), ylab='residuals', xlab='fitted values',
     main='linear regression')
abline(0,0)

plot(model2$fitted.values, residuals(model2), ylab='residuals', xlab='fitted values',
     main='piecewise linear regression')
abline(0,0)

plot(model3$fitted.values, residuals(model3), ylab='residuals', xlab='fitted values', 
     main='polynomial linear regression')
abline(0,0)

Polynomial model of order 2 performs much better than the simple linear regression. But a piecewise linear regression is further more accurate both in terms of better R2 and lower AIC.

Also, the residual plots indicate presence of heteroskedasticity.

(e) For prices 40, 60, 80, 100, and 120 cents per 1,000 cubic feet, compute the forecasted per capita demand using the best model of the three above.

df<-data.frame(price=c(40,60,80,100,120))
predict(model2,newdata=df)
##         1         2         3         4         5 
## 104.53618  53.34514  47.19593  41.04673  34.89752

(f) Compute 95% prediction intervals. Make a graph of these prediction intervals and discuss their interpretation.

newx <- seq(min(df), max(df), length.out=5)
intervals <- predict(model2, df, interval="predict") 

plot(consumption ~ price, data = tg, type = 'n')
#plot(consumption ~ price, data = tg, type = 'l')
polygon(c(rev(newx), newx), c(rev(intervals[ ,3]), intervals[ ,2]), col = 'lightgrey', border = NA)

# plot(tg$price, tg$consumption)
# lines(model2, col = 'black', lwd = 2)
# lines(intervals[,2], col = 'blue', lty = 2, lwd = 2)
# lines(intervals[,3], col = 'blue', lty = 2, lwd = 2)

library(ggplot2)
ggplot(tg, aes(x=price, y=consumption)) + 
  geom_point(color='#2980B9', size = 4) + 
  geom_smooth(method=lm, color='#2C3E50')

We can see here that the prediction intervals are fairly wide meaning we only have a limited idea of how much energy will be demanded given the price information for future. We may in future use more predictor variables to improve this prediction model.

(g) What is the correlation between P and P2? Does this suggest any general problem to be considered in dealing with polynomial regressions—especially of higher orders?

plot(tg$price,(tg$price)^2)


Chapter 6 - Multiple Regression

Question 1
Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.
Answer -
A 3X5 MA is when a 3-MA follows a moving of a 5-MA series. It is also called a “centered moving average of order 5”. This is because the results are now symmetric.

In detailed steps, 1. If we have 7 terms in time series. 2. 5-MA : _ , _ , (Y1+Y2+Y3+Y4+Y5)/5, (Y2+Y3+Y4+Y5+Y6)/5, (Y3+Y4+Y5+Y6+Y7)/5, _ , _ 3. 3-MA : _ , _ , _ ,((Y1+Y2+Y3+Y4+Y5)/5 + (Y2+Y3+Y4+Y5+Y6)/5 + (Y3+Y4+Y5+Y6+Y7)/5)/3 , _ , _ , _

= \(1/15 (Y1)\) \(2/15 (Y2)\) \(3/15 (Y3)\) \(3/15 (Y4)\) \(3/15 (Y5)\) \(2/15 (Y6)\) \(1/15 (Y7)\)

= \(0.067\) \(0.133\) \(0.2\) \(0.2\) \(0.2\) \(0.133\) \(0.067\)


Question 2
The data below represent the monthly sales (in thousands) of product A for a plastics manufacturer for years 1 through 5 (data set plastics).

(a) Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend?

pl<-(plastics)
plot(pl,xlab="Yr",ylab="Sales", main="Product A Sales")

We can see presence of seasonality as well as trend in the sales of plastic A. Every summer observes sales peak followed by troughs in winters and vice versa. The level of sales increases yearly as well.

(b) Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

model<-decompose(pl, type="multiplicative")

tr<-model$trend
seas<- model$seasonal

(c) Do the results support the graphical interpretation from part (a)?
Answer -
The model fit results also concur with the plot above such that there are seasonal peaks in the summer and some fall months.

(d) Compute and plot the seasonally adjusted data.

plot(seasadj(model))

(e) Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

pl[60]=pl[60]+500
model<-decompose(pl, type="multiplicative")

tr<-model$trend
seas<- model$seasonal

plot(seasadj(model))

(f) Does it make any difference if the outlier is near the end rather than in the middle of the time series?

pl[30]=pl[30]+500
model<-decompose(pl, type="multiplicative")

tr<-model$trend
seas<- model$seasonal

plot(seasadj(model))

We can conclude that an outlier causes a spike in the month it is present by increasing seasonality index of that month and hence the deseasonalized data for that month observes a spike.

(g) Use a random walk with drift to produce forecasts of the seasonally adjusted data.

rwf<-rwf(seasadj(model), drift=TRUE)

(h) Reseasonalize the results to give forecasts on the original scale.

m <- stl(pl, t.window=15, s.window="periodic", robust=TRUE)
frcst <- forecast(m, method="naive")

Question 3
Figure 6.13 shows the result of decomposing the number of persons in the civilian labor force in Australia each month from February 1978 to August 1995.

(a) Write about 3–5 sentences describing the results of the seasonal adjustment. Pay particular attention to the scales of the graphs in making your interpretation.
Answer -
Isolating the trend and seasonal component from the overall timeseries shows that the trend has increased throught the majority of the time frame, with a few stationary periods occuring in the early 90s. The monthly breakdown of the seasonal component shows that a few months esp March and December show greater peaks in the labor force than other months.

(b) Is the recession of 1991/1992 visible in the estimated components?
Answer -
Unusual peaks in labor force availability throughout the 1991-92 time period in the residuals or remainder graph is indicative of recession.