5.1 The data below (data set fancy) concern the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.
## Jan Feb Mar Apr May Jun Jul
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1988 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1989 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1990 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1991 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 1992 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## 1993 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 1987 3566.34 5021.82 6423.48 7600.60 19756.21
## 1988 4752.15 5496.43 5835.10 12600.08 28541.72
## 1989 8176.62 8573.17 9690.50 15151.84 34061.01
## 1990 7979.25 8093.06 8476.70 17914.66 30114.41
## 1991 12552.22 11637.39 13606.89 21822.11 45060.69
## 1992 19888.61 23933.38 25391.35 36024.80 80721.71
## 1993 28586.52 30505.41 30821.33 46634.38 104660.67
5.1a Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
plot(fancy)
As identified from the above information, there is a surge in monthly sales every March and December. After a visual inspection of the plot, there seems to be a gradual growth in the spike in sales during these peak months with the exception of 1991. Overall, excluding the peeks, there is a growth trend. It should also be noted that Australia in December is their spring season with January, February and March being their summer months. The 1991 fluctuation is interesting - from the plot, there doesn’t seem to be any March surge. Was the surfing festival held in March of 1991?
5.1b Explain why it is necessary to take logarithms of these data before fitting a model.
fancy.numeric <- as.numeric(fancy)
hist(fancy.numeric)
Because of the surges, the data is very right skewed. Taking the logarithm of this data can push the data to a more normal distribution and in turn allow for a better interpenetration of the bulk of the data. A normal distribution also helps in the interpretation of residuals once the model is created. The below histogram shows the distribution of the data after the log transformation. Note: Performing this transformation requires that all observations in the dataset are greater than zero. If there are observations of zero, the data would need to be scaled up by one.
fancy.log <- log(fancy.numeric)
hist(fancy.log)
5.1c Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
# Log Transform Data
fancy.log <- log(fancy)
# Create Festival Dummy Variable - Note: Skip first year.
d_festival <- rep(0, length(fancy))
d_festival[seq_along(d_festival)%%12 == 3] <- 1
d_festival[3] <- 0
# Make Dummy Variable a Timeseries
d_festival <- ts(d_festival, freq = 12, start=c(1987,1))
fancydata <- data.frame(fancy.log, d_festival)
# Create Model
fancyfit <- tslm(fancy.log ~ trend + season + d_festival, data = fancydata)
5.1d Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?
res <- residuals(fancyfit)
fit_vector <- as.vector(fitted(fancyfit))
fit_residuals <- as.vector(residuals(fancyfit))
plot(res, xlab = "Year", ylab = "Residuals")
plot(fit_vector, fit_residuals, xlab = "Predicted scores", ylab = "Residuals")
par(mfrow=c(2,2))
plot(fancyfit)
The residuals against time show an interesting phenomenon in 1991 - I think more investigation is needed for that year in that specific location. The residuals against the fitted values show no underlying pattern and look to be at random.
5.1e Do box plots of the residuals for each month. Does this reveal any problems with the model?
res.month <- data.frame(res, month=rep(1:12,7))
boxplot(res ~ month, data=res.month)
April, June and July seem to be the months with the least volatility. The one problem that does show through in this plot is the potential for the residuals to have some sort of cyclical pattern to them - which means we have missing variables or data that could explain the system.
5.1f What do the values of the coefficients tell you about each variable?
summary(fancyfit)
##
## Call:
## tslm(formula = fancy.log ~ trend + season + d_festival, data = fancydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33673 -0.12757 0.00257 0.10911 0.37671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6196670 0.0742471 102.626 < 0.0000000000000002 ***
## trend 0.0220198 0.0008268 26.634 < 0.0000000000000002 ***
## season2 0.2514168 0.0956790 2.628 0.010555 *
## season3 0.2660828 0.1934044 1.376 0.173275
## season4 0.3840535 0.0957075 4.013 0.000148 ***
## season5 0.4094870 0.0957325 4.277 0.0000588067270 ***
## season6 0.4488283 0.0957647 4.687 0.0000132668325 ***
## season7 0.6104545 0.0958039 6.372 0.0000000170771 ***
## season8 0.5879644 0.0958503 6.134 0.0000000453365 ***
## season9 0.6693299 0.0959037 6.979 0.0000000013630 ***
## season10 0.7473919 0.0959643 7.788 0.0000000000448 ***
## season11 1.2067479 0.0960319 12.566 < 0.0000000000000002 ***
## season12 1.9622412 0.0961066 20.417 < 0.0000000000000002 ***
## d_festival 0.5015151 0.1964273 2.553 0.012856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9487
## F-statistic: 119 on 13 and 70 DF, p-value: < 0.00000000000000022
Season2, Season3 and the dummy festival variable are the weakest in terms of p-value. Since we transformed the response variable by taking its natural log, the coefficients cannot be interpreted in the normal sense of a simple product of the coefficient and the feature. However, we can take into consideration direction and magnitude. All coefficients are positive, which makes sense given the growth trend shown in the plot. Season12 (December) is showing the highest magnitude and strongest p-value, which also makes sense given the peak influx in December.
5.1g What does the Durbin-Watson statistic tell you about your model?
library(lmtest)
dwtest(fancyfit, alt="two.sided")
##
## Durbin-Watson test
##
## data: fancyfit
## DW = 0.88889, p-value = 0.0000001956
## alternative hypothesis: true autocorrelation is not 0
The Durbin-Watson test is showing that there is auto correlation in the model’s residuals. This means that the model can still be improved. More investigation of year 1991 should be conducted, because there could be potential for outliers in this year.
5.1h Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.
# Assuming the March festival continues
d_festival = rep(0, 36)
d_festival[seq_along(d_festival)%%12 == 3] <- 1
future_fancydata <- data.frame(d_festival)
fcst <- forecast(fancyfit, newdata=future_fancydata)
fcst
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1994 9.491352 9.238522 9.744183 9.101594 9.88111
## Feb 1994 9.764789 9.511959 10.017620 9.375031 10.15455
## Mar 1994 10.302990 10.048860 10.557120 9.911228 10.69475
## Apr 1994 9.941465 9.688635 10.194296 9.551707 10.33122
## May 1994 9.988919 9.736088 10.241749 9.599161 10.37868
## Jun 1994 10.050280 9.797449 10.303110 9.660522 10.44004
## Jul 1994 10.233926 9.981095 10.486756 9.844168 10.62368
## Aug 1994 10.233456 9.980625 10.486286 9.843698 10.62321
## Sep 1994 10.336841 10.084010 10.589671 9.947083 10.72660
## Oct 1994 10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994 10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994 11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995 9.755590 9.499844 10.011336 9.361338 10.14984
## Feb 1995 10.029027 9.773281 10.284773 9.634775 10.42328
## Mar 1995 10.567228 10.310518 10.823938 10.171489 10.96297
## Apr 1995 10.205703 9.949957 10.461449 9.811451 10.59996
## May 1995 10.253157 9.997411 10.508903 9.858904 10.64741
## Jun 1995 10.314518 10.058772 10.570264 9.920265 10.70877
## Jul 1995 10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995 10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995 10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995 10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995 11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995 11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996 10.019828 9.760564 10.279093 9.620151 10.41951
## Feb 1996 10.293265 10.034000 10.552530 9.893588 10.69294
## Mar 1996 10.831466 10.571566 11.091365 10.430810 11.23212
## Apr 1996 10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996 10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996 10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996 10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996 10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996 10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996 10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996 11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996 12.224288 11.965023 12.483552 11.824611 12.62396
5.1i Transform your predictions and intervals to obtain predictions and intervals for the raw data.
fcst$mean <- exp(fcst$mean)
fcst$upper <- exp(fcst$upper)
fcst$lower <- exp(fcst$lower)
fcst$x <- exp(fcst$x)
fcst
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1994 13244.70 10285.82 17054.73 8969.583 19557.43
## Feb 1994 17409.81 13520.45 22418.00 11790.284 25707.73
## Mar 1994 29821.65 23129.40 38450.24 20155.412 44123.68
## Apr 1994 20774.16 16133.21 26750.16 14068.696 30675.62
## May 1994 21783.73 16917.24 28050.15 14752.395 32166.37
## Jun 1994 23162.27 17987.81 29825.24 15685.969 34201.95
## Jul 1994 27831.56 21613.98 35837.72 18848.111 41096.73
## Aug 1994 27818.48 21603.82 35820.87 18839.249 41077.41
## Sep 1994 30848.42 23956.87 39722.43 20891.193 45551.50
## Oct 1994 34095.57 26478.61 43903.67 23090.230 50346.32
## Nov 1994 55176.84 42850.31 71049.28 37366.903 81475.41
## Dec 1994 120067.79 93244.59 154607.08 81312.400 177294.90
## Jan 1995 17250.40 13357.65 22277.59 11629.938 25587.08
## Feb 1995 22675.20 17558.28 29283.31 15287.252 33633.55
## Mar 1995 38840.85 30046.98 50208.44 26146.972 57697.39
## Apr 1995 27057.06 20951.33 34942.16 18241.435 40133.06
## May 1995 28371.96 21969.51 36640.25 19127.918 42083.42
## Jun 1995 30167.42 23359.80 38958.95 20338.387 44746.58
## Jul 1995 36248.88 28068.91 46812.70 24438.412 53767.06
## Aug 1995 36231.84 28055.72 46790.69 24426.922 53741.78
## Sep 1995 40178.16 31111.50 51887.06 27087.467 59595.26
## Oct 1995 44407.37 34386.35 57348.77 29938.733 65868.34
## Nov 1995 71864.42 55647.40 92807.48 48449.831 106594.69
## Dec 1995 156380.86 121091.75 201954.08 105429.448 231955.81
## Jan 1996 22467.57 17336.40 29117.46 15065.329 33506.86
## Feb 1996 29533.04 22788.25 38274.14 19802.984 44043.89
## Mar 1996 50587.81 39009.73 65602.25 33887.802 75517.62
## Apr 1996 35240.15 27191.96 45670.42 23629.808 52555.15
## May 1996 36952.72 28513.41 47889.88 24778.151 55109.18
## Jun 1996 39291.20 30317.82 50920.48 26346.183 58596.65
## Jul 1996 47211.93 36429.60 61185.57 31657.322 70409.18
## Aug 1996 47189.73 36412.48 61156.80 31642.439 70376.07
## Sep 1996 52329.57 40378.47 67817.91 35088.887 78041.33
## Oct 1996 57837.85 44628.77 74956.52 38782.394 86256.08
## Nov 1996 93598.96 72222.70 121302.09 62761.521 139588.15
## Dec 1996 203676.38 157160.50 263959.89 136572.460 303751.35
plot(fcst)
5.1j How could you improve these predictions by modifying the model?
I think I could improve this model by adding data to it - specifically average temperature by month. Given the location is in close proximity to the beach, temperature could drastically explain the volatility in certain months.
5.2 The data below (data set texasgas) shows the demand for natural gas and the price of natural gas for 20 towns in Texas in 1969.
## price consumption
## 1 30 134
## 2 31 112
## 3 37 136
## 4 42 109
## 5 43 105
## 6 45 87
## 7 50 56
## 8 54 43
## 9 54 77
## 10 57 35
## 11 58 65
## 12 58 56
## 13 60 58
## 14 73 55
## 15 88 49
## 16 89 39
## 17 92 36
## 18 97 46
## 19 100 40
## 20 102 42
5.2a Do a scatter plot of consumption against price. The data are clearly not linear. Three possible nonlinear models for the data are given below. The second model divides the data into two sections, depending on whether the price is above or below 60 cents per 1,000 cubic feet.
plot(texasgas$price, texasgas$consumption, xlab = "Price", ylab = "Consumption")
5.2b Can you explain why the slope of the fitted line should change with P?
The first model is taking the log of the response variable to train the model - this should make the slope the product of beta times the response. The second model is using a knot point to change the slope at specific point - in this case the knot is at 60. The third model is using a peicewise cubics rather than peicewise lines (like the second model) to smooth the slope of the regression.
5.2c Fit the three models and find the coefficients, and residual variance in each case.
model1 <- lm(I(log(consumption)) ~ price, data = texasgas)
pricep <- pmax(texasgas$price-60,0)
model2 <- lm(consumption ~ price + pricep, data = texasgas)
model3 <- lm(consumption ~ price + I(price^2), data = texasgas)
5.2d For each model, find the value of R2 and AIC, and produce a residual plot. Comment on the adequacy of the three models.
CV(model1); plot(fitted(model1), residuals(model1), xlab = "Predicted scores", ylab = "Residuals")
## CV AIC AICc BIC AdjR2
## 0.07882759 -48.30827728 -46.80827728 -45.32108046 0.63413071
CV(model2); plot(fitted(model2), residuals(model2), xlab = "Predicted scores", ylab = "Residuals")
## CV AIC AICc BIC AdjR2
## 204.3115937 108.0556267 110.7222933 112.0385558 0.8403537
CV(model3); plot(fitted(model3), residuals(model3), xlab = "Predicted scores", ylab = "Residuals")
## CV AIC AICc BIC AdjR2
## 238.565929 111.358304 114.024971 115.341233 0.811689
Model 1 would be considered the best model under the AIC and BIC metrics. Model 2 shows the highest adjusted R2, however, there is a drastic difference in AIC and BIC between Model 1 and Model 2. Model 2 and Model 3 are very similar across all metrics.
5.2e For prices 40, 60, 80, 100, and 120 cents per 1,000 cubic feet, compute the forecasted per capita demand using the best model of the three above.
price <- c(40, 60, 80, 100, 120)
newprices <- data.frame(price)
fcst <- forecast(model1, newdata = newprices); fcst
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1 4.486114 4.107821 4.864407 3.888723 5.083506
## 2 4.180255 3.810442 4.550068 3.596255 4.764255
## 3 3.874396 3.499993 4.248799 3.283147 4.465644
## 4 3.568536 3.176933 3.960140 2.950126 4.186947
## 5 3.262677 2.842810 3.682544 2.599633 3.925721
# Transform Response to Original Form
fcst$mean <- exp(fcst$mean)
fcst$upper <- exp(fcst$upper)
fcst$lower <- exp(fcst$lower)
fcst
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 1 88.77582 60.81408 129.59410 48.84846 161.33866
## 2 65.38252 45.17039 94.63886 36.46142 117.24378
## 3 48.15359 33.11521 70.02125 26.65954 86.97705
## 4 35.46465 23.97312 52.46466 19.10835 65.82155
## 5 26.11937 17.16393 39.74739 13.45880 50.68963
5.2f Compute 95% prediction intervals. Make a graph of these prediction intervals and discuss their interpretation.
library(plotrix)
plotCI(newprices$price, fcst$mean, ui=fcst$upper[,2], li=fcst$lower[,2])
It is very clear from the plot that as price increases, the prediction interval shrinks. This is telling us that as price increases the model is more confident in predicting consumption.
5.2g What is the correlation between P and P2? Does this suggest any general problem to be considered in dealing with polynomial regressions - especially of higher orders?
cor(texasgas$price, texasgas$price^2)
## [1] 0.9904481
The correlation between predictors price and price squared is very high - suggesting that there could be a general problem of multicollinearity amongst features with higher orders/powers
6.1 Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.
3x5 MA is taking the average for the first five terms and then getting the moving average of each three observation set. Since a 3X5 MA will lead to a 1/15(y + 2y + 3y + 3y + 3y + 2y + y), it follows the weighted average of 1/15, 2/15, 3/15, 3/15 … etc.
6.2 The data below represent the monthly sales (in thousands) of product A for a plastics manufacturer for years 1 through 5 (data set plastics).
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1 742 697 776 898 1030 1107 1165 1216 1208 1131 971 783
## 2 741 700 774 932 1099 1223 1290 1349 1341 1296 1066 901
## 3 896 793 885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4 951 861 938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013
6.2a Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend?
plot(plastics)
From the plot above, there are seasonal fluctuations with a general upward growth trend. The seasons peak in roughly the middle of the year - potentially the summer months.
6.2b Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.
decompose_mult <- decompose(plastics, "multiplicative")
plot(decompose_mult)
6.2c Do the results support the graphical interpretation from part (a)?
Yes, they do.
6.2d Compute and plot the seasonally adjusted data.
plastic.adj <- decompose_mult$x / decompose_mult$seasonal
plot(plastic.adj)
6.2e Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?
plastics.outlier <- plastics
plastics.outlier[28] <- plastics.outlier[28] + 500
decompose_mult_outlier <- decompose(plastics.outlier, "multiplicative")
plastics.outlier.adj <- decompose_mult_outlier$x / decompose_mult_outlier$seasonal
plot(plastics.outlier.adj)
The effect of the outlier is significant. The outlier is apparent in the obvious peak in the time series.
6.2f Does it make any difference if the outlier is near the end rather than in the middle of the time series?
The placement of the outlier should not have any difference in effect. The basic intuition for the adjusted seasonal plot is to divide out the seasonal components of the time series. The seasonal components, however, do not take into consideration randomness or the remainder - in this situation the outlier would not fall into the seasonal component.
6.2g Use a random walk with drift to produce forecasts of the seasonally adjusted data.
fcst <- rwf(plastic.adj, h = 10, drift = TRUE)
plot(fcst)
6.2h Reseasonalize the results to give forecasts on the original scale.
fcast <- forecast(fcst, method="naive")
plot(fcast, ylab="New orders index")
6.3 Figure 6.13 shows the result of decomposing the number of persons in the civilian labor force in Australia each month from February 1978 to August 1995. (Plots Omitted)
6.3a Write about 3-5 sentences describing the results of the seasonal adjustment. Pay particular attention to the scales of the graphs in making your interpretation.
The decomposition definetivley shows an upward trend. However, since the seasonality is so close together, it is hard to notice in the non-adjusted time series - the seasonality could be mistaken for noise. With the decomposition, the remainder scale is of interest - it does show some outliers.
6.3b Is the recession of 1991/1992 visible in the estimated components?
Yes, in the remainder plot. The normal time series does show a slight hiccup at 1991, but the remainder plot does make it more obvious.