ADEC7460 Homework #1

Jordan Harrop


Chapter Five Excersises

5.1 The data below (data set fancy) concern the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff.

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1987   1664.81   2397.53   2840.71   3547.29   3752.96   3714.74   4349.61
## 1988   2499.81   5198.24   7225.14   4806.03   5900.88   4951.34   6179.12
## 1989   4717.02   5702.63   9957.58   5304.78   6492.43   6630.80   7349.62
## 1990   5921.10   5814.58  12421.25   6369.77   7609.12   7224.75   8121.22
## 1991   4826.64   6470.23   9638.77   8821.17   8722.37  10209.48  11276.55
## 1992   7615.03   9849.69  14558.40  11587.33   9332.56  13082.09  16732.78
## 1993  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 1987   3566.34   5021.82   6423.48   7600.60  19756.21
## 1988   4752.15   5496.43   5835.10  12600.08  28541.72
## 1989   8176.62   8573.17   9690.50  15151.84  34061.01
## 1990   7979.25   8093.06   8476.70  17914.66  30114.41
## 1991  12552.22  11637.39  13606.89  21822.11  45060.69
## 1992  19888.61  23933.38  25391.35  36024.80  80721.71
## 1993  28586.52  30505.41  30821.33  46634.38 104660.67

5.1a Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.

plot(fancy)

As identified from the above information, there is a surge in monthly sales every March and December. After a visual inspection of the plot, there seems to be a gradual growth in the spike in sales during these peak months with the exception of 1991. Overall, excluding the peeks, there is a growth trend. It should also be noted that Australia in December is their spring season with January, February and March being their summer months. The 1991 fluctuation is interesting - from the plot, there doesn’t seem to be any March surge. Was the surfing festival held in March of 1991?

5.1b Explain why it is necessary to take logarithms of these data before fitting a model.

fancy.numeric <- as.numeric(fancy)
hist(fancy.numeric)

Because of the surges, the data is very right skewed. Taking the logarithm of this data can push the data to a more normal distribution and in turn allow for a better interpenetration of the bulk of the data. A normal distribution also helps in the interpretation of residuals once the model is created. The below histogram shows the distribution of the data after the log transformation. Note: Performing this transformation requires that all observations in the dataset are greater than zero. If there are observations of zero, the data would need to be scaled up by one.

fancy.log <- log(fancy.numeric)
hist(fancy.log)

5.1c Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.

# Log Transform Data
fancy.log <- log(fancy)

# Create Festival Dummy Variable - Note: Skip first year.
d_festival <- rep(0, length(fancy))
d_festival[seq_along(d_festival)%%12 == 3] <- 1
d_festival[3] <- 0 

# Make Dummy Variable a Timeseries
d_festival <- ts(d_festival, freq = 12, start=c(1987,1))

fancydata <- data.frame(fancy.log, d_festival)

# Create Model
fancyfit <- tslm(fancy.log ~ trend + season + d_festival, data = fancydata)

5.1d Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model?

res <- residuals(fancyfit)

fit_vector <- as.vector(fitted(fancyfit))
fit_residuals <- as.vector(residuals(fancyfit))

plot(res, xlab = "Year", ylab = "Residuals")

plot(fit_vector, fit_residuals, xlab = "Predicted scores", ylab = "Residuals") 

par(mfrow=c(2,2))
plot(fancyfit)

The residuals against time show an interesting phenomenon in 1991 - I think more investigation is needed for that year in that specific location. The residuals against the fitted values show no underlying pattern and look to be at random.

5.1e Do box plots of the residuals for each month. Does this reveal any problems with the model?

res.month <- data.frame(res, month=rep(1:12,7))
boxplot(res ~ month, data=res.month)

April, June and July seem to be the months with the least volatility. The one problem that does show through in this plot is the potential for the residuals to have some sort of cyclical pattern to them - which means we have missing variables or data that could explain the system.

5.1f What do the values of the coefficients tell you about each variable?

summary(fancyfit)
## 
## Call:
## tslm(formula = fancy.log ~ trend + season + d_festival, data = fancydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) 7.6196670  0.0742471 102.626 < 0.0000000000000002 ***
## trend       0.0220198  0.0008268  26.634 < 0.0000000000000002 ***
## season2     0.2514168  0.0956790   2.628             0.010555 *  
## season3     0.2660828  0.1934044   1.376             0.173275    
## season4     0.3840535  0.0957075   4.013             0.000148 ***
## season5     0.4094870  0.0957325   4.277      0.0000588067270 ***
## season6     0.4488283  0.0957647   4.687      0.0000132668325 ***
## season7     0.6104545  0.0958039   6.372      0.0000000170771 ***
## season8     0.5879644  0.0958503   6.134      0.0000000453365 ***
## season9     0.6693299  0.0959037   6.979      0.0000000013630 ***
## season10    0.7473919  0.0959643   7.788      0.0000000000448 ***
## season11    1.2067479  0.0960319  12.566 < 0.0000000000000002 ***
## season12    1.9622412  0.0961066  20.417 < 0.0000000000000002 ***
## d_festival  0.5015151  0.1964273   2.553             0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 0.00000000000000022

Season2, Season3 and the dummy festival variable are the weakest in terms of p-value. Since we transformed the response variable by taking its natural log, the coefficients cannot be interpreted in the normal sense of a simple product of the coefficient and the feature. However, we can take into consideration direction and magnitude. All coefficients are positive, which makes sense given the growth trend shown in the plot. Season12 (December) is showing the highest magnitude and strongest p-value, which also makes sense given the peak influx in December.

5.1g What does the Durbin-Watson statistic tell you about your model?

library(lmtest)
dwtest(fancyfit, alt="two.sided")
## 
##  Durbin-Watson test
## 
## data:  fancyfit
## DW = 0.88889, p-value = 0.0000001956
## alternative hypothesis: true autocorrelation is not 0

The Durbin-Watson test is showing that there is auto correlation in the model’s residuals. This means that the model can still be improved. More investigation of year 1991 should be conducted, because there could be potential for outliers in this year.

5.1h Regardless of your answers to the above questions, use your regression model to predict the monthly sales for 1994, 1995, and 1996. Produce prediction intervals for each of your forecasts.

# Assuming the March festival continues
d_festival = rep(0, 36)
d_festival[seq_along(d_festival)%%12 == 3] <- 1

future_fancydata <- data.frame(d_festival)

fcst <- forecast(fancyfit, newdata=future_fancydata)
fcst
##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
## Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
## Mar 1994      10.302990 10.048860 10.557120  9.911228 10.69475
## Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
## May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
## Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
## Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
## Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
## Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
## Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995       9.755590  9.499844 10.011336  9.361338 10.14984
## Feb 1995      10.029027  9.773281 10.284773  9.634775 10.42328
## Mar 1995      10.567228 10.310518 10.823938 10.171489 10.96297
## Apr 1995      10.205703  9.949957 10.461449  9.811451 10.59996
## May 1995      10.253157  9.997411 10.508903  9.858904 10.64741
## Jun 1995      10.314518 10.058772 10.570264  9.920265 10.70877
## Jul 1995      10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995      10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995      10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995      10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995      11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995      11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996      10.019828  9.760564 10.279093  9.620151 10.41951
## Feb 1996      10.293265 10.034000 10.552530  9.893588 10.69294
## Mar 1996      10.831466 10.571566 11.091365 10.430810 11.23212
## Apr 1996      10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996      10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996      10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996      10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996      10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996      10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996      10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996      11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996      12.224288 11.965023 12.483552 11.824611 12.62396

5.1i Transform your predictions and intervals to obtain predictions and intervals for the raw data.

fcst$mean <- exp(fcst$mean)
fcst$upper <- exp(fcst$upper)
fcst$lower <- exp(fcst$lower)
fcst$x <- exp(fcst$x)
fcst
##          Point Forecast     Lo 80     Hi 80      Lo 95     Hi 95
## Jan 1994       13244.70  10285.82  17054.73   8969.583  19557.43
## Feb 1994       17409.81  13520.45  22418.00  11790.284  25707.73
## Mar 1994       29821.65  23129.40  38450.24  20155.412  44123.68
## Apr 1994       20774.16  16133.21  26750.16  14068.696  30675.62
## May 1994       21783.73  16917.24  28050.15  14752.395  32166.37
## Jun 1994       23162.27  17987.81  29825.24  15685.969  34201.95
## Jul 1994       27831.56  21613.98  35837.72  18848.111  41096.73
## Aug 1994       27818.48  21603.82  35820.87  18839.249  41077.41
## Sep 1994       30848.42  23956.87  39722.43  20891.193  45551.50
## Oct 1994       34095.57  26478.61  43903.67  23090.230  50346.32
## Nov 1994       55176.84  42850.31  71049.28  37366.903  81475.41
## Dec 1994      120067.79  93244.59 154607.08  81312.400 177294.90
## Jan 1995       17250.40  13357.65  22277.59  11629.938  25587.08
## Feb 1995       22675.20  17558.28  29283.31  15287.252  33633.55
## Mar 1995       38840.85  30046.98  50208.44  26146.972  57697.39
## Apr 1995       27057.06  20951.33  34942.16  18241.435  40133.06
## May 1995       28371.96  21969.51  36640.25  19127.918  42083.42
## Jun 1995       30167.42  23359.80  38958.95  20338.387  44746.58
## Jul 1995       36248.88  28068.91  46812.70  24438.412  53767.06
## Aug 1995       36231.84  28055.72  46790.69  24426.922  53741.78
## Sep 1995       40178.16  31111.50  51887.06  27087.467  59595.26
## Oct 1995       44407.37  34386.35  57348.77  29938.733  65868.34
## Nov 1995       71864.42  55647.40  92807.48  48449.831 106594.69
## Dec 1995      156380.86 121091.75 201954.08 105429.448 231955.81
## Jan 1996       22467.57  17336.40  29117.46  15065.329  33506.86
## Feb 1996       29533.04  22788.25  38274.14  19802.984  44043.89
## Mar 1996       50587.81  39009.73  65602.25  33887.802  75517.62
## Apr 1996       35240.15  27191.96  45670.42  23629.808  52555.15
## May 1996       36952.72  28513.41  47889.88  24778.151  55109.18
## Jun 1996       39291.20  30317.82  50920.48  26346.183  58596.65
## Jul 1996       47211.93  36429.60  61185.57  31657.322  70409.18
## Aug 1996       47189.73  36412.48  61156.80  31642.439  70376.07
## Sep 1996       52329.57  40378.47  67817.91  35088.887  78041.33
## Oct 1996       57837.85  44628.77  74956.52  38782.394  86256.08
## Nov 1996       93598.96  72222.70 121302.09  62761.521 139588.15
## Dec 1996      203676.38 157160.50 263959.89 136572.460 303751.35
plot(fcst)

5.1j How could you improve these predictions by modifying the model?

I think I could improve this model by adding data to it - specifically average temperature by month. Given the location is in close proximity to the beach, temperature could drastically explain the volatility in certain months.

5.2 The data below (data set texasgas) shows the demand for natural gas and the price of natural gas for 20 towns in Texas in 1969.

##    price consumption
## 1     30         134
## 2     31         112
## 3     37         136
## 4     42         109
## 5     43         105
## 6     45          87
## 7     50          56
## 8     54          43
## 9     54          77
## 10    57          35
## 11    58          65
## 12    58          56
## 13    60          58
## 14    73          55
## 15    88          49
## 16    89          39
## 17    92          36
## 18    97          46
## 19   100          40
## 20   102          42

5.2a Do a scatter plot of consumption against price. The data are clearly not linear. Three possible nonlinear models for the data are given below. The second model divides the data into two sections, depending on whether the price is above or below 60 cents per 1,000 cubic feet.

plot(texasgas$price, texasgas$consumption, xlab = "Price", ylab = "Consumption")

5.2b Can you explain why the slope of the fitted line should change with P?

The first model is taking the log of the response variable to train the model - this should make the slope the product of beta times the response. The second model is using a knot point to change the slope at specific point - in this case the knot is at 60. The third model is using a peicewise cubics rather than peicewise lines (like the second model) to smooth the slope of the regression.

5.2c Fit the three models and find the coefficients, and residual variance in each case.

model1 <- lm(I(log(consumption)) ~ price, data = texasgas)
pricep <- pmax(texasgas$price-60,0)
model2 <- lm(consumption ~ price + pricep, data = texasgas)
model3 <- lm(consumption ~ price + I(price^2), data = texasgas)

5.2d For each model, find the value of R2 and AIC, and produce a residual plot. Comment on the adequacy of the three models.

CV(model1); plot(fitted(model1), residuals(model1), xlab = "Predicted scores", ylab = "Residuals")
##           CV          AIC         AICc          BIC        AdjR2 
##   0.07882759 -48.30827728 -46.80827728 -45.32108046   0.63413071

CV(model2); plot(fitted(model2), residuals(model2), xlab = "Predicted scores", ylab = "Residuals")
##          CV         AIC        AICc         BIC       AdjR2 
## 204.3115937 108.0556267 110.7222933 112.0385558   0.8403537

CV(model3); plot(fitted(model3), residuals(model3), xlab = "Predicted scores", ylab = "Residuals")
##         CV        AIC       AICc        BIC      AdjR2 
## 238.565929 111.358304 114.024971 115.341233   0.811689

Model 1 would be considered the best model under the AIC and BIC metrics. Model 2 shows the highest adjusted R2, however, there is a drastic difference in AIC and BIC between Model 1 and Model 2. Model 2 and Model 3 are very similar across all metrics.

5.2e For prices 40, 60, 80, 100, and 120 cents per 1,000 cubic feet, compute the forecasted per capita demand using the best model of the three above.

price <- c(40, 60, 80, 100, 120)
newprices <- data.frame(price)
fcst <- forecast(model1, newdata = newprices); fcst
##   Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## 1       4.486114 4.107821 4.864407 3.888723 5.083506
## 2       4.180255 3.810442 4.550068 3.596255 4.764255
## 3       3.874396 3.499993 4.248799 3.283147 4.465644
## 4       3.568536 3.176933 3.960140 2.950126 4.186947
## 5       3.262677 2.842810 3.682544 2.599633 3.925721
# Transform Response to Original Form
fcst$mean <- exp(fcst$mean)
fcst$upper <- exp(fcst$upper)
fcst$lower <- exp(fcst$lower)

fcst
##   Point Forecast    Lo 80     Hi 80    Lo 95     Hi 95
## 1       88.77582 60.81408 129.59410 48.84846 161.33866
## 2       65.38252 45.17039  94.63886 36.46142 117.24378
## 3       48.15359 33.11521  70.02125 26.65954  86.97705
## 4       35.46465 23.97312  52.46466 19.10835  65.82155
## 5       26.11937 17.16393  39.74739 13.45880  50.68963

5.2f Compute 95% prediction intervals. Make a graph of these prediction intervals and discuss their interpretation.

library(plotrix)
plotCI(newprices$price, fcst$mean, ui=fcst$upper[,2], li=fcst$lower[,2])

It is very clear from the plot that as price increases, the prediction interval shrinks. This is telling us that as price increases the model is more confident in predicting consumption.

5.2g What is the correlation between P and P2? Does this suggest any general problem to be considered in dealing with polynomial regressions - especially of higher orders?

cor(texasgas$price, texasgas$price^2)
## [1] 0.9904481

The correlation between predictors price and price squared is very high - suggesting that there could be a general problem of multicollinearity amongst features with higher orders/powers

Chapter Six Excersises

6.1 Show that a 3×5 MA is equivalent to a 7-term weighted moving average with weights of 0.067, 0.133, 0.200, 0.200, 0.200, 0.133, and 0.067.

3x5 MA is taking the average for the first five terms and then getting the moving average of each three observation set. Since a 3X5 MA will lead to a 1/15(y + 2y + 3y + 3y + 3y + 2y + y), it follows the weighted average of 1/15, 2/15, 3/15, 3/15 … etc.

6.2 The data below represent the monthly sales (in thousands) of product A for a plastics manufacturer for years 1 through 5 (data set plastics).

##    Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1  742  697  776  898 1030 1107 1165 1216 1208 1131  971  783
## 2  741  700  774  932 1099 1223 1290 1349 1341 1296 1066  901
## 3  896  793  885 1055 1204 1326 1303 1436 1473 1453 1170 1023
## 4  951  861  938 1109 1274 1422 1486 1555 1604 1600 1403 1209
## 5 1030 1032 1126 1285 1468 1637 1611 1608 1528 1420 1119 1013

6.2a Plot the time series of sales of product A. Can you identify seasonal fluctuations and/or a trend?

plot(plastics)

From the plot above, there are seasonal fluctuations with a general upward growth trend. The seasons peak in roughly the middle of the year - potentially the summer months.

6.2b Use a classical multiplicative decomposition to calculate the trend-cycle and seasonal indices.

decompose_mult <- decompose(plastics, "multiplicative")
plot(decompose_mult)

6.2c Do the results support the graphical interpretation from part (a)?

Yes, they do.

6.2d Compute and plot the seasonally adjusted data.

plastic.adj <- decompose_mult$x / decompose_mult$seasonal
plot(plastic.adj)

6.2e Change one observation to be an outlier (e.g., add 500 to one observation), and recompute the seasonally adjusted data. What is the effect of the outlier?

plastics.outlier <- plastics
plastics.outlier[28] <- plastics.outlier[28] + 500
decompose_mult_outlier <- decompose(plastics.outlier, "multiplicative")
plastics.outlier.adj <- decompose_mult_outlier$x / decompose_mult_outlier$seasonal
plot(plastics.outlier.adj)

The effect of the outlier is significant. The outlier is apparent in the obvious peak in the time series.

6.2f Does it make any difference if the outlier is near the end rather than in the middle of the time series?

The placement of the outlier should not have any difference in effect. The basic intuition for the adjusted seasonal plot is to divide out the seasonal components of the time series. The seasonal components, however, do not take into consideration randomness or the remainder - in this situation the outlier would not fall into the seasonal component.

6.2g Use a random walk with drift to produce forecasts of the seasonally adjusted data.

fcst <- rwf(plastic.adj, h = 10, drift = TRUE)
plot(fcst)

6.2h Reseasonalize the results to give forecasts on the original scale.

fcast <- forecast(fcst, method="naive")
plot(fcast, ylab="New orders index")

6.3 Figure 6.13 shows the result of decomposing the number of persons in the civilian labor force in Australia each month from February 1978 to August 1995. (Plots Omitted)

6.3a Write about 3-5 sentences describing the results of the seasonal adjustment. Pay particular attention to the scales of the graphs in making your interpretation.

The decomposition definetivley shows an upward trend. However, since the seasonality is so close together, it is hard to notice in the non-adjusted time series - the seasonality could be mistaken for noise. With the decomposition, the remainder scale is of interest - it does show some outliers.

6.3b Is the recession of 1991/1992 visible in the estimated components?

Yes, in the remainder plot. The normal time series does show a slight hiccup at 1991, but the remainder plot does make it more obvious.