library(forecast)
## Warning: package 'forecast' was built under R version 3.4.2
## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/New_York'
library(fpp)
## Loading required package: fma
## Loading required package: expsmooth
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: tseries
library(car)
## Warning: package 'car' was built under R version 3.4.3
library(MASS)
##
## Attaching package: 'MASS'
## The following objects are masked from 'package:fma':
##
## cement, housing, petrol
library(tseries)
1a.
Looking at the history of an Australian souvenir shop’s sales, we can see a interesting pattern. Within the course of a year there is a trend of seasonality as there is a huge spike in sales before the end of the year and another smaller one once the year begins. This big spike could be due to the influx of tourists that come to the town during Christmas. The other smaller spike could be due to the surfing festival. Though there are trends and patterns seen in the data, it is unusual that there has been such a heavy increase during the Christmas season from 1991 to 1994. It also looks like in 1992 and 1993, the months in between March and December has seen a significantly larger increase in sales then from prior years were it was more flat.
fancy
## Jan Feb Mar Apr May Jun Jul
## 1987 1664.81 2397.53 2840.71 3547.29 3752.96 3714.74 4349.61
## 1988 2499.81 5198.24 7225.14 4806.03 5900.88 4951.34 6179.12
## 1989 4717.02 5702.63 9957.58 5304.78 6492.43 6630.80 7349.62
## 1990 5921.10 5814.58 12421.25 6369.77 7609.12 7224.75 8121.22
## 1991 4826.64 6470.23 9638.77 8821.17 8722.37 10209.48 11276.55
## 1992 7615.03 9849.69 14558.40 11587.33 9332.56 13082.09 16732.78
## 1993 10243.24 11266.88 21826.84 17357.33 15997.79 18601.53 26155.15
## Aug Sep Oct Nov Dec
## 1987 3566.34 5021.82 6423.48 7600.60 19756.21
## 1988 4752.15 5496.43 5835.10 12600.08 28541.72
## 1989 8176.62 8573.17 9690.50 15151.84 34061.01
## 1990 7979.25 8093.06 8476.70 17914.66 30114.41
## 1991 12552.22 11637.39 13606.89 21822.11 45060.69
## 1992 19888.61 23933.38 25391.35 36024.80 80721.71
## 1993 28586.52 30505.41 30821.33 46634.38 104660.67
plot(fancy, ylab="Sales", main="History of Sales for Australian Souvenir Shop")
seasonplot(fancy, ylab="Sales", main="Seasonal plot of Sales for Australian Souvenir Shop")
1b.
It is necessary to take logarithms of the sales data with this linear trend due to the seasonal trends within the data. Looking at the histogram of the sales data, we can see that the data is heavily positively skewed with a few outliers. Taking logarithms helps with this problem.
hist(fancy)
1c.
log.fancy=log(fancy)
surf<-data.frame(surf=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0))
fit <- tslm(log.fancy~ trend + season + surf)
fit
##
## Call:
## tslm(formula = log.fancy ~ trend + season + surf)
##
## Coefficients:
## (Intercept) trend season2 season3 season4
## 7.61967 0.02202 0.25142 0.26608 0.38405
## season5 season6 season7 season8 season9
## 0.40949 0.44883 0.61045 0.58796 0.66933
## season10 season11 season12 surf
## 0.74739 1.20675 1.96224 0.50152
1d.
The Residual vs. Fitted plot and the Residual vs. Time plot shows no pattern in the residuals among the data. They are random. This shows us that there is no sign of heteroscedasticity among the errors.
plot(as.numeric(residuals(fit)), xlab= "Time", ylab="Residuals", main = "Residuals vs. Time")
plot(as.numeric(fitted(fit)), as.numeric(residuals(fit)), xlab= "Fitted Values", ylab= "Residuals", main= "Residual vs Fitted")
plot(residuals(fit), ylab="Residuals", xlab="Year")
1e.
r<-residuals(fit)
boxplot(r, data=fancy)
1f. The variables in the model are all positive. As the season variables increase so does the magnitude of the coefficients on there effect on sales. Season 3 is the only variable that is not significant. Almost all of the other variables are significant at 0% level of significance when testing for the p-value, except for season 2 and the surf festival month dummy. They have a 5% level of significance.
summary(fit)
##
## Call:
## tslm(formula = log.fancy ~ trend + season + surf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.33673 -0.12757 0.00257 0.10911 0.37671
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.6196670 0.0742471 102.626 < 2e-16 ***
## trend 0.0220198 0.0008268 26.634 < 2e-16 ***
## season2 0.2514168 0.0956790 2.628 0.010555 *
## season3 0.2660828 0.1934044 1.376 0.173275
## season4 0.3840535 0.0957075 4.013 0.000148 ***
## season5 0.4094870 0.0957325 4.277 5.88e-05 ***
## season6 0.4488283 0.0957647 4.687 1.33e-05 ***
## season7 0.6104545 0.0958039 6.372 1.71e-08 ***
## season8 0.5879644 0.0958503 6.134 4.53e-08 ***
## season9 0.6693299 0.0959037 6.979 1.36e-09 ***
## season10 0.7473919 0.0959643 7.788 4.48e-11 ***
## season11 1.2067479 0.0960319 12.566 < 2e-16 ***
## season12 1.9622412 0.0961066 20.417 < 2e-16 ***
## surf 0.5015151 0.1964273 2.553 0.012856 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared: 0.9567, Adjusted R-squared: 0.9487
## F-statistic: 119 on 13 and 70 DF, p-value: < 2.2e-16
1g. According to the Durbin-Watson test, the model has evidence of lag one auto correlation. This is because the p-value is extremely small.
dwtest(fit,alt="two.sided")
##
## Durbin-Watson test
##
## data: fit
## DW = 0.88889, p-value = 1.956e-07
## alternative hypothesis: true autocorrelation is not 0
1h.
forprediction <- data.frame(surf =c(0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0))
forecast1= forecast(fit, forprediction)
forecast1
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## Jan 1994 9.491352 9.238522 9.744183 9.101594 9.88111
## Feb 1994 9.764789 9.511959 10.017620 9.375031 10.15455
## Mar 1994 10.302990 10.048860 10.557120 9.911228 10.69475
## Apr 1994 9.941465 9.688635 10.194296 9.551707 10.33122
## May 1994 9.988919 9.736088 10.241749 9.599161 10.37868
## Jun 1994 10.050280 9.797449 10.303110 9.660522 10.44004
## Jul 1994 10.233926 9.981095 10.486756 9.844168 10.62368
## Aug 1994 10.233456 9.980625 10.486286 9.843698 10.62321
## Sep 1994 10.336841 10.084010 10.589671 9.947083 10.72660
## Oct 1994 10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994 10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994 11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995 9.755590 9.499844 10.011336 9.361338 10.14984
## Feb 1995 10.029027 9.773281 10.284773 9.634775 10.42328
## Mar 1995 10.567228 10.310518 10.823938 10.171489 10.96297
## Apr 1995 10.205703 9.949957 10.461449 9.811451 10.59996
## May 1995 10.253157 9.997411 10.508903 9.858904 10.64741
## Jun 1995 10.314518 10.058772 10.570264 9.920265 10.70877
## Jul 1995 10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995 10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995 10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995 10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995 11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995 11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996 10.019828 9.760564 10.279093 9.620151 10.41951
## Feb 1996 10.293265 10.034000 10.552530 9.893588 10.69294
## Mar 1996 10.831466 10.571566 11.091365 10.430810 11.23212
## Apr 1996 10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996 10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996 10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996 10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996 10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996 10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996 10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996 11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996 12.224288 11.965023 12.483552 11.824611 12.62396
1i.
1j. We can implement other non-linear regression models since the data is non-linear. This could include piece wise models and regression splines. The regression splines could give a better fit to the data through knot selection.
2a.
plot(consumption ~ price, data=texasgas, xlab = "Price", ylab = "Consumption",
main="Price and per capita Consumption of Natural Gas")
2b.
The data is clearly nonlinear. That being said, if we were to try and forecast for consumption through simple linear regressions, forecasting would be worthless. Data will not be represented correctly.
2c.
##model 1
fit1<- lm(consumption ~ exp(price), data = texasgas)
summary(fit1)
##
## Call:
## lm(formula = consumption ~ exp(price), data = texasgas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.86 -25.09 -13.86 20.64 65.14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.086e+01 7.670e+00 9.238 2.98e-08 ***
## exp(price) -1.642e-43 1.711e-43 -0.959 0.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.19 on 18 degrees of freedom
## Multiple R-squared: 0.04864, Adjusted R-squared: -0.004214
## F-statistic: 0.9203 on 1 and 18 DF, p-value: 0.3501
#residual variance = 1101.359
(summary(fit1)$sigma)**2
## [1] 1101.359
##model 2
fit2<-lm(consumption ~ (price<=60)*price + (price>60)*price, data = texasgas)
summary(fit2)
##
## Call:
## lm(formula = consumption ~ (price <= 60) * price + (price > 60) *
## price, data = texasgas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.987 -6.421 2.823 9.324 22.617
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.7861 51.8428 1.635 0.12147
## price <= 60TRUE 136.1068 54.9412 2.477 0.02478 *
## price -0.4470 0.5634 -0.793 0.43919
## price > 60TRUE NA NA NA NA
## price <= 60TRUE:price -2.4587 0.6761 -3.636 0.00222 **
## price:price > 60TRUE NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.49 on 16 degrees of freedom
## Multiple R-squared: 0.8602, Adjusted R-squared: 0.834
## F-statistic: 32.81 on 3 and 16 DF, p-value: 4.565e-07
#residual variance = 182.1078
(summary(fit2)$sigma)**2
## [1] 182.1078
##model 3
fit3 <- lm(consumption ~ price + I(price^2), data = texasgas)
summary(fit3)
##
## Call:
## lm(formula = consumption ~ price + I(price^2), data = texasgas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5601 -5.4693 0.7502 11.0252 25.6619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 273.930628 31.031614 8.827 9.32e-08 ***
## price -5.675863 1.009086 -5.625 3.03e-05 ***
## I(price^2) 0.033904 0.007412 4.574 0.000269 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.37 on 17 degrees of freedom
## Multiple R-squared: 0.8315, Adjusted R-squared: 0.8117
## F-statistic: 41.95 on 2 and 17 DF, p-value: 2.666e-07
#residual variance = 206.5276
(summary(fit3)$sigma)**2
## [1] 206.5276
2d. Model 1 doesn’t capture the data well. It have very low r-squared and adjusted r-squared values. The residuals are very high and are all around 70. The AIC is also the highest among the three. Model 3 and Model 2 are very similar. Model 2 has the highest r-squared and adjusted r-squared , with the lowest AIC. Model 3 has slightly lower r-squared and adjusted r-squared values when compared to model 2. It has a slightly higher AIC. Residuals are very similar for both.
##model 1
#Multiple R-squared: 0.04864, Adjusted R-squared: -0.004214
AIC(fit1) #200.7363
## [1] 200.7363
plot(fitted(fit1), residuals(fit1),ylab='Residuals', xlab='Predicted Values', main= "Model 1")
hist(residuals(fit1),xlab='Residuals', main= "Histogram of Model 1 Residuals")
##model 2
#Multiple R-squared: 0.8602, Adjusted R-squared: 0.834
AIC(fit2) #166.3866
## [1] 166.3866
plot(fitted(fit2), residuals(fit2),ylab='Residuals', xlab='Predicted Values', main= "Model 2")
hist(residuals(fit2),xlab='Residuals', main= "Histogram of Model 2 Residuals")
##model 3
#Multiple R-squared: 0.8315, Adjusted R-squared: 0.8117
AIC(fit3) #168.1158
## [1] 168.1158
plot(fitted(fit3), residuals(fit3),ylab='Residuals', xlab='Predicted Values', main= "Model 3")
hist(residuals(fit3),xlab='Residuals', main= "Histogram of Model 3 Residuals")
2e. Model 2, the segmented one, is the best model.The model exhibits the lowest AIC and highest r-squared and adjusted r-squared.
forprediction <- data.frame(price = c(40, 60, 80, 100, 120))
predict(fit2, forprediction)
## Warning in predict.lm(fit2, forprediction): prediction from a rank-
## deficient fit may be misleading
## 1 2 3 4 5
## 104.66623 46.55289 49.02913 40.08989 31.15065
2f. I was unable to figure out the plotting the prediction intervals. Though, the predicted values show that when the price is low, consumption is high. When price is high, consumption is low. This is similar to the original plot.
pred <- seq(40, 120, 3)
pred<- data.frame(pred)
pred <- pred[1:20,]
pred_interval <- predict(fit2,newdata=data.frame(price=pred),interval="prediction", level = 0.95)
## Warning in predict.lm(fit2, newdata = data.frame(price = pred), interval =
## "prediction", : prediction from a rank-deficient fit may be misleading
pred_interval
## fit lwr upr
## 1 104.66623 74.371543 134.96092
## 2 95.94923 66.037298 125.86117
## 3 87.23223 57.517181 116.94728
## 4 78.51523 48.807497 108.22296
## 5 69.79823 39.908107 99.68835
## 6 61.08123 30.822440 91.34002
## 7 52.36423 21.557183 83.17127
## 8 57.52141 9.892524 105.15030
## 9 56.18053 11.239542 101.12151
## 10 54.83964 12.453913 97.22537
## 11 53.49875 13.510202 93.48731
## 12 52.15787 14.378306 89.93743
## 13 50.81698 15.023365 86.61060
## 14 49.47610 15.406353 83.54584
## 15 48.13521 15.485733 80.78469
## 16 46.79432 15.220506 78.36814
## 17 45.45344 14.574639 76.33223
## 18 44.11255 13.522175 74.70292
## 19 42.77166 12.051661 73.49167
## 20 41.43078 10.168295 72.69326
plot(texasgas$consumption, texasgas$price, xlab = "Price", ylab = "Consumption",
main="Price and per capita Consumption of Natural Gas")
2g. P an P^2 have extremely high correlations that are 0.99. This is very problematic. When dealing with polynomial regressions, one should carefully choose which of the order to use in the regression so that you don’t have high correlations among the variables. Counting both will create problems with the model.
I(texasgas$price)^2
## [1] 900 961 1369 1764 1849 2025 2500 2916 2916 3249 3364
## [12] 3364 3600 5329 7744 7921 8464 9409 10000 10404
mydata <- data.frame(I(texasgas$price)^2,texasgas$consumption, texasgas$price )
cor(mydata)
## I.texasgas.price..2 texasgas.consumption
## I.texasgas.price..2 1.0000000 -0.7196840
## texasgas.consumption -0.7196840 1.0000000
## texasgas.price 0.9904481 -0.7900215
## texasgas.price
## I.texasgas.price..2 0.9904481
## texasgas.consumption -0.7900215
## texasgas.price 1.0000000
x <- c(0.067, 0.133, 0.200, 0.200, 0.200, 0.133, 0.067)
x
## [1] 0.067 0.133 0.200 0.200 0.200 0.133 0.067
ma(x,order= 7)
## Time Series:
## Start = 1
## End = 7
## Frequency = 1
## [1] NA NA NA 0.1428571 NA NA NA
ma3x5 <- ma(x, 5, centre= FALSE)
ma(ma3x5, 3, centre= FALSE)
## Time Series:
## Start = 1
## End = 7
## Frequency = 1
## [1] NA NA NA 0.1644 NA NA NA
2a.
All together the time series of this data indicates a positive relationship. As time progresses, sales increase generally. The plots also indicate that there is a consistent spike in sales in the middle of the year.
y=ts(plastics, frequency = 12)
plot(y,xlab = "Years", ylab = "Sales (in thousands)")
2b.
fit <- decompose(y, type="multiplicative")
plot(fit)
plot(fit$seasonal)
2c. It looks like the multiplicative decomposition does support my original comment that the the summer months incur more sales when compared to the colder ones. This is seen more in deathly in the the seasonal plot. The trend diagram also indicates a positive trend, which I commented before on as well.
2d.
fit<- stl(y, s.window = 12)
seaadj <- seasadj(fit)
seaadj
## Jan Feb Mar Apr May Jun Jul
## 1 1005.8181 1023.6627 1024.9698 996.6815 975.8481 932.9908 964.2438
## 2 1006.8188 1028.0481 1024.0271 1030.9316 1043.9567 1045.9986 1087.5240
## 3 1163.8533 1122.4595 1136.1029 1154.1944 1148.0723 1146.0083 1098.8008
## 4 1220.7041 1191.8971 1190.1830 1208.4212 1217.2137 1239.0893 1280.3851
## 5 1301.5689 1364.3486 1379.2770 1384.6594 1410.3644 1451.1723 1403.9642
## Aug Sep Oct Nov Dec
## 1 953.4735 947.6618 922.9786 997.5470 977.0753
## 2 1085.5988 1080.5174 1087.9172 1094.3410 1095.2376
## 3 1171.7209 1212.3700 1244.8548 1200.1363 1217.4031
## 4 1289.8705 1343.1063 1391.5670 1434.9288 1403.5264
## 5 1342.0048 1266.8174 1211.2432 1152.6746 1207.5914
plot(seaadj)
2e. I choose the 12th observation and added 500 to it. We can see the magnitude of the outlier on sales with the spike in the plot
pl0 <- plastics
pl0[12] <- pl0[12]+500
y=ts(pl0, frequency = 12)
fitlast <- stl(y, s.window = 12)
fitlastadj <- seasadj(fitlast)
plot(fitlastadj)
2f.
Here I made two data sets, adding additional 500 in sales to the 30th and 57th values in order to illustrate the possible effect it may have on the the seasonal adjusted data. Within each observation you can clearly see a large spike in the seasonal adjusted plot.
#make outlier for 57th observation by adding 500
pl1 <- plastics
pl1[57] <- pl1[57]+500
y=ts(pl1, frequency = 12)
fitlast <- stl(y, s.window = 12)
fitlastadj <- seasadj(fitlast)
plot(fitlastadj)
#make outlier for 30th observation by adding 500
pl2 <- plastics
pl2[30] <- pl2[30]+500
y=ts(pl2, frequency = 12)
fitlast <- stl(y, s.window = 12)
fitlastadj <- seasadj(fitlast)
plot(fitlastadj)
Looking at the seasonal adjusted plots for both of these data sets, we can see that the outliers give the same effect in their respected locations. Other than that though, the graphs are very similar to the original seasonal adjusted plot of the sales data.
2g.
fits=rwf(seaadj,drift=TRUE)
View(fits)
2h. Here the sales for product A are forecasted based on naive forecasts. The data is “reseasonalized”.
fit <- stl(plastics, t.window = 15, s.window = "periodic", robust = TRUE)
eeadj <- seasadj(fit)
plot(naive(eeadj), xlab = "Years", ylab = "Sales (in thousands)")
fcast <- forecast(fit, method= "naive")
plot(fcast, xlab = "Years", ylab = "Sales (in thousands)")
3a. Looking at the decomposition of the data we can conclude that there is an overall trend of an increasing labor force in Australia. There are signs of seasonality within the data, as it looks to have a common trend. The seasonal sub-series plot helps distinguish this as the labor force tends drop in January and April through August. It increases in March and December. .
3b. Yes you can see it within the time series data plot. There is a noticeable drop in the the 1991/1992 time period. You are also able to see a minor drop in the trend plot, indicating something had happened during that time. The residuals are also very large during this time as well.