Homework 1

Load Libraries

library(forecast)

## Warning: package 'forecast' was built under R version 3.4.2

## Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2018c.
## 1.0/zoneinfo/America/New_York'

library(fpp)

## Loading required package: fma

## Loading required package: expsmooth

## Loading required package: lmtest

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Loading required package: tseries

library(car)

## Warning: package 'car' was built under R version 3.4.3

library(MASS)

## 
## Attaching package: 'MASS'

## The following objects are masked from 'package:fma':
## 
##     cement, housing, petrol

library(tseries)

5.8 Multiple Regression Exercises

1a.

Looking at the history of an Australian souvenir shop’s sales, we can see a interesting pattern. Within the course of a year there is a trend of seasonality as there is a huge spike in sales before the end of the year and another smaller one once the year begins. This big spike could be due to the influx of tourists that come to the town during Christmas. The other smaller spike could be due to the surfing festival. Though there are trends and patterns seen in the data, it is unusual that there has been such a heavy increase during the Christmas season from 1991 to 1994. It also looks like in 1992 and 1993, the months in between March and December has seen a significantly larger increase in sales then from prior years were it was more flat.

fancy

##            Jan       Feb       Mar       Apr       May       Jun       Jul
## 1987   1664.81   2397.53   2840.71   3547.29   3752.96   3714.74   4349.61
## 1988   2499.81   5198.24   7225.14   4806.03   5900.88   4951.34   6179.12
## 1989   4717.02   5702.63   9957.58   5304.78   6492.43   6630.80   7349.62
## 1990   5921.10   5814.58  12421.25   6369.77   7609.12   7224.75   8121.22
## 1991   4826.64   6470.23   9638.77   8821.17   8722.37  10209.48  11276.55
## 1992   7615.03   9849.69  14558.40  11587.33   9332.56  13082.09  16732.78
## 1993  10243.24  11266.88  21826.84  17357.33  15997.79  18601.53  26155.15
##            Aug       Sep       Oct       Nov       Dec
## 1987   3566.34   5021.82   6423.48   7600.60  19756.21
## 1988   4752.15   5496.43   5835.10  12600.08  28541.72
## 1989   8176.62   8573.17   9690.50  15151.84  34061.01
## 1990   7979.25   8093.06   8476.70  17914.66  30114.41
## 1991  12552.22  11637.39  13606.89  21822.11  45060.69
## 1992  19888.61  23933.38  25391.35  36024.80  80721.71
## 1993  28586.52  30505.41  30821.33  46634.38 104660.67

plot(fancy, ylab="Sales", main="History of Sales for Australian Souvenir Shop")

seasonplot(fancy, ylab="Sales", main="Seasonal plot of Sales for Australian Souvenir Shop")

1b.

It is necessary to take logarithms of the sales data with this linear trend due to the seasonal trends within the data. Looking at the histogram of the sales data, we can see that the data is heavily positively skewed with a few outliers. Taking logarithms helps with this problem.

hist(fancy)

1c.

log.fancy=log(fancy)
surf<-data.frame(surf=c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0))
fit <- tslm(log.fancy~ trend + season + surf)
fit

## 
## Call:
## tslm(formula = log.fancy ~ trend + season + surf)
## 
## Coefficients:
## (Intercept)        trend      season2      season3      season4  
##     7.61967      0.02202      0.25142      0.26608      0.38405  
##     season5      season6      season7      season8      season9  
##     0.40949      0.44883      0.61045      0.58796      0.66933  
##    season10     season11     season12         surf  
##     0.74739      1.20675      1.96224      0.50152

1d.

The Residual vs. Fitted plot and the Residual vs. Time plot shows no pattern in the residuals among the data. They are random. This shows us that there is no sign of heteroscedasticity among the errors.

plot(as.numeric(residuals(fit)), xlab= "Time", ylab="Residuals", main = "Residuals vs. Time")

plot(as.numeric(fitted(fit)), as.numeric(residuals(fit)), xlab= "Fitted Values", ylab= "Residuals", main= "Residual vs Fitted")

plot(residuals(fit), ylab="Residuals", xlab="Year")

1e.

r<-residuals(fit)
boxplot(r, data=fancy)

1f. The variables in the model are all positive. As the season variables increase so does the magnitude of the coefficients on there effect on sales. Season 3 is the only variable that is not significant. Almost all of the other variables are significant at 0% level of significance when testing for the p-value, except for season 2 and the surf festival month dummy. They have a 5% level of significance.

summary(fit)

## 
## Call:
## tslm(formula = log.fancy ~ trend + season + surf)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6196670  0.0742471 102.626  < 2e-16 ***
## trend       0.0220198  0.0008268  26.634  < 2e-16 ***
## season2     0.2514168  0.0956790   2.628 0.010555 *  
## season3     0.2660828  0.1934044   1.376 0.173275    
## season4     0.3840535  0.0957075   4.013 0.000148 ***
## season5     0.4094870  0.0957325   4.277 5.88e-05 ***
## season6     0.4488283  0.0957647   4.687 1.33e-05 ***
## season7     0.6104545  0.0958039   6.372 1.71e-08 ***
## season8     0.5879644  0.0958503   6.134 4.53e-08 ***
## season9     0.6693299  0.0959037   6.979 1.36e-09 ***
## season10    0.7473919  0.0959643   7.788 4.48e-11 ***
## season11    1.2067479  0.0960319  12.566  < 2e-16 ***
## season12    1.9622412  0.0961066  20.417  < 2e-16 ***
## surf        0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16

1g. According to the Durbin-Watson test, the model has evidence of lag one auto correlation. This is because the p-value is extremely small.

dwtest(fit,alt="two.sided")

## 
##  Durbin-Watson test
## 
## data:  fit
## DW = 0.88889, p-value = 1.956e-07
## alternative hypothesis: true autocorrelation is not 0

1h.

forprediction <- data.frame(surf =c(0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0))
forecast1= forecast(fit, forprediction)
forecast1

##          Point Forecast     Lo 80     Hi 80     Lo 95    Hi 95
## Jan 1994       9.491352  9.238522  9.744183  9.101594  9.88111
## Feb 1994       9.764789  9.511959 10.017620  9.375031 10.15455
## Mar 1994      10.302990 10.048860 10.557120  9.911228 10.69475
## Apr 1994       9.941465  9.688635 10.194296  9.551707 10.33122
## May 1994       9.988919  9.736088 10.241749  9.599161 10.37868
## Jun 1994      10.050280  9.797449 10.303110  9.660522 10.44004
## Jul 1994      10.233926  9.981095 10.486756  9.844168 10.62368
## Aug 1994      10.233456  9.980625 10.486286  9.843698 10.62321
## Sep 1994      10.336841 10.084010 10.589671  9.947083 10.72660
## Oct 1994      10.436923 10.184092 10.689753 10.047165 10.82668
## Nov 1994      10.918299 10.665468 11.171129 10.528541 11.30806
## Dec 1994      11.695812 11.442981 11.948642 11.306054 12.08557
## Jan 1995       9.755590  9.499844 10.011336  9.361338 10.14984
## Feb 1995      10.029027  9.773281 10.284773  9.634775 10.42328
## Mar 1995      10.567228 10.310518 10.823938 10.171489 10.96297
## Apr 1995      10.205703  9.949957 10.461449  9.811451 10.59996
## May 1995      10.253157  9.997411 10.508903  9.858904 10.64741
## Jun 1995      10.314518 10.058772 10.570264  9.920265 10.70877
## Jul 1995      10.498164 10.242418 10.753910 10.103911 10.89242
## Aug 1995      10.497694 10.241948 10.753440 10.103441 10.89195
## Sep 1995      10.601079 10.345333 10.856825 10.206826 10.99533
## Oct 1995      10.701161 10.445415 10.956907 10.306908 11.09541
## Nov 1995      11.182537 10.926791 11.438282 10.788284 11.57679
## Dec 1995      11.960050 11.704304 12.215796 11.565797 12.35430
## Jan 1996      10.019828  9.760564 10.279093  9.620151 10.41951
## Feb 1996      10.293265 10.034000 10.552530  9.893588 10.69294
## Mar 1996      10.831466 10.571566 11.091365 10.430810 11.23212
## Apr 1996      10.469941 10.210677 10.729206 10.070264 10.86962
## May 1996      10.517395 10.258130 10.776659 10.117718 10.91707
## Jun 1996      10.578756 10.319491 10.838021 10.179079 10.97843
## Jul 1996      10.762402 10.503137 11.021667 10.362725 11.16208
## Aug 1996      10.761932 10.502667 11.021196 10.362254 11.16161
## Sep 1996      10.865317 10.606052 11.124582 10.465640 11.26499
## Oct 1996      10.965399 10.706134 11.224664 10.565722 11.36508
## Nov 1996      11.446774 11.187510 11.706039 11.047097 11.84645
## Dec 1996      12.224288 11.965023 12.483552 11.824611 12.62396

1i.

1j. We can implement other non-linear regression models since the data is non-linear. This could include piece wise models and regression splines. The regression splines could give a better fit to the data through knot selection.

2a.

plot(consumption ~ price, data=texasgas, xlab = "Price", ylab = "Consumption",
     main="Price and per capita Consumption of Natural Gas")

2b.

The data is clearly nonlinear. That being said, if we were to try and forecast for consumption through simple linear regressions, forecasting would be worthless. Data will not be represented correctly.

2c.

##model 1
fit1<- lm(consumption ~ exp(price), data = texasgas)
summary(fit1)

## 
## Call:
## lm(formula = consumption ~ exp(price), data = texasgas)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35.86 -25.09 -13.86  20.64  65.14 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.086e+01  7.670e+00   9.238 2.98e-08 ***
## exp(price)  -1.642e-43  1.711e-43  -0.959     0.35    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.19 on 18 degrees of freedom
## Multiple R-squared:  0.04864,    Adjusted R-squared:  -0.004214 
## F-statistic: 0.9203 on 1 and 18 DF,  p-value: 0.3501

#residual variance = 1101.359
(summary(fit1)$sigma)**2

## [1] 1101.359

##model 2
fit2<-lm(consumption ~ (price<=60)*price + (price>60)*price, data = texasgas)
summary(fit2)

## 
## Call:
## lm(formula = consumption ~ (price <= 60) * price + (price > 60) * 
##     price, data = texasgas)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.987  -6.421   2.823   9.324  22.617 
## 
## Coefficients: (2 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)            84.7861    51.8428   1.635  0.12147   
## price <= 60TRUE       136.1068    54.9412   2.477  0.02478 * 
## price                  -0.4470     0.5634  -0.793  0.43919   
## price > 60TRUE              NA         NA      NA       NA   
## price <= 60TRUE:price  -2.4587     0.6761  -3.636  0.00222 **
## price:price > 60TRUE        NA         NA      NA       NA   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.49 on 16 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.834 
## F-statistic: 32.81 on 3 and 16 DF,  p-value: 4.565e-07

#residual variance = 182.1078
(summary(fit2)$sigma)**2

## [1] 182.1078

##model 3
fit3 <- lm(consumption ~ price + I(price^2), data = texasgas)
summary(fit3)

## 
## Call:
## lm(formula = consumption ~ price + I(price^2), data = texasgas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5601  -5.4693   0.7502  11.0252  25.6619 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 273.930628  31.031614   8.827 9.32e-08 ***
## price        -5.675863   1.009086  -5.625 3.03e-05 ***
## I(price^2)    0.033904   0.007412   4.574 0.000269 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 14.37 on 17 degrees of freedom
## Multiple R-squared:  0.8315, Adjusted R-squared:  0.8117 
## F-statistic: 41.95 on 2 and 17 DF,  p-value: 2.666e-07

#residual variance = 206.5276
(summary(fit3)$sigma)**2

## [1] 206.5276

2d. Model 1 doesn’t capture the data well. It have very low r-squared and adjusted r-squared values. The residuals are very high and are all around 70. The AIC is also the highest among the three. Model 3 and Model 2 are very similar. Model 2 has the highest r-squared and adjusted r-squared , with the lowest AIC. Model 3 has slightly lower r-squared and adjusted r-squared values when compared to model 2. It has a slightly higher AIC. Residuals are very similar for both.

##model 1
#Multiple R-squared:  0.04864,  Adjusted R-squared:  -0.004214 
AIC(fit1) #200.7363

## [1] 200.7363

plot(fitted(fit1), residuals(fit1),ylab='Residuals', xlab='Predicted Values', main= "Model 1")

hist(residuals(fit1),xlab='Residuals', main= "Histogram of Model 1 Residuals")

##model 2
#Multiple R-squared:  0.8602,   Adjusted R-squared:  0.834
AIC(fit2) #166.3866

## [1] 166.3866

plot(fitted(fit2), residuals(fit2),ylab='Residuals', xlab='Predicted Values', main= "Model 2")

hist(residuals(fit2),xlab='Residuals', main= "Histogram of Model 2 Residuals")

##model 3
#Multiple R-squared:  0.8315,   Adjusted R-squared:  0.8117 
AIC(fit3) #168.1158

## [1] 168.1158

plot(fitted(fit3), residuals(fit3),ylab='Residuals', xlab='Predicted Values', main= "Model 3")

hist(residuals(fit3),xlab='Residuals', main= "Histogram of Model 3 Residuals")

2e. Model 2, the segmented one, is the best model.The model exhibits the lowest AIC and highest r-squared and adjusted r-squared.

forprediction <- data.frame(price = c(40, 60, 80, 100, 120))
predict(fit2, forprediction)

## Warning in predict.lm(fit2, forprediction): prediction from a rank-
## deficient fit may be misleading

##         1         2         3         4         5 
## 104.66623  46.55289  49.02913  40.08989  31.15065

2f. I was unable to figure out the plotting the prediction intervals. Though, the predicted values show that when the price is low, consumption is high. When price is high, consumption is low. This is similar to the original plot.

pred <- seq(40, 120, 3)
pred<- data.frame(pred)
pred <- pred[1:20,]
pred_interval <- predict(fit2,newdata=data.frame(price=pred),interval="prediction", level = 0.95)

## Warning in predict.lm(fit2, newdata = data.frame(price = pred), interval =
## "prediction", : prediction from a rank-deficient fit may be misleading

pred_interval

##          fit       lwr       upr
## 1  104.66623 74.371543 134.96092
## 2   95.94923 66.037298 125.86117
## 3   87.23223 57.517181 116.94728
## 4   78.51523 48.807497 108.22296
## 5   69.79823 39.908107  99.68835
## 6   61.08123 30.822440  91.34002
## 7   52.36423 21.557183  83.17127
## 8   57.52141  9.892524 105.15030
## 9   56.18053 11.239542 101.12151
## 10  54.83964 12.453913  97.22537
## 11  53.49875 13.510202  93.48731
## 12  52.15787 14.378306  89.93743
## 13  50.81698 15.023365  86.61060
## 14  49.47610 15.406353  83.54584
## 15  48.13521 15.485733  80.78469
## 16  46.79432 15.220506  78.36814
## 17  45.45344 14.574639  76.33223
## 18  44.11255 13.522175  74.70292
## 19  42.77166 12.051661  73.49167
## 20  41.43078 10.168295  72.69326

plot(texasgas$consumption, texasgas$price, xlab = "Price", ylab = "Consumption",
     main="Price and per capita Consumption of Natural Gas")

2g. P an P^2 have extremely high correlations that are 0.99. This is very problematic. When dealing with polynomial regressions, one should carefully choose which of the order to use in the regression so that you don’t have high correlations among the variables. Counting both will create problems with the model.

I(texasgas$price)^2

##  [1]   900   961  1369  1764  1849  2025  2500  2916  2916  3249  3364
## [12]  3364  3600  5329  7744  7921  8464  9409 10000 10404

mydata <- data.frame(I(texasgas$price)^2,texasgas$consumption, texasgas$price ) 
cor(mydata)

##                      I.texasgas.price..2 texasgas.consumption
## I.texasgas.price..2            1.0000000           -0.7196840
## texasgas.consumption          -0.7196840            1.0000000
## texasgas.price                 0.9904481           -0.7900215
##                      texasgas.price
## I.texasgas.price..2       0.9904481
## texasgas.consumption     -0.7900215
## texasgas.price            1.0000000

Chapter 6

I understand that when working with moving averages in this setting, we take the first average of 5 and then that of 3. These are giving me different answers.

x <- c(0.067, 0.133, 0.200, 0.200, 0.200, 0.133, 0.067)
x

## [1] 0.067 0.133 0.200 0.200 0.200 0.133 0.067

ma(x,order= 7)

## Time Series:
## Start = 1 
## End = 7 
## Frequency = 1 
## [1]        NA        NA        NA 0.1428571        NA        NA        NA

ma3x5 <- ma(x, 5, centre= FALSE)
ma(ma3x5, 3, centre= FALSE)

## Time Series:
## Start = 1 
## End = 7 
## Frequency = 1 
## [1]     NA     NA     NA 0.1644     NA     NA     NA

2a.

All together the time series of this data indicates a positive relationship. As time progresses, sales increase generally. The plots also indicate that there is a consistent spike in sales in the middle of the year.

y=ts(plastics, frequency = 12)
plot(y,xlab = "Years", ylab = "Sales (in thousands)")

2b.

fit <- decompose(y, type="multiplicative")
plot(fit)

plot(fit$seasonal)

2c. It looks like the multiplicative decomposition does support my original comment that the the summer months incur more sales when compared to the colder ones. This is seen more in deathly in the the seasonal plot. The trend diagram also indicates a positive trend, which I commented before on as well.

2d.

fit<- stl(y, s.window = 12)
seaadj <- seasadj(fit)
seaadj

##         Jan       Feb       Mar       Apr       May       Jun       Jul
## 1 1005.8181 1023.6627 1024.9698  996.6815  975.8481  932.9908  964.2438
## 2 1006.8188 1028.0481 1024.0271 1030.9316 1043.9567 1045.9986 1087.5240
## 3 1163.8533 1122.4595 1136.1029 1154.1944 1148.0723 1146.0083 1098.8008
## 4 1220.7041 1191.8971 1190.1830 1208.4212 1217.2137 1239.0893 1280.3851
## 5 1301.5689 1364.3486 1379.2770 1384.6594 1410.3644 1451.1723 1403.9642
##         Aug       Sep       Oct       Nov       Dec
## 1  953.4735  947.6618  922.9786  997.5470  977.0753
## 2 1085.5988 1080.5174 1087.9172 1094.3410 1095.2376
## 3 1171.7209 1212.3700 1244.8548 1200.1363 1217.4031
## 4 1289.8705 1343.1063 1391.5670 1434.9288 1403.5264
## 5 1342.0048 1266.8174 1211.2432 1152.6746 1207.5914

plot(seaadj)

2e. I choose the 12th observation and added 500 to it. We can see the magnitude of the outlier on sales with the spike in the plot

pl0 <- plastics
pl0[12] <- pl0[12]+500


y=ts(pl0, frequency = 12)
fitlast <- stl(y, s.window = 12)
fitlastadj <- seasadj(fitlast)
plot(fitlastadj)

2f.

Here I made two data sets, adding additional 500 in sales to the 30th and 57th values in order to illustrate the possible effect it may have on the the seasonal adjusted data. Within each observation you can clearly see a large spike in the seasonal adjusted plot.

#make outlier for 57th observation by adding 500 
pl1 <- plastics
pl1[57] <- pl1[57]+500

y=ts(pl1, frequency = 12)
fitlast <- stl(y, s.window = 12)
fitlastadj <- seasadj(fitlast)
plot(fitlastadj)

#make outlier for 30th observation by adding 500
pl2 <- plastics
pl2[30] <- pl2[30]+500

y=ts(pl2, frequency = 12)
fitlast <- stl(y, s.window = 12)
fitlastadj <- seasadj(fitlast)
plot(fitlastadj)

Looking at the seasonal adjusted plots for both of these data sets, we can see that the outliers give the same effect in their respected locations. Other than that though, the graphs are very similar to the original seasonal adjusted plot of the sales data.

2g.

fits=rwf(seaadj,drift=TRUE)
View(fits)

2h. Here the sales for product A are forecasted based on naive forecasts. The data is “reseasonalized”.

fit <- stl(plastics, t.window = 15, s.window = "periodic", robust = TRUE)
eeadj <- seasadj(fit)
plot(naive(eeadj), xlab = "Years", ylab = "Sales (in thousands)")

fcast <- forecast(fit, method= "naive")
plot(fcast, xlab = "Years", ylab = "Sales (in thousands)")

3

3a. Looking at the decomposition of the data we can conclude that there is an overall trend of an increasing labor force in Australia. There are signs of seasonality within the data, as it looks to have a common trend. The seasonal sub-series plot helps distinguish this as the labor force tends drop in January and April through August. It increases in March and December. .

3b. Yes you can see it within the time series data plot. There is a noticeable drop in the the 1991/1992 time period. You are also able to see a minor drop in the trend plot, indicating something had happened during that time. The residuals are also very large during this time as well.

Homework 1

Andrei Rjedkin

3/22/2018

Load Libraries

5.8 Multiple Regression Exercises

Chapter 6

3