The data set fancy (in the fma package) concerns the monthly sales figures of a shop which opened in January 1987 and sells gifts, souvenirs, and novelties. The shop is situated on the wharf at a beach resort town in Queensland, Australia. The sales volume varies with the seasonal population of tourists. There is a large influx of visitors to the town at Christmas and for the local surfing festival, held every March since 1988. Over time, the shop has expanded its premises, range of products, and staff. Study the data set carefully.

a. Produce a time plot of the data and describe the patterns in the graph. Identify any unusual or unexpected fluctuations in the time series.
plot(fancy,
     xlab="Year",
     ylab="Monthly Sales",
     main="Monthly Sales of a Shop in the Fancy Data Set")

The time plot above shows a seasonality with both spikes at every last month of the year and at every march. December having the biggest spike while march with a smaller spike. This is supported by large influx of visitors to the town at Christmas and for the local surfing festival held every March. Also, there is an increase of sales volume overtime as the shop was expanding.

b. Explain why it is necessary to take logarithms of these data before fitting a model.

It is necesssary to take logarithms of these data to visualize the seasonality while setting aside the expanding changes in the data over time.

c. Use R to fit a regression model to the logarithms of these sales data with a linear trend, seasonal dummies and a “surfing festival” dummy variable.
logFancy <- log(fancy)
festDummy <- rep(0, length(fancy))
festDummy[seq_along(festDummy)%%12 == 3] <- 1
festDummy[3] <- 0
festDummy <- ts(festDummy, frequency = 12, start = c(1987,1))
fancyData <- data.frame(logFancy, festDummy)
fittedData <- tslm(logFancy~trend + season + festDummy, data=fancy)

summary(fittedData)
## 
## Call:
## tslm(formula = logFancy ~ trend + season + festDummy, data = fancy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6196670  0.0742471 102.626  < 2e-16 ***
## trend       0.0220198  0.0008268  26.634  < 2e-16 ***
## season2     0.2514168  0.0956790   2.628 0.010555 *  
## season3     0.2660828  0.1934044   1.376 0.173275    
## season4     0.3840535  0.0957075   4.013 0.000148 ***
## season5     0.4094870  0.0957325   4.277 5.88e-05 ***
## season6     0.4488283  0.0957647   4.687 1.33e-05 ***
## season7     0.6104545  0.0958039   6.372 1.71e-08 ***
## season8     0.5879644  0.0958503   6.134 4.53e-08 ***
## season9     0.6693299  0.0959037   6.979 1.36e-09 ***
## season10    0.7473919  0.0959643   7.788 4.48e-11 ***
## season11    1.2067479  0.0960319  12.566  < 2e-16 ***
## season12    1.9622412  0.0961066  20.417  < 2e-16 ***
## festDummy   0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16
d. Plot the residuals against time and against the fitted values. Do these plots reveal any problems with the model? Explain.
plot(residuals(fittedData), type="p", ylab="Residuals")

plot(as.numeric(fitted(fittedData)), residuals(fittedData), 
xlab = "Predicted Scores", ylab = "Residuals")

The plots above does not reveal any problems with the model hence both of them have shown residuals to be random.

e. Generate boxplots of the residuals for each month. Does this reveal any problems with the model? Explain.
boxplot(residuals(fittedData)~cycle(residuals(fittedData)))

The boxplots above show wider variance towards August and September which will mean that the model somehow is missing out seasonality at this period.

f. Interpret the estimated coefficients in the model.
summary(fittedData)
## 
## Call:
## tslm(formula = logFancy ~ trend + season + festDummy, data = fancy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.33673 -0.12757  0.00257  0.10911  0.37671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.6196670  0.0742471 102.626  < 2e-16 ***
## trend       0.0220198  0.0008268  26.634  < 2e-16 ***
## season2     0.2514168  0.0956790   2.628 0.010555 *  
## season3     0.2660828  0.1934044   1.376 0.173275    
## season4     0.3840535  0.0957075   4.013 0.000148 ***
## season5     0.4094870  0.0957325   4.277 5.88e-05 ***
## season6     0.4488283  0.0957647   4.687 1.33e-05 ***
## season7     0.6104545  0.0958039   6.372 1.71e-08 ***
## season8     0.5879644  0.0958503   6.134 4.53e-08 ***
## season9     0.6693299  0.0959037   6.979 1.36e-09 ***
## season10    0.7473919  0.0959643   7.788 4.48e-11 ***
## season11    1.2067479  0.0960319  12.566  < 2e-16 ***
## season12    1.9622412  0.0961066  20.417  < 2e-16 ***
## festDummy   0.5015151  0.1964273   2.553 0.012856 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.179 on 70 degrees of freedom
## Multiple R-squared:  0.9567, Adjusted R-squared:  0.9487 
## F-statistic:   119 on 13 and 70 DF,  p-value: < 2.2e-16
summary(fittedData$coefficients)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02202 0.39041 0.54474 1.12051 0.72788 7.61967

From the table above, all coefficients are seen to be significant except for season 3. This is because we represented the surfing festival using the dummy variable making it non significant for the model.

g. What does the Breusch-Godfrey test tell you about your model?
checkresiduals(fittedData)

## 
##  Breusch-Godfrey test for serial correlation of order up to 17
## 
## data:  Residuals from Linear regression model
## LM test = 37.954, df = 17, p-value = 0.002494

Testing the model using Breusch-Godfrey test, we have:
Null hypothesis: There is no autocorrelation among residuals.
Alternative hypothesis: There exists autocorrelation among residuals.
Significance level: 0.005
Test Statistic: LM test = 37.95, p-value = 0.002
Decision Rule: If the p-value is less than 0.05 then we can reject the null hypothesis and conclude that autocorrelation exists among the residuals.
Decision: Since p-value = 0.002 < 0.05 = significance level, then we reject the null hypothesis.
Conclusion: Autocorrelation still exists among residuals at the 5% level of significance.