load packages

library(fpp2)
library(knitr)

Loading dataset from January 1990 untill December 2013 and creating a time series function

sugar_production <- read.csv("C:/ECON 4210/as2_data-1.csv")
df = sugar_production
df= ts(df, start = c(1990, 1), end = c(2013, 12), frequency = 12)
y = df[,"SUGAR"]

Part A

fit.sugar <- tslm(y ~ trend + season)
summary(fit.sugar)
## 
## Call:
## tslm(formula = y ~ trend + season)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.1715  -6.7118  -0.6345   7.3188  17.5630 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 113.064618   1.928144  58.639  < 2e-16 ***
## trend        -0.019789   0.006034  -3.280  0.00117 ** 
## season2      -1.928269   2.455349  -0.785  0.43293    
## season3      -7.460826   2.455372  -3.039  0.00261 ** 
## season4     -12.237162   2.455409  -4.984 1.10e-06 ***
## season5     -13.112056   2.455461  -5.340 1.95e-07 ***
## season6     -10.823654   2.455527  -4.408 1.50e-05 ***
## season7     -11.202361   2.455609  -4.562 7.64e-06 ***
## season8      -3.861118   2.455705  -1.572  0.11703    
## season9       2.522326   2.455816   1.027  0.30528    
## season10     14.672915   2.455942   5.974 7.13e-09 ***
## season11     16.705254   2.456083   6.802 6.44e-11 ***
## season12     16.403918   2.456239   6.678 1.33e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.506 on 275 degrees of freedom
## Multiple R-squared:  0.6314, Adjusted R-squared:  0.6153 
## F-statistic: 39.26 on 12 and 275 DF,  p-value: < 2.2e-16
autoplot(y, series="Data") +
  autolayer(fitted(fit.sugar), series="Fitted") +
  xlab("Year") + ylab("Sugar Production") +
  ggtitle("Monthly Sugar Production")

The graph above is the fitted regression line and the graph below is the scatterplot

 cbind(Data=y, Fitted=fitted(fit.sugar)) %>%
  as.data.frame() %>%
  ggplot(aes(x = Data, y = Fitted)) +
  geom_point() +
  ylab("Fitted") + xlab("Actual values") +
  ggtitle("Monthly Sugar production") +
  scale_colour_brewer(palette="Dark2", name="Month") +
  geom_abline(intercept=0, slope=1)

It is evident that there is alot of seasonality in the dataset. As a result, we can expect that the linear trend will fit well and that the model will be effective in forecasting the production. Based on the time series graph, the linear model appears to fit extremely well and there is a moderate correlation with the actual data. The fitted regression line tends to follow a trend without any signficant changes. It is important to note that there was a large drop in sugar production in 2008 and 2009. This is likely due to the economic ression that occured. Although this was the largest outlier, there are several deviations from the fitted line throughout the years. The scatterplot also indicates that there is a relationship in the data as the data points are grouped closely together. The summary determined that the R^2 is around 0.63 which further proves the moderate correlation.

Part B

checkresiduals(fitted(fit.sugar))
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

This is a very useful fuction when analyzing the residuals of the data. The first residual graph provides the same insights as the previous fitted regression graph. It shows that there is great seasonality in the data. Even in the outlier year (2008), the residuals still had a seasonal pattern. The ACF graph shows that there is certain degree of correlation between the data points and the previous data observations. It appears that the data is lagged by 12 periods (1 year). Overall, there is a lot of seasonality and autocorrelation in the data. The last graph shows that the data is not exactly normally distributed and is more spaced out. Overall, it does not seem that there is a strong linear relationship between the data due to the heavy autocorrelation. However, the data is very predictable with trend analysis.

Part C

fourier.sugar = tslm(y ~ trend + fourier(y, K = 6))
  
  (autoplot(y, series = "Data") +
          autolayer(fitted(fourier.sugar), series = "Fitted Data") +
          ylab("Sugar Production (Million Metric Tons)") + 
          ggtitle("Monthly Sugar Production")) +
    theme(plot.title = element_text(size=15, face = "bold",hjust = 0.5),
          axis.text.x = element_text(size=10),
          axis.text.y = element_text(size=10))

CV(fourier.sugar)
##           CV          AIC         AICc          BIC        AdjR2 
##   75.6729341 1247.7518389 1249.2903004 1299.0332856    0.6153339

Fourier.sugar, K =1

CV AIC AICc BIC 81.9233498 1271.0368293 1271.2495953 1289.3516317 AdjR2 0.5701126

Fourier.sugar, K =2

CV AIC AICc BIC 76.3083742 1250.5680808 1250.9680808 1276.2088042 AdjR2 0.6023272

Fourier.sugar, K =3

CV AIC AICc BIC 76.2618404 1250.3262378 1250.9737198 1283.2928821 AdjR2 0.6053425

Fourier.sugar, K =4

CV AIC AICc BIC 76.3203403 1250.4282919 1251.3848137 1290.7208572 AdjR2 0.6078469

Fourier.sugar, K =5

CV AIC AICc BIC 76.7772621 1252.0029962 1253.3314633 1299.6214824 AdjR2 0.6083176

Fourier.sugar, K =6

CV AIC AICc BIC 75.6729341 1247.7518389 1249.2903004 1299.0332856 AdjR2 0.6153339

K=6 has the lowest CV and AIC. The means that when K=6, the model is more complex and has more explanatory power than the other models. As you decrease the parameter K, the model gets simpler and the explanatory power decreases.

Part D

fourier.sugar6 <- tslm(y ~ trend + fourier(y, K = 6))

cbind(Data=y, Fitted=fitted(fourier.sugar6)) %>%
  as.data.frame() %>%
  ggplot(aes(x = Data, y = Fitted)) +
  geom_point() +
  ylab("Fitted") + xlab("Actual values") +
  ggtitle("Monthly Sugar production") +
  scale_colour_brewer(palette="Dark2", name="Month") +
  geom_abline(intercept=0, slope=1)

The harmonic model is practically identical to the linear model discussed previously. The R^2 value is the same in both regressions and it indicates a moderate correlation.

Part E

checkresiduals(fitted(fourier.sugar6))
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

The residual graphs are very similar to the residual graphs in the linear regression. For instance, the ACF still shows high seasonality and autocorrelation and there is still little normality in the data. I believe that the strong seasonality in the data caused the linear regression and Fourier trend to be the same. If there was not as much seasonality, the trends would likely differ.

Part F

#linear trend#
forecast1 <- forecast(fit.sugar, h=68)
autoplot(forecast1) +
  ggtitle("Forecast of Monthly Sugar Production") +
  xlab("Year") + ylab("Million Metric Tons")

#Harmonic regression#
forecast2 <- forecast(fourier.sugar6, newdata=data.frame(fourier(y,6,68)))
autoplot(forecast2) +
  ggtitle("Forecast of Monthly Sugar Production") +
  xlab("Year") + ylab("Million Metric Tons")

Part G

variable = ts(sugar_production, start = c(1990, 1), end = c(2013,12))
data = variable[, "SUGAR"]
test = window(data, start = c(2012, 1), end = c(2013, 12))

accuracy(forecast1, test)
##                         ME     RMSE      MAE        MPE     MAPE      MASE
## Training set  8.428279e-16 8.311372 7.129457 -0.6198073 6.693746 1.2400496
## Test set     -3.469276e+00 3.574239 3.469276 -3.6486927 3.648693 0.6034224
##                    ACF1 Theil's U
## Training set  0.8934813        NA
## Test set     -0.4113133 0.9143451
accuracy(forecast2, test)
##                         ME     RMSE      MAE        MPE     MAPE      MASE
## Training set -1.377618e-15 8.311372 7.129457 -0.6198073 6.693746 1.2400496
## Test set     -3.469276e+00 3.574239 3.469276 -3.6486927 3.648693 0.6034224
##                    ACF1 Theil's U
## Training set  0.8934813        NA
## Test set     -0.4113133 0.9143451

The linear and harmonic regression are the exact same and the statistics indicate that the forecast accuracy is the same. We can see that the MAPE for the test set is less than 4% which indicates that the models are likely a good fit to the data. The MASE is also very low which is a positive sign.

Part H

autoplot(y, series = "Data") +  
  autolayer(forecast1, series="Linear", PI=FALSE) + 
  autolayer(forecast2, series="Harmonic", PI=FALSE) +
  labs(x = "Year", y = "SUGAR (Million Metric Tons)") +
  theme(plot.title = element_text(hjust = 0.5))+
  theme(plot.subtitle = element_text(hjust = 0.5))+
  ggtitle("Monthly Sugar Production (Forecast)") +
  guides(colour=guide_legend(title="Forecast"))

The harmonic and linear forecasts overlap since they are identical to each other. Both forecasts are capable of accurately forecasting the data since the data has a lot of seasonality. The forecasts will be able to predict the future trend but will not be able to catch outliers such as the 2008 recession as well as other large trends in the economy.