1 a)
dat <- read.csv("C:/Users/irisd/Desktop/MA611.R/export.csv", header = T)
dat.ts <- ts(dat[,1], start = c(2010,1), end = c(2018,2), frequency = 12)
plot(dat.ts, main="Time Series Plot", ylab="Quantity")

b)
plot(aggregate(dat.ts), ylab="Quantity", main="Annual Series")

c)
boxplot(dat.ts~cycle(dat.ts))

d)
plot(dat.ts[1:98],dat.ts[2:99], main = "Lag 1 Scatter Plot")

plot(dat.ts[1:98],dat.ts[13:110], main = "Lag 12 Scatter Plot")

e)
1.The monthly time series plot clearly shows a upward trend and does not seem to have a constant mean, so is non-stationary. The seasonal effect tends to increase as the trend increases, so for further steps, we may consider a multiplicative decomposition.

2.The aggregated annual series particularly shows the tendency of quantities of products exported. As we can see, the quantity exported has a dropin 2011 and a sharp increase in 2012. The trend could be a important factor in this sereis.

3.The box plot is drawn by aggregating each month from 2010 to 2018. There are variations exist, with high median quantity exported in frist three months, from January to March, and in last three months, from October to Deccember. The seasonal effect exists. Median values from each box form a curved pattern.

4.The lag 1 scatter plot shows a positive relation. The lag 12 scatter plot also shows a positive relation. Both plot indicate an autocorrelation.
2 a)
mult <- decompose(dat.ts, type = "mult")
plot(mult)

mult.res <- dat.ts - window(mult$trend*mult$seasonal, start=c(2010,7), end=c(2017,8))
mult.rmse <- sqrt(sum(mult.res^2)/length(mult.res))
print(paste("The RMSE for Multiplicative Model is:",mult.rmse))
## [1] "The RMSE for Multiplicative Model is: 28.38449708952"
addit <- decompose(dat.ts, type = "additive")
plot(addit)

addit.res <- dat.ts - window(addit$seasonal+addit$trend, start=c(2010,7), end=c(2017,8))
addit.rmse <- sqrt(sum(addit.res^2)/length(addit.res))
print(paste("The RMSE for Additive Model is:",addit.rmse))
## [1] "The RMSE for Additive Model is: 28.4849314445247"

From the time series plot, since seasonal tends to increse as the trend increase, as I said in Part 1, a multiplicative model be appropriate. After checking the RMSE of both multiplicative model and additive model, multiplicative model has lower RMSE. Therefore, the multiplicative model is better.

b)
acf(mult.res, main = "Residual ACF Plot on Multiplicative Model")

Residual ACF plot on multiplicative model does not perform well and forms a damped sinusoid shape. Lag 1, 3, 4, 5, 6, 8, 9, 10, 12, 13, 17 and 18 are all beyond confidence limits and significant. We expect all lags are within confidence limits and residuals behave like white noise, but residuals from multiplicative model does not meet what we expected.

acf(addit.res, main = "Residual ACF Plot on Additive Model")

Residual ACF plot on additive model also forms a damped sinusoid shape. Lag 1, 3, 4, 5, 6, 9, 10, 13, 17 and 18 are all significant. We expect no lags are significant except lag 0 but apparently, this plot does not satisfy the requirement.

3 a)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
tim <- time(dat.ts)
cyc <- cycle(dat.ts)
d <- coredata(dat.ts)
dat.mod <- lm(d~0+tim+factor(cyc))
summary(dat.mod)
## 
## Call:
## lm(formula = d ~ 0 + tim + factor(cyc))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -84.348 -24.220  -1.786  24.237  79.559 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## tim               18.745      1.605   11.68   <2e-16 ***
## factor(cyc)1  -37513.485   3233.205  -11.60   <2e-16 ***
## factor(cyc)2  -37504.817   3233.339  -11.60   <2e-16 ***
## factor(cyc)3  -37486.052   3232.673  -11.60   <2e-16 ***
## factor(cyc)4  -37549.970   3232.807  -11.62   <2e-16 ***
## factor(cyc)5  -37536.762   3232.941  -11.61   <2e-16 ***
## factor(cyc)6  -37555.855   3233.075  -11.62   <2e-16 ***
## factor(cyc)7  -37559.366   3233.208  -11.62   <2e-16 ***
## factor(cyc)8  -37529.217   3233.342  -11.61   <2e-16 ***
## factor(cyc)9  -37533.457   3233.476  -11.61   <2e-16 ***
## factor(cyc)10 -37505.561   3233.610  -11.60   <2e-16 ***
## factor(cyc)11 -37510.002   3233.744  -11.60   <2e-16 ***
## factor(cyc)12 -37511.558   3233.877  -11.60   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.31 on 85 degrees of freedom
## Multiple R-squared:  0.9784, Adjusted R-squared:  0.9751 
## F-statistic: 296.8 on 13 and 85 DF,  p-value: < 2.2e-16
rmse <- sqrt(sum((dat.ts - dat.mod$fitted.values)^2)/length((dat.ts - dat.mod$fitted.values)))
print(paste("The RMSE is:", rmse))
## [1] "The RMSE is: 34.7427292568714"
plot(dat.mod)

  1. The Residuals vs Fitted plot checks two assumptions: wether sum of residuals is zero and variance of residuals is sigma^2 or constant. However, from the plot, scatter points are not forming a constant band around zero and red line tends to curve around 200, which violate two assumptions.

  2. The normla Q-Q plot checks the normality. The scatter points are not all lying on the straight dashed line with some points away from the line at head part, but overall, the violation of assumption is not apparent.

hist(resid(dat.mod), main = "Histogram of Residuals", xlab = "Residuals")

  1. This step is to check if residuals form a normal distribution with zero mean and sigma^2 variance. The histogram is skewed to left, indicating that residuals are not normally distributed. We expect normal residuals are symmetric around zero but here, it fails to form what we expected.
acf(resid(dat.mod))

  1. From the ACF plot, lag 1, 2, 8, 9 and 10 are significant which violate the assumption that residuals are identically independent distributed. Residuals are autocorrelated.
b)
log.dat <- log(dat.ts)
log.mod <- lm(coredata(log.dat)~0+time(log.dat)+factor(cycle(log.dat)))
summary(log.mod)
## 
## Call:
## lm(formula = coredata(log.dat) ~ 0 + time(log.dat) + factor(cycle(log.dat)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.43173 -0.09773 -0.00558  0.12136  0.38541 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## time(log.dat)             8.470e-02  7.368e-03   11.50   <2e-16 ***
## factor(cycle(log.dat))1  -1.652e+02  1.484e+01  -11.13   <2e-16 ***
## factor(cycle(log.dat))2  -1.651e+02  1.484e+01  -11.13   <2e-16 ***
## factor(cycle(log.dat))3  -1.650e+02  1.484e+01  -11.12   <2e-16 ***
## factor(cycle(log.dat))4  -1.653e+02  1.484e+01  -11.14   <2e-16 ***
## factor(cycle(log.dat))5  -1.653e+02  1.484e+01  -11.14   <2e-16 ***
## factor(cycle(log.dat))6  -1.654e+02  1.484e+01  -11.14   <2e-16 ***
## factor(cycle(log.dat))7  -1.653e+02  1.484e+01  -11.14   <2e-16 ***
## factor(cycle(log.dat))8  -1.652e+02  1.484e+01  -11.13   <2e-16 ***
## factor(cycle(log.dat))9  -1.652e+02  1.484e+01  -11.13   <2e-16 ***
## factor(cycle(log.dat))10 -1.651e+02  1.484e+01  -11.12   <2e-16 ***
## factor(cycle(log.dat))11 -1.651e+02  1.484e+01  -11.12   <2e-16 ***
## factor(cycle(log.dat))12 -1.652e+02  1.484e+01  -11.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1712 on 85 degrees of freedom
## Multiple R-squared:  0.9991, Adjusted R-squared:  0.999 
## F-statistic:  7499 on 13 and 85 DF,  p-value: < 2.2e-16
rmse2 <- sqrt(sum((dat.ts - exp(log.mod$fitted.values))^2)/length((dat.ts - exp(log.mod$fitted.values))))
print(paste("The RMSE is:", rmse2))
## [1] "The RMSE is: 35.0328984578512"
plot(log.mod)

  1. The Residuals vs Fitted plot checks two assumptoins: wether sum of residuals is zero and variance of residuals is sigma^2 or constant. However, from the plot, scatter points are not forming a constant band around zero and the red line is curved between 5.4 and 5.6, which violate two assumptions.

  2. The normla Q-Q plot checks the normality. The scatter points are not all lying on the straight dashed line but tend to curve at tail part, but overall, the fitness is acceptable. Compared to the origional scale, it seems to get improved.

hist(resid(log.mod), main = "Histogram of Log Residuals", xlab = "Residuals")

  1. This step is to check if residuals form a normal distribution with zero mean and sigma^2 variance. The histogram is not normally distributed, since normally distributed residuals shoule be symmetric around zero. However, compared with original scale, this histogram performs better.
acf(resid(log.mod))

  1. From the ACF plot, lag 1, 2, 3, 8, 9 and 10 significant which violate the assumption that residuals are identically independent distributed. Residuals are autocorrelated. This plot is pretty similar to the one in original scale, no apparent improvements shown.
c)
pred.time <- seq(2018+3/12,2019+2/12,by = 1/12)
pred.cycle <- seq(1:12)
pred.data <- predict(dat.mod, data.frame(tim = pred.time, cyc = pred.cycle), se.fit = T)
pred.data$fit
##        1        2        3        4        5        6        7        8 
## 318.4737 328.7045 349.0311 286.6754 301.4453 283.9140 281.9650 313.6765 
##        9       10       11       12 
## 310.9982 340.4571 337.5776 337.5842

Since original scale has smaller RMSE, I would choose model from part a).

4 a)
mult.hw <- HoltWinters(dat.ts,seasonal = "multi")
print(paste("The RMSE for multiplicative model is:", sqrt(mult.hw$SSE/(length(dat.ts)-12))))
## [1] "The RMSE for multiplicative model is: 36.511492553582"
plot(mult.hw, main = "Multiplicative Model")

addit.hw <- HoltWinters(dat.ts,seasonal = "additive")
print(paste("The RMSE for additive model is:", sqrt(addit.hw$SSE/(length(dat.ts)-12))))
## [1] "The RMSE for additive model is: 36.6306017740178"
plot(addit.hw, main = "Additive Model")

Overall fitness for both models is similar, that red line does not seem to attach to original series closely, from two plots. The RMSE for multiplicative model is smaller, so I would say a multiplicative model may be more appropriate.

b)
predict(mult.hw, n.ahead = 12)
##           Jan      Feb      Mar      Apr      May      Jun      Jul
## 2018                   357.3755 274.9577 303.4226 267.3189 240.8673
## 2019 337.9437 330.3254                                             
##           Aug      Sep      Oct      Nov      Dec
## 2018 291.8674 289.8015 336.4556 344.8394 356.4497
## 2019
4 a)
plot(dat.ts)

acf(dat.ts, lag = 60)

pacf(dat.ts)

The time series plot does not show a constant mean and therefore is not stationary.

By checking the ACF plot, all lags are significant. We confirm that differencing is necessary. Also notice that lag 12 and 24 are also significant and the whole pattern waves cyclic, which indicating that seasonal part plays a role and differencing in seasonal part is necessary as well.

PACF plot shows that lag 5 and lag 13 are significant.

dz1 <- diff(dat.ts)
plot(dz1)

acf(dz1, lag = 60)

pacf(dz1, lag = 60)

The first order difference series tends to have a constant mean and the overall performance is acceptable.

The ACF plot shows that, lag 1, 4, 12, 13, 24 and 30 are significant. For seasonal part, lag 12 and 24 are sigificant, a MA(2) may help. For non-seasonal part, it is more like an exponential decay. Let’s see PACF for further analysis.

The PACF plot shows that lag 1, 4 and 8 are significant. For seasonal part, no lags are significant. For non-seasonal part, pattern is not clear or a possibly damped sinusoid, and lag 1, 4 and 8 are significant.

Therefore, for seasonal part, I would try order (0,1,2). For non-seasonal part, may have many possible combinations.

Since both ACF and PACF for non-seasonal part are not clearly cut-off pattern, I would say try (1,1,1). So, together, try SARIMA(1,1,1)(0,1,2)s=12.

model1 <- arima(dat.ts, order = c(1,1,1), seasonal = list(order = c(0,1,2),12))
model1
## 
## Call:
## arima(x = dat.ts, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 2), 12))
## 
## Coefficients:
##          ar1      ma1     sma1     sma2
##       0.5093  -0.8842  -0.9684  -0.0314
## s.e.  0.1344   0.0847   0.4634   0.1289
## 
## sigma^2 estimated as 928.7:  log likelihood = -423.77,  aic = 857.55
acf(resid(model1), main = "ACF of Residuals")

tsdiag(model1)

AIC for this model is 857.55.

Checking ACF of residual plot, the overall is good except lag 4 which is significant, but we can treat it as a false alarm. Therefore, we can conclude that residuals behave like white noise and autocorrelation does not exist here.

Checking standardized residuals plot, it almost forms a constand band around zero and overall performance is acceptable.

The ACF for residuals is acceptable and behave like a white noise series.

The last plot shows all P-values are above blue dashed line, which is what we expected.

Therefore, SARIMA model (1,1,1)(0,1,2)s=12 is adequate to explain the data.

Now, try auto.arima if can get a better result. The reason we use 10 for max.q is its last significant lag is 30, but I don’t intend to go that far so I use 10 as a threshold.

library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
auto.arima(dat.ts, max.p = 8, max.q = 10, d=1, D=1,max.P = 0, max.Q = 2, stationary = F, trace = F, ic="aic", stepwise = F)
## Series: dat.ts 
## ARIMA(4,1,0)(0,1,1)[12] 
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     sma1
##       -0.2742  -0.1895  -0.1400  -0.3064  -0.8035
## s.e.   0.1026   0.1061   0.1079   0.1052   0.1828
## 
## sigma^2 estimated as 1147:  log likelihood=-423.76
## AIC=859.52   AICc=860.59   BIC=874.17
arima(dat.ts, order = c(4,1,0), seas=list(order = c(0,1,1),12))
## 
## Call:
## arima(x = dat.ts, order = c(4, 1, 0), seasonal = list(order = c(0, 1, 1), 12))
## 
## Coefficients:
##           ar1      ar2      ar3      ar4     sma1
##       -0.2742  -0.1895  -0.1400  -0.3064  -0.8035
## s.e.   0.1026   0.1061   0.1079   0.1052   0.1828
## 
## sigma^2 estimated as 1080:  log likelihood = -423.76,  aic = 859.52

The AIC is greater than SARIMA(1,1,1)(0,1,2). Therefore, the best model is SARIMA(1,1,1)(0,1,2)s=12