dat <- read.csv("C:/Users/irisd/Desktop/MA611.R/export.csv", header = T)
dat.ts <- ts(dat[,1], start = c(2010,1), end = c(2018,2), frequency = 12)
plot(dat.ts, main="Time Series Plot", ylab="Quantity")
plot(aggregate(dat.ts), ylab="Quantity", main="Annual Series")
boxplot(dat.ts~cycle(dat.ts))
plot(dat.ts[1:98],dat.ts[2:99], main = "Lag 1 Scatter Plot")
plot(dat.ts[1:98],dat.ts[13:110], main = "Lag 12 Scatter Plot")
1.The monthly time series plot clearly shows a upward trend and does not seem to have a constant mean, so is non-stationary. The seasonal effect tends to increase as the trend increases, so for further steps, we may consider a multiplicative decomposition.
2.The aggregated annual series particularly shows the tendency of quantities of products exported. As we can see, the quantity exported has a dropin 2011 and a sharp increase in 2012. The trend could be a important factor in this sereis.
3.The box plot is drawn by aggregating each month from 2010 to 2018. There are variations exist, with high median quantity exported in frist three months, from January to March, and in last three months, from October to Deccember. The seasonal effect exists. Median values from each box form a curved pattern.
4.The lag 1 scatter plot shows a positive relation. The lag 12 scatter plot also shows a positive relation. Both plot indicate an autocorrelation.
mult <- decompose(dat.ts, type = "mult")
plot(mult)
mult.res <- dat.ts - window(mult$trend*mult$seasonal, start=c(2010,7), end=c(2017,8))
mult.rmse <- sqrt(sum(mult.res^2)/length(mult.res))
print(paste("The RMSE for Multiplicative Model is:",mult.rmse))
## [1] "The RMSE for Multiplicative Model is: 28.38449708952"
addit <- decompose(dat.ts, type = "additive")
plot(addit)
addit.res <- dat.ts - window(addit$seasonal+addit$trend, start=c(2010,7), end=c(2017,8))
addit.rmse <- sqrt(sum(addit.res^2)/length(addit.res))
print(paste("The RMSE for Additive Model is:",addit.rmse))
## [1] "The RMSE for Additive Model is: 28.4849314445247"
From the time series plot, since seasonal tends to increse as the trend increase, as I said in Part 1, a multiplicative model be appropriate. After checking the RMSE of both multiplicative model and additive model, multiplicative model has lower RMSE. Therefore, the multiplicative model is better.
acf(mult.res, main = "Residual ACF Plot on Multiplicative Model")
Residual ACF plot on multiplicative model does not perform well and forms a damped sinusoid shape. Lag 1, 3, 4, 5, 6, 8, 9, 10, 12, 13, 17 and 18 are all beyond confidence limits and significant. We expect all lags are within confidence limits and residuals behave like white noise, but residuals from multiplicative model does not meet what we expected.
acf(addit.res, main = "Residual ACF Plot on Additive Model")
Residual ACF plot on additive model also forms a damped sinusoid shape. Lag 1, 3, 4, 5, 6, 9, 10, 13, 17 and 18 are all significant. We expect no lags are significant except lag 0 but apparently, this plot does not satisfy the requirement.
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
tim <- time(dat.ts)
cyc <- cycle(dat.ts)
d <- coredata(dat.ts)
dat.mod <- lm(d~0+tim+factor(cyc))
summary(dat.mod)
##
## Call:
## lm(formula = d ~ 0 + tim + factor(cyc))
##
## Residuals:
## Min 1Q Median 3Q Max
## -84.348 -24.220 -1.786 24.237 79.559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## tim 18.745 1.605 11.68 <2e-16 ***
## factor(cyc)1 -37513.485 3233.205 -11.60 <2e-16 ***
## factor(cyc)2 -37504.817 3233.339 -11.60 <2e-16 ***
## factor(cyc)3 -37486.052 3232.673 -11.60 <2e-16 ***
## factor(cyc)4 -37549.970 3232.807 -11.62 <2e-16 ***
## factor(cyc)5 -37536.762 3232.941 -11.61 <2e-16 ***
## factor(cyc)6 -37555.855 3233.075 -11.62 <2e-16 ***
## factor(cyc)7 -37559.366 3233.208 -11.62 <2e-16 ***
## factor(cyc)8 -37529.217 3233.342 -11.61 <2e-16 ***
## factor(cyc)9 -37533.457 3233.476 -11.61 <2e-16 ***
## factor(cyc)10 -37505.561 3233.610 -11.60 <2e-16 ***
## factor(cyc)11 -37510.002 3233.744 -11.60 <2e-16 ***
## factor(cyc)12 -37511.558 3233.877 -11.60 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37.31 on 85 degrees of freedom
## Multiple R-squared: 0.9784, Adjusted R-squared: 0.9751
## F-statistic: 296.8 on 13 and 85 DF, p-value: < 2.2e-16
rmse <- sqrt(sum((dat.ts - dat.mod$fitted.values)^2)/length((dat.ts - dat.mod$fitted.values)))
print(paste("The RMSE is:", rmse))
## [1] "The RMSE is: 34.7427292568714"
plot(dat.mod)
The Residuals vs Fitted plot checks two assumptions: wether sum of residuals is zero and variance of residuals is sigma^2 or constant. However, from the plot, scatter points are not forming a constant band around zero and red line tends to curve around 200, which violate two assumptions.
The normla Q-Q plot checks the normality. The scatter points are not all lying on the straight dashed line with some points away from the line at head part, but overall, the violation of assumption is not apparent.
hist(resid(dat.mod), main = "Histogram of Residuals", xlab = "Residuals")
acf(resid(dat.mod))
log.dat <- log(dat.ts)
log.mod <- lm(coredata(log.dat)~0+time(log.dat)+factor(cycle(log.dat)))
summary(log.mod)
##
## Call:
## lm(formula = coredata(log.dat) ~ 0 + time(log.dat) + factor(cycle(log.dat)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.43173 -0.09773 -0.00558 0.12136 0.38541
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## time(log.dat) 8.470e-02 7.368e-03 11.50 <2e-16 ***
## factor(cycle(log.dat))1 -1.652e+02 1.484e+01 -11.13 <2e-16 ***
## factor(cycle(log.dat))2 -1.651e+02 1.484e+01 -11.13 <2e-16 ***
## factor(cycle(log.dat))3 -1.650e+02 1.484e+01 -11.12 <2e-16 ***
## factor(cycle(log.dat))4 -1.653e+02 1.484e+01 -11.14 <2e-16 ***
## factor(cycle(log.dat))5 -1.653e+02 1.484e+01 -11.14 <2e-16 ***
## factor(cycle(log.dat))6 -1.654e+02 1.484e+01 -11.14 <2e-16 ***
## factor(cycle(log.dat))7 -1.653e+02 1.484e+01 -11.14 <2e-16 ***
## factor(cycle(log.dat))8 -1.652e+02 1.484e+01 -11.13 <2e-16 ***
## factor(cycle(log.dat))9 -1.652e+02 1.484e+01 -11.13 <2e-16 ***
## factor(cycle(log.dat))10 -1.651e+02 1.484e+01 -11.12 <2e-16 ***
## factor(cycle(log.dat))11 -1.651e+02 1.484e+01 -11.12 <2e-16 ***
## factor(cycle(log.dat))12 -1.652e+02 1.484e+01 -11.13 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1712 on 85 degrees of freedom
## Multiple R-squared: 0.9991, Adjusted R-squared: 0.999
## F-statistic: 7499 on 13 and 85 DF, p-value: < 2.2e-16
rmse2 <- sqrt(sum((dat.ts - exp(log.mod$fitted.values))^2)/length((dat.ts - exp(log.mod$fitted.values))))
print(paste("The RMSE is:", rmse2))
## [1] "The RMSE is: 35.0328984578512"
plot(log.mod)
The Residuals vs Fitted plot checks two assumptoins: wether sum of residuals is zero and variance of residuals is sigma^2 or constant. However, from the plot, scatter points are not forming a constant band around zero and the red line is curved between 5.4 and 5.6, which violate two assumptions.
The normla Q-Q plot checks the normality. The scatter points are not all lying on the straight dashed line but tend to curve at tail part, but overall, the fitness is acceptable. Compared to the origional scale, it seems to get improved.
hist(resid(log.mod), main = "Histogram of Log Residuals", xlab = "Residuals")
acf(resid(log.mod))
pred.time <- seq(2018+3/12,2019+2/12,by = 1/12)
pred.cycle <- seq(1:12)
pred.data <- predict(dat.mod, data.frame(tim = pred.time, cyc = pred.cycle), se.fit = T)
pred.data$fit
## 1 2 3 4 5 6 7 8
## 318.4737 328.7045 349.0311 286.6754 301.4453 283.9140 281.9650 313.6765
## 9 10 11 12
## 310.9982 340.4571 337.5776 337.5842
Since original scale has smaller RMSE, I would choose model from part a).
mult.hw <- HoltWinters(dat.ts,seasonal = "multi")
print(paste("The RMSE for multiplicative model is:", sqrt(mult.hw$SSE/(length(dat.ts)-12))))
## [1] "The RMSE for multiplicative model is: 36.511492553582"
plot(mult.hw, main = "Multiplicative Model")
addit.hw <- HoltWinters(dat.ts,seasonal = "additive")
print(paste("The RMSE for additive model is:", sqrt(addit.hw$SSE/(length(dat.ts)-12))))
## [1] "The RMSE for additive model is: 36.6306017740178"
plot(addit.hw, main = "Additive Model")
Overall fitness for both models is similar, that red line does not seem to attach to original series closely, from two plots. The RMSE for multiplicative model is smaller, so I would say a multiplicative model may be more appropriate.
predict(mult.hw, n.ahead = 12)
## Jan Feb Mar Apr May Jun Jul
## 2018 357.3755 274.9577 303.4226 267.3189 240.8673
## 2019 337.9437 330.3254
## Aug Sep Oct Nov Dec
## 2018 291.8674 289.8015 336.4556 344.8394 356.4497
## 2019
plot(dat.ts)
acf(dat.ts, lag = 60)
pacf(dat.ts)
The time series plot does not show a constant mean and therefore is not stationary.
By checking the ACF plot, all lags are significant. We confirm that differencing is necessary. Also notice that lag 12 and 24 are also significant and the whole pattern waves cyclic, which indicating that seasonal part plays a role and differencing in seasonal part is necessary as well.
PACF plot shows that lag 5 and lag 13 are significant.
dz1 <- diff(dat.ts)
plot(dz1)
acf(dz1, lag = 60)
pacf(dz1, lag = 60)
The first order difference series tends to have a constant mean and the overall performance is acceptable.
The ACF plot shows that, lag 1, 4, 12, 13, 24 and 30 are significant. For seasonal part, lag 12 and 24 are sigificant, a MA(2) may help. For non-seasonal part, it is more like an exponential decay. Let’s see PACF for further analysis.
The PACF plot shows that lag 1, 4 and 8 are significant. For seasonal part, no lags are significant. For non-seasonal part, pattern is not clear or a possibly damped sinusoid, and lag 1, 4 and 8 are significant.
Therefore, for seasonal part, I would try order (0,1,2). For non-seasonal part, may have many possible combinations.
Since both ACF and PACF for non-seasonal part are not clearly cut-off pattern, I would say try (1,1,1). So, together, try SARIMA(1,1,1)(0,1,2)s=12.
model1 <- arima(dat.ts, order = c(1,1,1), seasonal = list(order = c(0,1,2),12))
model1
##
## Call:
## arima(x = dat.ts, order = c(1, 1, 1), seasonal = list(order = c(0, 1, 2), 12))
##
## Coefficients:
## ar1 ma1 sma1 sma2
## 0.5093 -0.8842 -0.9684 -0.0314
## s.e. 0.1344 0.0847 0.4634 0.1289
##
## sigma^2 estimated as 928.7: log likelihood = -423.77, aic = 857.55
acf(resid(model1), main = "ACF of Residuals")
tsdiag(model1)
AIC for this model is 857.55.
Checking ACF of residual plot, the overall is good except lag 4 which is significant, but we can treat it as a false alarm. Therefore, we can conclude that residuals behave like white noise and autocorrelation does not exist here.
Checking standardized residuals plot, it almost forms a constand band around zero and overall performance is acceptable.
The ACF for residuals is acceptable and behave like a white noise series.
The last plot shows all P-values are above blue dashed line, which is what we expected.
Therefore, SARIMA model (1,1,1)(0,1,2)s=12 is adequate to explain the data.
Now, try auto.arima if can get a better result. The reason we use 10 for max.q is its last significant lag is 30, but I don’t intend to go that far so I use 10 as a threshold.
library(forecast)
## Warning: package 'forecast' was built under R version 3.4.4
auto.arima(dat.ts, max.p = 8, max.q = 10, d=1, D=1,max.P = 0, max.Q = 2, stationary = F, trace = F, ic="aic", stepwise = F)
## Series: dat.ts
## ARIMA(4,1,0)(0,1,1)[12]
##
## Coefficients:
## ar1 ar2 ar3 ar4 sma1
## -0.2742 -0.1895 -0.1400 -0.3064 -0.8035
## s.e. 0.1026 0.1061 0.1079 0.1052 0.1828
##
## sigma^2 estimated as 1147: log likelihood=-423.76
## AIC=859.52 AICc=860.59 BIC=874.17
arima(dat.ts, order = c(4,1,0), seas=list(order = c(0,1,1),12))
##
## Call:
## arima(x = dat.ts, order = c(4, 1, 0), seasonal = list(order = c(0, 1, 1), 12))
##
## Coefficients:
## ar1 ar2 ar3 ar4 sma1
## -0.2742 -0.1895 -0.1400 -0.3064 -0.8035
## s.e. 0.1026 0.1061 0.1079 0.1052 0.1828
##
## sigma^2 estimated as 1080: log likelihood = -423.76, aic = 859.52
The AIC is greater than SARIMA(1,1,1)(0,1,2). Therefore, the best model is SARIMA(1,1,1)(0,1,2)s=12