Question 2a: Exploratory Data Analysis

Plot the Time Series and ACF plots. Comment on the main features, and identify what (if any) assumptions of stationarity are violated.

The seems to be two distinct trends. First the EUR seems to be strengthening from the year 2000 to around 2008. The Euro then seems to weaken from 2008 to 2015. The rate also seems somewhat volatile within these major trends. The second plot that looks at yearly averages makes these trends even clearer.

par(pin=c(3, 2))
par(mfrow=c(2,1))
plot(data,type="l")
plot(aggregate(data,FUN=mean))

Stationarity assumes that the mean , variance and autocorrelation do not change over time. The ACF plot show that the FX rate is highly correlated to its prior level. This violates an assumption of stationarity.

##52 weeks
acf(data, lag=52)

Box plot across months will give us a sense on seasonal effect.The mean looks pretty consistent during the year. (No violation) The variance seems to be a little higher in the middle of the year. However, this is inconclusive.The seasonal plot shows no large swings in the mean or variance due to seasonal effects.

par(mfrow=c(1,1))
boxplot(data~cycle(data),xlab="Week",ylab="Rate")

ggseasonplot(data, polar=TRUE) +
  ylab("Rate") +
  ggtitle("Eur$ Seasonal Plot")

Thus, a graphical interpretation of the data leads us to conclude that the data in non stationary. Let’s run a few test to confirm our visual analysis. The Augmented Dickey–Fuller Test p-value of .7253 tells us to accept the null hypotheses that the data is non stationary. The Box-Pierce test p-value near 0 indicates a time series has significant autocorrleations. Thus, both of these statistical tests agree with our assessment that the data is not stationary.

tseries::adf.test(data)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data
## Dickey-Fuller = -1.6528, Lag order = 10, p-value = 0.7253
## alternative hypothesis: stationary
Box.test(data)
## 
##  Box-Pierce test
## 
## data:  data
## X-squared = 1009.6, df = 1, p-value < 2.2e-16

Using the differenced rate data (‘rate.dif’), plot both the Time Series and ACF plots. Comment on the main features, and identify what (if any) assumptions of stationarity are violated. Additionally comment if you believe the differenced data is more appropriate for use in analysis. Support your position with your graphical analysis.

The diff.data is much tighter. The means seems centered around 0, the variance except for a high volatility near 2008-22009 (global financial crisis) seems relatively constant. The yearly data shows no clear trends.

par(pin=c(3, 2))
par(mfrow=c(2,1))
plot(data.dif,type="l")
plot(aggregate(data.dif,FUN=mean))

Stationarity assumes that the mean , variance and autocorrelation do not change over time. The ACF plot shows that besides the global financial crisis volatility, there is really no significant autocorrelation present. This data seems better to use for time series analysis.

##52 weeks
acf(data.dif, lag=52)

Box plot across months will give us a sense on seasonal effect.The diff.data box plot data is much tighter. There are a few outliers but the variance is greatly reduced. The mean looks pretty consistent during the year. (No violation) The seasonal plot shows no large swings in the mean or variance due to seasonal effects.

par(mfrow=c(1,1))
boxplot(data.dif~cycle(data.dif),xlab="Week",ylab="Rate")

ggseasonplot(data.dif, polar=TRUE) +
  ylab("Rate") +
  ggtitle("Eur$ Seasonal Plot")

Thus, a graphical interpretation of the data leads us to conclude that the diff.data is perhaps stationary. Let’s run a few test to confirm our visual analysis. The Augmented Dickey–Fuller Test p-value of .01 means we reject the null hypotheses and that the data is stationary. The Box-Pierce test’s large p-value indicates that there are no significant autocorrleations. Thus, both of these statistical tests agree with our assessment that the data is stationary.

tseries::adf.test(data.dif)
## Warning in tseries::adf.test(data.dif): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data.dif
## Dickey-Fuller = -9.2813, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
Box.test(data.dif)
## 
##  Box-Pierce test
## 
## data:  data.dif
## X-squared = 50.377, df = 1, p-value = 1.268e-12

Question 2b: Trend-Seasonality Estimation

Using the original time series data, fit the following models to estimate both trend and seasonality:

Parametric Polynomial Regression & Non-parametric model

Overlay the fitted values on the original time series. Construct and plot the residuals with respect to time and ACF of residuals. Comment on how the two models fit and on the appropriateness of the stationarity assumption of the residuals. For sake of simplicity, only use Categorical Regression (ANOVA) seasonality modelling.

Just to get a sense of the data I plotted both parametric and non-parametric trends. The combined trend and seasonal model to follow.

time.pts=c(1:length(data))
time.pts=c(time.pts-min(time.pts))/max(time.pts)

x1=time.pts
x2=time.pts^2

lm.fit=lm(data~x1+x2)
data.fit.lm= ts(fitted(lm.fit),start=2000,frequency=52)

##non-parametric model
## Local Polynomial Trend Estimation
loc.fit= loess(data~time.pts)
data.fit.loc= ts(fitted(loc.fit),start=2000,frequency=52)

## Splines Trend Estimation
library(mgcv)
gam.fit= gam(data~s(time.pts))
data.fit.gam= ts(fitted(gam.fit),start=2000,frequency=52)

## Is there a trend? 
ts.plot(data,ylab="Rate")
lines(data.fit.loc,lwd=2,col="brown")
lines(data.fit.gam,lwd=2,col="red")
lines(data.fit.lm,lwd=2,col="blue")

Combined Seasonality & Trend Estimation

Overlay the fitted values on the original time series.

The non-parametric with ANOVA seasonality model (red) model seems to capture the movement of the EUR better than the parametric poly regression with ANOVA model. (green)

##seasonal anova and trend parametric poly 
lm.fit = dynlm(data~x1+x2+season(data)-1)
dd= ts(fitted(lm.fit),start=2000,frequency=52)

## seasonal anova and non-parametric model
gam.fit<- gam(data~s(time.pts)+season(data)-1)
ee<-ts(fitted(gam.fit),start=2000,frequency=52)

## plots
ts.plot(data,ylab="Rate")
lines(dd,lwd=2,col="green")
lines(ee,lwd=2,col="red")

Construct and plot the residuals with respect to time and ACF of residuals.

The residuals of both models shows persistence and remain outside of the confidence bands. It is interesting how the non-parametric model correlations go negative. The residuals on both models appear non-stationary. Neither are great choices for time series analysis.

#the residuals
resid.1<-data-dd
resid.2<-data-ee

#plot time series
par(mfrow=c(2,2))
plot(resid.1,main="Seasonal ANOVA and Trend Parametric Poly Residuals",ylab="Residuals",col="green")
acf(resid.1, lag=52)
plot(resid.2,main="Seasonal ANOVA and Trend Non-Parametric Residuals",ylab="Residuals",col="red")
acf(resid.2, lag=52)

Seasonal ANOVA and Trend Parametric Poly Residuals Summary

Residuals help us determine if our model adequately captures the information in the data. A good model will: 1) yield uncorrelated residuals and 2) the residuals will have zero mean. Correlation of residuals indicate there is information left in the residuals that should be used in or model. A non zero residual mean represents bias. It is also helpful for the residuals to have a constant variance and to be normally distributed.

The parametric model’s histogram is not normally distributed. Additionally, both the Augmented Dickey-Fuller Test and Box Peirce Test suggest the model is non stationary.

r.lm.fit<-resid(lm.fit)
gghistogram(r.lm.fit) + ggtitle("Histogram of residuals Parametric Model")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
## Don't know how to automatically pick scale for object of type ts. Defaulting to continuous.

tseries::adf.test(r.lm.fit)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  r.lm.fit
## Dickey-Fuller = -3.2255, Lag order = 10, p-value = 0.08352
## alternative hypothesis: stationary
Box.test(r.lm.fit)
## 
##  Box-Pierce test
## 
## data:  r.lm.fit
## X-squared = 987.17, df = 1, p-value < 2.2e-16

Seasonal ANOVA and Trend Non-Parametric Residuals Summary

The non parametric model’s histogram is not normally distributed. Additionally, both the Augmented Dickey-Fuller Test and Box Peirce Test suggest the model is non-stationary.

#r.gam.fit<-resid(gam.fit)
#gghistogram(r.gam.fit) + ggtitle("Histogram of residuals Non-Parametric Model")
checkresiduals(resid.2)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

tseries::adf.test(resid.2)
## Warning in tseries::adf.test(resid.2): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  resid.2
## Dickey-Fuller = -4.8482, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
Box.test(resid.2)
## 
##  Box-Pierce test
## 
## data:  resid.2
## X-squared = 961.71, df = 1, p-value < 2.2e-16
library(tsbox)
## Warning: package 'tsbox' was built under R version 4.0.3
yyyy<- ts_xts(resid.2)
yyyy <- ts_df(yyyy)
ggqqplot(yyyy$value)

shapiro.test(yyyy$value)  ###less than .05 = not normal
## 
##  Shapiro-Wilk normality test
## 
## data:  yyyy$value
## W = 0.98524, p-value = 1.28e-08

Question 2c: Trend-Seasonality Estimation with Differenced Data

Now using the differenced time series data, construct the same type of models as you did above. Overlay the fitted values on the original time series.Construct and plot the residuals with respect to time and ACF of residuals. Comment on the two models fit and on the appropriateness of the stationarity assumption of the residuals. Additionally, comment if models built with original or differenced data appear to differ in quality of fit; which (if any) is better?

time.pts=c(1:length(data.dif))
time.pts=c(time.pts-min(time.pts))/max(time.pts)


x1=time.pts
x2=time.pts^2


lm.fit.dif = dynlm(data.dif~x1+x2+season(data.dif)-1)
dd= ts(fitted(lm.fit.dif),start=2000,frequency=52)

plot(dd)

gam.fit.dif<- gam(data.dif~s(time.pts)+season(data.dif)-1)
ee= ts(fitted(gam.fit.dif),start=2000,frequency=52)

plot(ee)

ts.plot(data.dif,ylab="Diff Data")
lines(dd,lwd=2,col="green") #parametric
lines(ee,lwd=1,col="red") ##non parametric

Construct and plot the residuals with respect to time and ACF of residuals.

#data.diff residuals
resid.1d<-data.dif-dd
resid.2d<-data.dif-ee

#plot time series
par(pin=c(5, 4))
par(mfrow=c(1,1))
plot(resid.1d,main="Seasonal ANOVA and Trend Parametric Poly Residuals",ylab="Residuals")

plot(resid.2d,main="Seasonal ANOVA and Trend Non-Parametric Residuals",ylab="Residuals")

Seasonal ANOVA and Trend Parametric Poly Residuals Summary

The histogram and the qqplot are much improved. The ACF plot is alos better. The number of observations out of the confidence bands is significantly lower and the trend is much less persistent. The test data is mixed. The Augmented Dickey-Fuller Test shows stationary data while the Boxtest is not convinced.

checkresiduals(resid.1d)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

library(tsbox)
library(ggpubr)
r1d<- ts_xts(resid.1d)
r1d <- ts_df(r1d)
library(ggpubr)
ggqqplot(r1d$value)

tseries::adf.test(r1d$value)
## Warning in tseries::adf.test(r1d$value): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  r1d$value
## Dickey-Fuller = -9.2961, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
Box.test(r1d$value)
## 
##  Box-Pierce test
## 
## data:  r1d$value
## X-squared = 29.068, df = 1, p-value = 6.988e-08

Seasonal ANOVA and Trend Non-Parametric Residuals Summary

The residuals

checkresiduals(resid.2d)
## Warning in modeldf.default(object): Could not find appropriate degrees of
## freedom for this model.

r2d<- ts_xts(resid.2d)
r2d <- ts_df(r2d)
library(ggpubr)
ggqqplot(r2d$value)

tseries::adf.test(r2d$value)
## Warning in tseries::adf.test(r2d$value): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  r2d$value
## Dickey-Fuller = -9.3911, Lag order = 10, p-value = 0.01
## alternative hypothesis: stationary
Box.test(r2d$value)
## 
##  Box-Pierce test
## 
## data:  r2d$value
## X-squared = 28.305, df = 1, p-value = 1.036e-07
rb1<-resid(lm.fit.dif)
rb2<-resid(gam.fit.dif)
acf(rb1, lag=52)

acf(rb2, lag=52)