Time Series Analysis: Lab1

Student 1: Allan Gholmi

Student 2: Ngo Trong Trung

1. Visualization, detrending and residual analysis.

Load rhine.csv online:

rhine <- read.csv("https://www.ida.liu.se/~732A34/info/Rhine.csv ", header=T, sep=";", dec=",")
attach(rhine)
head(rhine)
##   Year Month     Time TotN_conc
## 1 1989     1 1989.042  6.041350
## 2 1989     2 1989.125  7.644673
## 3 1989     3 1989.208  6.242788
## 4 1989     4 1989.292  6.130887
## 5 1989     5 1989.375  5.443721
## 6 1989     6 1989.458  5.834021

b) Fit linear model to the data. Is there a significant time trend? …

Fit a linear model to the data:

# Fit Time series
rhineTS <- ts(rhine$TotN_conc, start=c(rhine$Year[1], rhine$Month[1]), freq=12)
plot(rhineTS, ylab="Total Nitrogen m1", main="Total nitrogen in Rhine 1989-2002")

m1 <- lm(rhineTS ~ time(rhineTS))
lines(ts(fitted(m1),start=c(rhine$Year[1],rhine$Month[1]),freq=12),col=2, lwd=2)             
legend("topright", col=c("black", "red"), c("Original", "Trend"), lty=1, lwd=c(1,2))

summary(m1)
## 
## Call:
## lm(formula = rhineTS ~ time(rhineTS))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.75325 -0.65296  0.06071  0.52453  2.01276 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   430.69835   31.26505   13.78   <2e-16 ***
## time(rhineTS)  -0.21355    0.01566  -13.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8205 on 166 degrees of freedom
## Multiple R-squared:  0.5282, Adjusted R-squared:  0.5254 
## F-statistic: 185.9 on 1 and 166 DF,  p-value: < 2.2e-16

As stated in assignment 1a) there is a decreasing trend.

The summary shows that both variables, Intercept and the timeseries variable of dataTS are significant since their p-value is below 0.05.

Next, we observe the residual of our data:

# Fit Time series
# Plot (standardized) residuals, ACF, QQ-plot
res = rstudent(m1)
plot(res,type="l", xlab="Time", main="Residual plot of model 1", ylab="Residuals")
abline(h=0,col="red", lty=2)

Here we strive to see a random pattern for the residual plot to be as good as possible which we clearly don’t. There is still a seasonal pattern that can be seen here.

acf(res, lag.max=100, main="ACF of residuals for model 1")

Here is the sample ACF for the residuals and we can see that there seems to be a pattern that looks like a cosinus curve, going up and down meaning that we have a seasonality in the data. Take for example lag 6 and lag 18, if we see them as months, we know that they are the same months and that they behave similarly. Several spikes outside the confidence bands are also present.

(c) Now, include also seasonal dummy variables to obtain a seasonal means model …

month = season(rhineTS)
m2 = lm(rhineTS ~ month + time(rhineTS))

rhineTS <- ts(rhine$TotN_conc, start=c(rhine$Year[1], rhine$Month[1]), freq=12)
plot(rhineTS, , ylab="Total Nitrogen", main="With seasonal dummy variable model 2", lwd=1)
lines(ts(fitted(m2),start=c(rhine$Year[1],rhine$Month[1]),freq=12),col=2, lwd=2)
legend("topright", col=c("black", "red"), c("Original", "Trend"), lty=1, lwd=c(1,2))

Here we have obtained a seasonal dummy variable and plotted the trend together with the original data and we can see that the trend follows the original data pretty smoothly.

summary(m2)
## 
## Call:
## lm(formula = rhineTS ~ month + time(rhineTS))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6101 -0.3119  0.0082  0.3115  1.3646 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    420.81879   20.17161  20.862  < 2e-16 ***
## monthFebruary    0.27659    0.19962   1.386  0.16788    
## monthMarch       0.04006    0.19963   0.201  0.84121    
## monthApril      -0.34643    0.19964  -1.735  0.08468 .  
## monthMay        -0.86165    0.19965  -4.316 2.82e-05 ***
## monthJune       -1.26114    0.19967  -6.316 2.70e-09 ***
## monthJuly       -1.60808    0.19969  -8.053 2.00e-13 ***
## monthAugust     -1.71242    0.19971  -8.575 9.56e-15 ***
## monthSeptember  -1.23669    0.19974  -6.192 5.11e-09 ***
## monthOctober    -0.87446    0.19977  -4.377 2.20e-05 ***
## monthNovember   -0.75127    0.19980  -3.760  0.00024 ***
## monthDecember   -0.17745    0.19984  -0.888  0.37594    
## time(rhineTS)   -0.20824    0.01011 -20.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5282 on 155 degrees of freedom
## Multiple R-squared:  0.8175, Adjusted R-squared:  0.8034 
## F-statistic: 57.85 on 12 and 155 DF,  p-value: < 2.2e-16

Above can a summary of the linear model with seasonal dummy variables be seen. We have many significant variables (p-value above 0.05) and the R-squared is high meaning that the model covers81.75 % of the total variance of the data.

# Plot (standardized) residuals, ACF, QQ-plot
res = rstudent(m2)
par(mfrow=c(1,2))
plot(res, type="l")
acf(res, lag.max=100, , main="ACF of residuals for model 2")

When only looking for the default lags in R, around 25, no seasonality could be seen, it just looked random. But by increasing the lag.max in R we can distinguish a weak seasonality pattern.

qqnorm(res, , main="QQ plot of residuals for model 2")
qqline(res, col="red", lwd=2)
legend("topleft", col=c("black", "red"), c("Theoretical", "Actual"), lwd=2)

Here is a Q-Q plot of the residuals and we strive for the objects to be in the red line as much as possible which we can see. The residuals could be modeled as Gaussian since they follow the red line relatively smoothly.

(d) Pick a different time series data set from the TSA-package …

We chose the dataset robot, which are the distance of a robot from a desired position/time series. It is the final position in the x direction of an industrial robot put through a series of planned exercises many times.

data(robot)
plot(robot, xlab="Time", ylab="Distance", main="Distance of a robot", pch=20)

The linear trend in distance through time seems to be relatively constant even though it is difficult to see a decisive trend since it goes up and down. A seasonal trend can however be seen.

# Plot Yt-1
plot(y=robot,x=zlag(robot),ylab=expression(Y[t]),
     xlab=expression(Y[t-1]),type='p', main="Distance of a robot Yt-1", pch=20)

We can in the plot see a minor almost linear line but it is too few observations to actually tell the big picture. The observations except for that distinct line seems scattered but moves in a minor increasing linear trend.

# Plot Yt-6
plot(y=robot,x=zlag(robot,6),ylab=expression(Y[t]),
     xlab=expression(Y[t-6]),type='p', main="Distance of a robot Yt-6", pch=20)

Also here can a minor increasing trend be seen even though the observations here seems to be more scattered than the previous plot. The correlation seems to be relatively low.

# Plot Yt-12
plot(y=robot,x=zlag(robot,12),ylab=expression(Y[t]),
     xlab=expression(Y[t-12]),type='p', main="Distance of a robot Yt-12", pch=20)

The correlation is very low and no trend can be seen, the observations looks scattered.

# Extract highest concentration
robot[which.max(robot)]
## [1] 0.0083
# Fit Time series
robotTS <- ts(robot, start=1, freq=1)
plot(robotTS, ylab="Distance", main="Distance of a robot")

m1 <- lm(robotTS ~ time(robotTS))
lines(ts(fitted(m1),start=1,freq=1),col=2, lwd=2)             
legend("topright", col=c("black", "red"), c("Original", "Trend"), lty=1, lwd=c(1,2))

As predicted before, there was a minor decreasing trend which the linear model of the data also confirms.

summary(m1)
## 
## Call:
## lm(formula = robotTS ~ time(robotTS))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0075230 -0.0017577 -0.0000073  0.0016260  0.0063351 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.205e-03  2.769e-04  11.576  < 2e-16 ***
## time(robotTS) -1.079e-05  1.477e-06  -7.308 2.15e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002486 on 322 degrees of freedom
## Multiple R-squared:  0.1423, Adjusted R-squared:  0.1396 
## F-statistic: 53.41 on 1 and 322 DF,  p-value: 2.149e-12

Both variables, the intercept and the timeseries variable of robot are significant since their p-values are below 0.05. However, looking at the multiple R-squared, 0.1423, means that the model only covers about 14 % of the total variation of the data which means that a linear model for this type of data is not sufficient.

# Plot (standardized) residuals, ACF, QQ-plot
res = rstudent(m1)
plot(res,type="l", xlab="Time", main="Residual plot of model 1", ylab="Residuals")
abline(h=0,col="red", lty=2)

There seems to be some pattern that goes up and down which means that the data is biased.

acf(res, lag.max=100, main="ACF of residuals for model 1")

We have a few spikes here and there but we cannot see signs of seasonality here.

qqnorm(res, main="QQ plot of residuals for model 1")
qqline(res, col="red", lwd=2)
legend("topleft", col=c("black", "red"), c("Theoretical", "Actual"), lwd=2)

The residuals follows the theoretical line, the red very smoothly meaning that the residuals can be modeled as Gaussian.

2. Simulation of ARMA-processes

a) Determine whether the models are stationarity by solving the AR characteristic equation for each model.

Initialize the parameters:

M1 = c(0.8, 0)
M2 = c(-0.5, 0.5)
M3 = c(0, -0.64)

Implementation of characteristic equation is following function:

isAR2Stationary = function(phi1, phi2) {
  return ((phi1 + phi2) < 1 && phi2 - phi1 < 1 && phi2 > -1)
}

As a result:

cat('M1 is stationary: ',isAR2Stationary(M1[1],M1[2]))
## M1 is stationary:  TRUE
cat('M2 is stationary: ',isAR2Stationary(M2[1],M2[2]))
## M2 is stationary:  FALSE
cat('M3 is stationary: ',isAR2Stationary(M3[1],M3[2]))
## M3 is stationary:  TRUE

b) Compute the theoretical autocorrelation function for the models M1−M3 …

Using Yule-Walker equations we have: \[ \rho_k = \varphi_1 * \rho_{k-1} + \varphi_2 * \rho_{k-2} \\ \text{Start with: } \rho_0 = 1 \] R implementation of theoritical ACF of AR(2):

acfAR2 = function(T,phi1,phi2){
  r = rep(1,T)
  r[2] = phi1*r[1]
  for (i in 3:T) {
    r[i] = phi1*r[i-1] + phi2*r[i-2]
  }
  return (r)
}

For model M1, M2 and M3 we have:

cat("Theoritical ACF of M1: ",acfAR2(4, M1[1], M1[2]))
## Theoritical ACF of M1:  1 0.8 0.64 0.512
cat("Theoritical ACF of M2: ",acfAR2(4, M2[1], M2[2]))
## Theoritical ACF of M2:  1 -0.5 0.75 -0.625
cat("Theoritical ACF of M3: ",acfAR2(4, M3[1], M3[2]))
## Theoritical ACF of M3:  1 0 -0.64 0

We have theoretical marginal variance is, for AR(2) process: \[ \sigma^2 = \frac{(1-\phi_2)*\sigma_e^2}{(1+\phi_2)(1-\phi_1-\phi-2)(1+\phi_1-\phi2)} \\ (\sigma_e^2 = 1) \] Implementation in R:

varAR2 = function(phi1,phi2){
  return ((1-phi2)/((1+phi2)*(1-phi1-phi2)*(1+phi1-phi2)))
}
cat('Model M1 has var=',varAR2(M1[1],M1[2]))
## Model M1 has var= 2.777778
cat('Model M3 has var=',varAR2(M3[1],M3[2]))
## Model M3 has var= 1.693767

We can see that M1 has higher variance than M3, which will be clearly illustrated in c) as M1 plot fluctuates wider than M3

c) Write your own AR(2)-simulator function in R:

The implementation of AR2sim function:

AR2sim = function(T, phi1, phi2) {
  e = rnorm(T)
  ar2 = rep(0, T)
  ar2[1] = e[1]
  ar2[2] = e[2] + phi1 * e[1]
  for (t in 3:T) {
    ar2[t] = phi1 * ar2[t - 1] + phi2 * ar2[t - 2] + e[t]
  }
  return (ar2)
}

First simulation with T = 100 we have:

T = 100
m1 = AR2sim(T, M1[1], M1[2])
m2 = AR2sim(T, M2[1], M2[2])
m3 = AR2sim(T, M3[1], M3[2])

par(mfrow=c(3,2))
plot(m1,pch='.',main='Model M1'); acf(m1,lag.max=T,main='ACF of M1')
plot(m2,pch='.',main='Model M2'); acf(m2,lag.max=T,main='ACF of M2')
plot(m3,pch='.',main='Model M3'); acf(m3,lag.max=T,main='ACF of M3')

From this point of view, the stationarity statuses from (a) are correct.

We can see that M2 model tend to spread wider in future, as a result, the ACF of M2 is changing over time since M1 and M3 have stable ACF after “die out”.

d) Test producing a few realizations with your function …

3 new AR(2) we selected for presentation:

M4 = c(0.1, -0.4) # stationary
M5 = c(-0.4, -1.) # not stationary
M6 = c(-0.4, -2.) # not stationary

First we set T = 100:

T = 100
m4 = AR2sim(T, M4[1], M4[2])
m5 = AR2sim(T, M5[1], M5[2])
m6 = AR2sim(T, M6[1], M6[2])

par(mfrow=c(3,2))
plot(m4,pch='.',main='Model M4'); acf(m4,lag.max=T,main='ACF of M4')
plot(m5,pch='.',main='Model M5'); acf(m5,lag.max=T,main='ACF of M5')
plot(m6,pch='.',main='Model M6'); acf(m6,lag.max=T,main='ACF of M6')

As M4 is stationary, we can see the pattern of the points distributed by M4 spread homoscedasticity.

On the other hand, M5 distribution become wider as the process continue, hence, the variance is changing over time.

The most special case is M6, where the whole process is concentrated at the end, illustrates very large difference of variance over time.

Then we set T = 500:

T = 500
m4 = AR2sim(T, M4[1], M4[2])
m5 = AR2sim(T, M5[1], M5[2])
m6 = AR2sim(T, M6[1], M6[2])

par(mfrow=c(3,2))
plot(m4,pch='.',main='Model M4'); acf(m4,lag.max=T,main='ACF of M4')
plot(m5,pch='.',main='Model M5'); acf(m5,lag.max=T,main='ACF of M5')
plot(m6,pch='.',main='Model M6'); acf(m6,lag.max=T,main='ACF of M6')

When we increase the number of samples, the characteristic of process is more clearly revealed because of the law of large number. As M5 and M6 are not stationary, we can see their ACF change over time since M4 have very stable ACF in range of 95% confident interval.

e) Simulate realizations also from MA(1); MA(2) and ARMA(1; 1)

First for model MA(1) and MA(2), we select following set of params:

ma1_params = c(0.2, 1, 10)
ma2_params = list(c(0.2,0.2), c(1,0.5), c(0.2,10))
T = 300

Simulation for MA(1):

par(mfcol=c(2,3))
for (i in ma1_params) {
  tmp = arima.sim(n=T,model=list(ma=c(i))) 
  plot(tmp,main=paste('Model MA(',i,')'))
  acf(tmp ,lag.max = T,main=paste('ACF of MA(',i,')'))
}

Hence, we can see that all the MA(1) process die out after the second lag

Simulation for MA(2):

par(mfcol=c(2,3))
for (i in ma2_params) {
  tmp1 = arima.sim(n=T,model=list(ma=i)) 
  plot(tmp1,main=paste('Model MA(',i[1],',',i[2],')'))
  acf(tmp1 ,lag.max = T,main=paste('ACF of MA(',i[1],',',i[2],')'))
}

Hence, we can see that all the MA(2) process mostly die out after the third lag. The rule of thumb can be use to estimate the die out speed of MA(.) process is: \[ D_{MA(i)} = i+1 \]

Simulation for ARMA(1,1): the purpose selecting following parameters is understanding role of AR and MA in ARMA process

arma11_params = list(c(0.5,0.5),c(0.1,10),c(0.99,0.1))
par(mfcol=c(2,3))
for (i in arma11_params) {
  tmp1 = arima.sim(n=T,model=list(ar=c(i[1]),ma=c(i[2]))) 
  plot(tmp1, main=paste('Model ARMA(',i[1],',',i[2],')'))
  acf(tmp1 ,lag.max = T,main=paste('ACF of ARMA(',i[1],',',i[2],')'))
}

For a balanced parameter (0.5,0.5) the process hold characteristic of both AR and MA where it die out after second lag and the ACF keep fluctuate in range of 95% confident interval

If the parameters \(MA >> AR\), the process is more look like Moving-average process.

On the other hand, if \(AR >> MA\) the process has very strong oscillation and seems to be NOT stationary

3. From non-stationary to stationary

Load silver dataset online:

# Assignment 3a
silver <- read.csv("https://www.ida.liu.se/~732A34/info/silver.csv", header=T, sep=";", dec=",")
attach(silver)
head(silver)
##         Date    EURO
## 1 2005-01-02 4.99633
## 2 2005-01-09 4.88838
## 3 2005-01-16 5.05991
## 4 2005-01-23 5.07324
## 5 2005-01-30 5.19371
## 6 2005-02-06 5.13503

a) Interpret the time series in terms of stationarity in the mean and variance (covariance).

ts_silver <- ts(silver$EURO, start = 1, frequency = 1)
par(mfrow = c(1, 2))
plot(ts_silver, main="Weekly prices of silver \nper troy ounce in EUR", ylab="EUR")
acf(ts_silver, lag.max = nrow(silver), main="Autocorrelation plot")

In order for a time series to be stationary it requires that the mean, variance and the autocorrelation structure do not change over time. We can in the above plot see that both the mean and variance changes over time – the mean changing since the trend goes both up and down and the variance changing since the zigzag lines changes in intensity. The ACF plot confirms this as well, there is a clear pattern visible.

b) Display the differences of logarithms of the time series …

ts_log_silver = ts(log(silver$EURO), start = 1, frequency = 1)
par(mfrow = c(2, 2))
plot(ts_silver, main="Weekly prices of silver", ylab="EUR")
acf(ts_silver, lag.max = nrow(silver), main="Autocorrelation plot")
plot(ts_log_silver, ylab="Log EUR", main="Log weekly prices of silver")
acf(ts_log_silver, lag.max = nrow(silver), main="Autocorrelation plot for log")

Here we have only taken the logarithm of the response variable EURO and we can see that the trend seems a bit more smoother and that the scale is smaller. The autocorrelation plots are similar as well, still not stationary.

ex3_b = function(l){
  ts_silver = ts(diff(log(silver$EURO), lag=l), start = 1, frequency = 1)

  m = lm(ts_silver ~ time(ts_silver))
  res = rstudent(m)
  ts_res = ts(res, start = 1, frequency = 1)
  layout(matrix(c(1,1,2,3,4,5), 3, 2, byrow = TRUE))
  plot(ts_silver, main=paste("Log weekly prices of silver lag",l), ylab="LogEUR"); abline(m,col='red')
  plot(res, ylab="Residuals", main=paste("Residual plot for lag",l)); abline(h=mean(res),col='red')
  hist(res, main="Histogram of residuals", xlab="Residuals")
  acf(res,lag.max = length(res), main="Autocorrelation plot of log residuals")
  qqnorm(res, main="QQ plot of residuals", pch='.')
  qqline(res, col="red", lwd=2)
  legend("topleft", col=c("black", "red"), c("Actual", "Theoretical"), lwd=2, bty="n")
}
ex3_b(6)

We choose to display 2 different plots for differences of logarithms of the time series, one for a more stationary appearing model and one for a more normally distributed residual model. This one above looks stationary, both the mean and the variance looks constant in the first plot throughout time although there are a couple of spikes here and there. The ACF plot also shows this. Even though we chose this one to be the more stationary appearing model, this model works well to model the residuals as Gaussian, looking at the histogram of the residuals and the QQ-plot. The histogram looks relatively Gaussian, maybe not the perfect fit but it resembles a normal distribution. The residuals follows the theoretical line well.

ex3_b(8)

This one looks a bit worse stationary-wise but it is still suitable option to model as well. The ACF plot shows that there are some patterns and several spikes though. The histogram of the residuals looks a little better than the above plot and the QQ-plot of the residuals are almost exemplary to model the residuals as Gaussian.

c) Test applying different transforms …

ts_silver = ts(diff(sqrt(silver$EURO) ,lag=1), start = 1, frequency = 1)

m = lm(ts_silver ~ time(ts_silver))
res = rstudent(m)
ts_res = ts(res, start = 1, frequency = 1)
layout(matrix(c(1,1,2,3,4,5), 3, 2, byrow = TRUE))
plot(ts_silver, main="Log sqrt weekly prices of silver lag 1", ylab="LogEUR"); abline(m,col='red')
plot(res, ylab="Residuals", main="Residual plot for lag 1"); abline(h=mean(res),col='red')
hist(res, main="Histogram of residuals", xlab="Residuals")
acf(res,lag.max = length(res), main="Autocorrelation plot of log sqrt residuals")
qqnorm(res, main="QQ plot of residuals",pch='.')
qqline(res, col="red", lwd=2)
legend("topleft", col=c("black", "red"), c("Actual", "Theoretical"), lwd=2, bty="n")

We chose to try the logarithm square root transformation to investigate how this affects the stationarity in the mean and variance. We tried several transforms but stuck with this since this approach gave the best results. Looking at the first plot, the mean looks constant and the variance has regular zigzag lines with a few exceptions here and there. A reason for the big spikes is that this is weekly prices of silver during the financial crisis that hit around year 2008 meaning that the explanation is plausible that some spikes here and there are given. The ACF proves our previous theory about stationarity. The residuals can be modeled as Gaussian even though they don’t fit the data perfectly, looking at the histogram and the QQ-plot.

d) Pick a different non-stationary appearing time series …

data(days)
ts_silver = ts(diff(sqrt(days),lag=10),start=1,frequency = 1)

m = lm(ts_silver ~ time(ts_silver))
res = rstudent(m)
ts_res = ts(res, start = 1, frequency = 1)
layout(matrix(c(1,1,2,3,4,5), 3, 2, byrow = TRUE))
plot(ts_silver,ylab="Days", main="Number of days until Winegard company pays their account"); abline(m,col='red')
plot(res, main="Residual plot", ylab="Residuals"); abline(h=mean(res),col='red')
hist(res, main="Histogram of residuals", xlab="Residuals")
acf(res,lag.max = length(res), main="Autocorrelation plot")
qqnorm(res, main="QQ plot of residuals",pch='.')
qqline(res, col="red", lwd=2)
legend("topleft", col=c("black", "red"), c("Actual", "Theoretical"), lwd=2, bty="n")

We chose the days dataset in the TSA package that contains information about the number of days until a distributor of Winegard Company products pays their account. We chose the same approach as in 3c) as it seemed to be a good approach when modelling for stationarity. The first plot shows that it seems to be stationary since both the mean and the variance is constant (with a few exceptions off course that we have to take into account). The ACF looks very good with only one big spike where the rest of them seem completely random which we strive to have. The residual looks also good proven by the histogram and the QQ-plot.