TIME SERIES ANALYSIS [ARIMA MODELING]

Load the following libraries

library(tseries)
library(forecast)
library(gridExtra)

Load the dataset

data <- read.csv("data.csv")
attach(data)
head(data,5)

          y1         y2        y3         y4       y5
1  0.9209901  2.2157946 0.6819222 -0.0510231 1.181397
2 -0.7553562  1.4160652 1.6202822 -0.2182747 2.080426
3 -1.8629512 -0.1971389 0.8598497  0.2574356 1.440632
4 -1.2798546 -0.1321779 2.5016393 -0.4767962 2.452831
5 -1.8911987 -1.5814481 2.6356750  0.9732679 3.845812

Structure of the dataset

str(data)

'data.frame':   404 obs. of  5 variables:
 $ y1: num  0.921 -0.755 -1.863 -1.28 -1.891 ...
 $ y2: num  2.216 1.416 -0.197 -0.132 -1.581 ...
 $ y3: num  0.682 1.62 0.86 2.502 2.636 ...
 $ y4: num  -0.051 -0.218 0.257 -0.477 0.973 ...
 $ y5: num  1.18 2.08 1.44 2.45 3.85 ...

Split the data in an in-sample period, and an out of sample period there are 404 observations, save the last four to make predictions on Given that you know them, you can calculate the forecast error measures

y_simdata_insample <- data[1:400,1:5]

pick one out the separate y1,y2,y3,y4,y5

y1 <- y_simdata_insample[1:400,1]
y2 <- y_simdata_insample[1:400,2]
y3 <- y_simdata_insample[1:400,3]
y4 <- y_simdata_insample[1:400,4]
y5 <- y_simdata_insample[1:400,5]

We can now fit the model using the data y1, y2, y3, y4 and y5

y1_True_401_404 <- data[401:404,1]
y2_True_401_404 <- data[401:404,2]
y3_True_401_404 <- data[401:404,3]
y4_True_401_404 <- data[401:404,4]
y5_True_401_404 <- data[401:404,5]

It is convenient to set here what data to use. This way, you do not have to change between y1 and y2 etc in every place in the code, when you analyze a different dataset

y_data <- y1

Process Identification

layout(matrix(c(1, 1, 1, 1,
                2, 2, 3, 3), nrow=2, byrow=TRUE))
ts.plot(y_data);  acf(y_data); pacf(y_data)

par(mfrow = c(1,1))

Ljung Box test for autocorrelation/serial dependency This test is done on original data, no parameters are estimated, thus: fitdf = 0

LBQ_obj <- Box.test(y_data, lag = 1, type = c( "Ljung-Box"), fitdf = 0)
LBQ_obj


    Box-Ljung test

data:  y_data
X-squared = 141.99, df = 1, p-value < 2.2e-16

The Box-Ljung test is a statistical test used to assess whether a given set of data exhibits autocorrelation. Autocorrelation refers to the relationship between values in a time series dataset, where the value of a data point is dependent on the values of previous data points. In this case, the test was performed on the dataset labeled as “y_data.” The test statistic, denoted as X-squared, has a value of 141.99. The degrees of freedom (df) for this test are 1, indicating that only one lag was considered in the autocorrelation analysis.

The reported p-value, which is 2.2e-16 (a very small value), far much less than signgificant alpha or 0.05 suggests strong evidence against the null hypothesis of no autocorrelation. In other words, the data in “y_data” is likely to exhibit significant autocorrelation. It’s important to note that the Box-Ljung test is commonly used in time series analysis to evaluate the presence of autocorrelation, but it does not provide information about the specific nature or pattern of autocorrelation.

Unit Root Test

adf_obj <- adf.test(y_data)
adf_obj


    Augmented Dickey-Fuller Test

data:  y_data
Dickey-Fuller = -4.8936, Lag order = 7, p-value = 0.01
alternative hypothesis: stationary

The reported p-value from the results above is 0.01, which is less than the conventional significance level of 0.05. This indicates that there is sufficient evidence to reject the null hypothesis of non-stationarity in favor of the alternative hypothesis of stationarity. Therefore, based on the ADF test, the series is likely to be stationary.

Model Estimation

m1 <- arima(y_data, order = c(0, 0, 0 ),seasonal = list(order = c(0, 0, 0), period = NA),
            include.mean = FALSE)
summary(m1)


Call:
arima(x = y_data, order = c(0, 0, 0), seasonal = list(order = c(0, 0, 0), period = NA), 
    include.mean = FALSE)


sigma^2 estimated as 1.69:  log likelihood = -672.47,  aic = 1346.93

Training set error measures:
                    ME     RMSE      MAE MPE MAPE    MASE      ACF1
Training set 0.1680615 1.299822 1.051932 100  100 1.13203 0.5935726

Model Evaluation and Diagnostic Tests

Analyze and Plot the Residuals

Resids_m1 = m1$residuals
layout(matrix(c(1, 1, 1, 1,
                2, 2, 3, 3), nrow=2, byrow=TRUE))
ts.plot(Resids_m1); acf(Resids_m1); pacf(Resids_m1)

par(mfrow = c(1,1))

Summary of model residuals

summary(m1$residuals)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.8946 -0.7719  0.1301  0.1681  1.0246  3.8327

Test if the residuals are uncorrelated

plot(m1$residuals)
abline(0.1681,0.000)

From the plot above, the residuals seems not to be evenly distributed. In other words, there is some aspects of systematic variation of the models residuals.

On the other hand, Ljung Box test for autocorrelation/serial dependency in residuals. This test is done on the residuals from an estimated model. Thus, we loose so many degrees of freedom as the number of parameters we have estimated increases.

LBQ_obj = Box.test(Resids_m1,
                   lag = 10,
                   type = c( "Ljung-Box"),
                   fitdf = 3)
LBQ_obj


    Box-Ljung test

data:  Resids_m1
X-squared = 674.35, df = 7, p-value < 2.2e-16

The reported p-value is 2.2e-16 (a very small value), less than the significant alpha of 0.05, suggesting strong evidence against the null hypothesis of no autocorrelation. This indicates that there is significant autocorrelation present in the residuals. Besides, Autocorrelation function (ACF) and partial autocorrelation function (PACF) indicates that the residuals are autocorrelated.

Forecasting

The prediction are set to four. In other words, we are getting predictionf for y_401, y_402, y_403 and y_404

y_pred <- predict(m1, n.ahead = 4)
y_pred

$pred
Time Series:
Start = 401 
End = 404 
Frequency = 1 
[1] 0 0 0 0

$se
Time Series:
Start = 401 
End = 404 
Frequency = 1 
[1] 1.299822 1.299822 1.299822 1.299822

y_pred$pred

Time Series:
Start = 401 
End = 404 
Frequency = 1 
[1] 0 0 0 0

ts.plot(y_pred$pred)

Task Two

Load the dataset

data_1 <- read.csv("C:\\Users\\user\\Downloads\\Crude oil data.csv")
attach(data_1)
head(data_1,5)

        Date Price
1 21/08/2019 55.68
2 20/08/2019 56.34
3 19/08/2019 56.21
4 16/08/2019 54.87
5 15/08/2019 54.47

Structure of the dataset

str(data)

'data.frame':   404 obs. of  5 variables:
 $ y1: num  0.921 -0.755 -1.863 -1.28 -1.891 ...
 $ y2: num  2.216 1.416 -0.197 -0.132 -1.581 ...
 $ y3: num  0.682 1.62 0.86 2.502 2.636 ...
 $ y4: num  -0.051 -0.218 0.257 -0.477 0.973 ...
 $ y5: num  1.18 2.08 1.44 2.45 3.85 ...

Split data

x_simdata_insample <- data_1[1:4900,1:2]

pick one out the separate x1,x2

x1 <- x_simdata_insample[1:4900,1]
x2 <- x_simdata_insample[1:4900,2]

We can now fit the model using the data x1, x2

x1_True_4901_5000 <- data_1[4901:5000,1]
x2_True_4901_5000 <- data_1[4901:5000,2]

It is convenient to set here what data to use. This way, you do not have to change between y1 and y2 etc in every place in the code, when you analyze a different dataset

x_data <- x2
head(x_data,5)

[1] 55.68 56.34 56.21 54.87 54.47

Process Identification

layout(matrix(c(1, 1, 1, 1,
                2, 2, 3, 3), nrow=2, byrow=TRUE))
ts.plot(x_data);  acf(x_data); pacf(x_data)

par(mfrow = c(1,1))

Ljung Box test for autocorrelation/serial dependency This test is done on original data, no parameters are estimated, thus: fitdf = 0

LBQ_obj <- Box.test(x_data, lag = 1, type = c( "Ljung-Box"), fitdf = 0)
LBQ_obj


    Box-Ljung test

data:  x_data
X-squared = 4887.2, df = 1, p-value < 2.2e-16

The reported p-value, which is 2.2e-16 (a very small value), far much less than signgificant alpha or 0.05 suggests strong evidence against the null hypothesis of no autocorrelation. In other words, the data in “x_data” is likely to exhibit significant autocorrelation. It’s important to note that the Box-Ljung test is commonly used in time series analysis to evaluate the presence of autocorrelation, but it does not provide information about the specific nature or pattern of autocorrelation.

Unit Root Test

adf_obj <- adf.test(x_data)
adf_obj


    Augmented Dickey-Fuller Test

data:  x_data
Dickey-Fuller = -2.1417, Lag order = 16, p-value = 0.5183
alternative hypothesis: stationary

The reported p-value from the results above is 0.5183, which is more than the conventional significance level of 0.05. This indicates that there is insufficient evidence to reject the null hypothesis of non-stationarity in favor of the alternative hypothesis of stationarity. Therefore, based on the ADF test, the series is likely not to be stationary.

adf_obj <- adf.test(diff(x_data))
adf_obj


    Augmented Dickey-Fuller Test

data:  diff(x_data)
Dickey-Fuller = -16.107, Lag order = 16, p-value = 0.01
alternative hypothesis: stationary

After the first difference, the series was found to be stationary

Model Estimation

m2 <- arima(x_data, order = c(0, 1, 0 ),seasonal = list(order = c(0, 1, 0), period = NA),
            include.mean = FALSE)
summary(m2)


Call:
arima(x = x_data, order = c(0, 1, 0), seasonal = list(order = c(0, 1, 0), period = NA), 
    include.mean = FALSE)


sigma^2 estimated as 4.149:  log likelihood = -10434.7,  aic = 20871.39

Training set error measures:
                        ME     RMSE      MAE          MPE     MAPE    MASE
Training set -0.0002159848 2.036549 1.434585 -0.002821044 2.468067 1.45523
                   ACF1
Training set -0.5107874

Model Evaluation and Diagnostic Tests

Analyze and Plot the Residuals

Resids_m2 = m2$residuals
layout(matrix(c(1, 1, 1, 1,
                2, 2, 3, 3), nrow=2, byrow=TRUE))
ts.plot(Resids_m2); acf(Resids_m2); pacf(Resids_m2)

par(mfrow = c(1,1))

Summary of model residuals

summary(m2$residuals)

      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-30.680000  -1.060000  -0.030000  -0.000216   1.040000  14.790000

Test if the residuals are uncorrelated

plot(m2$residuals)
abline(-0.000216,0.000)

From the plot above, the residuals seems to not be evenly distributed. In other words, there is some aspects of systematic variation of the models residuals.

LBQ_obj = Box.test(Resids_m2,
                   lag = 10,
                   type = c( "Ljung-Box"),
                   fitdf = 3)
LBQ_obj


    Box-Ljung test

data:  Resids_m2
X-squared = 1312.3, df = 7, p-value < 2.2e-16

Forecasting

The prediction are set to four. In other words, we are getting predictionf for y_4900, y_5000

x_pred <- predict(m2, n.ahead = 100)
x_pred

$pred
Time Series:
Start = 4901 
End = 5000 
Frequency = 1 
  [1] 29.65 29.30 28.95 28.60 28.25 27.90 27.55 27.20 26.85 26.50 26.15 25.80
 [13] 25.45 25.10 24.75 24.40 24.05 23.70 23.35 23.00 22.65 22.30 21.95 21.60
 [25] 21.25 20.90 20.55 20.20 19.85 19.50 19.15 18.80 18.45 18.10 17.75 17.40
 [37] 17.05 16.70 16.35 16.00 15.65 15.30 14.95 14.60 14.25 13.90 13.55 13.20
 [49] 12.85 12.50 12.15 11.80 11.45 11.10 10.75 10.40 10.05  9.70  9.35  9.00
 [61]  8.65  8.30  7.95  7.60  7.25  6.90  6.55  6.20  5.85  5.50  5.15  4.80
 [73]  4.45  4.10  3.75  3.40  3.05  2.70  2.35  2.00  1.65  1.30  0.95  0.60
 [85]  0.25 -0.10 -0.45 -0.80 -1.15 -1.50 -1.85 -2.20 -2.55 -2.90 -3.25 -3.60
 [97] -3.95 -4.30 -4.65 -5.00

$se
Time Series:
Start = 4901 
End = 5000 
Frequency = 1 
  [1]    2.036965    4.554791    7.621624   11.156915   15.106534   19.431404
  [7]   24.101691   29.093674   34.387921   39.968132   45.820386   51.932612
 [13]   58.294213   64.895788   71.728927   78.786043   86.060249   93.545258
 [19]  101.235298  109.125045  117.209572  125.484296  133.944946  142.587525
 [25]  151.408282  160.403690  169.570424  178.905341  188.405465  198.067972
 [31]  207.890180  217.869532  228.003595  238.290040  248.726645  259.311279
 [37]  270.041901  280.916552  291.933349  303.090481  314.386207  325.818845
 [43]  337.386776  349.088436  360.922314  372.886949  384.980927  397.202881
 [49]  409.551484  422.025450  434.623531  447.344519  460.187235  473.150537
 [55]  486.233315  499.434486  512.752999  526.187828  539.737976  553.402468
 [61]  567.180358  581.070717  595.072645  609.185257  623.407694  637.739113
 [67]  652.178693  666.725628  681.379133  696.138438  711.002791  725.971453
 [73]  741.043703  756.218835  771.496154  786.874983  802.354656  817.934520
 [79]  833.613934  849.392271  865.268915  881.243260  897.314713  913.482690
 [85]  929.746619  946.105936  962.560090  979.108536  995.750739 1012.486176
 [91] 1029.314329 1046.234691 1063.246761 1080.350048 1097.544068 1114.828345
 [97] 1132.202409 1149.665800 1167.218062 1184.858749

x_pred$pred

Time Series:
Start = 4901 
End = 5000 
Frequency = 1 
  [1] 29.65 29.30 28.95 28.60 28.25 27.90 27.55 27.20 26.85 26.50 26.15 25.80
 [13] 25.45 25.10 24.75 24.40 24.05 23.70 23.35 23.00 22.65 22.30 21.95 21.60
 [25] 21.25 20.90 20.55 20.20 19.85 19.50 19.15 18.80 18.45 18.10 17.75 17.40
 [37] 17.05 16.70 16.35 16.00 15.65 15.30 14.95 14.60 14.25 13.90 13.55 13.20
 [49] 12.85 12.50 12.15 11.80 11.45 11.10 10.75 10.40 10.05  9.70  9.35  9.00
 [61]  8.65  8.30  7.95  7.60  7.25  6.90  6.55  6.20  5.85  5.50  5.15  4.80
 [73]  4.45  4.10  3.75  3.40  3.05  2.70  2.35  2.00  1.65  1.30  0.95  0.60
 [85]  0.25 -0.10 -0.45 -0.80 -1.15 -1.50 -1.85 -2.20 -2.55 -2.90 -3.25 -3.60
 [97] -3.95 -4.30 -4.65 -5.00

ts.plot(x_pred$pred)

TIME SERIES ANALYSIS [ARIMA MODELING]

LUMUMBA W. VICTOR

2023-05-12

Load the following libraries

Load the dataset

Structure of the dataset

pick one out the separate y1,y2,y3,y4,y5

Process Identification

Unit Root Test

Model Estimation

Model Evaluation and Diagnostic Tests

Analyze and Plot the Residuals

Summary of model residuals

Test if the residuals are uncorrelated

Forecasting

Task Two

Load the dataset

Structure of the dataset

Split data

pick one out the separate x1,x2

Process Identification

Unit Root Test

Model Estimation

Model Evaluation and Diagnostic Tests

Analyze and Plot the Residuals

Summary of model residuals

Test if the residuals are uncorrelated

Forecasting