Time_Series_final.knit

Time Series Modelling on Unemployment Rate and Seasonal Temperature Change Datasets

Seasonal Dataset - Collected from Kaggle for Temperature Change. Non-Seasonal Dataset - Collected from Freg Website for Unemployment rate.

Author: Deepshika Reddy AG

Email:dreddyag@stevens.edu

Project Supervisor: Dr. Hadi Safari Katesari

ABSTRACT

The main objective of the paper is to explain two kinds of datasets, one being seasonal and other being non-seasonal datasets in terms of time series modelling. The goal is to come up with best time series model by various techniques which will be discussed further. Time-series is a branch of data science that deals with univariate data with respect to date time. It is very useful for data that are particularly serially correlated. Handling time series data is quite challenging because it is quite difficult to understand the trend it may produce. For some datasets the trend can be totally random, while for others it can be seasonal or cyclic in nature. Time series analysis can be performed only if the dataset is stationary in nature. Throughout the paper, we first preprocess the dataset, make the dataset univariate and make the date time as index. Next steps involve plotting of time series plot, ACF, PACF and EACF graphs which helps in identifying the model to be used. Also we need to check if the dataset is stationary using Dicky-Fuller test. If it proves to be stationary we can directly analyze the bars in ACF and EACF and come up with AR or MA model. If its not stationary we can apply techniques like diffrencing, transforming and detrending to convert them into stationary. Further we apply ARIMA models and perform parameter estimation using AIC, BIC and so on. Further in the paper, various concepts pertaining to residual analysis is performed like ACF plot, histogram, qqplot, Shapiro-wilk test and Ljung-box plot. Prediction is based on forecasting on the original dataset for the future values and see how time series perform.

Keywords: Dicky-Fuller test, AIC, BIC, ACF, PACF, EACF, Forecasting, Time series modelling.

INTRODUCTION

Firstly, lets start by by implementing the univariate time-series analysis on seasonal and non-seasonal datasets.

Seasonal dataset Identification has been derived from Kaggle which is a temperature change dataset for different months. Kaggle is a good source for collecting any kind of dataset as there is clear description of the fields and the dataset is readily available in csv file.

Non-Seasonal dataset Identification has been collected from Fred official website which has large collection of time series datasets for various categories to choose from. The univariate dataset is readily available for time series analysis with clear description. It also has a time series plot already plotted so we can choose the dataset with a certain trend. I feel its a easy and great learning to capture datasets from the fred website which has both financial and non-financial datasets.

The techniques followed in the project are based on Box-Jenkins Approach. Every time series project has six steps that need to be followed according to the model to achieve the desired goal. Initial step being check for stationary followed by finding the best parameters to the model using ACF, PACF, EACF plots. Further performing and finding the best models which has least error(AIC). Forecasting is performed based on the model developed to find the future values.

Dataset Description
`Seasonal.Dataset`

The FAOSTAT temperature change dataset contains the mean temperature change by country along with their annual updates. The time duration of the dataset goes from 1961-2019. The dataset has statistics available for monthly, seasonal and annual mean temperatures. For the analysis purpose we have converted the columns with years to a single column and have filtered out only temperature change data and have ignored global warming and climate change respectively. By the problem statement it is known that temperature change can vary for every month of each year which makes it the seasonal part and can be clearly seen in the time series plot as well.

Non-Seasonal.Dataset

The dataset is a unemployment rate dataset for over 20 years. It has been collected from a household survey for population and formulated the .csv file with date and the percentage of unemployment rate over the years. The data has been collected from the source, “US Bureau of Labor Statistics” which has been present in the fred website. The dataset talks about the employment situation in USA, which is a monthly data and is seasonally adjusted.

METHADOLOGY

Steps involved in Box-Jenkins approach

\(\\[1in]\)

Step1:Stationary Check Every time-series dataset before fitting to a model needs to be checked for Stationary. In R, this can be done by Dicker-Fuller test based on the p-value. If p-value is greater than 0.05 then we can conclude that data is not stationary. Further in that case techniques like Diffrencing, Detrending, Transformation need to be applied and re-run the test. Based on the number of times the diffrentiation is done until we get desired p value forms the d parameter in modelling. If First Diffrentiated then d=1 so on.

Step2:Model Selection Model Selection is done based on the significant lines in ACF, PACF plots above the confidence interval. Also, the parameters p,q come from the checking the plots. If the plots are not clear EACF can be checked to know a clearer values.

Step3:Parameter Estimation Based on various models and parameters, AIC,BIC,Loglikelihood the best parameters are chosen for the time series model.

Step4:Residual Analysis The model chosen must be verified for its correctness and accuracy. This is where Residual Analysis plays a major role.There are various techniques available to check it. The major ones include starting from ACF plot to (QQ plot, Histogram and Shapiro-Wilk Test). To verify if residual is white noise or not, we can perform Ljung - Box test.

Step5:Forecasting The best model chosen based on the above steps can be used to forecast future values. This is done not on the diffrenced dataset but on the actual dataset or the raw data. This is an important steps for good forecasting.

IMPLEMENTATION

1.Seasonal Dataset:

      date      y
1 1/1/1961  0.777
2 2/1/1961 -1.743
3 3/1/1961  0.516
4 4/1/1961 -1.709
5 5/1/1961  1.412
6 6/1/1961 -0.058

Discuss:From the above dataset sample we can see that there are 2 columns, one being the year and other being the y value in decimal for temperature change. To get a better understanding of the data we need to do time series analysis. The details are discussed further throughout the paper.

Details of Timseries dataset:

Start value of Time series dataset

[1] 1975    1

End value of Time series dataset

[1] 2018   12

Freq value of Time series dataset

[1] 12

Statistics of Time Series Sample.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-7.7220 -0.5355  0.2785  0.2335  1.1030  4.1030

Discuss:Time series plot for the given dataset. We can see that the data is stationary has there is no much variation or trend found(like upward or downward trend). But only the DF test can confirm stationarity. Now if we analyze a sample data we can see the seasonally trends.

Discuss:Find the sample data between 2018 Jan to 2018 Dec where the seasonality can be found. We can see that there is an increase in temperature change from Jan to February slightly and little higher drop to March. In the mid of the year the temperature change is almost constant there is no much difference and in the year 1983 we can see sudden drop in temperature at January when compared to previous years.

Discuss:ACF plot signifies that there are 4 points outside the confidence interval. We can see that ACF curve shows the seasonality trends. The correlation has been found for multiple points beyond the confidence interval. We can see strong correlation at points 1,3,6,19 and so on. However the DF test, confirms that the time series is stationary as the p values is close to zero and also less than 0.05. Pins in the graph indicate MA(2) and 3 and 6 indicate the seasonal parts for MA(2).

Stationary or Not-Stationary check for dataset
Dickey-Fuller test

Warning in adf.test(ts): p-value smaller than printed p-value


    Augmented Dickey-Fuller Test

data:  ts
Dickey-Fuller = -7.2214, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary

Discuss:We can find the dickey-fuller test confirming the p values less than 0.05. We can further Investigate the PACF curves.

Discuss:We can see that only 1,3,6 pins are significant so we can use AR(3) part for non seasonal, however in the 6 can be used to find the seasonal part which is again AR(1).

Modelling

SARIMA and GARCH

Based on the above analysis we can form the SARMA model as, SARMA(,0,)X(,0,) Has no differentiation has been done we can mark it as zero. First part of multiplication is the Non-seasonal part with first parameter as PACF and second as ACF. Similarly its the same format for Seasonal part as well in SARMA model. From the above ACF, PACF analysis we can formulate the below models:
1.SARMA(2,0,3)X(1,0,2)
2.SARMA(3,0,1)X(1,0,1)

This model cannot be used, because it has higher value than the previous SARMA model. Also has the seasonality pattern is not certain we can use GARCH model and test to see if the AIC value is better along with ARMA model as well by ignoring the seasonal part. GARCH model can be abbreviated as Generalized Auto-regressive conditional Heteroskedasticity models. However GARCH model is usually used to estimate value returns for stocks and so on, where trends is not known. We are using to test in our use-case to look at better AIC values. We are going to apply the seasonal ARMA-GARCH model using rugarch.

Steps to perform ARMA-GARCH? [Click Here]👇

Check if the dataset is stationary or not like any other model.
Identify the p and q order using ARIMA.
To incorporate seasonality. Fourier terms are added.
Check AIC values, Residuals and LB test and so on.

As we already know that the data is stationary, we can go about finding the p and q values from ACF and PACF plots or use auto.arima() in R.

Series: ts 
ARIMA(0,1,2)(1,0,0)[12] 

Coefficients:
          ma1      ma2     sar1
      -0.8092  -0.1599  -0.0556
s.e.   0.0436   0.0435   0.0445

sigma^2 = 1.664:  log likelihood = -881.88
AIC=1771.76   AICc=1771.84   BIC=1788.83


Call:
arima(x = ts, order = c(2, 0, 3), seasonal = list(order = c(1, 0, 2), period = 12))

Coefficients:
         ar1     ar2     ma1      ma2      ma3    sar1     sma1    sma2
      0.1394  0.8450  0.0518  -0.8178  -0.1618  0.5569  -0.6262  0.1005
s.e.  0.1102  0.1087  0.1180   0.0950   0.0490  0.3267   0.3226  0.0488
      intercept
         0.3112
s.e.     0.2496

sigma^2 estimated as 1.64:  log likelihood = -880.12,  aic = 1778.24


Call:
arima(x = ts, order = c(3, 1, 1), seasonal = list(order = c(1, 0, 1), period = 12))

Coefficients:
         ar1      ar2      ar3      ma1     sar1    sma1
      0.1570  -0.0531  -0.0191  -0.9696  -0.0856  0.0259
s.e.  0.0453   0.0454   0.0460   0.0134   0.3651  0.3650

sigma^2 estimated as 1.652:  log likelihood = -881.51,  aic = 1775.03


 Fitting models using approximations to speed things up...

 Regression with ARIMA(2,1,2)            errors : Inf
 Regression with ARIMA(0,1,0)            errors : 2055.81
 Regression with ARIMA(1,1,0)            errors : 1968.435
 Regression with ARIMA(0,1,1)            errors : 1794.578
 Regression with ARIMA(0,1,0)            errors : 2053.707
 Regression with ARIMA(1,1,1)            errors : 1798.11
 Regression with ARIMA(0,1,2)            errors : 1784.067
 Regression with ARIMA(1,1,2)            errors : 1787.08
 Regression with ARIMA(0,1,3)            errors : 1785.998
 Regression with ARIMA(1,1,3)            errors : Inf
 Regression with ARIMA(0,1,2)            errors : 1782.493
 Regression with ARIMA(0,1,1)            errors : 1792.865
 Regression with ARIMA(1,1,2)            errors : 1787.4
 Regression with ARIMA(0,1,3)            errors : 1784.395
 Regression with ARIMA(1,1,1)            errors : 1797.301
 Regression with ARIMA(1,1,3)            errors : 1779.717
 Regression with ARIMA(2,1,3)            errors : 1776.862
 Regression with ARIMA(2,1,2)            errors : 1780.76
 Regression with ARIMA(3,1,3)            errors : 1778.824
 Regression with ARIMA(2,1,4)            errors : 1777.967
 Regression with ARIMA(1,1,4)            errors : 1779.827
 Regression with ARIMA(3,1,2)            errors : 1784.154
 Regression with ARIMA(3,1,4)            errors : 1780.328
 Regression with ARIMA(2,1,3)            errors : Inf

 Now re-fitting the best model(s) without approximations...

 Regression with ARIMA(2,1,3)            errors : 1784.776

 Best model: Regression with ARIMA(2,1,3)            errors

Series: ts 
Regression with ARIMA(2,1,3) errors 

Coefficients:
          ar1      ar2     ma1      ma2      ma3    S1-12    C1-12    S2-12
      -1.3561  -0.5037  0.5377  -0.8046  -0.6450  -0.0977  -0.0351  -0.0936
s.e.   0.1939   0.2005  0.1709   0.0625   0.1698   0.0854   0.0854   0.0845
       C2-12   S3-12   C3-12    S4-12    C4-12    S5-12    C5-12    C6-12
      0.1538  0.0183  0.0493  -0.0421  -0.0576  -0.0186  -0.1154  -0.0442
s.e.  0.0845  0.0825  0.0825   0.0766   0.0767   0.0553   0.0554   0.0566

sigma^2 = 1.661:  log likelihood = -874.79
AIC=1783.57   AICc=1784.78   BIC=1856.12

Discuss:As we are not differencing the model we can consider ARMA(2,0,3) has the best model. Which is the best and q value also found from the ACF and PACF plots.

GARCH


*---------------------------------*
*       GARCH Model Spec          *
*---------------------------------*

Conditional Variance Dynamics   
------------------------------------
GARCH Model     : sGARCH(2,2)
Variance Targeting  : FALSE 

Conditional Mean Dynamics
------------------------------------
Mean Model      : ARFIMA(2,0,3)
Include Mean        : TRUE 
GARCH-in-Mean       : FALSE 

Conditional Distribution
------------------------------------
Distribution    :  norm 
Includes Skew   :  FALSE 
Includes Shape  :  FALSE 
Includes Lambda :  FALSE

Warning in arima(data, order = c(modelinc[2], 0, modelinc[3]), include.mean =
modelinc[1], : possible convergence problem: optim gave code = 1


*---------------------------------*
*          GARCH Model Fit        *
*---------------------------------*

Conditional Variance Dynamics   
-----------------------------------
GARCH Model : sGARCH(2,2)
Mean Model  : ARFIMA(2,0,3)
Distribution    : norm 

Optimal Parameters
------------------------------------
        Estimate  Std. Error     t value Pr(>|t|)
mu      0.265301    0.544387    0.487339 0.626018
ar1     0.151484    0.070530    2.147800 0.031730
ar2     0.840833    0.069109   12.166792 0.000000
ma1     0.048533    0.081843    0.593002 0.553180
ma2    -0.821368    0.059487  -13.807438 0.000000
ma3    -0.157778    0.050057   -3.151964 0.001622
omega   0.000001    0.001029    0.000492 0.999608
alpha1  0.011519    0.006336    1.817906 0.069079
alpha2  0.000000    0.006488    0.000001 1.000000
beta1   0.000000    0.005294    0.000008 0.999994
beta2   0.986903    0.000647 1525.609951 0.000000

Robust Standard Errors:
        Estimate  Std. Error    t value Pr(>|t|)
mu      0.265301    1.566807   0.169326 0.865541
ar1     0.151484    0.065784   2.302762 0.021292
ar2     0.840833    0.048539  17.322906 0.000000
ma1     0.048533    0.077755   0.624179 0.532510
ma2    -0.821368    0.070008 -11.732560 0.000000
ma3    -0.157778    0.072942  -2.163071 0.030536
omega   0.000001    0.000503   0.001006 0.999197
alpha1  0.011519    0.007967   1.445786 0.148237
alpha2  0.000000    0.009078   0.000000 1.000000
beta1   0.000000    0.004614   0.000009 0.999993
beta2   0.986903    0.003461 285.133139 0.000000

LogLikelihood : -878.6808 

Information Criteria
------------------------------------
                   
Akaike       3.3700
Bayes        3.4589
Shibata      3.3692
Hannan-Quinn 3.4048

Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
                         statistic p-value
Lag[1]                     0.07735  0.7809
Lag[2*(p+q)+(p+q)-1][14]   3.76517  1.0000
Lag[4*(p+q)+(p+q)-1][24]   7.63807  0.9810
d.o.f=5
H0 : No serial correlation

Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
                         statistic  p-value
Lag[1]                       1.681 0.194794
Lag[2*(p+q)+(p+q)-1][11]    11.484 0.048284
Lag[4*(p+q)+(p+q)-1][19]    22.744 0.003613
d.o.f=4

Weighted ARCH LM Tests
------------------------------------
            Statistic Shape Scale P-Value
ARCH Lag[5]     2.776 0.500 2.000 0.09570
ARCH Lag[7]     7.613 1.473 1.746 0.03178
ARCH Lag[9]     9.587 2.402 1.619 0.03265

Nyblom stability test
------------------------------------
Joint Statistic:  2.4284
Individual Statistics:              
mu     0.27098
ar1    0.36042
ar2    0.21359
ma1    0.48745
ma2    0.12553
ma3    0.80301
omega  0.12151
alpha1 0.13134
alpha2 0.09875
beta1  0.11948
beta2  0.12760

Asymptotic Critical Values (10% 5% 1%)
Joint Statistic:         2.49 2.75 3.27
Individual Statistic:    0.35 0.47 0.75

Sign Bias Test
------------------------------------
                   t-value    prob sig
Sign Bias            1.955 0.05109   *
Negative Sign Bias   1.007 0.31458    
Positive Sign Bias   1.711 0.08776   *
Joint Effect         9.005 0.02923  **


Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
  group statistic p-value(g-1)
1    20     34.42     0.016369
2    30     56.20     0.001788
3    40     60.18     0.016282
4    50     73.70     0.012773


Elapsed time : 0.677094

Result of GARCH Model with Specifications:
Shows the output of the GARCH model when ran on the dataset. Some of the observations we can see that and compare with the SARMA model. We can view various optimal parameters and their estimate and standard error as well. The LB test has values for p nothing less than 0.05 so we can say that null hypothesis is rejected and may assume correlation being present in the dataset. However LB test on Standardized squared residuals yield one of the p values closer to 0. The Arch LM tests has p values for lags 7 and 9 closer to zero. We can observe that the log-likelihood for SARMA is smaller than GARCH. Although, it is well noted that the higher the log-likelihood, the better the model. The GARCH model has higher log-likelihood when compared sarima but no much difference between their values. While the same is opposite for AIC value, where smaller the AIC value is the best model.

Residual Analysis

It has been noted that fit2(model) has least AIC value and is considered for residual analysis.

Discuss:Shows the residual plot for GARCH model. The model has variations and seasonality.

Discuss: Shows the Residual plot for SARIMA.There is no significant change in the residual analysis after adopting the GARCH model. The pattern looks almost constant. Further Analysis can be done with SARIMA model.

Histogram

Discuss:The plot demonstrates the histogram of the residuals. The shape of the curve is almost bell shaped and forms the normal distribution.

Discuss:From the qq plot we can see slightly towards the ends there is outliers and deviation from the reference line for theoretical and sample quantiles. Though there is a presence of outliers it is not as far from the reference line.

Shapiro-test


    Shapiro-Wilk normality test

data:  residuals(fit1)
W = 0.97474, p-value = 6.67e-08

Discuss:Shapiro-Wilk test yields a W value of 0.9778 and p value of 0.00000000525 which is much smaller than 0.05 and very close to zero. Thus data is non-normal.

Box-test


    Box-Ljung test

data:  resid(fit1)
X-squared = 0.001268, df = 0, p-value < 2.2e-16

Discuss:LB test will test if there is autocorrelation in the time series data. P value is very small and close to zero therefore, it has grabbed the dependence in the time series.

Forecasting


Call:
arima(x = ts, order = c(2, 0, 3), seasonal = list(order = c(1, 0, 2), period = 12))

Coefficients:
         ar1     ar2     ma1      ma2      ma3    sar1     sma1    sma2
      0.1394  0.8450  0.0518  -0.8178  -0.1618  0.5569  -0.6262  0.1005
s.e.  0.1102  0.1087  0.1180   0.0950   0.0490  0.3267   0.3226  0.0488
      intercept
         0.3112
s.e.     0.2496

sigma^2 estimated as 1.64:  log likelihood = -880.12,  aic = 1778.24

Warning in plot.window(xlim, ylim, log, ...): "n.ahead" is not a graphical
parameter

Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "n.ahead" is not a
graphical parameter

Warning in axis(1, ...): "n.ahead" is not a graphical parameter

Warning in axis(2, ...): "n.ahead" is not a graphical parameter

Warning in box(...): "n.ahead" is not a graphical parameter

Discuss:Shows the forecasting for the time series with the original dataset. It gives a good forecasting and accuracy.

2.Non-Seasonal Dataset:

        DATE LNS14000024
1 1948-01-01         3.0
2 1948-02-01         3.3
3 1948-03-01         3.5
4 1948-04-01         3.5
5 1948-05-01         3.3
6 1948-06-01         3.2

Discuss:Shows reading the csv data in R. We can see that dataset started with Jan 1948 and has the value of unemployment rate stored in LNS140000024 variable.

Details of Timseries dataset:

[1] 2016    1

End value of Time series dataset

[1] 2020    1

Freq value of Time series dataset

[1] 12

Statistics of Time Series Sample.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   27.0    31.0    36.0    40.9    50.0    73.0

Discuss:Shows that the dataset for time-series is sampled between 2016 and 2020 to find the recent data and plot them and analyze the time series.

Time-series analysis

Warning in data(ts): data set 'ts' not found

Discuss:We can see that the time series plot between 2016 and 2020 is random in nature and is not stationary just by looking at it.

Comparing various yearly trends of different year values:

Warning in data(ts): data set 'ts' not found

Warning in data(ts): data set 'ts' not found

Warning in data(ts): data set 'ts' not found

Discuss:From the above time series plot we can conclude that, the trend within the year values for 1960,2016 and 2020 are similar. We can observe that during start of the year in January the unemployment rate increases and becomes constant during February, March and then decreases sharply post April. Then in mid of the year it increases to a certain level and attains constant until late/end of the year. Clearly we can see some pattern when we do time series plot within a single year. It can be concluded that unemployment rate is higher during winter months and decreased post April which is summer season. Thus the seasonal aspect can be clearly understood.

ACF Plot analysis for sample between 2016 and 2020:

Discuss:Shows the initial ACF plot and we can see that before lag 25 almost all are significant and having no trend it needs to be differentiated before performing any analysis. Clearly the seasonality is visible even in the ACF plot.

PACF plot

Dickey-Fuller Test and Plot

Warning in adf.test(ts_diff): p-value smaller than printed p-value


    Augmented Dickey-Fuller Test

data:  ts_diff
Dickey-Fuller = -5.5333, Lag order = 3, p-value = 0.01
alternative hypothesis: stationary

Discuss:The DF test confirms that it is stationary as p value < 0.05 and thus can be used for further analysis.This is after doing double differentiation.

Modeling and Parameter estimation

ARIMA(1,2,1) ARIMA(1,2,4) ARIMA(1,2,5) ARIMA(2,2,1) ARIMA(2,2,4) ARIMA(2,2,5)
1.ARIMA(1,2,1)
2.ARIMA(1,2,4)
3.ARIMA(1,2,5)
4.ARIMA(2,2,1)
5.ARIMA(2,2,4)
6.ARIMA(2,2,5)
Where the ARIMA (PACF, Num_Diffrentation, ACF) model have the below format for the parameters. Coefficients for various models:

(fit <- arima(ts_diff, order = c(1,2,1)))


Call:
arima(x = ts_diff, order = c(1, 2, 1))

Coefficients:
          ar1      ma1
      -0.7142  -1.0000
s.e.   0.0986   0.0557

sigma^2 estimated as 51.4:  log likelihood = -155.29,  aic = 314.59

(fit2 <- arima(ts_diff, order = c(1,2,4)))

Warning in log(s2): NaNs produced


Call:
arima(x = ts_diff, order = c(1, 2, 4))

Coefficients:
          ar1      ma1     ma2      ma3     ma4
      -0.0051  -3.0375  3.5460  -1.9735  0.4668
s.e.   0.3018   0.3067  0.7774   0.6970  0.2240

sigma^2 estimated as 13.9:  log likelihood = -131.28,  aic = 272.56

(fit3 <- arima(ts_diff, order = c(1,2,5)))


Call:
arima(x = ts_diff, order = c(1, 2, 5))

Coefficients:
          ar1      ma1     ma2      ma3     ma4     ma5
      -0.0353  -3.0022  3.4399  -1.8474  0.3926  0.0190
s.e.   1.6299   1.7001  5.1929   6.1762  3.5431  0.8653

sigma^2 estimated as 13.94:  log likelihood = -131.28,  aic = 274.56

(fit4 <- arima(ts_diff, order = c(2,2,1)))


Call:
arima(x = ts_diff, order = c(2, 2, 1))

Coefficients:
          ar1      ar2      ma1
      -1.2384  -0.6923  -1.0000
s.e.   0.1009   0.0985   0.0601

sigma^2 estimated as 24.54:  log likelihood = -139.87,  aic = 285.74

(fit5 <- arima(ts_diff, order = c(2,2,4)))


Call:
arima(x = ts_diff, order = c(2, 2, 4))

Coefficients:
          ar1      ar2      ma1     ma2     ma3      ma4
      -1.2361  -0.5367  -1.7490  0.1636  0.9268  -0.3396
s.e.   0.2311   0.1418   0.3913  0.7066  0.6598   0.2716

sigma^2 estimated as 12.94:  log likelihood = -130.39,  aic = 272.79

(fit6 <- arima(ts_diff, order = c(2,2,5)))


Call:
arima(x = ts_diff, order = c(2, 2, 5))

Coefficients:
          ar1      ar2      ma1     ma2     ma3      ma4     ma5
      -1.0150  -0.2931  -1.9884  0.6099  1.0188  -0.9012  0.2641
s.e.   0.4505   0.4386   0.4843  0.9346  0.4924   0.8726  0.4099

sigma^2 estimated as 13.1:  log likelihood = -130.23,  aic = 274.47

Discuss:Based on the different models, we can see that ARIMA(2,2,5) had the least AIC value, sigma^2 being the least therefore is the best model for given time series. Find the below time series plot for the residuals.

Residual Analysis

Residual Plot

Discuss:Residual plot tells the points that are left after fitting the model. We can see that most points are closer to the line except at the middle of the plot. Now lets plot the ACF of residuals for the model to further understand its behavior.

ACF Residual Plot

Discuss:From the plot for ACF of residuals, we can clearly see that there is no statistically significant correlation for the data and every point is within the confidence interval.

Discuss:From the histogram we can see that, it slightly follow normal distribution if we ignore the outliers. But the plot is slightly right skewed in nature. For more understanding we need to perform quantile-quantile plot for the analysis.

Discuss:From the qqplot for the residuals we can say that, most of the points lie on the reference line, however they are few points towards the tail part of the plots that deviate slightly. QQ plot gives a better visual of the residuals how the sample quantiles are related to the theoretical quantiles. There are few tests which can be performed, to check the normality of the residuals one such is Shapiro test.

Shapiro Test


    Shapiro-Wilk normality test

data:  residuals(fit6)
W = 0.94064, p-value = 0.01885

Discuss:The above figure shows the results of Shapiro-wilk test for the residuals of the model. If the value of p is equal to or less than 0.05, then the hypothesis of normality will be rejected by the Shapiro test. Here the p value is less than 0.05 so we can say that the residuals follow normal distribution.

Ljung-Box


    Box-Ljung test

data:  resid(fit6)
X-squared = 0.062981, df = 0, p-value < 2.2e-16

Discuss:Ljung-Box test is next performed on the models to test the randomness of the data over the lags at the bigger perspective.The null hypothesis for LB test is that residuals are independently distributed if p values is less than 0.05. Based on that we can see that, independence has been captured by the data for the following model.

Time-series Forecasting


Call:
arima(x = ts, order = c(2, 2, 5))

Coefficients:
         ar1     ar2      ma1      ma2     ma3      ma4     ma5
      0.1194  0.5561  -1.2597  -0.0785  0.7934  -0.5259  0.0857
s.e.  0.4024  0.2824   0.4718   0.5194  0.3006   0.2513  0.1822

sigma^2 estimated as 11.57:  log likelihood = -125.45,  aic = 264.91

Warning in plot.window(xlim, ylim, log, ...): "n.ahead" is not a graphical
parameter

Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "n.ahead" is not a
graphical parameter

Warning in axis(1, ...): "n.ahead" is not a graphical parameter

Warning in axis(2, ...): "n.ahead" is not a graphical parameter

Warning in box(...): "n.ahead" is not a graphical parameter

Discuss:The plot shows the forecasting to plot for the next 20 values which is shown by the blue region.

Executive Summary :

The report talks about the step by step procedure to perform time series analysis and forecasting using various methodologies. The report consists of 2 datasets one being seasonal and the other non-seasonal where data is being loaded in R and uses different libraries like TSA, tseries, forecast, rugarch to perform the analysis. Major part of the report discusses about how ACF and PACF plot are very much important in deducing the parameters for the ARIMA model. Stationary check is mandatory to perform any time series analysis. The time series analysis to find seasonality and so on, need clear understanding of the significance of 3 kinds of plots namely acf, pacf and time series plot. Further on fitting different models by varying the p, q and differentiation values we can find the sigma^2, logliklihood and AIC value which needs to be analyzed. It has been formulated in finding the best module the AIC value has to be the smallest and the log likelihood has to be higher. Further various residual techniques are done on the best model to verify does it satisfy normality or autocorrelation and histogram analysis to clearly see if the model forms normal distribution or is skewed in nature. The qq plot is one of the techniques used which was quite useful to know the behavior of outliers for the models. Further the last section we perform forecasting on the original time series to see how it can find further values based on the mentioned time period say 20 months or so on.

Challenges faced and conclusion:

I would like to thank Professor to provide me with clear roadmap to solve any time series dataset. His step by step procedure and algorithm really helped me to solve both of this datasets. Some of the challenges I faced was during preprocessing of the Seasonal dataset which was in a different format and getting it to time series format. Because there was different ways to get it to time series format like does it have to be monthly data or yearly data or which category to choose form and so on. Finding Seasonal trends was another challenge as ACF plot had pins not very prominent and thus I resampled the time series data into yearly to see monthly patterns for multiple years and thus understood the seasonality than considering the entire dataset at once and analyzing it. Model selection and fitting the model was another challenge, but I was able to do it based on clear understanding of ACF and PACF plots. Understanding various Residual methodologies required clear understanding of the Statistical methods, so I had to read in deep about these techniques and analyze for the dataset.

References

1.Cryer, J. D., & Chan, K. S. (2008). Time series analysis: with applications in R (Vol. 2). New York: Springer 38

2.Katesari, H. S., & Vajargah, B. F. (2015). Testing adverse selection using frank copula approach in Iran insurance markets.Mathematics and Computer Science, 15(2), 154-158

3.Katesari, H. S., & Zarodi, S. (2016). Effects of coverage choice by predictive modeling on frequency of accidents. Caspian Journal of Applied Sciences Research, 5(3), 28-33

4.Safari-Katesari, H., Samadi, S. Y., & Zaroudi, S. (2020). Modelling count data via copulas. Statistics, 54(6), 1329-1355

5.Shumway, R. H., Stoffer, D. S., & Stoffer, D. S. (2000). Time series analysis and its applications (Vol. 3). New York: springer

6.Safari-Katesari, H., & Zaroudi, S. (2020). Count copula regression model using generalized beta distribution of the second kind. Statistics, 21, 1-12

7.Safari-Katesari, H., & Zaroudi, S. (2021). Analysing the impact of dependency on conditional survival functions using copulas. Statistics in Transition New Series, 22(1)

8.Safari Katesari, H., (2021) Bayesian dynamic factor analysis and copula-based models for mixed data, PhD dissertation, Southern Illinois University Carbondale

9.Tsay, R. S. (2013). Multivariate time series analysis: with R and financial applications. John Wiley & Sons

10.Zaroudi, S., Faridrohani, M. R., Behzadi, M. H., & Safari-Katesari, H. (2022). Copula-based Modeling for IBNR Claim Loss Reserving. arXiv preprint arXiv:2203.12750

Links

1.https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/

2.https://otexts.com/fpp2/arima-r.html

3.https://www.geeksforgeeks.org/time-series-analysis-using-arima-model-in-r-programming/

4.https://rdrr.io/cran/tseries/man/garch.html

5.https://towardsdatascience.com/interpreting-acf-and-pacf-plots-for-time-series-forecasting-af0d6db4061c

6.https://www.kaggle.com/code/iamleonie/time-series-interpreting-acf-and-pacf

7.https://analyticsindiamag.com/complete-guide-to-dickey-fuller-test-in-time-series-analysis/

8.https://medium.datadriveninvestor.com/interpreting-results-of-dicky-fuller-test-for-time-series-analysis-4bb1e98f242b

9.https://rpubs.com/iabrady/residual-analysis