The main objective of the paper is to explain two kinds of datasets, one being seasonal and other being non-seasonal datasets in terms of time series modelling. The goal is to come up with best time series model by various techniques which will be discussed further. Time-series is a branch of data science that deals with univariate data with respect to date time. It is very useful for data that are particularly serially correlated. Handling time series data is quite challenging because it is quite difficult to understand the trend it may produce. For some datasets the trend can be totally random, while for others it can be seasonal or cyclic in nature. Time series analysis can be performed only if the dataset is stationary in nature. Throughout the paper, we first preprocess the dataset, make the dataset univariate and make the date time as index. Next steps involve plotting of time series plot, ACF, PACF and EACF graphs which helps in identifying the model to be used. Also we need to check if the dataset is stationary using Dicky-Fuller test. If it proves to be stationary we can directly analyze the bars in ACF and EACF and come up with AR or MA model. If its not stationary we can apply techniques like diffrencing, transforming and detrending to convert them into stationary. Further we apply ARIMA models and perform parameter estimation using AIC, BIC and so on. Further in the paper, various concepts pertaining to residual analysis is performed like ACF plot, histogram, qqplot, Shapiro-wilk test and Ljung-box plot. Prediction is based on forecasting on the original dataset for the future values and see how time series perform.
Keywords: Dicky-Fuller test, AIC, BIC, ACF, PACF, EACF, Forecasting, Time series modelling.
Seasonal dataset Identification has been derived from Kaggle which is a temperature change dataset for different months. Kaggle is a good source for collecting any kind of dataset as there is clear description of the fields and the dataset is readily available in csv file.
Non-Seasonal dataset Identification has been collected from Fred official website which has large collection of time series datasets for various categories to choose from. The univariate dataset is readily available for time series analysis with clear description. It also has a time series plot already plotted so we can choose the dataset with a certain trend. I feel its a easy and great learning to capture datasets from the fred website which has both financial and non-financial datasets.
The techniques followed in the project are based on Box-Jenkins Approach. Every time series project has six steps that need to be followed according to the model to achieve the desired goal. Initial step being check for stationary followed by finding the best parameters to the model using ACF, PACF, EACF plots. Further performing and finding the best models which has least error(AIC). Forecasting is performed based on the model developed to find the future values.
| Dataset Description | |
|---|---|
Seasonal.Dataset |
The FAOSTAT temperature change dataset contains the mean temperature change by country along with their annual updates. The time duration of the dataset goes from 1961-2019. The dataset has statistics available for monthly, seasonal and annual mean temperatures. For the analysis purpose we have converted the columns with years to a single column and have filtered out only temperature change data and have ignored global warming and climate change respectively. By the problem statement it is known that temperature change can vary for every month of each year which makes it the seasonal part and can be clearly seen in the time series plot as well.
Non-Seasonal.DatasetThe dataset is a unemployment rate dataset for over 20 years. It has been collected from a household survey for population and formulated the .csv file with date and the percentage of unemployment rate over the years. The data has been collected from the source, “US Bureau of Labor Statistics” which has been present in the fred website. The dataset talks about the employment situation in USA, which is a monthly data and is seasonally adjusted.
\(\\[1in]\)
Step1:Stationary Check Every time-series dataset before fitting to a model needs to be checked for Stationary. In R, this can be done by Dicker-Fuller test based on the p-value. If p-value is greater than 0.05 then we can conclude that data is not stationary. Further in that case techniques like Diffrencing, Detrending, Transformation need to be applied and re-run the test. Based on the number of times the diffrentiation is done until we get desired p value forms the d parameter in modelling. If First Diffrentiated then d=1 so on.
Step2:Model Selection Model Selection is done based on the significant lines in ACF, PACF plots above the confidence interval. Also, the parameters p,q come from the checking the plots. If the plots are not clear EACF can be checked to know a clearer values.
Step3:Parameter Estimation Based on various models and parameters, AIC,BIC,Loglikelihood the best parameters are chosen for the time series model.
Step4:Residual Analysis The model chosen must be verified for its correctness and accuracy. This is where Residual Analysis plays a major role.There are various techniques available to check it. The major ones include starting from ACF plot to (QQ plot, Histogram and Shapiro-Wilk Test). To verify if residual is white noise or not, we can perform Ljung - Box test.
Step5:Forecasting The best model chosen based on the above steps can be used to forecast future values. This is done not on the diffrenced dataset but on the actual dataset or the raw data. This is an important steps for good forecasting.
date y
1 1/1/1961 0.777
2 2/1/1961 -1.743
3 3/1/1961 0.516
4 4/1/1961 -1.709
5 5/1/1961 1.412
6 6/1/1961 -0.058
Discuss:From the above dataset sample we can see that there are 2 columns, one being the year and other being the y value in decimal for temperature change. To get a better understanding of the data we need to do time series analysis. The details are discussed further throughout the paper.
Start value of Time series dataset
[1] 1975 1
End value of Time series dataset
[1] 2018 12
Freq value of Time series dataset
[1] 12
Statistics of Time Series Sample.
Min. 1st Qu. Median Mean 3rd Qu. Max.
-7.7220 -0.5355 0.2785 0.2335 1.1030 4.1030
Discuss:Time series plot for the given dataset. We can see that the data is stationary has there is no much variation or trend found(like upward or downward trend). But only the DF test can confirm stationarity. Now if we analyze a sample data we can see the seasonally trends.
Discuss:ACF plot signifies that there are 4 points outside the confidence interval. We can see that ACF curve shows the seasonality trends. The correlation has been found for multiple points beyond the confidence interval. We can see strong correlation at points 1,3,6,19 and so on. However the DF test, confirms that the time series is stationary as the p values is close to zero and also less than 0.05. Pins in the graph indicate MA(2) and 3 and 6 indicate the seasonal parts for MA(2).
Stationary or Not-Stationary check for dataset Dickey-Fuller test
Warning in adf.test(ts): p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: ts
Dickey-Fuller = -7.2214, Lag order = 8, p-value = 0.01
alternative hypothesis: stationary
Discuss:We can find the dickey-fuller test confirming the p values less than 0.05. We can further Investigate the PACF curves.
Discuss:We can see that only 1,3,6 pins are significant so we can use AR(3) part for non seasonal, however in the 6 can be used to find the seasonal part which is again AR(1).
Based on the above analysis we can form the SARMA model as, SARMA(,0,)X(,0,) Has no differentiation has been done we can mark it as zero. First part of multiplication is the Non-seasonal part with first parameter as PACF and second as ACF. Similarly its the same format for Seasonal part as well in SARMA model. From the above ACF, PACF analysis we can formulate the below models: 1.SARMA(2,0,3)X(1,0,2) 2.SARMA(3,0,1)X(1,0,1)
This model cannot be used, because it has higher value than the previous SARMA model. Also has the seasonality pattern is not certain we can use GARCH model and test to see if the AIC value is better along with ARMA model as well by ignoring the seasonal part. GARCH model can be abbreviated as Generalized Auto-regressive conditional Heteroskedasticity models. However GARCH model is usually used to estimate value returns for stocks and so on, where trends is not known. We are using to test in our use-case to look at better AIC values. We are going to apply the seasonal ARMA-GARCH model using rugarch.
As we already know that the data is stationary, we can go about finding the p and q values from ACF and PACF plots or use auto.arima() in R.
Series: ts
ARIMA(0,1,2)(1,0,0)[12]
Coefficients:
ma1 ma2 sar1
-0.8092 -0.1599 -0.0556
s.e. 0.0436 0.0435 0.0445
sigma^2 = 1.664: log likelihood = -881.88
AIC=1771.76 AICc=1771.84 BIC=1788.83
Call:
arima(x = ts, order = c(2, 0, 3), seasonal = list(order = c(1, 0, 2), period = 12))
Coefficients:
ar1 ar2 ma1 ma2 ma3 sar1 sma1 sma2
0.1394 0.8450 0.0518 -0.8178 -0.1618 0.5569 -0.6262 0.1005
s.e. 0.1102 0.1087 0.1180 0.0950 0.0490 0.3267 0.3226 0.0488
intercept
0.3112
s.e. 0.2496
sigma^2 estimated as 1.64: log likelihood = -880.12, aic = 1778.24
Call:
arima(x = ts, order = c(3, 1, 1), seasonal = list(order = c(1, 0, 1), period = 12))
Coefficients:
ar1 ar2 ar3 ma1 sar1 sma1
0.1570 -0.0531 -0.0191 -0.9696 -0.0856 0.0259
s.e. 0.0453 0.0454 0.0460 0.0134 0.3651 0.3650
sigma^2 estimated as 1.652: log likelihood = -881.51, aic = 1775.03
Fitting models using approximations to speed things up...
Regression with ARIMA(2,1,2) errors : Inf
Regression with ARIMA(0,1,0) errors : 2055.81
Regression with ARIMA(1,1,0) errors : 1968.435
Regression with ARIMA(0,1,1) errors : 1794.578
Regression with ARIMA(0,1,0) errors : 2053.707
Regression with ARIMA(1,1,1) errors : 1798.11
Regression with ARIMA(0,1,2) errors : 1784.067
Regression with ARIMA(1,1,2) errors : 1787.08
Regression with ARIMA(0,1,3) errors : 1785.998
Regression with ARIMA(1,1,3) errors : Inf
Regression with ARIMA(0,1,2) errors : 1782.493
Regression with ARIMA(0,1,1) errors : 1792.865
Regression with ARIMA(1,1,2) errors : 1787.4
Regression with ARIMA(0,1,3) errors : 1784.395
Regression with ARIMA(1,1,1) errors : 1797.301
Regression with ARIMA(1,1,3) errors : 1779.717
Regression with ARIMA(2,1,3) errors : 1776.862
Regression with ARIMA(2,1,2) errors : 1780.76
Regression with ARIMA(3,1,3) errors : 1778.824
Regression with ARIMA(2,1,4) errors : 1777.967
Regression with ARIMA(1,1,4) errors : 1779.827
Regression with ARIMA(3,1,2) errors : 1784.154
Regression with ARIMA(3,1,4) errors : 1780.328
Regression with ARIMA(2,1,3) errors : Inf
Now re-fitting the best model(s) without approximations...
Regression with ARIMA(2,1,3) errors : 1784.776
Best model: Regression with ARIMA(2,1,3) errors
Series: ts
Regression with ARIMA(2,1,3) errors
Coefficients:
ar1 ar2 ma1 ma2 ma3 S1-12 C1-12 S2-12
-1.3561 -0.5037 0.5377 -0.8046 -0.6450 -0.0977 -0.0351 -0.0936
s.e. 0.1939 0.2005 0.1709 0.0625 0.1698 0.0854 0.0854 0.0845
C2-12 S3-12 C3-12 S4-12 C4-12 S5-12 C5-12 C6-12
0.1538 0.0183 0.0493 -0.0421 -0.0576 -0.0186 -0.1154 -0.0442
s.e. 0.0845 0.0825 0.0825 0.0766 0.0767 0.0553 0.0554 0.0566
sigma^2 = 1.661: log likelihood = -874.79
AIC=1783.57 AICc=1784.78 BIC=1856.12
Discuss:As we are not differencing the model we can consider ARMA(2,0,3) has the best model. Which is the best and q value also found from the ACF and PACF plots.
GARCH
*---------------------------------*
* GARCH Model Spec *
*---------------------------------*
Conditional Variance Dynamics
------------------------------------
GARCH Model : sGARCH(2,2)
Variance Targeting : FALSE
Conditional Mean Dynamics
------------------------------------
Mean Model : ARFIMA(2,0,3)
Include Mean : TRUE
GARCH-in-Mean : FALSE
Conditional Distribution
------------------------------------
Distribution : norm
Includes Skew : FALSE
Includes Shape : FALSE
Includes Lambda : FALSE
Warning in arima(data, order = c(modelinc[2], 0, modelinc[3]), include.mean =
modelinc[1], : possible convergence problem: optim gave code = 1
*---------------------------------*
* GARCH Model Fit *
*---------------------------------*
Conditional Variance Dynamics
-----------------------------------
GARCH Model : sGARCH(2,2)
Mean Model : ARFIMA(2,0,3)
Distribution : norm
Optimal Parameters
------------------------------------
Estimate Std. Error t value Pr(>|t|)
mu 0.265301 0.544387 0.487339 0.626018
ar1 0.151484 0.070530 2.147800 0.031730
ar2 0.840833 0.069109 12.166792 0.000000
ma1 0.048533 0.081843 0.593002 0.553180
ma2 -0.821368 0.059487 -13.807438 0.000000
ma3 -0.157778 0.050057 -3.151964 0.001622
omega 0.000001 0.001029 0.000492 0.999608
alpha1 0.011519 0.006336 1.817906 0.069079
alpha2 0.000000 0.006488 0.000001 1.000000
beta1 0.000000 0.005294 0.000008 0.999994
beta2 0.986903 0.000647 1525.609951 0.000000
Robust Standard Errors:
Estimate Std. Error t value Pr(>|t|)
mu 0.265301 1.566807 0.169326 0.865541
ar1 0.151484 0.065784 2.302762 0.021292
ar2 0.840833 0.048539 17.322906 0.000000
ma1 0.048533 0.077755 0.624179 0.532510
ma2 -0.821368 0.070008 -11.732560 0.000000
ma3 -0.157778 0.072942 -2.163071 0.030536
omega 0.000001 0.000503 0.001006 0.999197
alpha1 0.011519 0.007967 1.445786 0.148237
alpha2 0.000000 0.009078 0.000000 1.000000
beta1 0.000000 0.004614 0.000009 0.999993
beta2 0.986903 0.003461 285.133139 0.000000
LogLikelihood : -878.6808
Information Criteria
------------------------------------
Akaike 3.3700
Bayes 3.4589
Shibata 3.3692
Hannan-Quinn 3.4048
Weighted Ljung-Box Test on Standardized Residuals
------------------------------------
statistic p-value
Lag[1] 0.07735 0.7809
Lag[2*(p+q)+(p+q)-1][14] 3.76517 1.0000
Lag[4*(p+q)+(p+q)-1][24] 7.63807 0.9810
d.o.f=5
H0 : No serial correlation
Weighted Ljung-Box Test on Standardized Squared Residuals
------------------------------------
statistic p-value
Lag[1] 1.681 0.194794
Lag[2*(p+q)+(p+q)-1][11] 11.484 0.048284
Lag[4*(p+q)+(p+q)-1][19] 22.744 0.003613
d.o.f=4
Weighted ARCH LM Tests
------------------------------------
Statistic Shape Scale P-Value
ARCH Lag[5] 2.776 0.500 2.000 0.09570
ARCH Lag[7] 7.613 1.473 1.746 0.03178
ARCH Lag[9] 9.587 2.402 1.619 0.03265
Nyblom stability test
------------------------------------
Joint Statistic: 2.4284
Individual Statistics:
mu 0.27098
ar1 0.36042
ar2 0.21359
ma1 0.48745
ma2 0.12553
ma3 0.80301
omega 0.12151
alpha1 0.13134
alpha2 0.09875
beta1 0.11948
beta2 0.12760
Asymptotic Critical Values (10% 5% 1%)
Joint Statistic: 2.49 2.75 3.27
Individual Statistic: 0.35 0.47 0.75
Sign Bias Test
------------------------------------
t-value prob sig
Sign Bias 1.955 0.05109 *
Negative Sign Bias 1.007 0.31458
Positive Sign Bias 1.711 0.08776 *
Joint Effect 9.005 0.02923 **
Adjusted Pearson Goodness-of-Fit Test:
------------------------------------
group statistic p-value(g-1)
1 20 34.42 0.016369
2 30 56.20 0.001788
3 40 60.18 0.016282
4 50 73.70 0.012773
Elapsed time : 0.677094
Result of GARCH Model with Specifications: Shows the output of the GARCH model when ran on the dataset. Some of the observations we can see that and compare with the SARMA model. We can view various optimal parameters and their estimate and standard error as well. The LB test has values for p nothing less than 0.05 so we can say that null hypothesis is rejected and may assume correlation being present in the dataset. However LB test on Standardized squared residuals yield one of the p values closer to 0. The Arch LM tests has p values for lags 7 and 9 closer to zero. We can observe that the log-likelihood for SARMA is smaller than GARCH. Although, it is well noted that the higher the log-likelihood, the better the model. The GARCH model has higher log-likelihood when compared sarima but no much difference between their values. While the same is opposite for AIC value, where smaller the AIC value is the best model.
It has been noted that fit2(model) has least AIC value and is considered for residual analysis.
Discuss:Shows the residual plot for GARCH model. The model has variations and seasonality.
Discuss: Shows the Residual plot for SARIMA.There is no significant change in the residual analysis after adopting the GARCH model. The pattern looks almost constant. Further Analysis can be done with SARIMA model.
Discuss:The plot demonstrates the histogram of the residuals. The shape of the curve is almost bell shaped and forms the normal distribution.
Shapiro-test
Shapiro-Wilk normality test
data: residuals(fit1)
W = 0.97474, p-value = 6.67e-08
Discuss:Shapiro-Wilk test yields a W value of 0.9778 and p value of 0.00000000525 which is much smaller than 0.05 and very close to zero. Thus data is non-normal.
Box-test
Box-Ljung test
data: resid(fit1)
X-squared = 0.001268, df = 0, p-value < 2.2e-16
Discuss:LB test will test if there is autocorrelation in the time series data. P value is very small and close to zero therefore, it has grabbed the dependence in the time series.
Forecasting
Call:
arima(x = ts, order = c(2, 0, 3), seasonal = list(order = c(1, 0, 2), period = 12))
Coefficients:
ar1 ar2 ma1 ma2 ma3 sar1 sma1 sma2
0.1394 0.8450 0.0518 -0.8178 -0.1618 0.5569 -0.6262 0.1005
s.e. 0.1102 0.1087 0.1180 0.0950 0.0490 0.3267 0.3226 0.0488
intercept
0.3112
s.e. 0.2496
sigma^2 estimated as 1.64: log likelihood = -880.12, aic = 1778.24
Warning in plot.window(xlim, ylim, log, ...): "n.ahead" is not a graphical
parameter
Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "n.ahead" is not a
graphical parameter
Warning in axis(1, ...): "n.ahead" is not a graphical parameter
Warning in axis(2, ...): "n.ahead" is not a graphical parameter
Warning in box(...): "n.ahead" is not a graphical parameter
DATE LNS14000024
1 1948-01-01 3.0
2 1948-02-01 3.3
3 1948-03-01 3.5
4 1948-04-01 3.5
5 1948-05-01 3.3
6 1948-06-01 3.2
[1] 2016 1
End value of Time series dataset
[1] 2020 1
Freq value of Time series dataset
[1] 12
Statistics of Time Series Sample.
Min. 1st Qu. Median Mean 3rd Qu. Max.
27.0 31.0 36.0 40.9 50.0 73.0
Warning in data(ts): data set 'ts' not found
Warning in data(ts): data set 'ts' not found
Warning in data(ts): data set 'ts' not found
Warning in data(ts): data set 'ts' not found
Warning in adf.test(ts_diff): p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: ts_diff
Dickey-Fuller = -5.5333, Lag order = 3, p-value = 0.01
alternative hypothesis: stationary
ARIMA(1,2,1) ARIMA(1,2,4) ARIMA(1,2,5) ARIMA(2,2,1) ARIMA(2,2,4) ARIMA(2,2,5) 1.ARIMA(1,2,1) 2.ARIMA(1,2,4) 3.ARIMA(1,2,5) 4.ARIMA(2,2,1) 5.ARIMA(2,2,4) 6.ARIMA(2,2,5) Where the ARIMA (PACF, Num_Diffrentation, ACF) model have the below format for the parameters. Coefficients for various models:
(fit <- arima(ts_diff, order = c(1,2,1)))
Call:
arima(x = ts_diff, order = c(1, 2, 1))
Coefficients:
ar1 ma1
-0.7142 -1.0000
s.e. 0.0986 0.0557
sigma^2 estimated as 51.4: log likelihood = -155.29, aic = 314.59
(fit2 <- arima(ts_diff, order = c(1,2,4)))
Warning in log(s2): NaNs produced
Call:
arima(x = ts_diff, order = c(1, 2, 4))
Coefficients:
ar1 ma1 ma2 ma3 ma4
-0.0051 -3.0375 3.5460 -1.9735 0.4668
s.e. 0.3018 0.3067 0.7774 0.6970 0.2240
sigma^2 estimated as 13.9: log likelihood = -131.28, aic = 272.56
(fit3 <- arima(ts_diff, order = c(1,2,5)))
Call:
arima(x = ts_diff, order = c(1, 2, 5))
Coefficients:
ar1 ma1 ma2 ma3 ma4 ma5
-0.0353 -3.0022 3.4399 -1.8474 0.3926 0.0190
s.e. 1.6299 1.7001 5.1929 6.1762 3.5431 0.8653
sigma^2 estimated as 13.94: log likelihood = -131.28, aic = 274.56
(fit4 <- arima(ts_diff, order = c(2,2,1)))
Call:
arima(x = ts_diff, order = c(2, 2, 1))
Coefficients:
ar1 ar2 ma1
-1.2384 -0.6923 -1.0000
s.e. 0.1009 0.0985 0.0601
sigma^2 estimated as 24.54: log likelihood = -139.87, aic = 285.74
(fit5 <- arima(ts_diff, order = c(2,2,4)))
Call:
arima(x = ts_diff, order = c(2, 2, 4))
Coefficients:
ar1 ar2 ma1 ma2 ma3 ma4
-1.2361 -0.5367 -1.7490 0.1636 0.9268 -0.3396
s.e. 0.2311 0.1418 0.3913 0.7066 0.6598 0.2716
sigma^2 estimated as 12.94: log likelihood = -130.39, aic = 272.79
(fit6 <- arima(ts_diff, order = c(2,2,5)))
Call:
arima(x = ts_diff, order = c(2, 2, 5))
Coefficients:
ar1 ar2 ma1 ma2 ma3 ma4 ma5
-1.0150 -0.2931 -1.9884 0.6099 1.0188 -0.9012 0.2641
s.e. 0.4505 0.4386 0.4843 0.9346 0.4924 0.8726 0.4099
sigma^2 estimated as 13.1: log likelihood = -130.23, aic = 274.47
Shapiro-Wilk normality test
data: residuals(fit6)
W = 0.94064, p-value = 0.01885
Box-Ljung test
data: resid(fit6)
X-squared = 0.062981, df = 0, p-value < 2.2e-16
Discuss:Ljung-Box test is next performed on the models to test the randomness of the data over the lags at the bigger perspective.The null hypothesis for LB test is that residuals are independently distributed if p values is less than 0.05. Based on that we can see that, independence has been captured by the data for the following model.
Time-series Forecasting
Call:
arima(x = ts, order = c(2, 2, 5))
Coefficients:
ar1 ar2 ma1 ma2 ma3 ma4 ma5
0.1194 0.5561 -1.2597 -0.0785 0.7934 -0.5259 0.0857
s.e. 0.4024 0.2824 0.4718 0.5194 0.3006 0.2513 0.1822
sigma^2 estimated as 11.57: log likelihood = -125.45, aic = 264.91
Warning in plot.window(xlim, ylim, log, ...): "n.ahead" is not a graphical
parameter
Warning in title(main = main, xlab = xlab, ylab = ylab, ...): "n.ahead" is not a
graphical parameter
Warning in axis(1, ...): "n.ahead" is not a graphical parameter
Warning in axis(2, ...): "n.ahead" is not a graphical parameter
Warning in box(...): "n.ahead" is not a graphical parameter
Discuss:The plot shows the forecasting to plot for the next 20 values which is shown by the blue region.
The report talks about the step by step procedure to perform time series analysis and forecasting using various methodologies. The report consists of 2 datasets one being seasonal and the other non-seasonal where data is being loaded in R and uses different libraries like TSA, tseries, forecast, rugarch to perform the analysis. Major part of the report discusses about how ACF and PACF plot are very much important in deducing the parameters for the ARIMA model. Stationary check is mandatory to perform any time series analysis. The time series analysis to find seasonality and so on, need clear understanding of the significance of 3 kinds of plots namely acf, pacf and time series plot. Further on fitting different models by varying the p, q and differentiation values we can find the sigma^2, logliklihood and AIC value which needs to be analyzed. It has been formulated in finding the best module the AIC value has to be the smallest and the log likelihood has to be higher. Further various residual techniques are done on the best model to verify does it satisfy normality or autocorrelation and histogram analysis to clearly see if the model forms normal distribution or is skewed in nature. The qq plot is one of the techniques used which was quite useful to know the behavior of outliers for the models. Further the last section we perform forecasting on the original time series to see how it can find further values based on the mentioned time period say 20 months or so on.
I would like to thank Professor to provide me with clear roadmap to solve any time series dataset. His step by step procedure and algorithm really helped me to solve both of this datasets. Some of the challenges I faced was during preprocessing of the Seasonal dataset which was in a different format and getting it to time series format. Because there was different ways to get it to time series format like does it have to be monthly data or yearly data or which category to choose form and so on. Finding Seasonal trends was another challenge as ACF plot had pins not very prominent and thus I resampled the time series data into yearly to see monthly patterns for multiple years and thus understood the seasonality than considering the entire dataset at once and analyzing it. Model selection and fitting the model was another challenge, but I was able to do it based on clear understanding of ACF and PACF plots. Understanding various Residual methodologies required clear understanding of the Statistical methods, so I had to read in deep about these techniques and analyze for the dataset.
1.Cryer, J. D., & Chan, K. S. (2008). Time series analysis: with applications in R (Vol. 2). New York: Springer 38
2.Katesari, H. S., & Vajargah, B. F. (2015). Testing adverse selection using frank copula approach in Iran insurance markets.Mathematics and Computer Science, 15(2), 154-158
3.Katesari, H. S., & Zarodi, S. (2016). Effects of coverage choice by predictive modeling on frequency of accidents. Caspian Journal of Applied Sciences Research, 5(3), 28-33
4.Safari-Katesari, H., Samadi, S. Y., & Zaroudi, S. (2020). Modelling count data via copulas. Statistics, 54(6), 1329-1355
5.Shumway, R. H., Stoffer, D. S., & Stoffer, D. S. (2000). Time series analysis and its applications (Vol. 3). New York: springer
6.Safari-Katesari, H., & Zaroudi, S. (2020). Count copula regression model using generalized beta distribution of the second kind. Statistics, 21, 1-12
7.Safari-Katesari, H., & Zaroudi, S. (2021). Analysing the impact of dependency on conditional survival functions using copulas. Statistics in Transition New Series, 22(1)
8.Safari Katesari, H., (2021) Bayesian dynamic factor analysis and copula-based models for mixed data, PhD dissertation, Southern Illinois University Carbondale
9.Tsay, R. S. (2013). Multivariate time series analysis: with R and financial applications. John Wiley & Sons
10.Zaroudi, S., Faridrohani, M. R., Behzadi, M. H., & Safari-Katesari, H. (2022). Copula-based Modeling for IBNR Claim Loss Reserving. arXiv preprint arXiv:2203.12750
1.https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/
2.https://otexts.com/fpp2/arima-r.html
3.https://www.geeksforgeeks.org/time-series-analysis-using-arima-model-in-r-programming/
4.https://rdrr.io/cran/tseries/man/garch.html
5.https://towardsdatascience.com/interpreting-acf-and-pacf-plots-for-time-series-forecasting-af0d6db4061c
6.https://www.kaggle.com/code/iamleonie/time-series-interpreting-acf-and-pacf
7.https://analyticsindiamag.com/complete-guide-to-dickey-fuller-test-in-time-series-analysis/
8.https://medium.datadriveninvestor.com/interpreting-results-of-dicky-fuller-test-for-time-series-analysis-4bb1e98f242b
9.https://rpubs.com/iabrady/residual-analysis