1 Dependence- It refers to the association of two observations of the same variable at prior time periods.
2 Stationarity- It shows the mean value of the series that remains constant over the time period. If past effects accumulate and the values increase towards infinity then stationarity is not met.
3 Differencing- Differencing is used to make the series stationary and to control the auto-correlations. There may be some cases in time series analyses where we do not require differencing and over-differenced series can produce wrong estimates.
4 Specification - It may involve the testing of the linear or non-linear relationships of dependent variables by using time series models such as ARIMA models.
5 Exponential Smoothing - Exponential smoothing in time series analysis predicts the one next period value based on the past and current value. It involves averaging of data such that the non-systematic components of each individual case or observation cancel out each other. The exponential smoothing method is used to predict the short term prediction.
6 ARIMA - ARIMA stands for Auto Regressive Integrated Moving Average.
some useful packages to do our analysis and forcasting
library(tidyverse)
library(fpp3)
library(lubridate)
loading data set…..
amzn1=read.csv("D:\\wallpapers and photos\\csv\\time series\\Amazon.csv")
This data set is about amazon stock market details from 1997-06-13 to 2001-05-02.
Converting the dataset into tsibble will help us to do further time series analysis.
amzn=amzn1 %>%
mutate(Date=ymd(Date)) %>%
as_tsibble(index = Date) %>%
fill_gaps(.full=TRUE) %>%
fill(everything(),.direction = "updown")
amzn
## # A tsibble: 8,479 x 7 [1D]
## Date Open High Low Close Adj.Close Volume
## <date> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1997-05-15 2.44 2.5 1.93 1.96 1.96 72156000
## 2 1997-05-16 1.97 1.98 1.71 1.73 1.73 14700000
## 3 1997-05-17 1.76 1.77 1.62 1.71 1.71 6106800
## 4 1997-05-18 1.76 1.77 1.62 1.71 1.71 6106800
## 5 1997-05-19 1.76 1.77 1.62 1.71 1.71 6106800
## 6 1997-05-20 1.73 1.75 1.64 1.64 1.64 5467200
## 7 1997-05-21 1.64 1.65 1.38 1.43 1.43 18853200
## 8 1997-05-22 1.44 1.45 1.31 1.40 1.40 11776800
## 9 1997-05-23 1.41 1.52 1.33 1.5 1.5 15937200
## 10 1997-05-24 1.51 1.65 1.46 1.58 1.58 8697600
## # … with 8,469 more rows
summary of the data
summary(amzn)
## Date Open High Low
## Min. :1997-05-15 Min. : 1.406 Min. : 1.448 Min. : 1.312
## 1st Qu.:2003-03-04 1st Qu.: 38.040 1st Qu.: 38.530 1st Qu.: 37.265
## Median :2008-12-22 Median : 83.160 Median : 84.800 Median : 81.250
## Mean :2008-12-22 Mean : 371.984 Mean : 376.422 Mean : 367.441
## 3rd Qu.:2014-10-11 3rd Qu.: 359.890 3rd Qu.: 362.810 3rd Qu.: 356.175
## Max. :2020-07-31 Max. :3251.060 Max. :3344.290 Max. :3151.000
## Close Adj.Close Volume
## Min. : 1.396 Min. : 1.396 Min. : 487200
## 1st Qu.: 37.950 1st Qu.: 37.950 1st Qu.: 3653450
## Median : 83.050 Median : 83.050 Median : 5625700
## Mean : 372.297 Mean : 372.297 Mean : 7348637
## 3rd Qu.: 359.780 3rd Qu.: 359.780 3rd Qu.: 8407300
## Max. :3200.000 Max. :3200.000 Max. :104329200
library(patchwork)
p1=ggplot(amzn,aes(x=Open))+
geom_density(fill="violet")
p2=ggplot(amzn,aes(x=High))+
geom_density(fill="purple")
p3=ggplot(amzn,aes(x=Low))+
geom_density(fill="yellow")
p4=ggplot(amzn,aes(x=Close))+
geom_density(fill="lightblue")
(p1+p2)/(p3+p4)
which shows the distribution of closing,opening,hightest stock, lowest stock in a day is pretty much same . Perhaps there is no outlier
library(plotly)
plot1=ggplot(amzn,aes(x=Date,y=Close))+
geom_line() + theme(panel.grid.major = element_line(colour = "darkslategray2",
linetype = "dashed"), panel.grid.minor = element_line(colour = "darkslategray2",
linetype = "dashed"), panel.background = element_rect(colour = "white"),
plot.background = element_rect(fill = "aliceblue",
colour = "aliceblue")) +labs(title = "Date vs Closing of the day plot")
ggplotly(plot1)
amzn %>%
gg_season(Close,period = "month",labels = "both")
above plot shows closing stock of the day is increasing throughout every year.
Hour of day
Day of month
Weekly
Monthly
Yearly
However, if we want a more definitive inspection of the seasonality, use the Autocorrelation Function (ACF) plot. There is a strong seasonal pattern, the ACF plot usually reveals definitive repeated spikes at the multiples of the seasonal window
acf_plot=amzn %>%
ACF(Close,lag_max = 100) %>%
autoplot()
ggplotly(acf_plot)
Stl_graph=amzn %>%
model(STL(Close)) %>%
components() %>%
autoplot()
ggplotly(Stl_graph)
amzn %>%
model(STL(Close)) %>%
components()
## # A dable: 8,479 x 8 [1D]
## # Key: .model [1]
## # : Close = trend + season_year + season_week + remainder
## .model Date Close trend season_week season_year remainder season_…¹
## <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 STL(Close) 1997-05-15 1.96 0.719 0.0350 0.223 0.981 1.70
## 2 STL(Close) 1997-05-16 1.73 0.741 0.00770 0.225 0.755 1.50
## 3 STL(Close) 1997-05-17 1.71 0.762 0.0359 0.407 0.503 1.27
## 4 STL(Close) 1997-05-18 1.71 0.784 0.0226 0.471 0.431 1.21
## 5 STL(Close) 1997-05-19 1.71 0.805 0.0210 0.0317 0.850 1.66
## 6 STL(Close) 1997-05-20 1.64 0.827 -0.0414 -0.223 1.07 1.90
## 7 STL(Close) 1997-05-21 1.43 0.848 -0.0786 -0.483 1.14 1.99
## 8 STL(Close) 1997-05-22 1.40 0.870 0.0364 -1.07 1.56 2.43
## 9 STL(Close) 1997-05-23 1.5 0.891 0.0111 -1.08 1.68 2.57
## 10 STL(Close) 1997-05-24 1.58 0.912 0.0303 -1.72 2.36 3.28
## # … with 8,469 more rows, and abbreviated variable name ¹season_adjust
that means there is good trend in the data after 2015 and in augest it is picking highest seasonality
splitting the data into train data for modeling
train=amzn %>%
filter(Date <= "1999-09-17" )
doing some simple model some times can be useful for some sanity check
train %>%
model(mean=MEAN(Close),
snaive=SNAIVE(Close),
rw=RW(Close~drift())) %>%
forecast(h="12 month") %>%
autoplot(train)
#SNAIVE
fit_snaive=train %>%
model(SNAIVE(Close)) %>%
forecast()
train %>%
model(SNAIVE(Close)) %>%
report()
## Series: Close
## Model: SNAIVE
##
## sigma^2: 38.2706
accuracy(fit_snaive,amzn)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SNAIVE(Close) Test 2.44 6.84 3.98 2.84 5.31 1.19 1.10 0.626
train %>%
model(SNAIVE(Close)) %>%
gg_tsresiduals()
as we can see visually those models are a bad fit, except SNAIVE
# STL DECOMPOSITION MODEL WITH SEASON ADJUST
fit_dcmp=train %>%
model(stlf=decomposition_model(STL(Close),NAIVE(season_adjust)))%>%
forecast(h=" 3 month")
fit_dcmp %>%
autoplot(train)
train %>%
model(stlf=decomposition_model(STL(Close),NAIVE(season_adjust))) %>%
report()
## Series: Close
## Model: STL decomposition model
## Combination: season_adjust + season_year + season_week
##
## ======================================================
##
## Series: season_adjust + season_year
## Model: COMBINATION
## Combination: season_adjust + season_year
##
## ========================================
##
## Series: season_adjust
## Model: NAIVE
##
## sigma^2: 2.0537
##
## Series: season_year
## Model: SNAIVE
##
## sigma^2: 0.0065
##
##
## Series: season_week
## Model: SNAIVE
##
## sigma^2: 0.0216
fit_dcmp %>%
accuracy(amzn)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 stlf Test 14.9 17.6 15.1 17.4 17.8 4.51 2.84 0.896
#TSLM
fit_tslm=train %>%
model(tslm=TSLM(Close~trend()+season() )) %>%
forecast(h="1 year")
fit_tslm %>%
autoplot(train)
train %>%
model(tslm=TSLM(Close~trend()+season() )) %>%
report()
## Series: Close
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -21.362 -9.465 -2.723 8.456 50.872
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.021641 1.469354 -9.543 <2e-16 ***
## trend() 0.095587 0.001884 50.730 <2e-16 ***
## season()week2 0.440766 1.737036 0.254 0.800
## season()week3 0.230994 1.740593 0.133 0.894
## season()week4 0.135407 1.740591 0.078 0.938
## season()week5 0.039821 1.740591 0.023 0.982
## season()week6 -0.024484 1.740593 -0.014 0.989
## season()week7 0.153314 1.740598 0.088 0.930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.62 on 848 degrees of freedom
## Multiple R-squared: 0.7522, Adjusted R-squared: 0.7501
## F-statistic: 367.7 on 7 and 848 DF, p-value: < 2.22e-16
fit_tslm %>%
accuracy(amzn)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 tslm Test -24.0 36.2 29.2 -58.0 63.9 8.73 5.83 0.990
visually judging this is also a bad fit, because it picking seasonality very badly
ETS computes a weighted average over all observations in the input time series dataset as its prediction. The weights are exponentially decreasing over time, rather than the constant weights in simple moving average methods. The weights are dependent on a constant parameter, which is known as the smoothing parameter
#ETS
fit_ets=train %>%
model(ets=ETS(Close) ) %>%
forecast(h="6 month")
train %>%
model(ets=ETS(Close) ) %>%
report()
## Series: Close
## Model: ETS(M,A,N)
## Smoothing parameters:
## alpha = 0.9999
## beta = 0.004733842
##
## Initial states:
## l[0] b[0]
## 1.940126 0.01257808
##
## sigma^2: 0.0027
##
## AIC AICc BIC
## 5190.770 5190.841 5214.532
fit_ets %>% autoplot(train)
accuracy(fit_ets,amzn)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ets Test 5.69 13.6 11.0 5.52 13.8 3.30 2.19 0.948
#arima
amzn %>%
features(Close,unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 39.6 0.01
amzn %>%
features(Close %>% difference() %>% difference() ,unitroot_ndiffs)
## # A tibble: 1 × 1
## ndiffs
## <int>
## 1 0
amzn %>%
gg_tsdisplay(Close %>% difference() %>% difference() ,plot_type = "partial")
amzn %>%
features(Close %>% difference() %>% difference() ,unitroot_kpss)
## # A tibble: 1 × 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.0238 0.1
thus p value>0.05 meaning the data is stationary.
#arima
fit_arima=train %>%
model(arima=ARIMA(Close,stepwise = F,approximation = F)) %>%
forecast(h="6 month")
fit_arima %>%
autoplot(train)
train %>%
model(arima=ARIMA(Close,stepwise = F,approximation = F)) %>%
report()
## Series: Close
## Model: ARIMA(2,1,2)
##
## Coefficients:
## ar1 ar2 ma1 ma2
## 1.0299 -0.7904 -1.0300 0.8676
## s.e. 0.0773 0.0827 0.0651 0.0658
##
## sigma^2 estimated as 4.554: log likelihood=-1859.36
## AIC=3728.72 AICc=3728.79 BIC=3752.47
accuracy(fit_arima,amzn)
## # A tibble: 1 × 10
## .model .type ME RMSE MAE MPE MAPE MASE RMSSE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 arima Test 12.1 16.3 12.3 14.4 14.7 3.67 2.62 0.934
Difference between ETS and ARIMA time seriestrain %>%
model(SNAIVE(Close)) %>%
gg_tsresiduals()
train %>%
model(arima=ARIMA(Close,stepwise = F,approximation = F)) %>%
gg_tsresiduals()
The Ljung-Box Q-test is a “portmanteau” test that assesses the null hypothesis that a series of residuals exhibits no autocorrelation for a fixed number of lags L (see Lags ), against the alternative that some autocorrelation coefficient ρ(k), k = 1, ..., L is nonzero
The Ljung–Box test is a type of statistical test of whether any of a group of autocorrelations of a time series are different from zero.
H0=model is fine
H1=model shows lack of fit
res1=train %>%
model(snaive=SNAIVE(Close)) %>%
augment() %>%
features(.innov,ljung_box,lag=10)
res2=train %>%
model(arima=ARIMA(Close,stepwise = F,approximation = F)) %>%
augment() %>%
features(.innov,ljung_box,lag=10)
res3=train %>%
model(ets=ETS(Close)) %>%
augment() %>%
features(.innov,ljung_box,lag=10)
res1
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 snaive 1615. 0
res2
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 arima 13.2 0.215
res3
## # A tibble: 1 × 3
## .model lb_stat lb_pvalue
## <chr> <dbl> <dbl>
## 1 ets 11.7 0.308
meaning SNAIVE shows lack of fit. ETS and ARIMA passed the “portmanteau” test