In this project, We started with an empirical analysis of five main data series variables from the WDI Package to evaluate whether people’s quality of life actually improved in India and the world through measurement metrics. Next, we focused on South Asia (especially India) and GDP per capita as our development Indicator of interest. We looked at identification of the data source, Various summary statistics of the data and finally the Visualizations that describe the behavior/distribution of the time-series.
For the second section, we focused on building an ARIMA model for our time-series dataset. We looked at assessment of whether the data was variance/mean stationary using analysis of the model variance over time and statistical tests such as ADF/KPSS tests. Next we looked at transformations of the data to induce variance/mean stationarity (log/Box-Cox and differencing). We also examined the ACF/PACF plots for autocorrelation and seasonality. Selection of an appropriate ARIMA model was done based on AIC/BIC. We also looked at analysis of model residuals, assessment of in-sample model performance and finally the production of a forecast at a meaningful time horizon.
For the Third section, we focused on building a Prophet model for our time-series dataset. It involved decomposition of the elements of the time series (trend, seasonality, etc.), assessment of whether the model should take into account a saturating minimum/maximum point and identification and assessment of seasonality (daily, weekly, yearly, as well as additive/multiplicative). We also looked at selection of a “best” model, assessment of in-sample model performance and finally the production of a forecast at a meaningful time horizon
For the Final section, we focused on assessing and comparing the performance of the selected ARIMA and Prophet model.It involved comparison of the models using a single train/test split or rolling window cross validation approach using RMSE, MAE, and/or MAPE as appropriate, Identification of the model (ARIMA or Prophet) that performs the best out of sample using the given techniques and the final assessment of the forecast from the selected model.
Data Source Description: The dataset selected for this Project is the “World Development Indicators and Other World Bank Data” loaded as a package. The WDI package provides convenient access to over 40 databases hosted by the World Bank, including the World Development Indicators (WDI), International Debt Statistics, Doing Business,Human Capital Index, and Sub-national poverty indicators.The Data Base contains about 17’330 topics of various countries over several years. The complete dataset has a total of 16532 observations.
I decided to analyze the WDI dataset based on our classroom discussion on this package which piqued my curiosity regarding this dataset and the wealth of information captured by this single dataset.
Summary Statistics: This code chunk will call the WDI API and fetch the years 1960 through 2018, as available. We will find that only a few variables have data for 2020.
This dataset has a total of 15 variables.
## Rows: 16,532
## Columns: 15
## $ iso2c <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ~
## $ country <chr> "Global Partnership for Education", "Global Partnership ~
## $ year <int> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 19~
## $ birth <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lifeexp <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ inf_mort <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ gdpPercap <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ yrs_schooling <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ iso3c <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ region <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ capital <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ longitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ latitude <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ income <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lending <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
Summary of five key time-series variables from the WDI Package.
## region year birthrate
## South Asia :101 Min. :1960 Min. :10.47
## East Asia & Pacific : 70 1st Qu.:1975 1st Qu.:20.38
## Aggregates : 66 Median :1990 Median :29.26
## Sub-Saharan Africa : 66 Mean :1990 Mean :28.96
## Middle East & North Africa: 63 3rd Qu.:2005 3rd Qu.:37.73
## Europe & Central Asia : 61 Max. :2020 Max. :47.13
## (Other) :117
## life_exp infant gdpPercap yrs_schooling
## Min. :40.99 Min. : 4.90 Min. : 368.4 Min. : 1.894
## 1st Qu.:57.16 1st Qu.: 23.84 1st Qu.: 2424.1 1st Qu.: 2.739
## Median :65.48 Median : 41.96 Median : 6631.3 Median : 5.546
## Mean :63.98 Mean : 51.30 Mean :12912.0 Mean : 5.965
## 3rd Qu.:71.68 3rd Qu.: 70.89 3rd Qu.:14584.0 3rd Qu.: 7.904
## Max. :80.69 Max. :171.27 Max. :70255.2 Max. :12.750
##
| vars | n | mean | sd | min | max | range | se | |
|---|---|---|---|---|---|---|---|---|
| birthrate | 1 | 544 | 28.96167 | 10.320162 | 10.46667 | 47.13421 | 36.66755 | 0.4424733 |
| life_exp | 2 | 544 | 63.97716 | 9.401340 | 40.98700 | 80.68935 | 39.70234 | 0.4030791 |
| infant | 3 | 544 | 51.29561 | 36.900487 | 4.90000 | 171.26667 | 166.36667 | 1.5820953 |
| gdpPercap | 4 | 544 | 12911.94946 | 16526.007464 | 368.38498 | 70255.24633 | 69886.86135 | 708.5467245 |
| yrs_schooling | 5 | 544 | 5.96473 | 2.880443 | 1.89375 | 12.75000 | 10.85625 | 0.1234980 |
The graphs here shows the change in 5 Key development indicators: birth rate, Years of Schooling, infant mortality rate and GDP per capita over the past 60 years. The plot has been faceted to 7 regions. The line chart can efficiently explain the change tendency across multiple decades. In each individual chart, we can see a clear down trends in birth rate and infant mortality rate and up trends in GDP per Capita, Years of Schooling, Life Expectancy. Comparing horizontally, the region Sub-Saharan Africa has the highest birth rate and infant mortality rate but North America has the highest GDP per capita as expected.
Note the obvious down trends in Birth rate(%) and Infant Mortality Rate(%) and the general up trends in GDP per capita, Years of Schooling, Life Expectancy especially in South Asia and Sub-Saharan Africa.The Impact of Covid-19 Recession is also visible for multiple development indicators across most regions of the world.
We now develop a number of plots that describe the behavior of our key time-series development variables from the WDI dataset:
Note the Improvement in Health Indicators and Child growth Indicators across the world. Now we want to focus on South Asia (especially India) and GDP per capital as our development variable of interest.
Looking at a number of plots that describe the behavior of our development variable of interest:
This code chunk will call the WDI API and fetch the GDP Per capita data for years 1960 through 2021, as available.
wdi_data = WDI(indicator = c('gdpPercap'="NY.GDP.PCAP.KD"),
start = 1960, end = 2021,
extra=TRUE ) %>%
as_tibble()
This code chunk will further filter the WDI_Data for Country == ‘India’
# Filter the data for Country == 'India'
india_gdpPercap_data = wdi_data %>%
filter(country=='India') %>%
arrange(year)
Assessment of variance/mean stationarity: To assess the behavior of the data-generating process of our time-series, we start by looking at the plots of the time-series data that we are using for our analysis. As we can see our time series data is variance non-stationary but may not be mean non-stationary.
Further we do a Rolling window calculation (Rolling average/standard deviation),Sliding the time series across a 5 year time window, transforming each value into a rolling mean /Standard deviation. We observe that, Rolling Mean increases over time while the Rolling Standard Deviation also increases over Time. This suggests both Mean and variance non-stationarity.
Note the obvious decreasing trend in GDP Per capita around 2020 due to the Impact of Covid-19 Recession across all sectors of the Indian economy.
Transformations of the data to induce variance/mean stationarity (log/Box-Cox and differencing): As the data appear to be both Mean and variance non-stationary, we transform the time-series using a natural log or Box-Cox transformation. As we must solve variance stationarity before mean stationarity, we start with solving to make our series variance stationary.
As we take Log and Box Cox Transformations of India GDP Per capita data to make the time-series variance stationary, they both give very similar results. We will use Box Cox Transformation as it seems to do a slightly better Job of Normalizing the Variance.
Firstly we tried testing for Non-stationarity using Augmented Dickey-Fuller (ADF) Test:
As p-value > 0.05 It indicates Non-stationary
Next, we tried testing for Non-stationarity using KPSS Test (preferred approach):
As p-value < 0.05 indicates Non-stationary
As Results of Augmented Dickey-Fuller (ADF) Test and KPSS Test match, we assume Mean Non-Stationarity
##
## Augmented Dickey-Fuller Test
##
## data: gdpPercap_var_transform$gdpPercap_box
## Dickey-Fuller = -2.1544, Lag order = 3, p-value = 0.5124
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: gdpPercap_var_transform$gdpPercap_box
## KPSS Level = 1.5962, Truncation lag parameter = 3, p-value = 0.01
Examination of ACF/PACF plots for autocorrelation and seasonality: Next we try to find the Integrated order which is the number of differences required to make our series mean stationary. We start with one difference (I(1)) and see how we do before trying a second difference:
##
## Augmented Dickey-Fuller Test
##
## data: gdpPercap_mean_transform$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform$gdpPercap_box_diff)]
## Dickey-Fuller = -0.81635, Lag order = 2, p-value = 0.9476
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: gdpPercap_mean_transform$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform$gdpPercap_box_diff)]
## KPSS Level = 0.31214, Truncation lag parameter = 2, p-value = 0.1
Examining the ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. ACF shows no significant “dampening” auto-correlation.Also PACF lags does not indicate “order” of AR process or shows any oscillating patterns for MA process.
Next we try a second difference:
Again, we tried testing for Non-stationarity using Augmented Dickey-Fuller (ADF) Test:
As p-value > 0.05 It indicates Non-stationary
Next, we tried testing for Non-stationarity using KPSS Test (preferred approach):
As p-value > 0.05 indicates Stationary
As Results of Augmented Dickey-Fuller (ADF) Test and KPSS Test do not match, we assume Mean Non-Stationarity.
##
## Augmented Dickey-Fuller Test
##
## data: gdpPercap_mean_transform1$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform1$gdpPercap_box_diff)]
## Dickey-Fuller = -0.81635, Lag order = 2, p-value = 0.9476
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: gdpPercap_mean_transform1$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform1$gdpPercap_box_diff)]
## KPSS Level = 0.31214, Truncation lag parameter = 2, p-value = 0.1
Selection of an appropriate ARIMA model based on AIC/BIC: As we examine ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. It may be random walk model with ARIMA(0,1,0). We next try fitting several ARIMA models to our time-series gdpPercap variable. We will try to determine Which model is the “best” according to the AIC and BIC.
Note we’re using Box-Cox Transformed Parameter
##
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## 0.6932
## s.e. 0.1166
##
## sigma^2 estimated as 3.005e-06: log likelihood = 123.14, aic = -242.28
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.001162192 0.001750371 0.001501162 0.053958 0.0695625 0.6688944
## ACF1
## Training set -0.1906989
##
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 1, 0))
##
##
## sigma^2 estimated as 5.67e-06: log likelihood = 115.53, aic = -229.06
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002006089 0.002371993 0.002239897 0.09302686 0.1037458 0.998063
## ACF1
## Training set 0.1324183
##
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## 0.8468
## s.e. 0.1055
##
## sigma^2 estimated as 1.783e-06: log likelihood = 129.36, aic = -254.71
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002215809 0.0013746 0.001056106 0.01041494 0.04899651 0.4705842
## ACF1
## Training set -0.09277688
##
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 2, 1))
##
## Coefficients:
## ma1
## -0.4406
## s.e. 0.3712
##
## sigma^2 estimated as 1.779e-06: log likelihood = 124.71, aic = -245.42
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0003442775 0.001410643 0.001005344 -0.0158962 0.04661231
## MASE ACF1
## Training set 0.4479655 0.04467136
##
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 2, 0))
##
##
## sigma^2 estimated as 1.91e-06: log likelihood = 123.97, aic = -245.93
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0003095607 0.001452905 0.001009851 -0.0142933 0.04685949
## MASE ACF1
## Training set 0.449974 -0.1012526
##
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(1, 2, 0))
##
## Coefficients:
## ar1
## -0.2337
## s.e. 0.2581
##
## sigma^2 estimated as 1.843e-06: log likelihood = 124.37, aic = -244.73
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -0.0003261831 0.001431483 0.001008458 -0.01506122 0.04676878
## MASE ACF1
## Training set 0.4493532 -0.01220372
## df AIC
## summary(mod1) 2 -242.2816
## summary(mod2) 1 -229.0623
## summary(mod3) 2 -254.7148
## summary(mod4) 2 -245.4236
## summary(mod5) 1 -245.9328
## summary(mod6) 2 -244.7329
## df BIC
## summary(mod1) 2 -239.8439
## summary(mod2) 1 -227.8435
## summary(mod3) 2 -252.2770
## summary(mod4) 2 -243.0675
## summary(mod5) 1 -244.7548
## summary(mod6) 2 -242.3768
Analysis of model residuals: Looking at residuals involves finding the difference between what the model predicted versus what we observed. Looks like there no unmodeled variation left, we may not Want to try many different models now.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(1,1,0)
## Q* = 2.0158, df = 4, p-value = 0.7329
##
## Model df: 1. Total lags used: 5
The ACF plot above shows no first, second or third order autocorrelation residuals. Residuals seem to be fairly normal. We seem to have done a good job of modeling the short tem behavior of this time-series.
Next we try Ljung-Box Test for Residual Autocorrelation.
p < 0.05 indicates residual autocorrelation at that lag
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.25065, df = 1, p-value = 0.6166
##
## Box-Ljung test
##
## data: resid
## X-squared = 2.0158, df = 5, p-value = 0.847
##
## Box-Ljung test
##
## data: resid
## X-squared = 2.781, df = 10, p-value = 0.9861
##
## Box-Ljung test
##
## data: resid
## X-squared = 10.747, df = 20, p-value = 0.9525
##
## Box-Ljung test
##
## data: resid
## X-squared = 47.224, df = 25, p-value = 0.004613
Ljung-Box Test Results indicates residual autocorrelation at lag=25.
Small Remaining autocorrelation (at lag=25) in the residuals suggests that we may have the “best model”. Lower order lags are not statistically significant, suggest that we have been successful in picking up most of the fluctuation in the data generation process.
## Series: gdpPercap_mean_transform$gdpPercap_box
## ARIMA(0,1,0) with drift
##
## Coefficients:
## drift
## 2e-03
## s.e. 3e-04
##
## sigma^2 = 1.924e-06: log likelihood = 130.84
## AIC=-257.69 AICc=-257.14 BIC=-255.25
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set 8.189298e-05 0.001332554 0.0009385031 0.003959539 0.0434831
## MASE ACF1
## Training set 0.4181823 0.1595336
Assessment of in-sample model performance: Assessment of In-sample model performance involves the model performance on data used to estimate the model. In-sample performance can be analyzed using a familiar metric: RMSE (Root Mean Square Error). RMSE is the standard deviation of the residuals (prediction errors).
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,0) with drift
## Q* = 1.3384, df = 4, p-value = 0.8548
##
## Model df: 1. Total lags used: 5
##
## Box-Ljung test
##
## data: mod3_auto$residuals
## X-squared = 0.74113, df = 1, p-value = 0.3893
##
## Box-Ljung test
##
## data: mod3_auto$residuals
## X-squared = 0.74113, df = 1, p-value = 0.3893
Looking at RMSE which represents the average error of the model in-sample.RMSE of 58.14078 suggest that our model predicts the GDP Per Capita for 2020:2024 is within ~58 points on average in-sample.
## [1] 58.14078
Production of a forecast at a meaningful time horizon: Now we create a future forecast for 2020:2024.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 2.3696, df = 4, p-value = 0.6681
##
## Model df: 1. Total lags used: 5
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 1.059, df = 1, p-value = 0.3034
##
## Box-Ljung test
##
## data: mod1$residuals
## X-squared = 1.059, df = 1, p-value = 0.3034
## [1] 51.27804
To assess the behavior of the data-generating process of our time-series, we start by looking at the plots of the time-series data that we are using for our analysis.
We start with Fitting a Basic Prophet Model which draws from time-series decomposition, breaking down the time-series into:
Seasonal components: Daily, weekly, monthly, yearly, etc.
Holidays: For daily data
Trend Estimated along the data with unique slopes identified using changepoint detection
## Rows: 61
## Columns: 2
## $ year <date> 1960-01-01, 1961-01-01, 1962-01-01, 1963-01-01, 1964-01-01,~
## $ gdpPercap <dbl> 302.6718, 307.7279, 310.3767, 322.2841, 339.2037, 323.4641, ~
Next, we will be plotting Forecast model output and Interactive Charts as below:
Decomposition of the elements of the time series: Decomposition of the elements of the time series (trend, seasonality, etc.) is very similar to general time series decomposition.
As seen in the graphs above, We don’t see any seasonality in the GDP per capita growth but do observe an increasing trend to this dataset.
To plot Change-points, the algorithm examines 25 equally-spaced potential changepoints by default as potential candidates for a change in the trend. It examined the actual rate of change at each potential changepoint, and removes those with low rates of change. We are left with 5 potential change-points.
Note: The Algorithm won’t identify change-points in final 20% of data. This prevents final forecasts from trailing off in a different direction
Assessment of saturating minimum/maximum point: We can manually specify more number of change-points(10 here) to improve performance.The Algorithm examined the actual rate of change at each potential changepoint, and again removes those with low rates of change. We are left with 5 potential change-points.
Examination of the change-points identified for the “trend” part of the time series are shown below:
Selection of a “best” model: Forecast is finally made for data beyond 2020 and fails to account for the Covid related drop in GDP per capita.
We need to set the floor value for this data as the GDP Per capita can not go below 0. Let us make a two-year forecast to identify the need for saturation points.
We also look at assessing whether the model should take into account a saturating minimum/maximum point. For this Model we setup the Cap (ceiling) at 2500 and floor (lower limit) at 0.
Prophet Models can handle multiple types of seasonality simultaneously. Here we can look at assessment of seasonality (daily, weekly, yearly, as well as additive/multiplicative). As this data is a yearly data set we do not observe daily, weekly or yearly seasonality.
Additive Seasonality has the same variance over time.
Multiplicative Seasonality has increasing/decreasing variance over time.
From the above plots, we can see that the additive and multiplicative seasonalities doesn’t play any significant role. As the data we are using is a yearly data, impact of holidays is not applicable on the annual GDP per Capita Growth rate.
Assessment of in-sample model performance: Here we look at estimatiing in-sample performance of the model based on RMSE, MAE and MAPE.
## [1] "RMSE: 189.26"
## [1] "MAE: 157.07"
## [1] "MAPE: 0.08"
The observed values of performance metrics are RMSE = 189.26, MAE = 157.07, MAPE = 0.08.We find that the RMSE is higher that MAE which is primarily due to the Outlier values during the Covid duration.
We use the Rolling window cross-validation to assess performance of the model at meaningful thresholds depending on the data (e.g. 30 days out if daily, 1 year out if monthly, 5 years out if yearly, etc.). We train on the past 6 year of data to predict for the next 5 years of GDP per capita.
## # A tibble: 6 x 6
## y ds yhat yhat_lower yhat_upper cutoff
## <dbl> <dttm> <dbl> <dbl> <dbl> <dttm>
## 1 338. 1968-01-01 00:00:00 337. 326. 347. 1967-01-14 00:00:00
## 2 353. 1969-01-01 00:00:00 341. 330. 351. 1967-01-14 00:00:00
## 3 363. 1970-01-01 00:00:00 345. 334. 354. 1967-01-14 00:00:00
## 4 361. 1971-01-01 00:00:00 348. 337. 359. 1967-01-14 00:00:00
## 5 351. 1972-01-01 00:00:00 352. 341. 363. 1967-01-14 00:00:00
## 6 363. 1970-01-01 00:00:00 350. 340. 360. 1969-07-14 12:00:00
## [1] "1967-01-14 00:00:00 GMT" "1969-07-14 12:00:00 GMT"
## [3] "1972-01-13 00:00:00 GMT" "1974-07-13 12:00:00 GMT"
## [5] "1977-01-11 00:00:00 GMT" "1979-07-12 12:00:00 GMT"
## [7] "1982-01-10 00:00:00 GMT" "1984-07-10 12:00:00 GMT"
## [9] "1987-01-09 00:00:00 GMT" "1989-07-09 12:00:00 GMT"
## [11] "1992-01-08 00:00:00 GMT" "1994-07-08 12:00:00 GMT"
## [13] "1997-01-06 00:00:00 GMT" "1999-07-07 12:00:00 GMT"
## [15] "2002-01-05 00:00:00 GMT" "2004-07-05 12:00:00 GMT"
## [17] "2007-01-04 00:00:00 GMT" "2009-07-04 12:00:00 GMT"
## [19] "2012-01-03 00:00:00 GMT"
Production of a forecast at a meaningful time horizon: The black dots here are the actuals and the colored dots are the predicted data set.
Nearer horizon values are closer to the actual values and the farther values show more deviation from the actual values.
We can generate a table for various performance metrics for different time horizons.
## horizon mse rmse mae mape mdape smape
## 1 180.5 days 1084.2005 32.92720 25.87522 0.03661061 0.04154279 0.03740498
## 2 352.0 days 1065.0720 32.63544 24.57795 0.03307085 0.04154279 0.03379142
## 3 354.0 days 1076.4921 32.80994 25.20475 0.03494414 0.04154279 0.03558665
## 4 355.0 days 1112.5325 33.35465 26.80256 0.03888570 0.04465719 0.03965478
## 5 356.0 days 1109.2229 33.30500 26.71980 0.03891580 0.04465719 0.03968632
## 6 357.0 days 1164.2851 34.12162 27.74856 0.04131984 0.04601947 0.04222143
## 7 359.0 days 1079.9222 32.86217 24.88208 0.03671820 0.04601947 0.03750706
## 8 360.0 days 1127.4439 33.57743 25.94557 0.03848221 0.04601947 0.03932728
## 9 361.0 days 853.7364 29.21877 21.88554 0.03445494 0.03904813 0.03514902
## 10 362.0 days 436.7263 20.89800 17.82740 0.03160301 0.03372260 0.03216986
## 11 364.0 days 599.3676 24.48198 21.91131 0.03410403 0.03372260 0.03471092
## 12 535.5 days 588.0958 24.25069 21.29621 0.03232679 0.03029782 0.03301342
## 13 536.5 days 550.1992 23.45633 19.25161 0.02723353 0.02704426 0.02779974
## 14 538.5 days 510.7077 22.59884 17.67602 0.02361531 0.01826322 0.02407085
## 15 539.5 days 477.1791 21.84443 17.09330 0.02297716 0.01826322 0.02339239
## 16 540.5 days 478.9708 21.88540 17.38119 0.02354521 0.01826322 0.02396376
## 17 541.5 days 649.2640 25.48066 19.65913 0.02731632 0.01826322 0.02795564
## 18 543.5 days 655.6821 25.60629 19.85573 0.02773630 0.02204300 0.02838421
## 19 544.5 days 1324.6807 36.39616 25.55008 0.03360883 0.02204300 0.03460573
## 20 545.5 days 1801.0035 42.43823 29.72327 0.03712271 0.02204300 0.03827543
## 21 718.0 days 1812.7719 42.57666 30.35902 0.03896803 0.03433533 0.04016978
## 22 719.0 days 1860.5266 43.13382 32.65511 0.04553462 0.05743571 0.04654676
## 23 720.0 days 1883.8194 43.40299 33.76906 0.04865268 0.05743571 0.04956028
## 24 721.0 days 1859.1052 43.11734 33.25009 0.04779831 0.04974634 0.04865821
## 25 723.0 days 2047.2743 45.24682 37.35903 0.05579226 0.05866917 0.05702149
## 26 724.0 days 1816.9481 42.62567 33.62097 0.05033471 0.04974634 0.05128880
## 27 725.0 days 2147.9984 46.34650 38.06823 0.05644815 0.05866917 0.05771811
## 28 726.0 days 1549.2565 39.36060 33.31440 0.05235639 0.04974634 0.05334762
## 29 728.0 days 1391.4235 37.30179 32.20032 0.05212883 0.04974634 0.05310635
## 30 729.0 days 1997.3375 44.69158 39.16887 0.05384439 0.04977537 0.05489643
## 31 900.5 days 1957.2441 44.24075 37.78804 0.04988977 0.04974634 0.05110060
## 32 902.5 days 1952.0694 44.18223 37.61172 0.04923646 0.04974634 0.05061381
## 33 903.5 days 1904.2254 43.63743 35.84025 0.04522362 0.04632482 0.04647037
## 34 904.5 days 1805.7860 42.49454 34.43688 0.04320175 0.04632482 0.04429642
## 35 905.5 days 1794.7363 42.36433 33.97250 0.04254756 0.04632482 0.04362832
## 36 907.5 days 1678.9708 40.97525 32.86463 0.04186472 0.04632482 0.04289200
## 37 908.5 days 1498.1868 38.70642 29.36678 0.03812630 0.03573808 0.03904053
## 38 909.5 days 2539.4304 50.39276 35.06021 0.04392341 0.03573808 0.04534975
## 39 910.5 days 2767.0422 52.60268 36.46272 0.04560860 0.03573808 0.04713594
## 40 1083.0 days 2796.8581 52.88533 37.58069 0.04859835 0.05059554 0.05030241
## 41 1084.0 days 2776.0048 52.68780 36.28681 0.04529361 0.05059554 0.04692343
## 42 1085.0 days 2774.1887 52.67057 36.09173 0.04486161 0.05059554 0.04647579
## 43 1087.0 days 2779.7571 52.72340 36.18712 0.04536873 0.05059554 0.04701721
## 44 1088.0 days 3049.0456 55.21816 40.67489 0.05383985 0.06494207 0.05598780
## 45 1089.0 days 2920.0967 54.03792 39.06342 0.05184618 0.05297531 0.05386460
## 46 1090.0 days 3237.8485 56.90210 43.98795 0.05841838 0.06494207 0.06072512
## 47 1092.0 days 2279.1484 47.74043 38.89674 0.05463126 0.06494207 0.05656472
## 48 1093.0 days 2800.0736 52.91572 41.52069 0.05733319 0.06666782 0.05948779
## 49 1094.0 days 4525.6953 67.27329 53.47648 0.06042869 0.07182892 0.06279330
## 50 1266.5 days 4535.4574 67.34580 54.29873 0.06278897 0.07182892 0.06511491
## 51 1267.5 days 4591.9661 67.76405 56.41973 0.06803726 0.07182892 0.07055410
## 52 1268.5 days 4542.7028 67.39958 55.44311 0.06606849 0.07182892 0.06846634
## 53 1269.5 days 4628.0642 68.02988 56.23688 0.06842249 0.07182892 0.07108800
## 54 1271.5 days 4524.9675 67.26788 53.93610 0.06492676 0.07182892 0.06745811
## 55 1272.5 days 4614.6692 67.93136 54.71062 0.06676457 0.07471102 0.06945250
## 56 1273.5 days 4190.1939 64.73171 50.73129 0.06312036 0.05697737 0.06558585
## 57 1274.5 days 4055.5381 63.68311 50.10946 0.06392416 0.05697737 0.06646984
## 58 1276.5 days 3742.6635 61.17731 48.80697 0.06417266 0.05697737 0.06673937
## 59 1448.0 days 3748.9749 61.22887 49.09556 0.06491453 0.05697737 0.06758690
## 60 1449.0 days 3700.7313 60.83364 47.60108 0.06140711 0.04894886 0.06385842
## 61 1451.0 days 3656.2357 60.46681 45.92800 0.05756483 0.04191311 0.05989124
## 62 1452.0 days 3436.7780 58.62404 43.58275 0.05358564 0.04191311 0.05549334
## 63 1453.0 days 3568.0466 59.73313 46.29075 0.05886709 0.06904678 0.06102313
## 64 1454.0 days 3465.2003 58.86595 45.39424 0.05812776 0.06904678 0.06021667
## 65 1456.0 days 3718.9903 60.98352 48.06797 0.06193876 0.07621215 0.06426346
## 66 1457.0 days 3733.5274 61.10260 48.13668 0.06283545 0.07621215 0.06525757
## 67 1458.0 days 3961.8124 62.94293 49.09928 0.06443308 0.07621215 0.06700546
## 68 1459.0 days 7898.7004 88.87463 68.69899 0.07285507 0.07959574 0.07606997
## 69 1631.5 days 7925.8009 89.02697 69.67907 0.07582543 0.07959574 0.07892854
## 70 1632.5 days 7934.8337 89.07768 70.22742 0.07742895 0.07959574 0.08047508
## 71 1633.5 days 7846.4861 88.58039 68.72973 0.07444162 0.07621215 0.07727972
## 72 1635.5 days 8253.9211 90.85109 72.53439 0.08200360 0.08171522 0.08568854
## 73 1636.5 days 8009.4858 89.49573 69.42823 0.07773465 0.07621215 0.08113926
## 74 1637.5 days 8621.8333 92.85383 73.38844 0.08349980 0.09507027 0.08754296
## 75 1638.5 days 7834.2677 88.51140 68.77338 0.07987276 0.07192039 0.08357307
## 76 1640.5 days 8529.0129 92.35266 71.35552 0.08327817 0.07192039 0.08738826
## 77 1641.5 days 7405.4061 86.05467 68.12882 0.08288250 0.07192039 0.08694616
## 78 1813.0 days 7368.8082 85.84176 66.30052 0.07765332 0.07192039 0.08186276
## 79 1815.0 days 7357.6046 85.77648 65.52307 0.07551374 0.07192039 0.07977369
## 80 1816.0 days 7303.9064 85.46289 63.64742 0.07126516 0.07192039 0.07537831
## 81 1817.0 days 6924.0505 83.21088 60.20512 0.06555370 0.07192039 0.06897198
## 82 1818.0 days 7085.3455 84.17449 62.47232 0.07006843 0.08392725 0.07378873
## 83 1820.0 days 6397.8859 79.98679 57.84741 0.06475678 0.08029372 0.06787623
## 84 1821.0 days 6305.8926 79.40965 57.09551 0.06482511 0.08029372 0.06794977
## 85 1822.0 days 6149.5225 78.41889 56.55143 0.06544248 0.08029372 0.06865483
## 86 1823.0 days 5501.9283 74.17498 54.39088 0.06521095 0.08029372 0.06839690
## 87 1825.0 days 11630.0893 107.84289 80.28790 0.07901005 0.08392725 0.08319001
## coverage
## 1 0.2222222
## 2 0.3333333
## 3 0.2222222
## 4 0.1111111
## 5 0.1111111
## 6 0.1111111
## 7 0.2222222
## 8 0.2222222
## 9 0.2222222
## 10 0.2222222
## 11 0.1111111
## 12 0.2222222
## 13 0.3333333
## 14 0.4444444
## 15 0.4444444
## 16 0.4444444
## 17 0.4444444
## 18 0.4444444
## 19 0.4444444
## 20 0.4444444
## 21 0.3333333
## 22 0.2222222
## 23 0.1111111
## 24 0.1111111
## 25 0.0000000
## 26 0.0000000
## 27 0.0000000
## 28 0.0000000
## 29 0.0000000
## 30 0.0000000
## 31 0.1111111
## 32 0.1111111
## 33 0.2222222
## 34 0.2222222
## 35 0.3333333
## 36 0.3333333
## 37 0.4444444
## 38 0.4444444
## 39 0.4444444
## 40 0.3333333
## 41 0.4444444
## 42 0.4444444
## 43 0.4444444
## 44 0.3333333
## 45 0.3333333
## 46 0.2222222
## 47 0.2222222
## 48 0.2222222
## 49 0.2222222
## 50 0.2222222
## 51 0.1111111
## 52 0.1111111
## 53 0.1111111
## 54 0.2222222
## 55 0.2222222
## 56 0.2222222
## 57 0.2222222
## 58 0.2222222
## 59 0.1111111
## 60 0.2222222
## 61 0.3333333
## 62 0.3333333
## 63 0.2222222
## 64 0.2222222
## 65 0.2222222
## 66 0.2222222
## 67 0.2222222
## 68 0.2222222
## 69 0.1111111
## 70 0.1111111
## 71 0.1111111
## 72 0.1111111
## 73 0.1111111
## 74 0.1111111
## 75 0.1111111
## 76 0.1111111
## 77 0.1111111
## 78 0.2222222
## 79 0.2222222
## 80 0.3333333
## 81 0.3333333
## 82 0.3333333
## 83 0.3333333
## 84 0.3333333
## 85 0.3333333
## 86 0.3333333
## 87 0.2222222
Now, we can plot visualizations for metrics like RMSE, MAE, MAPE, MDAPE, SMAPE.
Comparison of the models: ARIMA Model for RMSE comparison:
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,2,1)
## Q* = 6.3575, df = 9, p-value = 0.7037
##
## Model df: 1. Total lags used: 10
ARIMA(2,1,1) on training dataset is indicated by Auto Arima
Identification of the model (ARIMA or Prophet) that performs the best out of sample: Calculating Out of Sample RMSE for ARIMA:
Prophet Model for RMSE comparison:
Final assessment of the forecast from the selected model: Out of Sample RMSE comparison between best models built on ARIMA and Prophet:
## # A tibble: 1 x 2
## best_ARIMA best_Prophet
## <dbl> <dbl>
## 1 2143. 137.
We have a clear winner as the model built by Prophet outperforms the best ARIMA model.
We can say that while it is better to understand the Time Series through ARIMA through manually decomposing the time-series, and understanding the AR characteristics of the process, it is also advisable to use Prophet to compare the performance of the forecasting. Usually, the Prophet is better performing but to understand the data generating process, it would not hurt to build ARIMA for better understanding of the time-series.