Executive Summary

In this project, We started with an empirical analysis of five main data series variables from the WDI Package to evaluate whether people’s quality of life actually improved in India and the world through measurement metrics. Next, we focused on South Asia (especially India) and GDP per capita as our development Indicator of interest. We looked at identification of the data source, Various summary statistics of the data and finally the Visualizations that describe the behavior/distribution of the time-series.

For the second section, we focused on building an ARIMA model for our time-series dataset. We looked at assessment of whether the data was variance/mean stationary using analysis of the model variance over time and statistical tests such as ADF/KPSS tests. Next we looked at transformations of the data to induce variance/mean stationarity (log/Box-Cox and differencing). We also examined the ACF/PACF plots for autocorrelation and seasonality. Selection of an appropriate ARIMA model was done based on AIC/BIC. We also looked at analysis of model residuals, assessment of in-sample model performance and finally the production of a forecast at a meaningful time horizon.

For the Third section, we focused on building a Prophet model for our time-series dataset. It involved decomposition of the elements of the time series (trend, seasonality, etc.), assessment of whether the model should take into account a saturating minimum/maximum point and identification and assessment of seasonality (daily, weekly, yearly, as well as additive/multiplicative). We also looked at selection of a “best” model, assessment of in-sample model performance and finally the production of a forecast at a meaningful time horizon

For the Final section, we focused on assessing and comparing the performance of the selected ARIMA and Prophet model.It involved comparison of the models using a single train/test split or rolling window cross validation approach using RMSE, MAE, and/or MAPE as appropriate, Identification of the model (ARIMA or Prophet) that performs the best out of sample using the given techniques and the final assessment of the forecast from the selected model.

Section 1: Exploratory Data Analysis

1: Introduction

Data Source Description: The dataset selected for this Project is the “World Development Indicators and Other World Bank Data” loaded as a package. The WDI package provides convenient access to over 40 databases hosted by the World Bank, including the World Development Indicators (WDI), International Debt Statistics, Doing Business,Human Capital Index, and Sub-national poverty indicators.The Data Base contains about 17’330 topics of various countries over several years. The complete dataset has a total of 16532 observations.

I decided to analyze the WDI dataset based on our classroom discussion on this package which piqued my curiosity regarding this dataset and the wealth of information captured by this single dataset.

2: Summary Statistics of the WDI time-series dataset

Summary Statistics: This code chunk will call the WDI API and fetch the years 1960 through 2018, as available. We will find that only a few variables have data for 2020.

This dataset has a total of 15 variables.

## Rows: 16,532
## Columns: 15
## $ iso2c         <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ~
## $ country       <chr> "Global Partnership for Education", "Global Partnership ~
## $ year          <int> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 19~
## $ birth         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lifeexp       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ inf_mort      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ gdpPercap     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ yrs_schooling <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ iso3c         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ region        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ capital       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ longitude     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ latitude      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ income        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ lending       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~

Summary of five key time-series variables from the WDI Package.

##                         region         year        birthrate    
##  South Asia                :101   Min.   :1960   Min.   :10.47  
##  East Asia & Pacific       : 70   1st Qu.:1975   1st Qu.:20.38  
##  Aggregates                : 66   Median :1990   Median :29.26  
##  Sub-Saharan Africa        : 66   Mean   :1990   Mean   :28.96  
##  Middle East & North Africa: 63   3rd Qu.:2005   3rd Qu.:37.73  
##  Europe & Central Asia     : 61   Max.   :2020   Max.   :47.13  
##  (Other)                   :117                                 
##     life_exp         infant         gdpPercap       yrs_schooling   
##  Min.   :40.99   Min.   :  4.90   Min.   :  368.4   Min.   : 1.894  
##  1st Qu.:57.16   1st Qu.: 23.84   1st Qu.: 2424.1   1st Qu.: 2.739  
##  Median :65.48   Median : 41.96   Median : 6631.3   Median : 5.546  
##  Mean   :63.98   Mean   : 51.30   Mean   :12912.0   Mean   : 5.965  
##  3rd Qu.:71.68   3rd Qu.: 70.89   3rd Qu.:14584.0   3rd Qu.: 7.904  
##  Max.   :80.69   Max.   :171.27   Max.   :70255.2   Max.   :12.750  
##

	vars	n	mean	sd	min	max	range	se
birthrate	1	544	28.96167	10.320162	10.46667	47.13421	36.66755	0.4424733
life_exp	2	544	63.97716	9.401340	40.98700	80.68935	39.70234	0.4030791
infant	3	544	51.29561	36.900487	4.90000	171.26667	166.36667	1.5820953
gdpPercap	4	544	12911.94946	16526.007464	368.38498	70255.24633	69886.86135	708.5467245
yrs_schooling	5	544	5.96473	2.880443	1.89375	12.75000	10.85625	0.1234980

3. Visualization Plots

The graphs here shows the change in 5 Key development indicators: birth rate, Years of Schooling, infant mortality rate and GDP per capita over the past 60 years. The plot has been faceted to 7 regions. The line chart can efficiently explain the change tendency across multiple decades. In each individual chart, we can see a clear down trends in birth rate and infant mortality rate and up trends in GDP per Capita, Years of Schooling, Life Expectancy. Comparing horizontally, the region Sub-Saharan Africa has the highest birth rate and infant mortality rate but North America has the highest GDP per capita as expected.

Note the obvious down trends in Birth rate(%) and Infant Mortality Rate(%) and the general up trends in GDP per capita, Years of Schooling, Life Expectancy especially in South Asia and Sub-Saharan Africa.The Impact of Covid-19 Recession is also visible for multiple development indicators across most regions of the world.

We now develop a number of plots that describe the behavior of our key time-series development variables from the WDI dataset:

Note the Improvement in Health Indicators and Child growth Indicators across the world. Now we want to focus on South Asia (especially India) and GDP per capital as our development variable of interest.

Looking at a number of plots that describe the behavior of our development variable of interest:

Section 2: ARIMA Modeling

4. India’s GDP Per capita time-series dataset selection

This code chunk will call the WDI API and fetch the GDP Per capita data for years 1960 through 2021, as available.

wdi_data = WDI(indicator = c('gdpPercap'="NY.GDP.PCAP.KD"),
               start = 1960, end = 2021,
               extra=TRUE ) %>% 
  as_tibble()

This code chunk will further filter the WDI_Data for Country == ‘India’

# Filter the data for Country == 'India'
india_gdpPercap_data = wdi_data %>%
  filter(country=='India') %>%
  arrange(year)

5. Behavior of the data-generating process

Assessment of variance/mean stationarity: To assess the behavior of the data-generating process of our time-series, we start by looking at the plots of the time-series data that we are using for our analysis. As we can see our time series data is variance non-stationary but may not be mean non-stationary.

Further we do a Rolling window calculation (Rolling average/standard deviation),Sliding the time series across a 5 year time window, transforming each value into a rolling mean /Standard deviation. We observe that, Rolling Mean increases over time while the Rolling Standard Deviation also increases over Time. This suggests both Mean and variance non-stationarity.

Note the obvious decreasing trend in GDP Per capita around 2020 due to the Impact of Covid-19 Recession across all sectors of the Indian economy.

6. Time-Series Transformation for variance stationarity

Transformations of the data to induce variance/mean stationarity (log/Box-Cox and differencing): As the data appear to be both Mean and variance non-stationary, we transform the time-series using a natural log or Box-Cox transformation. As we must solve variance stationarity before mean stationarity, we start with solving to make our series variance stationary.

As we take Log and Box Cox Transformations of India GDP Per capita data to make the time-series variance stationary, they both give very similar results. We will use Box Cox Transformation as it seems to do a slightly better Job of Normalizing the Variance.

7. Testing for Mean Non-Stationarity

Firstly we tried testing for Non-stationarity using Augmented Dickey-Fuller (ADF) Test:

As p-value > 0.05 It indicates Non-stationary

Next, we tried testing for Non-stationarity using KPSS Test (preferred approach):

As p-value < 0.05 indicates Non-stationary

As Results of Augmented Dickey-Fuller (ADF) Test and KPSS Test match, we assume Mean Non-Stationarity

## 
##  Augmented Dickey-Fuller Test
## 
## data:  gdpPercap_var_transform$gdpPercap_box
## Dickey-Fuller = -2.1544, Lag order = 3, p-value = 0.5124
## alternative hypothesis: stationary

## 
##  KPSS Test for Level Stationarity
## 
## data:  gdpPercap_var_transform$gdpPercap_box
## KPSS Level = 1.5962, Truncation lag parameter = 3, p-value = 0.01

Examination of ACF/PACF plots for autocorrelation and seasonality: Next we try to find the Integrated order which is the number of differences required to make our series mean stationary. We start with one difference (I(1)) and see how we do before trying a second difference:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  gdpPercap_mean_transform$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform$gdpPercap_box_diff)]
## Dickey-Fuller = -0.81635, Lag order = 2, p-value = 0.9476
## alternative hypothesis: stationary

## 
##  KPSS Test for Level Stationarity
## 
## data:  gdpPercap_mean_transform$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform$gdpPercap_box_diff)]
## KPSS Level = 0.31214, Truncation lag parameter = 2, p-value = 0.1

Examining the ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. ACF shows no significant “dampening” auto-correlation.Also PACF lags does not indicate “order” of AR process or shows any oscillating patterns for MA process.

Next we try a second difference:

Again, we tried testing for Non-stationarity using Augmented Dickey-Fuller (ADF) Test:

As p-value > 0.05 It indicates Non-stationary

Next, we tried testing for Non-stationarity using KPSS Test (preferred approach):

As p-value > 0.05 indicates Stationary

As Results of Augmented Dickey-Fuller (ADF) Test and KPSS Test do not match, we assume Mean Non-Stationarity.

## 
##  Augmented Dickey-Fuller Test
## 
## data:  gdpPercap_mean_transform1$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform1$gdpPercap_box_diff)]
## Dickey-Fuller = -0.81635, Lag order = 2, p-value = 0.9476
## alternative hypothesis: stationary

## 
##  KPSS Test for Level Stationarity
## 
## data:  gdpPercap_mean_transform1$gdpPercap_box_diff[!is.na(gdpPercap_mean_transform1$gdpPercap_box_diff)]
## KPSS Level = 0.31214, Truncation lag parameter = 2, p-value = 0.1

8. ARIMA Modeling

Selection of an appropriate ARIMA model based on AIC/BIC: As we examine ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. It may be random walk model with ARIMA(0,1,0). We next try fitting several ARIMA models to our time-series gdpPercap variable. We will try to determine Which model is the “best” according to the AIC and BIC.

Note we’re using Box-Cox Transformed Parameter

## 
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 1, 1))
## 
## Coefficients:
##          ma1
##       0.6932
## s.e.  0.1166
## 
## sigma^2 estimated as 3.005e-06:  log likelihood = 123.14,  aic = -242.28
## 
## Training set error measures:
##                       ME        RMSE         MAE      MPE      MAPE      MASE
## Training set 0.001162192 0.001750371 0.001501162 0.053958 0.0695625 0.6688944
##                    ACF1
## Training set -0.1906989

## 
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 5.67e-06:  log likelihood = 115.53,  aic = -229.06
## 
## Training set error measures:
##                       ME        RMSE         MAE        MPE      MAPE     MASE
## Training set 0.002006089 0.002371993 0.002239897 0.09302686 0.1037458 0.998063
##                   ACF1
## Training set 0.1324183

## 
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(1, 1, 0))
## 
## Coefficients:
##          ar1
##       0.8468
## s.e.  0.1055
## 
## sigma^2 estimated as 1.783e-06:  log likelihood = 129.36,  aic = -254.71
## 
## Training set error measures:
##                        ME      RMSE         MAE        MPE       MAPE      MASE
## Training set 0.0002215809 0.0013746 0.001056106 0.01041494 0.04899651 0.4705842
##                     ACF1
## Training set -0.09277688

## 
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 2, 1))
## 
## Coefficients:
##           ma1
##       -0.4406
## s.e.   0.3712
## 
## sigma^2 estimated as 1.779e-06:  log likelihood = 124.71,  aic = -245.42
## 
## Training set error measures:
##                         ME        RMSE         MAE        MPE       MAPE
## Training set -0.0003442775 0.001410643 0.001005344 -0.0158962 0.04661231
##                   MASE       ACF1
## Training set 0.4479655 0.04467136

## 
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(0, 2, 0))
## 
## 
## sigma^2 estimated as 1.91e-06:  log likelihood = 123.97,  aic = -245.93
## 
## Training set error measures:
##                         ME        RMSE         MAE        MPE       MAPE
## Training set -0.0003095607 0.001452905 0.001009851 -0.0142933 0.04685949
##                  MASE       ACF1
## Training set 0.449974 -0.1012526

## 
## Call:
## arima(x = gdpPercap_mean_transform$gdpPercap_box, order = c(1, 2, 0))
## 
## Coefficients:
##           ar1
##       -0.2337
## s.e.   0.2581
## 
## sigma^2 estimated as 1.843e-06:  log likelihood = 124.37,  aic = -244.73
## 
## Training set error measures:
##                         ME        RMSE         MAE         MPE       MAPE
## Training set -0.0003261831 0.001431483 0.001008458 -0.01506122 0.04676878
##                   MASE        ACF1
## Training set 0.4493532 -0.01220372

##               df       AIC
## summary(mod1)  2 -242.2816
## summary(mod2)  1 -229.0623
## summary(mod3)  2 -254.7148
## summary(mod4)  2 -245.4236
## summary(mod5)  1 -245.9328
## summary(mod6)  2 -244.7329

##               df       BIC
## summary(mod1)  2 -239.8439
## summary(mod2)  1 -227.8435
## summary(mod3)  2 -252.2770
## summary(mod4)  2 -243.0675
## summary(mod5)  1 -244.7548
## summary(mod6)  2 -242.3768

Analysis of model residuals: Looking at residuals involves finding the difference between what the model predicted versus what we observed. Looks like there no unmodeled variation left, we may not Want to try many different models now.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)
## Q* = 2.0158, df = 4, p-value = 0.7329
## 
## Model df: 1.   Total lags used: 5

The ACF plot above shows no first, second or third order autocorrelation residuals. Residuals seem to be fairly normal. We seem to have done a good job of modeling the short tem behavior of this time-series.

9.Ljung-Box Test Results

Next we try Ljung-Box Test for Residual Autocorrelation.

p < 0.05 indicates residual autocorrelation at that lag

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.25065, df = 1, p-value = 0.6166

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 2.0158, df = 5, p-value = 0.847

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 2.781, df = 10, p-value = 0.9861

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 10.747, df = 20, p-value = 0.9525

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 47.224, df = 25, p-value = 0.004613

Ljung-Box Test Results indicates residual autocorrelation at lag=25.

Small Remaining autocorrelation (at lag=25) in the residuals suggests that we may have the “best model”. Lower order lags are not statistically significant, suggest that we have been successful in picking up most of the fluctuation in the data generation process.

10.Auto Arima Modeling

## Series: gdpPercap_mean_transform$gdpPercap_box 
## ARIMA(0,1,0) with drift 
## 
## Coefficients:
##       drift
##       2e-03
## s.e.  3e-04
## 
## sigma^2 = 1.924e-06:  log likelihood = 130.84
## AIC=-257.69   AICc=-257.14   BIC=-255.25
## 
## Training set error measures:
##                        ME        RMSE          MAE         MPE      MAPE
## Training set 8.189298e-05 0.001332554 0.0009385031 0.003959539 0.0434831
##                   MASE      ACF1
## Training set 0.4181823 0.1595336

11.Diagnostics from the Model

Assessment of in-sample model performance: Assessment of In-sample model performance involves the model performance on data used to estimate the model. In-sample performance can be analyzed using a familiar metric: RMSE (Root Mean Square Error). RMSE is the standard deviation of the residuals (prediction errors).

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,0) with drift
## Q* = 1.3384, df = 4, p-value = 0.8548
## 
## Model df: 1.   Total lags used: 5

## 
##  Box-Ljung test
## 
## data:  mod3_auto$residuals
## X-squared = 0.74113, df = 1, p-value = 0.3893

## 
##  Box-Ljung test
## 
## data:  mod3_auto$residuals
## X-squared = 0.74113, df = 1, p-value = 0.3893

12.Diagnostics from the Model

Looking at RMSE which represents the average error of the model in-sample.RMSE of 58.14078 suggest that our model predicts the GDP Per Capita for 2020:2024 is within ~58 points on average in-sample.

## [1] 58.14078

13.Forecast from the Arima Model

Production of a forecast at a meaningful time horizon: Now we create a future forecast for 2020:2024.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 2.3696, df = 4, p-value = 0.6681
## 
## Model df: 1.   Total lags used: 5

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 1.059, df = 1, p-value = 0.3034

## 
##  Box-Ljung test
## 
## data:  mod1$residuals
## X-squared = 1.059, df = 1, p-value = 0.3034

## [1] 51.27804

Section 3: Facebook Prophet Model

14. Behavior of the data-generating process for Facebook Prophet Model

To assess the behavior of the data-generating process of our time-series, we start by looking at the plots of the time-series data that we are using for our analysis.

15. Fit and assess a Facebook Prophet model

We start with Fitting a Basic Prophet Model which draws from time-series decomposition, breaking down the time-series into:

Seasonal components: Daily, weekly, monthly, yearly, etc.
Holidays: For daily data
Trend Estimated along the data with unique slopes identified using changepoint detection

## Rows: 61
## Columns: 2
## $ year      <date> 1960-01-01, 1961-01-01, 1962-01-01, 1963-01-01, 1964-01-01,~
## $ gdpPercap <dbl> 302.6718, 307.7279, 310.3767, 322.2841, 339.2037, 323.4641, ~

16. Plotting Forecast and Interactive Charts

Next, we will be plotting Forecast model output and Interactive Charts as below:

17. Time Series Decomposition

Decomposition of the elements of the time series: Decomposition of the elements of the time series (trend, seasonality, etc.) is very similar to general time series decomposition.

As seen in the graphs above, We don’t see any seasonality in the GDP per capita growth but do observe an increasing trend to this dataset.

18. Plotting Changepoints

To plot Change-points, the algorithm examines 25 equally-spaced potential changepoints by default as potential candidates for a change in the trend. It examined the actual rate of change at each potential changepoint, and removes those with low rates of change. We are left with 5 potential change-points.

Note: The Algorithm won’t identify change-points in final 20% of data. This prevents final forecasts from trailing off in a different direction

19. Changepoint Hyperparameters

Assessment of saturating minimum/maximum point: We can manually specify more number of change-points(10 here) to improve performance.The Algorithm examined the actual rate of change at each potential changepoint, and again removes those with low rates of change. We are left with 5 potential change-points.

Examination of the change-points identified for the “trend” part of the time series are shown below:

20. Forecast

Selection of a “best” model: Forecast is finally made for data beyond 2020 and fails to account for the Covid related drop in GDP per capita.

21. Identifying Saturation Points

We need to set the floor value for this data as the GDP Per capita can not go below 0. Let us make a two-year forecast to identify the need for saturation points.

22. Specifying Saturation Points

We also look at assessing whether the model should take into account a saturating minimum/maximum point. For this Model we setup the Cap (ceiling) at 2500 and floor (lower limit) at 0.

23. Assessment of Seasonality

Prophet Models can handle multiple types of seasonality simultaneously. Here we can look at assessment of seasonality (daily, weekly, yearly, as well as additive/multiplicative). As this data is a yearly data set we do not observe daily, weekly or yearly seasonality.

Additive Seasonality

Additive Seasonality has the same variance over time.

Multiplicative Seasonality

Multiplicative Seasonality has increasing/decreasing variance over time.

From the above plots, we can see that the additive and multiplicative seasonalities doesn’t play any significant role. As the data we are using is a yearly data, impact of holidays is not applicable on the annual GDP per Capita Growth rate.

24. Prophet Model Assessment

Assessment of in-sample model performance: Here we look at estimatiing in-sample performance of the model based on RMSE, MAE and MAPE.

## [1] "RMSE: 189.26"

## [1] "MAE: 157.07"

## [1] "MAPE: 0.08"

The observed values of performance metrics are RMSE = 189.26, MAE = 157.07, MAPE = 0.08.We find that the RMSE is higher that MAE which is primarily due to the Outlier values during the Covid duration.

25. Cross-Validation with Prophet Model

We use the Rolling window cross-validation to assess performance of the model at meaningful thresholds depending on the data (e.g. 30 days out if daily, 1 year out if monthly, 5 years out if yearly, etc.). We train on the past 6 year of data to predict for the next 5 years of GDP per capita.

## # A tibble: 6 x 6
##       y ds                   yhat yhat_lower yhat_upper cutoff             
##   <dbl> <dttm>              <dbl>      <dbl>      <dbl> <dttm>             
## 1  338. 1968-01-01 00:00:00  337.       326.       347. 1967-01-14 00:00:00
## 2  353. 1969-01-01 00:00:00  341.       330.       351. 1967-01-14 00:00:00
## 3  363. 1970-01-01 00:00:00  345.       334.       354. 1967-01-14 00:00:00
## 4  361. 1971-01-01 00:00:00  348.       337.       359. 1967-01-14 00:00:00
## 5  351. 1972-01-01 00:00:00  352.       341.       363. 1967-01-14 00:00:00
## 6  363. 1970-01-01 00:00:00  350.       340.       360. 1969-07-14 12:00:00

##  [1] "1967-01-14 00:00:00 GMT" "1969-07-14 12:00:00 GMT"
##  [3] "1972-01-13 00:00:00 GMT" "1974-07-13 12:00:00 GMT"
##  [5] "1977-01-11 00:00:00 GMT" "1979-07-12 12:00:00 GMT"
##  [7] "1982-01-10 00:00:00 GMT" "1984-07-10 12:00:00 GMT"
##  [9] "1987-01-09 00:00:00 GMT" "1989-07-09 12:00:00 GMT"
## [11] "1992-01-08 00:00:00 GMT" "1994-07-08 12:00:00 GMT"
## [13] "1997-01-06 00:00:00 GMT" "1999-07-07 12:00:00 GMT"
## [15] "2002-01-05 00:00:00 GMT" "2004-07-05 12:00:00 GMT"
## [17] "2007-01-04 00:00:00 GMT" "2009-07-04 12:00:00 GMT"
## [19] "2012-01-03 00:00:00 GMT"

26. Cross-Validation Actual vs Predicted

Production of a forecast at a meaningful time horizon: The black dots here are the actuals and the colored dots are the predicted data set.

Nearer horizon values are closer to the actual values and the farther values show more deviation from the actual values.

27. Table for Performance metrics

We can generate a table for various performance metrics for different time horizons.

##        horizon        mse      rmse      mae       mape      mdape      smape
## 1   180.5 days  1084.2005  32.92720 25.87522 0.03661061 0.04154279 0.03740498
## 2   352.0 days  1065.0720  32.63544 24.57795 0.03307085 0.04154279 0.03379142
## 3   354.0 days  1076.4921  32.80994 25.20475 0.03494414 0.04154279 0.03558665
## 4   355.0 days  1112.5325  33.35465 26.80256 0.03888570 0.04465719 0.03965478
## 5   356.0 days  1109.2229  33.30500 26.71980 0.03891580 0.04465719 0.03968632
## 6   357.0 days  1164.2851  34.12162 27.74856 0.04131984 0.04601947 0.04222143
## 7   359.0 days  1079.9222  32.86217 24.88208 0.03671820 0.04601947 0.03750706
## 8   360.0 days  1127.4439  33.57743 25.94557 0.03848221 0.04601947 0.03932728
## 9   361.0 days   853.7364  29.21877 21.88554 0.03445494 0.03904813 0.03514902
## 10  362.0 days   436.7263  20.89800 17.82740 0.03160301 0.03372260 0.03216986
## 11  364.0 days   599.3676  24.48198 21.91131 0.03410403 0.03372260 0.03471092
## 12  535.5 days   588.0958  24.25069 21.29621 0.03232679 0.03029782 0.03301342
## 13  536.5 days   550.1992  23.45633 19.25161 0.02723353 0.02704426 0.02779974
## 14  538.5 days   510.7077  22.59884 17.67602 0.02361531 0.01826322 0.02407085
## 15  539.5 days   477.1791  21.84443 17.09330 0.02297716 0.01826322 0.02339239
## 16  540.5 days   478.9708  21.88540 17.38119 0.02354521 0.01826322 0.02396376
## 17  541.5 days   649.2640  25.48066 19.65913 0.02731632 0.01826322 0.02795564
## 18  543.5 days   655.6821  25.60629 19.85573 0.02773630 0.02204300 0.02838421
## 19  544.5 days  1324.6807  36.39616 25.55008 0.03360883 0.02204300 0.03460573
## 20  545.5 days  1801.0035  42.43823 29.72327 0.03712271 0.02204300 0.03827543
## 21  718.0 days  1812.7719  42.57666 30.35902 0.03896803 0.03433533 0.04016978
## 22  719.0 days  1860.5266  43.13382 32.65511 0.04553462 0.05743571 0.04654676
## 23  720.0 days  1883.8194  43.40299 33.76906 0.04865268 0.05743571 0.04956028
## 24  721.0 days  1859.1052  43.11734 33.25009 0.04779831 0.04974634 0.04865821
## 25  723.0 days  2047.2743  45.24682 37.35903 0.05579226 0.05866917 0.05702149
## 26  724.0 days  1816.9481  42.62567 33.62097 0.05033471 0.04974634 0.05128880
## 27  725.0 days  2147.9984  46.34650 38.06823 0.05644815 0.05866917 0.05771811
## 28  726.0 days  1549.2565  39.36060 33.31440 0.05235639 0.04974634 0.05334762
## 29  728.0 days  1391.4235  37.30179 32.20032 0.05212883 0.04974634 0.05310635
## 30  729.0 days  1997.3375  44.69158 39.16887 0.05384439 0.04977537 0.05489643
## 31  900.5 days  1957.2441  44.24075 37.78804 0.04988977 0.04974634 0.05110060
## 32  902.5 days  1952.0694  44.18223 37.61172 0.04923646 0.04974634 0.05061381
## 33  903.5 days  1904.2254  43.63743 35.84025 0.04522362 0.04632482 0.04647037
## 34  904.5 days  1805.7860  42.49454 34.43688 0.04320175 0.04632482 0.04429642
## 35  905.5 days  1794.7363  42.36433 33.97250 0.04254756 0.04632482 0.04362832
## 36  907.5 days  1678.9708  40.97525 32.86463 0.04186472 0.04632482 0.04289200
## 37  908.5 days  1498.1868  38.70642 29.36678 0.03812630 0.03573808 0.03904053
## 38  909.5 days  2539.4304  50.39276 35.06021 0.04392341 0.03573808 0.04534975
## 39  910.5 days  2767.0422  52.60268 36.46272 0.04560860 0.03573808 0.04713594
## 40 1083.0 days  2796.8581  52.88533 37.58069 0.04859835 0.05059554 0.05030241
## 41 1084.0 days  2776.0048  52.68780 36.28681 0.04529361 0.05059554 0.04692343
## 42 1085.0 days  2774.1887  52.67057 36.09173 0.04486161 0.05059554 0.04647579
## 43 1087.0 days  2779.7571  52.72340 36.18712 0.04536873 0.05059554 0.04701721
## 44 1088.0 days  3049.0456  55.21816 40.67489 0.05383985 0.06494207 0.05598780
## 45 1089.0 days  2920.0967  54.03792 39.06342 0.05184618 0.05297531 0.05386460
## 46 1090.0 days  3237.8485  56.90210 43.98795 0.05841838 0.06494207 0.06072512
## 47 1092.0 days  2279.1484  47.74043 38.89674 0.05463126 0.06494207 0.05656472
## 48 1093.0 days  2800.0736  52.91572 41.52069 0.05733319 0.06666782 0.05948779
## 49 1094.0 days  4525.6953  67.27329 53.47648 0.06042869 0.07182892 0.06279330
## 50 1266.5 days  4535.4574  67.34580 54.29873 0.06278897 0.07182892 0.06511491
## 51 1267.5 days  4591.9661  67.76405 56.41973 0.06803726 0.07182892 0.07055410
## 52 1268.5 days  4542.7028  67.39958 55.44311 0.06606849 0.07182892 0.06846634
## 53 1269.5 days  4628.0642  68.02988 56.23688 0.06842249 0.07182892 0.07108800
## 54 1271.5 days  4524.9675  67.26788 53.93610 0.06492676 0.07182892 0.06745811
## 55 1272.5 days  4614.6692  67.93136 54.71062 0.06676457 0.07471102 0.06945250
## 56 1273.5 days  4190.1939  64.73171 50.73129 0.06312036 0.05697737 0.06558585
## 57 1274.5 days  4055.5381  63.68311 50.10946 0.06392416 0.05697737 0.06646984
## 58 1276.5 days  3742.6635  61.17731 48.80697 0.06417266 0.05697737 0.06673937
## 59 1448.0 days  3748.9749  61.22887 49.09556 0.06491453 0.05697737 0.06758690
## 60 1449.0 days  3700.7313  60.83364 47.60108 0.06140711 0.04894886 0.06385842
## 61 1451.0 days  3656.2357  60.46681 45.92800 0.05756483 0.04191311 0.05989124
## 62 1452.0 days  3436.7780  58.62404 43.58275 0.05358564 0.04191311 0.05549334
## 63 1453.0 days  3568.0466  59.73313 46.29075 0.05886709 0.06904678 0.06102313
## 64 1454.0 days  3465.2003  58.86595 45.39424 0.05812776 0.06904678 0.06021667
## 65 1456.0 days  3718.9903  60.98352 48.06797 0.06193876 0.07621215 0.06426346
## 66 1457.0 days  3733.5274  61.10260 48.13668 0.06283545 0.07621215 0.06525757
## 67 1458.0 days  3961.8124  62.94293 49.09928 0.06443308 0.07621215 0.06700546
## 68 1459.0 days  7898.7004  88.87463 68.69899 0.07285507 0.07959574 0.07606997
## 69 1631.5 days  7925.8009  89.02697 69.67907 0.07582543 0.07959574 0.07892854
## 70 1632.5 days  7934.8337  89.07768 70.22742 0.07742895 0.07959574 0.08047508
## 71 1633.5 days  7846.4861  88.58039 68.72973 0.07444162 0.07621215 0.07727972
## 72 1635.5 days  8253.9211  90.85109 72.53439 0.08200360 0.08171522 0.08568854
## 73 1636.5 days  8009.4858  89.49573 69.42823 0.07773465 0.07621215 0.08113926
## 74 1637.5 days  8621.8333  92.85383 73.38844 0.08349980 0.09507027 0.08754296
## 75 1638.5 days  7834.2677  88.51140 68.77338 0.07987276 0.07192039 0.08357307
## 76 1640.5 days  8529.0129  92.35266 71.35552 0.08327817 0.07192039 0.08738826
## 77 1641.5 days  7405.4061  86.05467 68.12882 0.08288250 0.07192039 0.08694616
## 78 1813.0 days  7368.8082  85.84176 66.30052 0.07765332 0.07192039 0.08186276
## 79 1815.0 days  7357.6046  85.77648 65.52307 0.07551374 0.07192039 0.07977369
## 80 1816.0 days  7303.9064  85.46289 63.64742 0.07126516 0.07192039 0.07537831
## 81 1817.0 days  6924.0505  83.21088 60.20512 0.06555370 0.07192039 0.06897198
## 82 1818.0 days  7085.3455  84.17449 62.47232 0.07006843 0.08392725 0.07378873
## 83 1820.0 days  6397.8859  79.98679 57.84741 0.06475678 0.08029372 0.06787623
## 84 1821.0 days  6305.8926  79.40965 57.09551 0.06482511 0.08029372 0.06794977
## 85 1822.0 days  6149.5225  78.41889 56.55143 0.06544248 0.08029372 0.06865483
## 86 1823.0 days  5501.9283  74.17498 54.39088 0.06521095 0.08029372 0.06839690
## 87 1825.0 days 11630.0893 107.84289 80.28790 0.07901005 0.08392725 0.08319001
##     coverage
## 1  0.2222222
## 2  0.3333333
## 3  0.2222222
## 4  0.1111111
## 5  0.1111111
## 6  0.1111111
## 7  0.2222222
## 8  0.2222222
## 9  0.2222222
## 10 0.2222222
## 11 0.1111111
## 12 0.2222222
## 13 0.3333333
## 14 0.4444444
## 15 0.4444444
## 16 0.4444444
## 17 0.4444444
## 18 0.4444444
## 19 0.4444444
## 20 0.4444444
## 21 0.3333333
## 22 0.2222222
## 23 0.1111111
## 24 0.1111111
## 25 0.0000000
## 26 0.0000000
## 27 0.0000000
## 28 0.0000000
## 29 0.0000000
## 30 0.0000000
## 31 0.1111111
## 32 0.1111111
## 33 0.2222222
## 34 0.2222222
## 35 0.3333333
## 36 0.3333333
## 37 0.4444444
## 38 0.4444444
## 39 0.4444444
## 40 0.3333333
## 41 0.4444444
## 42 0.4444444
## 43 0.4444444
## 44 0.3333333
## 45 0.3333333
## 46 0.2222222
## 47 0.2222222
## 48 0.2222222
## 49 0.2222222
## 50 0.2222222
## 51 0.1111111
## 52 0.1111111
## 53 0.1111111
## 54 0.2222222
## 55 0.2222222
## 56 0.2222222
## 57 0.2222222
## 58 0.2222222
## 59 0.1111111
## 60 0.2222222
## 61 0.3333333
## 62 0.3333333
## 63 0.2222222
## 64 0.2222222
## 65 0.2222222
## 66 0.2222222
## 67 0.2222222
## 68 0.2222222
## 69 0.1111111
## 70 0.1111111
## 71 0.1111111
## 72 0.1111111
## 73 0.1111111
## 74 0.1111111
## 75 0.1111111
## 76 0.1111111
## 77 0.1111111
## 78 0.2222222
## 79 0.2222222
## 80 0.3333333
## 81 0.3333333
## 82 0.3333333
## 83 0.3333333
## 84 0.3333333
## 85 0.3333333
## 86 0.3333333
## 87 0.2222222

28. Plots for various metrics

Now, we can plot visualizations for metrics like RMSE, MAE, MAPE, MDAPE, SMAPE.

Section 4: Model Comparison and Validation

29.Model Comparison between best models of ARIMA and Prophet

Comparison of the models: ARIMA Model for RMSE comparison:

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,2,1)
## Q* = 6.3575, df = 9, p-value = 0.7037
## 
## Model df: 1.   Total lags used: 10

ARIMA(2,1,1) on training dataset is indicated by Auto Arima

Identification of the model (ARIMA or Prophet) that performs the best out of sample: Calculating Out of Sample RMSE for ARIMA:

Prophet Model for RMSE comparison:

Final assessment of the forecast from the selected model: Out of Sample RMSE comparison between best models built on ARIMA and Prophet:

## # A tibble: 1 x 2
##   best_ARIMA best_Prophet
##        <dbl>        <dbl>
## 1      2143.         137.

We have a clear winner as the model built by Prophet outperforms the best ARIMA model.

30. Conclusion

We can say that while it is better to understand the Time Series through ARIMA through manually decomposing the time-series, and understanding the AR characteristics of the process, it is also advisable to use Prophet to compare the performance of the forecasting. Usually, the Prophet is better performing but to understand the data generating process, it would not hurt to build ARIMA for better understanding of the time-series.

BANA 7050 Final Project

Sudipt Tewari

2/24/2022