Section 1:

1. Introduction

The WDI package provides convenient access to over 40 databases hosted by the World Bank, including the World Development Indicators (WDI), International Debt Statistics, Doing Business,Human Capital Index, and Sub-national poverty indicators.

I decided to analyze the WDI dataset based on our classroom discussion on this package which piqued my curiosity regarding this dataset and the wealth of information captured by this dataset.

2. WDI time-series dataset selection

This code chunk will call the WDI API and fetch the Unemployment data for years 1992 through 2020, as available.

## Rows: 7,714
## Columns: 11
## $ iso2c        <chr> "1A", "1A", "1A", "1A", "1A", "1A", "1A", "1A", "1A", "1A~
## $ country      <chr> "Arab World", "Arab World", "Arab World", "Arab World", "~
## $ unemployment <dbl> 11.481565, 12.319537, 10.564155, 10.495993, 10.302311, 9.~
## $ year         <int> 2020, 2002, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 200~
## $ iso3c        <chr> "ARB", "ARB", "ARB", "ARB", "ARB", "ARB", "ARB", "ARB", "~
## $ region       <chr> "Aggregates", "Aggregates", "Aggregates", "Aggregates", "~
## $ capital      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "~
## $ longitude    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "~
## $ latitude     <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "~
## $ income       <chr> "Aggregates", "Aggregates", "Aggregates", "Aggregates", "~
## $ lending      <chr> "Aggregates", "Aggregates", "Aggregates", "Aggregates", "~

3. Behavior of the data-generating process

To assess the behavior of the data-generating process of our time-series, we start by looking at the plots of the time-series data that we are using for our analysis. As we can see our time series data is variance non-stationary but may not be mean non-stationary.

Further we do a Rolling window calculation (Rolling average/standard deviation),Sliding the time series across a 5 year time window, transforming each value into a rolling mean /Standard deviation. We observe that, Rolling Mean decreases over Time while the Rolling Standard Deviation Increases over Time. This suggests both Mean and variance non-stationarity.

Note the obvious increasing trend in Unemployment(%) around 2020 due to the Impact of Covid-19 Recession across all sectors of the Indian economy.

We further develop a number of plots that describe the behavior of our time-series WDI dataset:

4. Time-Series Transformation for variance stationarity

As the data appear to be both Mean and variance non-stationary, we transform your time-series using a natural log or Box-Cox transformation. As we must solve variance stationarity before mean stationarity, we start with solving to make our series variance stationary.

As we take Log and Box Cox Transformations of India unemployment data to make the time-series variance stationary, they both give very similar results.

5. Testing for Mean Non-Stationarity

Firstly we tried testing for Non-stationarity using Augmented Dickey-Fuller (ADF) Test:

As p-value > 0.05 It indicates Non-stationary

Next, we tried testing for Non-stationarity using KPSS Test (preferred approach):

As p-value > 0.05 indicates Stationary

As Results of Augmented Dickey-Fuller (ADF) Test and KPSS Test do not match, we assume Mean Non-Stationarity

## 
##  Augmented Dickey-Fuller Test
## 
## data:  unemployment_var_transform$unemployment_box
## Dickey-Fuller = -2.1461, Lag order = 3, p-value = 0.5166
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  unemployment_var_transform$unemployment_box
## KPSS Level = 0.14101, Truncation lag parameter = 2, p-value = 0.1

Next we try to find the Integrated order which is the number of differences required to make our series mean stationary.

We start with one difference (I(1)) and see how we do before trying a second difference:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  unemployment_mean_transform$unemployment_box_diff[!is.na(unemployment_mean_transform$unemployment_box_diff)]
## Dickey-Fuller = -1.4553, Lag order = 2, p-value = 0.7799
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  unemployment_mean_transform$unemployment_box_diff[!is.na(unemployment_mean_transform$unemployment_box_diff)]
## KPSS Level = 0.25067, Truncation lag parameter = 2, p-value = 0.1

Examining the ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. ACF shows no significant “dampening” autocorrelation.Also PACF lags does not indicate “order” of AR process or shows any oscillating patterns for MA process.

Next we try a second difference:

## 
##  Augmented Dickey-Fuller Test
## 
## data:  unemployment_mean_transform1$unemployment_box_diff[!is.na(unemployment_mean_transform1$unemployment_box_diff)]
## Dickey-Fuller = -1.4553, Lag order = 2, p-value = 0.7799
## alternative hypothesis: stationary
## 
##  KPSS Test for Level Stationarity
## 
## data:  unemployment_mean_transform1$unemployment_box_diff[!is.na(unemployment_mean_transform1$unemployment_box_diff)]
## KPSS Level = 0.25067, Truncation lag parameter = 2, p-value = 0.1

Section 2:

1. ARIMA Modeling

As we examine ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. It may be random walk model with ARIMA(0,1,0).WE next try fitting several ARIMA models to our time-series unemployement variable. We will try to determine Which model is the “best” according to the AIC and BIC.

Note we’re using Box-Cox Transformed Parameter

## 
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 1, 1))
## 
## Coefficients:
##           ma1
##       -1.0000
## s.e.   0.2071
## 
## sigma^2 estimated as 0.0001123:  log likelihood = 76.58,  aic = -149.15
## 
## Training set error measures:
##                        ME       RMSE         MAE        MPE      MAPE    MASE
## Training set -0.001546848 0.01039308 0.005228364 -0.1895448 0.5905995 1.21511
##                     ACF1
## Training set -0.03699429
## 
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 1, 0))
## 
## 
## sigma^2 estimated as 0.0001468:  log likelihood = 74.86,  aic = -147.72
## 
## Training set error measures:
##                       ME       RMSE         MAE       MPE     MAPE      MASE
## Training set 0.001573842 0.01188161 0.004171181 0.1660662 0.464282 0.9694133
##                     ACF1
## Training set -0.01259795
## 
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(1, 1, 0))
## 
## Coefficients:
##           ar1
##       -0.2015
## s.e.   0.6252
## 
## sigma^2 estimated as 0.000146:  log likelihood = 74.91,  aic = -145.82
## 
## Training set error measures:
##                       ME       RMSE         MAE       MPE    MAPE     MASE
## Training set 0.001428221 0.01184801 0.004439028 0.1493185 0.49516 1.031663
##                      ACF1
## Training set -0.009926859
## 
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 2, 1))
## 
## Coefficients:
##           ma1
##       -0.0457
## s.e.   0.6802
## 
## sigma^2 estimated as 0.0001676:  log likelihood = 70.27,  aic = -136.54
## 
## Training set error measures:
##                       ME       RMSE         MAE       MPE      MAPE      MASE
## Training set 0.002227539 0.01244118 0.004192699 0.2413288 0.4654647 0.9744141
##                       ACF1
## Training set -0.0007774564
## 
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 2, 0))
## 
## 
## sigma^2 estimated as 0.0001677:  log likelihood = 70.27,  aic = -138.54
## 
## Training set error measures:
##                       ME       RMSE         MAE       MPE      MAPE      MASE
## Training set 0.002232241 0.01244294 0.004212842 0.2418833 0.4677729 0.9790955
##                      ACF1
## Training set -0.003705795
## 
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(1, 2, 0))
## 
## Coefficients:
##           ar1
##       -0.0413
## s.e.   0.6255
## 
## sigma^2 estimated as 0.0001676:  log likelihood = 70.27,  aic = -136.54
## 
## Training set error measures:
##                       ME       RMSE        MAE       MPE      MAPE      MASE
## Training set 0.002228235 0.01244137 0.00419662 0.2414109 0.4659133 0.9753253
##                       ACF1
## Training set -0.0009612731

Looks like there no unmodeled variation left, we may not Want to try any different models now.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)
## Q* = 0.97603, df = 4, p-value = 0.9134
## 
## Model df: 1.   Total lags used: 5

2.Ljung-Box Test Results

Next we try Ljung-Box Test for Residual Autocorrelation

p < 0.05 indicates residual autocorrelation at that lag

## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.039853, df = 1, p-value = 0.8418
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 0.97603, df = 5, p-value = 0.9645
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 1.5369, df = 10, p-value = 0.9988
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 4.7774, df = 20, p-value = 0.9998
## 
##  Box-Ljung test
## 
## data:  resid
## X-squared = 6.1623, df = 25, p-value = 1

Ljung-Box Test Results indicates residual no autocorrelation at different lags. Remaining autocorrelation in the residuals suggests that we may have the “best model”.

3.Auto Arima Modeling

## Series: unemployment_mean_transform$unemployment_box 
## ARIMA(0,0,0) with non-zero mean 
## 
## Coefficients:
##         mean
##       0.8776
## s.e.  0.0020
## 
## sigma^2 = 0.0001123:  log likelihood = 81.84
## AIC=-159.69   AICc=-159.16   BIC=-157.17
## 
## Training set error measures:
##                         ME       RMSE         MAE         MPE      MAPE
## Training set -1.366427e-16 0.01039165 0.006008677 -0.01362929 0.6791605
##                  MASE        ACF1
## Training set 1.396461 0.008248719

4.Diagnostics from the Model

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 0.52305, df = 4, p-value = 0.9712
## 
## Model df: 1.   Total lags used: 5
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 0.0019814, df = 1, p-value = 0.9645
## 
##  Box-Ljung test
## 
## data:  mod1_auto$residuals
## X-squared = 0.0019814, df = 1, p-value = 0.9645
## [1] 0.3236508