The WDI package provides convenient access to over 40 databases hosted by the World Bank, including the World Development Indicators (WDI), International Debt Statistics, Doing Business,Human Capital Index, and Sub-national poverty indicators.
I decided to analyze the WDI dataset based on our classroom discussion on this package which piqued my curiosity regarding this dataset and the wealth of information captured by this dataset.
This code chunk will call the WDI API and fetch the Unemployment data for years 1992 through 2020, as available.
## Rows: 7,714
## Columns: 11
## $ iso2c <chr> "1A", "1A", "1A", "1A", "1A", "1A", "1A", "1A", "1A", "1A~
## $ country <chr> "Arab World", "Arab World", "Arab World", "Arab World", "~
## $ unemployment <dbl> 11.481565, 12.319537, 10.564155, 10.495993, 10.302311, 9.~
## $ year <int> 2020, 2002, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 200~
## $ iso3c <chr> "ARB", "ARB", "ARB", "ARB", "ARB", "ARB", "ARB", "ARB", "~
## $ region <chr> "Aggregates", "Aggregates", "Aggregates", "Aggregates", "~
## $ capital <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "~
## $ longitude <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "~
## $ latitude <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "~
## $ income <chr> "Aggregates", "Aggregates", "Aggregates", "Aggregates", "~
## $ lending <chr> "Aggregates", "Aggregates", "Aggregates", "Aggregates", "~
To assess the behavior of the data-generating process of our time-series, we start by looking at the plots of the time-series data that we are using for our analysis. As we can see our time series data is variance non-stationary but may not be mean non-stationary.
Further we do a Rolling window calculation (Rolling average/standard deviation),Sliding the time series across a 5 year time window, transforming each value into a rolling mean /Standard deviation. We observe that, Rolling Mean decreases over Time while the Rolling Standard Deviation Increases over Time. This suggests both Mean and variance non-stationarity.
Note the obvious increasing trend in Unemployment(%) around 2020 due to the Impact of Covid-19 Recession across all sectors of the Indian economy.
We further develop a number of plots that describe the behavior of our time-series WDI dataset:
As the data appear to be both Mean and variance non-stationary, we transform your time-series using a natural log or Box-Cox transformation. As we must solve variance stationarity before mean stationarity, we start with solving to make our series variance stationary.
As we take Log and Box Cox Transformations of India unemployment data to make the time-series variance stationary, they both give very similar results.
Firstly we tried testing for Non-stationarity using Augmented Dickey-Fuller (ADF) Test:
As p-value > 0.05 It indicates Non-stationary
Next, we tried testing for Non-stationarity using KPSS Test (preferred approach):
As p-value > 0.05 indicates Stationary
As Results of Augmented Dickey-Fuller (ADF) Test and KPSS Test do not match, we assume Mean Non-Stationarity
##
## Augmented Dickey-Fuller Test
##
## data: unemployment_var_transform$unemployment_box
## Dickey-Fuller = -2.1461, Lag order = 3, p-value = 0.5166
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: unemployment_var_transform$unemployment_box
## KPSS Level = 0.14101, Truncation lag parameter = 2, p-value = 0.1
Next we try to find the Integrated order which is the number of differences required to make our series mean stationary.
We start with one difference (I(1)) and see how we do before trying a second difference:
##
## Augmented Dickey-Fuller Test
##
## data: unemployment_mean_transform$unemployment_box_diff[!is.na(unemployment_mean_transform$unemployment_box_diff)]
## Dickey-Fuller = -1.4553, Lag order = 2, p-value = 0.7799
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: unemployment_mean_transform$unemployment_box_diff[!is.na(unemployment_mean_transform$unemployment_box_diff)]
## KPSS Level = 0.25067, Truncation lag parameter = 2, p-value = 0.1
Examining the ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. ACF shows no significant “dampening” autocorrelation.Also PACF lags does not indicate “order” of AR process or shows any oscillating patterns for MA process.
Next we try a second difference:
##
## Augmented Dickey-Fuller Test
##
## data: unemployment_mean_transform1$unemployment_box_diff[!is.na(unemployment_mean_transform1$unemployment_box_diff)]
## Dickey-Fuller = -1.4553, Lag order = 2, p-value = 0.7799
## alternative hypothesis: stationary
##
## KPSS Test for Level Stationarity
##
## data: unemployment_mean_transform1$unemployment_box_diff[!is.na(unemployment_mean_transform1$unemployment_box_diff)]
## KPSS Level = 0.25067, Truncation lag parameter = 2, p-value = 0.1
As we examine ACF/PACF to see what type of model may be appropriate, We find that the ACF/PACF don’t give us much info on AR/MA process here. It may be random walk model with ARIMA(0,1,0).WE next try fitting several ARIMA models to our time-series unemployement variable. We will try to determine Which model is the “best” according to the AIC and BIC.
Note we’re using Box-Cox Transformed Parameter
##
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 1, 1))
##
## Coefficients:
## ma1
## -1.0000
## s.e. 0.2071
##
## sigma^2 estimated as 0.0001123: log likelihood = 76.58, aic = -149.15
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set -0.001546848 0.01039308 0.005228364 -0.1895448 0.5905995 1.21511
## ACF1
## Training set -0.03699429
##
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 1, 0))
##
##
## sigma^2 estimated as 0.0001468: log likelihood = 74.86, aic = -147.72
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.001573842 0.01188161 0.004171181 0.1660662 0.464282 0.9694133
## ACF1
## Training set -0.01259795
##
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(1, 1, 0))
##
## Coefficients:
## ar1
## -0.2015
## s.e. 0.6252
##
## sigma^2 estimated as 0.000146: log likelihood = 74.91, aic = -145.82
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.001428221 0.01184801 0.004439028 0.1493185 0.49516 1.031663
## ACF1
## Training set -0.009926859
##
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 2, 1))
##
## Coefficients:
## ma1
## -0.0457
## s.e. 0.6802
##
## sigma^2 estimated as 0.0001676: log likelihood = 70.27, aic = -136.54
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002227539 0.01244118 0.004192699 0.2413288 0.4654647 0.9744141
## ACF1
## Training set -0.0007774564
##
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(0, 2, 0))
##
##
## sigma^2 estimated as 0.0001677: log likelihood = 70.27, aic = -138.54
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002232241 0.01244294 0.004212842 0.2418833 0.4677729 0.9790955
## ACF1
## Training set -0.003705795
##
## Call:
## arima(x = unemployment_mean_transform$unemployment_box, order = c(1, 2, 0))
##
## Coefficients:
## ar1
## -0.0413
## s.e. 0.6255
##
## sigma^2 estimated as 0.0001676: log likelihood = 70.27, aic = -136.54
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.002228235 0.01244137 0.00419662 0.2414109 0.4659133 0.9753253
## ACF1
## Training set -0.0009612731
Looks like there no unmodeled variation left, we may not Want to try any different models now.
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,1,1)
## Q* = 0.97603, df = 4, p-value = 0.9134
##
## Model df: 1. Total lags used: 5
Next we try Ljung-Box Test for Residual Autocorrelation
p < 0.05 indicates residual autocorrelation at that lag
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.039853, df = 1, p-value = 0.8418
##
## Box-Ljung test
##
## data: resid
## X-squared = 0.97603, df = 5, p-value = 0.9645
##
## Box-Ljung test
##
## data: resid
## X-squared = 1.5369, df = 10, p-value = 0.9988
##
## Box-Ljung test
##
## data: resid
## X-squared = 4.7774, df = 20, p-value = 0.9998
##
## Box-Ljung test
##
## data: resid
## X-squared = 6.1623, df = 25, p-value = 1
Ljung-Box Test Results indicates residual no autocorrelation at different lags. Remaining autocorrelation in the residuals suggests that we may have the “best model”.
## Series: unemployment_mean_transform$unemployment_box
## ARIMA(0,0,0) with non-zero mean
##
## Coefficients:
## mean
## 0.8776
## s.e. 0.0020
##
## sigma^2 = 0.0001123: log likelihood = 81.84
## AIC=-159.69 AICc=-159.16 BIC=-157.17
##
## Training set error measures:
## ME RMSE MAE MPE MAPE
## Training set -1.366427e-16 0.01039165 0.006008677 -0.01362929 0.6791605
## MASE ACF1
## Training set 1.396461 0.008248719
##
## Ljung-Box test
##
## data: Residuals from ARIMA(0,0,0) with non-zero mean
## Q* = 0.52305, df = 4, p-value = 0.9712
##
## Model df: 1. Total lags used: 5
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 0.0019814, df = 1, p-value = 0.9645
##
## Box-Ljung test
##
## data: mod1_auto$residuals
## X-squared = 0.0019814, df = 1, p-value = 0.9645
## [1] 0.3236508