Unemployment is an economic phenomena of such importance that a lot of literature has being devoted to it. The term unemployment refers to the situation when job-seeker, for more than 3 months, actively searches to provide its labour to the market, however is unable to find work. Unemployment rate is an economic indicator that measures the number of unemployed people as a percentage of the labour force population (those age between 16-65). It is normal and healthy to have some minor level of unemployment in an economy, this level is usually denoted as the natural unemployment rate. Studies have shown that this rate is unique to every economy and is dependent of various factors, but most economies have a natural unemployment rate of 4 to 5 percent.
Thus, due to the importance that this indicator has in our lives and my interest in economics I have decided to forecast the US unemployment rate using historic data and compare my results to the actual figures. The data I will use comes from the R library astsa and is named UnempRate. It has the monthly U.S. unemployment rate in percent unemployed from Jan, 1948 to Nov, 2016. The purpose of this study is to evaluate to what extent it is possible to predict the unemployment rate of a nation using its historical unemployment figures and advance statistical methods.
# Load library and data
library(astsa)
data(UnempRate)
# Plot the data
plot.ts(UnempRate, main="US Unemployment Rate (Jan, 1948 - Nov, 2016)",
xlab="Year", ylab="Unemployment Rate")
Clearly, the mean is not constant and the process is not currently stationary. Now we will plot the acf and pacf.
plt = acf2(UnempRate,200)
We compute the sample ACF and PACF to see how quickly p(h) decays. As shown above, the sample ACF decays to zero extremly slowly as h increases, meaning that differencing is needed. We will try to make the data stationary by differencing:
diff1 = diff(UnempRate)
plt = acf2(diff1,200)
# Plot first difference data
plot.ts(diff1, main="Monthly Change in US Unemployment Rate (Jan, 1948 - Nov, 2016)",
xlab="Year", ylab="Monthly Change in Unemployment Rate")
The ACF and PACF of first difference suggests a seasonal trend as every 12 months we see peaks every lag 12 on the ACF. We will do a month plot of our first differenced data to see if this hypothesis is veridict.
Our plot of the first difference data shows a mean that is now roughly constant around zero. The variance also is no longer drifting. Thus, further differencing is probably not needed, making our hypothesis of a seasonal trend stronger.
From economics, we know about seasonal unemployment and cyclical unemployment. Hence, our observations are also supported by literature.
#month plot of unaltered data
monthplot(UnempRate)
#month plot of first difference data
monthplot(diff1)
From the monthplots we see that our hypothesis is correct. We will proceed by continue working with seasonal data.
diff12 = diff(UnempRate, 12)
# Plot lag 12 difference data
plot.ts(diff12, main="Yearly Change of Monthly Change in US Unemployment Rate",
xlab="Year", ylab="Yealy Change of Monthly Change in Unemployment Rate")
plt = acf2(diff12,50)
Our plot of seasonally differenced data looks like a random walk with roughly constant mean around zero and not much drifting in variance. Thus, we may assume our data is now stationary.
The seasonal ACF plot goes to zero after lag 1 and the PACF tails off, indicating P = 0 and Q = 1. The nonseasonal PACF cuts off after lag 3, thus we propose an \(ARIMA(3,1,0)xARIMA(0,1,1)_{12}\) model. Another alternative model we propose for the sake of variety and comparison is \(ARIMA(3,1,3)xARIMA(0,1,1)_{12}\), the difference between the former and the latter model is that the latter has a non-seasonal MA(3) while the former has a non-season MA(0).
model1 = sarima(UnempRate,3,1,0, 0,1,1, 12)
## initial value -1.155876
## iter 2 value -1.375366
## iter 3 value -1.398788
## iter 4 value -1.413527
## iter 5 value -1.421629
## iter 6 value -1.426445
## iter 7 value -1.427929
## iter 8 value -1.428084
## iter 9 value -1.428092
## iter 10 value -1.428092
## iter 10 value -1.428092
## final value -1.428092
## converged
## initial value -1.435718
## iter 2 value -1.435907
## iter 3 value -1.436180
## iter 4 value -1.436186
## iter 5 value -1.436187
## iter 5 value -1.436187
## iter 5 value -1.436187
## final value -1.436187
## converged
print(model1$ttable)
## Estimate SE t.value p.value
## ar1 0.1148 0.0351 3.2678 0.0011
## ar2 0.2023 0.0345 5.8651 0.0000
## ar3 0.0900 0.0350 2.5709 0.0103
## sma1 -0.7674 0.0256 -30.0169 0.0000
Inspection of the standard residuals show no obvious patterns, however there are at least 2 outliers exceeding 3 standard deviations from the mean. The ACF Residuals plot doesn’t show any significant spikes. Hence, we conclude that there are largely no apparent departures from the model randomness assumption. The Normal Q-Q Plot of Residuals also support that there are no apparent departures from the normality assumption aside from some outliers in the tails, which show some departure from normality. However, most of the p-values for Ljung-Box statistics are at or above the significant level, so we accept the null hypothesis that the residuals are independent.
From the values in the table of Component & Coefficient Estimate & Standard Error & P-Value we can derive that all coefficients are non-zero as all our AR terms are have a p-value less than the significance level of 0.05 and our MA terms has a p-value less than the significance level. I should also highlight that this model had an AIC of -18.08 and a variance of residuals of 0.05582.
So our model is thus,
\((1-B)(x_t-.1148_{(.0351)}x_{t-1} -.2023_{(.0345)}x_{t-2} - .0900_{(.0350)}x_{t-3})\) = \((1-B)(1-B^{12})(w_t-.7674_{(.0256)}w_{t-1})\).
In non-technical language, our model tells us that we can forecast a future unemployment rate value by knowing the past four unemployment rates and multiplying them by some constant depending on the date of the observation. Moreover, it also has some differencing meaning that we will have to take the difference of the values we obtain, and white noise terms, one of which is also multiplied by a constant.
model2 = sarima(UnempRate,3,1,3, 0,1,1, 12)
## initial value -1.155876
## iter 2 value -1.279225
## iter 3 value -1.378295
## iter 4 value -1.400768
## iter 5 value -1.422581
## iter 6 value -1.426073
## iter 7 value -1.427334
## iter 8 value -1.428107
## iter 9 value -1.430380
## iter 10 value -1.430948
## iter 11 value -1.431197
## iter 12 value -1.431240
## iter 13 value -1.431315
## iter 14 value -1.431508
## iter 15 value -1.431797
## iter 16 value -1.431990
## iter 17 value -1.432045
## iter 18 value -1.432067
## iter 19 value -1.432069
## iter 20 value -1.432073
## iter 21 value -1.432125
## iter 22 value -1.432161
## iter 23 value -1.432225
## iter 24 value -1.432276
## iter 25 value -1.432352
## iter 26 value -1.432452
## iter 27 value -1.433064
## iter 28 value -1.433281
## iter 29 value -1.433964
## iter 30 value -1.435818
## iter 31 value -1.435843
## iter 32 value -1.437111
## iter 33 value -1.438140
## iter 34 value -1.440778
## iter 35 value -1.441273
## iter 36 value -1.441387
## iter 37 value -1.441470
## iter 38 value -1.441649
## iter 38 value -1.441649
## iter 39 value -1.441910
## iter 40 value -1.442398
## iter 41 value -1.444011
## iter 42 value -1.444037
## iter 42 value -1.444037
## iter 43 value -1.444197
## iter 44 value -1.444216
## iter 44 value -1.444216
## iter 45 value -1.444239
## iter 46 value -1.444241
## iter 46 value -1.444241
## iter 46 value -1.444241
## final value -1.444241
## converged
## initial value -1.433671
## iter 2 value -1.434549
## iter 3 value -1.434838
## iter 4 value -1.437369
## iter 5 value -1.438439
## iter 6 value -1.438758
## iter 7 value -1.439010
## iter 8 value -1.439100
## iter 9 value -1.439195
## iter 10 value -1.439508
## iter 11 value -1.439612
## iter 12 value -1.439646
## iter 13 value -1.439653
## iter 14 value -1.439656
## iter 15 value -1.439664
## iter 16 value -1.439715
## iter 17 value -1.439730
## iter 18 value -1.439733
## iter 19 value -1.439743
## iter 20 value -1.439746
## iter 21 value -1.439753
## iter 22 value -1.439828
## iter 23 value -1.439902
## iter 24 value -1.440015
## iter 25 value -1.440162
## iter 26 value -1.440208
## iter 27 value -1.440293
## iter 28 value -1.440504
## iter 29 value -1.440598
## iter 30 value -1.440713
## iter 31 value -1.440985
## iter 32 value -1.441137
## iter 33 value -1.441175
## iter 34 value -1.441254
## iter 35 value -1.441397
## iter 36 value -1.441432
## iter 37 value -1.441447
## iter 38 value -1.441453
## iter 39 value -1.441474
## iter 40 value -1.441486
## iter 41 value -1.441490
## iter 42 value -1.441491
## iter 43 value -1.441493
## iter 44 value -1.441498
## iter 45 value -1.441510
## iter 46 value -1.441536
## iter 47 value -1.441539
## iter 48 value -1.441540
## iter 49 value -1.441550
## iter 50 value -1.441554
## iter 51 value -1.441564
## iter 52 value -1.441564
## iter 53 value -1.441565
## iter 54 value -1.441569
## iter 55 value -1.441577
## iter 56 value -1.441595
## iter 57 value -1.441619
## iter 58 value -1.441640
## iter 59 value -1.441645
## iter 60 value -1.441646
## iter 61 value -1.441646
## iter 62 value -1.441647
## iter 62 value -1.441647
## iter 62 value -1.441647
## final value -1.441647
## converged
print(model2$ttable)
## Estimate SE t.value p.value
## ar1 -0.4783 0.0448 -10.6693 0.0000
## ar2 0.0674 0.0576 1.1696 0.2425
## ar3 0.8191 0.0446 18.3465 0.0000
## ma1 0.6412 0.0562 11.4054 0.0000
## ma2 0.1585 0.0726 2.1838 0.0293
## ma3 -0.6440 0.0599 -10.7456 0.0000
## sma1 -0.7586 0.0256 -29.6797 0.0000
Model 2 satisfies the standard residuals test and it only has 1 outliers exceeding 3 standard deviations from the mean. It also satisifies the ACF Residuals plot test and the Normall Q-Q Plot test, hence, we satisfy the randomness and normality assumption. Model 2 also performs better than Model 1 for the Ljung-Box statistic test.
From the values in the table of Component & Coefficient Estimate & Standard Error & P-Value we fail to reject the hypothesis that the AR(2) coefficient is non-zero as their p-values is greater than the significance level of 0.05. The AIC of this model is -20.97 and a variance of residuals of 0.05507.
Using AIC criteria, Model 2 is better fit and some assumptions are better satisfied, however, some components of the model have coefficients with p-values that fail our hypothesis test. Thus, we decide in favour of Model 1. Likewise, we will not discuss or analyze Model 2 anymore.
# Forecasting for the next 10 months
Unemployment_Rate = UnempRate
pred1 = sarima.for(Unemployment_Rate,10, 3,1,3, 0, 1,1, 12, main="Forecasting of US Unemployment Rate for Next 10 months")
# Set month vector
year = c(1:10)
# Get the 5% upper Prediction interval
upper = pred1$pred+qnorm(0.975)*pred1$se
# Get the 5% lower Prediction interval
lower = pred1$pred-qnorm(0.975)*pred1$se
# Construct a dataframe for the intervals
(data.frame("Prediction"=pred1$pred,"PI 95% Lower Bound"=lower,"PI 95% Upper Bound"=upper))
## Prediction PI.95..Lower.Bound PI.95..Upper.Bound
## 1 4.411524 3.951547 4.871502
## 2 5.030795 4.325351 5.736239
## 3 4.882757 3.954779 5.810734
## 4 4.657417 3.520997 5.793838
## 5 4.088660 2.755775 5.421544
## 6 4.175067 2.652266 5.697868
## 7 4.574522 2.871017 6.278028
## 8 4.632169 2.755877 6.508461
## 9 4.405957 2.362275 6.449639
## 10 4.058865 1.855873 6.261858
UnempRate
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1948 4.0 4.7 4.5 4.0 3.4 3.9 3.9 3.6 3.4 2.9 3.3 3.6
## 1949 5.0 5.8 5.6 5.4 5.7 6.4 7.0 6.3 5.9 6.1 5.7 6.0
## 1950 7.6 7.9 7.1 6.0 5.3 5.6 5.3 4.1 4.0 3.3 3.8 3.9
## 1951 4.4 4.2 3.8 3.2 2.9 3.4 3.3 2.9 3.0 2.8 3.2 2.9
## 1952 3.7 3.8 3.3 3.0 2.9 3.2 3.3 3.1 2.7 2.4 2.5 2.5
## 1953 3.4 3.2 2.9 2.8 2.5 2.7 2.7 2.4 2.6 2.5 3.2 4.2
## 1954 5.7 6.3 6.4 6.1 5.7 5.7 5.7 5.4 5.3 4.6 4.9 4.8
## 1955 5.8 5.7 5.2 4.9 4.2 4.4 4.0 3.8 3.5 3.4 3.8 3.9
## 1956 4.7 4.8 4.7 4.1 4.2 4.7 4.4 3.7 3.4 3.1 3.9 4.0
## 1957 4.9 4.7 4.3 4.0 3.9 4.6 4.1 3.7 3.7 3.6 4.6 5.0
## 1958 6.8 7.7 7.7 7.5 7.1 7.6 7.4 6.7 6.0 5.5 5.6 6.0
## 1959 7.0 7.0 6.4 5.2 4.9 5.4 5.2 4.8 4.7 4.7 5.3 5.1
## 1960 6.1 5.7 6.1 5.2 4.8 5.8 5.5 5.2 4.7 5.0 5.6 6.4
## 1961 7.7 8.1 7.7 7.0 6.6 7.3 6.9 6.2 5.8 5.5 5.6 5.8
## 1962 6.7 6.5 6.2 5.5 5.1 5.9 5.3 5.3 4.9 4.5 5.3 5.3
## 1963 6.6 6.9 6.3 5.6 5.5 6.2 5.6 5.2 4.8 4.7 5.3 5.3
## 1964 6.4 6.2 5.9 5.3 4.8 5.9 4.9 4.8 4.5 4.4 4.5 4.7
## 1965 5.5 5.7 5.1 4.7 4.3 5.3 4.5 4.2 3.8 3.6 3.9 3.7
## 1966 4.4 4.2 4.0 3.6 3.7 4.6 3.9 3.6 3.3 3.2 3.4 3.5
## 1967 4.2 4.2 3.9 3.5 3.2 4.6 4.1 3.7 3.7 3.8 3.7 3.5
## 1968 4.0 4.2 3.8 3.2 2.9 4.5 4.0 3.5 3.3 3.2 3.3 3.1
## 1969 3.7 3.7 3.5 3.2 2.9 4.1 3.8 3.5 3.7 3.5 3.3 3.2
## 1970 4.2 4.7 4.6 4.3 4.1 5.6 5.3 5.0 5.2 5.1 5.5 5.6
## 1971 6.6 6.6 6.3 5.7 5.3 6.5 6.2 5.9 5.8 5.4 5.7 5.5
## 1972 6.5 6.4 6.1 5.5 5.1 6.2 5.9 5.5 5.4 5.1 4.9 4.8
## 1973 5.5 5.6 5.2 4.8 4.4 5.4 5.0 4.7 4.7 4.2 4.6 4.6
## 1974 5.7 5.8 5.3 4.8 4.6 5.8 5.7 5.3 5.7 5.5 6.2 6.7
## 1975 9.0 9.1 9.1 8.6 8.3 9.1 8.7 8.2 8.1 7.8 7.8 7.8
## 1976 8.8 8.7 8.1 7.4 6.8 8.0 7.8 7.6 7.4 7.2 7.4 7.4
## 1977 8.3 8.5 7.9 6.9 6.4 7.5 7.0 6.8 6.6 6.4 6.5 6.0
## 1978 7.1 6.9 6.6 5.8 5.5 6.2 6.3 5.9 5.8 5.4 5.6 5.7
## 1979 6.4 6.4 6.1 5.5 5.2 6.0 5.9 5.9 5.7 5.6 5.6 5.7
## 1980 6.9 6.8 6.6 6.7 7.1 7.8 7.9 7.6 7.2 7.1 7.1 6.9
## 1981 8.2 8.0 7.7 7.0 7.1 7.7 7.3 7.2 7.3 7.5 7.9 8.3
## 1982 9.4 9.6 9.5 9.2 9.1 9.8 9.8 9.6 9.7 9.9 10.4 10.5
## 1983 11.4 11.3 10.8 10.0 9.8 10.2 9.4 9.2 8.8 8.4 8.1 8.0
## 1984 8.8 8.4 8.1 7.6 7.2 7.4 7.5 7.3 7.1 7.0 6.9 7.0
## 1985 8.0 7.8 7.5 7.1 7.0 7.5 7.4 6.9 6.9 6.8 6.7 6.7
## 1986 7.3 7.8 7.5 7.0 7.0 7.3 7.0 6.7 6.8 6.6 6.6 6.3
## 1987 7.3 7.2 6.9 6.2 6.1 6.3 6.1 5.8 5.7 5.7 5.6 5.4
## 1988 6.3 6.2 5.9 5.3 5.4 5.5 5.5 5.4 5.2 5.0 5.2 5.0
## 1989 6.0 5.6 5.2 5.1 5.0 5.5 5.3 5.1 5.1 5.0 5.2 5.1
## 1990 6.0 5.9 5.5 5.3 5.2 5.4 5.6 5.5 5.6 5.5 5.9 6.0
## 1991 7.1 7.3 7.2 6.5 6.7 7.0 6.8 6.6 6.5 6.5 6.7 6.9
## 1992 8.1 8.2 7.8 7.2 7.3 8.0 7.7 7.4 7.3 6.9 7.1 7.1
## 1993 8.0 7.8 7.4 6.9 6.8 7.2 7.0 6.6 6.4 6.4 6.2 6.1
## 1994 7.3 7.1 6.8 6.2 5.9 6.2 6.2 5.9 5.6 5.4 5.3 5.1
## 1995 6.2 5.9 5.7 5.6 5.5 5.8 5.9 5.6 5.4 5.2 5.3 5.2
## 1996 6.3 6.0 5.8 5.4 5.4 5.5 5.6 5.1 5.0 4.9 5.0 5.0
## 1997 5.9 5.7 5.5 4.8 4.7 5.2 5.0 4.8 4.7 4.4 4.3 4.4
## 1998 5.2 5.0 5.0 4.1 4.2 4.7 4.7 4.5 4.4 4.2 4.1 4.0
## 1999 4.8 4.7 4.4 4.1 4.0 4.5 4.5 4.2 4.1 3.8 3.8 3.7
## 2000 4.5 4.4 4.3 3.7 3.8 4.1 4.2 4.1 3.8 3.6 3.7 3.7
## 2001 4.7 4.6 4.5 4.2 4.1 4.7 4.7 4.9 4.7 5.0 5.3 5.4
## 2002 6.3 6.1 6.1 5.7 5.5 6.0 5.9 5.7 5.4 5.3 5.6 5.7
## 2003 6.5 6.4 6.2 5.8 5.8 6.5 6.3 6.0 5.8 5.6 5.6 5.4
## 2004 6.3 6.0 6.0 5.4 5.3 5.8 5.7 5.4 5.1 5.1 5.2 5.1
## 2005 5.7 5.8 5.4 4.9 4.9 5.2 5.2 4.9 4.8 4.6 4.8 4.6
## 2006 5.1 5.1 4.8 4.5 4.4 4.8 5.0 4.6 4.4 4.1 4.3 4.3
## 2007 5.0 4.9 4.5 4.3 4.3 4.7 4.9 4.6 4.5 4.4 4.5 4.8
## 2008 5.4 5.2 5.2 4.8 5.2 5.7 6.0 6.1 6.0 6.1 6.5 7.1
## 2009 8.5 8.9 9.0 8.6 9.1 9.7 9.7 9.6 9.5 9.5 9.4 9.7
## 2010 10.6 10.4 10.2 9.5 9.3 9.6 9.7 9.5 9.2 9.0 9.3 9.1
## 2011 9.8 9.5 9.2 8.7 8.7 9.3 9.3 9.1 8.8 8.5 8.2 8.3
## 2012 8.8 8.7 8.4 7.7 7.9 8.4 8.6 8.2 7.6 7.5 7.4 7.6
## 2013 8.5 8.1 7.6 7.1 7.3 7.8 7.7 7.3 7.0 7.0 6.6 6.5
## 2014 7.0 7.0 6.8 5.9 6.1 6.3 6.5 6.3 5.7 5.5 5.5 5.4
## 2015 6.1 5.8 5.6 5.1 5.3 5.5 5.6 5.2 4.9 4.8 4.8 4.8
## 2016 5.3 5.2 5.1 4.7 4.5 5.1 5.1 5.0 4.8 4.7 4.4
We can see that after the second month, the prediction of next months are within the confidence interval of previous months and more importantly, the standard deviation of the confidence interval increases really fast decreasing the precision of our predictions. Now I will compare the predicted values of my model versus the actual values.
Predicted vs Actual
My predicted values are extremely accurate to the extent that only 1 out of 10 is has a difference of more than 0.1 with the real or actual value, which is my second last predicted value. This shows my model was a success as it was able to accomplish the goal it was built for.
Unemployment closely follows the business cycle, hence, it is of importance to execute an spectral analysis and find the three dominant frequencies on our data.
# Spectral analysis for UnempRate series
UnempRate.per = mvspec(UnempRate, log = "no")
P2<-UnempRate.per$details[order(UnempRate.per$details[,3],decreasing = TRUE),]
#Identify the first three dominant frequencies for UnempRate series
P2[1,];P2[2,];P2[3,]
## frequency period spectrum
## 0.0278 36.0000 14.4829
## frequency period spectrum
## 0.1111 9.0000 11.1720
## frequency period spectrum
## 0.0417 24.0000 9.9625
##95% CIs for the dominant frequencies for UnempRate series in part(a)
UnempRate.u1 = 2*P2[1,3]/qchisq(.025,2)
UnempRate.l1 = 2*P2[1,3]/qchisq(.975,2)
UnempRate.u2 = 2*P2[2,3]/qchisq(.025,2)
UnempRate.l2 = 2*P2[2,3]/qchisq(.975,2)
UnempRate.u3 = 2*P2[3,3]/qchisq(.025,2)
UnempRate.l3 = 2*P2[3,3]/qchisq(.975,2)
##Create a data frame for the CIs
Result <- data.frame(Series=c(rep("UnempRate",3)),
Dominant.Freq=c(P2[1,1],P2[2,1],P2[3,1]), Spec=c(P2[1,3],P2[2,3],P2[3,3]),
Lower=c(UnempRate.l1,UnempRate.l2,UnempRate.l3),
Upper=c(UnempRate.u1,UnempRate.u2,UnempRate.u3))
Result[1:2,3:5] = round(Result[1:2,3:5], 4)
Result
## Series Dominant.Freq Spec Lower Upper
## 1 UnempRate 0.0278 14.4829 3.926100 572.0440
## 2 UnempRate 0.1111 11.1720 3.028600 441.2704
## 3 UnempRate 0.0417 9.9625 2.700685 393.4977
From our spectral analysis data we can see that we cannot establish the significance of the first peak since the periodogram ordinate is 14.4829, which lies in the confidence intervals of the second and third peak. Similarly, we cannot establish the significance of the second peak since the periodogram ordinate is 11.1720, which lies in the confidence interval of the first and third peak and we cannot establish the significance of the third peak since the periodogram ordinate is 9.9625, which lies in the confidence interval of the second peak and fist peak.
Conclusion: ARIMA models are an efficient way of forecasting short-term economic indicators such as unemployment rate, however, we must first acknowledge the weaknesses that this approach has, such as the assumption of Ceteris paribus, meaning that we assume that all the other things that may affect unemployment are held constant to some degree, which is unreasonable as things such as a war or a pandemic may heavily influence unemployment.
Regarding the model, the weaknesses it has are that it is really hard to explain it in a non-technical way, the model itself has significant outliers in its error terms, it lacks precision for long-term forecasting, it assumes Ceteris paribus as explained earlier,and it only works for the US economy, it is not universal. However, as shown by the results we obtained from the model, we can say that even with this limitations, the model was a success as it very accurately and precisely predicted future values.
I believe that more research should be done in the area of exogenous shocks and how they can be predicted or modeled using economic indicators, as that is one of the biggest weaknesses of the majority of economic models, and any breakthrough in this field of knowledge will greatly positively impact the field of economics and econometrics.