Unemployment is an economic phenomena of such importance that a lot of literature has being devoted to it. The term unemployment refers to the situation when job-seeker, for more than 3 months, actively searches to provide its labour to the market, however is unable to find work. Unemployment rate is an economic indicator that measures the number of unemployed people as a percentage of the labour force population (those age between 16-65). It is normal and healthy to have some minor level of unemployment in an economy, this level is usually denoted as the natural unemployment rate. Studies have shown that this rate is unique to every economy and is dependent of various factors, but most economies have a natural unemployment rate of 4 to 5 percent.

Thus, due to the importance that this indicator has in our lives and my interest in economics I have decided to forecast the US unemployment rate using historic data and compare my results to the actual figures. The data I will use comes from the R library astsa and is named UnempRate. It has the monthly U.S. unemployment rate in percent unemployed from Jan, 1948 to Nov, 2016. The purpose of this study is to evaluate to what extent it is possible to predict the unemployment rate of a nation using its historical unemployment figures and advance statistical methods.

# Load library and data
library(astsa)
data(UnempRate)
# Plot the data
plot.ts(UnempRate, main="US Unemployment Rate (Jan, 1948 - Nov, 2016)",
xlab="Year", ylab="Unemployment Rate")

Clearly, the mean is not constant and the process is not currently stationary. Now we will plot the acf and pacf.

plt = acf2(UnempRate,200)

We compute the sample ACF and PACF to see how quickly p(h) decays. As shown above, the sample ACF decays to zero extremly slowly as h increases, meaning that differencing is needed. We will try to make the data stationary by differencing:

diff1 = diff(UnempRate)
plt = acf2(diff1,200)

# Plot first difference data
plot.ts(diff1, main="Monthly Change in US Unemployment Rate (Jan, 1948 - Nov, 2016)",
xlab="Year", ylab="Monthly Change in Unemployment Rate")

The ACF and PACF of first difference suggests a seasonal trend as every 12 months we see peaks every lag 12 on the ACF. We will do a month plot of our first differenced data to see if this hypothesis is veridict.

Our plot of the first difference data shows a mean that is now roughly constant around zero. The variance also is no longer drifting. Thus, further differencing is probably not needed, making our hypothesis of a seasonal trend stronger.

From economics, we know about seasonal unemployment and cyclical unemployment. Hence, our observations are also supported by literature.

#month plot of unaltered data 
monthplot(UnempRate)

#month plot of first difference data
monthplot(diff1)

From the monthplots we see that our hypothesis is correct. We will proceed by continue working with seasonal data.

diff12 = diff(UnempRate, 12)
# Plot lag 12 difference data
plot.ts(diff12, main="Yearly Change of Monthly Change in US Unemployment Rate",
xlab="Year", ylab="Yealy Change of Monthly Change in Unemployment Rate")

plt = acf2(diff12,50)

Our plot of seasonally differenced data looks like a random walk with roughly constant mean around zero and not much drifting in variance. Thus, we may assume our data is now stationary.

The seasonal ACF plot goes to zero after lag 1 and the PACF tails off, indicating P = 0 and Q = 1. The nonseasonal PACF cuts off after lag 3, thus we propose an \(ARIMA(3,1,0)xARIMA(0,1,1)_{12}\) model. Another alternative model we propose for the sake of variety and comparison is \(ARIMA(3,1,3)xARIMA(0,1,1)_{12}\), the difference between the former and the latter model is that the latter has a non-seasonal MA(3) while the former has a non-season MA(0).

model1 = sarima(UnempRate,3,1,0, 0,1,1, 12)
## initial  value -1.155876 
## iter   2 value -1.375366
## iter   3 value -1.398788
## iter   4 value -1.413527
## iter   5 value -1.421629
## iter   6 value -1.426445
## iter   7 value -1.427929
## iter   8 value -1.428084
## iter   9 value -1.428092
## iter  10 value -1.428092
## iter  10 value -1.428092
## final  value -1.428092 
## converged
## initial  value -1.435718 
## iter   2 value -1.435907
## iter   3 value -1.436180
## iter   4 value -1.436186
## iter   5 value -1.436187
## iter   5 value -1.436187
## iter   5 value -1.436187
## final  value -1.436187 
## converged

print(model1$ttable)
##      Estimate     SE  t.value p.value
## ar1    0.1148 0.0351   3.2678  0.0011
## ar2    0.2023 0.0345   5.8651  0.0000
## ar3    0.0900 0.0350   2.5709  0.0103
## sma1  -0.7674 0.0256 -30.0169  0.0000

Inspection of the standard residuals show no obvious patterns, however there are at least 2 outliers exceeding 3 standard deviations from the mean. The ACF Residuals plot doesn’t show any significant spikes. Hence, we conclude that there are largely no apparent departures from the model randomness assumption. The Normal Q-Q Plot of Residuals also support that there are no apparent departures from the normality assumption aside from some outliers in the tails, which show some departure from normality. However, most of the p-values for Ljung-Box statistics are at or above the significant level, so we accept the null hypothesis that the residuals are independent.

From the values in the table of Component & Coefficient Estimate & Standard Error & P-Value we can derive that all coefficients are non-zero as all our AR terms are have a p-value less than the significance level of 0.05 and our MA terms has a p-value less than the significance level. I should also highlight that this model had an AIC of -18.08 and a variance of residuals of 0.05582.

So our model is thus,

\((1-B)(x_t-.1148_{(.0351)}x_{t-1} -.2023_{(.0345)}x_{t-2} - .0900_{(.0350)}x_{t-3})\) = \((1-B)(1-B^{12})(w_t-.7674_{(.0256)}w_{t-1})\).

In non-technical language, our model tells us that we can forecast a future unemployment rate value by knowing the past four unemployment rates and multiplying them by some constant depending on the date of the observation. Moreover, it also has some differencing meaning that we will have to take the difference of the values we obtain, and white noise terms, one of which is also multiplied by a constant.

model2 = sarima(UnempRate,3,1,3, 0,1,1, 12)
## initial  value -1.155876 
## iter   2 value -1.279225
## iter   3 value -1.378295
## iter   4 value -1.400768
## iter   5 value -1.422581
## iter   6 value -1.426073
## iter   7 value -1.427334
## iter   8 value -1.428107
## iter   9 value -1.430380
## iter  10 value -1.430948
## iter  11 value -1.431197
## iter  12 value -1.431240
## iter  13 value -1.431315
## iter  14 value -1.431508
## iter  15 value -1.431797
## iter  16 value -1.431990
## iter  17 value -1.432045
## iter  18 value -1.432067
## iter  19 value -1.432069
## iter  20 value -1.432073
## iter  21 value -1.432125
## iter  22 value -1.432161
## iter  23 value -1.432225
## iter  24 value -1.432276
## iter  25 value -1.432352
## iter  26 value -1.432452
## iter  27 value -1.433064
## iter  28 value -1.433281
## iter  29 value -1.433964
## iter  30 value -1.435818
## iter  31 value -1.435843
## iter  32 value -1.437111
## iter  33 value -1.438140
## iter  34 value -1.440778
## iter  35 value -1.441273
## iter  36 value -1.441387
## iter  37 value -1.441470
## iter  38 value -1.441649
## iter  38 value -1.441649
## iter  39 value -1.441910
## iter  40 value -1.442398
## iter  41 value -1.444011
## iter  42 value -1.444037
## iter  42 value -1.444037
## iter  43 value -1.444197
## iter  44 value -1.444216
## iter  44 value -1.444216
## iter  45 value -1.444239
## iter  46 value -1.444241
## iter  46 value -1.444241
## iter  46 value -1.444241
## final  value -1.444241 
## converged
## initial  value -1.433671 
## iter   2 value -1.434549
## iter   3 value -1.434838
## iter   4 value -1.437369
## iter   5 value -1.438439
## iter   6 value -1.438758
## iter   7 value -1.439010
## iter   8 value -1.439100
## iter   9 value -1.439195
## iter  10 value -1.439508
## iter  11 value -1.439612
## iter  12 value -1.439646
## iter  13 value -1.439653
## iter  14 value -1.439656
## iter  15 value -1.439664
## iter  16 value -1.439715
## iter  17 value -1.439730
## iter  18 value -1.439733
## iter  19 value -1.439743
## iter  20 value -1.439746
## iter  21 value -1.439753
## iter  22 value -1.439828
## iter  23 value -1.439902
## iter  24 value -1.440015
## iter  25 value -1.440162
## iter  26 value -1.440208
## iter  27 value -1.440293
## iter  28 value -1.440504
## iter  29 value -1.440598
## iter  30 value -1.440713
## iter  31 value -1.440985
## iter  32 value -1.441137
## iter  33 value -1.441175
## iter  34 value -1.441254
## iter  35 value -1.441397
## iter  36 value -1.441432
## iter  37 value -1.441447
## iter  38 value -1.441453
## iter  39 value -1.441474
## iter  40 value -1.441486
## iter  41 value -1.441490
## iter  42 value -1.441491
## iter  43 value -1.441493
## iter  44 value -1.441498
## iter  45 value -1.441510
## iter  46 value -1.441536
## iter  47 value -1.441539
## iter  48 value -1.441540
## iter  49 value -1.441550
## iter  50 value -1.441554
## iter  51 value -1.441564
## iter  52 value -1.441564
## iter  53 value -1.441565
## iter  54 value -1.441569
## iter  55 value -1.441577
## iter  56 value -1.441595
## iter  57 value -1.441619
## iter  58 value -1.441640
## iter  59 value -1.441645
## iter  60 value -1.441646
## iter  61 value -1.441646
## iter  62 value -1.441647
## iter  62 value -1.441647
## iter  62 value -1.441647
## final  value -1.441647 
## converged

print(model2$ttable)
##      Estimate     SE  t.value p.value
## ar1   -0.4783 0.0448 -10.6693  0.0000
## ar2    0.0674 0.0576   1.1696  0.2425
## ar3    0.8191 0.0446  18.3465  0.0000
## ma1    0.6412 0.0562  11.4054  0.0000
## ma2    0.1585 0.0726   2.1838  0.0293
## ma3   -0.6440 0.0599 -10.7456  0.0000
## sma1  -0.7586 0.0256 -29.6797  0.0000

Model 2 satisfies the standard residuals test and it only has 1 outliers exceeding 3 standard deviations from the mean. It also satisifies the ACF Residuals plot test and the Normall Q-Q Plot test, hence, we satisfy the randomness and normality assumption. Model 2 also performs better than Model 1 for the Ljung-Box statistic test.

From the values in the table of Component & Coefficient Estimate & Standard Error & P-Value we fail to reject the hypothesis that the AR(2) coefficient is non-zero as their p-values is greater than the significance level of 0.05. The AIC of this model is -20.97 and a variance of residuals of 0.05507.

Using AIC criteria, Model 2 is better fit and some assumptions are better satisfied, however, some components of the model have coefficients with p-values that fail our hypothesis test. Thus, we decide in favour of Model 1. Likewise, we will not discuss or analyze Model 2 anymore.

# Forecasting for the next 10 months
Unemployment_Rate =  UnempRate
pred1 = sarima.for(Unemployment_Rate,10, 3,1,3, 0, 1,1, 12, main="Forecasting of US Unemployment Rate for Next 10 months")

# Set month vector
year = c(1:10)
# Get the 5% upper Prediction interval
upper = pred1$pred+qnorm(0.975)*pred1$se
# Get the 5% lower Prediction interval
lower = pred1$pred-qnorm(0.975)*pred1$se
# Construct a dataframe for the intervals
(data.frame("Prediction"=pred1$pred,"PI 95% Lower Bound"=lower,"PI 95% Upper Bound"=upper))
##    Prediction PI.95..Lower.Bound PI.95..Upper.Bound
## 1    4.411524           3.951547           4.871502
## 2    5.030795           4.325351           5.736239
## 3    4.882757           3.954779           5.810734
## 4    4.657417           3.520997           5.793838
## 5    4.088660           2.755775           5.421544
## 6    4.175067           2.652266           5.697868
## 7    4.574522           2.871017           6.278028
## 8    4.632169           2.755877           6.508461
## 9    4.405957           2.362275           6.449639
## 10   4.058865           1.855873           6.261858
UnempRate
##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 1948  4.0  4.7  4.5  4.0  3.4  3.9  3.9  3.6  3.4  2.9  3.3  3.6
## 1949  5.0  5.8  5.6  5.4  5.7  6.4  7.0  6.3  5.9  6.1  5.7  6.0
## 1950  7.6  7.9  7.1  6.0  5.3  5.6  5.3  4.1  4.0  3.3  3.8  3.9
## 1951  4.4  4.2  3.8  3.2  2.9  3.4  3.3  2.9  3.0  2.8  3.2  2.9
## 1952  3.7  3.8  3.3  3.0  2.9  3.2  3.3  3.1  2.7  2.4  2.5  2.5
## 1953  3.4  3.2  2.9  2.8  2.5  2.7  2.7  2.4  2.6  2.5  3.2  4.2
## 1954  5.7  6.3  6.4  6.1  5.7  5.7  5.7  5.4  5.3  4.6  4.9  4.8
## 1955  5.8  5.7  5.2  4.9  4.2  4.4  4.0  3.8  3.5  3.4  3.8  3.9
## 1956  4.7  4.8  4.7  4.1  4.2  4.7  4.4  3.7  3.4  3.1  3.9  4.0
## 1957  4.9  4.7  4.3  4.0  3.9  4.6  4.1  3.7  3.7  3.6  4.6  5.0
## 1958  6.8  7.7  7.7  7.5  7.1  7.6  7.4  6.7  6.0  5.5  5.6  6.0
## 1959  7.0  7.0  6.4  5.2  4.9  5.4  5.2  4.8  4.7  4.7  5.3  5.1
## 1960  6.1  5.7  6.1  5.2  4.8  5.8  5.5  5.2  4.7  5.0  5.6  6.4
## 1961  7.7  8.1  7.7  7.0  6.6  7.3  6.9  6.2  5.8  5.5  5.6  5.8
## 1962  6.7  6.5  6.2  5.5  5.1  5.9  5.3  5.3  4.9  4.5  5.3  5.3
## 1963  6.6  6.9  6.3  5.6  5.5  6.2  5.6  5.2  4.8  4.7  5.3  5.3
## 1964  6.4  6.2  5.9  5.3  4.8  5.9  4.9  4.8  4.5  4.4  4.5  4.7
## 1965  5.5  5.7  5.1  4.7  4.3  5.3  4.5  4.2  3.8  3.6  3.9  3.7
## 1966  4.4  4.2  4.0  3.6  3.7  4.6  3.9  3.6  3.3  3.2  3.4  3.5
## 1967  4.2  4.2  3.9  3.5  3.2  4.6  4.1  3.7  3.7  3.8  3.7  3.5
## 1968  4.0  4.2  3.8  3.2  2.9  4.5  4.0  3.5  3.3  3.2  3.3  3.1
## 1969  3.7  3.7  3.5  3.2  2.9  4.1  3.8  3.5  3.7  3.5  3.3  3.2
## 1970  4.2  4.7  4.6  4.3  4.1  5.6  5.3  5.0  5.2  5.1  5.5  5.6
## 1971  6.6  6.6  6.3  5.7  5.3  6.5  6.2  5.9  5.8  5.4  5.7  5.5
## 1972  6.5  6.4  6.1  5.5  5.1  6.2  5.9  5.5  5.4  5.1  4.9  4.8
## 1973  5.5  5.6  5.2  4.8  4.4  5.4  5.0  4.7  4.7  4.2  4.6  4.6
## 1974  5.7  5.8  5.3  4.8  4.6  5.8  5.7  5.3  5.7  5.5  6.2  6.7
## 1975  9.0  9.1  9.1  8.6  8.3  9.1  8.7  8.2  8.1  7.8  7.8  7.8
## 1976  8.8  8.7  8.1  7.4  6.8  8.0  7.8  7.6  7.4  7.2  7.4  7.4
## 1977  8.3  8.5  7.9  6.9  6.4  7.5  7.0  6.8  6.6  6.4  6.5  6.0
## 1978  7.1  6.9  6.6  5.8  5.5  6.2  6.3  5.9  5.8  5.4  5.6  5.7
## 1979  6.4  6.4  6.1  5.5  5.2  6.0  5.9  5.9  5.7  5.6  5.6  5.7
## 1980  6.9  6.8  6.6  6.7  7.1  7.8  7.9  7.6  7.2  7.1  7.1  6.9
## 1981  8.2  8.0  7.7  7.0  7.1  7.7  7.3  7.2  7.3  7.5  7.9  8.3
## 1982  9.4  9.6  9.5  9.2  9.1  9.8  9.8  9.6  9.7  9.9 10.4 10.5
## 1983 11.4 11.3 10.8 10.0  9.8 10.2  9.4  9.2  8.8  8.4  8.1  8.0
## 1984  8.8  8.4  8.1  7.6  7.2  7.4  7.5  7.3  7.1  7.0  6.9  7.0
## 1985  8.0  7.8  7.5  7.1  7.0  7.5  7.4  6.9  6.9  6.8  6.7  6.7
## 1986  7.3  7.8  7.5  7.0  7.0  7.3  7.0  6.7  6.8  6.6  6.6  6.3
## 1987  7.3  7.2  6.9  6.2  6.1  6.3  6.1  5.8  5.7  5.7  5.6  5.4
## 1988  6.3  6.2  5.9  5.3  5.4  5.5  5.5  5.4  5.2  5.0  5.2  5.0
## 1989  6.0  5.6  5.2  5.1  5.0  5.5  5.3  5.1  5.1  5.0  5.2  5.1
## 1990  6.0  5.9  5.5  5.3  5.2  5.4  5.6  5.5  5.6  5.5  5.9  6.0
## 1991  7.1  7.3  7.2  6.5  6.7  7.0  6.8  6.6  6.5  6.5  6.7  6.9
## 1992  8.1  8.2  7.8  7.2  7.3  8.0  7.7  7.4  7.3  6.9  7.1  7.1
## 1993  8.0  7.8  7.4  6.9  6.8  7.2  7.0  6.6  6.4  6.4  6.2  6.1
## 1994  7.3  7.1  6.8  6.2  5.9  6.2  6.2  5.9  5.6  5.4  5.3  5.1
## 1995  6.2  5.9  5.7  5.6  5.5  5.8  5.9  5.6  5.4  5.2  5.3  5.2
## 1996  6.3  6.0  5.8  5.4  5.4  5.5  5.6  5.1  5.0  4.9  5.0  5.0
## 1997  5.9  5.7  5.5  4.8  4.7  5.2  5.0  4.8  4.7  4.4  4.3  4.4
## 1998  5.2  5.0  5.0  4.1  4.2  4.7  4.7  4.5  4.4  4.2  4.1  4.0
## 1999  4.8  4.7  4.4  4.1  4.0  4.5  4.5  4.2  4.1  3.8  3.8  3.7
## 2000  4.5  4.4  4.3  3.7  3.8  4.1  4.2  4.1  3.8  3.6  3.7  3.7
## 2001  4.7  4.6  4.5  4.2  4.1  4.7  4.7  4.9  4.7  5.0  5.3  5.4
## 2002  6.3  6.1  6.1  5.7  5.5  6.0  5.9  5.7  5.4  5.3  5.6  5.7
## 2003  6.5  6.4  6.2  5.8  5.8  6.5  6.3  6.0  5.8  5.6  5.6  5.4
## 2004  6.3  6.0  6.0  5.4  5.3  5.8  5.7  5.4  5.1  5.1  5.2  5.1
## 2005  5.7  5.8  5.4  4.9  4.9  5.2  5.2  4.9  4.8  4.6  4.8  4.6
## 2006  5.1  5.1  4.8  4.5  4.4  4.8  5.0  4.6  4.4  4.1  4.3  4.3
## 2007  5.0  4.9  4.5  4.3  4.3  4.7  4.9  4.6  4.5  4.4  4.5  4.8
## 2008  5.4  5.2  5.2  4.8  5.2  5.7  6.0  6.1  6.0  6.1  6.5  7.1
## 2009  8.5  8.9  9.0  8.6  9.1  9.7  9.7  9.6  9.5  9.5  9.4  9.7
## 2010 10.6 10.4 10.2  9.5  9.3  9.6  9.7  9.5  9.2  9.0  9.3  9.1
## 2011  9.8  9.5  9.2  8.7  8.7  9.3  9.3  9.1  8.8  8.5  8.2  8.3
## 2012  8.8  8.7  8.4  7.7  7.9  8.4  8.6  8.2  7.6  7.5  7.4  7.6
## 2013  8.5  8.1  7.6  7.1  7.3  7.8  7.7  7.3  7.0  7.0  6.6  6.5
## 2014  7.0  7.0  6.8  5.9  6.1  6.3  6.5  6.3  5.7  5.5  5.5  5.4
## 2015  6.1  5.8  5.6  5.1  5.3  5.5  5.6  5.2  4.9  4.8  4.8  4.8
## 2016  5.3  5.2  5.1  4.7  4.5  5.1  5.1  5.0  4.8  4.7  4.4

We can see that after the second month, the prediction of next months are within the confidence interval of previous months and more importantly, the standard deviation of the confidence interval increases really fast decreasing the precision of our predictions. Now I will compare the predicted values of my model versus the actual values.

Predicted vs Actual

My predicted values are extremely accurate to the extent that only 1 out of 10 is has a difference of more than 0.1 with the real or actual value, which is my second last predicted value. This shows my model was a success as it was able to accomplish the goal it was built for.

Unemployment closely follows the business cycle, hence, it is of importance to execute an spectral analysis and find the three dominant frequencies on our data.

# Spectral analysis for UnempRate series
UnempRate.per = mvspec(UnempRate, log = "no")

P2<-UnempRate.per$details[order(UnempRate.per$details[,3],decreasing = TRUE),]
#Identify the first three dominant frequencies for UnempRate series
P2[1,];P2[2,];P2[3,]
## frequency    period  spectrum 
##    0.0278   36.0000   14.4829
## frequency    period  spectrum 
##    0.1111    9.0000   11.1720
## frequency    period  spectrum 
##    0.0417   24.0000    9.9625
##95% CIs for the dominant frequencies for UnempRate series in part(a)
UnempRate.u1 = 2*P2[1,3]/qchisq(.025,2)
UnempRate.l1 = 2*P2[1,3]/qchisq(.975,2)
UnempRate.u2 = 2*P2[2,3]/qchisq(.025,2)
UnempRate.l2 = 2*P2[2,3]/qchisq(.975,2)
UnempRate.u3 = 2*P2[3,3]/qchisq(.025,2)
UnempRate.l3 = 2*P2[3,3]/qchisq(.975,2)
##Create a data frame for the CIs
Result <- data.frame(Series=c(rep("UnempRate",3)),
Dominant.Freq=c(P2[1,1],P2[2,1],P2[3,1]), Spec=c(P2[1,3],P2[2,3],P2[3,3]),
Lower=c(UnempRate.l1,UnempRate.l2,UnempRate.l3),
Upper=c(UnempRate.u1,UnempRate.u2,UnempRate.u3))
Result[1:2,3:5] = round(Result[1:2,3:5], 4)
Result
##      Series Dominant.Freq    Spec    Lower    Upper
## 1 UnempRate        0.0278 14.4829 3.926100 572.0440
## 2 UnempRate        0.1111 11.1720 3.028600 441.2704
## 3 UnempRate        0.0417  9.9625 2.700685 393.4977

From our spectral analysis data we can see that we cannot establish the significance of the first peak since the periodogram ordinate is 14.4829, which lies in the confidence intervals of the second and third peak. Similarly, we cannot establish the significance of the second peak since the periodogram ordinate is 11.1720, which lies in the confidence interval of the first and third peak and we cannot establish the significance of the third peak since the periodogram ordinate is 9.9625, which lies in the confidence interval of the second peak and fist peak.

Conclusion: ARIMA models are an efficient way of forecasting short-term economic indicators such as unemployment rate, however, we must first acknowledge the weaknesses that this approach has, such as the assumption of Ceteris paribus, meaning that we assume that all the other things that may affect unemployment are held constant to some degree, which is unreasonable as things such as a war or a pandemic may heavily influence unemployment.

Regarding the model, the weaknesses it has are that it is really hard to explain it in a non-technical way, the model itself has significant outliers in its error terms, it lacks precision for long-term forecasting, it assumes Ceteris paribus as explained earlier,and it only works for the US economy, it is not universal. However, as shown by the results we obtained from the model, we can say that even with this limitations, the model was a success as it very accurately and precisely predicted future values.

I believe that more research should be done in the area of exogenous shocks and how they can be predicted or modeled using economic indicators, as that is one of the biggest weaknesses of the majority of economic models, and any breakthrough in this field of knowledge will greatly positively impact the field of economics and econometrics.