Saqib Ali
22 July, 2017
We will look at the past Santa Clara Country Unemployment Rate and try to predict the Unemployment Rate for the next 2 years. We will look at various critereons for evaluating the accuracy for the Forecast Model.
For this presentation we will be using Prof. Rob Hyndman's Forecast R Library
Assumption: We assume that you have a basic understanding of the ARIMA and ETS Forecast Models
We will use the following R Libraries for this Forecast
library("data.table")
library("forecast")
library("timeSeries")
library("ggfortify")
library("zoo")
library("tseries")
unemployment <- fread("~/datascience/personal projects/Santa-Clara-County-Unemployment-Forecast/output.csv")
tail(unemployment, 5)
Year Month State County Rate
1: 2009 November Maine Somerset County 10.5
2: 2009 November Maine Oxford County 10.5
3: 2009 November Maine Knox County 7.5
4: 2009 November Maine Piscataquis County 11.3
5: 2009 November Maine Aroostook County 9.0
Let's filter on just the Santa Clara County Data
santaclaracounty_unployment <- unemployment[unemployment$County=="Santa Clara County"]
head(santaclaracounty_unployment)
Year Month State County Rate
1: 2015 February California Santa Clara County 4.4
2: 2015 October California Santa Clara County 4.0
3: 2015 March California Santa Clara County 4.4
4: 2015 August California Santa Clara County 4.1
5: 2015 May California Santa Clara County 4.1
6: 2015 January California Santa Clara County 4.7
This is easy in R
In the Dataset we noticed that Year and Month are two seperate columns. Let's combine them into one field of the Type YEARMON.
We will name this column Date
santaclaracounty_unployment$Date <- as.yearmon(paste(santaclaracounty_unployment$Year,santaclaracounty_unployment$Month), "%Y %B")
We also sort the Dataset by Date, earliest first.
santaclaracounty_unployment <- santaclaracounty_unployment[order(Date)]
This is a important step before we can create a Time Series object. We need to identify the Months that we are missing the data for, and set them as NAs.
First create a data.table with all the intervals (month) from the Earliest Date of the Dataset and to the Lastest Date of the Dataset
ymd.full <- data.table(Date=as.yearmon(seq(as.Date(min(santaclaracounty_unployment$Date)), as.Date(max(santaclaracounty_unployment$Date)), by="mon")))
Next, merge the the data.table created in the previous step with santaclaracounty_unployment data.table
santaclaracounty_unployment <- merge(ymd.full, santaclaracounty_unployment, all.x=T)
Next we will inspect the Dataset
Months for which we had no data will be listed in the data.table with NAs.
We now have a complete dataset that can be user for creating the Time Series
tail(santaclaracounty_unployment, 20)
Date Year Month State County Rate
1: May 2015 2015 May California Santa Clara County 4.1
2: Jun 2015 2015 June California Santa Clara County 4.1
3: Jul 2015 2015 July California Santa Clara County 4.4
4: Aug 2015 2015 August California Santa Clara County 4.1
5: Sep 2015 2015 September California Santa Clara County 3.8
6: Oct 2015 2015 October California Santa Clara County 4.0
7: Nov 2015 2015 November California Santa Clara County 4.0
8: Dec 2015 2015 December California Santa Clara County 3.8
9: Jan 2016 2016 January California Santa Clara County 3.7
10: Feb 2016 2016 February California Santa Clara County 3.7
11: Mar 2016 2016 March California Santa Clara County 3.8
12: Apr 2016 2016 April California Santa Clara County 3.6
13: May 2016 NA NA NA NA NA
14: Jun 2016 2016 June California Santa Clara County 4.0
15: Jul 2016 2016 July California Santa Clara County 4.2
16: Aug 2016 2016 August California Santa Clara County 4.0
17: Sep 2016 2016 September California Santa Clara County 3.8
18: Oct 2016 2016 October California Santa Clara County 3.8
19: Nov 2016 2016 November California Santa Clara County 3.5
20: Dec 2016 2016 December California Santa Clara County 3.3
myts <- ts(santaclaracounty_unployment$Rate, start = c(1990, 1), frequency = 12)
head(myts, 108)
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1990 3.5 3.5 3.5 3.7 3.7 3.9 NA 3.9 4.2 NA NA 4.5
1991 NA NA NA NA NA NA NA NA 5.8 NA NA NA
1992 6.6 6.9 6.8 6.6 6.7 7.2 7.1 6.9 6.9 6.6 6.8 6.6
1993 7.2 7.2 7.0 6.9 6.8 7.2 7.3 7.0 6.9 6.8 6.5 6.1
1994 7.1 6.9 6.8 6.7 6.3 6.5 6.7 6.3 6.0 5.7 5.4 4.9
1995 5.8 5.4 5.3 5.5 5.1 5.1 5.4 4.8 4.6 4.3 4.0 3.6
1996 4.2 3.9 3.8 3.7 3.6 3.6 3.8 3.6 3.6 3.5 3.4 3.1
1997 3.6 3.4 3.3 3.1 3.0 3.2 3.3 3.1 3.1 2.8 2.6 2.4
1998 2.9 2.9 3.0 2.8 2.8 3.2 3.5 3.6 3.8 3.7 3.5 3.1
myts %>% autoplot()
The plot shows some
Let's explore more.
We can use the Ljung Box test to test for White-noise (Randomness) in this Time Series:
myts %>% Box.test(type="Lj")
Box-Ljung test
data: .
X-squared = 278.81, df = 1, p-value < 2.2e-16
myts %>% tsdisplay()
Auto-correlation Plot, and the Partial Auto-correlation Plot show that:
We know that our Time Series is missing some data. Let's Interpolate with seasonal Kalman filter.
myts <- myts %>% na.StructTS()
We can now use our Time Series to create ARIMA and ETS Models for Forcasting. Creating these models is easy Prof. Rob Hyndman's Forecast R Library
We can now use our Time Series to create ARIMA and ETS Models for Forcasting. Creating these models is easy with Prof. Rob Hyndman's Forecast R Library
myts %>% auto.arima()
Series: .
ARIMA(2,1,1)(2,0,0)[12]
Coefficients:
ar1 ar2 ma1 sar1 sar2
0.6246 0.1979 -0.4484 0.5065 0.3027
s.e. 0.1063 0.0721 0.1006 0.0531 0.0536
sigma^2 estimated as 0.04454: log likelihood=40.72
AIC=-69.45 AICc=-69.18 BIC=-46.78
myts %>% ets()
ETS(A,Ad,A)
Call:
ets(y = .)
Smoothing parameters:
alpha = 0.867
beta = 0.1314
gamma = 0.115
phi = 0.9485
Initial states:
l = 3.1119
b = 0.1883
s=-0.4287 -0.183 -0.0835 0.0558 0.0683 0.2718
0.2058 -0.1272 -0.1001 0.0526 0.1049 0.1633
sigma: 0.2058
AIC AICc BIC
884.5656 886.8083 952.6190
ARIMA vs. ETS
myts %>% auto.arima() %>% checkresiduals(plot=F)
Ljung-Box test
data: Residuals from ARIMA(2,1,1)(2,0,0)[12]
Q* = 21.733, df = 19, p-value = 0.2977
Model df: 5. Total lags used: 24
The high p-value i.e. >.05 in the ARIMA Model indicates suggests that the
myts %>% ets %>% checkresiduals(plot=F)
Ljung-Box test
data: Residuals from ETS(A,Ad,A)
Q* = 67.382, df = 7, p-value = 4.988e-12
Model df: 17. Total lags used: 24
The low p-value i.e. <.05 in the ETS indicates that the
myts %>% auto.arima() %>% checkresiduals()
Ljung-Box test
data: Residuals from ARIMA(2,1,1)(2,0,0)[12]
Q* = 21.733, df = 19, p-value = 0.2977
Model df: 5. Total lags used: 24
We can experiment with various Lambdas or we can use the Box Cox test to determine the optimal Lambda for our Time Series
BoxCox.lambda(myts)
[1] 0.6832323
We will use this Lambda value in our Models
myts %>% auto.arima()
Series: .
ARIMA(2,1,1)(2,0,0)[12]
Coefficients:
ar1 ar2 ma1 sar1 sar2
0.6246 0.1979 -0.4484 0.5065 0.3027
s.e. 0.1063 0.0721 0.1006 0.0531 0.0536
sigma^2 estimated as 0.04454: log likelihood=40.72
AIC=-69.45 AICc=-69.18 BIC=-46.78
myts %>% auto.arima(lambda=BoxCox.lambda(myts))
Series: .
ARIMA(2,1,1)(2,0,0)[12]
Box Cox transformation: lambda= 0.6832323
Coefficients:
ar1 ar2 ma1 sar1 sar2
0.6195 0.1868 -0.4427 0.5002 0.3117
s.e. 0.1141 0.0734 0.1092 0.0530 0.0536
sigma^2 estimated as 0.01531: log likelihood=213.16
AIC=-414.31 AICc=-414.04 BIC=-391.64
myts %>% auto.arima() %>% forecast(h=24) %>% accuracy()
ME RMSE MAE MPE MAPE MASE
Training set -0.00160113 0.2090743 0.1578683 0.0997268 3.008419 0.1292814
ACF1
Training set -0.005056303
myts %>% auto.arima(lambda=BoxCox.lambda(myts)) %>% forecast(h=24) %>% accuracy()
ME RMSE MAE MPE MAPE MASE
Training set -0.00609656 0.2089553 0.1561958 0.02736311 2.949653 0.1279118
ACF1
Training set 0.01628047
myts %>% ets(lambda=BoxCox.lambda(myts)) %>% forecast(h=24) %>% autoplot()
Based on the Forecast, we see a decreasing Trend in the Unemployment, with Seasonal peaks for the next 24 months
Thank you!
Feedback and comments are welcome.