Santa Clara County Unemployment Forecast (DRAFT)

Saqib Ali
22 July, 2017

Overview

We will look at the past Santa Clara Country Unemployment Rate and try to predict the Unemployment Rate for the next 2 years. We will look at various critereons for evaluating the accuracy for the Forecast Model.

For this presentation we will be using Prof. Rob Hyndman's Forecast R Library

Assumption: We assume that you have a basic understanding of the ARIMA and ETS Forecast Models

Required Libraries

We will use the following R Libraries for this Forecast

library("data.table")
library("forecast")
library("timeSeries")
library("ggfortify")
library("zoo")
library("tseries")

Prepare the Dataset

Read in the Dataset

unemployment <- fread("~/datascience/personal projects/Santa-Clara-County-Unemployment-Forecast/output.csv")
  • We use the fread function in the data.table Libray for this
  • The Dataset contains the Umemployment Rate by Month for each County in the US

Let's take a look at the dataset

tail(unemployment, 5)
   Year    Month State             County Rate
1: 2009 November Maine    Somerset County 10.5
2: 2009 November Maine      Oxford County 10.5
3: 2009 November Maine        Knox County  7.5
4: 2009 November Maine Piscataquis County 11.3
5: 2009 November Maine   Aroostook County  9.0
  • The Year and Month are two separate columns. We will need to combine them
  • We need to filter for Santa Clara County Data
  • Are there any months that we are missing the data for? We don't know yet.

Santa Clara County Data

Let's filter on just the Santa Clara County Data

santaclaracounty_unployment <- unemployment[unemployment$County=="Santa Clara County"]
head(santaclaracounty_unployment)
   Year    Month      State             County Rate
1: 2015 February California Santa Clara County  4.4
2: 2015  October California Santa Clara County  4.0
3: 2015    March California Santa Clara County  4.4
4: 2015   August California Santa Clara County  4.1
5: 2015      May California Santa Clara County  4.1
6: 2015  January California Santa Clara County  4.7

This is easy in R

Combine the Year and Month

In the Dataset we noticed that Year and Month are two seperate columns. Let's combine them into one field of the Type YEARMON.

We will name this column Date

santaclaracounty_unployment$Date <- as.yearmon(paste(santaclaracounty_unployment$Year,santaclaracounty_unployment$Month), "%Y %B")

We also sort the Dataset by Date, earliest first.

santaclaracounty_unployment <- santaclaracounty_unployment[order(Date)]

Handling missing data

This is a important step before we can create a Time Series object. We need to identify the Months that we are missing the data for, and set them as NAs.

First create a data.table with all the intervals (month) from the Earliest Date of the Dataset and to the Lastest Date of the Dataset

ymd.full <- data.table(Date=as.yearmon(seq(as.Date(min(santaclaracounty_unployment$Date)), as.Date(max(santaclaracounty_unployment$Date)), by="mon"))) 

Next, merge the the data.table created in the previous step with santaclaracounty_unployment data.table

santaclaracounty_unployment <- merge(ymd.full, santaclaracounty_unployment, all.x=T)

Next we will inspect the Dataset

Inspect the dataset

Months for which we had no data will be listed in the data.table with NAs.

We now have a complete dataset that can be user for creating the Time Series

tail(santaclaracounty_unployment, 20)
        Date Year     Month      State             County Rate
 1: May 2015 2015       May California Santa Clara County  4.1
 2: Jun 2015 2015      June California Santa Clara County  4.1
 3: Jul 2015 2015      July California Santa Clara County  4.4
 4: Aug 2015 2015    August California Santa Clara County  4.1
 5: Sep 2015 2015 September California Santa Clara County  3.8
 6: Oct 2015 2015   October California Santa Clara County  4.0
 7: Nov 2015 2015  November California Santa Clara County  4.0
 8: Dec 2015 2015  December California Santa Clara County  3.8
 9: Jan 2016 2016   January California Santa Clara County  3.7
10: Feb 2016 2016  February California Santa Clara County  3.7
11: Mar 2016 2016     March California Santa Clara County  3.8
12: Apr 2016 2016     April California Santa Clara County  3.6
13: May 2016   NA        NA         NA                 NA   NA
14: Jun 2016 2016      June California Santa Clara County  4.0
15: Jul 2016 2016      July California Santa Clara County  4.2
16: Aug 2016 2016    August California Santa Clara County  4.0
17: Sep 2016 2016 September California Santa Clara County  3.8
18: Oct 2016 2016   October California Santa Clara County  3.8
19: Nov 2016 2016  November California Santa Clara County  3.5
20: Dec 2016 2016  December California Santa Clara County  3.3

Let's create the Time Series

myts <- ts(santaclaracounty_unployment$Rate, start = c(1990, 1), frequency = 12)
  • We use the ts() function to create the Time Series
  • The Time Series starts on January 1990
  • Frequency of 12 i.e. Monthly data
head(myts, 108)
     Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1990 3.5 3.5 3.5 3.7 3.7 3.9  NA 3.9 4.2  NA  NA 4.5
1991  NA  NA  NA  NA  NA  NA  NA  NA 5.8  NA  NA  NA
1992 6.6 6.9 6.8 6.6 6.7 7.2 7.1 6.9 6.9 6.6 6.8 6.6
1993 7.2 7.2 7.0 6.9 6.8 7.2 7.3 7.0 6.9 6.8 6.5 6.1
1994 7.1 6.9 6.8 6.7 6.3 6.5 6.7 6.3 6.0 5.7 5.4 4.9
1995 5.8 5.4 5.3 5.5 5.1 5.1 5.4 4.8 4.6 4.3 4.0 3.6
1996 4.2 3.9 3.8 3.7 3.6 3.6 3.8 3.6 3.6 3.5 3.4 3.1
1997 3.6 3.4 3.3 3.1 3.0 3.2 3.3 3.1 3.1 2.8 2.6 2.4
1998 2.9 2.9 3.0 2.8 2.8 3.2 3.5 3.6 3.8 3.7 3.5 3.1

Explore the Time Series

First thing First. Plot the Time Series

myts %>% autoplot()

plot of chunk unnamed-chunk-12

The plot shows some

  • seasonality;
  • cyclicity; and
  • trend.
  • AND missing values

Let's explore more.

Check for White-noise (Randomness)

We can use the Ljung Box test to test for White-noise (Randomness) in this Time Series:

myts %>%  Box.test(type="Lj")

    Box-Ljung test

data:  .
X-squared = 278.81, df = 1, p-value < 2.2e-16
  • Low p-value i.e. <.05 indicates this Time Series is NOT a pure Random Dataset, and there is correlation present in the data
  • There are possbily Trends and Seasonality in Time Series.
  • This is Good!

Inspect the ACF Plot, and the PACF Plot

myts %>% tsdisplay()

plot of chunk unnamed-chunk-14

Auto-correlation Plot, and the Partial Auto-correlation Plot show that:

  • This Time Series is highly co-related at multiple Lags
  • There is Seasonality

Handle the NAs

We know that our Time Series is missing some data. Let's Interpolate with seasonal Kalman filter.

myts <- myts %>% na.StructTS()

Creating the Models

We can now use our Time Series to create ARIMA and ETS Models for Forcasting. Creating these models is easy Prof. Rob Hyndman's Forecast R Library

Go to slide 1

Create ARIMA and ETS Models

We can now use our Time Series to create ARIMA and ETS Models for Forcasting. Creating these models is easy with Prof. Rob Hyndman's Forecast R Library

Arima

myts %>% auto.arima()
Series: . 
ARIMA(2,1,1)(2,0,0)[12]                    

Coefficients:
         ar1     ar2      ma1    sar1    sar2
      0.6246  0.1979  -0.4484  0.5065  0.3027
s.e.  0.1063  0.0721   0.1006  0.0531  0.0536

sigma^2 estimated as 0.04454:  log likelihood=40.72
AIC=-69.45   AICc=-69.18   BIC=-46.78

ETS

myts %>% ets()
ETS(A,Ad,A) 

Call:
 ets(y = .) 

  Smoothing parameters:
    alpha = 0.867 
    beta  = 0.1314 
    gamma = 0.115 
    phi   = 0.9485 

  Initial states:
    l = 3.1119 
    b = 0.1883 
    s=-0.4287 -0.183 -0.0835 0.0558 0.0683 0.2718
           0.2058 -0.1272 -0.1001 0.0526 0.1049 0.1633

  sigma:  0.2058

     AIC     AICc      BIC 
884.5656 886.8083 952.6190 

Compare the Models

ARIMA vs. ETS

Check residuals for Randomness

ARIMA

myts %>% auto.arima() %>% checkresiduals(plot=F)

    Ljung-Box test

data:  Residuals from ARIMA(2,1,1)(2,0,0)[12]
Q* = 21.733, df = 19, p-value = 0.2977

Model df: 5.   Total lags used: 24

The high p-value i.e. >.05 in the ARIMA Model indicates suggests that the

  • Residuals White Noise i.e. they are Random; AND
  • that much of the information has been incorporated in the Model

ETS

myts %>% ets  %>% checkresiduals(plot=F)

    Ljung-Box test

data:  Residuals from ETS(A,Ad,A)
Q* = 67.382, df = 7, p-value = 4.988e-12

Model df: 17.   Total lags used: 24

The low p-value i.e. <.05 in the ETS indicates that the

  • Residuals are not purely Random; AND
  • there is information that was not incorporated into the Model

Exploring the Residuals for the ARIMA

myts %>% auto.arima() %>% checkresiduals()

plot of chunk unnamed-chunk-20


    Ljung-Box test

data:  Residuals from ARIMA(2,1,1)(2,0,0)[12]
Q* = 21.733, df = 19, p-value = 0.2977

Model df: 5.   Total lags used: 24
  • The ACF Chart for the Residuals in the ARIMA Model indicates that the residuals are uncorrelated
  • The Residuals are also normally distributed; and
  • and Random

Optimal Lamda for the ARIMA and ETS

We can experiment with various Lambdas or we can use the Box Cox test to determine the optimal Lambda for our Time Series

BoxCox.lambda(myts)
[1] 0.6832323

We will use this Lambda value in our Models

ARIMA with the optimal Lambda

ARIMA with no Lambda set

myts %>% auto.arima()
Series: . 
ARIMA(2,1,1)(2,0,0)[12]                    

Coefficients:
         ar1     ar2      ma1    sar1    sar2
      0.6246  0.1979  -0.4484  0.5065  0.3027
s.e.  0.1063  0.0721   0.1006  0.0531  0.0536

sigma^2 estimated as 0.04454:  log likelihood=40.72
AIC=-69.45   AICc=-69.18   BIC=-46.78

Arima with Lambda set to the optimal value

myts %>% auto.arima(lambda=BoxCox.lambda(myts))  
Series: . 
ARIMA(2,1,1)(2,0,0)[12]                    
Box Cox transformation: lambda= 0.6832323 

Coefficients:
         ar1     ar2      ma1    sar1    sar2
      0.6195  0.1868  -0.4427  0.5002  0.3117
s.e.  0.1141  0.0734   0.1092  0.0530  0.0536

sigma^2 estimated as 0.01531:  log likelihood=213.16
AIC=-414.31   AICc=-414.04   BIC=-391.64
  • Notice the lower AICc for the ARIMA Model with the Lamdba of .683
  • This is good!

Let's check for the RMSE

RMSE for ARIMA with no Lambda set

myts %>% auto.arima() %>% forecast(h=24) %>% accuracy()
                      ME      RMSE       MAE       MPE     MAPE      MASE
Training set -0.00160113 0.2090743 0.1578683 0.0997268 3.008419 0.1292814
                     ACF1
Training set -0.005056303

RMSE for ARIMA with Lambda set

myts %>% auto.arima(lambda=BoxCox.lambda(myts)) %>% forecast(h=24) %>% accuracy()
                      ME      RMSE       MAE        MPE     MAPE      MASE
Training set -0.00609656 0.2089553 0.1561958 0.02736311 2.949653 0.1279118
                   ACF1
Training set 0.01628047
  • Notice the lower Root Mean Squared Errors for the ARIMA model with the Lambda set
  • This is good!

Let's Forecast!

Let's predict using ARIMA

myts %>% ets(lambda=BoxCox.lambda(myts))  %>%  forecast(h=24) %>% autoplot()

plot of chunk unnamed-chunk-26

Conclusion

Based on the Forecast, we see a decreasing Trend in the Unemployment, with Seasonal peaks for the next 24 months

Thank you!

Feedback and comments are welcome.

saqibali2@yahoo.com