Capstone Project - Milestone Report

The Problem

The idea of this project is to develop an algorithm to predict the length of unemployment in the US economy based on some economic indicators. The length of unemployment is closely related to government spending on social security, default rates on mortgages and credit cards, etc. Being able to predict the length on unemployment is of high value to make government spending more efficient and to reduce risk to financial institutions. Additionally it serves as a thermometer for the economy as a whole.

The problem we are trying to solve is to develop a simple method to predict the median weeks of unemployment based on monthly indicators that are easier to obtain than an accurate depiction of the length of unemployment.

Structure of the model

The variables are [variable codes in brackets]:

Median weeks of unemployment [uempmed];
Unemployment rate [urate];
Consumer price index [cpi];
Producer price index [ppi];
Fed interest rate [ir];
Oil price (month avg.) [OilPrice];
Month [month];
Stock Index (Dow Jones) [DowJones];
Personal savings rate [psavert];
Personal consumption spending [pce];
Manufacturing capacity utlization [cap];
Manufacturing Confidence report [mconf];
Dummy variable for recession period [USREC].

The idea is to use the median weeks of uemployment as the dependent variable and the rest as explanatory variables.

The Data

The data for this project comes from a variety of sources. Median weeks of unemployment, unemployment rate (we use unemployed/total population), personal savings and consumption spending are available in the “economics” data set in the ggplot2 package. CPI and PPI are available in the BLS website. Stock index and oil prices are available on the ipeadata website. Manufacturing capacity and confidence reports are published by the OECD. Finally the dummy variable for recession periods is computed by the St. Louis FED and available using the Quantmod package.

Basic Data Wrangling

The data sets were downloaded then read into R. Some transformations were required to get the data into the “tidy” format. The code below reads the raw data and generates the final data frame.

library(ggplot2)
library(tidyr)
library(dplyr)

data(economics)

economics <- economics %>% separate(date, into = c("Year", "month", "day"), sep = "\\-")
economics$day <- NULL

economics$month <- as.numeric(economics$month)
class(economics$month)
###

ppi <- read.csv("PPI.csv", skip = 10)

ppi <- ppi %>% gather(month, ppi, Jan:Dec )


ppi <- ppi[order(ppi$Year),]

ppi$month <-match(ppi$month, month.abb)
class(ppi$month)
ppi$month <- as.numeric(ppi$month)
ppi$Year <- as.character(ppi$Year)
###
cpi <- read.csv("CPI.csv", skip = 10)
cpi$HALF1<-NULL
cpi$HALF2 <-NULL

cpi <- cpi %>% gather(month, cpi, Jan:Dec )
cpi <- cpi[order(cpi$Year),]

cpi$month <- match(cpi$month, month.abb)
class(cpi$month)
cpi$Year <- as.character(cpi$Year)
###
ir <- read.csv("Interest Rate.csv", skip = 5)

ir <- ir %>% separate(Time.Period, into =  c("Year", "month"), sep = "\\-")

colnames(ir) <- c("Year", "month", "InterestRate")

class(ir$month)
ir$month <- as.numeric(ir$month)
###
dj <- read.csv("DowJones.csv")

dj <- dj %>% separate(Data, into = c("Year", "month"), sep = "\\.")

colnames(dj ) <- c("Year", "month", "DowJones")
class(dj$month)
dj$month[dj$month=="1"] <- "10"
dj$month <- as.numeric(dj$month)
###
oil <- read.csv("Oil Price.csv")

oil <- oil %>% separate(Data, into = c("Year", "month"), sep = "\\."  )
oil$International.Financial.Statistics..FMI.IFS....IFS12_PETROLEUM12 <- NULL

colnames(oil) <- c("Year", "month", "OilPrice")
class(oil$month)
oil$month[oil$month=="1"]<-"10"
oil$month <- as.numeric(oil$month)
####
library(quantmod)
getSymbols("USREC",src="FRED")

usrec <- data.frame(date=index(USREC), coredata(USREC))
usrec <- usrec %>% separate(date, into=c("Year", "month", "day"), sep = "\\-" )
usrec$day <- NULL

newUSREC <- read.csv("newUSREC.csv")

usrec2 <- rbind(usrec, newUSREC)

class(usrec$month)
usrec$month <- as.numeric(usrec$month)
### 
#Capacity Utilization
mei <- read.csv("MEI.csv")

cap <- subset(mei, SUBJECT == "BSCURT", select = c(SUBJECT, TIME, Value ))

cap <- cap %>% separate(TIME, into = c("Year", "month"), sep = "\\-")
cap <- rename(cap, cap = Value)
cap$SUBJECT <- NULL
class(cap$month)
cap$month <- as.numeric(cap$month)
###
#Confidence Indicator
mconf <- subset(mei, SUBJECT == "BSCI", select = c(SUBJECT, TIME, Value) )
mconf <- mconf %>% separate(TIME, into = c("Year", "month"), sep = "\\-")

mconf <- rename(mconf, mconf = Value)
mconf$SUBJECT <- NULL

mconf$month <- as.numeric(mconf$month)

###############
#merging the dfs

mydata <- inner_join(economics, ppi, by = c("Year", "month"))
mydata <- inner_join(mydata, cpi,by = c("Year", "month"))
mydata <- inner_join(mydata, ir, by = c("Year", "month"))
mydata <- inner_join(mydata,dj, by = c("Year", "month")) 
mydata <- inner_join(mydata, oil, by = c("Year", "month")) 
mydata <- inner_join(mydata, usrec2, by = c("Year", "month"))
mydata <- inner_join(mydata, cap, by = c("Year", "month")) 
mydata <- inner_join(mydata, mconf, by = c("Year", "month"))

mydata <- mydata %>% mutate(urate = unemploy/pop)

Exploratory Data Analysis

library(dplyr)
library(quantmod)
library(zoo)
library(ggplot2)
options("getSymbols.warning4.0"=FALSE)

Now we can proceed to gain some insights into the data set. We begin by checking some summary statistics of the data.

##       pce               pop            psavert          uempmed     
##  Min.   :  507.4   Min.   :198712   Min.   : 1.900   Min.   : 4.00  
##  1st Qu.: 1582.2   1st Qu.:224896   1st Qu.: 5.500   1st Qu.: 6.00  
##  Median : 3953.6   Median :253060   Median : 7.700   Median : 7.50  
##  Mean   : 4843.5   Mean   :257189   Mean   : 7.937   Mean   : 8.61  
##  3rd Qu.: 7667.3   3rd Qu.:290291   3rd Qu.:10.500   3rd Qu.: 9.10  
##  Max.   :12161.5   Max.   :320887   Max.   :17.000   Max.   :25.20  
##     unemploy          ppi             cpi          InterestRate   
##  Min.   : 2685   Min.   : 35.7   Min.   : 34.70   Min.   : 0.070  
##  1st Qu.: 6284   1st Qu.: 76.9   1st Qu.: 71.55   1st Qu.: 3.062  
##  Median : 7494   Median :121.8   Median :141.40   Median : 5.395  
##  Mean   : 7772   Mean   :115.1   Mean   :135.81   Mean   : 5.607  
##  3rd Qu.: 8691   3rd Qu.:142.9   3rd Qu.:193.07   3rd Qu.: 7.935  
##  Max.   :15352   Max.   :203.0   Max.   :241.80   Max.   :19.100  
##     DowJones          OilPrice          USREC             cap       
##  Min.   :  607.9   Min.   :  1.79   Min.   :0.0000   Min.   :66.93  
##  1st Qu.:  942.2   1st Qu.: 12.73   1st Qu.:0.0000   1st Qu.:77.90  
##  Median : 2910.3   Median : 20.18   Median :0.0000   Median :80.38  
##  Mean   : 5378.0   Mean   : 32.13   Mean   :0.1446   Mean   :80.47  
##  3rd Qu.:10165.5   3rd Qu.: 35.62   3rd Qu.:0.0000   3rd Qu.:83.57  
##  Max.   :18133.0   Max.   :132.55   Max.   :1.0000   Max.   :88.85  
##      mconf             urate        
##  Min.   :-41.200   Min.   :0.01332  
##  1st Qu.: -0.950   1st Qu.:0.02404  
##  Median :  6.800   Median :0.02855  
##  Mean   :  5.125   Mean   :0.02989  
##  3rd Qu.: 12.200   3rd Qu.:0.03506  
##  Max.   : 44.200   Max.   :0.05169

Next we construct a basic correlation matrix.

##                      pce         pop      psavert     uempmed     unemploy
## pce           1.00000000  0.98727782 -0.837069022  0.72734920  0.613999689
## pop           0.98727782  1.00000000 -0.875464194  0.69594436  0.634029891
## psavert      -0.83706902 -0.87546419  1.000000000 -0.38741589 -0.354007252
## uempmed       0.72734920  0.69594436 -0.387415889  1.00000000  0.869406283
## unemploy      0.61399969  0.63402989 -0.354007252  0.86940628  1.000000000
## ppi           0.95938697  0.97969349 -0.842231572  0.71358003  0.706481232
## cpi           0.96995149  0.99397670 -0.891002043  0.66110053  0.628061280
## InterestRate -0.70716734 -0.67138554  0.542937908 -0.62034075 -0.422487025
## DowJones      0.96281343  0.93917546 -0.820448264  0.62695303  0.444135879
## OilPrice      0.83202811  0.78137630 -0.540009082  0.79811885  0.691903837
## USREC        -0.09717555 -0.10918543  0.209404807 -0.13702519  0.001427585
## cap          -0.50966422 -0.51993114  0.306091504 -0.61499498 -0.785117493
## mconf        -0.02891649 -0.05179458  0.002591511  0.07239267 -0.140306901
## urate         0.24839136  0.28223103 -0.022259937  0.66830257  0.909733572
##                      ppi         cpi InterestRate    DowJones    OilPrice
## pce           0.95938697  0.96995149   -0.7071673  0.96281343  0.83202811
## pop           0.97969349  0.99397670   -0.6713855  0.93917546  0.78137630
## psavert      -0.84223157 -0.89100204    0.5429379 -0.82044826 -0.54000908
## uempmed       0.71358003  0.66110053   -0.6203408  0.62695303  0.79811885
## unemploy      0.70648123  0.62806128   -0.4224870  0.44413588  0.69190384
## ppi           1.00000000  0.98452945   -0.5860196  0.88365255  0.81573291
## cpi           0.98452945  1.00000000   -0.6470430  0.91286187  0.74438182
## InterestRate -0.58601958 -0.64704298    1.0000000 -0.69369321 -0.47426930
## DowJones      0.88365255  0.91286187   -0.6936932  1.00000000  0.76739530
## OilPrice      0.81573291  0.74438182   -0.4742693  0.76739530  1.00000000
## USREC        -0.09648551 -0.11735902    0.2253167 -0.12351670  0.02326217
## cap          -0.53733908 -0.50629700    0.3949031 -0.40974801 -0.49228276
## mconf        -0.09850582 -0.07532791   -0.1671471  0.02203815 -0.04720614
## urate         0.39082727  0.28342342   -0.1294801  0.06695485  0.43696235
##                     USREC        cap        mconf       urate
## pce          -0.097175550 -0.5096642 -0.028916490  0.24839136
## pop          -0.109185434 -0.5199311 -0.051794584  0.28223103
## psavert       0.209404807  0.3060915  0.002591511 -0.02225994
## uempmed      -0.137025193 -0.6149950  0.072392673  0.66830257
## unemploy      0.001427585 -0.7851175 -0.140306901  0.90973357
## ppi          -0.096485509 -0.5373391 -0.098505823  0.39082727
## cpi          -0.117359018 -0.5062970 -0.075327912  0.28342342
## InterestRate  0.225316706  0.3949031 -0.167147129 -0.12948014
## DowJones     -0.123516700 -0.4097480  0.022038152  0.06695485
## OilPrice      0.023262169 -0.4922828 -0.047206137  0.43696235
## USREC         1.000000000 -0.1957393 -0.571315647  0.06275184
## cap          -0.195739287  1.0000000  0.366181908 -0.73025790
## mconf        -0.571315647  0.3661819  1.000000000 -0.18210397
## urate         0.062751837 -0.7302579 -0.182103971  1.00000000

To check the evolution of the median weeks of unemployment we can make a simple plot. As one can see from the correlation table, the median weeks of unemployment and the recession dummy are negatively correlated, we can add the recession periods to the plot to gain some insight into this relation.

##     As of 0.4-0, 'getSymbols' uses env=parent.frame() and
##  auto.assign=TRUE by default.
## 
##  This  behavior  will be  phased out in 0.5-0  when the call  will
##  default to use auto.assign=FALSE. getOption("getSymbols.env") and 
##  getOptions("getSymbols.auto.assign") are now checked for alternate defaults
## 
##  This message is shown once per session and may be disabled by setting 
##  options("getSymbols.warning4.0"=FALSE). See ?getSymbols for more details.

## [1] "USREC"

As we could expect, median weeks of unemployment remain high after the recession is over due to the fact that workers are laied off during the recession.

It’s distribution is also of interest.

Despite the correlation not being extremely high, it is to expect that there is a close relation between the length and rate of unemployment. This relation is depicted in the following figure. Average values are added in blue.

Next we can check how a linear regression performs in this data set.

## 
## Call:
## lm(formula = uempmed ~ cap + cpi + DowJones + month + pce + psavert + 
##     ppi + InterestRate + OilPrice + USREC + mconf, data = mydata2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0037 -0.9832 -0.1076  0.8649  8.2896 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.683e+01  2.176e+00   7.735 4.83e-14 ***
## cap          -2.304e-01  2.268e-02 -10.159  < 2e-16 ***
## cpi          -8.831e-03  1.762e-02  -0.501  0.61637    
## DowJones     -3.792e-04  6.491e-05  -5.842 8.74e-09 ***
## month         8.364e-04  1.915e-02   0.044  0.96518    
## pce           7.865e-04  2.236e-04   3.518  0.00047 ***
## psavert       6.367e-01  5.843e-02  10.898  < 2e-16 ***
## ppi           3.906e-02  1.877e-02   2.080  0.03795 *  
## InterestRate -2.455e-01  3.335e-02  -7.361 6.51e-13 ***
## OilPrice      4.992e-02  9.445e-03   5.286 1.79e-07 ***
## USREC        -1.566e+00  2.537e-01  -6.174 1.27e-09 ***
## mconf         3.853e-02  6.951e-03   5.543 4.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.582 on 562 degrees of freedom
## Multiple R-squared:  0.8545, Adjusted R-squared:  0.8517 
## F-statistic:   300 on 11 and 562 DF,  p-value: < 2.2e-16

Note that some of the regressors are highly correlated, hence the coefficients are biased. But we are interested in prediction, therefore we can just check the value of R-squared to determine the quality of fit.

Limitations

An obvious limitation of this approach is that we only have the median weeks of unemployment for the economy as a whole. There’s no information on the other statistics pertaining the distribution of weeks of unemployment. Additional dispersion information could add to the analysis, not on the prediction problem but on understanding the phenomenon as a whole.

Further development

The next step in the project is to develop a few models and compare them. The data set will be split into training and testing so that we can compare out of sample errors of different models.

Additionally we may transform some variables and even create additional dummy variables to increase the predictive ability of the models.