The idea of this project is to develop an algorithm to predict the length of unemployment in the US economy based on some economic indicators. The length of unemployment is closely related to government spending on social security, default rates on mortgages and credit cards, etc. Being able to predict the length on unemployment is of high value to make government spending more efficient and to reduce risk to financial institutions. Additionally it serves as a thermometer for the economy as a whole.
The problem we are trying to solve is to develop a simple method to predict the median weeks of unemployment based on monthly indicators that are easier to obtain than an accurate depiction of the length of unemployment.
The variables are [variable codes in brackets]:
The idea is to use the median weeks of uemployment as the dependent variable and the rest as explanatory variables.
The data for this project comes from a variety of sources. Median weeks of unemployment, unemployment rate (we use unemployed/total population), personal savings and consumption spending are available in the “economics” data set in the ggplot2 package. CPI and PPI are available in the BLS website. Stock index and oil prices are available on the ipeadata website. Manufacturing capacity and confidence reports are published by the OECD. Finally the dummy variable for recession periods is computed by the St. Louis FED and available using the Quantmod package.
The data sets were downloaded then read into R. Some transformations were required to get the data into the “tidy” format. The code below reads the raw data and generates the final data frame.
library(ggplot2)
library(tidyr)
library(dplyr)
data(economics)
economics <- economics %>% separate(date, into = c("Year", "month", "day"), sep = "\\-")
economics$day <- NULL
economics$month <- as.numeric(economics$month)
class(economics$month)
###
ppi <- read.csv("PPI.csv", skip = 10)
ppi <- ppi %>% gather(month, ppi, Jan:Dec )
ppi <- ppi[order(ppi$Year),]
ppi$month <-match(ppi$month, month.abb)
class(ppi$month)
ppi$month <- as.numeric(ppi$month)
ppi$Year <- as.character(ppi$Year)
###
cpi <- read.csv("CPI.csv", skip = 10)
cpi$HALF1<-NULL
cpi$HALF2 <-NULL
cpi <- cpi %>% gather(month, cpi, Jan:Dec )
cpi <- cpi[order(cpi$Year),]
cpi$month <- match(cpi$month, month.abb)
class(cpi$month)
cpi$Year <- as.character(cpi$Year)
###
ir <- read.csv("Interest Rate.csv", skip = 5)
ir <- ir %>% separate(Time.Period, into = c("Year", "month"), sep = "\\-")
colnames(ir) <- c("Year", "month", "InterestRate")
class(ir$month)
ir$month <- as.numeric(ir$month)
###
dj <- read.csv("DowJones.csv")
dj <- dj %>% separate(Data, into = c("Year", "month"), sep = "\\.")
colnames(dj ) <- c("Year", "month", "DowJones")
class(dj$month)
dj$month[dj$month=="1"] <- "10"
dj$month <- as.numeric(dj$month)
###
oil <- read.csv("Oil Price.csv")
oil <- oil %>% separate(Data, into = c("Year", "month"), sep = "\\." )
oil$International.Financial.Statistics..FMI.IFS....IFS12_PETROLEUM12 <- NULL
colnames(oil) <- c("Year", "month", "OilPrice")
class(oil$month)
oil$month[oil$month=="1"]<-"10"
oil$month <- as.numeric(oil$month)
####
library(quantmod)
getSymbols("USREC",src="FRED")
usrec <- data.frame(date=index(USREC), coredata(USREC))
usrec <- usrec %>% separate(date, into=c("Year", "month", "day"), sep = "\\-" )
usrec$day <- NULL
newUSREC <- read.csv("newUSREC.csv")
usrec2 <- rbind(usrec, newUSREC)
class(usrec$month)
usrec$month <- as.numeric(usrec$month)
###
#Capacity Utilization
mei <- read.csv("MEI.csv")
cap <- subset(mei, SUBJECT == "BSCURT", select = c(SUBJECT, TIME, Value ))
cap <- cap %>% separate(TIME, into = c("Year", "month"), sep = "\\-")
cap <- rename(cap, cap = Value)
cap$SUBJECT <- NULL
class(cap$month)
cap$month <- as.numeric(cap$month)
###
#Confidence Indicator
mconf <- subset(mei, SUBJECT == "BSCI", select = c(SUBJECT, TIME, Value) )
mconf <- mconf %>% separate(TIME, into = c("Year", "month"), sep = "\\-")
mconf <- rename(mconf, mconf = Value)
mconf$SUBJECT <- NULL
mconf$month <- as.numeric(mconf$month)
###############
#merging the dfs
mydata <- inner_join(economics, ppi, by = c("Year", "month"))
mydata <- inner_join(mydata, cpi,by = c("Year", "month"))
mydata <- inner_join(mydata, ir, by = c("Year", "month"))
mydata <- inner_join(mydata,dj, by = c("Year", "month"))
mydata <- inner_join(mydata, oil, by = c("Year", "month"))
mydata <- inner_join(mydata, usrec2, by = c("Year", "month"))
mydata <- inner_join(mydata, cap, by = c("Year", "month"))
mydata <- inner_join(mydata, mconf, by = c("Year", "month"))
mydata <- mydata %>% mutate(urate = unemploy/pop)
library(dplyr)
library(quantmod)
library(zoo)
library(ggplot2)
options("getSymbols.warning4.0"=FALSE)
Now we can proceed to gain some insights into the data set. We begin by checking some summary statistics of the data.
## pce pop psavert uempmed
## Min. : 507.4 Min. :198712 Min. : 1.900 Min. : 4.00
## 1st Qu.: 1582.2 1st Qu.:224896 1st Qu.: 5.500 1st Qu.: 6.00
## Median : 3953.6 Median :253060 Median : 7.700 Median : 7.50
## Mean : 4843.5 Mean :257189 Mean : 7.937 Mean : 8.61
## 3rd Qu.: 7667.3 3rd Qu.:290291 3rd Qu.:10.500 3rd Qu.: 9.10
## Max. :12161.5 Max. :320887 Max. :17.000 Max. :25.20
## unemploy ppi cpi InterestRate
## Min. : 2685 Min. : 35.7 Min. : 34.70 Min. : 0.070
## 1st Qu.: 6284 1st Qu.: 76.9 1st Qu.: 71.55 1st Qu.: 3.062
## Median : 7494 Median :121.8 Median :141.40 Median : 5.395
## Mean : 7772 Mean :115.1 Mean :135.81 Mean : 5.607
## 3rd Qu.: 8691 3rd Qu.:142.9 3rd Qu.:193.07 3rd Qu.: 7.935
## Max. :15352 Max. :203.0 Max. :241.80 Max. :19.100
## DowJones OilPrice USREC cap
## Min. : 607.9 Min. : 1.79 Min. :0.0000 Min. :66.93
## 1st Qu.: 942.2 1st Qu.: 12.73 1st Qu.:0.0000 1st Qu.:77.90
## Median : 2910.3 Median : 20.18 Median :0.0000 Median :80.38
## Mean : 5378.0 Mean : 32.13 Mean :0.1446 Mean :80.47
## 3rd Qu.:10165.5 3rd Qu.: 35.62 3rd Qu.:0.0000 3rd Qu.:83.57
## Max. :18133.0 Max. :132.55 Max. :1.0000 Max. :88.85
## mconf urate
## Min. :-41.200 Min. :0.01332
## 1st Qu.: -0.950 1st Qu.:0.02404
## Median : 6.800 Median :0.02855
## Mean : 5.125 Mean :0.02989
## 3rd Qu.: 12.200 3rd Qu.:0.03506
## Max. : 44.200 Max. :0.05169
Next we construct a basic correlation matrix.
## pce pop psavert uempmed unemploy
## pce 1.00000000 0.98727782 -0.837069022 0.72734920 0.613999689
## pop 0.98727782 1.00000000 -0.875464194 0.69594436 0.634029891
## psavert -0.83706902 -0.87546419 1.000000000 -0.38741589 -0.354007252
## uempmed 0.72734920 0.69594436 -0.387415889 1.00000000 0.869406283
## unemploy 0.61399969 0.63402989 -0.354007252 0.86940628 1.000000000
## ppi 0.95938697 0.97969349 -0.842231572 0.71358003 0.706481232
## cpi 0.96995149 0.99397670 -0.891002043 0.66110053 0.628061280
## InterestRate -0.70716734 -0.67138554 0.542937908 -0.62034075 -0.422487025
## DowJones 0.96281343 0.93917546 -0.820448264 0.62695303 0.444135879
## OilPrice 0.83202811 0.78137630 -0.540009082 0.79811885 0.691903837
## USREC -0.09717555 -0.10918543 0.209404807 -0.13702519 0.001427585
## cap -0.50966422 -0.51993114 0.306091504 -0.61499498 -0.785117493
## mconf -0.02891649 -0.05179458 0.002591511 0.07239267 -0.140306901
## urate 0.24839136 0.28223103 -0.022259937 0.66830257 0.909733572
## ppi cpi InterestRate DowJones OilPrice
## pce 0.95938697 0.96995149 -0.7071673 0.96281343 0.83202811
## pop 0.97969349 0.99397670 -0.6713855 0.93917546 0.78137630
## psavert -0.84223157 -0.89100204 0.5429379 -0.82044826 -0.54000908
## uempmed 0.71358003 0.66110053 -0.6203408 0.62695303 0.79811885
## unemploy 0.70648123 0.62806128 -0.4224870 0.44413588 0.69190384
## ppi 1.00000000 0.98452945 -0.5860196 0.88365255 0.81573291
## cpi 0.98452945 1.00000000 -0.6470430 0.91286187 0.74438182
## InterestRate -0.58601958 -0.64704298 1.0000000 -0.69369321 -0.47426930
## DowJones 0.88365255 0.91286187 -0.6936932 1.00000000 0.76739530
## OilPrice 0.81573291 0.74438182 -0.4742693 0.76739530 1.00000000
## USREC -0.09648551 -0.11735902 0.2253167 -0.12351670 0.02326217
## cap -0.53733908 -0.50629700 0.3949031 -0.40974801 -0.49228276
## mconf -0.09850582 -0.07532791 -0.1671471 0.02203815 -0.04720614
## urate 0.39082727 0.28342342 -0.1294801 0.06695485 0.43696235
## USREC cap mconf urate
## pce -0.097175550 -0.5096642 -0.028916490 0.24839136
## pop -0.109185434 -0.5199311 -0.051794584 0.28223103
## psavert 0.209404807 0.3060915 0.002591511 -0.02225994
## uempmed -0.137025193 -0.6149950 0.072392673 0.66830257
## unemploy 0.001427585 -0.7851175 -0.140306901 0.90973357
## ppi -0.096485509 -0.5373391 -0.098505823 0.39082727
## cpi -0.117359018 -0.5062970 -0.075327912 0.28342342
## InterestRate 0.225316706 0.3949031 -0.167147129 -0.12948014
## DowJones -0.123516700 -0.4097480 0.022038152 0.06695485
## OilPrice 0.023262169 -0.4922828 -0.047206137 0.43696235
## USREC 1.000000000 -0.1957393 -0.571315647 0.06275184
## cap -0.195739287 1.0000000 0.366181908 -0.73025790
## mconf -0.571315647 0.3661819 1.000000000 -0.18210397
## urate 0.062751837 -0.7302579 -0.182103971 1.00000000
To check the evolution of the median weeks of unemployment we can make a simple plot. As one can see from the correlation table, the median weeks of unemployment and the recession dummy are negatively correlated, we can add the recession periods to the plot to gain some insight into this relation.
## As of 0.4-0, 'getSymbols' uses env=parent.frame() and
## auto.assign=TRUE by default.
##
## This behavior will be phased out in 0.5-0 when the call will
## default to use auto.assign=FALSE. getOption("getSymbols.env") and
## getOptions("getSymbols.auto.assign") are now checked for alternate defaults
##
## This message is shown once per session and may be disabled by setting
## options("getSymbols.warning4.0"=FALSE). See ?getSymbols for more details.
## [1] "USREC"
As we could expect, median weeks of unemployment remain high after the recession is over due to the fact that workers are laied off during the recession.
It’s distribution is also of interest.
Despite the correlation not being extremely high, it is to expect that there is a close relation between the length and rate of unemployment. This relation is depicted in the following figure. Average values are added in blue.
Next we can check how a linear regression performs in this data set.
##
## Call:
## lm(formula = uempmed ~ cap + cpi + DowJones + month + pce + psavert +
## ppi + InterestRate + OilPrice + USREC + mconf, data = mydata2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0037 -0.9832 -0.1076 0.8649 8.2896
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.683e+01 2.176e+00 7.735 4.83e-14 ***
## cap -2.304e-01 2.268e-02 -10.159 < 2e-16 ***
## cpi -8.831e-03 1.762e-02 -0.501 0.61637
## DowJones -3.792e-04 6.491e-05 -5.842 8.74e-09 ***
## month 8.364e-04 1.915e-02 0.044 0.96518
## pce 7.865e-04 2.236e-04 3.518 0.00047 ***
## psavert 6.367e-01 5.843e-02 10.898 < 2e-16 ***
## ppi 3.906e-02 1.877e-02 2.080 0.03795 *
## InterestRate -2.455e-01 3.335e-02 -7.361 6.51e-13 ***
## OilPrice 4.992e-02 9.445e-03 5.286 1.79e-07 ***
## USREC -1.566e+00 2.537e-01 -6.174 1.27e-09 ***
## mconf 3.853e-02 6.951e-03 5.543 4.59e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.582 on 562 degrees of freedom
## Multiple R-squared: 0.8545, Adjusted R-squared: 0.8517
## F-statistic: 300 on 11 and 562 DF, p-value: < 2.2e-16
Note that some of the regressors are highly correlated, hence the coefficients are biased. But we are interested in prediction, therefore we can just check the value of R-squared to determine the quality of fit.
An obvious limitation of this approach is that we only have the median weeks of unemployment for the economy as a whole. There’s no information on the other statistics pertaining the distribution of weeks of unemployment. Additional dispersion information could add to the analysis, not on the prediction problem but on understanding the phenomenon as a whole.
The next step in the project is to develop a few models and compare them. The data set will be split into training and testing so that we can compare out of sample errors of different models.
Additionally we may transform some variables and even create additional dummy variables to increase the predictive ability of the models.