library(tidyverse)
library(skimr)
library(forecast)
library(ggfortify)
econ = economics
sum(is.na(econ))
[1] 0
sum(duplicated(econ))
[1] 0
head(econ)
skim(econ) #descriptive stats
── Data Summary ────────────────────────
Values
Name econ
Number of rows 574
Number of columns 6
_______________________
Column type frequency:
Date 1
numeric 5
________________________
Group variables None
econ1 <- econ %>%
select(uempmed,date)
#plot histogram of unemployment
econ1 %>%
ggplot(aes(uempmed))+geom_histogram(col='black',bins=40)+
labs(x = "Unemployment", title = "Monthly Unemployment Rate in US between 1967 to 2015")
the distribution of histogram is heavily skewed to the right meaning there are a few observations from this variable with more extreme values. the question now is what period did these extreme observations take place
mean(econ1$uempmed)
[1] 8.608711
median(econ1$uempmed)
[1] 7.5
sd(econ1$uempmed)
[1] 4.106645
#Check Outliers via z scores
z <- econ1$uempmed - mean(econ1$uempmed)/sd(econ1$uempmed)
# outlier as > [sd * 3 + mean] , outlier as smaller than [ mean - sd * 3 ]
econ_nootl <- econ1 %>%
mutate(outlier = ifelse(uempmed > 20.9,'1','0'),
outlier2 = ifelse(uempmed < -3.7,'1','0'))
econ_nootl <- econ_nootl %>%
filter(outlier != '1') %>%
select(uempmed,outlier,outlier2)
econ_noot2 <- inner_join(econ1,econ_nootl,by='uempmed')
econ_noot2 <- econ_noot2 %>%
filter(outlier != '1')
econ_noot3 <- econ_noot2 %>%
select(-c(outlier,outlier2))
econ1
#outliers within uempmed variable removed
#plot histogram of unemployment
econ1 %>%
ggplot(aes(uempmed))+geom_histogram(col='black',bins=45)+xlim(0,12)+
labs(x = "Unemployment", title = "Monthly Unemployment Rate in US between 1967 to 2015")
#Line Graph of Unemployment
econ1 %>%
ggplot(aes(y=log(uempmed),x=date)) +geom_line()+
geom_smooth()+ylab('unemployment')
Using this graph, we can identify that in the most recent period, in the last observations from 2010, there is a tendency for an increase in unemployment, surpassing the history observed in previous decades. Another important point, especially in econometric modeling contexts, is the stationarity of the series; that is, are the mean and variance constant over time?
When these assumptions are not true in a variable, we say that the series has a unit root (non-stationary) so the shocks that the variable suffers generate a permanent effect.
It seems to have been the case for the variable in question, the duration of unemployment. We have seen that the fluctuations of the variable have changed considerably, which has strong implications related to economic theories that deal with cycles. But, departing from theory, how do we practically check whether the variable is stationary?
The forecast package has an excellent function allowing to apply tests, such as ADF, KPSS, and others, which already return the number of differences necessary for the series to be stationary:
adf.test(econ1$uempmed)
Augmented Dickey-Fuller Test
data: econ1$uempmed
Dickey-Fuller = -2.9516, Lag order = 8, p-value =
0.1755
alternative hypothesis: stationary
we can conclude that there is no stationarity as our adf test came out being greater than .05
checkresiduals(econ1$uempmed)
Ljung-Box test
data: Residuals
Q* = 5230.2, df = 10, p-value < 0.00000000000000022
Model df: 0. Total lags used: 10
pacf(econ1$uempmed)
Now, I am going to repeat the above steps using the data with the removed outliers
#Line Graph of Unemployment after outliers removed
econ_noot3 %>%
ggplot(aes(y=uempmed,x=date)) +geom_line()+
geom_smooth()
Using this graph, we can identify that in the most recent period: in the last observations from 2010, there is a tendency for an increase in unemployment, surpassing the history observed in previous decades. Another important point, especially in econometric modeling contexts, is the stationarity of the series; that is, are the mean and variance constant over time?
Removing the outliers has helped to better point out the stationarity of the unemployment variable
hypothesis test H_0: there is stationary or in other words mean and variance are consistent over time H_1: There is no stationarity or in other words the mean and variance are not constant over time
adf.test(econ_noot3$uempmed)
Warning: p-value smaller than printed p-value
Augmented Dickey-Fuller Test
data: econ_noot3$uempmed
Dickey-Fuller = -4.0358, Lag order = 16, p-value =
0.01
alternative hypothesis: stationary
After removal of outliers within uempmed variable we are able to see from our hypothesis test that there are now signs of stationarity within the time series plot as a result of the p-value being less than .05
checkresiduals(econ_noot3$uempmed)
Ljung-Box test
data: Residuals
Q* = 45632, df = 10, p-value < 0.00000000000000022
Model df: 0. Total lags used: 10
pacf(econ_noot3$uempmed)
nearly normal distribution
forecast(econ_noot3$uempmed)
NA
joblack <- econ_noot3 %>%
filter(date > '2010-01-05') %>%
select(date) %>%
arrange(date)
jobm <- ts(joblack,start = decimal_date(ymd("2010-01-05")),
frequency = 365.25/7) #frequency = weekly
jobm
Time Series:
Start = 2010.01095890411
End = 2011.46749552286
Frequency = 52.1785714285714
date
[1,] 14641
[2,] 14641
[3,] 14641
[4,] 14669
[5,] 14669
[6,] 14853
[7,] 15065
[8,] 15248
[9,] 15279
[10,] 15279
[11,] 15309
[12,] 15340
[13,] 15340
[14,] 15371
[15,] 15400
[16,] 15431
[17,] 15461
[18,] 15461
[19,] 15461
[20,] 15492
[21,] 15492
[22,] 15522
[23,] 15553
[24,] 15584
[25,] 15614
[26,] 15614
[27,] 15614
[28,] 15645
[29,] 15675
[30,] 15706
[31,] 15706
[32,] 15737
[33,] 15737
[34,] 15765
[35,] 15796
[36,] 15796
[37,] 15796
[38,] 15826
[39,] 15826
[40,] 15826
[41,] 15857
[42,] 15887
[43,] 15918
[44,] 15918
[45,] 15949
[46,] 15949
[47,] 15979
[48,] 15979
[49,] 16010
[50,] 16010
[51,] 16010
[52,] 16040
[53,] 16071
[54,] 16102
[55,] 16130
[56,] 16130
[57,] 16161
[58,] 16191
[59,] 16222
[60,] 16252
[61,] 16252
[62,] 16283
[63,] 16283
[64,] 16283
[65,] 16314
[66,] 16344
[67,] 16375
[68,] 16405
[69,] 16405
[70,] 16405
[71,] 16436
[72,] 16467
[73,] 16467
[74,] 16467
[75,] 16495
[76,] 16526
[77,] 16526
jobmarket <- (auto.arima(jobm))
forecast(jobmarket,1000)
plot(forecast(jobmarket,1000), xlab ="Time", ylab="Unemployment",
main='Unemployment Projection',
col.main='darkred')
From here, we can see that going into the future and past the current
year of 2024 the unemployment forecast clearly illustrates just how much
unemployment will continue to increase
For by the year 2025, we see unemployment reaching 32000, 4000 by 2030,
and 4100 by 2032.