Project 1

library(tidyverse)
library(forecast)
library(gridExtra)
library(prettydoc)

theme_set(theme_bw())

Intro:

The data we used is known as Total Nonfarm Payroll which measures the number of U.S workers in the economy exlcuding proprietors, private household employees, unpaid volunteers, farm employees, and the unincorporated self-employed. This data accounts for about 80% of workers that contribute to GDP in the U.S and provides an understanding of the current economy because the data can represent number of jobs added or lost in the economy. An increase in employment can suggest businesses are hiring more employees.

I.

# Reading in the data
paynsa <- read_csv('PAYNSA.csv')

II. II.1.a

# Generating new columns as time series
paynsa$employees <- paynsa$PAYNSA %>% ts(start = 1939, frequency = 12)
paynsa$date <- paynsa$DATE %>% ts(start = 1939, frequency = 12)

# Plotting the time series
autoplot(paynsa$employees) +
  ggtitle("Total Number of Nonfarm Employees") +
  labs(x = "Thousands of Persons",
       y = "Date")

II.1.b

Data shows non stationary, the plot shows it is not consistent around the mean suggesting all of the random variables do no thave the same mean and the linear association depends on other dynamics other than how many periods k apart they are.

II.1.c

# Plotting ACF plot
c_acf <- paynsa$employees %>% ggAcf() +
  ggtitle("Paynsa ACF")

# Plotting PACF plot 
c_pacf <- paynsa$employees %>% ggPacf() +
  ggtitle("Paynsa PACF")

# Arranging the plots to be laid out horizontally
grid.arrange(c_acf, c_pacf, ncol = 2)

Looking at the ACF and PACF plot looks to be towards an AR model with one signfincnat spike in the pacf and a very slow decay to zero in the ACF. The ACF must have a big persistence because of that slow decay.

II.1.d

# Creating a linear model
linear_model <- lm(employees ~ date, data = paynsa)

# Creating fitted values using the linear model
lm_fit <- predict(linear_model)

# Creating a third order polynomial model
nonlinear_model <- lm(employees ~ date + I(date^2) + I(date^3), data = paynsa)

# Creating fitted values using the nonlinear model
nlm_fit <- predict(nonlinear_model)

# Plotting the two models on the observed data
autoplot(paynsa$employees) + 
  geom_line(aes(y = lm_fit, col = 'cyan'), alpha = 0.7) +
  geom_line(aes(y = nlm_fit, col = 'red'), alpha = 0.7) +
  labs() +
  scale_color_manual(name = "Model type", 
                     values = c('cyan' = 'cyan','red' = 'red'), 
                     labels = c('Linear','Third-Order Polynomial')) +
  ggtitle("Comparison of Nonlinear and Linear") +
  labs(x = "Year",
       y = "Thousands of Persons")

II.1.e

# Generating a residuals vs fitted plot for the linear model
ggplot(linear_model,aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = 0, col = 'blue', linetype = 'dashed') +
  ggtitle("Residual Plot of Linear Model") +
  labs(x = "Fitted Values",
       y = "Resdiuals")

# Generating a residuals vs fitted plot for the nonlinear model
ggplot(nonlinear_model,aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.7) + 
  geom_hline(yintercept = 0, col = 'blue', linetype = 'dashed') +
  ggtitle("Residual Plot of Third-Order Polynomial Model") +
  labs(x = "Fitted Values",
       y = "Resdiuals")

Residuals for both model stay relatively close to zero with no values that signficantly stand out, but the residuals show a pattern. Variation of the residuals stay the same across and no siginficant correlation.

II.1.f

# Generating a histogram of residuals for the linear model
ggplot(linear_model,aes(x = .resid)) + 
  geom_histogram(aes(y = ..density..), alpha = 0.7) +
  geom_density() +
  ggtitle("Histogram of Linear Model Residuals") +
  labs(x = "Residuals",
       y = "Frequency")

# Generating a histogram of residuals for the third order polynomial model
ggplot(nonlinear_model,aes(x = .resid)) + 
  geom_histogram(aes(y = ..density..), alpha = 0.7) +
  geom_density() +
  ggtitle("Histogram of Third-Order Polynomial Model Residuals") +
  labs(x = "Residuals",
       y = "Frequency")

Histogram of residuals for linear model show a distribution similar to a normal distribution while the histogram of residuals for polynomial model show it is symmetric but the tails are skewed left and does not match a normal distribution.

II.1.g

# Generating a summary output for the linear model
linear_model %>% summary()

## 
## Call:
## lm(formula = employees ~ date, data = paynsa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9020.9 -4212.3  -283.9  3701.3 11786.9 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.444e+04  1.652e+02   450.6   <2e-16 ***
## date        4.231e+00  1.793e-02   235.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4775 on 970 degrees of freedom
## Multiple R-squared:  0.9829, Adjusted R-squared:  0.9829 
## F-statistic: 5.565e+04 on 1 and 970 DF,  p-value: < 2.2e-16

# Generating a summary output for the nonlinear model
nonlinear_model %>% summary()

## 
## Call:
## lm(formula = employees ~ date + I(date^2) + I(date^3), data = paynsa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10176.5  -1907.9    304.7   2271.3   8450.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.046e+04  1.735e+02  406.05   <2e-16 ***
## date         4.709e+00  2.628e-02  179.23   <2e-16 ***
## I(date^2)    8.892e-05  2.736e-06   32.50   <2e-16 ***
## I(date^3)   -6.541e-09  2.139e-10  -30.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3264 on 968 degrees of freedom
## Multiple R-squared:  0.992,  Adjusted R-squared:  0.992 
## F-statistic: 4.008e+04 on 3 and 968 DF,  p-value: < 2.2e-16

Summary of linear model show a good correlation with R-squared and adjusted R-squared of 0.9829. Small P-values for both T-distribution and F-distribution indicate signficance to the linear model. Summary of polynomial model show a better correlation with an R-squared and adjusted R-squared of 0.992. Small p-values for both T and F distributions indicate signficance to the non-linear model.

II.1.h

# Finding the AICa nd BIC for both models
AIC(linear_model, nonlinear_model)

##                 df      AIC
## linear_model     3 19230.41
## nonlinear_model  5 18492.48

BIC(linear_model, nonlinear_model)

##                 df      BIC
## linear_model     3 19245.05
## nonlinear_model  5 18516.88

AIC for linear model is 19230.41 and non-linear is 18492.48. BIC for linear model is 19245.05 and non-linear model is 18516.88. Best trend model would be the non-linear model with the smallest AIC and BIC values.

II.1.i

# Subsetting the ts variables to a new data frame
t1 <- paynsa[,c(3,4)]

# Assigning a third order linear model to graph later on
nlm_p <- tslm(employees ~ date + I(date^2) + I(date^3),data = t1)

# Creating a data frame that contains future time values to forecast on
t <- time(paynsa$employees)
t.new <- (t[length(t)] + seq(36)) * 12
newdata <- cbind(date=t.new) %>%
  as.data.frame()

# Fitting the third order polynomial model
fit_tri <- tslm(employees ~ date + I(date^2) + I(date^3),data = t1)

# Forecasting the third order polynomial model
fcast_tri <- forecast(fit_tri, h = 36, newdata = newdata)

# Plotting the forecast with the observed data
autoplot(paynsa$employees) +
  autolayer(fitted(fit_tri), series = "Cubic") +
  autolayer(fcast_tri, series = "Cubic") +
  ggtitle("Third Order Polynomial Prediction Interval") +
  labs(x = "Date",
       y = "Thousands of Persons")

II.2.a

dummies <- seasonaldummy(paynsa$employees) #Creating a set of seasonal dummy matrix
dec_paynsa <- decompose(paynsa$employees) #Decomposing a time series to observe detrended data

seasonal_model <- lm(dec_paynsa$seasonal ~ dummies) #Constructing a model with detrended time series data
summary(seasonal_model) #Note: The intercept stands for December coefficient.

## 
## Call:
## lm(formula = dec_paynsa$seasonal ~ dummies)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.747e-10 -4.080e-13 -3.800e-14  7.300e-13  1.030e-10 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  8.159e+02  9.539e-13  8.554e+14   <2e-16 ***
## dummiesJan  -2.140e+03  1.349e-12 -1.587e+15   <2e-16 ***
## dummiesFeb  -1.975e+03  1.349e-12 -1.464e+15   <2e-16 ***
## dummiesMar  -1.550e+03  1.349e-12 -1.149e+15   <2e-16 ***
## dummiesApr  -1.009e+03  1.349e-12 -7.483e+14   <2e-16 ***
## dummiesMay  -5.551e+02  1.349e-12 -4.115e+14   <2e-16 ***
## dummiesJun  -1.199e+02  1.349e-12 -8.891e+13   <2e-16 ***
## dummiesJul  -9.679e+02  1.349e-12 -7.175e+14   <2e-16 ***
## dummiesAug  -8.524e+02  1.349e-12 -6.319e+14   <2e-16 ***
## dummiesSep  -4.240e+02  1.349e-12 -3.143e+14   <2e-16 ***
## dummiesOct  -1.372e+02  1.349e-12 -1.017e+14   <2e-16 ***
## dummiesNov  -6.078e+01  1.349e-12 -4.505e+13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.585e-12 on 960 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 6.094e+29 on 11 and 960 DF,  p-value: < 2.2e-16

II.2.b

plot(c(seasonal_model$coefficients[-1], seasonal_model$coefficients[1]), type="b", pch=20, main="Plot of Estimated Seasonal Factors", xlab="Month", ylab="Coefficient") #Since the intercept, the first coefficient of the model, stands for the remaining month, December, I reordered the intercept to the end.

Plot shows many companies hiring additional employees beginning in the summer with a decline into fall before increasing to its maximum in Decemeber. We expect many companies to hire recent graduates as well as other students as schools break for summer. Additionally, as the holiday season approaches companies again hire many seasonal employees to accomodate the surge in sales until hiring is at a maximum in December. Offloading employees begins in the subsequent year in January when lay offs are at a maximum.

ggseasonplot(paynsa$employees)

Near constant seasonality between years, but bunching and overall decline occurs around recession years. Particularly noticeable for 2007-2011 during the global financial crisis. This points toward the data possessing a cyclical component as well.

II.2.c

full_model3 <- tslm(paynsa$employees ~ season + date + I(date^2) + I(date^3), data=paynsa) #Creating a full regression model including seasonality and trend at the same time

plot(full_model3$fit, full_model3$res, pch=20, col="grey", lwd=0.8, xlab="Fitted Values",ylab="Residuals")
abline(h=0, lwd=2, col="salmon") #Generating a plot that shows fitted values versus residuals

Residuals still show a pattern which is likely due to underlying cycles not captured by the model. Persistence can also be seen as values remaining above and below 0 for extended periods. Applying an AR model could account for persistence in residuals.

II.2.d

summary(full_model3)

## 
## Call:
## tslm(formula = paynsa$employees ~ season + date + I(date^2) + 
##     I(date^3), data = paynsa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8764.7 -1984.0   374.7  2229.3  7829.3 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.905e+04  3.802e+02 181.624  < 2e-16 ***
## season2      1.760e+02  5.017e+02   0.351 0.725733    
## season3      6.218e+02  5.017e+02   1.240 0.215450    
## season4      1.166e+03  5.017e+02   2.324 0.020320 *  
## season5      1.634e+03  5.017e+02   3.258 0.001161 ** 
## season6      2.081e+03  5.017e+02   4.148 3.66e-05 ***
## season7      1.256e+03  5.017e+02   2.503 0.012472 *  
## season8      1.385e+03  5.017e+02   2.760 0.005885 ** 
## season9      1.823e+03  5.017e+02   3.633 0.000294 ***
## season10     2.133e+03  5.017e+02   4.251 2.34e-05 ***
## season11     2.227e+03  5.017e+02   4.439 1.01e-05 ***
## season12     2.299e+03  5.017e+02   4.582 5.21e-06 ***
## date         4.711e+00  2.571e-02 183.263  < 2e-16 ***
## I(date^2)    8.919e-05  2.677e-06  33.316  < 2e-16 ***
## I(date^3)   -6.567e-09  2.093e-10 -31.376  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3193 on 957 degrees of freedom
## Multiple R-squared:  0.9924, Adjusted R-squared:  0.9923 
## F-statistic:  8980 on 14 and 957 DF,  p-value: < 2.2e-16

R-squared and adj R-squared are both very high and most variables show statistical significance. It should be noted that the coefficiencts for dummy months are almost entirely negative implying that lay offs are greater than new hires in months except for December. The trend portion of the model and December coefficient (intercept) are great enough to follow the data properly however. Standard error on several months is quite large as well which is why their statisical significance is in question since they approach zero.

II.2.e

fvalue3 <- forecast(full_model3$fitted.values, h=30) #Creating a forecasting object with 30 steps ahead

autoplot(fvalue3, main = "Forecast Using Full Model") #Overall looking at a glance

autoplot(fvalue3, xlim=c(2018, 2023), ylim=c(135000, 155000), main = "Forecast Using Full Model") #Zoomed version

fvalue3 #Displays the prediction intervals and point forecasts

##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Jan 2020       144831.8 144829.0 144834.6 144827.5 144836.1
## Feb 2020       145053.6 145050.1 145057.0 145048.3 145058.9
## Mar 2020       145535.3 145531.0 145539.6 145528.7 145541.9
## Apr 2020       146125.2 146119.9 146130.6 146117.1 146133.4
## May 2020       146635.9 146629.4 146642.4 146626.0 146645.8
## Jun 2020       147128.3 147120.6 147136.1 147116.5 147140.2
## Jul 2020       146345.6 146336.5 146354.7 146331.7 146359.6
## Aug 2020       146520.6 146510.0 146531.2 146504.4 146536.8
## Sep 2020       147004.5 146992.4 147016.7 146985.9 147023.1
## Oct 2020       147356.7 147342.9 147370.5 147335.6 147377.7
## Nov 2020       147497.1 147481.6 147512.5 147473.4 147520.7
## Dec 2020       147611.0 147593.8 147628.3 147584.6 147637.4
## Jan 2021       145357.8 145338.7 145376.9 145328.6 145387.0
## Feb 2021       145579.6 145558.5 145600.6 145547.4 145611.7
## Mar 2021       146061.3 146038.3 146084.3 146026.1 146096.4
## Apr 2021       146651.2 146626.2 146676.2 146613.0 146689.5
## May 2021       147161.9 147134.8 147189.0 147120.4 147203.3
## Jun 2021       147654.3 147625.1 147683.6 147609.6 147699.1
## Jul 2021       146871.6 146840.2 146903.0 146823.5 146919.7
## Aug 2021       147046.6 147012.9 147080.3 146995.1 147098.1
## Sep 2021       147530.5 147494.5 147566.5 147475.5 147585.6
## Oct 2021       147882.7 147844.3 147921.0 147824.0 147941.3
## Nov 2021       148023.0 147982.3 148063.8 147960.7 148085.4
## Dec 2021       148137.0 148093.8 148180.2 148070.9 148203.1
## Jan 2022       145883.8 145838.0 145929.5 145813.8 145953.7
## Feb 2022       146105.5 146057.3 146153.8 146031.7 146179.4
## Mar 2022       146587.3 146536.4 146638.1 146509.5 146665.1
## Apr 2022       147177.2 147123.7 147230.7 147095.4 147259.1
## May 2022       147687.9 147631.7 147744.1 147601.9 147773.8
## Jun 2022       148180.3 148121.4 148239.2 148090.2 148270.4

Conclusion and Future Considerations

A possible consideration for our current model includes using a smaller time frame for the data rather than all observations over the last 70 years. This could improve residuals by reducing the range of data points to just the last few decades. Using the log of the data could reduce outliers and range of data as well. Accounting for the cyclical nature of employed individuals and persistence would be the largest improvements to the model. Also, examining a time series of the growth rates through the difference of the log of the data could reveal additional information about employment. Using this method would also reveal more stationarity as the process mean reverts quite frequently.

Data Source:

U.S. Bureau of Labor Statistics, All Employees, Total Nonfarm [PAYNSA], retrieved from FRED, Federal Reserve Bank of St. Louis; https://fred.stlouisfed.org/series/PAYNSA, January 27, 2020.