Unemployment-Project--1-.knit

Unemployment Prediction using ML Algorithms.

Sakyarshi Kurati

2023-04-25

knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)

Unemployment Rate from 1954 -2023

Origin

Semester Date: Spring 2023
Class: Data Mining (CS 5310)
Program: Master of Science in Data Analytics
School: University of Houston Downtown

Introduction

This is a Data Mining project during the spring 2023 semester as a part of Masters in Data Analytics(MSDA) program at the University of Houston Downtown (UHD).

This Project is about predicting the unemployment rate for the coming months in 2023 using the historic economic indicators data and demographic indicators data.

I spent a lot of time in collecting the data as the data is needed in monthly fashion to predict the unemployment in monthly fashion for 2023.

I chose this project because it provides an opportunity to explore all the data mining techniques and the main machine learning algorithms to produce the best output.

Through this project i have explored the all the stages of the data analytics workflow which includes :

Data pre-processing
Modeling
Evaluation
Prediction

The data pre-processing tasks employed in this project were : Data merging - combining different data sets, Data cleaning - removing redundant values, addressing missing values and noisy data, changing data types of the attributes to comply with the ML algorithms functions in R, feature engineering using principal component analysis etc.

The next step is modeling , in this step i created different models, which are evaluated , improved and finally the best performing model is selected. The primary evaluation metric was the Root-Mean-Square_Error.

Final steps include prediction. In this step i used the best performing model using the evaluation metric to predict the test data(May to December 2023).

Data Collection

I have collected the historic data from different data sources which included monthly data of the economic indicators like inflation rates, interest rates, producer price index, consumer confident index, GDP etc. and also demographic data like unemployment of the white race, average hours per week etc. within the time frmae of 1954 -2022. This is the scenario where each attribute is a single data set.

Unemployment rate data for each state is also collected to explore the performance of states throughout the time.

unempRate <- read.csv("Unemp rate 54-23.csv")
unemp_men <- read.csv("Unemp Men 54-23.csv")
unemp_women <- read.csv("Unemp women 54-23.csv")
unemp_people <- read.csv("Unemp level(people) 54-23.csv")
unemp_weeks <- read.csv("Unemp Avg weeks 54-23.csv")
unemp_over_20 <- read.csv("Unemp 20 & over 54-23.csv")
unemp_white <- read.csv("Unemp- white 54-23.csv")
inflation <- read.csv("Inflation 54-23(FRED).csv")
ppi <- read.csv("PPI 54-23(FRED).csv")
import_goods <- read.csv("Imports goods 55-23.csv")
export_goods <- read.csv("Exports goods 55 -23.csv")
nettrade <- read.csv("Net Trade goods 55-23.csv")
cli <- read.csv("CLI 55-23(USA).CSV")
labourforce <- read.csv("labour force 55-23.csv")
stockprices <- read.csv("stock prices 50-23.csv")
bci <- read.csv("Businees CI 50-23.csv")
cci <- read.csv("CCI 60-23.csv")
gdp_change <- read.csv("GDP % change 60-22.csv")
interest_rate <- read.csv("Interest rates 53-2023.csv")

Data Pre-processing

As every data set is a single attribute and different from other datasets it took a lot of time to preprocess the data. Only required columns are selected from each data set and finally all these columns are merged together to a single data set with all the economic indicators, demographic indicators along with target variable unemployment rate form1954 -2022.

The final data set has the data of economic indicators and demographic data of unemployment rate of 856 months since 1954.

### Data Pre-processing
library(dplyr)
#merging the datasets to a single data set.
merge1 <- left_join(unempRate, unemp_men)
merge1 <- left_join(merge1,unemp_over_20)
merge1 <- left_join(merge1, unemp_people)
#naming the columns
colnames(merge1)[c(2,3,4,5)] <- c("unempRate","Unemp_men", "unemp_over_20", "unemp_people")
merge1 <- left_join(merge1,unemp_weeks)
merge1 <- left_join(merge1, unemp_white)
merge1 <- left_join(merge1, unemp_women)
colnames(merge1)[c(6,7,8)] <- c("unemp_weeks","unemp_white", "unemp_women")
merge1 <- left_join(merge1, inflation)
merge1 <- left_join(merge1, ppi)
import_goods1 <- import_goods[-c(1,2,3,4,5,8)]
colnames(import_goods1)[1] <- "DATE"
import_goods1$DATE <- paste(import_goods1$DATE,"01" ,sep = "-")
merge1 <- left_join(merge1,import_goods1)
export_goods1 <- export_goods[-c(1,2,3,4,5,8)]
colnames(export_goods1)[1] <- "DATE"
export_goods1$DATE <- paste(export_goods1$DATE,"01" ,sep = "-")

#naming the columns
colnames(merge1)[c(9,10,11)] <- c("inflation","ppi", "import_goods")
#merging remaining datasets
merge1 <- left_join(merge1, export_goods1)
nettrade1 <- nettrade[-c(1,2,3,4,5,8)]
colnames(nettrade1)[1] <- "DATE"
nettrade1$DATE <- paste(nettrade1$DATE,"01" ,sep = "-")
colnames(nettrade1)[2] <- "nettrade"
merge1 <- left_join(merge1,nettrade1)

cli1 <- cli[-c(1,2,3,4,5,8)]
colnames(cli1)[1] <- "DATE"
cli1$DATE <- paste(cli1$DATE,"01" ,sep = "-")
colnames(cli1)[2] <- "cli"
merge1 <- left_join(merge1,cli1)

labourforce1 <- labourforce[-c(1,2,3,4,5,8)]
colnames(labourforce1)[1] <- "DATE"
labourforce1$DATE <- paste(labourforce1$DATE,"01" ,sep = "-")
colnames(labourforce1)[2] <- "labourforce"

merge1 <- left_join(merge1,labourforce1)

stockprices1 <- stockprices[-c(1,2,3,4,5,8)]
colnames(stockprices1)[1] <- "DATE"
stockprices1$DATE <- paste(stockprices1$DATE,"01" ,sep = "-")
colnames(stockprices1)[2] <- "stockprices"

merge1 <- left_join(merge1,stockprices1)

bci1 <- bci[-c(1,2,3,4,5,8)]
colnames(bci1)[1] <- "DATE"
bci1$DATE <- paste(bci1$DATE,"01" ,sep = "-")
colnames(bci1)[2] <- "bci"

merge1 <- left_join(merge1,bci1)

cci1 <- cci[-c(1,2,3,4,5,8)]
colnames(cci1)[1] <- "DATE"
cci1$DATE <- paste(cci1$DATE,"01" ,sep = "-")
colnames(cci1)[2] <- "cci"

merge1 <- left_join(merge1,cci1)

merge1 <- left_join(merge1,gdp_change)
colnames(merge1)[c(12,19)] <- c("export_goods","gdp_change")
interest_rate1 <- interest_rate[-c(1,2,3,4,5,8)]
colnames(interest_rate1)[1] <- "DATE"
interest_rate1$DATE <- paste(interest_rate1$DATE,"01" ,sep = "-")
colnames(interest_rate1)[2] <- "interest_rate"

merge1 <- left_join(merge1,interest_rate1)
merge1$DATE <- as.Date(merge1$DATE,format = "%Y-%m-%d")

Missing values - there are a lot of missing values. I have imputed the missing values using the KNN algorithm.

Here imputing the missing values with mean or median is not relevant so, the step_impute_knn function from recipes package is thought be better here when compared to the other methods as i don’t want to delete the rows with NA values.

Imputing the missing values using KNN algorithm, step_impute_knn function is used from recipes package.

library(recipes)
#creating a recipe
rec <- recipe(merge1, formula = unempRate ~.,)
#training the algorithm
rec1 <- rec %>% step_impute_knn(all_predictors(), neighbors = 11)
rec2 <- prep(rec1, training = merge1)

#imputing the missing values.
imputed <- bake(rec2, new_data = merge1)

Multiple Regression

Multiple regression analysis is done to check the variation explained by the independent variables, multicollinearity and also the most important variables.

method1 <- read.csv("complete dataset.csv")

str(method1)

## 'data.frame':    831 obs. of  20 variables:
##  $ DATE              : chr  "1954-01-01" "1954-02-01" "1954-03-01" "1954-04-01" ...
##  $ Unemp_20_and_over : num  4.4 4.6 5.1 5.3 5.4 5.3 5.3 5.5 5.6 5.3 ...
##  $ Unemp_avg_weeks   : num  8.7 9.5 10.6 10.9 11.6 12.3 12.5 12.8 12.9 13.3 ...
##  $ Unemp_level_people: int  3077 3331 3607 3749 3767 3551 3659 3854 3927 3666 ...
##  $ Unemp_men         : num  4.4 4.9 5.3 5.6 5.7 5.3 5.6 6 6 5.7 ...
##  $ Unemp_rate        : num  4.9 5.2 5.7 5.9 5.9 5.6 5.8 6 6.1 5.7 ...
##  $ Unemp_women       : num  5.9 5.9 6.4 6.4 6.3 6.2 6.3 6.2 6.4 5.8 ...
##  $ Unemp_white       : num  4.5 4.9 5 5.5 5.3 5 5.3 5.6 5.9 5.1 ...
##  $ PPI               : num  30.5 30.4 30.3 30.5 30.5 30.3 30.5 30.5 30.4 30.3 ...
##  $ Businees_CI       : num  96.5 97.2 98 98.7 99.2 ...
##  $ stock_prices      : num  30.1 28.5 29.3 31.5 30.4 ...
##  $ CCI               : num  99.1 98.6 99.8 99.3 100.1 ...
##  $ CLI               : num  97.1 97.5 98.3 98.5 99.1 ...
##  $ Export_goods      : num  38.8 39.1 37.6 38.5 37.3 ...
##  $ GDP               : num  -5.12 -4.49 -2.92 -2.19 -1.26 ...
##  $ Import_goods      : num  71.9 71.7 71.7 71.7 71.6 ...
##  $ Inflation         : num  26.9 26.9 26.9 26.8 26.9 26.9 26.9 26.9 26.8 26.8 ...
##  $ Interest_rates    : num  2.48 2.47 2.37 2.29 2.37 2.38 2.3 2.36 2.38 2.43 ...
##  $ labour_force      : num  59487 60606 60697 59264 58622 ...
##  $ Net_Trade_goods   : num  34.1 33.7 35.1 34.3 35.3 ...

Changing the data types of the date column to comply with functions of the ML algorithms.

#Changing the data type of date dolumn from character to date.
method1$DATE <- as.Date(merge1$DATE,format = "%Y-%m-%d")

Regression model and summary.

c_reg_model <- lm(Unemp_rate ~., data = method1)
summary(c_reg_model)

## 
## Call:
## lm(formula = Unemp_rate ~ ., data = method1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.125693 -0.027594 -0.001228  0.029541  0.105262 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.271e-01  1.837e-01  -0.692  0.48897    
## DATE               -1.640e-06  3.817e-06  -0.430  0.66761    
## Unemp_20_and_over   2.106e-01  1.686e-02  12.487  < 2e-16 ***
## Unemp_avg_weeks     1.262e-03  6.289e-04   2.006  0.04519 *  
## Unemp_level_people -2.918e-05  3.553e-06  -8.214 8.44e-16 ***
## Unemp_men           4.205e-01  1.334e-02  31.528  < 2e-16 ***
## Unemp_women         3.528e-01  9.594e-03  36.773  < 2e-16 ***
## Unemp_white         6.927e-02  1.591e-02   4.354 1.51e-05 ***
## PPI                 1.066e-03  5.857e-04   1.820  0.06907 .  
## Businees_CI        -1.440e-03  2.040e-03  -0.706  0.48065    
## stock_prices        4.579e-05  2.740e-04   0.167  0.86732    
## CCI                -5.991e-04  1.718e-03  -0.349  0.72743    
## CLI                -1.070e-03  2.532e-03  -0.423  0.67251    
## Export_goods       -1.422e-02  6.737e-03  -2.111  0.03506 *  
## GDP                 3.492e-04  2.746e-04   1.272  0.20380    
## Import_goods        1.397e-02  6.738e-03   2.073  0.03850 *  
## Inflation          -2.117e-03  6.489e-04  -3.262  0.00115 ** 
## Interest_rates      2.934e-03  1.478e-03   1.985  0.04750 *  
## labour_force        5.380e-06  7.997e-07   6.727 3.27e-11 ***
## Net_Trade_goods    -1.401e-02  6.704e-03  -2.090  0.03692 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03884 on 811 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9994 
## F-statistic: 7.934e+04 on 19 and 811 DF,  p-value: < 2.2e-16

We can see from the summary that these variables are well explaining the unemployment rate.

The model is good as the f statistic is significant with p value less than 0.01.

Correlation matrix says that the variables are highly correlated.

library(corrplot)
corrplot(cor(sapply(method1, as.numeric)), method = "color", order = "alphabet")

library(car)
#checking the variation inflation factors for multicollinearity
vif(c_reg_model)

##               DATE  Unemp_20_and_over    Unemp_avg_weeks Unemp_level_people 
##       4.279327e+02       3.927055e+02       1.136753e+01       6.006635e+01 
##          Unemp_men        Unemp_women        Unemp_white                PPI 
##       3.295042e+02       1.202054e+02       3.145540e+02       7.384184e+02 
##        Businees_CI       stock_prices                CCI                CLI 
##       4.933750e+00       6.983628e+01       3.640785e+00       7.935599e+00 
##       Export_goods                GDP       Import_goods          Inflation 
##       5.907012e+04       1.315236e+00       1.455912e+05       1.545279e+03 
##     Interest_rates       labour_force    Net_Trade_goods 
##       1.025997e+01       3.913009e+02       2.002035e+04

We can see that vif is exceeding 5 for some of the variables. So, these variables are highly correlated.

From the correlation map it is evident that the variables are highly correlated. So, i thought of doing the Principal component analysis. From the PCA, 4 principal components were extracted as these four components are explaining 89% of the variance. This was expected to reduce the multicollinearity. But the results are not so convincing with the principal components.

The evaluation metrics are better with original variables than with principal components for all the models. So, i do not want to include the principal component analysis here as it gets too much complicated.

Model Creation

As the target variable is numerical variable. i wanted to explore ML algorithms like Decision Tree Regression, Model trees, Random forest, Gradient boosting algorithm.

Also, random forest algorithm and gradient boosting algorithms are very good when you have correlated variables. So, i have high expectations on random forest and gradient boosting algorithms.

library(Metrics)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(party)
library(glmnet)
library(xgboost)
library(neuralnet)

Splitting the dataset

###Splitting the data into training and testing sets.
set.seed(800)
c_rs <- sample(nrow(method1), nrow(method1)*0.25)
complete_data_train <- method1[-c_rs,]
complete_data_test <- method1[c_rs,]
complete_data_test2 <- complete_data_test[-6]

Decision Tree Regression Algorithm

# Training the model
reg_tree <- rpart(Unemp_rate ~., data = complete_data_train, method = "anova", cp = 0.0001)

Method “anova” is used because our problem is regression type.

The complexity parameter of 0.0001 gave the best results after trying the tuning with other values.

Seems like complex tree works better here.

#Model Evaluation
reg_tree_pred <- predict(reg_tree, complete_data_test2)

######### Evaluation #############
mae3 <- mae(reg_tree_pred, complete_data_test$Unemp_rate)
mse3 <- mse(reg_tree_pred, complete_data_test$Unemp_rate)
rmse3 <- rmse(reg_tree_pred, complete_data_test$Unemp_rate)
rsq3 <- cor(reg_tree_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", mae3, "///MSE-", mse3, "///RMSE-", rmse3, "///RSQ-", rsq3))

## [1] "MAE- 0.110114471181898 ///MSE- 0.036890008735503 ///RMSE- 0.192067719139638 ///RSQ- 0.985370155158067"

Most Important variable -Unemployment of white.

imp_var <- sort(reg_tree$variable.importance, decreasing = T)
most_imp <- names(imp_var)[1:10]
rpart.plot(reg_tree, type = 2, extra = 101, fallen.leaves = F, under = TRUE, cex = 0.6, branch.lty = 1, branch.lwd = 1,
            box.col = "lightblue")

Random Forest algorithm

set.seed(1200)
rf_model <- randomForest(Unemp_rate~., data = complete_data_train, ntree = 100, mtry = 4, max.depth = 6)

rf_pred <- predict(rf_model, complete_data_test2)
rf_model

## 
## Call:
##  randomForest(formula = Unemp_rate ~ ., data = complete_data_train,      ntree = 100, mtry = 4, max.depth = 6) 
##                Type of random forest: regression
##                      Number of trees: 100
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.03577444
##                     % Var explained: 98.73

After trying several tuning combinations, no.of trees - 100, no of variables at each node -4, depth of tree -6 gave the best results for this model.

#model evaluation
mae4 <- mae(rf_pred, complete_data_test$Unemp_rate)
mse4 <- mse(rf_pred, complete_data_test$Unemp_rate)
rmse4 <- rmse(rf_pred, complete_data_test$Unemp_rate)
rsq4 <- cor(rf_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", mae4, "///MSE-", mse4, "///RMSE-", rmse4, "///RSQ-", rsq4))

## [1] "MAE- 0.050504106280193 ///MSE- 0.00761183238996231 ///RMSE- 0.0872458158879973 ///RSQ- 0.997163011492964"

10 fold cross validation method

rf_ctrl  <- trainControl(method = "cv", number = 10)

rf_grid <- expand.grid(.mtry = c(1,2,3,4,5))
set.seed(1100)
rf_cust <- train(Unemp_rate~., data = complete_data_train, method = "rf", metric= "RMSE" , trControl = rf_ctrl,
                 tuneGrid = rf_grid)

rf_cust

## Random Forest 
## 
## 624 samples
##  19 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 561, 562, 563, 561, 562, 562, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE       
##   1     0.2219627  0.9819266  0.11777307
##   2     0.1549901  0.9885853  0.07674027
##   3     0.1332539  0.9905698  0.06513193
##   4     0.1253030  0.9914552  0.06208328
##   5     0.1201174  0.9920139  0.06001626
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.

rf_cust_pred <- predict(rf_cust, complete_data_test2)

#Evaluation
mae5 <- mae(rf_cust_pred, complete_data_test$Unemp_rate)
mse5 <- mse(rf_cust_pred, complete_data_test$Unemp_rate)
rmse5 <- rmse(rf_cust_pred, complete_data_test$Unemp_rate)
rsq5 <- cor(rf_cust_pred, complete_data_test$Unemp_rate)^2

print(paste("MAE-", mae5, "///MSE-", mse5, "///RMSE-", rmse5, "///RSQ-", rsq5))

## [1] "MAE- 0.0533278421900161 ///MSE- 0.0120321158212028 ///RMSE- 0.109691001550732 ///RSQ- 0.995468328871509"

Model Tree Regression

Model tree is a variation of decision tree regression where it uses regression model instead of mean at each split.

complete_data_train1 <- complete_data_train
complete_data_train1$DATE <- as.numeric(complete_data_train1$DATE - as.Date("1954-01-01"))
complete_data_test1 <- complete_data_test2
complete_data_test1$DATE <- as.numeric(complete_data_test2$DATE - as.Date("1954-01-01"))

model_tree_reg <- ctree(Unemp_rate ~., data = complete_data_train1)

model_tree_pred <- predict(model_tree_reg, complete_data_test1)

#Evaluation
c_mae_mt <- mae(model_tree_pred, complete_data_test$Unemp_rate)
c_mse_mt <- mse(model_tree_pred, complete_data_test$Unemp_rate)
c_rmse_mt <- rmse(model_tree_pred, complete_data_test$Unemp_rate)
c_rsq_mt <- cor(model_tree_pred, complete_data_test$Unemp_rate)^2

print(paste("MAE-", c_mae_mt, "/// MSE-", c_mse_mt, "/// RMSE-", c_rmse_mt, "/// RSQ-", c_rsq_mt))

## [1] "MAE- 0.111207463990694 /// MSE- 0.0421555392641434 /// RMSE- 0.205318141585549 /// RSQ- 0.983852208154005"

Ridge Regression

I tried ridge regression methods manages the correlated variables.

x <- as.matrix(complete_data_train1[-6])
y <- complete_data_train$Unemp_rate
cv_model <- cv.glmnet(x,y, alpha=0, lambda = seq(0.001,0.1,0.001), nfolds = 10)
min_l <- cv_model$lambda.min
ridge_model <- glmnet(x,y, alpha = 0, lambda = min_l)

ridge_pred <- predict(ridge_model, as.matrix(complete_data_test1))

Evaluation

mae_r <- mae(ridge_pred, complete_data_test$Unemp_rate)
mse5_r <- mse(ridge_pred, complete_data_test$Unemp_rate)
rmse5_r <- rmse(ridge_pred, complete_data_test$Unemp_rate)
rsq5_r <- cor(ridge_pred, complete_data_test$Unemp_rate)^2

print(paste("MAE-", mae_r, "///MSE-", mse5_r, "///RMSE-", rmse5_r, "///RSQ-", rsq5_r))

## [1] "MAE- 0.0328041786911582 ///MSE- 0.00152426395686021 ///RMSE- 0.039041823175413 ///RSQ- 0.999389660178682"

XG Boost algorithm

params <- list(
  booster = "gbtree",
  eta = 0.1,max_depth = 5, subsample = 0.8, colsample_bytree = 0.8, eval_metric = "rmse")
# Train the xgboost model
boost_model <- xgboost(data = as.matrix(complete_data_train1[-6]), 
                 label = complete_data_train1$Unemp_rate, 
                 params = params, 
                 nrounds = 82)

## [1]  train-rmse:5.067344 
## [2]  train-rmse:4.572092 
## [3]  train-rmse:4.123619 
## [4]  train-rmse:3.722090 
## [5]  train-rmse:3.358581 
## [6]  train-rmse:3.029320 
## [7]  train-rmse:2.734491 
## [8]  train-rmse:2.466368 
## [9]  train-rmse:2.226653 
## [10] train-rmse:2.009577 
## [11] train-rmse:1.815121 
## [12] train-rmse:1.638956 
## [13] train-rmse:1.483156 
## [14] train-rmse:1.342854 
## [15] train-rmse:1.214197 
## [16] train-rmse:1.098768 
## [17] train-rmse:0.994439 
## [18] train-rmse:0.900680 
## [19] train-rmse:0.816675 
## [20] train-rmse:0.741236 
## [21] train-rmse:0.673717 
## [22] train-rmse:0.612564 
## [23] train-rmse:0.557201 
## [24] train-rmse:0.509086 
## [25] train-rmse:0.464330 
## [26] train-rmse:0.423868 
## [27] train-rmse:0.387607 
## [28] train-rmse:0.354592 
## [29] train-rmse:0.327011 
## [30] train-rmse:0.299887 
## [31] train-rmse:0.275789 
## [32] train-rmse:0.253529 
## [33] train-rmse:0.233715 
## [34] train-rmse:0.215969 
## [35] train-rmse:0.202132 
## [36] train-rmse:0.187461 
## [37] train-rmse:0.176491 
## [38] train-rmse:0.164223 
## [39] train-rmse:0.153389 
## [40] train-rmse:0.143600 
## [41] train-rmse:0.134657 
## [42] train-rmse:0.128615 
## [43] train-rmse:0.121190 
## [44] train-rmse:0.114292 
## [45] train-rmse:0.110250 
## [46] train-rmse:0.104723 
## [47] train-rmse:0.099584 
## [48] train-rmse:0.094948 
## [49] train-rmse:0.092405 
## [50] train-rmse:0.088332 
## [51] train-rmse:0.084455 
## [52] train-rmse:0.081035 
## [53] train-rmse:0.079326 
## [54] train-rmse:0.076351 
## [55] train-rmse:0.073619 
## [56] train-rmse:0.072593 
## [57] train-rmse:0.070102 
## [58] train-rmse:0.067923 
## [59] train-rmse:0.065937 
## [60] train-rmse:0.064122 
## [61] train-rmse:0.062360 
## [62] train-rmse:0.060876 
## [63] train-rmse:0.059287 
## [64] train-rmse:0.058580 
## [65] train-rmse:0.057139 
## [66] train-rmse:0.055823 
## [67] train-rmse:0.054493 
## [68] train-rmse:0.053339 
## [69] train-rmse:0.052228 
## [70] train-rmse:0.051188 
## [71] train-rmse:0.050329 
## [72] train-rmse:0.049242 
## [73] train-rmse:0.048446 
## [74] train-rmse:0.047587 
## [75] train-rmse:0.046815 
## [76] train-rmse:0.046141 
## [77] train-rmse:0.045388 
## [78] train-rmse:0.044774 
## [79] train-rmse:0.044214 
## [80] train-rmse:0.043764 
## [81] train-rmse:0.043107 
## [82] train-rmse:0.042563

boost_pred <- predict(boost_model, as.matrix(complete_data_test1))

Evaluation

mae_b <- mae(boost_pred, complete_data_test$Unemp_rate)
mse5_b <- mse(boost_pred, complete_data_test$Unemp_rate)
rmse5_b <- rmse(boost_pred, complete_data_test$Unemp_rate)
rsq5_b <- cor(boost_pred, complete_data_test$Unemp_rate)^2

print(paste("MAE-", mae_r, "///MSE-", mse5_r, "///RMSE-", rmse5_r, "///RSQ-", rsq5_r))

## [1] "MAE- 0.0328041786911582 ///MSE- 0.00152426395686021 ///RMSE- 0.039041823175413 ///RSQ- 0.999389660178682"

Artificial Neural Network algorithm

After trying a number of combinations of hyper parametric tuning for the learning rate and hidden layers, the learning rate of 0.006 and hidden layers of 4 gave the best results.

The activation factor is not used at the output node because the target variable here is of continuous type and there should not be any range for the value, so, the activation factor is not used at the output node.

scaled_data <- as.data.frame(sapply(complete_data_train1, 
                      scale))
set.seed(99)
ann_model <- neuralnet(Unemp_rate ~ ., data = scaled_data, hidden = 4, 
                        learningrate = 0.006, threshold = 0.01, linear.output = TRUE)

The training data and testing testing data is scaled for standardization. After that the predictions are re scaled to the original scale to evaluate using metrics.

#evaluation
scaledtestdata <- data.frame(sapply(complete_data_test1,scale))
ann_pred <- compute(ann_model, scaledtestdata)$net.result

rescaled <- ann_pred *sd(complete_data_train1$Unemp_rate) + mean(complete_data_train1$Unemp_rate)

c_mae_an <- mae(rescaled, complete_data_test$Unemp_rate)
c_mse_an <- mse(rescaled, complete_data_test$Unemp_rate)
c_rmse_an <- rmse(rescaled, complete_data_test$Unemp_rate)
print(paste("MAE-", c_mae_an, "/// MSE-", c_mse_an, "/// RMSE-", c_rmse_an))

## [1] "MAE- 0.0813837249670667 /// MSE- 0.009626285350451 /// RMSE- 0.0981136348855296"

#plotting the ann model
plot(ann_model, rep= "best")

I choose RMSE as the main evaluation metric for this project as we prefer accuracy here in this scenario of predicting unemployment.

The best RMSE scores were with XG boost model and with ridge regression model.

I finally chose XG boost model as this model perfectly manages the correlated variables.

Plotting the metrics

library(tidyr)
metrics_df <- data.frame(Model = c("Ridge Reg", "Decision Tree", "Model Tree", "RF- CV", "RF","XG Boost", "ANN"), 
                         RMSE = c(0.0390, 0.1920, 0.20531, 0.109, 0.08, 0.0390,
                                  0.0981),
                         MSE = c(0.0015, 0.03, 0.04, 0.01, 0.007, 0.0015,0.0096),
                         MAE = c(0.032, 0.036, 0.11, 0.053, 0.050, 0.0328,0.081))

df_long <- pivot_longer(metrics_df, cols = c("RMSE", "MSE", "MAE"), names_to = "metric", values_to = "value")
df_long <- df_long%>%
  arrange(value)

# Plot a grouped bar chart
ggplot(df_long, aes(x = Model, y = value, fill = metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Model", y = "Metric Value", fill = "Metric") +
  scale_fill_manual(values = c("red", "blue", "green")) +
  theme_bw()

I have also created ML models with the principal components from PCA, i found these metrics are better with original variables when i compared the models with principal components. So, i have decided to exclude the PCA and stick to the original variables.

Final Predictions.

RMSE is used to deploy the final which will be our final model to predict the unemployment for the remaining months of 2023.

The predictors of 2022 are used to predict the final predictions.

#Creating the 2023 test dataset with 2022 predictors as we dont have predictors for 2023 yet! However when the data for those particular months are available we can always use htat recent data predict the unemployment.

#Extracting the 2022 rows
test2022_1 <- method1[820:828,]
#creating the date for next 2023 months
test2022_1$DATE <- seq(as.Date("2023-04-01"), by = "month", length.out = 9)
#removing the target variable
test2022 <- test2022_1[-6]
test2022_date <-test2022
#changing the date format to numeric format.
test2022_date$DATE <- as.numeric(test2022$DATE - as.Date("1954-01-01"))

#PRedictions unsing the XG boost model
og_pred1 <- predict(boost_model, as.matrix(test2022_date))
pred_2023 <- data.frame(Date = test2022_1$DATE, Unemp_rate = og_pred1)

These are the predictions.

When the real data of economic indicators and demographic indicators of unemployment is available for the coming months of 2023 we can use this XG boost model to predict the unemployment.

Cluster Analysis using K-means algorithm

I have also done the cluster analysis with the unemployment in the states data to see the performance of the states over the years.

library(xlsx)
states <- read.xlsx("emp-unemployment.xls", 2, startRow = 6)

#data cleaning
states <- states[-1]
states <- states[-1,]
states <- states[-c(52:59),]
colnames(states) <- gsub("X", "", colnames(states))

Clustering

library(factoextra)
library(cluster)
#Data standardization
states_n1 <- as.data.frame(lapply(states[2:length(states)], scale))

#Finding optimal clusters.
fviz_nbclust(states_n1, kmeans, method = "silhouette", k.max = 8)

This method says 2 clusters but i want to explore 1 more clusters.

K-means Clustering

set.seed(700)
states_clusters <- kmeans(states_n1,3)

states_clusters$size

## [1] 22 16 13

These are the cluster centroids which shows that the Cluster 2 has all the negative values and clusters 3 has all the positive values, cluster 1 has both.

Interpretation: The states in Cluster 2 are the states with unemployment rate less than the mean unemployment rate throughout the years.

The states are cluster 3 are the states with their unemployment greater than the mean unemployment rate throughout the years.

Cluster has states with unemployment rate fluctuating above and below the mean unemployment rate.

Visualization:

fviz_cluster(states_clusters, data = states[2:length(states)], axes = c(1,2))

#assigning the clusters to the states
cluster_assign <- data.frame(state = states$Area, cluster = states_clusters$cluster)
#putting in the order
cluster_assign<- cluster_assign[order(cluster_assign$cluster),]
#assigning the clusters with the mean unemployment for states for each year
cluster_means <- aggregate(states[2:length(states)], by = list(cluster= states_clusters$cluster), mean)

#creating a dataset with year, unemployment rate and the cluster.
cluster_trends <- gather(cluster_means, key = "year", value = "unemployment_rate", -cluster)

#Visualizing the performance of the states throughout the time
ggplot(cluster_trends, aes(x= year, y = unemployment_rate, colour = factor(cluster), group = cluster)) + geom_line(linewidth =2) +
  labs(x="year", y ="unemployment rate", color = "Cluster") + scale_color_discrete(name = "Cluster") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))

When you take the consistency into the account, it can be concluded that states like South and North Dacotas, Minnesota, Maryland, Nebraska are performing better.

States like DC, California, Illinois, Louisiana, Oregon are not performing better!

Conclusion

Economic indicators like Producer price index , interest rates, export goods and inflation rates are highly impacting unemployment. So, these economic factors should be carefully monitored by the federal government.

Unemployment of white people is the mainly effecting the unemployment we can see this from the decision tree plot. Unemployment of the white people is the most important variable for this analysis.

Other demographic indicators like unemployment labor force are mainly impacting unemployment in the US.

States like CA, DC, IL, New Mexico are consistently having higher unemployment rate than the mean unemployment rate. These states should be monitored to increase the unemployment in these states.

16 states are performing better and 13 states under-performing according the cluster analysis.

Future Work

When we get the real data of the economic indicators and other demographic data is available we can check the real accuracy of this model. Other area that can be improved is the ANN model .