knitr::opts_chunk$set(warning = FALSE)
knitr::opts_chunk$set(message = FALSE)
Unemployment Rate from 1954 -2023
Semester Date: Spring 2023
Class: Data Mining (CS 5310)
Program: Master of Science in Data Analytics
School: University of Houston Downtown
This is a Data Mining project during the spring 2023 semester as a part of Masters in Data Analytics(MSDA) program at the University of Houston Downtown (UHD).
This Project is about predicting the unemployment rate for the coming months in 2023 using the historic economic indicators data and demographic indicators data.
I spent a lot of time in collecting the data as the data is needed in monthly fashion to predict the unemployment in monthly fashion for 2023.
I chose this project because it provides an opportunity to explore all the data mining techniques and the main machine learning algorithms to produce the best output.
Through this project i have explored the all the stages of the data analytics workflow which includes :
The data pre-processing tasks employed in this project were : Data merging - combining different data sets, Data cleaning - removing redundant values, addressing missing values and noisy data, changing data types of the attributes to comply with the ML algorithms functions in R, feature engineering using principal component analysis etc.
The next step is modeling , in this step i created different models, which are evaluated , improved and finally the best performing model is selected. The primary evaluation metric was the Root-Mean-Square_Error.
Final steps include prediction. In this step i used the best performing model using the evaluation metric to predict the test data(May to December 2023).
I have collected the historic data from different data sources which included monthly data of the economic indicators like inflation rates, interest rates, producer price index, consumer confident index, GDP etc. and also demographic data like unemployment of the white race, average hours per week etc. within the time frmae of 1954 -2022. This is the scenario where each attribute is a single data set.
Unemployment rate data for each state is also collected to explore the performance of states throughout the time.
unempRate <- read.csv("Unemp rate 54-23.csv")
unemp_men <- read.csv("Unemp Men 54-23.csv")
unemp_women <- read.csv("Unemp women 54-23.csv")
unemp_people <- read.csv("Unemp level(people) 54-23.csv")
unemp_weeks <- read.csv("Unemp Avg weeks 54-23.csv")
unemp_over_20 <- read.csv("Unemp 20 & over 54-23.csv")
unemp_white <- read.csv("Unemp- white 54-23.csv")
inflation <- read.csv("Inflation 54-23(FRED).csv")
ppi <- read.csv("PPI 54-23(FRED).csv")
import_goods <- read.csv("Imports goods 55-23.csv")
export_goods <- read.csv("Exports goods 55 -23.csv")
nettrade <- read.csv("Net Trade goods 55-23.csv")
cli <- read.csv("CLI 55-23(USA).CSV")
labourforce <- read.csv("labour force 55-23.csv")
stockprices <- read.csv("stock prices 50-23.csv")
bci <- read.csv("Businees CI 50-23.csv")
cci <- read.csv("CCI 60-23.csv")
gdp_change <- read.csv("GDP % change 60-22.csv")
interest_rate <- read.csv("Interest rates 53-2023.csv")
As every data set is a single attribute and different from other datasets it took a lot of time to preprocess the data. Only required columns are selected from each data set and finally all these columns are merged together to a single data set with all the economic indicators, demographic indicators along with target variable unemployment rate form1954 -2022.
The final data set has the data of economic indicators and demographic data of unemployment rate of 856 months since 1954.
### Data Pre-processing
library(dplyr)
#merging the datasets to a single data set.
merge1 <- left_join(unempRate, unemp_men)
merge1 <- left_join(merge1,unemp_over_20)
merge1 <- left_join(merge1, unemp_people)
#naming the columns
colnames(merge1)[c(2,3,4,5)] <- c("unempRate","Unemp_men", "unemp_over_20", "unemp_people")
merge1 <- left_join(merge1,unemp_weeks)
merge1 <- left_join(merge1, unemp_white)
merge1 <- left_join(merge1, unemp_women)
colnames(merge1)[c(6,7,8)] <- c("unemp_weeks","unemp_white", "unemp_women")
merge1 <- left_join(merge1, inflation)
merge1 <- left_join(merge1, ppi)
import_goods1 <- import_goods[-c(1,2,3,4,5,8)]
colnames(import_goods1)[1] <- "DATE"
import_goods1$DATE <- paste(import_goods1$DATE,"01" ,sep = "-")
merge1 <- left_join(merge1,import_goods1)
export_goods1 <- export_goods[-c(1,2,3,4,5,8)]
colnames(export_goods1)[1] <- "DATE"
export_goods1$DATE <- paste(export_goods1$DATE,"01" ,sep = "-")
#naming the columns
colnames(merge1)[c(9,10,11)] <- c("inflation","ppi", "import_goods")
#merging remaining datasets
merge1 <- left_join(merge1, export_goods1)
nettrade1 <- nettrade[-c(1,2,3,4,5,8)]
colnames(nettrade1)[1] <- "DATE"
nettrade1$DATE <- paste(nettrade1$DATE,"01" ,sep = "-")
colnames(nettrade1)[2] <- "nettrade"
merge1 <- left_join(merge1,nettrade1)
cli1 <- cli[-c(1,2,3,4,5,8)]
colnames(cli1)[1] <- "DATE"
cli1$DATE <- paste(cli1$DATE,"01" ,sep = "-")
colnames(cli1)[2] <- "cli"
merge1 <- left_join(merge1,cli1)
labourforce1 <- labourforce[-c(1,2,3,4,5,8)]
colnames(labourforce1)[1] <- "DATE"
labourforce1$DATE <- paste(labourforce1$DATE,"01" ,sep = "-")
colnames(labourforce1)[2] <- "labourforce"
merge1 <- left_join(merge1,labourforce1)
stockprices1 <- stockprices[-c(1,2,3,4,5,8)]
colnames(stockprices1)[1] <- "DATE"
stockprices1$DATE <- paste(stockprices1$DATE,"01" ,sep = "-")
colnames(stockprices1)[2] <- "stockprices"
merge1 <- left_join(merge1,stockprices1)
bci1 <- bci[-c(1,2,3,4,5,8)]
colnames(bci1)[1] <- "DATE"
bci1$DATE <- paste(bci1$DATE,"01" ,sep = "-")
colnames(bci1)[2] <- "bci"
merge1 <- left_join(merge1,bci1)
cci1 <- cci[-c(1,2,3,4,5,8)]
colnames(cci1)[1] <- "DATE"
cci1$DATE <- paste(cci1$DATE,"01" ,sep = "-")
colnames(cci1)[2] <- "cci"
merge1 <- left_join(merge1,cci1)
merge1 <- left_join(merge1,gdp_change)
colnames(merge1)[c(12,19)] <- c("export_goods","gdp_change")
interest_rate1 <- interest_rate[-c(1,2,3,4,5,8)]
colnames(interest_rate1)[1] <- "DATE"
interest_rate1$DATE <- paste(interest_rate1$DATE,"01" ,sep = "-")
colnames(interest_rate1)[2] <- "interest_rate"
merge1 <- left_join(merge1,interest_rate1)
merge1$DATE <- as.Date(merge1$DATE,format = "%Y-%m-%d")
Missing values - there are a lot of missing values. I have imputed the missing values using the KNN algorithm.
Here imputing the missing values with mean or median is not relevant so, the step_impute_knn function from recipes package is thought be better here when compared to the other methods as i don’t want to delete the rows with NA values.
Imputing the missing values using KNN algorithm, step_impute_knn function is used from recipes package.
library(recipes)
#creating a recipe
rec <- recipe(merge1, formula = unempRate ~.,)
#training the algorithm
rec1 <- rec %>% step_impute_knn(all_predictors(), neighbors = 11)
rec2 <- prep(rec1, training = merge1)
#imputing the missing values.
imputed <- bake(rec2, new_data = merge1)
Multiple regression analysis is done to check the variation explained by the independent variables, multicollinearity and also the most important variables.
method1 <- read.csv("complete dataset.csv")
str(method1)
## 'data.frame': 831 obs. of 20 variables:
## $ DATE : chr "1954-01-01" "1954-02-01" "1954-03-01" "1954-04-01" ...
## $ Unemp_20_and_over : num 4.4 4.6 5.1 5.3 5.4 5.3 5.3 5.5 5.6 5.3 ...
## $ Unemp_avg_weeks : num 8.7 9.5 10.6 10.9 11.6 12.3 12.5 12.8 12.9 13.3 ...
## $ Unemp_level_people: int 3077 3331 3607 3749 3767 3551 3659 3854 3927 3666 ...
## $ Unemp_men : num 4.4 4.9 5.3 5.6 5.7 5.3 5.6 6 6 5.7 ...
## $ Unemp_rate : num 4.9 5.2 5.7 5.9 5.9 5.6 5.8 6 6.1 5.7 ...
## $ Unemp_women : num 5.9 5.9 6.4 6.4 6.3 6.2 6.3 6.2 6.4 5.8 ...
## $ Unemp_white : num 4.5 4.9 5 5.5 5.3 5 5.3 5.6 5.9 5.1 ...
## $ PPI : num 30.5 30.4 30.3 30.5 30.5 30.3 30.5 30.5 30.4 30.3 ...
## $ Businees_CI : num 96.5 97.2 98 98.7 99.2 ...
## $ stock_prices : num 30.1 28.5 29.3 31.5 30.4 ...
## $ CCI : num 99.1 98.6 99.8 99.3 100.1 ...
## $ CLI : num 97.1 97.5 98.3 98.5 99.1 ...
## $ Export_goods : num 38.8 39.1 37.6 38.5 37.3 ...
## $ GDP : num -5.12 -4.49 -2.92 -2.19 -1.26 ...
## $ Import_goods : num 71.9 71.7 71.7 71.7 71.6 ...
## $ Inflation : num 26.9 26.9 26.9 26.8 26.9 26.9 26.9 26.9 26.8 26.8 ...
## $ Interest_rates : num 2.48 2.47 2.37 2.29 2.37 2.38 2.3 2.36 2.38 2.43 ...
## $ labour_force : num 59487 60606 60697 59264 58622 ...
## $ Net_Trade_goods : num 34.1 33.7 35.1 34.3 35.3 ...
Changing the data types of the date column to comply with functions of the ML algorithms.
#Changing the data type of date dolumn from character to date.
method1$DATE <- as.Date(merge1$DATE,format = "%Y-%m-%d")
Regression model and summary.
c_reg_model <- lm(Unemp_rate ~., data = method1)
summary(c_reg_model)
##
## Call:
## lm(formula = Unemp_rate ~ ., data = method1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.125693 -0.027594 -0.001228 0.029541 0.105262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.271e-01 1.837e-01 -0.692 0.48897
## DATE -1.640e-06 3.817e-06 -0.430 0.66761
## Unemp_20_and_over 2.106e-01 1.686e-02 12.487 < 2e-16 ***
## Unemp_avg_weeks 1.262e-03 6.289e-04 2.006 0.04519 *
## Unemp_level_people -2.918e-05 3.553e-06 -8.214 8.44e-16 ***
## Unemp_men 4.205e-01 1.334e-02 31.528 < 2e-16 ***
## Unemp_women 3.528e-01 9.594e-03 36.773 < 2e-16 ***
## Unemp_white 6.927e-02 1.591e-02 4.354 1.51e-05 ***
## PPI 1.066e-03 5.857e-04 1.820 0.06907 .
## Businees_CI -1.440e-03 2.040e-03 -0.706 0.48065
## stock_prices 4.579e-05 2.740e-04 0.167 0.86732
## CCI -5.991e-04 1.718e-03 -0.349 0.72743
## CLI -1.070e-03 2.532e-03 -0.423 0.67251
## Export_goods -1.422e-02 6.737e-03 -2.111 0.03506 *
## GDP 3.492e-04 2.746e-04 1.272 0.20380
## Import_goods 1.397e-02 6.738e-03 2.073 0.03850 *
## Inflation -2.117e-03 6.489e-04 -3.262 0.00115 **
## Interest_rates 2.934e-03 1.478e-03 1.985 0.04750 *
## labour_force 5.380e-06 7.997e-07 6.727 3.27e-11 ***
## Net_Trade_goods -1.401e-02 6.704e-03 -2.090 0.03692 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03884 on 811 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9994
## F-statistic: 7.934e+04 on 19 and 811 DF, p-value: < 2.2e-16
We can see from the summary that these variables are well explaining the unemployment rate.
The model is good as the f statistic is significant with p value less than 0.01.
Correlation matrix says that the variables are highly correlated.
library(corrplot)
corrplot(cor(sapply(method1, as.numeric)), method = "color", order = "alphabet")
library(car)
#checking the variation inflation factors for multicollinearity
vif(c_reg_model)
## DATE Unemp_20_and_over Unemp_avg_weeks Unemp_level_people
## 4.279327e+02 3.927055e+02 1.136753e+01 6.006635e+01
## Unemp_men Unemp_women Unemp_white PPI
## 3.295042e+02 1.202054e+02 3.145540e+02 7.384184e+02
## Businees_CI stock_prices CCI CLI
## 4.933750e+00 6.983628e+01 3.640785e+00 7.935599e+00
## Export_goods GDP Import_goods Inflation
## 5.907012e+04 1.315236e+00 1.455912e+05 1.545279e+03
## Interest_rates labour_force Net_Trade_goods
## 1.025997e+01 3.913009e+02 2.002035e+04
We can see that vif is exceeding 5 for some of the variables. So, these variables are highly correlated.
From the correlation map it is evident that the variables are highly correlated. So, i thought of doing the Principal component analysis. From the PCA, 4 principal components were extracted as these four components are explaining 89% of the variance. This was expected to reduce the multicollinearity. But the results are not so convincing with the principal components.
The evaluation metrics are better with original variables than with principal components for all the models. So, i do not want to include the principal component analysis here as it gets too much complicated.
As the target variable is numerical variable. i wanted to explore ML algorithms like Decision Tree Regression, Model trees, Random forest, Gradient boosting algorithm.
Also, random forest algorithm and gradient boosting algorithms are very good when you have correlated variables. So, i have high expectations on random forest and gradient boosting algorithms.
library(Metrics)
library(caret)
library(rpart)
library(rpart.plot)
library(randomForest)
library(party)
library(glmnet)
library(xgboost)
library(neuralnet)
###Splitting the data into training and testing sets.
set.seed(800)
c_rs <- sample(nrow(method1), nrow(method1)*0.25)
complete_data_train <- method1[-c_rs,]
complete_data_test <- method1[c_rs,]
complete_data_test2 <- complete_data_test[-6]
# Training the model
reg_tree <- rpart(Unemp_rate ~., data = complete_data_train, method = "anova", cp = 0.0001)
Method “anova” is used because our problem is regression type.
The complexity parameter of 0.0001 gave the best results after trying the tuning with other values.
Seems like complex tree works better here.
#Model Evaluation
reg_tree_pred <- predict(reg_tree, complete_data_test2)
######### Evaluation #############
mae3 <- mae(reg_tree_pred, complete_data_test$Unemp_rate)
mse3 <- mse(reg_tree_pred, complete_data_test$Unemp_rate)
rmse3 <- rmse(reg_tree_pred, complete_data_test$Unemp_rate)
rsq3 <- cor(reg_tree_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", mae3, "///MSE-", mse3, "///RMSE-", rmse3, "///RSQ-", rsq3))
## [1] "MAE- 0.110114471181898 ///MSE- 0.036890008735503 ///RMSE- 0.192067719139638 ///RSQ- 0.985370155158067"
Most Important variable -Unemployment of white.
imp_var <- sort(reg_tree$variable.importance, decreasing = T)
most_imp <- names(imp_var)[1:10]
rpart.plot(reg_tree, type = 2, extra = 101, fallen.leaves = F, under = TRUE, cex = 0.6, branch.lty = 1, branch.lwd = 1,
box.col = "lightblue")
set.seed(1200)
rf_model <- randomForest(Unemp_rate~., data = complete_data_train, ntree = 100, mtry = 4, max.depth = 6)
rf_pred <- predict(rf_model, complete_data_test2)
rf_model
##
## Call:
## randomForest(formula = Unemp_rate ~ ., data = complete_data_train, ntree = 100, mtry = 4, max.depth = 6)
## Type of random forest: regression
## Number of trees: 100
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 0.03577444
## % Var explained: 98.73
After trying several tuning combinations, no.of trees - 100, no of variables at each node -4, depth of tree -6 gave the best results for this model.
#model evaluation
mae4 <- mae(rf_pred, complete_data_test$Unemp_rate)
mse4 <- mse(rf_pred, complete_data_test$Unemp_rate)
rmse4 <- rmse(rf_pred, complete_data_test$Unemp_rate)
rsq4 <- cor(rf_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", mae4, "///MSE-", mse4, "///RMSE-", rmse4, "///RSQ-", rsq4))
## [1] "MAE- 0.050504106280193 ///MSE- 0.00761183238996231 ///RMSE- 0.0872458158879973 ///RSQ- 0.997163011492964"
rf_ctrl <- trainControl(method = "cv", number = 10)
rf_grid <- expand.grid(.mtry = c(1,2,3,4,5))
set.seed(1100)
rf_cust <- train(Unemp_rate~., data = complete_data_train, method = "rf", metric= "RMSE" , trControl = rf_ctrl,
tuneGrid = rf_grid)
rf_cust
## Random Forest
##
## 624 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 561, 562, 563, 561, 562, 562, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 1 0.2219627 0.9819266 0.11777307
## 2 0.1549901 0.9885853 0.07674027
## 3 0.1332539 0.9905698 0.06513193
## 4 0.1253030 0.9914552 0.06208328
## 5 0.1201174 0.9920139 0.06001626
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
rf_cust_pred <- predict(rf_cust, complete_data_test2)
#Evaluation
mae5 <- mae(rf_cust_pred, complete_data_test$Unemp_rate)
mse5 <- mse(rf_cust_pred, complete_data_test$Unemp_rate)
rmse5 <- rmse(rf_cust_pred, complete_data_test$Unemp_rate)
rsq5 <- cor(rf_cust_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", mae5, "///MSE-", mse5, "///RMSE-", rmse5, "///RSQ-", rsq5))
## [1] "MAE- 0.0533278421900161 ///MSE- 0.0120321158212028 ///RMSE- 0.109691001550732 ///RSQ- 0.995468328871509"
Model tree is a variation of decision tree regression where it uses regression model instead of mean at each split.
complete_data_train1 <- complete_data_train
complete_data_train1$DATE <- as.numeric(complete_data_train1$DATE - as.Date("1954-01-01"))
complete_data_test1 <- complete_data_test2
complete_data_test1$DATE <- as.numeric(complete_data_test2$DATE - as.Date("1954-01-01"))
model_tree_reg <- ctree(Unemp_rate ~., data = complete_data_train1)
model_tree_pred <- predict(model_tree_reg, complete_data_test1)
#Evaluation
c_mae_mt <- mae(model_tree_pred, complete_data_test$Unemp_rate)
c_mse_mt <- mse(model_tree_pred, complete_data_test$Unemp_rate)
c_rmse_mt <- rmse(model_tree_pred, complete_data_test$Unemp_rate)
c_rsq_mt <- cor(model_tree_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", c_mae_mt, "/// MSE-", c_mse_mt, "/// RMSE-", c_rmse_mt, "/// RSQ-", c_rsq_mt))
## [1] "MAE- 0.111207463990694 /// MSE- 0.0421555392641434 /// RMSE- 0.205318141585549 /// RSQ- 0.983852208154005"
I tried ridge regression methods manages the correlated variables.
x <- as.matrix(complete_data_train1[-6])
y <- complete_data_train$Unemp_rate
cv_model <- cv.glmnet(x,y, alpha=0, lambda = seq(0.001,0.1,0.001), nfolds = 10)
min_l <- cv_model$lambda.min
ridge_model <- glmnet(x,y, alpha = 0, lambda = min_l)
ridge_pred <- predict(ridge_model, as.matrix(complete_data_test1))
Evaluation
mae_r <- mae(ridge_pred, complete_data_test$Unemp_rate)
mse5_r <- mse(ridge_pred, complete_data_test$Unemp_rate)
rmse5_r <- rmse(ridge_pred, complete_data_test$Unemp_rate)
rsq5_r <- cor(ridge_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", mae_r, "///MSE-", mse5_r, "///RMSE-", rmse5_r, "///RSQ-", rsq5_r))
## [1] "MAE- 0.0328041786911582 ///MSE- 0.00152426395686021 ///RMSE- 0.039041823175413 ///RSQ- 0.999389660178682"
params <- list(
booster = "gbtree",
eta = 0.1,max_depth = 5, subsample = 0.8, colsample_bytree = 0.8, eval_metric = "rmse")
# Train the xgboost model
boost_model <- xgboost(data = as.matrix(complete_data_train1[-6]),
label = complete_data_train1$Unemp_rate,
params = params,
nrounds = 82)
## [1] train-rmse:5.067344
## [2] train-rmse:4.572092
## [3] train-rmse:4.123619
## [4] train-rmse:3.722090
## [5] train-rmse:3.358581
## [6] train-rmse:3.029320
## [7] train-rmse:2.734491
## [8] train-rmse:2.466368
## [9] train-rmse:2.226653
## [10] train-rmse:2.009577
## [11] train-rmse:1.815121
## [12] train-rmse:1.638956
## [13] train-rmse:1.483156
## [14] train-rmse:1.342854
## [15] train-rmse:1.214197
## [16] train-rmse:1.098768
## [17] train-rmse:0.994439
## [18] train-rmse:0.900680
## [19] train-rmse:0.816675
## [20] train-rmse:0.741236
## [21] train-rmse:0.673717
## [22] train-rmse:0.612564
## [23] train-rmse:0.557201
## [24] train-rmse:0.509086
## [25] train-rmse:0.464330
## [26] train-rmse:0.423868
## [27] train-rmse:0.387607
## [28] train-rmse:0.354592
## [29] train-rmse:0.327011
## [30] train-rmse:0.299887
## [31] train-rmse:0.275789
## [32] train-rmse:0.253529
## [33] train-rmse:0.233715
## [34] train-rmse:0.215969
## [35] train-rmse:0.202132
## [36] train-rmse:0.187461
## [37] train-rmse:0.176491
## [38] train-rmse:0.164223
## [39] train-rmse:0.153389
## [40] train-rmse:0.143600
## [41] train-rmse:0.134657
## [42] train-rmse:0.128615
## [43] train-rmse:0.121190
## [44] train-rmse:0.114292
## [45] train-rmse:0.110250
## [46] train-rmse:0.104723
## [47] train-rmse:0.099584
## [48] train-rmse:0.094948
## [49] train-rmse:0.092405
## [50] train-rmse:0.088332
## [51] train-rmse:0.084455
## [52] train-rmse:0.081035
## [53] train-rmse:0.079326
## [54] train-rmse:0.076351
## [55] train-rmse:0.073619
## [56] train-rmse:0.072593
## [57] train-rmse:0.070102
## [58] train-rmse:0.067923
## [59] train-rmse:0.065937
## [60] train-rmse:0.064122
## [61] train-rmse:0.062360
## [62] train-rmse:0.060876
## [63] train-rmse:0.059287
## [64] train-rmse:0.058580
## [65] train-rmse:0.057139
## [66] train-rmse:0.055823
## [67] train-rmse:0.054493
## [68] train-rmse:0.053339
## [69] train-rmse:0.052228
## [70] train-rmse:0.051188
## [71] train-rmse:0.050329
## [72] train-rmse:0.049242
## [73] train-rmse:0.048446
## [74] train-rmse:0.047587
## [75] train-rmse:0.046815
## [76] train-rmse:0.046141
## [77] train-rmse:0.045388
## [78] train-rmse:0.044774
## [79] train-rmse:0.044214
## [80] train-rmse:0.043764
## [81] train-rmse:0.043107
## [82] train-rmse:0.042563
boost_pred <- predict(boost_model, as.matrix(complete_data_test1))
Evaluation
mae_b <- mae(boost_pred, complete_data_test$Unemp_rate)
mse5_b <- mse(boost_pred, complete_data_test$Unemp_rate)
rmse5_b <- rmse(boost_pred, complete_data_test$Unemp_rate)
rsq5_b <- cor(boost_pred, complete_data_test$Unemp_rate)^2
print(paste("MAE-", mae_r, "///MSE-", mse5_r, "///RMSE-", rmse5_r, "///RSQ-", rsq5_r))
## [1] "MAE- 0.0328041786911582 ///MSE- 0.00152426395686021 ///RMSE- 0.039041823175413 ///RSQ- 0.999389660178682"
After trying a number of combinations of hyper parametric tuning for the learning rate and hidden layers, the learning rate of 0.006 and hidden layers of 4 gave the best results.
The activation factor is not used at the output node because the target variable here is of continuous type and there should not be any range for the value, so, the activation factor is not used at the output node.
scaled_data <- as.data.frame(sapply(complete_data_train1,
scale))
set.seed(99)
ann_model <- neuralnet(Unemp_rate ~ ., data = scaled_data, hidden = 4,
learningrate = 0.006, threshold = 0.01, linear.output = TRUE)
The training data and testing testing data is scaled for standardization. After that the predictions are re scaled to the original scale to evaluate using metrics.
#evaluation
scaledtestdata <- data.frame(sapply(complete_data_test1,scale))
ann_pred <- compute(ann_model, scaledtestdata)$net.result
rescaled <- ann_pred *sd(complete_data_train1$Unemp_rate) + mean(complete_data_train1$Unemp_rate)
c_mae_an <- mae(rescaled, complete_data_test$Unemp_rate)
c_mse_an <- mse(rescaled, complete_data_test$Unemp_rate)
c_rmse_an <- rmse(rescaled, complete_data_test$Unemp_rate)
print(paste("MAE-", c_mae_an, "/// MSE-", c_mse_an, "/// RMSE-", c_rmse_an))
## [1] "MAE- 0.0813837249670667 /// MSE- 0.009626285350451 /// RMSE- 0.0981136348855296"
#plotting the ann model
plot(ann_model, rep= "best")
I choose RMSE as the main evaluation metric for this project as we prefer accuracy here in this scenario of predicting unemployment.
The best RMSE scores were with XG boost model and with ridge regression model.
I finally chose XG boost model as this model perfectly manages the correlated variables.
library(tidyr)
metrics_df <- data.frame(Model = c("Ridge Reg", "Decision Tree", "Model Tree", "RF- CV", "RF","XG Boost", "ANN"),
RMSE = c(0.0390, 0.1920, 0.20531, 0.109, 0.08, 0.0390,
0.0981),
MSE = c(0.0015, 0.03, 0.04, 0.01, 0.007, 0.0015,0.0096),
MAE = c(0.032, 0.036, 0.11, 0.053, 0.050, 0.0328,0.081))
df_long <- pivot_longer(metrics_df, cols = c("RMSE", "MSE", "MAE"), names_to = "metric", values_to = "value")
df_long <- df_long%>%
arrange(value)
# Plot a grouped bar chart
ggplot(df_long, aes(x = Model, y = value, fill = metric)) +
geom_bar(stat = "identity", position = "dodge") +
labs(x = "Model", y = "Metric Value", fill = "Metric") +
scale_fill_manual(values = c("red", "blue", "green")) +
theme_bw()
I have also created ML models with the principal components from PCA, i found these metrics are better with original variables when i compared the models with principal components. So, i have decided to exclude the PCA and stick to the original variables.
RMSE is used to deploy the final which will be our final model to predict the unemployment for the remaining months of 2023.
The predictors of 2022 are used to predict the final predictions.
#Creating the 2023 test dataset with 2022 predictors as we dont have predictors for 2023 yet! However when the data for those particular months are available we can always use htat recent data predict the unemployment.
#Extracting the 2022 rows
test2022_1 <- method1[820:828,]
#creating the date for next 2023 months
test2022_1$DATE <- seq(as.Date("2023-04-01"), by = "month", length.out = 9)
#removing the target variable
test2022 <- test2022_1[-6]
test2022_date <-test2022
#changing the date format to numeric format.
test2022_date$DATE <- as.numeric(test2022$DATE - as.Date("1954-01-01"))
#PRedictions unsing the XG boost model
og_pred1 <- predict(boost_model, as.matrix(test2022_date))
pred_2023 <- data.frame(Date = test2022_1$DATE, Unemp_rate = og_pred1)
These are the predictions.
When the real data of economic indicators and demographic indicators of unemployment is available for the coming months of 2023 we can use this XG boost model to predict the unemployment.
library(xlsx)
states <- read.xlsx("emp-unemployment.xls", 2, startRow = 6)
#data cleaning
states <- states[-1]
states <- states[-1,]
states <- states[-c(52:59),]
colnames(states) <- gsub("X", "", colnames(states))
Clustering
library(factoextra)
library(cluster)
#Data standardization
states_n1 <- as.data.frame(lapply(states[2:length(states)], scale))
#Finding optimal clusters.
fviz_nbclust(states_n1, kmeans, method = "silhouette", k.max = 8)
This method says 2 clusters but i want to explore 1 more clusters.
set.seed(700)
states_clusters <- kmeans(states_n1,3)
states_clusters$size
## [1] 22 16 13
These are the cluster centroids which shows that the Cluster 2 has all the negative values and clusters 3 has all the positive values, cluster 1 has both.
Interpretation: The states in Cluster 2 are the states with unemployment rate less than the mean unemployment rate throughout the years.
The states are cluster 3 are the states with their unemployment greater than the mean unemployment rate throughout the years.
Cluster has states with unemployment rate fluctuating above and below the mean unemployment rate.
fviz_cluster(states_clusters, data = states[2:length(states)], axes = c(1,2))
#assigning the clusters to the states
cluster_assign <- data.frame(state = states$Area, cluster = states_clusters$cluster)
#putting in the order
cluster_assign<- cluster_assign[order(cluster_assign$cluster),]
#assigning the clusters with the mean unemployment for states for each year
cluster_means <- aggregate(states[2:length(states)], by = list(cluster= states_clusters$cluster), mean)
#creating a dataset with year, unemployment rate and the cluster.
cluster_trends <- gather(cluster_means, key = "year", value = "unemployment_rate", -cluster)
#Visualizing the performance of the states throughout the time
ggplot(cluster_trends, aes(x= year, y = unemployment_rate, colour = factor(cluster), group = cluster)) + geom_line(linewidth =2) +
labs(x="year", y ="unemployment rate", color = "Cluster") + scale_color_discrete(name = "Cluster") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
When you take the consistency into the account, it can be concluded that states like South and North Dacotas, Minnesota, Maryland, Nebraska are performing better.
States like DC, California, Illinois, Louisiana, Oregon are not performing better!
Economic indicators like Producer price index , interest rates, export goods and inflation rates are highly impacting unemployment. So, these economic factors should be carefully monitored by the federal government.
Unemployment of white people is the mainly effecting the unemployment we can see this from the decision tree plot. Unemployment of the white people is the most important variable for this analysis.
Other demographic indicators like unemployment labor force are mainly impacting unemployment in the US.
States like CA, DC, IL, New Mexico are consistently having higher unemployment rate than the mean unemployment rate. These states should be monitored to increase the unemployment in these states.
16 states are performing better and 13 states under-performing according the cluster analysis.
When we get the real data of the economic indicators and other demographic data is available we can check the real accuracy of this model. Other area that can be improved is the ANN model .