Targeting the right customers and finding those that are most likely to buy the product that a company offers, is one of the most important tasks of businesses that already gathered some broad knowledge about their clients. In this paper we will try to explore the data gathered by portugese researchers publicly available at: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. This dataset contains information about over 41 000 observations which include variables about client of a bank, data related with the previous and current campaings held by the bank and social and economic context attributes present at a particular time. Main goal is to build a model which will predict the outcome of the campaign held by the bank, namely whether the client has subscribed a term deposit.
Initially, let’s summarize the data with the use of in-built r functions.
bank <- read.csv2("bank-additional-full.csv")
summary(bank)
## age job marital
## Min. :17.00 admin. :10422 divorced: 4612
## 1st Qu.:32.00 blue-collar: 9254 married :24928
## Median :38.00 technician : 6743 single :11568
## Mean :40.02 services : 3969 unknown : 80
## 3rd Qu.:47.00 management : 2924
## Max. :98.00 retired : 1720
## (Other) : 6156
## education default housing loan
## university.degree :12168 no :32588 no :18622 no :33950
## high.school : 9515 unknown: 8597 unknown: 990 unknown: 990
## basic.9y : 6045 yes : 3 yes :21576 yes : 6248
## professional.course: 5243
## basic.4y : 4176
## basic.6y : 2292
## (Other) : 1749
## contact month day_of_week duration
## cellular :26144 may :13769 fri:7827 Min. : 0.0
## telephone:15044 jul : 7174 mon:8514 1st Qu.: 102.0
## aug : 6178 thu:8623 Median : 180.0
## jun : 5318 tue:8090 Mean : 258.3
## nov : 4101 wed:8134 3rd Qu.: 319.0
## apr : 2632 Max. :4918.0
## (Other): 2016
## campaign pdays previous poutcome
## Min. : 1.000 Min. : 0.0 Min. :0.000 failure : 4252
## 1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 nonexistent:35563
## Median : 2.000 Median :999.0 Median :0.000 success : 1373
## Mean : 2.568 Mean :962.5 Mean :0.173
## 3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000
## Max. :56.000 Max. :999.0 Max. :7.000
##
## emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
## 01.kwi :16234 93.994 :7763 -36.4 :7763 4.857 : 2868 5228.1 :16234
## -1.8 : 9184 93.918 :6685 -42.7 :6685 4.962 : 2613 5099.1 : 8534
## 1.1 : 7763 92.893 :5794 -46.2 :5794 4.963 : 2487 5191 : 7763
## -0.1 : 3683 93.444 :5175 -36.1 :5175 4.961 : 1902 5195.8 : 3683
## -2.9 : 1663 94.465 :4374 -41.8 :4374 4.856 : 1210 5076.2 : 1663
## -3.4 : 1071 93.2 :3616 -42 :3616 4.964 : 1175 5017.5 : 1071
## (Other): 1590 (Other):7781 (Other):7781 (Other):28933 (Other): 2240
## y
## no :36548
## yes: 4640
##
##
##
##
##
str(bank)
## 'data.frame': 41188 obs. of 21 variables:
## $ age : int 56 57 37 40 56 45 59 41 24 25 ...
## $ job : Factor w/ 12 levels "admin.","blue-collar",..: 4 8 8 1 8 8 1 2 10 8 ...
## $ marital : Factor w/ 4 levels "divorced","married",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ education : Factor w/ 8 levels "basic.4y","basic.6y",..: 1 4 4 2 4 3 6 8 6 4 ...
## $ default : Factor w/ 3 levels "no","unknown",..: 1 2 1 1 1 2 1 2 1 1 ...
## $ housing : Factor w/ 3 levels "no","unknown",..: 1 1 3 1 1 1 1 1 3 3 ...
## $ loan : Factor w/ 3 levels "no","unknown",..: 1 1 1 1 3 1 1 1 1 1 ...
## $ contact : Factor w/ 2 levels "cellular","telephone": 2 2 2 2 2 2 2 2 2 2 ...
## $ month : Factor w/ 10 levels "apr","aug","dec",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ day_of_week : Factor w/ 5 levels "fri","mon","thu",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ duration : int 261 149 226 151 307 198 139 217 380 50 ...
## $ campaign : int 1 1 1 1 1 1 1 1 1 1 ...
## $ pdays : int 999 999 999 999 999 999 999 999 999 999 ...
## $ previous : int 0 0 0 0 0 0 0 0 0 0 ...
## $ poutcome : Factor w/ 3 levels "failure","nonexistent",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ emp.var.rate : Factor w/ 10 levels "-0.1","-0.2",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ cons.price.idx: Factor w/ 26 levels "92.201","92.379",..: 19 19 19 19 19 19 19 19 19 19 ...
## $ cons.conf.idx : Factor w/ 26 levels "-26.9","-29.8",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ euribor3m : Factor w/ 315 levels "0.634","0.635",..: 277 277 277 277 277 277 277 277 277 277 ...
## $ nr.employed : Factor w/ 11 levels "4963.6","4991.6",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ y : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
bank$y <- as.factor(ifelse(bank$y == 'yes', 1, 0))
prop.table(table(bank$y))
##
## 0 1
## 0.8873458 0.1126542
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gmodels)
ggplot(bank, aes(x = age)) +
geom_bar() +
facet_grid(y ~ ., scales = 'free_y') +
geom_vline(xintercept = c(30, 60), col = "blue",linetype = "dashed") +
scale_x_continuous(breaks = seq(0, 100, 5))
bank = bank %>%
mutate(age = ifelse(age > 60, 'old', ifelse(age > 30, 'middle', 'young')))
bank$age <- as.factor(bank$age)
We can distinguish 3 main patterns of how age influences the outcome variable. Young people (below 30 years old) and reaching or being in their retirement (over 60 years old) are more willing to take deposits than middle aged people. Therefore we divide the continous age variable into 3 categories.
CrossTable(bank$job, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 41188
##
##
## | bank$y
## bank$job | 0 | 1 | Row Total |
## --------------|-----------|-----------|-----------|
## admin. | 9070 | 1352 | 10422 |
## | 0.870 | 0.130 | 0.253 |
## --------------|-----------|-----------|-----------|
## blue-collar | 8616 | 638 | 9254 |
## | 0.931 | 0.069 | 0.225 |
## --------------|-----------|-----------|-----------|
## entrepreneur | 1332 | 124 | 1456 |
## | 0.915 | 0.085 | 0.035 |
## --------------|-----------|-----------|-----------|
## housemaid | 954 | 106 | 1060 |
## | 0.900 | 0.100 | 0.026 |
## --------------|-----------|-----------|-----------|
## management | 2596 | 328 | 2924 |
## | 0.888 | 0.112 | 0.071 |
## --------------|-----------|-----------|-----------|
## retired | 1286 | 434 | 1720 |
## | 0.748 | 0.252 | 0.042 |
## --------------|-----------|-----------|-----------|
## self-employed | 1272 | 149 | 1421 |
## | 0.895 | 0.105 | 0.035 |
## --------------|-----------|-----------|-----------|
## services | 3646 | 323 | 3969 |
## | 0.919 | 0.081 | 0.096 |
## --------------|-----------|-----------|-----------|
## student | 600 | 275 | 875 |
## | 0.686 | 0.314 | 0.021 |
## --------------|-----------|-----------|-----------|
## technician | 6013 | 730 | 6743 |
## | 0.892 | 0.108 | 0.164 |
## --------------|-----------|-----------|-----------|
## unemployed | 870 | 144 | 1014 |
## | 0.858 | 0.142 | 0.025 |
## --------------|-----------|-----------|-----------|
## unknown | 293 | 37 | 330 |
## | 0.888 | 0.112 | 0.008 |
## --------------|-----------|-----------|-----------|
## Column Total | 36548 | 4640 | 41188 |
## --------------|-----------|-----------|-----------|
##
##
bank = bank %>%
filter(job != "unknown")
Certain occupations have visibly greater proportions of positive class than other. We remove the observations with unknown values.
CrossTable(bank$marital, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40858
##
##
## | bank$y
## bank$marital | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## divorced | 4126 | 473 | 4599 |
## | 0.897 | 0.103 | 0.113 |
## -------------|-----------|-----------|-----------|
## married | 22178 | 2516 | 24694 |
## | 0.898 | 0.102 | 0.604 |
## -------------|-----------|-----------|-----------|
## single | 9889 | 1605 | 11494 |
## | 0.860 | 0.140 | 0.281 |
## -------------|-----------|-----------|-----------|
## unknown | 62 | 9 | 71 |
## | 0.873 | 0.127 | 0.002 |
## -------------|-----------|-----------|-----------|
## Column Total | 36255 | 4603 | 40858 |
## -------------|-----------|-----------|-----------|
##
##
bank = bank %>%
filter(marital != "unknown")
Single people decide to subscribe to a term deposit more often than married or divorced.
table_educ <- table(bank$education, bank$y)
mosaicplot(table_educ, color = T)
nrow(bank[bank$education == 'illiterate',])
## [1] 18
bank = bank %>%
filter(education != 'illiterate')
Since there are only 18 observations in ‘illiterate’ category I will remove those variables that has such level of education.
summary(bank$default)
## no unknown yes
## 32337 8429 3
table(bank$y, bank$default)
##
## no unknown yes
## 0 28182 7994 3
## 1 4155 435 0
bank <- select(bank, -c('default'))
As there is no differentiation in this variable (all observations except 3 belong to groups “no” and “unknown”) we remove this variable as it doesn’t provide us with any vital information
CrossTable(bank$loan, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | bank$y
## bank$loan | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## no | 29799 | 3806 | 33605 |
## | 0.887 | 0.113 | 0.824 |
## -------------|-----------|-----------|-----------|
## unknown | 877 | 107 | 984 |
## | 0.891 | 0.109 | 0.024 |
## -------------|-----------|-----------|-----------|
## yes | 5503 | 677 | 6180 |
## | 0.890 | 0.110 | 0.152 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
chisq.test(bank$loan, bank$y)
##
## Pearson's Chi-squared test
##
## data: bank$loan and bank$y
## X-squared = 0.86841, df = 2, p-value = 0.6478
bank <- select(bank, -c('loan'))
The differences in groups seem to not be statistically significant. We remove this variable.
CrossTable(bank$housing, bank$y, prop.c = F, prop.t = F, prop.chisq = F)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## |-------------------------|
##
##
## Total Observations in Table: 40769
##
##
## | bank$y
## bank$housing | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## no | 16416 | 2003 | 18419 |
## | 0.891 | 0.109 | 0.452 |
## -------------|-----------|-----------|-----------|
## unknown | 877 | 107 | 984 |
## | 0.891 | 0.109 | 0.024 |
## -------------|-----------|-----------|-----------|
## yes | 18886 | 2480 | 21366 |
## | 0.884 | 0.116 | 0.524 |
## -------------|-----------|-----------|-----------|
## Column Total | 36179 | 4590 | 40769 |
## -------------|-----------|-----------|-----------|
##
##
chisq.test(bank$housing, bank$y)
##
## Pearson's Chi-squared test
##
## data: bank$housing and bank$y
## X-squared = 5.4627, df = 2, p-value = 0.06513
bank <- select(bank, -c('housing'))
In here the differences among groups are also not statistically significant. We remove ‘housing’.
Although most of the association of the dependent variables with the outcome variable were already checked, let’s use the Information Value to check which variables can we expect to have to strongest impact on Y.
library("Information")
bank$y <- ifelse(bank$y == '1', 1, 0)
IV <- create_infotables(data = bank, y = "y", bins = 5, parallel = FALSE)
IV$Summary
## Variable IV
## 14 nr.employed 1.171818847
## 11 emp.var.rate 0.913439490
## 15 pdays_binary 0.555002409
## 10 poutcome 0.550876798
## 6 month 0.484687904
## 12 cons.price.idx 0.385989656
## 9 previous 0.342814339
## 5 contact 0.251907553
## 13 cons.conf.idx 0.230233359
## 1 age 0.193432291
## 2 job 0.190651464
## 4 education 0.047426068
## 8 campaign 0.035463010
## 3 marital 0.027303022
## 7 day_of_week 0.006175579
Most of the social and economic attributes have very high information values that indicates strong predictive power. The same can be said about some variables concerning the outcome of the previous campaign, namely: poutcome, pdays_binary and previous. We decided to discard the day_of_week variable, since it is regarded that variables with IV’s lower than 0.02 have no predictive power at all.
bank <- select(bank, -c("day_of_week"))
We divide the data into training and test samples in proportions 80:20.
set.seed(103)
bank <- bank[sample(nrow(bank)), ]
bank$y <- as.factor(bank$y)
train_proportion <- 0.8
train_index <- runif(nrow(bank)) < train_proportion
train <- bank[train_index,]
test <- bank[!train_index,]
test$y <- as.factor(ifelse(test$y == 1, "Yes", "No"))
train$y <- as.factor(ifelse(train$y == 1, "Yes", "No"))
library(rpart)
library(rpart.plot)
library(ROCR)
library(caret)
## Loading required package: lattice
Now it’s time to build the our first model - the Decision Tree. We will use the ‘trainControl’ function from the caret package to determine the best complexity parameter for the tree. 10-fold cross validation was used to check the behaviour of different values of complexity parameter in different samples. With the rpart function we build the decison tree, in which we use the estimated value of cp that bring the best results and finally we plot the tree.
set.seed(123)
fitControl <- trainControl(method = 'cv',
number = 10,
classProbs = T,
summaryFunction = twoClassSummary)
tree <- train(y ~ .,
data = train,
trControl = fitControl,
metric = "ROC",
method = "rpart")
tree_fitted <- predict(tree, train, type = "prob")
tree_fitted_test <- predict(tree, test, type = "prob")
pred <- factor(ifelse(tree_fitted[, "Yes"] > 0.2, "Yes", "No"))
confusionMatrix(pred, train$y, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 26690 1964
## Yes 2163 1761
##
## Accuracy : 0.8733
## 95% CI : (0.8697, 0.8769)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1.000000
##
## Kappa : 0.3887
##
## Mcnemar's Test P-Value : 0.002055
##
## Sensitivity : 0.47275
## Specificity : 0.92503
## Pos Pred Value : 0.44878
## Neg Pred Value : 0.93146
## Prevalence : 0.11434
## Detection Rate : 0.05405
## Detection Prevalence : 0.12045
## Balanced Accuracy : 0.69889
##
## 'Positive' Class : Yes
##
pred_test <- factor(ifelse(tree_fitted_test[, "Yes"] > 0.2, "Yes", "No"))
confusionMatrix(pred_test, test$y, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6773 447
## Yes 553 418
##
## Accuracy : 0.8779
## 95% CI : (0.8706, 0.8849)
## No Information Rate : 0.8944
## P-Value [Acc > NIR] : 0.9999991
##
## Kappa : 0.3868
##
## Mcnemar's Test P-Value : 0.0008989
##
## Sensitivity : 0.48324
## Specificity : 0.92452
## Pos Pred Value : 0.43048
## Neg Pred Value : 0.93809
## Prevalence : 0.10560
## Detection Rate : 0.05103
## Detection Prevalence : 0.11854
## Balanced Accuracy : 0.70388
##
## 'Positive' Class : Yes
##
The 0.20 probability cut off point was used to classify observations (as the most optimal trade off between specificity and sensitivity).
In the training sample, we obtained the following results: • Sensitivity = 47.2% • Specificity = 92.5% • Balanced accuracy = 0.699
In the test sample: • Sensitivity = 48.3% • Specificity = 92.4% • Balanced accuracy = 0.703
Surprisingly the model did better on the test sample than on the training sample.
Our next model - logisitc regression will be build again with the train function. This time we use 5-fold cross validation (repeated 3 times), to achieve comprehensive results of the prediction tested on different parts of the dataset.
set.seed(123)
ctrl_cv5x3a <- trainControl(method = "repeatedcv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
repeats = 3)
logit_train <- train(y ~.,
data = train,
method = "glm",
family = "binomial",
trControl = ctrl_cv5x3a)
logit_train_fitted <- predict(logit_train, train, type = "prob")
logit_test_fitted <- predict(logit_train, test, type = "prob")
pred <- factor(ifelse(logit_train_fitted[, "Yes"] > 0.19, "Yes", "No"))
confusionMatrix(pred, train$y, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 25806 1581
## Yes 3047 2144
##
## Accuracy : 0.8579
## 95% CI : (0.8541, 0.8617)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4012
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.57557
## Specificity : 0.89440
## Pos Pred Value : 0.41302
## Neg Pred Value : 0.94227
## Prevalence : 0.11434
## Detection Rate : 0.06581
## Detection Prevalence : 0.15934
## Balanced Accuracy : 0.73498
##
## 'Positive' Class : Yes
##
pred_test <- factor(ifelse(logit_test_fitted[, "Yes"] > 0.19, "Yes", "No"))
confusionMatrix(pred_test, test$y, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6546 375
## Yes 780 490
##
## Accuracy : 0.859
## 95% CI : (0.8513, 0.8665)
## No Information Rate : 0.8944
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3813
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.56647
## Specificity : 0.89353
## Pos Pred Value : 0.38583
## Neg Pred Value : 0.94582
## Prevalence : 0.10560
## Detection Rate : 0.05982
## Detection Prevalence : 0.15505
## Balanced Accuracy : 0.73000
##
## 'Positive' Class : Yes
##
The 0.19 probability cut off point was used to classify observations (as the most optimal trade off between specificity and sensitivity). In the training sample sensitivity equals around 57.6% and specificity 89.4%. Balanced accuracy is therefore equal to 0.735.
In the test sample the results are slightly worse, but still reasonably good. The Sensitivity dropped by around 1% point to 56.6% and specificity stayed pretty much the same (89.4%). Balanced accuracy was 0.73.
The final classification model is the k-nearest neighbours. This method is the most computationally expensive from the ones used in this project. This time in the train function we have define additional parameters (except cross validation). Firstly - trControl to conduct 5-fold cross validation. Secondly - ‘preProcess’ - to scale the variables. Applying these parameters will help to achieve more accurate predictions.
Beforehand, with the parameter ‘tuneGrid’ we defined the optimal number of hyperparameter k = 86. As this operation is very computationally expensive we will not repeat this operation in this paper.
set.seed(123)
ctrl_cv5 <- trainControl(method = "cv",
number = 5)
different_k <- data.frame(k = 86)
train_knn_tuned <-
train(y ~ .,
data = train,
method = "knn",
trControl = ctrl_cv5,
tuneGrid = different_k,
preProcess = c("range")
)
train_knn_tuned_forecast <- predict(train_knn_tuned, train, type = "prob")
test_knn_tuned_forecast <- predict(train_knn_tuned, test, type = "prob")
pred_knn <- factor(ifelse(train_knn_tuned_forecast[, "Yes"] > 0.15, "Yes", "No"))
confusionMatrix(pred_knn, train$y, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 26159 1691
## Yes 2694 2034
##
## Accuracy : 0.8654
## 95% CI : (0.8616, 0.8691)
## No Information Rate : 0.8857
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.4052
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.54604
## Specificity : 0.90663
## Pos Pred Value : 0.43020
## Neg Pred Value : 0.93928
## Prevalence : 0.11434
## Detection Rate : 0.06243
## Detection Prevalence : 0.14513
## Balanced Accuracy : 0.72634
##
## 'Positive' Class : Yes
##
pred_knn_test <- factor(ifelse(test_knn_tuned_forecast[, "Yes"] > 0.15, "Yes", "No"))
confusionMatrix(pred_knn_test, test$y, positive = "Yes")
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 6631 414
## Yes 695 451
##
## Accuracy : 0.8646
## 95% CI : (0.857, 0.8719)
## No Information Rate : 0.8944
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3731
##
## Mcnemar's Test P-Value : <0.0000000000000002
##
## Sensitivity : 0.52139
## Specificity : 0.90513
## Pos Pred Value : 0.39354
## Neg Pred Value : 0.94123
## Prevalence : 0.10560
## Detection Rate : 0.05506
## Detection Prevalence : 0.13991
## Balanced Accuracy : 0.71326
##
## 'Positive' Class : Yes
##
The 0.15 probability cut off point was used to classify observations.
The results of KNN in the training sample: • Sensitivity = 54.6% • Specificity = 90.6% • Balanced accuracy = 72.6%
Results in test sample: • Sensitivity = 52.1% • Specificity = 90.5% • Balanced accuracy = 71.3%
Out of the 3 different models presented in this report, logistic regression did the best job in determining true positive values and has the best balanced accuracy. Results of the KNN model are also satisfactory, although it is not as precise as logistic regression in determining true positives (sensitivity of model on the test sample was more than 4 % points worse than in case of logisitc regression). Lastly, the decision tree, while having the highest specificity (best ratio of predicted true negatives), it did much worse with predicting the positive class (around 8 % points less than logistic regression).
All things considered if I were to choose the model for this portugese bank to predict whether the customers will subscribe to a term deposit I will opt for the one obtained with logisitc regression method. There are other algorithms that can possibly bring even better results than those presented in this paper. For example we tried to use Support Vector Machine algorithm, which turned out to be too computationally expensive but can be used by someone who has better CPU and RAM. Random Forest, Gradient Boosting or aforementioned SVM can be the methods worth to use by other researchers who will try to achieve best possible predictions in the future.
Social and economic context attributes
Since some of the numbers where interpreted as months in Excel, we have to decode them into numbers. Next we convert all the variables from this group into numeric. Than we check the correlation among them.
Since euribor3m is strongly correlated (almost perfectly colinear) with emp.var.rate and nr.employed we remove this from the further analysis.