1 Background and Justifications for Multilevel model

For Assessment Task 2 (AT2) Data Scientist for Social Good (DSSG) created a logistic regression model of Unemployment in an attempt to answer the following research questions Rivera et al. (2020)¹

What Socio-Economic, demographic and geospatial factors should be included in an unemployment insurance premium model?
- In what ways does unemployment differ between urban, regional and remote geographies?
- How do Socio-Economic variables (e.g. levels of educational attainment) relate to unemployment?
- What is the nature of the relationship between demographic factors (e.g. age and Sex) and unemployment?

Having used logistic regression in AT2, DSGS Rivera et al. (2020) assumed the following:

Independence of Observations.
Constant probability of success across trials.
Linear relationships between explanatory variables and the Log-Odds expression of the probability of unemployment.

Our original findings were that all these factors were significant and should be included in unemployment insurance premium models. However, the model did not take into account the hierarchical nature of the data, which actually violates the assumption of independence of observations. It is on this basis that I conduct my new analysis.

1.1 New analysis and research questions

Using the original data from AT2, I will investigate the extent that this nested structure influences the model by implementing the use of three model types, then reviewing the differences in output and revisiting the conclusions that were drawn from the AT2 model.

To do this I will investigate Level 1 and Level 2 variables separately then together by using the following models:

Binary Logistic Regression model to explore the level 1 variables explicitly
Binomial Logistic Regression model to explore level 2 variables explicitly
Multilevel Binary Logistic Regression Model to explore the level 1 and level 2 variables accounting for their nested structure

To answer the following research questions

What are the effects of age, sex, education and indigenous status on whether a person is unemployed if we ignore the hierarchical structure of the data?
What is the effect of Index of Education and Occupation (IEO), Remoteness and whether the area is a capital city on the proportion of unemployment if we ignore the hierarchical structure of the data?
Accounting for the hierarchical structure of the data, what are the effects of age, sex, education and indigenous status (Level 1 variables) and of IEO, Remoteness and whether the area is a capital city (Level 2 variables) on whether a person is unemployed?

2 Analysis and Models

The process which I will use in the analysis is as follows:

Load and modify data
Create Binary Logistic Regression Model
1. Explore data
2. Fit Binary Logistic Regression Model to data
3. Interpretation of Binary Logistic Model output
4. Visualisation of parameter Effects
5. Model Evaluation
  - Likelihood Ratio Test (LRT) using Analysis of Variance (ANOVA)
  - Akaike information criterion (AIC)
  - Correct Classification Rate (using Confusion Matrix)
  - Area under the Curve (AUC)
Create Binomial Logistic Regression Model
1. Explore data
2. Fit Binomial Logistic Regression Model
3. Interpretation of Binomial Logistic Model output
4. Visualisation of Effects
Create Multilevel Binary Logistic Regression Model
1. Explore data
2. Centre variables
3. Fit NULL model and step-wise selection to get to Maximal Model
4. Visualisation of Parameter Effects
5. Model Evaluation
  - LRT
  - AIC

2.1 Data

The data being used for this model will be the same as used for the previous model in AT2.

Please note that there added noise in the form of a random adjustment to labour force figures the ABS applies in small samples in order to manage privacy concerns which can make the data unreliable(O’Keefe 2008; Leaver 2009 in Lee et al., 2017).²

2.1.1 Data transformation

Starting with the original dataset from AT2, I have kept the transformations fairly consistent with what was done originally.

Modify datatypes from char to factor
Check for and remove
1. missing data
2. duplicate data
3. rows with small counts (>20) which corresponds to peoples aged 60 and over, to account of limitation of Cross-Table approach
4. NaN ratios and ratios over 100% - caused by addition of “noise” by Australian Bureau of Statistics (ABS)
Feature engineering of
1. “Capital” feature
2. “sa4id” feature (numeric), used in plotting

2.2 Binary Logistic Regression Model

To answer my first research question, I fit just the level 1 variables to a Binary Logistic Regression Model and see how it performs.

2.2.1 Explore data

Having a look at the data grouped by the different variables.

2.2.1.1 Explore by age

Younger people are more likely than older people to be unemployed with almost 20% being under 20.

use_dataset%>%
  group_by(age) %>%
  summarise(unemployment = (sum(unemployment)/sum(labour_force)))

## # A tibble: 5 x 2
##   age         unemployment
##   <fct>              <dbl>
## 1 15-19 years       0.199 
## 2 20-29 years       0.0929
## 3 30-39 years       0.0534
## 4 40-49 years       0.0498
## 5 50-59 years       0.0494

2.2.1.2 Explore by sex

It seems that unemployment differs slightly between males and females, with more males being unemployed.

## # A tibble: 2 x 2
##   sex    unemployment
##   <fct>         <dbl>
## 1 Female       0.0693
## 2 Male         0.0713

2.2.1.3 Explore by education

More people who did not finish a high school education are unemployed.

## # A tibble: 2 x 2
##   education       unemployment
##   <fct>                  <dbl>
## 1 hs_finished           0.0599
## 2 hs_not_finished       0.0921

2.2.1.4 Explore by Indigenous status

Those with an Indigenous status are more likely to be unemployed at 17.2% than those who are non-indigenous 6.83%.

## # A tibble: 2 x 2
##   ingp           unemployment
##   <fct>                 <dbl>
## 1 Indigenous           0.172 
## 2 Non-Indigenous       0.0683

2.2.2 Fit Binary Logistic Regression Model to data

Fitting age, sex, education and indigenous variables to the Binary Logistic Model.

Model_Binary <- glm(formula = (unemployment/labour_force) ~ age + sex + education + ingp ,
  family = binomial(logit),
  weight = labour_force,
  data = use_dataset)

summary(Model_Binary)

## 
## Call:
## glm(formula = (unemployment/labour_force) ~ age + sex + education + 
##     ingp, family = binomial(logit), data = use_dataset, weights = labour_force)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -19.8147   -2.9320   -0.9439    1.4037   22.4383  
## 
## Coefficients:
##                           Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)              -0.871637   0.006953 -125.358  < 2e-16 ***
## age20-29 years           -0.733454   0.003906 -187.797  < 2e-16 ***
## age30-39 years           -1.328883   0.004244 -313.141  < 2e-16 ***
## age40-49 years           -1.463136   0.004263 -343.252  < 2e-16 ***
## age50-59 years           -1.537727   0.004402 -349.322  < 2e-16 ***
## sexMale                   0.013593   0.002488    5.464 4.64e-08 ***
## educationhs_not_finished  0.443012   0.002644  167.528  < 2e-16 ***
## ingpNon-Indigenous       -0.816771   0.006125 -133.360  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 294674  on 3515  degrees of freedom
## Residual deviance:  67804  on 3508  degrees of freedom
## AIC: 86727
## 
## Number of Fisher Scoring iterations: 4

2.2.3 Interpretation of Binary Logistic Model output

From the summary output above:

Sex and education positively and significantly predicts unemployment while age and indigenous status negatively and significantly so.
Specifically, in comparison to being a female, males are more likely to be unemployed.
Having finished high schooling is less likely to result in unemployment.
Older people are less likely unemployed as are non-indigenous peoples

From this point on I will use the summ() function in the jtools package to interpret parameter estimates and calculate the exponentiation coefficient estimate.

summ(Model_Binary, exp = T) # set "exp = T" to show esponentiated estimates

Observations	3516
Dependent variable	(unemployment/labour_force)
Type	Generalized linear model
Family	binomial
Link	logit

χ²(7)	226870.33
Pseudo-R² (Cragg-Uhler)	1.00
Pseudo-R² (McFadden)	0.72
AIC	86727.43
BIC	86776.75

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.42	0.41	0.42	-125.36	0.00
age20-29 years	0.48	0.48	0.48	-187.80	0.00
age30-39 years	0.26	0.26	0.27	-313.14	0.00
age40-49 years	0.23	0.23	0.23	-343.25	0.00
age50-59 years	0.21	0.21	0.22	-349.32	0.00
sexMale	1.01	1.01	1.02	5.46	0.00
educationhs_not_finished	1.56	1.55	1.57	167.53	0.00
ingpNon-Indigenous	0.44	0.44	0.45	-133.36	0.00
Standard errors: MLE

The interpretation of the parameter estimates is linked to the odds rather than probabilities.

In this analysis, assuming everything else stays the constant:

Being aged 20-29 lowers the odds of being unemployed by (1-0.48) = 52% and using the same calculation;
- Being aged 30-39 by 74%
- Being aged 40-49 by 77%
- Being aged 50-59 by 79%
Being a male increases the odds of being unemployed by 1%, in comparison to being female;
Not finishing high school increases the odds of being unemployed by (1 – 1.56)% = 56%, in comparison to having finished high school.

2.2.4 Visualisation of Parameter Effects

To make the interpretation of the parameter effects easier - I create a plot of all the effects. These y-scale reports probability of being unemployed, not the log odds.

plot(allEffects(Model_Binary),cex.main=.75, cex.lab=.1, cex.axis=0.1)

Assuming all other variables in the model are constant, we can see, that being male has a higher probability (~0.1133) of being unemployed than females at (~0.1119).

Likewise, having finished a high school education has a lower probability (~0.02) of being unemployed than not having high school education (~0.13), having indigenous status has higher probability of being unemployed (~0.16), than having non-indigenous status (~0.08) and finally, being younger (under 30) has a higher probability of being unemployed than being aged older:

15-19 (~0.26)
19-29 (~0.14)
30-39 (~0.05)
40-49 (~0.03)
50-59 (~0.02)

2.2.5 Model Evaluation

2.2.5.1 LRT

Using the LRT to look at I compare a model with only the ‘age’ variable against the full model using an ANOVA test to compare. The model with IEO_Decile, remoteness and predictors provide a significantly better fit to the data than does the model with only the age variable.

#specify a model with only the `age` variable
Model_Binary_Test <- glm(formula = (unemployment/labour_force) ~ age,
                         family = binomial,
                         weight = labour_force,
                         data = use_dataset)

#use the `anova()` function to run the likelihood ratio test
anova(Model_Binary_Test, Model_Binary, test ="Chisq")

## Analysis of Deviance Table
## 
## Model 1: (unemployment/labour_force) ~ age
## Model 2: (unemployment/labour_force) ~ age + sex + education + ingp
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      3511     114709                          
## 2      3508      67804  3    46905 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.2.5.2 AIC

Next I compare AIC between the Test model previously created and the Binary model to check for parsimonious and goodness of fit of a model.

The Binary model performs better here than the test as it has the smaller AIC value.

Model_Binary_Test$aic

## [1] 133626.8

Model_Binary$aic

## [1] 86727.43

2.2.5.3 Correct Classification Rate

To check the correct classification rate I use the same confusion matrix created in AT2 and predict the responses.

We can see that the model has a precision of 87.92%, but a recall of just 0.27% and F1 of 0.55% of all the observations.

However, a closer look reveals that the model predicts quite a high number of false negatives (~7%), meaning that people are predicted not being unemployed when they actually are.

Given that the majority people are not unemployed (i.e. in the labour force) , the model does not perform better in classification than simply assigning all observations to the majority not Unemployed.

Pred <- predict(Model_Binary, type = "response")
Pred <- if_else(Pred < mean(Pred), 1, 0)# less than mean is unemployed [1] and not unemployed [0]

#Confusion matrix (constructed)
use_dataset$employed <-  use_dataset$labour_force-use_dataset$unemployment
Pred_employed <-  use_dataset$labour_force-Pred
use_dataset$true_positives <- ifelse(Pred < use_dataset$unemployment, Pred, use_dataset$unemployment)
use_dataset$true_negatives <- ifelse(Pred_employed < use_dataset$employed, Pred_employed, use_dataset$employed)   
use_dataset$false_positives <- ifelse(Pred > use_dataset$unemployment, Pred - use_dataset$unemployment, 0)
use_dataset$false_negatives <- ifelse(Pred_employed > use_dataset$employed, Pred_employed - use_dataset$employed, 0)

use_dataset$control_positives <- (Pred == use_dataset$true_positives + use_dataset$false_positives)
use_dataset$control_negatives <- (Pred_employed == use_dataset$true_negatives + use_dataset$false_negatives)


recall <- sum(use_dataset$true_positives)/(sum(use_dataset$true_positives)+sum(use_dataset$false_negatives))
precision <- sum(use_dataset$true_positives)/(sum(use_dataset$true_positives)+sum(use_dataset$false_positives))
f_1 <- (( 2 * precision * recall) / (precision + recall))

 recall

## [1] 0.00277215

 precision

## [1] 0.8792651

f_1

## [1] 0.005526875

sum(use_dataset$true_positives)

## [1] 2010

sum(use_dataset$true_negatives)

## [1] 9581734

sum(use_dataset$false_positives)

## [1] 276

sum(use_dataset$false_negatives)

## [1] 723059

sum(use_dataset$control_positives)

## [1] 3516

sum(use_dataset$control_negatives)

## [1] 3516

sum(use_dataset$unemployment)

## [1] 725069

sum(use_dataset$labour_force)

## [1] 10307079

2.2.5.4 AUC

Finally, I use the Area Under the Curve as another measure of discrimination. According to Fang and va de Schoot (2019) “a good model should have an AUC score much higher than 0.50 (preferably higher than 0.80)”³

The model has an AUC score of 0.7969, meaning it does not discriminate well enough but is close.

Prob <- predict(Model_Binary, type="response")
Prob <- if_else(Prob < mean(Prob), 1, 0)
target <- (use_dataset$unemployment/use_dataset$labour_force)
target <- if_else(target < mean(target),1,0)

PredAUC <- prediction(Prob, target)
AUC <- performance(PredAUC, measure = "auc")
AUC <- AUC@y.values[[1]]
AUC

## [1] 0.7969204

2.3 Binomial Logistic Regression Model

Next I move onto the Binomial Logistic Regression Model to answer my second research question, fitting only the level 2 variables to it an assessing how it performs.

2.3.1 Explore data

use_dataset %>% 
  ggplot(aes(x = exp(IEO_Decile)/(1+exp(IEO_Decile)), y = unemployment/labour_force))+
  geom_point()+
  geom_smooth(method = "lm")+
  labs(title = "Index of Education and Occupation",
       x = "Inverse logit IEO_Decile",
       y = "Unemployment %")

## `geom_smooth()` using formula 'y ~ x'

This shows that the proportion of unemployment is negatively related to the inverse-logit of IEO_Decile. This is because the higher the IEO decile the more affluent and therefore the lower the expected unemployment.

use_dataset %>% 
  ggplot(aes(x = exp(remoteness_index)/(1+exp(remoteness_index)), y = unemployment/labour_force))+
  geom_point()+
  geom_smooth(method = "lm")+
  labs(title = "Remoteness Index",
       x = "Inverse logit Remoteness_Index",
       y = "Unemployment %")

## `geom_smooth()` using formula 'y ~ x'

The plot shows the proportion of unemployment is slightly positively related to the inverse-logit of Remoteness_Index. This is because the higher the remoteness index the more remote the area the higher expected unemployment.

2.3.2 Fit Binomial Logistic Regression Model

Model_Prop <- glm(formula = (unemployment/labour_force) ~ IEO_Decile + remoteness_index + capital,
                  family = binomial,
                  weight = labour_force,
                  data = use_dataset)
summary(Model_Prop)

## 
## Call:
## glm(formula = (unemployment/labour_force) ~ IEO_Decile + remoteness_index + 
##     capital, family = binomial, data = use_dataset, weights = labour_force)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -30.336   -2.797    0.951    5.239   43.768  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.059044   0.005924 -347.56   <2e-16 ***
## IEO_Decile       -0.080352   0.000678 -118.51   <2e-16 ***
## remoteness_index -0.067628   0.002259  -29.94   <2e-16 ***
## capitalT          0.066607   0.003388   19.66   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 294674  on 3515  degrees of freedom
## Residual deviance: 279589  on 3512  degrees of freedom
## AIC: 298504
## 
## Number of Fisher Scoring iterations: 5

2.3.3 Interpretation of Binomial Logistic Model output

We know from the model summary above that the IEO_Decile of an SA4 is negatively related to the odds of person being unemployed and remoteness_index is slightly positively related and all variables seem significant.

summ(Model_Prop, exp = T)

Observations	3516
Dependent variable	(unemployment/labour_force)
Type	Generalized linear model
Family	binomial
Link	logit

χ²(3)	15085.32
Pseudo-R² (Cragg-Uhler)	0.99
Pseudo-R² (McFadden)	0.05
AIC	298504.44
BIC	298529.10

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.13	0.13	0.13	-347.56	0.00
IEO_Decile	0.92	0.92	0.92	-118.51	0.00
remoteness_index	0.93	0.93	0.94	-29.94	0.00
capitalT	1.07	1.06	1.08	19.66	0.00
Standard errors: MLE

We can see that with a SD increase in IEO_Decile, the odds of Unemployment is lowered by 1 – 92% = 8% and as remoteness_index increases the odds of Unemployment is lowered by 7% and capital city being True raises unemployment odds by 7%.

2.3.4 Visualisation of Effects

plot(allEffects(Model_Prop))

The plot above shows the expected influence of IEO_Decile, remoteness_index and Capital city status on the probability of unemployment.

Holding everything else constant:

As IEO_Decile increases, the probability unemployment lowers (from ~0.09 to ~0.05)
As remoteness_index increases, the probability of unemployment lowers (from ~0.074 to ~0.055)
As Capital moves from False to True, the probability of unemployment raises (from ~0.0685 to ~0.0725)

2.4 Multilevel Binary Logistic Regression Model

Now to look at all the level 1 or population-level variables and the SA4 level predictors together using a Multilevel Binary Logistic Regression Model.

2.4.1 Explore data

As always we first start with exploring the data.

2.4.1.1 Proportion of Unemployment per SA4

The plot show the proportions of unemployed persons across SA4 with a regression line and clearly there are vast differences between the SA4s.

use_dataset %>%
  group_by(sa4id) %>%
  summarise(PROP = (sum(unemployment,na.rm = TRUE)/sum(labour_force, na.rm = TRUE)*100)) %>%
  ggplot(aes(x= as.numeric(sa4id), y=PROP))+
  geom_point()+
  geom_smooth(method = "lm")+
  labs(title = "Proportion of Unemployment by SA4",
       subtitle = "With regression line",
       x = "SA4 Areas",
       y = "Unemployment %")

## `geom_smooth()` using formula 'y ~ x'

2.4.1.2 Plot of Unemployment vs. Labour Force

use_dataset %>%
ggplot(aes(x     = labour_force,
           y     = unemployment/labour_force,
           col   = sa4id,
           group = sa4id))+ #to add the colours for different SA4
  geom_point(size     = 1.2,
             alpha    = .8,
             position = "jitter")+ #to add some random noise for plotting purposes
  theme_minimal()+
  theme(legend.position = "none")+
  scale_color_gradientn(colours = rainbow(88))+
  geom_smooth(method = lm,
              se     = FALSE,
              size   = .5,
              alpha  = .8)+ # to add regression line
  labs(title    = "Linear Relationship Between Labour Force and Unemployment for 88 SA4",
       subtitle = "add colours for different SA4 and regression lines",
       x = "Labour Force",
       y = "Unemployment ratio")

## `geom_smooth()` using formula 'y ~ x'

#Zooming in to get better visual
use_dataset %>%
  filter(unemployment > 0) %>%
  ggplot(aes(x     = labour_force,
           y     = unemployment/labour_force,
           col   = sa4id,
           group = sa4id))+ #to add the colours for different SA4
  geom_point(size     = 1.2,
             alpha    = .8,
             position = "jitter")+ #to add some random noise for plotting purposes
  theme_minimal()+
  theme(legend.position = "none")+
  scale_color_gradientn(colours = rainbow(88))+
  geom_smooth(method = lm,
              se     = FALSE,
              size   = .5,
              alpha  = .8)+ # to add regression line
  coord_cartesian(xlim = c(1, 5000))+
  labs(title    = "Linear Relationship Between Labour Force and Unemployment for 88 SA4",
       subtitle = "add colours for different SA4 and regression lines; limited to Labour Force <5,000",
       x = "Labour Force",
       y = "Unemployment ratio")

## `geom_smooth()` using formula 'y ~ x'

#Extreme lines all data
f1(data = as.data.frame(use_dataset),
   x    = "labour_force",
   y    = "unemployment",
   grouping = "sa4id",
   n.highest = 3,
   n.lowest = 3) %>%
  ggplot()+
  geom_point(aes(x     = labour_force,
                 y     = unemployment/labour_force,
                 fill  = sa4id,
                 group = sa4id),
             size     =  1,
             alpha    = .5,
             position = "jitter",
             shape    = 21,
             col      = "white")+
  geom_smooth(aes(x     = labour_force,
                  y     = unemployment/labour_force,
                  col   = high_and_low,
                  group = sa4id,
                  size  = as.factor(high_and_low),
                  alpha = as.factor(high_and_low)),
              method = lm,
              se     = FALSE)+
  theme_minimal()+
  theme(legend.position = "none")+
  scale_fill_gradientn(colours = rainbow(88))+
  scale_color_manual(values=c("top"      = "blue",
                              "bottom"   = "red",
                              "none"     = "grey40"))+
  scale_size_manual(values=c("top"       = 1.2,
                             "bottom"   = 1.2,
                             "none"     = .5))+
  scale_alpha_manual(values=c("top"      = 1,
                              "bottom"    = 1,
                              "none"      =.3))+
  labs(title="Linear Relationship Between Labour Force and Unemployment for 88 SA4",
       subtitle="The 6 with the most extreme relationship have been highlighted red(low) and blue(high)",
       x = "Labour Force",
       y = "Unemployment ratio")

## Joining, by = "sa4id"
## `geom_smooth()` using formula 'y ~ x'

2.4.1.3 Relationship between sex and unemployment by SA4

We can also plot the relationship between sex and unemployment by SA4, (L1 Factor) to see whether the relationship between sex and unemployment varies by SA4, which it seems to.

use_dataset %>%
  group_by(sa4) %>%
  mutate(sex = if_else(sex == "Male",1,0)) %>%
  mutate(unemployment = if_else(unemployment >0,1,0)) %>%
  ggplot(aes(x = sex, y = unemployment, colour = as.factor(sa4))) +
  geom_point(alpha = .1, position = "jitter") +
  geom_smooth(method = "glm", se = FALSE,
              method.args = list(family = "binomial")) +
  theme(legend.position = "none") +
  scale_x_continuous(breaks = c(0,1)) +
  scale_y_continuous(breaks = c(0,1))

## `geom_smooth()` using formula 'y ~ x'

2.4.1.4 Relationship between education and unemployment by SA4

Using the same plotting technique as above I also plot the relationship between education and unemployment that also looks like there is variance between the different SA4s.

## `geom_smooth()` using formula 'y ~ x'

2.4.1.5 Relationship between indigenous and unemployment by SA4

Finally, I plot the relationship between indigenous status and unemployment.

## `geom_smooth()` using formula 'y ~ x'

2.4.2 Centre variables

According to Fang and van de Schoot (2019), centering of the predictors using an appropriate centering method is necessary before fitting to Multilevel Models.

Enders and Tofighi (2007) recommends use of within-cluster centering for the first-level predictors age, sex, education and indg and grand-mean centering for the second-level predictors IEO_Decile, Remoteness_Index and Capital.⁴

centered_dataset <- use_dataset %>%
  mutate(sex = if_else(sex == "Male",1,0),
         education = if_else(education == "hs_not_finished",1,0),
         ingp = if_else(ingp == "Indigenous",1,0)) %>%
  group_by(sa4) %>%
  mutate( sex = sex - mean(sex),
          education = education - mean(education),
          ingp = ingp - mean(ingp)) %>%
  ungroup() %>%
  mutate(capital = ifelse(capital == "F",1,0)) %>%
  mutate(IEO_Decile = IEO_Decile - mean(IEO_Decile, na.rm = TRUE),
         remoteness_index = remoteness_index - mean(remoteness_index, na.rm = TRUE),
         capital = capital - mean(capital))

head(centered_dataset)

## # A tibble: 6 x 21
##   sa4   age     sex education  ingp unemployment labour_force IEO_Decile
##   <chr> <fct> <dbl>     <dbl> <dbl>        <dbl>        <dbl>      <dbl>
## 1 Capi~ 15-1~   0.5      -0.5   0.5           13           66    0.00456
## 2 Capi~ 15-1~   0.5      -0.5  -0.5          110         1039    0.00456
## 3 Capi~ 15-1~   0.5       0.5   0.5           31          138    0.00456
## 4 Capi~ 15-1~   0.5       0.5  -0.5          262         1958    0.00456
## 5 Capi~ 15-1~  -0.5      -0.5   0.5            0           48    0.00456
## 6 Capi~ 15-1~  -0.5      -0.5  -0.5           98         1267    0.00456
## # ... with 13 more variables: population <dbl>, remoteness_index <dbl>,
## #   sa4id <dbl>, ratio <dbl>, city <chr>, capital <dbl>, employed <dbl>,
## #   true_positives <dbl>, true_negatives <dbl>, false_positives <dbl>,
## #   false_negatives <dbl>, control_positives <lgl>, control_negatives <lgl>

2.4.3 Fit Models

2.4.3.1 Intercepts Only model - NULL model

To assess the impact of the SA4 grouping structure I specify a null model and calculate the intra-class correlation (ICC), which is very low in this instance.

The ICC is non-zero which implies a non-independence in the observation Park, T., Sunhee ; Lake, T., Eileen (2005) ⁵

ModelNull <- glmer(formula = (unemployment/labour_force) ~ 1 + (1|sa4),
                    data = centered_dataset,
                    family = binomial(logit),
                    weight = labour_force,
                    control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))

summary(ModelNull)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: (unemployment/labour_force) ~ 1 + (1 | sa4)
##    Data: centered_dataset
## Weights: labour_force
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##       AIC       BIC    logLik  deviance  df.resid 
##  280186.2  280198.6 -140091.1  280182.2      3514 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -30.804  -2.041   1.121   6.393  52.022 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  sa4    (Intercept) 0.05819  0.2412  
## Number of obs: 3516, groups:  sa4, 88
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.61277    0.02576  -101.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

performance::icc(ModelNull)

## # Intraclass Correlation Coefficient
## 
##      Adjusted ICC: 0.017
##   Conditional ICC: 0.017

2.4.3.2 Step-wise build of other Models

Next I used a step-wise approach to build the best model, in the process producing the following models:

First level predictors
First and Second level predictors
First and Second level predictors with Random Slopes
First and Second Level Predictors with Random Slopes and Cross-level Interaction
- These were created to be less than Maximal with some terms excluded due to their significance being low
Maximal Model (Best)

2.4.3.2.1 First Level Predictors Model

model1 <- glmer(formula = (unemployment/labour_force) ~ 1 + age + sex + education + ingp + (1|sa4),
               data = centered_dataset,
               family = binomial,
               weight = labour_force,
               control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))


summary(model1)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: (unemployment/labour_force) ~ 1 + age + sex + education + ingp +  
##     (1 | sa4)
##    Data: centered_dataset
## Weights: labour_force
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  57301.0  57356.5 -28641.5  57283.0     3507 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -14.1598  -2.0645  -0.6566   1.3994  19.0065 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  sa4    (Intercept) 0.05109  0.226   
## Number of obs: 3516, groups:  sa4, 88
## 
## Fixed effects:
##                 Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)    -1.092335   0.024465  -44.648  < 2e-16 ***
## age20-29 years -0.748700   0.003927 -190.644  < 2e-16 ***
## age30-39 years -1.344089   0.004264 -315.226  < 2e-16 ***
## age40-49 years -1.467990   0.004275 -343.421  < 2e-16 ***
## age50-59 years -1.537952   0.004416 -348.282  < 2e-16 ***
## sex             0.009753   0.002493    3.913 9.12e-05 ***
## education       0.449718   0.002727  164.905  < 2e-16 ***
## ingp            0.845556   0.006366  132.822  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) a20-2y a30-3y a40-4y a50-5y sex    eductn
## age20-29yrs -0.104                                          
## age30-39yrs -0.093  0.621                                   
## age40-49yrs -0.091  0.595  0.547                            
## age50-59yrs -0.085  0.554  0.512  0.510                     
## sex          0.003 -0.048 -0.051 -0.029 -0.020              
## education   -0.020  0.206  0.175  0.086 -0.009 -0.091       
## ingp         0.112 -0.012  0.006  0.015  0.034  0.007 -0.063

summ(model1, exp = T)

Observations	3516
Dependent variable	(unemployment/labour_force)
Type	Mixed effects generalized linear model
Family	binomial
Link	logit

AIC	57301.04
BIC	57356.53
Pseudo-R² (fixed effects)	0.15
Pseudo-R² (total)	0.16

Fixed Effects
	exp(Est.)	S.E.	z val.	p
(Intercept)	0.34	0.02	-44.65	0.00
age20-29 years	0.47	0.00	-190.64	0.00
age30-39 years	0.26	0.00	-315.23	0.00
age40-49 years	0.23	0.00	-343.42	0.00
age50-59 years	0.21	0.00	-348.28	0.00
sex	1.01	0.00	3.91	0.00
education	1.57	0.00	164.90	0.00
ingp	2.33	0.01	132.82	0.00

Random Effects
Group	Parameter	Std. Dev.
sa4	(Intercept)	0.23

Grouping Variables
Group	# groups	ICC
sa4	88	0.02

2.4.3.2.2 First and Second level predictors

model2 <- glmer(formula = (unemployment/labour_force) ~ 1 + age + sex + education + ingp + IEO_Decile + remoteness_index + capital + (1|sa4),
               data = centered_dataset,
               family = binomial,
               weight = labour_force,
               control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))

summary(model2)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: (unemployment/labour_force) ~ 1 + age + sex + education + ingp +  
##     IEO_Decile + remoteness_index + capital + (1 | sa4)
##    Data: centered_dataset
## Weights: labour_force
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  57283.6  57357.6 -28629.8  57259.6     3504 
## 
## Scaled residuals: 
##      Min       1Q   Median       3Q      Max 
## -14.1648  -2.0652  -0.6572   1.3983  19.0125 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  sa4    (Intercept) 0.0391   0.1977  
## Number of obs: 3516, groups:  sa4, 88
## 
## Fixed effects:
##                   Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)      -1.092149   0.021499  -50.801  < 2e-16 ***
## age20-29 years   -0.748682   0.003927 -190.639  < 2e-16 ***
## age30-39 years   -1.344070   0.004264 -315.222  < 2e-16 ***
## age40-49 years   -1.467977   0.004275 -343.418  < 2e-16 ***
## age50-59 years   -1.537932   0.004416 -348.278  < 2e-16 ***
## sex               0.009751   0.002493    3.912 9.16e-05 ***
## education         0.449651   0.002727  164.866  < 2e-16 ***
## ingp              0.845760   0.006367  132.842  < 2e-16 ***
## IEO_Decile       -0.060278   0.012965   -4.649 3.33e-06 ***
## remoteness_index -0.101105   0.030733   -3.290    0.001 ** 
## capital          -0.090771   0.062532   -1.452    0.147    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) a20-2y a30-3y a40-4y a50-5y sex    eductn ingp   IEO_Dc
## age20-29yrs -0.118                                                        
## age30-39yrs -0.106  0.621                                                 
## age40-49yrs -0.104  0.595  0.547                                          
## age50-59yrs -0.097  0.554  0.512  0.510                                   
## sex          0.004 -0.048 -0.051 -0.029 -0.020                            
## education   -0.023  0.206  0.174  0.086 -0.009 -0.091                     
## ingp         0.128 -0.012  0.006  0.015  0.034  0.007 -0.063              
## IEO_Decile  -0.001  0.000  0.001  0.000  0.000  0.000  0.007  0.001       
## remtnss_ndx  0.000 -0.002 -0.002 -0.001 -0.002  0.000 -0.001 -0.015  0.243
## capital      0.000  0.002  0.002  0.001  0.001  0.001 -0.003  0.002  0.497
##             rmtns_
## age20-29yrs       
## age30-39yrs       
## age40-49yrs       
## age50-59yrs       
## sex               
## education         
## ingp              
## IEO_Decile        
## remtnss_ndx       
## capital     -0.402

summ(model2, exp = T)

Observations	3516
Dependent variable	(unemployment/labour_force)
Type	Mixed effects generalized linear model
Family	binomial
Link	logit

AIC	57283.63
BIC	57357.61
Pseudo-R² (fixed effects)	0.15
Pseudo-R² (total)	0.16

Fixed Effects
	exp(Est.)	S.E.	z val.	p
(Intercept)	0.34	0.02	-50.80	0.00
age20-29 years	0.47	0.00	-190.64	0.00
age30-39 years	0.26	0.00	-315.22	0.00
age40-49 years	0.23	0.00	-343.42	0.00
age50-59 years	0.21	0.00	-348.28	0.00
sex	1.01	0.00	3.91	0.00
education	1.57	0.00	164.87	0.00
ingp	2.33	0.01	132.84	0.00
IEO_Decile	0.94	0.01	-4.65	0.00
remoteness_index	0.90	0.03	-3.29	0.00
capital	0.91	0.06	-1.45	0.15

Random Effects
Group	Parameter	Std. Dev.
sa4	(Intercept)	0.20

Grouping Variables
Group	# groups	ICC
sa4	88	0.01

2.4.3.2.3 First and Second Level Predictors with Random Slopes

First and Second Level Predictors with Random Slopes all predictor variables L1 will have random slopes

model3 <- glmer(formula = (unemployment/labour_force) ~ 1 + age + sex + education + ingp + IEO_Decile + remoteness_index + capital + (1 + age + sex + education + ingp |sa4),
               data = centered_dataset,
               family = binomial(logit),
               weight = labour_force,
               control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))
summary(model3)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: (unemployment/labour_force) ~ 1 + age + sex + education + ingp +  
##     IEO_Decile + remoteness_index + capital + (1 + age + sex +  
##     education + ingp | sa4)
##    Data: centered_dataset
## Weights: labour_force
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  40584.4  40874.2 -20245.2  40490.4     3469 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -9.7108 -1.3993 -0.1471  1.4766 12.5805 
## 
## Random effects:
##  Groups Name           Variance Std.Dev. Corr                               
##  sa4    (Intercept)    0.08592  0.2931                                      
##         age20-29 years 0.04640  0.2154    0.01                              
##         age30-39 years 0.08615  0.2935    0.03  0.93                        
##         age40-49 years 0.05335  0.2310    0.05  0.85  0.94                  
##         age50-59 years 0.04848  0.2202   -0.01  0.74  0.81  0.94            
##         sex            0.01251  0.1119    0.36  0.26  0.17  0.15  0.13      
##         education      0.02719  0.1649    0.32  0.64  0.69  0.62  0.46  0.11
##         ingp           0.33972  0.5829    0.56  0.55  0.67  0.66  0.52  0.17
##       
##       
##       
##       
##       
##       
##       
##       
##   0.76
## Number of obs: 3516, groups:  sa4, 88
## 
## Fixed effects:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.30452    0.03190 -40.894  < 2e-16 ***
## age20-29 years   -0.70396    0.02347 -29.992  < 2e-16 ***
## age30-39 years   -1.27264    0.03173 -40.110  < 2e-16 ***
## age40-49 years   -1.42206    0.02515 -56.541  < 2e-16 ***
## age50-59 years   -1.49930    0.02406 -62.309  < 2e-16 ***
## sex               0.02705    0.01231   2.198    0.028 *  
## education         0.46765    0.01789  26.135  < 2e-16 ***
## ingp              0.58186    0.06307   9.226  < 2e-16 ***
## IEO_Decile       -0.09077    0.01482  -6.124 9.15e-10 ***
## remoteness_index -0.26011    0.04024  -6.464 1.02e-10 ***
## capital          -0.08340    0.06276  -1.329    0.184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) a20-2y a30-3y a40-4y a50-5y sex    eductn ingp   IEO_Dc
## age20-29yrs -0.009                                                        
## age30-39yrs  0.015  0.918                                                 
## age40-49yrs  0.027  0.841  0.929                                          
## age50-59yrs -0.026  0.731  0.800  0.922                                   
## sex          0.343  0.241  0.161  0.137  0.119                            
## education    0.302  0.627  0.676  0.603  0.441  0.097                     
## ingp         0.572  0.534  0.654  0.635  0.500  0.158  0.733              
## IEO_Decile  -0.002 -0.001 -0.004 -0.001 -0.001  0.002  0.002 -0.002       
## remtnss_ndx  0.007 -0.001  0.002  0.002 -0.001 -0.009 -0.002  0.004 -0.106
## capital     -0.002  0.004  0.001  0.001  0.002 -0.002  0.002  0.000  0.387
##             rmtns_
## age20-29yrs       
## age30-39yrs       
## age40-49yrs       
## age50-59yrs       
## sex               
## education         
## ingp              
## IEO_Decile        
## remtnss_ndx       
## capital     -0.254

summ(model3)

Observations	3516
Dependent variable	(unemployment/labour_force)
Type	Mixed effects generalized linear model
Family	binomial
Link	logit

AIC	40584.42
BIC	40874.17
Pseudo-R² (fixed effects)	0.13
Pseudo-R² (total)	0.18

Fixed Effects
	Est.	S.E.	z val.	p
(Intercept)	-1.30	0.03	-40.89	0.00
age20-29 years	-0.70	0.02	-29.99	0.00
age30-39 years	-1.27	0.03	-40.11	0.00
age40-49 years	-1.42	0.03	-56.54	0.00
age50-59 years	-1.50	0.02	-62.31	0.00
sex	0.03	0.01	2.20	0.03
education	0.47	0.02	26.13	0.00
ingp	0.58	0.06	9.23	0.00
IEO_Decile	-0.09	0.01	-6.12	0.00
remoteness_index	-0.26	0.04	-6.46	0.00
capital	-0.08	0.06	-1.33	0.18

Random Effects
Group	Parameter	Std. Dev.
sa4	(Intercept)	0.29
sa4	age20-29 years	0.22
sa4	age30-39 years	0.29
sa4	age40-49 years	0.23
sa4	age50-59 years	0.22
sa4	sex	0.11
sa4	education	0.16
sa4	ingp	0.58

Grouping Variables
Group	# groups	ICC
sa4	88	0.03

2.4.3.2.4 First and Second Level Predictors with Random Slopes and Cross-level Interaction

All interaction terms are included as fixed effects.

#Minus non-significant interactions
# sex:IEO_Decile
# sex:capital
# education:IEO_Decile
# education:remoteness_index
# ingp:capital
# IEO_Decile:remoteness_index
# IEO_Decile:capital
# remoteness_index:capital


max_model1 <- glmer(formula = (unemployment/labour_force) ~ 1 + age + sex + education + ingp + IEO_Decile + remoteness_index + capital + age:sex + age:education + age:ingp + age:IEO_Decile + age:remoteness_index + age:capital + sex:education + sex:ingp + sex:remoteness_index + education:ingp + education:capital + ingp:IEO_Decile + ingp:remoteness_index + (1 + age + sex + education + ingp |sa4),
                   data = centered_dataset,
                   family = binomial,
                   weight = labour_force,
                   control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))

# > summary(max_model1)
print(max_model1, correlation = TRUE)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: (unemployment/labour_force) ~ 1 + age + sex + education + ingp +  
##     IEO_Decile + remoteness_index + capital + age:sex + age:education +  
##     age:ingp + age:IEO_Decile + age:remoteness_index + age:capital +  
##     sex:education + sex:ingp + sex:remoteness_index + education:ingp +  
##     education:capital + ingp:IEO_Decile + ingp:remoteness_index +  
##     (1 + age + sex + education + ingp | sa4)
##    Data: centered_dataset
## Weights: labour_force
##       AIC       BIC    logLik  deviance  df.resid 
##  31063.61  31544.49 -15453.81  30907.61      3438 
## Random effects:
##  Groups Name           Std.Dev. Corr                                     
##  sa4    (Intercept)    0.2645                                            
##         age20-29 years 0.1328   -0.26                                    
##         age30-39 years 0.1574   -0.28  0.84                              
##         age40-49 years 0.1387   -0.28  0.65  0.85                        
##         age50-59 years 0.1576   -0.29  0.51  0.68  0.91                  
##         sex            0.1079    0.28  0.22  0.15  0.09  0.01            
##         education      0.1245    0.19  0.31  0.37  0.27  0.16  0.01      
##         ingp           0.3371    0.59 -0.16  0.00  0.03  0.01  0.01  0.51
## Number of obs: 3516, groups:  sa4, 88
## Fixed Effects:
##                     (Intercept)                   age20-29 years  
##                        -1.34805                         -0.61512  
##                  age30-39 years                   age40-49 years  
##                        -1.16653                         -1.37916  
##                  age50-59 years                              sex  
##                        -1.56168                          0.27311  
##                       education                             ingp  
##                         0.40345                          0.44635  
##                      IEO_Decile                 remoteness_index  
##                        -0.07618                         -0.11078  
##                         capital               age20-29 years:sex  
##                        -0.12491                         -0.15423  
##              age30-39 years:sex               age40-49 years:sex  
##                        -0.48334                         -0.36318  
##              age50-59 years:sex         age20-29 years:education  
##                        -0.08801                          0.23960  
##        age30-39 years:education         age40-49 years:education  
##                         0.43599                          0.14942  
##        age50-59 years:education              age20-29 years:ingp  
##                        -0.10171                          0.20320  
##             age30-39 years:ingp              age40-49 years:ingp  
##                         0.17444                          0.11898  
##             age50-59 years:ingp        age20-29 years:IEO_Decile  
##                        -0.15711                         -0.01729  
##       age30-39 years:IEO_Decile        age40-49 years:IEO_Decile  
##                        -0.04049                         -0.02789  
##       age50-59 years:IEO_Decile  age20-29 years:remoteness_index  
##                        -0.02750                          0.02760  
## age30-39 years:remoteness_index  age40-49 years:remoteness_index  
##                         0.07448                          0.11061  
## age50-59 years:remoteness_index           age20-29 years:capital  
##                         0.13216                          0.15446  
##          age30-39 years:capital           age40-49 years:capital  
##                         0.08045                          0.04443  
##          age50-59 years:capital                    sex:education  
##                         0.05062                         -0.15131  
##                        sex:ingp             sex:remoteness_index  
##                         0.09205                          0.04419  
##                  education:ingp                education:capital  
##                         0.19543                          0.13585  
##                 ingp:IEO_Decile            ingp:remoteness_index  
##                        -0.05451                          0.37871

summ(max_model1)

Observations	3516
Dependent variable	(unemployment/labour_force)
Type	Mixed effects generalized linear model
Family	binomial
Link	logit

AIC	31063.61
BIC	31544.49
Pseudo-R² (fixed effects)	0.15
Pseudo-R² (total)	0.17

Fixed Effects
	Est.	S.E.	z val.	p
(Intercept)	-1.35	0.03	-45.70	0.00
age20-29 years	-0.62	0.02	-36.21	0.00
age30-39 years	-1.17	0.02	-58.71	0.00
age40-49 years	-1.38	0.02	-74.92	0.00
age50-59 years	-1.56	0.02	-73.73	0.00
sex	0.27	0.01	18.96	0.00
education	0.40	0.02	25.03	0.00
ingp	0.45	0.04	11.18	0.00
IEO_Decile	-0.08	0.02	-4.64	0.00
remoteness_index	-0.11	0.04	-2.77	0.01
capital	-0.12	0.07	-1.90	0.06
age20-29 years:sex	-0.15	0.01	-19.59	0.00
age30-39 years:sex	-0.48	0.01	-56.83	0.00
age40-49 years:sex	-0.36	0.01	-42.42	0.00
age50-59 years:sex	-0.09	0.01	-9.92	0.00
age20-29 years:education	0.24	0.01	28.82	0.00
age30-39 years:education	0.44	0.01	48.47	0.00
age40-49 years:education	0.15	0.01	16.78	0.00
age50-59 years:education	-0.10	0.01	-10.99	0.00
age20-29 years:ingp	0.20	0.02	10.63	0.00
age30-39 years:ingp	0.17	0.02	8.01	0.00
age40-49 years:ingp	0.12	0.02	5.29	0.00
age50-59 years:ingp	-0.16	0.03	-5.98	0.00
age20-29 years:IEO_Decile	-0.02	0.01	-2.04	0.04
age30-39 years:IEO_Decile	-0.04	0.01	-4.04	0.00
age40-49 years:IEO_Decile	-0.03	0.01	-2.99	0.00
age50-59 years:IEO_Decile	-0.03	0.01	-2.55	0.01
age20-29 years:remoteness_index	0.03	0.02	1.29	0.20
age30-39 years:remoteness_index	0.07	0.02	3.00	0.00
age40-49 years:remoteness_index	0.11	0.02	4.79	0.00
age50-59 years:remoteness_index	0.13	0.03	5.00	0.00
age20-29 years:capital	0.15	0.04	3.70	0.00
age30-39 years:capital	0.08	0.05	1.59	0.11
age40-49 years:capital	0.04	0.05	0.96	0.34
age50-59 years:capital	0.05	0.05	0.96	0.34
sex:education	-0.15	0.01	-27.67	0.00
sex:ingp	0.09	0.01	6.79	0.00
sex:remoteness_index	0.04	0.01	3.25	0.00
education:ingp	0.20	0.01	13.49	0.00
education:capital	0.14	0.03	5.23	0.00
ingp:IEO_Decile	-0.05	0.02	-2.95	0.00
ingp:remoteness_index	0.38	0.04	8.70	0.00

Random Effects
Group	Parameter	Std. Dev.
sa4	(Intercept)	0.26
sa4	age20-29 years	0.13
sa4	age30-39 years	0.16
sa4	age40-49 years	0.14
sa4	age50-59 years	0.16
sa4	sex	0.11
sa4	education	0.12
sa4	ingp	0.34

Grouping Variables
Group	# groups	ICC
sa4	88	0.02

2.4.3.2.5 Maximal Model

#Maximal model
max_model <- glmer(formula = (unemployment/labour_force) ~ 1 + age + sex + education + ingp + IEO_Decile + remoteness_index + capital + age:sex + age:education + age:ingp + age:IEO_Decile + age:remoteness_index + age:capital + sex:education + sex:ingp + sex:IEO_Decile + sex:remoteness_index + sex:capital + education:ingp + education:IEO_Decile + education:remoteness_index + education:capital + ingp:IEO_Decile + ingp:remoteness_index + ingp:capital + IEO_Decile:remoteness_index + IEO_Decile:capital + remoteness_index:capital + (1 + age + sex + education + ingp |sa4),
                data = centered_dataset,
                family = binomial,
                weight = labour_force,
                control = glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun=2e5)))

summary(max_model)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: (unemployment/labour_force) ~ 1 + age + sex + education + ingp +  
##     IEO_Decile + remoteness_index + capital + age:sex + age:education +  
##     age:ingp + age:IEO_Decile + age:remoteness_index + age:capital +  
##     sex:education + sex:ingp + sex:IEO_Decile + sex:remoteness_index +  
##     sex:capital + education:ingp + education:IEO_Decile + education:remoteness_index +  
##     education:capital + ingp:IEO_Decile + ingp:remoteness_index +  
##     ingp:capital + IEO_Decile:remoteness_index + IEO_Decile:capital +  
##     remoteness_index:capital + (1 + age + sex + education + ingp |      sa4)
##    Data: centered_dataset
## Weights: labour_force
## Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))
## 
##      AIC      BIC   logLik deviance df.resid 
##  31065.2  31595.4 -15446.6  30893.2     3430 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -5.1656 -1.1322 -0.1626  0.9884  9.5672 
## 
## Random effects:
##  Groups Name           Variance Std.Dev. Corr                               
##  sa4    (Intercept)    0.06894  0.2626                                      
##         age20-29 years 0.01750  0.1323   -0.32                              
##         age30-39 years 0.02466  0.1570   -0.34  0.84                        
##         age40-49 years 0.01917  0.1384   -0.32  0.65  0.85                  
##         age50-59 years 0.02479  0.1574   -0.31  0.51  0.68  0.91            
##         sex            0.01126  0.1061    0.28  0.24  0.16  0.10  0.02      
##         education      0.01444  0.1202    0.18  0.31  0.36  0.26  0.15  0.05
##         ingp           0.11211  0.3348    0.59 -0.16  0.00  0.02  0.00  0.02
##       
##       
##       
##       
##       
##       
##       
##       
##   0.50
## Number of obs: 3516, groups:  sa4, 88
## 
## Fixed effects:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -1.391059   0.041044 -33.892  < 2e-16 ***
## age20-29 years                  -0.615153   0.016947 -36.298  < 2e-16 ***
## age30-39 years                  -1.166469   0.019834 -58.813  < 2e-16 ***
## age40-49 years                  -1.379052   0.018388 -74.999  < 2e-16 ***
## age50-59 years                  -1.561554   0.021164 -73.782  < 2e-16 ***
## sex                              0.273169   0.014250  19.169  < 2e-16 ***
## education                        0.403317   0.015738  25.627  < 2e-16 ***
## ingp                             0.447214   0.039726  11.258  < 2e-16 ***
## IEO_Decile                      -0.062023   0.020401  -3.040 0.002364 ** 
## remoteness_index                -0.090782   0.054029  -1.680 0.092907 .  
## capital                         -0.036999   0.095931  -0.386 0.699730    
## age20-29 years:sex              -0.154198   0.007873 -19.586  < 2e-16 ***
## age30-39 years:sex              -0.483248   0.008505 -56.822  < 2e-16 ***
## age40-49 years:sex              -0.363113   0.008561 -42.413  < 2e-16 ***
## age50-59 years:sex              -0.087983   0.008871  -9.918  < 2e-16 ***
## age20-29 years:education         0.239256   0.008315  28.775  < 2e-16 ***
## age30-39 years:education         0.435571   0.008996  48.419  < 2e-16 ***
## age40-49 years:education         0.149005   0.008906  16.732  < 2e-16 ***
## age50-59 years:education        -0.101977   0.009257 -11.016  < 2e-16 ***
## age20-29 years:ingp              0.202492   0.019122  10.589  < 2e-16 ***
## age30-39 years:ingp              0.174034   0.021788   7.988 1.38e-15 ***
## age40-49 years:ingp              0.118545   0.022474   5.275 1.33e-07 ***
## age50-59 years:ingp             -0.156806   0.026255  -5.972 2.34e-09 ***
## age20-29 years:IEO_Decile       -0.019079   0.009012  -2.117 0.034247 *  
## age30-39 years:IEO_Decile       -0.043919   0.010630  -4.131 3.61e-05 ***
## age40-49 years:IEO_Decile       -0.030276   0.009454  -3.202 0.001363 ** 
## age50-59 years:IEO_Decile       -0.029536   0.010683  -2.765 0.005696 ** 
## age20-29 years:remoteness_index  0.035402   0.022359   1.583 0.113336    
## age30-39 years:remoteness_index  0.085290   0.026119   3.265 0.001093 ** 
## age40-49 years:remoteness_index  0.117858   0.023429   5.030 4.89e-07 ***
## age50-59 years:remoteness_index  0.137253   0.026321   5.215 1.84e-07 ***
## age20-29 years:capital           0.145727   0.043616   3.341 0.000834 ***
## age30-39 years:capital           0.062689   0.051414   1.219 0.222731    
## age40-49 years:capital           0.031674   0.045740   0.692 0.488631    
## age50-59 years:capital           0.039252   0.051650   0.760 0.447279    
## sex:education                   -0.151351   0.005473 -27.655  < 2e-16 ***
## sex:ingp                         0.092438   0.013560   6.817 9.28e-12 ***
## sex:IEO_Decile                   0.007782   0.007122   1.093 0.274581    
## sex:remoteness_index             0.036466   0.017314   2.106 0.035193 *  
## sex:capital                      0.053809   0.034460   1.561 0.118408    
## education:ingp                   0.193440   0.014522  13.320  < 2e-16 ***
## education:IEO_Decile            -0.010792   0.008068  -1.338 0.181020    
## education:remoteness_index       0.027569   0.019489   1.415 0.157177    
## education:capital                0.074389   0.038864   1.914 0.055609 .  
## ingp:IEO_Decile                 -0.066946   0.023180  -2.888 0.003875 ** 
## ingp:remoteness_index            0.410088   0.052721   7.779 7.34e-15 ***
## ingp:capital                    -0.065520   0.108296  -0.605 0.545172    
## IEO_Decile:remoteness_index      0.065031   0.026643   2.441 0.014653 *  
## IEO_Decile:capital              -0.065742   0.033461  -1.965 0.049446 *  
## remoteness_index:capital         0.243039   0.110858   2.192 0.028355 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
## Correlation matrix not shown by default, as p = 50 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it

print(max_model, correlation = TRUE)

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: (unemployment/labour_force) ~ 1 + age + sex + education + ingp +  
##     IEO_Decile + remoteness_index + capital + age:sex + age:education +  
##     age:ingp + age:IEO_Decile + age:remoteness_index + age:capital +  
##     sex:education + sex:ingp + sex:IEO_Decile + sex:remoteness_index +  
##     sex:capital + education:ingp + education:IEO_Decile + education:remoteness_index +  
##     education:capital + ingp:IEO_Decile + ingp:remoteness_index +  
##     ingp:capital + IEO_Decile:remoteness_index + IEO_Decile:capital +  
##     remoteness_index:capital + (1 + age + sex + education + ingp |      sa4)
##    Data: centered_dataset
## Weights: labour_force
##       AIC       BIC    logLik  deviance  df.resid 
##  31065.18  31595.37 -15446.59  30893.18      3430 
## Random effects:
##  Groups Name           Std.Dev. Corr                                     
##  sa4    (Intercept)    0.2626                                            
##         age20-29 years 0.1323   -0.32                                    
##         age30-39 years 0.1570   -0.34  0.84                              
##         age40-49 years 0.1384   -0.32  0.65  0.85                        
##         age50-59 years 0.1574   -0.31  0.51  0.68  0.91                  
##         sex            0.1061    0.28  0.24  0.16  0.10  0.02            
##         education      0.1202    0.18  0.31  0.36  0.26  0.15  0.05      
##         ingp           0.3348    0.59 -0.16  0.00  0.02  0.00  0.02  0.50
## Number of obs: 3516, groups:  sa4, 88
## Fixed Effects:
##                     (Intercept)                   age20-29 years  
##                       -1.391059                        -0.615153  
##                  age30-39 years                   age40-49 years  
##                       -1.166469                        -1.379052  
##                  age50-59 years                              sex  
##                       -1.561554                         0.273169  
##                       education                             ingp  
##                        0.403317                         0.447214  
##                      IEO_Decile                 remoteness_index  
##                       -0.062023                        -0.090782  
##                         capital               age20-29 years:sex  
##                       -0.036999                        -0.154198  
##              age30-39 years:sex               age40-49 years:sex  
##                       -0.483248                        -0.363113  
##              age50-59 years:sex         age20-29 years:education  
##                       -0.087983                         0.239256  
##        age30-39 years:education         age40-49 years:education  
##                        0.435571                         0.149005  
##        age50-59 years:education              age20-29 years:ingp  
##                       -0.101977                         0.202492  
##             age30-39 years:ingp              age40-49 years:ingp  
##                        0.174034                         0.118545  
##             age50-59 years:ingp        age20-29 years:IEO_Decile  
##                       -0.156806                        -0.019079  
##       age30-39 years:IEO_Decile        age40-49 years:IEO_Decile  
##                       -0.043919                        -0.030276  
##       age50-59 years:IEO_Decile  age20-29 years:remoteness_index  
##                       -0.029536                         0.035402  
## age30-39 years:remoteness_index  age40-49 years:remoteness_index  
##                        0.085290                         0.117858  
## age50-59 years:remoteness_index           age20-29 years:capital  
##                        0.137253                         0.145727  
##          age30-39 years:capital           age40-49 years:capital  
##                        0.062689                         0.031674  
##          age50-59 years:capital                    sex:education  
##                        0.039252                        -0.151351  
##                        sex:ingp                   sex:IEO_Decile  
##                        0.092438                         0.007782  
##            sex:remoteness_index                      sex:capital  
##                        0.036466                         0.053809  
##                  education:ingp             education:IEO_Decile  
##                        0.193440                        -0.010792  
##      education:remoteness_index                education:capital  
##                        0.027569                         0.074389  
##                 ingp:IEO_Decile            ingp:remoteness_index  
##                       -0.066946                         0.410088  
##                    ingp:capital      IEO_Decile:remoteness_index  
##                       -0.065520                         0.065031  
##              IEO_Decile:capital         remoteness_index:capital  
##                       -0.065742                         0.243039

summ(max_model)

Observations	3516
Dependent variable	(unemployment/labour_force)
Type	Mixed effects generalized linear model
Family	binomial
Link	logit

AIC	31065.18
BIC	31595.37
Pseudo-R² (fixed effects)	0.15
Pseudo-R² (total)	0.17

Fixed Effects
	Est.	S.E.	z val.	p
(Intercept)	-1.39	0.04	-33.89	0.00
age20-29 years	-0.62	0.02	-36.30	0.00
age30-39 years	-1.17	0.02	-58.81	0.00
age40-49 years	-1.38	0.02	-75.00	0.00
age50-59 years	-1.56	0.02	-73.78	0.00
sex	0.27	0.01	19.17	0.00
education	0.40	0.02	25.63	0.00
ingp	0.45	0.04	11.26	0.00
IEO_Decile	-0.06	0.02	-3.04	0.00
remoteness_index	-0.09	0.05	-1.68	0.09
capital	-0.04	0.10	-0.39	0.70
age20-29 years:sex	-0.15	0.01	-19.59	0.00
age30-39 years:sex	-0.48	0.01	-56.82	0.00
age40-49 years:sex	-0.36	0.01	-42.41	0.00
age50-59 years:sex	-0.09	0.01	-9.92	0.00
age20-29 years:education	0.24	0.01	28.77	0.00
age30-39 years:education	0.44	0.01	48.42	0.00
age40-49 years:education	0.15	0.01	16.73	0.00
age50-59 years:education	-0.10	0.01	-11.02	0.00
age20-29 years:ingp	0.20	0.02	10.59	0.00
age30-39 years:ingp	0.17	0.02	7.99	0.00
age40-49 years:ingp	0.12	0.02	5.27	0.00
age50-59 years:ingp	-0.16	0.03	-5.97	0.00
age20-29 years:IEO_Decile	-0.02	0.01	-2.12	0.03
age30-39 years:IEO_Decile	-0.04	0.01	-4.13	0.00
age40-49 years:IEO_Decile	-0.03	0.01	-3.20	0.00
age50-59 years:IEO_Decile	-0.03	0.01	-2.76	0.01
age20-29 years:remoteness_index	0.04	0.02	1.58	0.11
age30-39 years:remoteness_index	0.09	0.03	3.27	0.00
age40-49 years:remoteness_index	0.12	0.02	5.03	0.00
age50-59 years:remoteness_index	0.14	0.03	5.21	0.00
age20-29 years:capital	0.15	0.04	3.34	0.00
age30-39 years:capital	0.06	0.05	1.22	0.22
age40-49 years:capital	0.03	0.05	0.69	0.49
age50-59 years:capital	0.04	0.05	0.76	0.45
sex:education	-0.15	0.01	-27.65	0.00
sex:ingp	0.09	0.01	6.82	0.00
sex:IEO_Decile	0.01	0.01	1.09	0.27
sex:remoteness_index	0.04	0.02	2.11	0.04
sex:capital	0.05	0.03	1.56	0.12
education:ingp	0.19	0.01	13.32	0.00
education:IEO_Decile	-0.01	0.01	-1.34	0.18
education:remoteness_index	0.03	0.02	1.41	0.16
education:capital	0.07	0.04	1.91	0.06
ingp:IEO_Decile	-0.07	0.02	-2.89	0.00
ingp:remoteness_index	0.41	0.05	7.78	0.00
ingp:capital	-0.07	0.11	-0.61	0.55
IEO_Decile:remoteness_index	0.07	0.03	2.44	0.01
IEO_Decile:capital	-0.07	0.03	-1.96	0.05
remoteness_index:capital	0.24	0.11	2.19	0.03

Random Effects
Group	Parameter	Std. Dev.
sa4	(Intercept)	0.26
sa4	age20-29 years	0.13
sa4	age30-39 years	0.16
sa4	age40-49 years	0.14
sa4	age50-59 years	0.16
sa4	sex	0.11
sa4	education	0.12
sa4	ingp	0.33

Grouping Variables
Group	# groups	ICC
sa4	88	0.02

2.4.4 Visualisation of Parameter Effects

From the Maximal Model we plot the interclass interactions that were significant to visualise them:

age:remoteness_index
age:IEO_Decile
age:capital
sex:remoteness_index
ingp:remoteness_index
ingp:IEO_Decile

In the plots it is clear that each of the interactions has an influence on the intercept of unemployment.

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

## `geom_smooth()` using formula 'y ~ x'

2.4.5 Model Evaluation

2.4.5.1 LRT

Once again using the ANOVA test we compare the different multilevel models.

Keeping in mind that a likelihood ratio test results (Pr(>Chisq) > 0.05) is considered insignificant and means that there is no significant improvement in model fit by adding any random slope terms, however comparing the max_model with models with less or no random slopes yields significant results. So I conclude the model does benefit from having all random slopes specified.

anova(model1, model2, test="Chisq") #winner 2, discard 1

## Data: centered_dataset
## Models:
## model1: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## model1:     (1 | sa4)
## model2: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## model2:     IEO_Decile + remoteness_index + capital + (1 | sa4)
##        npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## model1    9 57301 57357 -28642    57283                         
## model2   12 57284 57358 -28630    57260 23.407  3  3.322e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model2, model3, test="Chisq") #winner 3, discard 2

## Data: centered_dataset
## Models:
## model2: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## model2:     IEO_Decile + remoteness_index + capital + (1 | sa4)
## model3: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## model3:     IEO_Decile + remoteness_index + capital + (1 + age + sex + 
## model3:     education + ingp | sa4)
##        npar   AIC   BIC logLik deviance Chisq Df Pr(>Chisq)    
## model2   12 57284 57358 -28630    57260                        
## model3   47 40584 40874 -20245    40490 16769 35  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(model3, max_model, test="Chisq") #winner max_model, discard 3,

## Data: centered_dataset
## Models:
## model3: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## model3:     IEO_Decile + remoteness_index + capital + (1 + age + sex + 
## model3:     education + ingp | sa4)
## max_model: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## max_model:     IEO_Decile + remoteness_index + capital + age:sex + age:education + 
## max_model:     age:ingp + age:IEO_Decile + age:remoteness_index + age:capital + 
## max_model:     sex:education + sex:ingp + sex:IEO_Decile + sex:remoteness_index + 
## max_model:     sex:capital + education:ingp + education:IEO_Decile + education:remoteness_index + 
## max_model:     education:capital + ingp:IEO_Decile + ingp:remoteness_index + 
## max_model:     ingp:capital + IEO_Decile:remoteness_index + IEO_Decile:capital + 
## max_model:     remoteness_index:capital + (1 + age + sex + education + ingp | 
## max_model:     sa4)
##           npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
## model3      47 40584 40874 -20245    40490                         
## max_model   86 31065 31595 -15447    30893 9597.2 39  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(max_model, max_model1, test="Chisq") #winner max_model, discard max_model1,

## Data: centered_dataset
## Models:
## max_model1: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## max_model1:     IEO_Decile + remoteness_index + capital + age:sex + age:education + 
## max_model1:     age:ingp + age:IEO_Decile + age:remoteness_index + age:capital + 
## max_model1:     sex:education + sex:ingp + sex:remoteness_index + education:ingp + 
## max_model1:     education:capital + ingp:IEO_Decile + ingp:remoteness_index + 
## max_model1:     (1 + age + sex + education + ingp | sa4)
## max_model: (unemployment/labour_force) ~ 1 + age + sex + education + ingp + 
## max_model:     IEO_Decile + remoteness_index + capital + age:sex + age:education + 
## max_model:     age:ingp + age:IEO_Decile + age:remoteness_index + age:capital + 
## max_model:     sex:education + sex:ingp + sex:IEO_Decile + sex:remoteness_index + 
## max_model:     sex:capital + education:ingp + education:IEO_Decile + education:remoteness_index + 
## max_model:     education:capital + ingp:IEO_Decile + ingp:remoteness_index + 
## max_model:     ingp:capital + IEO_Decile:remoteness_index + IEO_Decile:capital + 
## max_model:     remoteness_index:capital + (1 + age + sex + education + ingp | 
## max_model:     sa4)
##            npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)  
## max_model1   78 31064 31545 -15454    30908                       
## max_model    86 31065 31595 -15447    30893 14.433  8    0.07115 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.4.5.2 AIC

By adding in random slopes the AIC does improve quite significantly with the difference from no random slopes to random slopes all being ~26,237. From this I can conclude that the model does benefit from having the random effects terms included.

The best AIC is from max_model1 which includes all random slope terms (age, sex, education and ingp) with removal of some non-significant interactions.

AIC(logLik(model1)) #L1 fixed effects only

## [1] 57301.04

AIC(logLik(model2)) #L1 & L2 fixed effects only

## [1] 57283.63

AIC(logLik(model3)) #All fixed and all random

## [1] 40584.42

AIC(logLik(max_model1)) #Almost maximal, removed non-signficant interactions (blank)

## [1] 31063.61

AIC(logLik(max_model)) #MAXIMAL MODEL

## [1] 31065.18

3 Conclusion

My findings from the analysis are that while the Binary Logistic Model indicated that all the level 1 variables were significant to unemployment it was clear from the AUC test that the predictive power of the model was slightly less than ideal.

Secondly, the Binomial Logistic Regression model again, indicated that all level 2 variables were significant in predicting unemployment.

And finally, the Multilevel Binary Logistic Regression Model confirmed that there were indeed intra-class interactions and that the model benefited from having these modeled along with the random effects.

In conclusion, there are likely more variables that could be added to the Multilevel Binary Logistic Model to further improve it but, it is definitely a viable further option to better explore when modelling data such as this, which has a hierarchical structure.

Rivera, A.M.J., Li, D., Iamsiri, D., Tin, H.p., Sampe, I., Zhou, K., & Cunningham, N. (2020). Accessible Unemployment Insurance. STDS AT2b. p7.↩
Lee, J.Y.L., Brown, J.J. & Ryan, L.M. 2017, ‘Sufficiency Revisited: Rethinking Statistical Algorithms in the Big Data Era’, The American Statistician, vol. 71, no. 3.↩
Fang, Q., & van de Schoot, R. (2019). Intro to Frequentist (Multilevel) Generalised Linear Models (GLM) in R with glm and lme4. Retrieved from https://www.rensvandeschoot.com/tutorials/generalised-linear-models-with-glm-and-lme4/↩
Enders, C. K., & Tofighi, D. (2007). Centering predictor variables in cross-sectional multilevel models: A new look at an old issue. Psychological Methods, 12(2), 121-138. doi:10.1037/1082-989X.12.2.121 ↩
Park, T., Sunhee ; Lake, T., Eileen(2005). Multilevel Modeling of a Clustered Continuous Outcome: Nurses’ Work Hours and Burnout. Nursing Research, 2005, Vol.54(6), pp.406-413↩

AT3 - once more around the blog

Deena Iamsiri

28/06/2020