Pls do not cite without permission since the results are under review. Results are stored here for reference. Data will be available upon request once results are published. Questions to love.borjeson at gmail.com
We will in this script use data on Swedish Governmental Agencies’ (GA:s) spending on Management Consultancy Services (MCS) and, with a little help from a multilevel regression test whereas this spending is affected by a) Chief Executive (CEO/GD) tenure, b) CEO’s previous CEO-experience and c) Organizational function of the GA. We will control for CEO’s age, PhD, business education and gender, and for the GA’s income.
The fitting is boxed in individual CEO:s, nested within their respective GA, and then crossed with calendar year.
prescript
#If you, like me, run your code in RStudio, this can save you 1.5 seconds :-):
library(rstudioapi)
setwd(dirname(rstudioapi::callFun("getActiveDocumentContext")$path))
script
(Now on with it)
#clean slate...:
rm(list = ls())
#Necessary packages:
library(lme4) #The package we use, lme4. An option is to use 'nlme'.
require(MuMIn) #To get Marginal and Conditional R2.
require(lmerTest) #To get the Satterthwaite approximations of p-values. In the latest version of lme4, p-values based on Sat.Approx. are reported simply using the summary command.
require(car) #for varance tests
library(sjPlot) #Forest plot for fixed effects.
#the whole sj-suit (checkout "strengejack" at github) is higly relevant for getting stuff out of mixed models, and for additional stat test, such as inter-class correlation tests.
#Fun to have:
require(pbkrtest) #To get the Kenward-Roger approximations of p-values.
require(AICcmodavg) #To get AICc
library(bbmle) #to get delta AICs and BICs
library(strengejacke) #will load the whole sj-suite
#To plot stuff, Hilarious to have?
library(plotly)#...and this one to generate interactive plots...
library(ggplot2)#...and this one to generate interactive plots...
library(ggExtra)#to adjust plots
library(lattice) #more (density) plots
#Get the data...
#We'll use
data1_full=read.csv("DataCompleteF.csv") #"DataCompleteF.csv" has all information. COFOG (organizational function) comes in three varieties:
#1. All first level classification plus second level for universities (09_4)
#2. All first level classification
#1. Only COFOG 09 and COFOG X (i.e. not 09)
#Data will be made publically aviable.
#First, we need to get rid of all NA's:
data1<-data1_full[complete.cases(data1_full),] #because when we compare competing models, we need them to be based on data of same size.
#To get a matching file in the extra-r world...:
write.csv(data1, "completeCasesSample.csv")
All variables included, no interactions.
m1 = lmer(Spendingln ~ Incomeln + Tenure + COFOG_1Level + Former_CE_exp_exp + Gender + PhD + Business_ed + Age +
(1|Agency/GD) + (1|Year), data=data1)
summary(m1) #This used to be way more complicated.
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula:
## Spendingln ~ Incomeln + Tenure + COFOG_1Level + Former_CE_exp_exp +
## Gender + PhD + Business_ed + Age + (1 | Agency/GD) + (1 | Year)
## Data: data1
##
## REML criterion at convergence: 1726.7
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.96546 -0.48122 0.04267 0.54389 3.06989
##
## Random effects:
## Groups Name Variance Std.Dev.
## GD:Agency (Intercept) 0.4301 0.6558
## Agency (Intercept) 1.4834 1.2179
## Year (Intercept) 0.0813 0.2851
## Residual 0.9890 0.9945
## Number of obs: 519, groups: GD:Agency, 156; Agency, 72; Year, 9
##
## Fixed effects:
## Estimate Std. Error df t value
## (Intercept) 2.645e+00 2.591e+00 1.134e+02 1.021
## Incomeln 5.050e-01 1.219e-01 1.085e+02 4.141
## TenureTenure_1 -8.935e-02 1.520e-01 4.004e+02 -0.588
## TenureTenure_2 4.372e-02 1.611e-01 4.285e+02 0.271
## TenureTenure_3 4.669e-01 1.783e-01 4.464e+02 2.620
## TenureTenure_4 4.625e-01 1.917e-01 4.439e+02 2.412
## TenureTenure_5 2.029e-01 2.123e-01 4.292e+02 0.956
## TenureTenure_6pl -4.555e-02 2.217e-01 2.682e+02 -0.205
## COFOG_1LevelCOFOG_02 1.414e+00 8.114e-01 5.627e+01 1.742
## COFOG_1LevelCOFOG_03 -4.417e-01 7.925e-01 5.418e+01 -0.557
## COFOG_1LevelCOFOG_04 9.315e-01 5.379e-01 5.983e+01 1.732
## COFOG_1LevelCOFOG_05 5.274e-01 7.685e-01 5.243e+01 0.686
## COFOG_1LevelCOFOG_09 -5.321e-01 5.284e-01 8.737e+01 -1.007
## COFOG_1LevelCOFOG_10 8.699e-01 7.553e-01 5.990e+01 1.152
## Former_CE_exp_expFormer_CE_exp 3.823e-01 2.014e-01 9.944e+01 1.899
## GenderMale 5.644e-02 1.921e-01 1.021e+02 0.294
## PhDPhD 2.161e-01 2.920e-01 8.616e+01 0.740
## Business_edBusiness_ed -2.747e-01 2.549e-01 1.433e+02 -1.078
## Age -3.635e-04 1.688e-02 1.579e+02 -0.022
## Pr(>|t|)
## (Intercept) 0.30958
## Incomeln 6.85e-05 ***
## TenureTenure_1 0.55688
## TenureTenure_2 0.78620
## TenureTenure_3 0.00911 **
## TenureTenure_4 0.01625 *
## TenureTenure_5 0.33977
## TenureTenure_6pl 0.83738
## COFOG_1LevelCOFOG_02 0.08692 .
## COFOG_1LevelCOFOG_03 0.57955
## COFOG_1LevelCOFOG_04 0.08848 .
## COFOG_1LevelCOFOG_05 0.49554
## COFOG_1LevelCOFOG_09 0.31677
## COFOG_1LevelCOFOG_10 0.25403
## Former_CE_exp_expFormer_CE_exp 0.06049 .
## GenderMale 0.76954
## PhDPhD 0.46130
## Business_edBusiness_ed 0.28295
## Age 0.98284
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The writers of lme4 are skeptical (rightly so) to p-values by asymptotic t-based Wald tests. Now, we get p-values directly with the “summary” command, based on Satterthwaite-approximated degrees of freedom so it is no longer an issue for the desperate user looking for a way to convince their audience. To expand a little and get get Kenward-Roger approximated degrees of freedom as well, first save the summary as a data frame:
coefs_m1 <- data.frame(coef(summary(m1)))
#get Kenward-Roger approximated degrees of freedom
df.KRF <- get_ddf_Lb(m1, fixef(m1))
#get p-values from the t-distribution using the t-values and approximated degrees of freedom..:
coefs_m1$p.KR <- 2 * (1 - pt(abs(coefs_m1$t.value), df.KRF))
coefs_m1 #and there we go!
## Estimate Std..Error df
## (Intercept) 2.6445493268 2.59099459 113.38215
## Incomeln 0.5049931863 0.12193993 108.53455
## TenureTenure_1 -0.0893499303 0.15196048 400.40591
## TenureTenure_2 0.0437178378 0.16107524 428.49764
## TenureTenure_3 0.4669391628 0.17825399 446.35617
## TenureTenure_4 0.4625455728 0.19173289 443.85635
## TenureTenure_5 0.2028831164 0.21229267 429.19386
## TenureTenure_6pl -0.0455504542 0.22172113 268.15921
## COFOG_1LevelCOFOG_02 1.4136770093 0.81138133 56.27400
## COFOG_1LevelCOFOG_03 -0.4417469904 0.79252205 54.17989
## COFOG_1LevelCOFOG_04 0.9315211452 0.53792040 59.82645
## COFOG_1LevelCOFOG_05 0.5274316959 0.76850743 52.42573
## COFOG_1LevelCOFOG_09 -0.5320590460 0.52842207 87.37164
## COFOG_1LevelCOFOG_10 0.8699050642 0.75532996 59.89960
## Former_CE_exp_expFormer_CE_exp 0.3823233221 0.20135410 99.44401
## GenderMale 0.0564366222 0.19212384 102.06470
## PhDPhD 0.2160970499 0.29201384 86.16216
## Business_edBusiness_ed -0.2747309293 0.25490670 143.33970
## Age -0.0003635201 0.01687585 157.89958
## t.value Pr...t.. p.KR
## (Intercept) 1.02066957 0.3095839050 3.094391e-01
## Incomeln 4.14132741 0.0000685317 6.408293e-05
## TenureTenure_1 -0.58798137 0.5568762697 5.576361e-01
## TenureTenure_2 0.27141253 0.7862044760 7.865344e-01
## TenureTenure_3 2.61951595 0.0091053768 9.927820e-03
## TenureTenure_4 2.41244768 0.0162503963 1.734077e-02
## TenureTenure_5 0.95567652 0.3397736567 3.411334e-01
## TenureTenure_6pl -0.20544030 0.8373840069 8.375720e-01
## COFOG_1LevelCOFOG_02 1.74230902 0.0869175037 8.398487e-02
## COFOG_1LevelCOFOG_03 -0.55739394 0.5795539726 5.782838e-01
## COFOG_1LevelCOFOG_04 1.73170815 0.0884793932 8.586372e-02
## COFOG_1LevelCOFOG_05 0.68630656 0.4955441156 4.938275e-01
## COFOG_1LevelCOFOG_09 -1.00688271 0.3167715515 3.159923e-01
## COFOG_1LevelCOFOG_10 1.15168882 0.2540251537 2.517103e-01
## Former_CE_exp_expFormer_CE_exp 1.89876106 0.0604947403 5.996758e-02
## GenderMale 0.29375128 0.7695444786 7.694492e-01
## PhDPhD 0.74002332 0.4612975134 4.607134e-01
## Business_edBusiness_ed -1.07777054 0.2829466089 2.832714e-01
## Age -0.02154084 0.9828414186 9.828496e-01
ranova(m1)#to measure the sig. of the random effect, using likelihood ratio test.
## ANOVA-like table for random-effects: Single term deletions
##
## Model:
## Spendingln ~ Incomeln + Tenure + COFOG_1Level + Former_CE_exp_exp +
## Gender + PhD + Business_ed + Age + (1 | GD:Agency) + (1 |
## Agency) + (1 | Year)
## npar logLik AIC LRT Df Pr(>Chisq)
## <none> 23 -863.35 1772.7
## (1 | GD:Agency) 22 -874.27 1792.5 21.827 1 2.983e-06 ***
## (1 | Agency) 22 -881.27 1806.5 35.833 1 2.150e-09 ***
## (1 | Year) 22 -867.99 1780.0 9.266 1 0.002335 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#For older versions of lmerTest, "rand" is an alias for "ranova".
#To get the estimated means and the CI:s:
lsmeansLT(m1, test.effs=NULL, method.grad="Richardson", ddf="Kenward-Roger") #Not displayed
#correlation matrix
vcov(m1)
#Simple forestplot
plot_model(m1, show.values = TRUE, value.offset = .3)
#with the sj-suit you can do loads of stuff, to adjust variable labels, plot appearances, etc.... I shall elaborate below...
#m1
plot(fitted(m1),residuals(m1), family ="serif")
abline(0, 0)
lines(smooth.spline(fitted(m1), residuals(m1))) #Here we would like to have at fitted line too, because it is pretty.
# I will address colinarity and heteroscesdasticity etc further down, but it is always nice to see the residuals upfront, as it were.
plot(m1) #will also work if you can live with it.
Test interaction between tenure and former CE-experience (how is tenure-effect on spending moderated by former exp).
Only sig. control variables from M1 included (COFOG here a control variable).
Results are suppressed from here onward, but you get the picture…
m2 = lmer(Spendingln ~ Incomeln + Tenure * Former_CE_exp_exp + COFOG_1Level +
(1|Agency/GD) + (1|Year), data=data1)
summary(m2)
coefs_m2 <- data.frame(coef(summary(m2)))
df.KRF <- get_ddf_Lb(m2, fixef(m2))
coefs_m2$p.KR <- 2 * (1 - pt(abs(coefs_m2$t.value), df.KRF))
coefs_m2 #and there we go!
ranova(m2)
lsmeansLT(m2, test.effs=NULL, method.grad="Richardson", ddf="Kenward-Roger")
vcov(m2)
#Simple forestplot
plot_model(m2, show.values = TRUE, value.offset = .3)
#ok, so this is starting to look like crap. We'll do something about it further down the line.
#m2
plot(fitted(m2),residuals(m2), family ="serif")
abline(0, 0)
lines(smooth.spline(fitted(m2), residuals(m2))) #Here we would like to have at fitted line too, because it is pretty.
Test interaction between tenure and former COFOG (how is tenure-effect on spending moderated by organizational function)
#Only sig. control variables from M1 included (former CE-experience here a control variable)
m3 = lmer(Spendingln ~ Incomeln + Tenure * COFOG_1Level + Former_CE_exp_exp +
(1|Agency/GD) + (1|Year), data=data1)
summary(m3)
coefs_m3 <- data.frame(coef(summary(m3)))
df.KRF <- get_ddf_Lb(m3, fixef(m3))
coefs_m3$p.KR <- 2 * (1 - pt(abs(coefs_m3$t.value), df.KRF))
coefs_m3 #and there we go!
ranova(m3)
lsmeansLT(m3, test.effs=NULL, method.grad="Richardson", ddf="Kenward-Roger")
vcov(m3)
#Simple forestplot
plot_model(m3, show.values = TRUE, value.offset = .3)
#ok, looks like crap. We'll do something about it further down the line.
#m3
plot(fitted(m3),residuals(m3), family ="serif")
abline(0, 0)
lines(smooth.spline(fitted(m3), residuals(m3))) #Here we would like to have at fitted line too, because it is pretty.
To get proper means with interactions from both CE-experience and organizational function.
Since we now know the COFOG variable a little better, we can simplify it into COFOG 09/COFOG not 09, see data1.
#Only sig. control variables from M1 included,
m4 = lmer(Spendingln ~ Incomeln + Tenure * COFOG_simple + Tenure * Former_CE_exp_exp +
(1|Agency/GD) + (1|Year), data=data1)
summary(m4)
coefs_m4 <- data.frame(coef(summary(m4)))
df.KRF <- get_ddf_Lb(m4, fixef(m4))
coefs_m4$p.KR <- 2 * (1 - pt(abs(coefs_m4$t.value), df.KRF))
coefs_m4 #and there we go!
ranova(m4)
meansm4 <- lsmeansLT(m4, test.effs=NULL, method.grad="Richardson", ddf="Kenward-Roger")
meansm4 # if we want the means to be easily accesible to do plots etc...
vcov(m4)
#Lets look at the plots and how we can work with that.
#Simple forestplot
plot_model(m4, show.values = TRUE, value.offset = .3)
#and now the order of factors is confused too. Lets fix that... Reflecting our hypotheses
plot_model(m4, show.values = TRUE, value.offset = .3,
order.terms = c(2,3,4,5,6,7,9,8,1,16,17,18,19,20,21,14,10,11,12,13,15))
#sjplot uses automatic cleaning of variables using snakecase, which however does little for us until we get to the interaction plots.
#m4
plot(fitted(m4),residuals(m4), family ="serif")
abline(0, 0)
lines(smooth.spline(fitted(m4), residuals(m4))) #Here we would like to have at fitted line too, because it is pretty.
#We can label factors, using sjlabelled
set_label(data1$Spendingln) <- "ln(Spending)" # and henceforth. Wont do us much good, though.
#Below we will use a data frame of the coefs and rename the levels instead. But first interactions.
#Let's plot the interactions, we have them in m2-m4. Lets concentrate on m4.
plot_model(m4, type = "pred", terms = c("Tenure","COFOG_simple" ))
plot_model(m4, type = "pred", terms = c("Tenure", "Former_CE_exp_exp" ))
plot_model(m4, type = "pred", terms = c("Tenure","COFOG_simple","Former_CE_exp_exp" ))
#and the other way around
plot_model(m4, type = "pred", terms = c("Tenure","Former_CE_exp_exp","COFOG_simple" ))
#For some reason, sjplot fails to use the labels. If you know why, pls let me know.
Now, lets get serious with plotting the results, before turning to diagnostics.
We can bypass our labeling issues by extracting our coefs into a data frame.
## Create data frame of model coefficients, pvalues, df:s and standard errors
# Function to extract what we need
ce = function(model.obj) {
extract = summary(get(model.obj))$coefficients[ ,1:5]
return(data.frame(extract, varsn=row.names(extract), model=model.obj))
}
# Run function on the three models and bind into single data frame
coefs = do.call(rbind, sapply(paste0("m",4), ce, simplify=FALSE)) # here you specify the models we want to include in Faceted coefficient plots.
#Lets stick to just m4 for now.
names(coefs)[2] = "se" #name the std.E.
coefs #inspect
#Rename:
levels(coefs$varsn)[levels(coefs$varsn)== "Incomeln"] <- "ln(Income)"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_1"] <- "Tenure 1"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_2"] <- "Tenure 2"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_3"] <- "Tenure 3"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_4"] <- "Tenure 4"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_5"] <- "Tenure 5"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_6pl"] <- "Tenure 6pl"
levels(coefs$varsn)[levels(coefs$varsn)== "COFOG_simpleCOFOG_09"] <- "COFOG 9"
levels(coefs$varsn)[levels(coefs$varsn)== "Former_CE_exp_expFormer_CE_exp"] <- "Former CEO exp"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_1:COFOG_simpleCOFOG_09"] <- "Tenure 1/COFOG 9"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_2:COFOG_simpleCOFOG_09"] <- "Tenure 2/COFOG 9"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_3:COFOG_simpleCOFOG_09"] <- "Tenure 3/COFOG 9"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_4:COFOG_simpleCOFOG_09"] <- "Tenure 4/COFOG 9"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_5:COFOG_simpleCOFOG_09"] <- "Tenure 5/COFOG 9"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_6pl:COFOG_simpleCOFOG_09"] <- "Tenure 6pl/COFOG 9"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_1:Former_CE_exp_expFormer_CE_exp"] <- "Tenure 1/Former CEO exp"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_2:Former_CE_exp_expFormer_CE_exp"] <- "Tenure 2/Former CEO exp"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_3:Former_CE_exp_expFormer_CE_exp"] <- "Tenure 3/Former CEO exp"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_4:Former_CE_exp_expFormer_CE_exp"] <- "Tenure 4/Former CEO exp"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_5:Former_CE_exp_expFormer_CE_exp"] <- "Tenure 5/Former CEO exp"
levels(coefs$varsn)[levels(coefs$varsn)== "TenureTenure_6pl:Former_CE_exp_expFormer_CE_exp"] <- "Tenure 6pl/Former CEO exp"
#reorder factors
coefs$varsn <- factor(coefs$varsn, levels = c("(Intercept)", "ln(Income)", "Tenure 6pl/COFOG 9", "Tenure 5/COFOG 9","Tenure 4/COFOG 9", "Tenure 3/COFOG 9",
"Tenure 2/COFOG 9", "Tenure 1/COFOG 9", "Tenure 6pl/Former CEO exp", "Tenure 5/Former CEO exp",
"Tenure 4/Former CEO exp", "Tenure 3/Former CEO exp", "Tenure 2/Former CEO exp", "Tenure 1/Former CEO exp","COFOG 9",
"Former CEO exp", "Tenure 6pl", "Tenure 5", "Tenure 4", "Tenure 3", "Tenure 2", "Tenure 1"))
#Add asteriks...
coefs$sig <- cut(coefs$Pr...t.., breaks = c(0, 0.001, 0.01, 0.05, Inf),
labels = c("***", "**", "*", "ns")) #yes, you are correct mylady, "inf" can be replaced with "1".
#re-inspect
coefs
# Faceted coefficient plot, with estimates and pvalues in the figure. Over the top?
p<-ggplot(coefs, aes(varsn, Estimate)) +
geom_errorbar(aes(ymin=Estimate - se, ymax=Estimate + se, colour=Estimate),
lwd=1, width=0.2) +
geom_point(size=2, aes(colour=Estimate)) +
geom_text(label=paste("e",round(coefs$Estimate,3),":p",round(coefs$Pr...t..,3),"(",(coefs$sig),")", sep = ""),
nudge_x = 0.3, nudge_y = 0.0, size= 3) +
facet_grid(. ~ model) +
coord_flip() +
guides(colour=FALSE) +
labs(x="Coefficient", y="Value") +
scale_colour_gradientn(colours = rainbow(3)) +
theme_grey(base_size=8)
p
#add line
p<-p + geom_hline(yintercept=0, lty=2, lwd=1, colour="grey50") #remember, coordinates are fliped, so use hline
p
ggplotly(p) #plotly does not handle the geomline well
Lets elaborate around the p-values, and add additional tests. since p-values for mixed model are disputed to begin with, here are some references:
*0. Luke (2016), Evaluating significance in linear mixed-effects models in R. Behavior Research Methods, pp 1-9
Kenward, M. G., and J. H. Roger, 1997: Small sample inference for fixed effects from restricted maximum likelihood. Biometrics 53, 983-997.
Spilke J., Piepho, H.-P. and Hu, X. Hu (2005) A Simulation Study on Tests of Hypotheses and Confidence Intervals for Fixed Effects in Mixed Models for Blocked Experiments With Missing Data, Journal of Agricultural, Biological, and Environmental Statistics, Vol. 10,p. 374-389
Alnosaier, W. (2007) Kenward-Roger Approximate F Test for Fixed Effects in Mixed Linear Models, Dissertation, Oregon State University
All in all, KR.approximation seem to be preferred. Or some bootstrap or simulation technique, see below.
Complementing ANOVA-style tests for significance
First the alternative models (same as the models above, but with the ‘REML=FALSE’ addition, thus using ML)
Why?
Answer from CrossValidated/Robert Long: …. ML is biased for the estimation of variance components. But observe that the bias gets smaller for larger sample sizes. Hence, under what circumstances may REML be preferred over ML (or vice versa) when fitting a mixed effects model?“, for small sample sizes REML is preferred.
However,
likelihood ratio tests for REML require exactly the same fixed effects specification in both models.
So, to compare models with different fixed effects (a common scenario) with an LR test, ML must be used. [Since, by default, we we’ll held out the one fixed effect we’re interested in -my addition]
#Two sets of null models, with one model for each variable we remove (i.e. test)...
#For the pars.model
m1_ML = lmer(Spendingln ~ Incomeln + Tenure + COFOG_1Level + Former_CE_exp_exp + Gender + PhD + Business_ed + Age +
(1|Agency/GD) + (1|Year), data=data1, REML = FALSE) #As m1, but with REML set to false.
m1_ML_PhDNULL = lmer(Spendingln ~ Incomeln + Tenure + COFOG_1Level + Former_CE_exp_exp + Gender + Business_ed + Age +
(1|Agency/GD) + (1|Year), data=data1, REML = FALSE) #take away variable of interest...
#Says ANOVA below, but they are actually likelihood ratio tests
#The tests for the pars.model variables:
anova(m1_ML,m1_ML_PhDNULL) #gives us no star :-( But we now that already. Repeat for all models and variables of interest.
What we have done is basically to remove one variable (PhD) from one model and then compare this model with a model where it is still included, ceteris paribus. Significant difference between the two models? For PhD, “no”. #that is, CE PhD does NOT have a significant effect on the level of spending on MCS:s.
Then the bootstrap tests, (and now we’re talking).
Very similar to the above likelihood ratio tests in the basic design: remove one variable of interest and see if there is a significant difference in the results…using bobyqa optimization (Bound Optimization BY Quadratic Approximation).
ATTENTION! Computationally expensive, these are…
Bootstrap_m1_PhD <- PBmodcomp(m1_ML, m1_ML_PhDNULL, nsim = 20000) #20K= 1038.24 sec, 8 cores parallel processing.
#... and then simply call the lists to get the results.
Bootstrap_m1_PhD
#It will not look good until you increase the no. of evaluations.
#To get stuff we've got so far out of R..:
write.csv(coefs_m1, "coefs_m1.csv")
m1_fit <- data.frame(fitted(m1))
write.csv(m1_fit, "fitted_m1.csv")
fixedef_m1 <- data.frame(fixef(m1,add.dropped=TRUE))
write.csv(fixedef_m1, "fixedef_m1.csv") #and so on for m2-m4....
#Let's have the models compete:
AIC(m1,m2,m3,m4) #lower=better. AIC is a good criteria for getting good prediction-models
BIC(m1,m2,m3,m4) #BIC favors simple models
bbmle::AICtab(m1,m2,m3,m4) #Call the bbmle-package first. gives delta AIC
bbmle::BICtab(m1,m2,m3,m4) #Call the bbmle-package first. gives delta BIC
#Corrected AIC's, more robust for small samples
AICc(m1)
AICc(m2)
AICc(m3)
AICc(m4)
#Loglike.
logLik(m1,REML=FALSE) #Higher (i.e., closer to zero) =better
logLik(m2,REML=FALSE)
logLik(m3,REML=FALSE)
logLik(m4,REML=FALSE)
#deviance
deviance(m1,REML=FALSE) #lower=better
deviance(m2,REML=FALSE)
deviance(m3,REML=FALSE)
deviance(m4,REML=FALSE)
You could easily, or with great effort I mean, write a PhD on which information criteria is the best…
How about the overall R2? Though contested (as are p-values), they could give you a hint on how well your model performs.
#Function suggetsed by Jarrett Byrnes:
r2.corr.mer <- function(m) {
lmfit <- lm(model.response(model.frame(m)) ~ fitted(m))
summary(lmfit)$r.squared
}
#Then apply the function on the different models...
r2.corr.mer(m1) #etc..... Could be called R2 corr; i.e. the correlation between the fitted and the observed values
#Optionally, and published, we have 'Omega02':
1-var(residuals(m1))/(var(model.response(model.frame(m1)))) #etc...
#Even better is to report marginal (for fixed effect) and conditional (for the whole model) R2's, using the 'MuMin' package:
r.squaredGLMM(m1)
r.squaredGLMM(m2)
r.squaredGLMM(m3)
r.squaredGLMM(m4) #Splitting up r2 shows how relatively effective our favored models are
Lets run some diagnostics, most importantly, the homogeneity of variance assumption.
#For some reason, variance inflation factor (VIF) is not supported for lmer. Strange, but I guess I'm missing something.
#From Github "aufrank/R-hacks", we have this life-saver for diagnosing collinearity in mixed models from lme4:
vif.mer <- function (fit) {
##adapted from rms::vif
v <- vcov(fit)
nam <- names(fixef(fit))
##exclude intercepts
ns <- sum(1 * (nam == "Intercept" | nam == "(Intercept)"))
if (ns > 0) {
v <- v[-(1:ns), -(1:ns), drop = FALSE]
nam <- nam[-(1:ns)]
}
d <- diag(v)^0.5
v <- diag(solve(v/(d %o% d)))
names(v) <- nam
v
}
#Then just run the function on the model you want to test
vif.mer(m1)
vif.mer(m2)
vif.mer(m3)
vif.mer(m4) #Golden! Should be <3 (or 2.5 Or 5. Or whatever the reviewers tells you it should be)
The depth of my ignorance: VIF I’m ok with, seem pretty straight forward. Tests for homogeneity of variance, in contrast, makes me suspicious. Seem to be a way to impose a rather arbitrary cut-off value, that someone pulled out of a hat. I am strongly in favor of visual inspection of the residuals along the line suggested by Zuur et al. (2010) in A protocol for data exploration to avoid common statistical problems, Methods in Ecology and Evolution 1(1), 3-14.
Nonetheless, to please ALL kinds of reviewers:
Most common tests include Fligner-Killeen’s test and Levene’s test, both of which are supposedly robust for outliers. Levene’s test cannot handle continuous variables.
One thing that worries me are the degrees of freedom (I may be way of here): to get them right you need to collapse the IV’s into one. And then run separate tests for the quantitative variables (age, income). The whole procedure is… iffy.
#Anything above p=0.05 is ok...
leveneTest(residuals(m1) ~ Tenure*COFOG_1Level*Former_CE_exp_exp*Gender*PhD*Business_ed, data=data1) #ok.
fligner.test(residuals(m1) ~ interaction(Tenure,COFOG_1Level,Former_CE_exp_exp,Gender,PhD,Business_ed), data=data1) #oh, Nelly!
#and:
fligner.test(residuals(m1) ~ (Incomeln), data=data1) #ok
fligner.test(residuals(m1) ~ (Age), data=data1) #ok
#and then onwards, for m2-m4. remember m4 has different cofog variable...
Four, no, five, options if heteroscedasticity turnes out to be a problem:
0. Manipulate the quantitative variables (with ln or std) and see what happens.
1. Get rid of problematic outliers.
2. Weight the model (look at the code snippet under extras below or use package ‘robustlmm’).
3. Could we add Former GD as a random variable, though the categories strictly speaking exhausts the population?
4. Akin to no.3, but with brute force: divide the data set based on the COFOG_simple variable and run separate models.
I would start with 2 or 1 or 3.
#Time to plot some diagnostics.
#We already have the oh so important residuals from above
#density plot of the residuals... We don't really need to check for normality, but hey...
#if you run into a problem, density plot of the residual may give you clue as to where the problem is
hist(residuals(m1), prob=TRUE, ylim=c(0,0.6), family="serif")
lines(density(residuals(m1)))
curve(dnorm(x, mean=mean(residuals(m1)), sd=sd(residuals(m1))), lty=3, col="#FF4500", add=TRUE)
#qq plot
qqnorm(residuals(m1), family="serif")
qqline(residuals(m1))
#The original variables
#density plots
densityplot(~Spendingln, data = data1)
densityplot(~Incomeln, data = data1)
densityplot(~Age, data = data1)
#boxplots
par(family = "serif")
boxplot(Spendingln ~ Tenure, data = data1)
boxplot(Spendingln ~ Former_CE_exp_exp, data = data1)
#xy-plots
xyplot(Spendingln ~ Incomeln, data = data1, type=c("p","r","smooth"))
#conditional xy-plots
xyplot(Spendingln ~ Incomeln | Tenure, data=data1, type=c("p","r","smooth"))
xyplot(Spendingln ~ Incomeln | COFOG_1Level, data=data1, type=c("p","r","smooth"))
xyplot(Spendingln ~ Incomeln | COFOG_simple, data=data1, type=c("p","r","smooth"))
xyplot(Spendingln ~ Incomeln | Former_CE_exp_exp, data=data1, type=c("p","r","smooth"))
#then for diagnostics (again?!?)
#Extract values from lmer objects
fittedm1 <- fitted(m1) #Fitted values
residualsm1 <- residuals(m1) #residuals
theo_quantm1 <- qqnorm(residualsm1, plot.it = F)$x #Theoretical Quantiles
stdrm1 <- residuals(m1,
type = c("pearson")) #Standardized residuals
#Create dataframe
#Will be used as input to plots/plot_ly/ggplot
p_data <- data.frame(fittedm1,
stdrm1,
residualsm1,
theo_quantm1)
#scatterplot with marginal densityplots
#This I like:
scatter <- qplot(fittedm1,residualsm1, data=p_data) +
scale_x_continuous(limits=c(min(fittedm1),max(fittedm1))) +
scale_y_continuous(limits=c(min(residualsm1),max(residualsm1))) +
geom_rug(col=rgb(.5,0,0,alpha=.5))
print(scatter)
#
#below with marginal density plots
p1 <- p1 <- p3 <- ggplot(p_data, aes(fittedm1, residualsm1))+ geom_point() + theme_classic()
p1
ggExtra::ggMarginal(
p = p1,
type = 'density',
margins = 'both',
size = 5,
col = 'black'
) #This you can add using the addin 'ggplot2 Marginal Plots' (in Rstudio)
#Old school:
#create a function for marginal hist plots
scatterhist = function(x, y, xlab="", ylab=""){
zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
xhist = hist(x, plot=FALSE)
yhist = hist(y, plot=FALSE)
top = max(c(xhist$counts, yhist$counts))
par(mar=c(3,3,1,1))
plot(x,y)
par(mar=c(0,3,1,1))
barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
par(mar=c(3,0,1,1))
barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
par(oma=c(3,3,0,0))
mtext(xlab, side=1, line=1, outer=TRUE, adj=0,
at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
mtext(ylab, side=2, line=1, outer=TRUE, adj=0,
at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}
#run the function with our data..:
with(coefs_m1, scatterhist(fittedm1,residualsm1, xlab="fitted", ylab="residuals"))
#useful? Maybe.
#
PLOT <- xyplot(residualsm1 ~ fittedm1, data=data1, type=c("p","r","smooth"),xlab = "Fitted Values",
ylab = "Residuals")
PLOT
update(PLOT, panel=function(...){
panel.xyplot(...)
panel.abline(h=0)
} ) #This one I actually find useful
#Scatter plot
LOESS1 <- loess.smooth(fittedm1, residualsm1)
plot_ly(x = fittedm1, y = residualsm1,
type = "scatter", mode = "markers", hoverinfo = "x+y", name = "Data",
marker = list(size = 10, opacity = 0.5), showlegend = F) %>%
add_trace(x = LOESS1$x, y = LOESS1$y, type = "scatter", mode = "lines+markers", name = "Smooth",
line = list(width = 2)) %>%
layout(title = "Residuals (y) vs Fitted Values (x)", plot_bgcolor = "#e6e6e6")
#Histogram, density and fitted nd, all in the same graph
p <- ggplot(p_data, aes(residualsm1))+
geom_histogram(aes(y = ..density..), alpha = 0.7, fill = "#333333", breaks=seq(-3, 3, by = 0.4)) +
geom_density(fill = "#ff4d4d", alpha = 0.5) +
theme(panel.background = element_rect(fill = '#ffffff')) +
stat_function(fun = dnorm, n = 101, args = list(mean = 0, sd = sd(residualsm1))) + ylab("insert whatever") + xlab("you like here") +
scale_y_continuous(breaks = NULL) +
ggtitle("Density with Histogram overlay")
ggplotly(p) #This to transform the ggplot into an plotly object
#Below, we combine fq and density in the same plot, with two scales (left and right)
fit <- density(residualsm1)
plot_ly(x = residualsm1) %>%
add_histogram(nbinsx=25) %>%
add_lines(x = fit$x, y = fit$y, fill = "tozeroy", yaxis = "y2") %>%
layout(yaxis2 = list(overlaying = "y", side = "right")) #it actually works, imagine that!
GMY
## [1] "MYA"