Hierarchical models
For this homework I just want to prove that I understand the basics of the multilevel modeling. I am using data of the 2015 Intercensus Survey for one state in Mexico (Baja California) that had information of individuals and municipalities. The outcome variable is income (continuous variable) of the person who answer and we want to see if this varies across the municipalities where this individual lived five years ago. Some of the individual characteristics are age (older than 15 years of age), sex (male =1, female =0), afrodescendece (Yes =1, No =0), indigenous self-identification (Yes =1, No =0), education (no education, basic education and more than high school) and employment (if the individual is a worker, if its an employer and if its a worker employed by itself). The higher level variable is the municipalities where these individuals lived five years ago.
First, I recode the variables:
# Recode the outcome: ingtrmen
bc.sub$income<-ifelse(bc.sub$ingtrmen==999999, NA, bc.sub$ingtrmen)
bc.sub$income<-ifelse(bc.sub$income==999998, NA, bc.sub$income)
bc.sub$income<-ifelse(bc.sub$income==0, NA, bc.sub$income)
bc.sub$income<-ifelse(bc.sub$income>1000, bc.sub$income, NA)
bc.sub$income<-ifelse(bc.sub$income<100000, bc.sub$income, NA)
# Individual characteristics: age, sex, afrodescendence, indigenous, education and employment
bc.sub$male<-recode(bc.sub$sexo, recodes = "1=1; 3=0")
bc.sub$age<-ifelse(bc.sub$edad==999, NA, bc.sub$edad)
bc.sub$age<-ifelse(bc.sub$edad>15, bc.sub$edad, NA)
bc.sub$afro<-recode(bc.sub$afrodes, recodes ="1:2=1; 3:8=0; 9=NA")
bc.sub$ind<-recode(bc.sub$perte_indigena, recodes ="1:2=1; 3:8=0; 9=NA")
bc.sub$no_educ<-recode(bc.sub$nivacad, recodes = "0=1; else=0; 99=NA")
bc.sub$basic<-recode(bc.sub$nivacad, recodes = "1:9=1; else=0; 99=NA")
bc.sub$morebasic<-recode(bc.sub$nivacad, recodes = "10:14=1; else=0; 99=NA")
bc.sub$worker<-recode(bc.sub$situacion_trab, recodes = "1:3=1; else=0; 9=NA")
bc.sub$employer<-recode(bc.sub$situacion_trab, recodes = "4=1; else=0; 9=NA")
bc.sub$selfworker<-recode(bc.sub$situacion_trab, recodes = "5:6=1; else=0; 9=NA")
# Higher level characteristic: municipality of residence five years ago
bc.sub$local<-recode(bc.sub$tamloc, recodes = "5=1; else=0")
bc.sub$res10<-ifelse(bc.sub$mun_res10==999, NA, bc.sub$mun_res10)
Second, as was stated in class, we check if there is variation across groups, in this case, across municipalities. And yes, there is.
# Check for variation across localities
fit0<-lm(income~factor(res10), data=bc.sub)
# Global F test for equality of means across localities
anova(fit0)
## Analysis of Variance Table
##
## Response: income
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(res10) 157 2.1064e+10 134165980 2.6194 < 2.2e-16 ***
## Residuals 68866 3.5273e+12 51220121
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null model only includes the group variable. I have a huge variance, because income goes from almost 1,000 to 98,000 Mexican pesos.
# Null model
fit1<-lmer(income~1+(1|mun_res10), data=bc.sub, REML=T, subset = complete.cases(bc.sub))
summary(fit1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: income ~ 1 + (1 | mun_res10)
## Data: bc.sub
## Subset: complete.cases(bc.sub)
##
## REML criterion at convergence: 1407916
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.0135 -0.5144 -0.3278 0.0912 12.6602
##
## Random effects:
## Groups Name Variance Std.Dev.
## mun_res10 (Intercept) 1406978 1186
## Residual 51271932 7160
## Number of obs: 68374, groups: mun_res10, 158
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 6663.8 196.6 33.9
The second model adds individual level variables. And as we see in the summary the variance is lower (from a crazy 1406978 to 352906). From the colum t-value, male and age are significant predictors of income, an indigenous persons also but not an afrodescendant, and as education is higher (more than basic education) income is greater. For employment I believe that be an employer had a significant effect in income.
fit2<-lmer(income~male+age+afro+ind+no_educ+basic+morebasic+worker+employer+selfworker+local+(1|mun_res10), data=bc.sub, REML = T, subset=complete.cases(bc.sub))
summary(fit2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: income ~ male + age + afro + ind + no_educ + basic + morebasic +
## worker + employer + selfworker + local + (1 | mun_res10)
## Data: bc.sub
## Subset: complete.cases(bc.sub)
##
## REML criterion at convergence: 1394134
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -4.9431 -0.4313 -0.1837 0.1441 14.1334
##
## Random effects:
## Groups Name Variance Std.Dev.
## mun_res10 (Intercept) 352906 594.1
## Residual 42011778 6481.6
## Number of obs: 68374, groups: mun_res10, 158
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 5242.082 780.544 6.716
## male 1862.810 51.388 36.250
## age 28.893 1.344 21.499
## afro 184.621 323.196 0.571
## ind -430.242 76.052 -5.657
## no_educ -4358.114 664.211 -6.561
## basic -2477.688 639.496 -3.874
## morebasic 4399.112 641.473 6.858
## worker 539.312 425.679 1.267
## employer 5618.070 450.427 12.473
## selfworker 788.952 429.417 1.837
## local 691.305 68.346 10.115
##
## Correlation of Fixed Effects:
## (Intr) male age afro ind no_edc basic morbsc worker
## male -0.036
## age -0.066 -0.019
## afro -0.001 0.004 -0.002
## ind -0.013 -0.011 -0.017 -0.085
## no_educ -0.786 -0.014 -0.010 0.000 -0.020
## basic -0.816 -0.011 0.014 -0.003 -0.003 0.961
## morebasic -0.813 -0.005 0.015 -0.003 0.002 0.957 0.995
## worker -0.541 0.004 0.007 0.003 -0.009 -0.002 -0.004 -0.004
## employer -0.510 -0.008 -0.019 0.003 -0.009 0.000 -0.002 -0.004 0.941
## selfworker -0.533 0.001 -0.016 0.003 -0.009 -0.004 -0.005 -0.004 0.987
## local -0.061 0.024 -0.023 0.007 0.069 0.016 0.001 -0.009 0.012
## emplyr slfwrk
## male
## age
## afro
## ind
## no_educ
## basic
## morebasic
## worker
## employer
## selfworker 0.933
## local 0.011 0.005
Then I use the ANOVA to confirm that the model with the individual level variables is a better fit than the null model. Also, confirm by the reduction of the ICC in the second model with respect to the ICC in the null model.
anova(fit1, fit2)
## refitting model(s) with ML (instead of REML)
## Data: bc.sub
## Subset: complete.cases(bc.sub)
## Models:
## fit1: income ~ 1 + (1 | mun_res10)
## fit2: income ~ male + age + afro + ind + no_educ + basic + morebasic +
## fit2: worker + employer + selfworker + local + (1 | mun_res10)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## fit1 3 1407934 1407961 -703964 1407928
## fit2 14 1394293 1394421 -697133 1394265 13663 11 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ICC
#install.packages("sjstats")
library(sjstats)
icc(fit1)
##
## Linear mixed model
##
## Family : gaussian (identity)
## Formula: income ~ 1 + (1 | mun_res10)
##
## ICC (mun_res10): 0.0267
icc(fit2)
##
## Linear mixed model
##
## Family : gaussian (identity)
## Formula: income ~ male + age + afro + ind + no_educ + basic + morebasic + worker + employer + selfworker + local + (1 | mun_res10)
##
## ICC (mun_res10): 0.0083