Hierarchical models

For this homework I just want to prove that I understand the basics of the multilevel modeling. I am using data of the 2015 Intercensus Survey for one state in Mexico (Baja California) that had information of individuals and municipalities. The outcome variable is income (continuous variable) of the person who answer and we want to see if this varies across the municipalities where this individual lived five years ago. Some of the individual characteristics are age (older than 15 years of age), sex (male =1, female =0), afrodescendece (Yes =1, No =0), indigenous self-identification (Yes =1, No =0), education (no education, basic education and more than high school) and employment (if the individual is a worker, if its an employer and if its a worker employed by itself). The higher level variable is the municipalities where these individuals lived five years ago.

First, I recode the variables:

# Recode the outcome: ingtrmen
bc.sub$income<-ifelse(bc.sub$ingtrmen==999999, NA, bc.sub$ingtrmen)
bc.sub$income<-ifelse(bc.sub$income==999998, NA, bc.sub$income)
bc.sub$income<-ifelse(bc.sub$income==0, NA, bc.sub$income)
bc.sub$income<-ifelse(bc.sub$income>1000, bc.sub$income, NA)
bc.sub$income<-ifelse(bc.sub$income<100000, bc.sub$income, NA)

# Individual characteristics: age, sex, afrodescendence, indigenous, education and employment
bc.sub$male<-recode(bc.sub$sexo, recodes = "1=1; 3=0")
bc.sub$age<-ifelse(bc.sub$edad==999, NA, bc.sub$edad)
bc.sub$age<-ifelse(bc.sub$edad>15, bc.sub$edad, NA)
bc.sub$afro<-recode(bc.sub$afrodes, recodes ="1:2=1; 3:8=0; 9=NA")
bc.sub$ind<-recode(bc.sub$perte_indigena, recodes ="1:2=1; 3:8=0; 9=NA")
bc.sub$no_educ<-recode(bc.sub$nivacad, recodes = "0=1; else=0; 99=NA")
bc.sub$basic<-recode(bc.sub$nivacad, recodes = "1:9=1; else=0; 99=NA")
bc.sub$morebasic<-recode(bc.sub$nivacad, recodes = "10:14=1; else=0; 99=NA")
bc.sub$worker<-recode(bc.sub$situacion_trab, recodes = "1:3=1; else=0; 9=NA")
bc.sub$employer<-recode(bc.sub$situacion_trab, recodes = "4=1; else=0; 9=NA")
bc.sub$selfworker<-recode(bc.sub$situacion_trab, recodes = "5:6=1; else=0; 9=NA")
# Higher level characteristic: municipality of residence five years ago
bc.sub$local<-recode(bc.sub$tamloc, recodes = "5=1; else=0")
bc.sub$res10<-ifelse(bc.sub$mun_res10==999, NA, bc.sub$mun_res10)

Second, as was stated in class, we check if there is variation across groups, in this case, across municipalities. And yes, there is.

# Check for variation across localities
fit0<-lm(income~factor(res10), data=bc.sub)

# Global F test for equality of means across localities
anova(fit0)
## Analysis of Variance Table
## 
## Response: income
##                  Df     Sum Sq   Mean Sq F value    Pr(>F)    
## factor(res10)   157 2.1064e+10 134165980  2.6194 < 2.2e-16 ***
## Residuals     68866 3.5273e+12  51220121                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null model only includes the group variable. I have a huge variance, because income goes from almost 1,000 to 98,000 Mexican pesos.

# Null model

fit1<-lmer(income~1+(1|mun_res10), data=bc.sub, REML=T, subset = complete.cases(bc.sub))
summary(fit1)
## Linear mixed model fit by REML ['lmerMod']
## Formula: income ~ 1 + (1 | mun_res10)
##    Data: bc.sub
##  Subset: complete.cases(bc.sub)
## 
## REML criterion at convergence: 1407916
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.0135 -0.5144 -0.3278  0.0912 12.6602 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  mun_res10 (Intercept)  1406978 1186    
##  Residual              51271932 7160    
## Number of obs: 68374, groups:  mun_res10, 158
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)   6663.8      196.6    33.9

The second model adds individual level variables. And as we see in the summary the variance is lower (from a crazy 1406978 to 352906). From the colum t-value, male and age are significant predictors of income, an indigenous persons also but not an afrodescendant, and as education is higher (more than basic education) income is greater. For employment I believe that be an employer had a significant effect in income.

fit2<-lmer(income~male+age+afro+ind+no_educ+basic+morebasic+worker+employer+selfworker+local+(1|mun_res10), data=bc.sub, REML = T, subset=complete.cases(bc.sub))
summary(fit2)
## Linear mixed model fit by REML ['lmerMod']
## Formula: income ~ male + age + afro + ind + no_educ + basic + morebasic +  
##     worker + employer + selfworker + local + (1 | mun_res10)
##    Data: bc.sub
##  Subset: complete.cases(bc.sub)
## 
## REML criterion at convergence: 1394134
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.9431 -0.4313 -0.1837  0.1441 14.1334 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  mun_res10 (Intercept)   352906  594.1  
##  Residual              42011778 6481.6  
## Number of obs: 68374, groups:  mun_res10, 158
## 
## Fixed effects:
##              Estimate Std. Error t value
## (Intercept)  5242.082    780.544   6.716
## male         1862.810     51.388  36.250
## age            28.893      1.344  21.499
## afro          184.621    323.196   0.571
## ind          -430.242     76.052  -5.657
## no_educ     -4358.114    664.211  -6.561
## basic       -2477.688    639.496  -3.874
## morebasic    4399.112    641.473   6.858
## worker        539.312    425.679   1.267
## employer     5618.070    450.427  12.473
## selfworker    788.952    429.417   1.837
## local         691.305     68.346  10.115
## 
## Correlation of Fixed Effects:
##            (Intr) male   age    afro   ind    no_edc basic  morbsc worker
## male       -0.036                                                        
## age        -0.066 -0.019                                                 
## afro       -0.001  0.004 -0.002                                          
## ind        -0.013 -0.011 -0.017 -0.085                                   
## no_educ    -0.786 -0.014 -0.010  0.000 -0.020                            
## basic      -0.816 -0.011  0.014 -0.003 -0.003  0.961                     
## morebasic  -0.813 -0.005  0.015 -0.003  0.002  0.957  0.995              
## worker     -0.541  0.004  0.007  0.003 -0.009 -0.002 -0.004 -0.004       
## employer   -0.510 -0.008 -0.019  0.003 -0.009  0.000 -0.002 -0.004  0.941
## selfworker -0.533  0.001 -0.016  0.003 -0.009 -0.004 -0.005 -0.004  0.987
## local      -0.061  0.024 -0.023  0.007  0.069  0.016  0.001 -0.009  0.012
##            emplyr slfwrk
## male                    
## age                     
## afro                    
## ind                     
## no_educ                 
## basic                   
## morebasic               
## worker                  
## employer                
## selfworker  0.933       
## local       0.011  0.005

Then I use the ANOVA to confirm that the model with the individual level variables is a better fit than the null model. Also, confirm by the reduction of the ICC in the second model with respect to the ICC in the null model.

anova(fit1, fit2)
## refitting model(s) with ML (instead of REML)
## Data: bc.sub
## Subset: complete.cases(bc.sub)
## Models:
## fit1: income ~ 1 + (1 | mun_res10)
## fit2: income ~ male + age + afro + ind + no_educ + basic + morebasic + 
## fit2:     worker + employer + selfworker + local + (1 | mun_res10)
##      Df     AIC     BIC  logLik deviance Chisq Chi Df Pr(>Chisq)    
## fit1  3 1407934 1407961 -703964  1407928                            
## fit2 14 1394293 1394421 -697133  1394265 13663     11  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# ICC
#install.packages("sjstats")
library(sjstats)

icc(fit1)
## 
## Linear mixed model
## 
## Family : gaussian (identity)
## Formula: income ~ 1 + (1 | mun_res10)
## 
##   ICC (mun_res10): 0.0267
icc(fit2)
## 
## Linear mixed model
## 
## Family : gaussian (identity)
## Formula: income ~ male + age + afro + ind + no_educ + basic + morebasic + worker + employer + selfworker + local + (1 | mun_res10)
## 
##   ICC (mun_res10): 0.0083