HW1

Estimate the random intercept model for your outcome. Fit two models. The first model should be the “NULL” model, with no predictors and the second should use appropriate individual predictors.

In this assignment, I’ll be using European Social Survey data. I’m intersted in the different immigrant groups (by country of origin) and how they fare in their new country. The survey data is mainly categorical in nature, there are no continuous variables, so for this assignment, I’ve chosen a dichotomous dependent variable: gender. I’ll be looking for the variation of how the experience of men and women differ, and more specifically, how they differ by the region from which they come from. I’ve divided the countries of origin into 9 different regions: South America (SA), South Western Asia (SWAS), South Eastern Asia (SEAS), America (AM)[having a variable named NA (for North America) seemed wrong], Pacific Islands (PI), European Union (EU), Africa (AF), Middle East (ME), Russian Federation and former Russian controlled countries (RU).

The other variables I’ve selected from the data are: ctzcntr: Citizen of country (1 = Y, 2 = N) livecntr How long ago did you immigrate (>= 3 yrs = 1, <3 = 0) edulvla Highest level of education (upper secondary education or better = 1, else 0) edulvlpa highest education of father, same determinant as above edulvlma highest education of mother, same determinant as above pdwrk Have you been paid for work in last seven days (1 = Y, 0 = N) wkhtot Total number hours normally worked per week. (>=40 = 1, <40 = 0) uemp3m Have you been unemployed and looking for work for three or months (1 =Y, 0 =N) hinctnt Total household income (reported in deciles, 0 = lower three deciles, else 1 ) health How is your health (fair or better = 1, else 0) age actual age (>= 25 = 1, else 0) rlgblg Do you belong to a particular religous denom. (1 = Y, 0 = N) dscrgrp Do you feel as if you’ve been discriminated against (1 = Y, 0 = N) jbspv Responsible for supervising others (1 = Y, 0 = N)

ESS1_7e01 <- read_csv("ESS Germany 2016/ESS1-7e01.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   cedition = col_double(),
##   cseqno = col_integer(),
##   essround = col_integer(),
##   edition = col_double(),
##   idno = col_integer(),
##   dweight = col_double(),
##   pspwght = col_double(),
##   pweight = col_double(),
##   tvtot = col_integer(),
##   tvpol = col_integer(),
##   rdtot = col_integer(),
##   rdpol = col_integer(),
##   nwsptot = col_integer(),
##   nwsppol = col_integer(),
##   ppltrst = col_integer(),
##   pplfair = col_integer(),
##   pplhlp = col_integer(),
##   polintr = col_integer(),
##   polcmpl = col_integer(),
##   poldcs = col_integer()
##   # ... with 254 more columns
## )

## See spec(...) for full column specifications.

sa <- c("CO", "BR", "AR")
swas <- c("AF", "IN", "LK", "PK")
seas <- c("CN", "TH", "VN")
am <- c("US",  "CA")
pi <- c("ID", "PH")
eu <- c("AU", "CH", "ES", "FI", "FR", "GB", "HU", "NL", "NO", "PL", "PT",
        "SK", "EE", "BG", "RO", "TR", "GR", "IT", "AT", "IE")

af <- c("BF", "GH", "GM", "MA", "MZ", "TN", "UG", "EG", "CG")
me <- c("IQ", "LB", "SA", "SY", "IR")
rs <- c("AL", "AM", "AZ", "BA", "BY", "CZ", "HR", "KG", "LT", "LV", "MC",
        "RU", "UA", "CS", "02", "03", "04", "MK", "KZ", "TM", "TJ",   "UZ")
hw1 <- ESS1_7e01 %>% 
  select(ctzcntr, livecntr, edulvla, edulvlpa, edulvlma, pdwrk, wkhtot,
         uemp3m, hinctnt, health, ctzshipa, cntbrtha, gndr, yrbrn, rlgblg, dscrgrp, jbspv) %>% 
  filter(livecntr <= 5, hinctnt <= 12, wkhtot <= 665, cntbrtha != c("66", "99"), edulvlpa <= 54, 
         edulvlma <= 54) %>%  # removing livecntr, hinctr and wkhtot NAs)
  mutate(
    origin = ifelse(cntbrtha %in% sa, "SA",
                  ifelse(cntbrtha %in% swas, "SWAS",  #setting regions for the countries of origin
                           ifelse(cntbrtha %in% seas, "SEAS",
                                  ifelse(cntbrtha %in% am, "AM",
                                         ifelse(cntbrtha %in% pi, "PI",
                                                ifelse(cntbrtha %in% eu, "EU",
                                                       ifelse(cntbrtha %in% af, "AF",
                                                              ifelse(cntbrtha %in% me, "ME",
                                                                     ifelse(cntbrtha %in% rs, "RS", NA))))))))),
    age = ifelse(2018 - yrbrn >=25, 1 ,0), #changing predictors to binary 
    gndr = ifelse(gndr > 1, 0, 1),
    livecntr = ifelse(livecntr >= 3, 1, 0),
    edulvla = ifelse(edulvla >= 3, 1, 0),
    edulvlpa = ifelse(edulvlpa >= 3, 1, 0),
    edulvlma = ifelse(edulvlma >= 3, 1, 0),
    wkhtot = ifelse(wkhtot >= 40, 1, 0),
    health = ifelse(health <= 3, 1, 0),
    rlgblg = ifelse(rlgblg == 1, 1, 0),
    hinctnt = ifelse(hinctnt <= 3, 0, 1)
  )%>%
  filter(age >= 0) %>% 
  select(hinctnt, origin, gndr, age, ctzcntr:pdwrk, wkhtot, uemp3m, health, rlgblg, dscrgrp, jbspv) %>% 
  arrange(origin) %>% 
  print(hw1)

## Warning: package 'bindrcpp' was built under R version 3.4.4

## # A tibble: 286 x 16
##    hinctnt origin  gndr   age ctzcntr livecntr edulvla edulvlpa edulvlma
##      <dbl> <chr>  <dbl> <dbl>   <int>    <dbl>   <dbl>    <dbl>    <dbl>
##  1       1 AF         0     1       2        0       1        1        1
##  2       1 AF         1     1       1        1       1        1        0
##  3       1 AF         1     1       2        1       1        0        0
##  4       1 AF         1     1       1        1       1        1        0
##  5       1 AF         0     1       1        1       1        1        1
##  6       1 AF         0     0       2        1       1        1        1
##  7       0 AF         1     1       1        1       1        1        0
##  8       1 AF         0     1       2        0       1        1        1
##  9       1 AM         1     1       2        1       1        1        1
## 10       1 AM         0     1       2        1       1        1        1
## # ... with 276 more rows, and 7 more variables: pdwrk <int>, wkhtot <dbl>,
## #   uemp3m <int>, health <dbl>, rlgblg <dbl>, dscrgrp <int>, jbspv <int>

rm(ESS1_7e01); gc()

##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 1853199 99.0    3205452 171.2  3205452 171.2
## Vcells 2048383 15.7   40525969 309.2 46723429 356.5

a. Estimate the ANOVA model first to be sure you have a variation between your groups in the Null Model.

```r
fit_b <- glm(gndr ~ factor(origin), family = "binomial", data = hw1)
anova(fit_b, test = "Chisq")# p-value = 0.00323
```

```
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: gndr
## 
## Terms added sequentially (first to last)
## 
## 
##                Df Deviance Resid. Df Resid. Dev Pr(>Chi)   
## NULL                             282     391.89            
## factor(origin)  8   23.107       274     368.79  0.00323 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

I’m assuming you don’t want a formal write up every time we are testing something (if I’m incorrect in this, I’ll make up for it in the next assignment). Here we are testing for variation between the groups, and with a p-value of 0.00323, at the 0.05 level of significance, we would reject our null hypothesis that the variations were statistically equivalent. Therefore we may conclude that there is some statistical variation between the groups.

Report your variance components, including ICC:

I ran into some issues here. When I left all variables (that I cleaned/tidied from the begining), my fit2 variation (intercept) went to zero, causing my ICC_fit2 to go to zero, which didn’t seem correct (or a strong indication that this model is not a good candidate for multi-level modeling). Through trial and error (either variation going to zero or failing to converge) I came up with the model listed for fit2 that at least has some variation and a non-zero value for ICC.

Variation for the Null model: Intercept: 0.3402 Residual: 3.2899 ICC: 0.0937 Model With variables: Intercept: 0.3106 Residual: 3.2899 ICC: 0.0863

fit1 <- glmer(gndr ~ 1 + (1|origin), data = hw1, family = binomial, subset = complete.cases(hw1))
summary(fit1)#var 0.3402    3.2899

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: gndr ~ 1 + (1 | origin)
##    Data: hw1
##  Subset: complete.cases(hw1)
## 
##      AIC      BIC   logLik deviance df.resid 
##    392.7    400.0   -194.3    388.7      281 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.5192 -0.9212  0.6582  0.8675  1.1412 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  origin (Intercept) 0.3402   0.5833  
## Number of obs: 283, groups:  origin, 9
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.1482     0.3071   0.482     0.63

fit2 <- glmer(gndr ~ age  + ctzcntr + edulvla + health +  edulvla + edulvlpa + edulvlma + rlgblg + 
                 (1|origin), data = hw1, family = binomial, subset= complete.cases(hw1) )
summary(fit2)# variation intercept 0.3106 residual variance 3.2899

## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: gndr ~ age + ctzcntr + edulvla + health + edulvla + edulvlpa +  
##     edulvlma + rlgblg + (1 | origin)
##    Data: hw1
##  Subset: complete.cases(hw1)
## 
##      AIC      BIC   logLik deviance df.resid 
##    385.3    418.1   -183.7    367.3      274 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.9892 -0.9396  0.3921  0.8252  1.7898 
## 
## Random effects:
##  Groups Name        Variance Std.Dev.
##  origin (Intercept) 0.3106   0.5573  
## Number of obs: 283, groups:  origin, 9
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   2.8461     1.5122   1.882  0.05982 . 
## age          -1.1904     1.2314  -0.967  0.33372   
## ctzcntr      -0.5135     0.2958  -1.736  0.08254 . 
## edulvla       0.5575     0.3914   1.424  0.15433   
## health        0.1133     0.4064   0.279  0.78035   
## edulvlpa     -1.0479     0.3742  -2.800  0.00511 **
## edulvlma     -0.5500     0.2743  -2.005  0.04498 * 
## rlgblg       -0.4033     0.2870  -1.405  0.15996   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##          (Intr) age    ctzcnt edulvl health edlvlp edlvlm
## age      -0.846                                          
## ctzcntr  -0.371  0.056                                   
## edulvla  -0.234  0.056  0.065                            
## health   -0.270  0.043  0.015  0.001                     
## edulvlpa -0.186  0.011  0.204 -0.239 -0.071              
## edulvlma  0.064 -0.045 -0.115 -0.173 -0.068 -0.219       
## rlgblg   -0.219  0.073 -0.042  0.067  0.063  0.023  0.150

icc(fit1)#ICC (origin): 0.0937

## 
## Generalized linear mixed model
## 
## Family : binomial (logit)
## Formula: gndr ~ 1 + (1 | origin)
## 
##   ICC (origin): 0.0937

icc_fit1 <- (0.3402)/(0.3402 + (3.14159^2)/3)#0.09371739

icc(fit2)# ICC (origin) 0.0863

## 
## Generalized linear mixed model
## 
## Family : binomial (logit)
## Formula: gndr ~ age + ctzcntr + edulvla + health + edulvla + edulvlpa + edulvlma + rlgblg + (1 | origin)
## 
##   ICC (origin): 0.0863

icc_fit2 <- (0.3106)/(0.3106 + (3.14159^2)/3)#0.08626669

Report the results for the fixed effects from the second model. The only variables of significance were education levels of both parents, pvalues of father=0.00511 and mother pvalue 0.04498. All of the other variables were not signficant contributors. i. How did you derive your p-values.
I looked at the results of R output used the Test statisitic that R-provided, looked on the appropriate statistical table and verified that R was correctly listing the tw osided p-values, and it did.
Compare the NULL model to the model with covariates using the likelihood ratio test, report your result.

anova(fit1, fit2)# p-value 0.003299

## Data: hw1
## Subset: complete.cases(hw1)
## Models:
## fit1: gndr ~ 1 + (1 | origin)
## fit2: gndr ~ age + ctzcntr + edulvla + health + edulvla + edulvlpa + 
## fit2:     edulvlma + rlgblg + (1 | origin)
##      Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
## fit1  2 392.66 399.95 -194.33   388.66                            
## fit2  9 385.32 418.13 -183.66   367.32 21.339      7   0.003299 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value of the Likelihood Ratio Test (LRT) is 0.003299. This tells us that at the 0.05 level of significance, we can reject our null hypothesis that the null model is the better fit. This is further supported as the second model has a lower AIC value, lower loglikelihood absolute value and smaller variance. However, the BIC is larger and this may be explained when we analyze the ICC.

What happened to the ICC in the second model compared to the first? Why do you think this happens?

The ICC value of the second model (0.0863) is slightly lower than the ICC of the first model (0.0937). The first thing to note is the low values of ICC, this tells us that the variances withinn the different groups is high, but the variances between the groups is low. The fact that the second model had a lower ICC than the NULL model tells us that the second model has even less variation between the groups.

I think this happened because I chose a dichotomous dependent variable that has a very high amount of variation, within each group, but has almost no variation between the groups. Applying this to our data, we may conclude that the variation in the difference between males and females over all is very high, and, despite my thinking we might be able to find variation by region, there is not a significant variation in the differences on the responses of the survey between men and women from different regions. In plain language, the difference between men and women is high, and we see that the difference between men and women of different origins is not particularly different.

Additionally, the null model had a lower BIC (it is my understanding that lower AIC and BIC scores are desired, I’ve worked with AIC quite a bit in the past, but not BIC), even though the second model was better in every other category. I’m taking a guess that this might be because the null model has the higher ICC and therefore accounts for more variation between the groups and somehow the BIC accounts for this?

Also, while building my model it failed to converge several times (I’m not sure why) and also had its variance go to zero. At the time I was thinking that my ICC should increase and had a hard time wrapping my head around the decrease but after completing the assignment I see why the model was behaving that way. I chose a dependent variable that is probably not a good fit for a linear mixed model

HW1

Craig, Scott

9/15/2018