Introduction

For this assignment I am using data from the Bristol University website shown here , since there are various datasets included I will be using the data set named 2Lev-XC. This data is a study done on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. For this assignment I will focus on two conceptual levels, the Primary school level and the student level. I want to see if the verbal reasoning score from tests pupils took when they entered secondary school is affected by the pupils gender. Additionally I want to see if their local environment (Primary school) has an important influence on the outcome of their verbal reasoning score.

library(Zelig)
library(foreign)
library(knitr)
library(stargazer)
library(dplyr)
library(DescTools)
library(lme4)
library(readstata13)
library(texreg)
library(memisc)
library(magrittr)
dataset<-read.table("C:/Users/Xiomara/Downloads/datasets/2lev-xc.txt")
names(dataset)
## [1] "V1" "V2" "V3" "V4" "V5" "V6"
scotland <- rename(dataset, V1="VRQ", V2="ATTAIN", V3="PID", V4="SEX", V5="SC", V6="SID")
str(scotland)
## 'data.frame':    3435 obs. of  6 variables:
##  $ VRQ   : num  11 0 -14 -6 -30 -17 -17 -11 -9 -19 ...
##  $ ATTAIN: num  10 3 2 3 2 2 4 6 4 2 ...
##  $ PID   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ SEX   : int  0 1 0 0 1 1 1 0 0 0 ...
##  $ SC    : num  0 0 0 20 0 0 0 0 0 0 ...
##  $ SID   : num  9 9 9 9 9 9 1 1 9 9 ...

Using Complete Pooling Model

The following model is called a complete-pooling model. I want to see if the pupils gender (SEX) has an effect on their Verbal reasoning score (VQR).

cpooling <- zelig(VRQ ~ SEX , data = scotland, model = "normal", cite=F)
stargazer(cpooling, type = "html", single.row = T)
Dependent variable:
VRQ
SEX 2.553*** (0.452)
Constant -3.456*** (0.317)
Observations 3,435
Log Likelihood -13,745.780
Akaike Inf. Crit. 27,495.560
Note: p<0.1; p<0.05; p<0.01

The above model shows that the relationship between the gender of the pupil and their verbal reasoning score when they entered secondary school is statistically significant at the 99% confidence level. This model shows that if the pupil is a girl, their verbal reasoning score increases by 2.553.

There is a problem with this model because it completely ignores the nesting structure. I want to see if there is an effect on the pupils verbal reasoning score depending on the primary school that they come from.

Using No-Pooling Model

The following model shows whether there is an effect on a pupils verbal reasoning score by the sex of the pupil and the primary school they attended measured by the variable (PID).

dcoef <- scotland %>% 
    group_by(PID) %>% 
    do(mod = lm(VRQ ~ SEX, data = .))
coef <- dcoef %>% do(data.frame(sexc = coef(.$mod)[2]))
hist(coef$sexc)

coef <- data.frame(coef)
Desc(coef)
## 
## -------------------------------------------------------------------------
## 'data.frame':    148 obs. of  1 variable:
##  1 $ sexc: num  -0.717 5.083 -27.5 -15.2 3.739 ...
## 
## ------------------------------------------------------------------------- 
## 1 - sexc (numeric)
## 
##   length      n    NAs unique     0s   mean meanSE
##      148    135     13    = n      0  2.249  0.653
## 
##      .05    .10    .25 median    .75    .90    .95
##   -8.985 -7.257 -1.340  2.535  5.906  9.760 12.590
## 
##      rng     sd  vcoef    mad    IQR   skew   kurt
##   65.750  7.582  3.371  5.057  7.246  0.291  4.571
##  
## lowest : -27.5, -15.66667, -15.2, -12.87778, -12.65476
## highest: 14.625, 15, 15, 25.5, 38.25
## 
## Shapiro-Wilks normality test  p.value : 5.8105e-06

As seen in the model above there is a lot of between Primary school variation in the relationship between pupils gender and their verbal reasoning score. The reason that the complete pooling model produces an estimate of 2.553 is because there are more children from schools where girls have higher verbal reasoning scores than boys.

Partial Pooling - Multi Level Model 1

The following model uses partial pooling which allows for between primary school variations. This model shows that we want a randomely varied primary school intercept.

m1_lme <- lmer(VRQ ~ SEX + (1|PID), data = scotland)
summary(m1_lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: VRQ ~ SEX + (1 | PID)
##    Data: scotland
## 
## REML criterion at convergence: 27318.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7758 -0.6878  0.0160  0.6513  3.2733 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  PID      (Intercept)  17.18    4.145  
##  Residual             158.86   12.604  
## Number of obs: 3435, groups:  PID, 148
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -3.4151     0.4852  -7.039
## SEX           2.4311     0.4356   5.581
## 
## Correlation of Fixed Effects:
##     (Intr)
## SEX -0.445

The above is a random intercept model which does not allow the gender difference in verbal reasoning scores to differ between the primary schools.

Multi-Level Model 2

This model allows for pupil gender differences in verbal reasoning scores to differ between primary schools. Since we want a gender intercept (SEX) we replace the 1 in the above model which allowed for school intercept with the variable SEX.

m2_lme <- lmer(VRQ ~ SEX + (SEX|PID), data = scotland)
summary(m2_lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: VRQ ~ SEX + (SEX | PID)
##    Data: scotland
## 
## REML criterion at convergence: 27316.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.7209 -0.6960  0.0181  0.6564  3.2969 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  PID      (Intercept)  14.3734  3.7912      
##           SEX           0.5389  0.7341  1.00
##  Residual             158.7140 12.5982      
## Number of obs: 3435, groups:  PID, 148
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -3.4109     0.4600  -7.415
## SEX           2.4275     0.4404   5.512
## 
## Correlation of Fixed Effects:
##     (Intr)
## SEX -0.348
stargazer(cpooling, m1_lme, m2_lme, type = "html")
Dependent variable:
VRQ
normal linear
mixed-effects
(1) (2) (3)
SEX 2.553*** 2.431*** 2.427***
(0.452) (0.436) (0.440)
Constant -3.456*** -3.415*** -3.411***
(0.317) (0.485) (0.460)
Observations 3,435 3,435 3,435
Log Likelihood -13,745.780 -13,659.410 -13,658.390
Akaike Inf. Crit. 27,495.560 27,326.830 27,328.780
Bayesian Inf. Crit. 27,351.390 27,365.630
Note: p<0.1; p<0.05; p<0.01

In the above model we see that all three models are statistically significant at a 99% confidence level. To see which model to use we look aat the Akaike Inf. Crit. which shows that model 3 is the best model to use.

screenreg(list(m1_lme, m2_lme))
## 
## =======================================================
##                            Model 1        Model 2      
## -------------------------------------------------------
## (Intercept)                    -3.42 ***      -3.41 ***
##                                (0.49)         (0.46)   
## SEX                             2.43 ***       2.43 ***
##                                (0.44)         (0.44)   
## -------------------------------------------------------
## AIC                         27326.83       27328.78    
## BIC                         27351.39       27365.63    
## Log Likelihood             -13659.41      -13658.39    
## Num. obs.                    3435           3435       
## Num. groups: PID              148            148       
## Variance: PID.(Intercept)      17.18          14.37    
## Variance: Residual            158.86         158.71    
## Variance: PID.SEX                              0.54    
## =======================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Multi-Level 3 & 4

The following model demonstrate the relationship between the pupils gender and their verbal reasoning score by primary school but also including the pupils social class scale (SC variable).

m3_lme <- lmer(VRQ ~ SEX + (SEX|PID) + SC, data = scotland)
m4_lme <- lmer(VRQ ~ SEX*PID + (SEX|PID), data = scotland)
screenreg(list(m3_lme, m4_lme))
## 
## =======================================================
##                            Model 1        Model 2      
## -------------------------------------------------------
## (Intercept)                    -4.75 ***      -4.05 ***
##                                (0.46)         (0.89)   
## SEX                             2.57 ***       3.42 ***
##                                (0.44)         (0.82)   
## SC                              0.19 ***               
##                                (0.02)                  
## PID                                            0.01    
##                                               (0.01)   
## SEX:PID                                       -0.01    
##                                               (0.01)   
## -------------------------------------------------------
## AIC                         27252.96       27345.41    
## BIC                         27295.95       27394.54    
## Log Likelihood             -13619.48      -13664.71    
## Num. obs.                    3435           3435       
## Num. groups: PID              148            148       
## Variance: PID.(Intercept)      11.85          14.27    
## Variance: PID.SEX               0.47           0.67    
## Variance: Residual            155.61         158.64    
## =======================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

The above model shows that the students social class has a statistically significant effect on the verbal reasoning score of the pupil.

Centering the independent variable

In the following model we want to center the independent variable of social class so we use the mutate function. In order to achieve the centered social class we do social class (SC) minus the mean of social class(SC)

scotland %<>% mutate(CSC = SC - mean(SC))
m4a_lme <- lmer(VRQ ~ SEX*CSC + (SEX| SC), data = scotland)
screenreg(list(m4_lme, m4a_lme))
## 
## =======================================================
##                            Model 1        Model 2      
## -------------------------------------------------------
## (Intercept)                    -4.05 ***      -0.27    
##                                (0.89)         (3.04)   
## SEX                             3.42 ***       3.23 ***
##                                (0.82)         (0.75)   
## PID                             0.01                   
##                                (0.01)                  
## SEX:PID                        -0.01                   
##                                (0.01)                  
## CSC                                            0.03    
##                                               (0.21)   
## SEX:CSC                                       -0.01    
##                                               (0.06)   
## -------------------------------------------------------
## AIC                         27345.41       27237.03    
## BIC                         27394.54       27286.17    
## Log Likelihood             -13664.71      -13610.52    
## Num. obs.                    3435           3435       
## Num. groups: PID              148                      
## Variance: PID.(Intercept)      14.27                   
## Variance: PID.SEX               0.67                   
## Variance: Residual            158.64         161.07    
## Num. groups: SC                                4       
## Variance: SC.(Intercept)                      29.77    
## Variance: SC.SEX                               1.24    
## =======================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05

Intra-Class correlation

In the following model we are trying to see if the differences in verbal reasoning scores is mainly an individual level (student) or a school level thing (Primary schools). What I am trying to do is decompose the verbal reasoning score at a student level and a school level. This model shows a multi-level regression model without an independent variable.

m0_lme <- lmer(VRQ ~ 1 + (1|PID), data = scotland)
summary(m0_lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: VRQ ~ 1 + (1 | PID)
##    Data: scotland
## 
## REML criterion at convergence: 27350
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8374 -0.6762 -0.0034  0.6627  3.2920 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  PID      (Intercept)  17.61    4.196  
##  Residual             160.19   12.657  
## Number of obs: 3435, groups:  PID, 148
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  -2.2113     0.4388  -5.039

The intra-class correlation is obtained by 17.61/(17.61-160.19)= -.12350