For this assignment I am using data from the Bristol University website shown here , since there are various datasets included I will be using the data set named 2Lev-XC. This data is a study done on 3,435 children who attended 148 primary schools and 19 secondary schools in Scotland. For this assignment I will focus on two conceptual levels, the Primary school level and the student level. I want to see if the verbal reasoning score from tests pupils took when they entered secondary school is affected by the pupils gender. Additionally I want to see if their local environment (Primary school) has an important influence on the outcome of their verbal reasoning score.
library(Zelig)
library(foreign)
library(knitr)
library(stargazer)
library(dplyr)
library(DescTools)
library(lme4)
library(readstata13)
library(texreg)
library(memisc)
library(magrittr)
dataset<-read.table("C:/Users/Xiomara/Downloads/datasets/2lev-xc.txt")
names(dataset)
## [1] "V1" "V2" "V3" "V4" "V5" "V6"
scotland <- rename(dataset, V1="VRQ", V2="ATTAIN", V3="PID", V4="SEX", V5="SC", V6="SID")
str(scotland)
## 'data.frame': 3435 obs. of 6 variables:
## $ VRQ : num 11 0 -14 -6 -30 -17 -17 -11 -9 -19 ...
## $ ATTAIN: num 10 3 2 3 2 2 4 6 4 2 ...
## $ PID : num 1 1 1 1 1 1 1 1 1 1 ...
## $ SEX : int 0 1 0 0 1 1 1 0 0 0 ...
## $ SC : num 0 0 0 20 0 0 0 0 0 0 ...
## $ SID : num 9 9 9 9 9 9 1 1 9 9 ...
The following model is called a complete-pooling model. I want to see if the pupils gender (SEX) has an effect on their Verbal reasoning score (VQR).
cpooling <- zelig(VRQ ~ SEX , data = scotland, model = "normal", cite=F)
stargazer(cpooling, type = "html", single.row = T)
| Dependent variable: | |
| VRQ | |
| SEX | 2.553*** (0.452) |
| Constant | -3.456*** (0.317) |
| Observations | 3,435 |
| Log Likelihood | -13,745.780 |
| Akaike Inf. Crit. | 27,495.560 |
| Note: | p<0.1; p<0.05; p<0.01 |
The above model shows that the relationship between the gender of the pupil and their verbal reasoning score when they entered secondary school is statistically significant at the 99% confidence level. This model shows that if the pupil is a girl, their verbal reasoning score increases by 2.553.
There is a problem with this model because it completely ignores the nesting structure. I want to see if there is an effect on the pupils verbal reasoning score depending on the primary school that they come from.
The following model shows whether there is an effect on a pupils verbal reasoning score by the sex of the pupil and the primary school they attended measured by the variable (PID).
dcoef <- scotland %>%
group_by(PID) %>%
do(mod = lm(VRQ ~ SEX, data = .))
coef <- dcoef %>% do(data.frame(sexc = coef(.$mod)[2]))
hist(coef$sexc)
coef <- data.frame(coef)
Desc(coef)
##
## -------------------------------------------------------------------------
## 'data.frame': 148 obs. of 1 variable:
## 1 $ sexc: num -0.717 5.083 -27.5 -15.2 3.739 ...
##
## -------------------------------------------------------------------------
## 1 - sexc (numeric)
##
## length n NAs unique 0s mean meanSE
## 148 135 13 = n 0 2.249 0.653
##
## .05 .10 .25 median .75 .90 .95
## -8.985 -7.257 -1.340 2.535 5.906 9.760 12.590
##
## rng sd vcoef mad IQR skew kurt
## 65.750 7.582 3.371 5.057 7.246 0.291 4.571
##
## lowest : -27.5, -15.66667, -15.2, -12.87778, -12.65476
## highest: 14.625, 15, 15, 25.5, 38.25
##
## Shapiro-Wilks normality test p.value : 5.8105e-06
As seen in the model above there is a lot of between Primary school variation in the relationship between pupils gender and their verbal reasoning score. The reason that the complete pooling model produces an estimate of 2.553 is because there are more children from schools where girls have higher verbal reasoning scores than boys.
The following model uses partial pooling which allows for between primary school variations. This model shows that we want a randomely varied primary school intercept.
m1_lme <- lmer(VRQ ~ SEX + (1|PID), data = scotland)
summary(m1_lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: VRQ ~ SEX + (1 | PID)
## Data: scotland
##
## REML criterion at convergence: 27318.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.7758 -0.6878 0.0160 0.6513 3.2733
##
## Random effects:
## Groups Name Variance Std.Dev.
## PID (Intercept) 17.18 4.145
## Residual 158.86 12.604
## Number of obs: 3435, groups: PID, 148
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -3.4151 0.4852 -7.039
## SEX 2.4311 0.4356 5.581
##
## Correlation of Fixed Effects:
## (Intr)
## SEX -0.445
The above is a random intercept model which does not allow the gender difference in verbal reasoning scores to differ between the primary schools.
This model allows for pupil gender differences in verbal reasoning scores to differ between primary schools. Since we want a gender intercept (SEX) we replace the 1 in the above model which allowed for school intercept with the variable SEX.
m2_lme <- lmer(VRQ ~ SEX + (SEX|PID), data = scotland)
summary(m2_lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: VRQ ~ SEX + (SEX | PID)
## Data: scotland
##
## REML criterion at convergence: 27316.8
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.7209 -0.6960 0.0181 0.6564 3.2969
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## PID (Intercept) 14.3734 3.7912
## SEX 0.5389 0.7341 1.00
## Residual 158.7140 12.5982
## Number of obs: 3435, groups: PID, 148
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -3.4109 0.4600 -7.415
## SEX 2.4275 0.4404 5.512
##
## Correlation of Fixed Effects:
## (Intr)
## SEX -0.348
stargazer(cpooling, m1_lme, m2_lme, type = "html")
| Dependent variable: | |||
| VRQ | |||
| normal | linear | ||
| mixed-effects | |||
| (1) | (2) | (3) | |
| SEX | 2.553*** | 2.431*** | 2.427*** |
| (0.452) | (0.436) | (0.440) | |
| Constant | -3.456*** | -3.415*** | -3.411*** |
| (0.317) | (0.485) | (0.460) | |
| Observations | 3,435 | 3,435 | 3,435 |
| Log Likelihood | -13,745.780 | -13,659.410 | -13,658.390 |
| Akaike Inf. Crit. | 27,495.560 | 27,326.830 | 27,328.780 |
| Bayesian Inf. Crit. | 27,351.390 | 27,365.630 | |
| Note: | p<0.1; p<0.05; p<0.01 | ||
In the above model we see that all three models are statistically significant at a 99% confidence level. To see which model to use we look aat the Akaike Inf. Crit. which shows that model 3 is the best model to use.
screenreg(list(m1_lme, m2_lme))
##
## =======================================================
## Model 1 Model 2
## -------------------------------------------------------
## (Intercept) -3.42 *** -3.41 ***
## (0.49) (0.46)
## SEX 2.43 *** 2.43 ***
## (0.44) (0.44)
## -------------------------------------------------------
## AIC 27326.83 27328.78
## BIC 27351.39 27365.63
## Log Likelihood -13659.41 -13658.39
## Num. obs. 3435 3435
## Num. groups: PID 148 148
## Variance: PID.(Intercept) 17.18 14.37
## Variance: Residual 158.86 158.71
## Variance: PID.SEX 0.54
## =======================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
The following model demonstrate the relationship between the pupils gender and their verbal reasoning score by primary school but also including the pupils social class scale (SC variable).
m3_lme <- lmer(VRQ ~ SEX + (SEX|PID) + SC, data = scotland)
m4_lme <- lmer(VRQ ~ SEX*PID + (SEX|PID), data = scotland)
screenreg(list(m3_lme, m4_lme))
##
## =======================================================
## Model 1 Model 2
## -------------------------------------------------------
## (Intercept) -4.75 *** -4.05 ***
## (0.46) (0.89)
## SEX 2.57 *** 3.42 ***
## (0.44) (0.82)
## SC 0.19 ***
## (0.02)
## PID 0.01
## (0.01)
## SEX:PID -0.01
## (0.01)
## -------------------------------------------------------
## AIC 27252.96 27345.41
## BIC 27295.95 27394.54
## Log Likelihood -13619.48 -13664.71
## Num. obs. 3435 3435
## Num. groups: PID 148 148
## Variance: PID.(Intercept) 11.85 14.27
## Variance: PID.SEX 0.47 0.67
## Variance: Residual 155.61 158.64
## =======================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
The above model shows that the students social class has a statistically significant effect on the verbal reasoning score of the pupil.
In the following model we want to center the independent variable of social class so we use the mutate function. In order to achieve the centered social class we do social class (SC) minus the mean of social class(SC)
scotland %<>% mutate(CSC = SC - mean(SC))
m4a_lme <- lmer(VRQ ~ SEX*CSC + (SEX| SC), data = scotland)
screenreg(list(m4_lme, m4a_lme))
##
## =======================================================
## Model 1 Model 2
## -------------------------------------------------------
## (Intercept) -4.05 *** -0.27
## (0.89) (3.04)
## SEX 3.42 *** 3.23 ***
## (0.82) (0.75)
## PID 0.01
## (0.01)
## SEX:PID -0.01
## (0.01)
## CSC 0.03
## (0.21)
## SEX:CSC -0.01
## (0.06)
## -------------------------------------------------------
## AIC 27345.41 27237.03
## BIC 27394.54 27286.17
## Log Likelihood -13664.71 -13610.52
## Num. obs. 3435 3435
## Num. groups: PID 148
## Variance: PID.(Intercept) 14.27
## Variance: PID.SEX 0.67
## Variance: Residual 158.64 161.07
## Num. groups: SC 4
## Variance: SC.(Intercept) 29.77
## Variance: SC.SEX 1.24
## =======================================================
## *** p < 0.001, ** p < 0.01, * p < 0.05
In the following model we are trying to see if the differences in verbal reasoning scores is mainly an individual level (student) or a school level thing (Primary schools). What I am trying to do is decompose the verbal reasoning score at a student level and a school level. This model shows a multi-level regression model without an independent variable.
m0_lme <- lmer(VRQ ~ 1 + (1|PID), data = scotland)
summary(m0_lme)
## Linear mixed model fit by REML ['lmerMod']
## Formula: VRQ ~ 1 + (1 | PID)
## Data: scotland
##
## REML criterion at convergence: 27350
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8374 -0.6762 -0.0034 0.6627 3.2920
##
## Random effects:
## Groups Name Variance Std.Dev.
## PID (Intercept) 17.61 4.196
## Residual 160.19 12.657
## Number of obs: 3435, groups: PID, 148
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) -2.2113 0.4388 -5.039
The intra-class correlation is obtained by 17.61/(17.61-160.19)= -.12350