Class 6 Multilevel Regression Part 2

Prepare the data set and Review the research question

These data are a sub-sample from the 1982 High School and Beyond (HS&B) Survey, and include information on 7185 students nested within 160 schools: 90 public and 70 catholic.

# Import the dataset into R
library(haven)
hsb <- read_sav("hsbdataset.sav")
# Take a glance at the first 10 lines of the data
head(hsb,10)

## # A tibble: 10 x 9
##    school student     ses meanses    cses mathach   sector  himinty  female
##     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <dbl+lb> <dbl+lb> <dbl+l>
##  1   1224       1 -1.53    -0.428 -1.10     5.88  0 [publ… 0 [less… 1 [fem…
##  2   1224       2 -0.588   -0.428 -0.160   19.7   0 [publ… 0 [less… 1 [fem…
##  3   1224       3 -0.528   -0.428 -0.1000  20.3   0 [publ… 0 [less… 0 [mal…
##  4   1224       4 -0.668   -0.428 -0.240    8.78  0 [publ… 0 [less… 0 [mal…
##  5   1224       5 -0.158   -0.428  0.270   17.9   0 [publ… 0 [less… 0 [mal…
##  6   1224       6  0.0220  -0.428  0.450    4.58  0 [publ… 0 [less… 0 [mal…
##  7   1224       7 -0.618   -0.428 -0.190   -2.83  0 [publ… 0 [less… 1 [fem…
##  8   1224       8 -0.998   -0.428 -0.570    0.523 0 [publ… 0 [less… 0 [mal…
##  9   1224       9 -0.888   -0.428 -0.460    1.53  0 [publ… 0 [less… 1 [fem…
## 10   1224      10 -0.458   -0.428 -0.0300  21.5   0 [publ… 0 [less… 0 [mal…

# Check the structure of the data set
str(hsb)

## Classes 'tbl_df', 'tbl' and 'data.frame':    7185 obs. of  9 variables:
##  $ school : num  1224 1224 1224 1224 1224 ...
##   ..- attr(*, "format.spss")= chr "F10.0"
##  $ student: num  1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "format.spss")= chr "F9.2"
##  $ ses    : num  -1.528 -0.588 -0.528 -0.668 -0.158 ...
##   ..- attr(*, "format.spss")= chr "F9.2"
##  $ meanses: num  -0.428 -0.428 -0.428 -0.428 -0.428 ...
##   ..- attr(*, "format.spss")= chr "F9.2"
##  $ cses   : num  -1.1 -0.16 -0.1 -0.24 0.27 ...
##   ..- attr(*, "format.spss")= chr "F9.2"
##  $ mathach: num  5.88 19.71 20.35 8.78 17.9 ...
##   ..- attr(*, "format.spss")= chr "F9.2"
##  $ sector : 'haven_labelled' num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.spss")= chr "F9.0"
##   ..- attr(*, "labels")= Named num  0 1
##   .. ..- attr(*, "names")= chr  "public" "catholic"
##  $ himinty: 'haven_labelled' num  0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "format.spss")= chr "F9.0"
##   ..- attr(*, "labels")= Named num  0 1
##   .. ..- attr(*, "names")= chr  "less than 40% minority enrollment" "more than 40% minority enrollment"
##  $ female : 'haven_labelled' num  1 1 0 0 0 0 1 0 1 0 ...
##   ..- attr(*, "format.spss")= chr "F9.0"
##   ..- attr(*, "labels")= Named num  0 1
##   .. ..- attr(*, "names")= chr  "male" "female"

# Check for the general summaries for each variable
summary(hsb)

##      school        student           ses               meanses         
##  Min.   :1224   Min.   : 1.00   Min.   :-3.758000   Min.   :-1.188000  
##  1st Qu.:3020   1st Qu.:12.00   1st Qu.:-0.538000   1st Qu.:-0.317000  
##  Median :5192   Median :23.00   Median : 0.002000   Median : 0.038000  
##  Mean   :5278   Mean   :24.51   Mean   : 0.000143   Mean   : 0.006138  
##  3rd Qu.:7342   3rd Qu.:36.00   3rd Qu.: 0.602000   3rd Qu.: 0.333000  
##  Max.   :9586   Max.   :67.00   Max.   : 2.692000   Max.   : 0.831000  
##       cses              mathach           sector          himinty    
##  Min.   :-3.657000   Min.   :-2.832   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:-0.454000   1st Qu.: 7.275   1st Qu.:0.0000   1st Qu.:0.00  
##  Median : 0.010000   Median :13.131   Median :0.0000   Median :0.00  
##  Mean   :-0.005995   Mean   :12.748   Mean   :0.4931   Mean   :0.28  
##  3rd Qu.: 0.463000   3rd Qu.:18.317   3rd Qu.:1.0000   3rd Qu.:1.00  
##  Max.   : 2.850000   Max.   :24.993   Max.   :1.0000   Max.   :1.00  
##      female      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.5282  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

# Students nested within schools
# Three student-level variabels (Level-1): mathach, ses, gender.
# Three school-level variables (Level-2): sector, meanses, himinty.
# Load the statistical package that we needed for the multilevel regression
library(nlme)

A quick Review of the Research Questions:

1. How much do U.S. high schools vary in their mean math achievement?
1. Do schools with high MEAN SES also have high math achievement?
1. Is the strength of association between student CSES and math achievement similar across schools? Or is CSES a better predictor of student math achievement in some schools than others?
1. How do public and Catholic schools compare in terms of mean math achievement and in terms of the strength of the SES-math achievement relationship, after we control for MEAN SES?

Section 1. Centering for continuous X

# Bascially, we need two centered variable. One is the centered grand mean, the other is the centered group mean.
# Find the general ses mean value frist.
ses_mean <- mean(hsb$ses)
# Build a new variable called Grand_CSES
hsb$Grand_CSES <- hsb$ses-ses_mean
# Now, we need to get the group centered mean for each school
# We need the group_by function in "dplyr" package.
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:nlme':
## 
##     collapse

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Calculate the Group Mean within each school.
newhsb <- hsb %>%
    group_by(school) %>%
    mutate(Group_mean = mean(as.numeric(as.character(ses))))
# Add a new variable as Group_CSES
newhsb$Group_CSES <- newhsb$ses - newhsb$Group_mean

Section 2: Q1-How much do U.S. high schools vary in their mean math achievement?

# This is our unconditional model
newhsb$school <-as.factor(newhsb$school)
UnconditionalModel<- lme(mathach~1,random=~1|school,data=newhsb,method="ML")
summary(UnconditionalModel)

## Linear mixed-effects model fit by maximum likelihood
##  Data: newhsb 
##        AIC      BIC    logLik
##   47121.81 47142.45 -23557.91
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    2.924631 6.256868
## 
## Fixed effects: mathach ~ 1 
##                Value Std.Error   DF  t-value p-value
## (Intercept) 12.63707 0.2436341 7025 51.86905       0
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.06261752 -0.75364531  0.02675562  0.76070102  2.74183820 
## 
## Number of Observations: 7185
## Number of Groups: 160

VarCorr(UnconditionalModel) # A parameter estimate of τ_00 is 8.55

## school = pdLogChol(1) 
##             Variance  StdDev  
## (Intercept)  8.553464 2.924631
## Residual    39.148400 6.256868

logLik(UnconditionalModel)*-2 # -2LL is equal to 47115.81, df=3

## 'log Lik.' 47115.81 (df=3)

# Load our ICC calculation function
ICC <- function(out) {
  varests <- as.numeric(VarCorr(out)[1:nrow(VarCorr(out))])
  return(paste("ICC=",varests[1]/ (varests[1]+varests[-1])))
}
Grand_mean <- mean(newhsb$mathach) # The grand mean is 12.64.
# Calculate the ICC for unconditional model
ICC(UnconditionalModel) # ICC is 17.93%

## [1] "ICC= 0.179310896530165"

Answering Q1:

A parameter estimate of τ_00 is 8.55. The population variance among the level-2 unit means is 8.55.
The average achievement across all school averages (i.e, the grand mean) is γ00 = 12.74.
ICC is equal to 17.92%. This tells us that about 18% of the variability in math achievement difference occurs between schools, with the remaining 82% occuring within level-2. It’s good to use a multilevel model.

Section 3: Q2-Do schools with high MEAN SES also have high math achievement?

# To address this question, we need to build a fixed intercept model.
Q2_model <- lme(mathach~meanses,random=~1|school,data=newhsb,method="ML")
summary(Q2_model)

## Linear mixed-effects model fit by maximum likelihood
##  Data: newhsb 
##        AIC      BIC    logLik
##   46967.11 46994.63 -23479.55
## 
## Random effects:
##  Formula: ~1 | school
##         (Intercept) Residual
## StdDev:    1.610337 6.257581
## 
## Fixed effects: mathach ~ meanses 
##                 Value Std.Error   DF  t-value p-value
## (Intercept) 12.649741 0.1483385 7025 85.27619       0
## meanses      5.862922 0.3591754  158 16.32328       0
##  Correlation: 
##         (Intr)
## meanses -0.004
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.13467044 -0.75260775  0.02335172  0.76835668  2.78398911 
## 
## Number of Observations: 7185
## Number of Groups: 160

VarCorr(Q2_model) # A parameter estimate of τ_00 is 2.59

## school = pdLogChol(1) 
##             Variance  StdDev  
## (Intercept)  2.593186 1.610337
## Residual    39.157321 6.257581

logLik(Q2_model)*-2 # -2LL is equal to 46959.11, df=4

## 'log Lik.' 46959.11 (df=4)

# Calculate the ICC for Q2 model
ICC(Q2_model) # ICC is 6.21%

## [1] "ICC= 0.0621114852569335"

# Compare the Q2_model with unconditional model
anova(UnconditionalModel,Q2_model)

##                    Model df      AIC      BIC    logLik   Test  L.Ratio
## UnconditionalModel     1  3 47121.81 47142.45 -23557.90                
## Q2_model               2  4 46967.11 46994.63 -23479.55 1 vs 2 156.7017
##                    p-value
## UnconditionalModel        
## Q2_model            <.0001

Answering Q2:

The estimate of γ_00 is 12.65, indicating the average math achievement when (MEAN SES)_j = 0.
The Estimate of γ_01 is 5.86, indicating the average math achievement change when (MEAN SES)_j increases by one unit.
The t value of the MEANSES slope γ_01 = 16.32, with df=158, p<0.01. The MEAN SES effect is significant, which means the schools with high MEAN SES also had high math achievement at alpha= .05/.01.
Estimate of τ_00 = 2.59. The population variance among the level-2 unit means is less than the one in unconditional model (which is 8.55).Estimate of σ^2 = 39.16, which is the variance of Level-1.
The ICC is 6.21%, which measures the degree of dependence among observations within schools that are of the same MEANSES.
We can also compute the proportion of variance that can be explained by MEANSES. (See details on the class slides)

Section 4: Q3-Is the strength of association between student CSES and math achievement similar across schools? Or is CSES a better predictor of student math achievement in some schools than others?

# To address this question, we need to build a random intercept and slope model.
Q3_model <- lme(mathach~cses,random=~cses|school,data=newhsb,method="ML")
summary(Q3_model)

## Linear mixed-effects model fit by maximum likelihood
##  Data: newhsb 
##        AIC      BIC    logLik
##   46722.98 46764.26 -23355.49
## 
## Random effects:
##  Formula: ~cses | school
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 2.9361736 (Intr)
## cses        0.8235253 0.021 
## Residual    6.0580570       
## 
## Fixed effects: mathach ~ cses 
##                 Value Std.Error   DF  t-value p-value
## (Intercept) 12.649431 0.2437717 7024 51.89049       0
## cses         2.193148 0.1278639 7024 17.15221       0
##  Correlation: 
##      (Intr)
## cses 0.013 
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.09688177 -0.73200318  0.01795852  0.75445130  2.89904649 
## 
## Number of Observations: 7185
## Number of Groups: 160

# Estimate of γ_00 = 12.65, which is the average math achievement across the population of schools.
# Estimate of γ_10 = 2.19, which is the average SES-math slope across those schools.
VarCorr(Q3_model)

## school = pdLogChol(cses) 
##             Variance  StdDev    Corr  
## (Intercept)  8.621115 2.9361736 (Intr)
## cses         0.678194 0.8235253 0.021 
## Residual    36.700055 6.0580570

# Estimate of τ_00 is 8.62. This is the population variance among the Level-2 unit means.
# Estimate of τ_11 is 0.68. This is the population variance among the slope.
# Estimate of σ^2 is 36.70. This is the variance among the Level-1.
# In R, we can simply use the anova function to our model. It provides us the F-value and P value of the intercept which is the same with the wald Z test. 
anova(Q3_model)

##             numDF denDF   F-value p-value
## (Intercept)     1  7024 2670.8307  <.0001
## cses            1  7024  294.1983  <.0001

# From the result, we could tell the esitmated variance among the mean is τ_00 is 8.62, F=2670.83, p<.0001. 

logLik(Q3_model)*-2 # -2LL is equal to 46710.98, df=4

## 'log Lik.' 46710.98 (df=6)

# Calculate the ICC for Q3 model
ICC(Q3_model) # ICC is 19.02%

## [1] "ICC= 0.927070495237872" "ICC= 0.190222692838689"

# Compare the Q3_model with unconditional model and Q2 Model
anova(Q3_model,Q2_model,UnconditionalModel)

##                    Model df      AIC      BIC    logLik   Test  L.Ratio
## Q3_model               1  6 46722.98 46764.26 -23355.49                
## Q2_model               2  4 46967.11 46994.63 -23479.55 1 vs 2 248.1283
## UnconditionalModel     3  3 47121.81 47142.45 -23557.90 2 vs 3 156.7017
##                    p-value
## Q3_model                  
## Q2_model            <.0001
## UnconditionalModel  <.0001

Answering Q3:

The t-value of the CSES slope γ_10 is 17.15, and significant. We find that, on average, student SES is significantly related (p<.001) to math achievement within schools.
The estimated variamce among the means is τ_00 = 8.62 with a F = 2670.831. We infer that highly significant differences exist among the 160 school means (p<.001).
The estimated variamce among the means is τ_11 = 0.68 with a F = 294.198. We reject the null hypothesis, in this case that τ_11 = 0.68, and infer that the relationship between SES and math achievement within schools does indeed vary significantly across the population of schools.
THe ICC is 19.02%. We can also compute the proportion of variance that can be explained by CSES. (Check Slides for details)

Section 5: Q4-How do public and Catholic schools compare in terms of mean math achievement and in terms of the strength of the SES-math achievement relationship, after we control for MEAN SES?

# To address this question, we need to build a random intercept and slope model with two extra interaction terms. 
# The difference between Q4 model and Q3 model is, we use group mean centered SES(CSES) variable instead of "SES" for easier interpretation.
Q4_model <- lme(mathach~meanses + sector + Group_CSES + meanses:Group_CSES + sector:Group_CSES, random=~cses|school,data=newhsb,method="ML")
summary(Q4_model)

## Linear mixed-effects model fit by maximum likelihood
##  Data: newhsb 
##        AIC      BIC    logLik
##   46516.43 46585.23 -23248.21
## 
## Random effects:
##  Formula: ~cses | school
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 1.5227660 (Intr)
## cses        0.2552747 0.485 
## Residual    6.0598017       
## 
## Fixed effects: mathach ~ meanses + sector + Group_CSES + meanses:Group_CSES +      sector:Group_CSES 
##                        Value Std.Error   DF  t-value p-value
## (Intercept)        12.096013 0.1969220 7022 61.42541   0e+00
## meanses             5.331710 0.3656888  157 14.57991   0e+00
## sector              1.226729 0.3033774  157  4.04357   1e-04
## Group_CSES          2.939386 0.1535554 7022 19.14218   0e+00
## meanses:Group_CSES  1.042425 0.2961537 7022  3.51988   4e-04
## sector:Group_CSES  -1.643907 0.2374524 7022 -6.92310   0e+00
##  Correlation: 
##                    (Intr) meanss sector G_CSES m:G_CS
## meanses             0.245                            
## sector             -0.697 -0.356                     
## Group_CSES          0.075  0.019 -0.053              
## meanses:Group_CSES  0.018  0.074 -0.026  0.283       
## sector:Group_CSES  -0.052 -0.027  0.077 -0.694 -0.351
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.16092105 -0.72444280  0.01678312  0.75488891  2.95805927 
## 
## Number of Observations: 7185
## Number of Groups: 160

# Estimate of γ_00 = 12.11,represents the average true intercept for the public school(SECTOR = 0) has MEANSES = 0; It is significantly different from zero..
# Estimate of γ_01 = 5.33, represents the group performance difference in the intercepts for public school (or MEAN SES effects on school means after controlling for sector effect). It is significantly different from zero; We estimate γ_01 to study whether high-SES schools differ from low-SES schools in mean math achievement after controlling for SECTOR.
# Estimate of γ_02 = 1.23, represents the group performance difference in the intercepts when MEAN SES = 0. It is significantly different from zero; We estimate γ_02 to learn whether Catholic schools differ from public schools in terms of the mean achievement once MEAN SES is controlled. 
# γ_00 + γ_02 = 12.11+1.21 represents the average true intercept for the private school(SECTOR=1) has MEAN SES = 0; It is significantly different from zero.
VarCorr(Q4_model)

## school = pdLogChol(cses) 
##             Variance    StdDev    Corr  
## (Intercept)  2.31881618 1.5227660 (Intr)
## cses         0.06516518 0.2552747 0.485 
## Residual    36.72119652 6.0598017

# Estimate of τ_00 is 2.31. This is the population variance among the Level-2 unit means. The significant variation in the intercepts remains unexplained even after controlling for SECTOR and MEAN SES.
# Estimate of τ_11 is 0.07. This is the population variance among the slope. No significant variation in the slopes remains unexplained even after controlling for SECTOR and MEAN SES.
# Estimate of σ^2 is 36.72. This is the variance among the Level-1.
# In R, we can simply use the anova function to our model. It provides us the F-value and P value of the intercept which is the same with the wald Z test. 
anova(Q4_model)

##                    numDF denDF  F-value p-value
## (Intercept)            1  7022 7825.298  <.0001
## meanses                1   157  295.507  <.0001
## sector                 1   157   20.810  <.0001
## Group_CSES             1  7022  395.381  <.0001
## meanses:Group_CSES     1  7022    1.353  0.2448
## sector:Group_CSES      1  7022   47.929  <.0001

# From the result, we could tell the esitmated variance among the mean is τ_00 is 8.62, F=2670.83, p<.0001. 
logLik(Q4_model)*-2 # -2LL is equal to 46496.43, df=10

## 'log Lik.' 46496.43 (df=10)

# Calculate the ICC for Q4 model
ICC(Q4_model) # ICC is 5.94%, this tells us that about 6% of the variability in math achievement difference occurs between schools. It is okay to use a multilevel model.

## [1] "ICC= 0.972665398692547"  "ICC= 0.0593958869280798"

# Compare the Q4_model with Q3_model, unconditional model and Q2 Model
anova(Q4_model,Q3_model,Q2_model,UnconditionalModel)

##                    Model df      AIC      BIC    logLik   Test  L.Ratio
## Q4_model               1 10 46516.43 46585.23 -23248.22                
## Q3_model               2  6 46722.98 46764.26 -23355.49 1 vs 2 214.5510
## Q2_model               3  4 46967.11 46994.63 -23479.55 2 vs 3 248.1283
## UnconditionalModel     4  3 47121.81 47142.45 -23557.90 3 vs 4 156.7017
##                    p-value
## Q4_model                  
## Q3_model            <.0001
## Q2_model            <.0001
## UnconditionalModel  <.0001

Answering Q4:

MEAN SES is positively related to school mean math achievement.
Catholic schools have significantly higher mean achievement than do public schools, controlling for the effect of MEAN SES.