This week, we are going to use a larger version of the HSB dataset back from 1988. Same structure- students nested within schools. This is the same dataset that Garson uses in Chapter 3, so you can recreate what he has as well.
library(tidyverse)
library(Hmisc) # label()
library(lme4)
library(lmerTest)
glimpse to Summarize the Data Structurehsb <- read_csv("hsbmerged.csv")
Parsed with column specification:
cols(
schoolid = [32mcol_double()[39m,
minority = [32mcol_double()[39m,
female = [32mcol_double()[39m,
ses = [32mcol_double()[39m,
mathach = [32mcol_double()[39m,
size = [32mcol_double()[39m,
sector = [32mcol_double()[39m,
pracad = [32mcol_double()[39m,
disclim = [32mcol_double()[39m,
himinty = [32mcol_double()[39m,
meanses = [32mcol_double()[39m
)
glimpse(hsb)
Rows: 7,185
Columns: 11
$ schoolid [3m[38;5;246m<dbl>[39m[23m 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1224, 1…
$ minority [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ female [3m[38;5;246m<dbl>[39m[23m 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0…
$ ses [3m[38;5;246m<dbl>[39m[23m -1.528, -0.588, -0.528, -0.668, -0.158, 0.022, -0.618, -0.998, -0.888, -0.458, -1.448, -0.658, -0.468, -0.988…
$ mathach [3m[38;5;246m<dbl>[39m[23m 5.876, 19.708, 20.349, 8.781, 17.898, 4.583, -2.832, 0.523, 1.527, 21.521, 9.475, 16.057, 21.178, 20.178, 20.…
$ size [3m[38;5;246m<dbl>[39m[23m 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842, 842,…
$ sector [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ pracad [3m[38;5;246m<dbl>[39m[23m 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0.35, 0…
$ disclim [3m[38;5;246m<dbl>[39m[23m 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.597, 1.59…
$ himinty [3m[38;5;246m<dbl>[39m[23m 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ meanses [3m[38;5;246m<dbl>[39m[23m -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.428, -0.42…
summary command from base R is a quick and easy way to explore a new dataset.summary(hsb)
schoolid minority female ses mathach size sector
Min. :1224 Min. :0.0000 Min. :0.0000 Min. :-3.758000 Min. :-2.832 Min. : 100 Min. :0.0000
1st Qu.:3020 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:-0.538000 1st Qu.: 7.275 1st Qu.: 565 1st Qu.:0.0000
Median :5192 Median :0.0000 Median :1.0000 Median : 0.002000 Median :13.131 Median :1016 Median :0.0000
Mean :5278 Mean :0.2747 Mean :0.5282 Mean : 0.000143 Mean :12.748 Mean :1057 Mean :0.4931
3rd Qu.:7342 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 0.602000 3rd Qu.:18.317 3rd Qu.:1436 3rd Qu.:1.0000
Max. :9586 Max. :1.0000 Max. :1.0000 Max. : 2.692000 Max. :24.993 Max. :2713 Max. :1.0000
pracad disclim himinty meanses
Min. :0.0000 Min. :-2.4160 Min. :0.00 Min. :-1.188000
1st Qu.:0.3200 1st Qu.:-0.8170 1st Qu.:0.00 1st Qu.:-0.317000
Median :0.5300 Median :-0.2310 Median :0.00 Median : 0.038000
Mean :0.5345 Mean :-0.1319 Mean :0.28 Mean : 0.006138
3rd Qu.:0.7000 3rd Qu.: 0.4600 3rd Qu.:1.00 3rd Qu.: 0.333000
Max. :1.0000 Max. : 2.7560 Max. :1.00 Max. : 0.831000
Summary gives you a quick, high-level summary of all the variables in a dataset. It is also really useful for your own records. You can just copy and paste into a Word file and have it for your records.
Now, data cleaning and coding! We’re going to learn how to apply labels to (1) variables themselves, and (2) the values within a given variable. Both are useful and a good practice for cleaning up your data.
str(hsb$sector)
'labelled' num [1:7185] 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "label")= chr "School Sector"
sector and give specific values of sector a label:str(hsb.clean$sector.factor)
Factor w/ 2 levels "Public","Parochial": 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "label")= chr "School Sector"
table(hsb.clean$sector.factor)
Public Parochial
3642 3543
summary again - notice how it handles factors…summary(hsb.clean)
schoolid minority female ses mathach size sector
Min. :1224 Min. :0.0000 Min. :0.0000 Min. :-3.758000 Min. :-2.832 Min. : 100 Min. :0.0000
1st Qu.:3020 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:-0.538000 1st Qu.: 7.275 1st Qu.: 565 1st Qu.:0.0000
Median :5192 Median :0.0000 Median :1.0000 Median : 0.002000 Median :13.131 Median :1016 Median :0.0000
Mean :5278 Mean :0.2747 Mean :0.5282 Mean : 0.000143 Mean :12.748 Mean :1057 Mean :0.4931
3rd Qu.:7342 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 0.602000 3rd Qu.:18.317 3rd Qu.:1436 3rd Qu.:1.0000
Max. :9586 Max. :1.0000 Max. :1.0000 Max. : 2.692000 Max. :24.993 Max. :2713 Max. :1.0000
pracad disclim himinty meanses sector.factor female.factor minority.factor
Min. :0.0000 Min. :-2.4160 Min. :0.00 Min. :-1.188000 Public :3642 Male :3390 Non-Minority:5211
1st Qu.:0.3200 1st Qu.:-0.8170 1st Qu.:0.00 1st Qu.:-0.317000 Parochial:3543 Female:3795 Minority :1974
Median :0.5300 Median :-0.2310 Median :0.00 Median : 0.038000
Mean :0.5345 Mean :-0.1319 Mean :0.28 Mean : 0.006138
3rd Qu.:0.7000 3rd Qu.: 0.4600 3rd Qu.:1.00 3rd Qu.: 0.333000
Max. :1.0000 Max. : 2.7560 Max. :1.00 Max. : 0.831000
mathach variable using the describe function from the Hmisc package:describe(hsb.clean$mathach)
hsb.clean$mathach : Std. Math Score
n missing distinct Info Mean Gmd .05 .10 .25 .50 .75 .90 .95
7185 0 6031 1 12.75 7.91 1.210 3.301 7.275 13.131 18.317 22.016 23.123
lowest : -2.832 -2.830 -2.721 -2.658 -2.657, highest: 24.677 24.707 24.812 24.869 24.993
hist(hsb.clean$mathach,
main = "Distribution of Math Achievement Scores",
sub = "N = 7,185. Students from the High School and Beyond Survey (1988).",
xlab = "Standardized Math Achievement Score")
model.null <- lmer(mathach ~ (1|schoolid), REML = FALSE, data = hsb.clean)
summary(model.null)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: mathach ~ (1 | schoolid)
Data: hsb.clean
AIC BIC logLik deviance df.resid
47121.8 47142.4 -23557.9 47115.8 7182
Scaled residuals:
Min 1Q Median 3Q Max
-3.06262 -0.75365 0.02676 0.76070 2.74184
Random effects:
Groups Name Variance Std.Dev.
schoolid (Intercept) 8.553 2.925
Residual 39.148 6.257
Number of obs: 7185, groups: schoolid, 160
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 12.6371 0.2436 157.6209 51.87 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
You also get AIC and BIC, two very common information criterion measures that are used to assess model fit:
These commands are run after you an any kind of MLM. They are called post-estimation commands because they require that you have the model saved in the memory (it is the last thing you ran). If you do something else and try to run them, you will likely get an error.
Here’s a quick and easy way to get our friend the ICC!
null.ICC
[1] 0.1793044
lmerTest to Evaluate Random Effects:rand(model.null)
ANOVA-like table for random-effects: Single term deletions
Model:
mathach ~ (1 | schoolid)
npar logLik AIC LRT Df Pr(>Chisq)
<none> 3 -23558 47122
(1 | schoolid) 2 -24050 48104 983.92 1 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model.regular <- lm(mathach ~ NULL, data = hsb.clean)
summary(model.regular)
Call:
lm(formula = mathach ~ NULL, data = hsb.clean)
Residuals:
Std. Math Score
Min 1Q Median 3Q Max
-15.5799 -5.4729 0.3831 5.5691 12.2451
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.74785 0.08115 157.1 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.878 on 7184 degrees of freedom