Data Assignment 1: Answer Key

Author

Ty Partridge

Theses first two blocks of code set the working directory and load all the packages you will need

setwd("C:/Work Files/8190 MLM/Fall 2024/Data Assignments/Data Assignment 1")

library(tidyverse) # All things data wrangling

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2) #Visualizing Data
library(lme4) #Conducting the MLM analyses

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

library(lmerTest) #Provides the significance tests


Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step

library(psych) #Used here for generating descriptive statisitics


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

library(knitr) #Creating nice tables
library(kableExtra) #Ditto


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

library(apaTables) #Also nice tables

The next step is to load your data and it is always a good idea to use the head and str commands to make sure your data looks the way it should and that the data structure is correct (i.e., are variables tagged with the appropriate data type etc.). The head command only provides the first few rows of data (it defaults to 6) but you can expand it using the following:

head(df,10) where 10 is the number of rows you want displayed. It can be any number up to your sample size. You can also click on the data set in your global environment and it will open a tab with your data in a spreadsheet format.

Assn_1_Data <- read.csv("Data_Assn_1_Data.csv")
head(Assn_1_Data)

  X school    sex    ses   mAch   meanses sector        cses
1 1   1224 Female -1.528  5.876 -0.434383 Public -1.09361702
2 2   1224 Female -0.588 19.708 -0.434383 Public -0.15361702
3 3   1224   Male -0.528 20.349 -0.434383 Public -0.09361702
4 4   1224   Male -0.668  8.781 -0.434383 Public -0.23361702
5 5   1224   Male -0.158 17.898 -0.434383 Public  0.27638298
6 6   1224   Male  0.022  4.583 -0.434383 Public  0.45638298

str(Assn_1_Data)

'data.frame':   7185 obs. of  8 variables:
 $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
 $ school : int  1224 1224 1224 1224 1224 1224 1224 1224 1224 1224 ...
 $ sex    : chr  "Female" "Female" "Male" "Male" ...
 $ ses    : num  -1.528 -0.588 -0.528 -0.668 -0.158 ...
 $ mAch   : num  5.88 19.71 20.35 8.78 17.9 ...
 $ meanses: num  -0.434 -0.434 -0.434 -0.434 -0.434 ...
 $ sector : chr  "Public" "Public" "Public" "Public" ...
 $ cses   : num  -1.0936 -0.1536 -0.0936 -0.2336 0.2764 ...

In examining the structure of the data, the first column which is the student ID number is listed as X so we need to change the variable name to Student ID or ID or something other than X. Also, the sex variable and the sector variable are character strings rather than numerical variables.

(in the original data set I believe the sex variable is labeled sx, which I changed before loading the data)

While these two variables being character string variables is not an issue for using them in the MLM, it is always best to convert character strings to numerical values and set them to be factors. I do this in the code chunk below by first recoding the variables so that if sex = female then the value is coded as 0 otherwise it is coded as 1. Similarly, if sector = Catholic the value is coded as 0 otherwise it is coded as 1. This works here because the variable is binary. Now that the variables have been recoded into a numerical format I can use the factor() function to convert them to factor data types and label the 0,1 codes with their original character labels.

I always run tables when I have converted a variable to a factor to make sure 1) that the labels are provided in the output correctly and that the frequencies match. I have also added a step to generate tables as a proportion in addition to frequency counts. This not only checks for accuracy but these tables can be used for the descriptives section of a presentation / paper.

colnames(Assn_1_Data)[1] <- "Student_ID"

Assn_1_Data$sex <- ifelse(Assn_1_Data$sex == "Female", 0, 1)
Assn_1_Data$sector <-ifelse(Assn_1_Data$sector == "Catholic",0,1)


Assn_1_Data$sex<-factor(Assn_1_Data$sex,
                           levels = c(0,1),
                           labels = c("Female", "Male"))

Assn_1_Data$sector<-factor(Assn_1_Data$sector,
                              levels = c(0,1),
                              labels = c("Catholic","Public"))                         
                           

table(Assn_1_Data$sex)


Female   Male 
  3795   3390

table(Assn_1_Data$sector)


Catholic   Public 
    3543     3642

freq_table_sex <- table(Assn_1_Data$sex)
freq_table_sector <- table(Assn_1_Data$sector)

# Convert to proportions
prop_table_sex <- prop.table(freq_table_sex)
prop_table_sex


   Female      Male 
0.5281837 0.4718163

prop_table_sector <- prop.table(freq_table_sector)
prop_table_sector


 Catholic    Public 
0.4931106 0.5068894

# Convert to percentages
percent_table_sex <- prop.table(freq_table_sex) * 100
percent_table_sex


  Female     Male 
52.81837 47.18163

percent_table_sector <- prop.table(freq_table_sector) * 100
percent_table_sector


Catholic   Public 
49.31106 50.68894

Now that the data is structured as it should be and we are confident in the frequency counts being accurate we can generate the rest of the descriptive statistics. The apa.cor.table command from the package apaTables is a really nice way to get means, s.d.’s and correlations in one descriptive table, but it only works well with relatively small numbers of variables. I also really like the describe command from the psych package because it gives a complete set of descriptive statistics including median based estimates. The latter are quite useful for data that violates normality assumptions (see Wilcox, 2013 for an excellent and very readable discussion https://www.sciencedirect.com/book/9780123869838/introduction-to-robust-estimation-and-hypothesis-testing).

Running the describe command and connecting it to kable through knitr creates really nice APA formatted tables that are manuscript ready.

apa.cor.table(Assn_1_Data[,c(4:6,8)], filename="ex.CorTable1.doc")



Means, standard deviations, and correlations with confidence intervals
 

  Variable   M     SD   1          2          3          
  1. ses     0.00  0.78                                  
                                                         
  2. mAch    12.75 6.88 .36**                            
                        [.34, .38]                       
                                                         
  3. meanses 0.00  0.41 .53**      .34**                 
                        [.51, .55] [.32, .36]            
                                                         
  4. cses    -0.00 0.66 .85**      .21**      .00        
                        [.84, .85] [.19, .23] [-.02, .02]
                                                         

Note. M and SD are used to represent mean and standard deviation, respectively.
Values in square brackets indicate the 95% confidence interval.
The confidence interval is a plausible range of population correlations 
that could have caused the sample correlation (Cumming, 2014).
 * indicates p < .05. ** indicates p < .01.

describe(Assn_1_Data[,c(2:8)])

        vars    n    mean      sd  median trimmed     mad     min     max
school     1 7185 5277.90 2499.58 5192.00 5254.46 3192.04 1224.00 9586.00
sex*       2 7185    1.47    0.50    1.00    1.46    0.00    1.00    2.00
ses        3 7185    0.00    0.78    0.00    0.02    0.85   -3.76    2.69
mAch       4 7185   12.75    6.88   13.13   12.91    8.12   -2.83   24.99
meanses    5 7185    0.00    0.41    0.03    0.01    0.47   -1.19    0.82
sector*    6 7185    1.51    0.50    2.00    1.51    0.00    1.00    2.00
cses       7 7185    0.00    0.66    0.02    0.01    0.68   -3.65    2.86
          range  skew kurtosis    se
school  8362.00  0.11    -1.25 29.49
sex*       1.00  0.11    -1.99  0.01
ses        6.45 -0.23    -0.38  0.01
mAch      27.82 -0.18    -0.92  0.08
meanses    2.02 -0.27    -0.48  0.00
sector*    1.00 -0.03    -2.00  0.01
cses       6.51 -0.17     0.11  0.01

describe(Assn_1_Data[,c(2:8)])%>%
  knitr::kable(digits = 3, format="html", booktabs=TRUE, caption="Table 1. Sample Descriptives")%>%
  kable_classic(full_width = F, html_font = "Cambria")

Table 1. Sample Descriptives
	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
school	1	7185	5277.898	2499.578	5192.000	5254.457	3192.038	1224.000	9586.000	8362.000	0.107	-1.255	29.489
sex*	2	7185	1.472	0.499	1.000	1.465	0.000	1.000	2.000	1.000	0.113	-1.988	0.006
ses	3	7185	0.000	0.779	0.002	0.020	0.845	-3.758	2.692	6.450	-0.228	-0.380	0.009
mAch	4	7185	12.748	6.878	13.131	12.915	8.123	-2.832	24.993	27.825	-0.180	-0.922	0.081
meanses	5	7185	0.000	0.414	0.032	0.012	0.471	-1.194	0.825	2.019	-0.268	-0.479	0.005
sector*	6	7185	1.507	0.500	2.000	1.509	0.000	1.000	2.000	1.000	-0.028	-2.000	0.006
cses	7	7185	0.000	0.661	0.016	0.011	0.678	-3.651	2.856	6.507	-0.172	0.114	0.008

Now for the MLM. This is clearly nested data with students nested within schools, but it is important to calculate the ICC to determine whether or not the nested structure has led to clustering (non-independence) of the data. We can do this by first calculating an unconditional random intercepts model. It is unconditional because there are no predictors at either level 1 or level 2. In the code below we specify math achievement (mAch) as the outcome variable, the tilde indicates a regression but since there are no predictors the model will just generate the sample mean \[Y_{ij} = \beta_{0j}\] The random effects are specified by the terms inside the parentheses. the 1 indicates that random intercepts should be calculated and the ’ | school ’ segment indicates that the school id should be used as the clustering variable. Then finally the code indicates to which data set the analysis should be applied.

Once we run the model we can use the summary() command to print the model output. Since we are only running this model to get the information we need for the ICC we only need to look at the random effects portion of the output. Here we see that the level-2 variance or variance in math achievement between schools is 8.614 (in HLM notation this is denoted by a \(\tau_{0j}\) ) and the level-1 residual variance ( \[e_{ij}\] ). Remember the ICC is the ratio of between cluster variance over the total residual variance:

\[ ICC = \tau^2_{0j} / e^2_{ij} + \tau^2_{0j} \]

*p.s. I went back and fixed this in my earlier example where I accidentally reversed the numerator

The ICC tells us that clustering accounts for 18% of the variance in math achievement scores. That is the degree of non-independence or the average correlation on math achievement among students within a given classroom. An ICC of .18 is not huge, but it is large enough to bias results so we need a MLM to account for that nesting.

#unconditional model to generate ICC 
model_UC <- lmer(mAch ~  + (1 | school), data = Assn_1_Data)

# View the summary of the model
summary(model_UC)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: mAch ~ +(1 | school)
   Data: Assn_1_Data

REML criterion at convergence: 47116.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.0631 -0.7539  0.0267  0.7606  2.7426 

Random effects:
 Groups   Name        Variance Std.Dev.
 school   (Intercept)  8.614   2.935   
 Residual             39.148   6.257   
Number of obs: 7185, groups:  school, 160

Fixed effects:
            Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)  12.6370     0.2444 156.6473   51.71   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ICC <- 8.614 / (8.614 + 39.148)
ICC

[1] 0.1803526

Next we specify a conditional MLM with student sex (female = 0, male = 1) as the level-1 predictor and sector (Catholic = 0, Public = 1) as the level-2 predictor. Since these are categorical predictors the category coded as zero is the reference category. In our output then we will interpret the sex effects as the male deviation from the female value and the sector effects will be interpreted as Public school differences relative to Catholic schools. The model specified below is a random intercepts and random slopes model, this is denoted within the parenthesis with the 1 indicating random intercepts and the ’ + sex ’ portion indicating random variance in the relationship between sex and math achievement between schools. In addition to adding the level-1 and level-2 predictors we also add the sex by sector interaction. The sector main effect is an estimate of the proportion of explained variance in overall math achievement is due to the type of school and the sex by sector interaction is an estimate of the degree to which school type moderates the relationship between sex and math achievement.

model_1 <-lmer(mAch ~ sex + sector + sex:sector + (1 + sex |school),data = Assn_1_Data)

summary(model_1)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: mAch ~ sex + sector + sex:sector + (1 + sex | school)
   Data: Assn_1_Data

REML criterion at convergence: 47014.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.1820 -0.7518  0.0306  0.7652  2.6521 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 school   (Intercept)  5.8274  2.4140       
          sexMale      0.6743  0.8212   0.13
 Residual             38.7336  6.2236       
Number of obs: 7185, groups:  school, 160

Fixed effects:
                     Estimate Std. Error       df t value Pr(>|t|)    
(Intercept)           13.6151     0.3417 141.5756  39.840  < 2e-16 ***
sexMale                1.1591     0.3214 116.0708   3.607 0.000459 ***
sectorPublic          -2.9308     0.4504 145.7839  -6.507 1.15e-09 ***
sexMale:sectorPublic   0.3319     0.3935 118.8435   0.843 0.400705    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) sexMal sctrPb
sexMale     -0.370              
sectorPublc -0.759  0.281       
sxMl:sctrPb  0.302 -0.817 -0.331

There is a lot of output here, but really only the fixed effects are needed for reporting, because the MLM is just providing unbiased relationships for our two predictors and their interaction on math achievement. I would write it up something like the following:

To test our main hypotheses that male students, on average, have higher math achievement scores than female students and that Catholic schools, on average have higher math achievement scores than Public schools, we conducted a MLM with student sex as our level-1 predictor and school type (sector) as our level-2 predictor. The results indicate that for female students in catholic school [our reference categories] the average math achievement score is 13.62. Supporting our first hypothesis male students did have, on average, a significantly higher math achievement score compared to female students ( \(\gamma_{10}\) = 1.16, p<.001). Supporting our second hypothesis students in public schools did have, on average, a significantly lower math achievement score compared to students in Catholic schools ( \(\gamma_{01}\) = -2.93, p<.001). However, school type did not moderate the relationship between student sex and math achievement ( \(\gamma_{11}\) = 0.332, p<.400). see table x.

Table X.

	Fixed Effects
	Estimate	Std. Error	df	t value	p-value
(Intercept)	13.6151	0.3417	141.58	39.84	< .001 ***
Male	1.1591	0.3214	116.07	3.61	< .001 ***
Public	-2.9308	0.4504	145.78	-6.51	< .001 ***
*MalePublic**	0.3319	0.3935	118.84	0.84	.401

Random Effects	Variance	Std. Dev.	Corr
school (Intercept)	5.8274	2.4140
school sexMale	0.6743	0.8212	0.13
Residual	38.7336	6.2236

Model Statistics	Number of Observations	Number of Groups (schools)
	7185	160

Significance codes: *** p < .001, ** p < .01, * p < .05.

The findings can be further illustrated using the following ggplot figures. The first provides the model implied data for each participant (aka a spaghetti plot) and the second provides the model implied interaction.

Assn_1_Data$fitted <- fitted(model_1)
library(ggthemes)
spg_plot <-ggplot(Assn_1_Data, aes(x = sex, y = fitted, group = school, color = sector)) +
  geom_point(alpha = 0.4) +  # Observed data
  geom_line(aes(y = fitted), linewidth = .5) +  # Fitted values
  labs(title = "Random Intercept and Slope Model",
       x = "Sex",
       y = "Math Achievement",
       color = "Sector") +
  theme_fivethirtyeight() 
spg_plot

mean_plot <-ggplot(Assn_1_Data, aes(x = sex, y = fitted, group = sector, color = sector)) +
  geom_point(alpha = 0.4) +  # Observed data
  geom_line(aes(y = fitted), linewidth = .5) +  # Fitted values
  labs(title = "Random Intercept and Slope Model",
       x = "Sex",
       y = "Math Achievement",
       color = "Sector") +
  theme_fivethirtyeight() 

mean_plot

The figure below is a useful way to visualize the variance in intercepts (you could also do the same thing for random slopes). The data used in this figure are taken from the fitted model and are adjusted so that the average intercept \[\gamma_{00} \] is set to zero by centering the individual school intercepts around the average intercept. These centered scores are then plotted from lowest to highest.

# Extract random effects
random_effects <- ranef(model_1)$school

# Convert random effects to a data frame for plotting
random_effects_df <- as.data.frame(random_effects)
random_effects_df$group <- rownames(random_effects)

# Plot random intercepts for each group
ggplot(random_effects_df, aes(x = reorder(group, `(Intercept)`), y = `(Intercept)`)) +
  geom_point() +
  labs(x = "School", y = "Math Achievement Intercept", 
       title = "Random Intercepts by School from HLM") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))