Introduction

Panel data can be difficult to deal with in social science research, because many assumptions we impose on cross-sectional data modelling does not apply to panel data modelling.

However, panel data analysis is a powerful tool in social scientists’ toolbox because it allows us to draw causal conclusions on a research question, where true random experiment is not possible.

In this tutorial, we will go over:

  1. the concept of panel data
  2. the critical issues of working with panel data
  3. Common methods of working with panel data
  4. An Example of Modelling Panel data

What is Panel Data

Panel data can be seen as a combination of:

Panel data is generate by measuring multiple objects over time. We can think of it like a timeline in which we periodically observe the same individuums. Here is an example of Panel Data:


Challenges of Working with Panel Data

  1. Missing Data: It’s common in panel datasets to have missing values. Survey participants may drop out of a study over time, or data might not be recorded at certain time points. Handling these missing data points requires careful consideration to avoid bias in the analysis.

  2. Time-Invariant Variables: Some characteristics of the subjects do not change over time.

  3. Heterogeneity: There can be variation across subjects (between-subject heterogeneity) and within the same subject over time (within-subject heterogeneity).

  4. Autocorrelation: Observations for the same subject across time are likely to be correlated. This violates the assumption of independent observations that many statistical methods rely on and can lead to incorrect inferences if not properly addressed.

  5. Multilevel in Nature Panel data has a multi-level structure (data is collected on the same units across multiple time points).


Common Methdos of Panel Data Modelling

  1. Fixed and Random Effects Models: Control for individual-specific characteristics that do not change over time or assume that individual-specific effect is uncorrelated with the independent variables

  2. Difference-in-Differences (DiD): This method is used in quasi-experimental settings. It compares the before-and-after changes in outcomes for a treatment group and a control group. We’ll discuss this in the paper discussion.

  3. Multilevel Modeling (Hierarchical Linear Modeling): Treating time as a level of the dataset.

  4. Structural Equation Modeling (SEM): Incorporating the fixed and random effects models into a structural equation modeling framework. SEM allows us to:


Example: Cross-Lagged Panel SEM Model

First, let’s set up the environment for analysis. Beside the commonly known packages that consists of the R computing environment, the main workhorse package we need to implement SEM in R is lavaan.

lavaan is widely considered as the best open-source software for SEM. It is a native R package, so all you need is to install and load the package in R, no extra compilers needed. It has its own grammar for model specification, but follows general R syntax rules whenever possible. For R users, it’s an added benefit.

Lavaan also has an excellent tutorial page: https://lavaan.ugent.be/

If you use Python, STATA, or SPSS, you can check out MPLUS. MPLUS has long been recognized as the go-to choice for SEM, but it is a proprietary software, and we don’t have university-wide license.

You can check out MPLUS in here: https://www.statmodel.com/

In this demonstration we use lavaan:

# Setting up the environment
# rm(list=ls()) #Remove all existing objects
# setwd("/storage/home/szn5432/work/soda501") #Set the working directory

# Load dataset
load("/home/freeyang/research_data/mcs/merging/sem214.RData")


Case Study: Panel Data Analysis with SEM


Millennium Cohort Study

In this example, we use a panel data set drawn from a longitudinal social survey well-known in adolescent development research called the Millennium Cohort Study (MCS).

The MCS is a UK-based longitudinal survey. Starting from the year 2,000, the survey team follows the lives of around 19,000 young people born across England, Scotland, Wales and Northern Ireland in 2000-02. The study began with an original sample of 18,818 cohort members.

MCS provides multiple measures of the cohort members’ physical, socio-emotional, cognitive and behavioral development over time, as well as detailed information on their daily life, behavior and experiences. Alongside this, rich information on economic circumstances, parenting, relationships and family life is available from both resident parents.

As of now, the MCS have seven waves of survey beginning from when the cohort members were nine-month old to when they were seventeen-year-old. The eighth wave that was conducted when the cohort members were twenty-one-year-old is ongoing.

If you are interested, you can look for more information about the MCS from their official website: https://cls.ucl.ac.uk/cls-studies/millennium-cohort-study/.

MCS is huge. We combined all waves of the dataset, and produce a data frame of 19,243 rows and 18,509 columns. Panel data that has a large sample size and lots of features are extremely rare in adolescent study because they are incredibly expensive, laborious, and difficult to produce. However, they are also extremely powerful because it allows us to build complicated confirmatory models or apply machine learning models on it.

MCS is huge. We combined all waves of the dataset, and produce a data frame of 19,243 rows and 18,509 columns. Panel data that has a large sample size and lots of features are extremely rare in adolescent study because they are incredibly expensive, laborious, and difficult to produce. However, they are also extremely powerful because it allows us to build complicated confirmatory models or apply machine learning models on it.

# Check df
dim(df)
## [1] 19243 18509
head(df, 5)
## # A tibble: 5 × 18,509
##   mcsid   gpnum00 gcnum00     g_out_parquest gpcmed00 gpschl00 gpaslu00 gpasux0a
##   <chr>     <dbl> <dbl+lbl>   <dbl+lbl>      <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
## 1 M10002P       1 1 [1st Coh… 1 [Parent CAW… 1 [Yes]  2 [Fair… 1 [Very… -1 [Not…
## 2 M10007U       1 1 [1st Coh… 1 [Parent CAW… 1 [Yes]  2 [Fair… 1 [Very… -1 [Not…
## 3 M10015U       1 1 [1st Coh… 1 [Parent CAW… 1 [Yes]  1 [Very… 1 [Very… -1 [Not…
## 4 M10016V       1 1 [1st Coh… 1 [Parent CAW… 1 [Yes]  2 [Fair… 3 [Not …  0 [No] 
## 5 M10018X       1 1 [1st Coh… 1 [Parent CAW… 1 [Yes]  2 [Fair… 1 [Very… -1 [Not…
## # ℹ 18,501 more variables: gpasux0b <dbl+lbl>, gpasux0c <dbl+lbl>,
## #   gpasux0d <dbl+lbl>, gpasux0e <dbl+lbl>, gpasux0f <dbl+lbl>,
## #   gpasux0g <dbl+lbl>, gpasux0h <dbl+lbl>, gpasux0i <dbl+lbl>,
## #   gpasux0j <dbl+lbl>, gpasux0k <dbl+lbl>, gpasux0l1 <dbl+lbl>,
## #   gpasux0m <dbl+lbl>, gpasux0n <dbl+lbl>, gpasux0o <dbl+lbl>,
## #   gpasux0p <dbl+lbl>, gpasux0q <dbl+lbl>, gptaim00 <dbl+lbl>,
## #   gpwhet00 <dbl+lbl>, gpwhut00 <dbl+lbl>, gpschc00 <dbl+lbl>, …


The Study

In this study we uses 4 waves of measurement:

  • wave 4: respondents’ age were 7
  • wave 5: respondents’ age were 11
  • wave 6: respondents’ age were 14
  • wave 7: respondents’ age were 17


We ask two main questions (along with some others):

  1. Stability of self-control: whether individual differences in self-control are established before adolescence, and the between-person variations are stable thereafter

  2. The lagged effect of self-control on aggression in two periods of adolescence:

  • Self-control at age 11’s effect on aggression at age 14
  • Self-control at age 14’s effect on aggression at age 17


The Model.
The Model.


Data Preperation

It’s always necessary to do EDA before going to confirmatory analysis, but for demonstration purpose, we jump into data engineering, and won’t go over this part.


Data Exploration


Time-invariate Variables

We want the dataset to be relatively balanced in terms of demographic variables like gender and race:

df1 <- data.frame(gender = rep(NA, nrow(df)), race = rep(NA, nrow(df)))

df1$gender <- factor(df$gender6, levels = c(0, 1), labels = c("Female", "Male"))
df1$race <- factor(df$race, levels = c(1, 2, 3, 4), labels = c("White", "Asian", "Black", "Other"))

# Plot for gender variable
ggplot(df1, aes(x = gender)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Distribution of Gender", x = "Gender", y = "Count") +
  theme_minimal()

# Plot for race variable
ggplot(df1, aes(x = race)) +
  geom_bar(fill = "lightgreen", color = "black") +
  labs(title = "Distribution of Race", x = "Race", y = "Count") +
  theme_minimal()

Gender and race remains the same for each measurement. They are the time-invariant variables.


Time-variant Variables

Our key variables are time-variant variables. Line graphs demonstrate that they have ==trend over time==.


Generally, as individuals grow out of their childhood, they become less impulsive.

# Plot for cimp averages
ggplot(cimp_data, aes(x = time, y = mean)) +
  geom_line(color = "blue") +
  geom_point(color = "blue", size = 3) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, color = "blue") +
  labs(title = "Trend of 'cimp' Over Time with Standard Deviation", x = "Time", y = "Mean Value") +
  theme_minimal()


However, as one grows into adolescnece, they are more likely to take risks.

# Plotting 'crisk' with error bars
ggplot(crisk_data, aes(x = time, y = mean)) +
  geom_line(color = "red") +
  geom_point(color = "red", size = 3) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, color = "red") +
  labs(title = "Trend of 'crisk' Over Time with Standard Deviation", x = "Time", y = "Mean Value") +
  theme_minimal()


And we see a substantial jump in aggressive behaviors as teenagers grow from 14-year-old into 17-year-old.

# Plot for agr_fig averages
ggplot(agr_fig_data, aes(x = time, y = mean)) +
  geom_line(color = "green") +
  geom_point(color = "green", size = 3) +
  geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, color = "green") +
  labs(title = "Trend of 'agr_fig' Over Time with Standard Deviation", x = "Time", y = "Mean Value") +
  theme_minimal()


All of these trend we’ve seen are called within-person variations, they are sometimes called fixed effect in psychometrics.

In the three graphs above, we calculated the standard deviation for each variable at each measurement. This shows us the between-person variations. They are sometimes called random effect.


Auto Correlation

In Panel data, we usually see repeated measures being highly correlated with themselves, knows as auto-correlation. This can be detected from the correlation matrix.

# Load packages 
p_load(corrplot)

# Compute the correlation matrix. Note: rcorr(as.matrix(subset)) returns a list with the matrix at $r
subset_noid <- subset[, !(names(subset) %in% c('mcsid'))]
cor_matrix <- rcorr(as.matrix(subset_noid))$r

# Basic corrplot
corrplot(cor_matrix, method = "color", 
         type = "upper", # Only show upper half to avoid redundancy
         order = "hclust", # Order variables to cluster highly correlated ones
         tl.col = "black", tl.srt = 45) # Adjust text color and rotation for readability.

# capabilities()


In panel data analysis, we need to take care of all of these:

  1. Time-invariant effect
  2. Time-variant effect
  3. Fixed effect
  4. Random effect
  5. Auto Correlation


Missing Data

You may have been told that your data need to be ==missing completely at random (MCAR)== or at least == ar random (MAR)== to proceed. However, panel data is almost never MCAR. The main reason is that when you track people longitudinally, they drop out constantly. This is known as attrition.

Attrition does not happen at random most of the time. For example, if you track the health status of people who smoke and who don’t smoke over the next 50 years. You would expect smokers drop out more because of lower life-expectancy among smokers.

However, if you’re lucky to have a huge set of variables (not necessarily included in your model), like what we have in this case, your data may be MAR, because some features may explain the patterns of missingness in the data set.

Our data is not MCAR, as you can see below that we failed the Little’s MCAR test. However, it is always useful to plot the missing pattern to identifdy potential issues of the data instead of just relying on test statistics.

# Missing Pattern
p_load(naniar, Amelia)


# Missingness frequency by variable
n_miss <- miss_var_summary(subset)
print(n_miss, n=40)
## # A tibble: 29 × 3
##    variable   n_miss pct_miss
##    <chr>       <int>    <num>
##  1 crisk7      12103    62.9 
##  2 cimp7       11788    61.3 
##  3 crisk6       9688    50.3 
##  4 agr7_fig     9512    49.4 
##  5 agr7_srfig   9512    49.4 
##  6 cimp6        9408    48.9 
##  7 agr7_wepn    9267    48.2 
##  8 agr7_hit     9266    48.2 
##  9 agr6_fig     9118    47.4 
## 10 agr6_hps     8787    45.7 
## 11 pared        8379    43.5 
## 12 crisk5       8293    43.1 
## 13 agr6_hpc     8056    41.9 
## 14 agr6_hit     8050    41.8 
## 15 agr6_wepn    8048    41.8 
## 16 race         8026    41.7 
## 17 cpar5_dire   7879    40.9 
## 18 cimp5        7753    40.3 
## 19 cpar5_ditr   7735    40.2 
## 20 cpar5_dibn   7729    40.2 
## 21 gender6      7517    39.1 
## 22 agr5_fig     7437    38.6 
## 23 agr5_hps     7341    38.1 
## 24 agr5_hpc     6422    33.4 
## 25 cpar4_dire   6182    32.1 
## 26 cpar4_ditr   6130    31.9 
## 27 cpar4_dibn   6096    31.7 
## 28 married       711     3.69
## 29 mcsid           0     0
# missmap(subset, main = "Missing Data Pattern", 
        #col = c("yellow", "black"), legend = TRUE)

# MCAR TEST (result significant, Not completely at random)
MCAR_result <- mcar_test(subset)
print(MCAR_result)
## # A tibble: 1 × 4
##   statistic    df p.value missing.patterns
##       <dbl> <dbl>   <dbl>            <int>
## 1    33209. 24493       0             1330
# Correlation analysis
vis_miss(subset)


One final thing I want to say about the data: you may have found out that our data is in wide format. This is because wide format data is native to SEM. Also, the way the MCS dataset is structured makes it difficult to reshape it into long format.


What is Structural Equation Modelling

Structural equation modeling is a modelling framework that models simultaneous regression equations with latent variables. Models such as regression, multivariate regression, path analysis, confirmatory factor analysis, and structural regressionare all special cases of SEM. SEM can model:

  • observed to observed variables (e.g., regression)

  • latent to observed variables (e.g., confirmatory factor analysis)

  • latent to latent variables (e.g., structural regression)

SEM have two main components:

  • A measurement model: relates observed to latent variables

  • A structural models: relates latent to latent variables.


SEM with Lavaan

You can do SEM within R using a package called lavaan: https://lavaan.ugent.be/. lavaan is distributed under the GNU General Public License, so it is completely free to use for different purposes.

Other popular SEM software are proprietary:

  • MPlus has long been the industry standard for SEM, but it is a standalone and expensive piece of software.

  • SPSS AMOS is popular in the business intelligence world because of its integration with the SPSS and its point-and-click style GUI. Penn State has a university-wide license for SPSS AMOS.

  • STATA is widely popular both in Economics and Sociology. It has a built-in SEM tool. You don’t need to pay extra for the built-in tool, but its SEM capability is limited.


SEM with lavaan is quite straightforward. We have a full SEM model that consists of a measurement model and a structural model.

In lavaan: - =~ connects the latent factor with the observed indicators - ~ connects the outcome with the predictors - ~~ are used to model covariances

You can look at this page to learn the complete syntax of lavaan

# Model Specification
sem1 <- ' 
 # Measurement Model
 
 agr7 =~ agr7_fig + agr7_hit + agr7_wepn
 agr6 =~ agr6_fig + agr6_hit + agr6_wepn + agr6_hpc
 agr5 =~ agr5_fig + agr5_hpc 
 
 
 ## Self-control
 
 ## Self-control at age 11
 fsc5 =~ cimp5 + crisk5
 
 ## Self-control at age 14
 fsc6 =~ cimp6 + crisk6
   
 ## Self-control at age 17
 fsc7 =~ cimp7 + crisk7

   
 ## Parenting
 
 ## Parenting at age 7
 par4 =~ cpar4_dibn + cpar4_ditr + cpar4_dire
 
 ## Parenting at age 11
 par5 =~ cpar5_dibn + cpar5_ditr + cpar5_dire
   
   
 # Structural Model
 
 agr7 ~ fsc6 + agr6 + par5
 agr6 ~ fsc5 + agr5 + par5
 agr5 ~ par4 + gender6 + pared + race + married
   
 fsc7 ~ agr6 + fsc6 + par5
 fsc6 ~ agr5 + fsc5 + par5 
 fsc5 ~ par4 + gender6 + pared + race + married 
   
 par5 ~ par4  
 
## Covariances

 agr7 ~~ fsc7
 agr6 ~~ fsc6
 agr5 ~~ fsc5
 agr6_fig ~~ agr5_fig

 
 crisk6 ~~ crisk7
 crisk5 ~~ crisk6
 crisk5 ~~ crisk7
 
 cimp5 ~~ cimp6
 cimp6 ~~ cimp7
 cimp5 ~~ cimp7
 
 cpar4_dire ~~ cpar5_dire
 cpar4_ditr ~~ cpar5_ditr
 cpar4_dibn ~~ cpar5_dibn
'


After specifying the model, fitting the model is simple. However, you do need to understand the SEM framework to choose a suitable estimator.

Full Information Maximum Likelihood is the recommended method of handling missing data for SEM. However, it only works with the MLR estimator.

You can also impute the data first with the MICE package, and then use the multiple imputed data to run SEM. A recent study (https://psycnet.apa.org/record/2021-12018-001) found that the two method provides equivalent estimation results in most cases. However, FIML is built into the workflow of SEM, so it is a bit easier to implement.

This is not a tutorial about SEM modelling, so I will not interpret the results and model fit statistics.

# Model Fitting

## Specify the ordinal variables
### MLR estimator with FIML

fit1 <- sem(model = sem1, data = subset, estimator='MLR', missing='fiml', fixed.x = FALSE)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan WARNING: some cases are empty and will be ignored:
##   8988 9116 9168 9310 9331 9390 9442 9561 9573 9847 10160 10233 10264 10329 10391 10515 10663 10741 10787 10799 10839 10850 10907 11020 11073 11124 11280 11670 11708 11867 11890 11906 11908 11956 11985 12096 12184 12285 12351 12360 12394 12418 12434 12603 12820 12909 12916 13134 13189 13282 13311 13382 13476 13561 13646 13655 13805 13887 14206 14213 14295 14303 14350 14409 14460 14462 14515 14585 14667 14693 14775 14822 14836 15134 15152 15167 15375 15402 15406 15508 15558 15635 15659 15807 16141 16145 16209 16268 16275 16343 16429 16533 16618 16726 16735 16877 17023 17060 17069 17077 17102 17308 17311 17316 17515 17718 17909 17924 17929 18009 18073 18079 18154 18165 18197 18224 18262 18489 18567 18638 18682 18739 18823 18898 18944 18987 19120 19186
## Warning in lav_object_post_check(object): lavaan WARNING: covariance matrix of latent variables
##                 is not positive definite;
##                 use lavInspect(fit, "cov.lv") to investigate.
summary(fit1, standardized = TRUE, fit.measures=TRUE)
## lavaan 0.6.17 ended normally after 272 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                       113
## 
##                                                   Used       Total
##   Number of observations                         19115       19243
##   Number of missing patterns                      1130            
## 
## Model Test User Model:
##                                               Standard      Scaled
##   Test Statistic                              5264.736    4450.943
##   Degrees of freedom                               237         237
##   P-value (Chi-square)                           0.000       0.000
##   Scaling correction factor                                  1.183
##     Yuan-Bentler correction (Mplus variant)                       
## 
## Model Test Baseline Model:
## 
##   Test statistic                             49016.445   40364.341
##   Degrees of freedom                               294         294
##   P-value                                        0.000       0.000
##   Scaling correction factor                                  1.214
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.897       0.895
##   Tucker-Lewis Index (TLI)                       0.872       0.870
##                                                                   
##   Robust Comparative Fit Index (CFI)                         0.895
##   Robust Tucker-Lewis Index (TLI)                            0.870
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)            -210622.699 -210622.699
##   Scaling correction factor                                  2.257
##       for the MLR correction                                      
##   Loglikelihood unrestricted model (H1)    -207990.331 -207990.331
##   Scaling correction factor                                  1.530
##       for the MLR correction                                      
##                                                                   
##   Akaike (AIC)                              421471.398  421471.398
##   Bayesian (BIC)                            422359.378  422359.378
##   Sample-size adjusted Bayesian (SABIC)     422000.269  422000.269
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.033       0.030
##   90 Percent confidence interval - lower         0.033       0.030
##   90 Percent confidence interval - upper         0.034       0.031
##   P-value H_0: RMSEA <= 0.050                    1.000       1.000
##   P-value H_0: RMSEA >= 0.080                    0.000       0.000
##                                                                   
##   Robust RMSEA                                               0.048
##   90 Percent confidence interval - lower                     0.047
##   90 Percent confidence interval - upper                     0.050
##   P-value H_0: Robust RMSEA <= 0.050                         0.985
##   P-value H_0: Robust RMSEA >= 0.080                         0.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.052       0.052
## 
## Parameter Estimates:
## 
##   Standard errors                             Sandwich
##   Information bread                           Observed
##   Observed information based on                Hessian
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   agr7 =~                                                               
##     agr7_fig          1.000                               0.173    0.421
##     agr7_hit          1.530    0.088   17.321    0.000    0.265    0.616
##     agr7_wepn         0.162    0.017    9.818    0.000    0.028    0.276
##   agr6 =~                                                               
##     agr6_fig          1.000                               0.101    0.312
##     agr6_hit          2.772    0.235   11.811    0.000    0.279    0.605
##     agr6_wepn         0.242    0.028    8.666    0.000    0.024    0.253
##     agr6_hpc          4.311    0.328   13.132    0.000    0.434    0.488
##   agr5 =~                                                               
##     agr5_fig          1.000                               0.130    0.409
##     agr5_hpc          2.863    0.151   18.938    0.000    0.371    0.385
##   fsc5 =~                                                               
##     cimp5             1.000                               1.103    0.585
##     crisk5            0.696    0.025   28.191    0.000    0.767    0.729
##   fsc6 =~                                                               
##     cimp6             1.000                               1.051    0.563
##     crisk6            0.871    0.034   25.493    0.000    0.915    0.699
##   fsc7 =~                                                               
##     cimp7             1.000                               1.080    0.579
##     crisk7            0.758    0.037   20.602    0.000    0.818    0.637
##   par4 =~                                                               
##     cpar4_dibn        1.000                               0.334    0.676
##     cpar4_ditr        0.945    0.018   51.407    0.000    0.315    0.651
##     cpar4_dire        0.280    0.013   22.205    0.000    0.094    0.275
##   par5 =~                                                               
##     cpar5_dibn        1.000                               0.381    0.768
##     cpar5_ditr        1.042    0.016   65.063    0.000    0.397    0.797
##     cpar5_dire        0.562    0.012   44.977    0.000    0.214    0.488
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   agr7 ~                                                                
##     fsc6              0.024    0.008    3.233    0.001    0.148    0.148
##     agr6              1.449    0.108   13.368    0.000    0.844    0.844
##     par5              0.036    0.010    3.501    0.000    0.079    0.079
##   agr6 ~                                                                
##     fsc5              0.094    0.019    5.000    0.000    1.028    1.028
##     agr5              1.256    0.169    7.447    0.000    1.617    1.617
##     par5              0.009    0.006    1.380    0.167    0.032    0.032
##   agr5 ~                                                                
##     par4              0.149    0.010   15.505    0.000    0.383    0.383
##     gender6           0.072    0.005   14.380    0.000    0.557    0.279
##     pared            -0.013    0.001   -9.661    0.000   -0.101   -0.151
##     race             -0.003    0.005   -0.500    0.617   -0.020   -0.007
##     married          -0.044    0.004  -11.773    0.000   -0.336   -0.165
##   fsc7 ~                                                                
##     agr6              0.173    0.335    0.517    0.605    0.016    0.016
##     fsc6              0.907    0.035   26.129    0.000    0.882    0.882
##     par5             -0.070    0.052   -1.355    0.175   -0.025   -0.025
##   fsc6 ~                                                                
##     agr5              2.251    0.748    3.011    0.003    0.278    0.278
##     fsc5              1.038    0.086   12.068    0.000    1.089    1.089
##     par5             -0.131    0.043   -3.022    0.003   -0.047   -0.047
##   fsc5 ~                                                                
##     par4             -1.502    0.089  -16.950    0.000   -0.454   -0.454
##     gender6          -0.350    0.027  -13.110    0.000   -0.317   -0.159
##     pared             0.154    0.011   13.890    0.000    0.140    0.208
##     race              0.223    0.035    6.427    0.000    0.202    0.076
##     married           0.437    0.026   17.117    0.000    0.397    0.195
##   par5 ~                                                                
##     par4              0.749    0.024   31.013    0.000    0.656    0.656
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .agr7 ~~                                                               
##    .fsc7             -0.037    0.004   -8.893    0.000   -0.655   -0.655
##  .agr6 ~~                                                               
##    .fsc6             -0.035    0.003  -10.164    0.000   -1.031   -1.031
##  .agr5 ~~                                                               
##    .fsc5             -0.090    0.005  -19.581    0.000   -0.906   -0.906
##  .agr6_fig ~~                                                           
##    .agr5_fig          0.025    0.002   11.164    0.000    0.025    0.278
##  .crisk6 ~~                                                             
##    .crisk7            0.300    0.029   10.178    0.000    0.300    0.323
##  .crisk5 ~~                                                             
##    .crisk6            0.068    0.019    3.634    0.000    0.068    0.101
##    .crisk7            0.027    0.020    1.350    0.177    0.027    0.037
##  .cimp5 ~~                                                              
##    .cimp6             1.342    0.041   32.437    0.000    1.342    0.568
##  .cimp6 ~~                                                              
##    .cimp7             1.289    0.047   27.604    0.000    1.289    0.549
##  .cimp5 ~~                                                              
##    .cimp7             1.124    0.042   26.604    0.000    1.124    0.483
##  .cpar4_dire ~~                                                         
##    .cpar5_dire        0.027    0.002   16.770    0.000    0.027    0.213
##  .cpar4_ditr ~~                                                         
##    .cpar5_ditr       -0.001    0.002   -0.284    0.776   -0.001   -0.005
##  .cpar4_dibn ~~                                                         
##    .cpar5_dibn        0.007    0.002    3.497    0.000    0.007    0.062
##   gender6 ~~                                                            
##     pared             0.006    0.007    0.864    0.388    0.006    0.008
##     race             -0.001    0.002   -0.731    0.465   -0.001   -0.007
##     married          -0.000    0.002   -0.133    0.894   -0.000   -0.001
##   pared ~~                                                              
##     race             -0.091    0.007  -13.719    0.000   -0.091   -0.162
##     married           0.133    0.007   18.442    0.000    0.133    0.182
##   race ~~                                                               
##     married           0.027    0.002   16.474    0.000    0.027    0.149
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .agr7_fig          0.048    0.009    5.118    0.000    0.048    0.116
##    .agr7_hit          0.071    0.015    4.693    0.000    0.071    0.165
##    .agr7_wepn        -0.008    0.002   -4.386    0.000   -0.008   -0.077
##    .agr6_fig          0.009    0.005    1.638    0.102    0.009    0.027
##    .agr6_hit          0.101    0.017    5.879    0.000    0.101    0.220
##    .agr6_wepn        -0.008    0.002   -5.202    0.000   -0.008   -0.087
##    .agr6_hpc          0.113    0.024    4.713    0.000    0.113    0.127
##    .agr5_fig          0.043    0.009    4.699    0.000    0.043    0.135
##    .agr5_hpc          0.351    0.027   12.913    0.000    0.351    0.364
##    .cimp5             2.764    0.051   53.758    0.000    2.764    1.465
##    .crisk5            6.997    0.035  199.557    0.000    6.997    6.652
##    .cimp6             2.806    0.050   56.455    0.000    2.806    1.503
##    .crisk6            6.371    0.047  136.579    0.000    6.371    4.863
##    .cimp7             3.238    0.056   57.654    0.000    3.238    1.736
##    .crisk7            5.809    0.046  126.125    0.000    5.809    4.522
##    .cpar4_dibn        0.586    0.004  137.222    0.000    0.586    1.188
##    .cpar4_ditr        0.626    0.004  148.667    0.000    0.626    1.291
##    .cpar4_dire        0.865    0.003  289.376    0.000    0.865    2.545
##    .cpar5_dibn        0.430    0.005   94.830    0.000    0.430    0.867
##    .cpar5_ditr        0.450    0.005   98.562    0.000    0.450    0.903
##    .cpar5_dire        0.738    0.004  180.448    0.000    0.738    1.682
##     gender6           1.503    0.005  325.101    0.000    1.503    3.004
##     pared             2.842    0.014  198.245    0.000    2.842    1.913
##     race              0.164    0.003   47.547    0.000    0.164    0.436
##     married           0.595    0.004  165.158    0.000    0.595    1.211
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .agr7_fig          0.139    0.004   33.104    0.000    0.139    0.823
##    .agr7_hit          0.115    0.004   29.384    0.000    0.115    0.621
##    .agr7_wepn         0.010    0.001   11.186    0.000    0.010    0.924
##    .agr6_fig          0.094    0.004   23.017    0.000    0.094    0.902
##    .agr6_hit          0.135    0.004   36.347    0.000    0.135    0.634
##    .agr6_wepn         0.009    0.001   11.072    0.000    0.009    0.936
##    .agr6_hpc          0.604    0.019   31.593    0.000    0.604    0.762
##    .agr5_fig          0.084    0.003   25.698    0.000    0.084    0.832
##    .agr5_hpc          0.794    0.022   36.372    0.000    0.794    0.852
##    .cimp5             2.343    0.047   50.105    0.000    2.343    0.658
##    .crisk5            0.518    0.019   26.949    0.000    0.518    0.468
##    .cimp6             2.381    0.049   48.849    0.000    2.381    0.683
##    .crisk6            0.878    0.033   26.559    0.000    0.878    0.512
##    .cimp7             2.315    0.061   38.124    0.000    2.315    0.665
##    .crisk7            0.981    0.038   26.046    0.000    0.981    0.594
##    .cpar4_dibn        0.132    0.003   41.638    0.000    0.132    0.542
##    .cpar4_ditr        0.135    0.003   42.193    0.000    0.135    0.577
##    .cpar4_dire        0.107    0.002   53.932    0.000    0.107    0.924
##    .cpar5_dibn        0.101    0.003   34.544    0.000    0.101    0.411
##    .cpar5_ditr        0.090    0.003   29.993    0.000    0.090    0.365
##    .cpar5_dire        0.147    0.002   80.997    0.000    0.147    0.762
##    .agr7              0.012    0.002    7.912    0.000    0.413    0.413
##    .agr6              0.004    0.001    4.814    0.000    0.381    0.381
##    .agr5              0.012    0.001   10.749    0.000    0.717    0.717
##    .fsc5              0.812    0.027   30.032    0.000    0.668    0.668
##    .fsc6              0.295    0.022   13.151    0.000    0.267    0.267
##    .fsc7              0.264    0.029    9.245    0.000    0.226    0.226
##     par4              0.111    0.003   34.974    0.000    1.000    1.000
##    .par5              0.082    0.003   25.365    0.000    0.569    0.569
##     gender6           0.250    0.000 2652.816    0.000    0.250    1.000
##     pared             2.207    0.024   92.731    0.000    2.207    1.000
##     race              0.141    0.002   59.915    0.000    0.141    1.000
##     married           0.241    0.001  351.529    0.000    0.241    1.000
# Check covariance of the latent variables
lavInspect(fit1, "cov.lv")
##        agr7   agr6   agr5   fsc5   fsc6   fsc7   par4   par5
## agr7  0.030                                                 
## agr6  0.013  0.010                                          
## agr5  0.011  0.009  0.017                                   
## fsc5 -0.056 -0.052 -0.132  1.216                            
## fsc6 -0.079 -0.070 -0.100  0.982  1.104                     
## fsc7 -0.107 -0.062 -0.090  0.890  0.998  1.166              
## par4  0.008  0.006  0.017 -0.167 -0.147 -0.138  0.111       
## par5  0.010  0.005  0.012 -0.125 -0.121 -0.119  0.083  0.145


The package lavaanPlot, when used together with lavaan can do basic plotting for the model fitting result from lavaan:

p_load(lavaanPlot) # For plotting


## Label the subset
my_label <- list(
  par7 = "Parenting Age 7",
  par11 = "Parenting Age 11",
  fsc11 = "Self-control Age 11",
  fsc14 = "Self-control Age 14",
  fsc17 = "Self-control Age 17",
  cimp5 = "Impulsivity Age 11",
  crisk5 = "Risk-Taking Age 11",
  cimp6 = "Impulsivity Age 14",
  crisk6 = "Risk-Taking Age 14",
  cimp7 = "Impulsivity Age 17",
  crisk7 = "Risk-Taking Age 17",
  cpar4_dibn = "Parenting Bedroom Age 7",
  cpar4_ditr = "Parenting Treats Age 7",
  cpar4_dire = "Parenting Reason Age 7",
  cpar5_dibn = "Parenting Bedroom Age 11",
  cpar5_ditr = "Parenting Treats Age 11",
  cpar5_dire = "Parenting Reason Age 11",
  agr7 = "Aggression Age 17",
  agr6 = "Aggression Age 14",
  agr5 = "Aggression Age 11",
  gender6 = "Gender",
  pared = "Parents Education",
  race = "Ethnicity",
  married = "Parents Married Status")

# MLR Estimates
lavaanPlot(model = fit1, labels = my_label, node_options = list(shape = "box", fontname = "Helvetica"),
           edge_options = list(color = "grey"), coefs=TRUE, stand = TRUE)

SEM Model with MCMC (Optional)

Sometimes you can use Bayesian estimator when you have the following issue

  1. Model identification: Your model won’t identified

  2. Prior Knowledge: We social scientists usually have strong background in theory training. Bayesian SEM allows us to incorporate our domain-specific knowledge into the modelling process by specifying the prior

  3. Missiness Since we can define prior, we can model missing pattern if needed. This can be useful when we work with panel data, which is usually MNAR

  4. Accuracy Given the nature of Bayesian estimators, Bayesian SEM can account for measurement errors, non-normality, and heterogeneity in panel data.

It requires at least a complete course to discuss Bayesian estimators. So we will not discuss how to do it here.

There are two main tools we need to implement Bayesian SEM:

  1. STAN: STAN is a probability programming language written in C++. However, rstan offers a interface to invoke STAN from the R environment.

  2. blavaan: The R package that uses STAN to do SEM modelling

If someday you need to go Bayesian, you can uncomment the following code to set up the computing environment. Otherwise, I hope this short tutorial provides you with a very elementary introduction of the SEM framework of working with panel data.

## The following will create or edit a configuration file for the C++ toolchain
dotR <- file.path(Sys.getenv("HOME"), ".R")
if (!file.exists(dotR)) dir.create(dotR)
M <- file.path(dotR, "Makevars")
if (!file.exists(M)) file.create(M)
cat("\nCXX14FLAGS=-O3 -march=native -mtune=native -fPIC",
    "CXX14=g++", # or clang++ but you may need a version postfix
    file = M, sep = "\n", append = TRUE)

# Load relevant packages
# install.packages("rstan", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
# install.packages('blavaan')
library(rstan)
library(blavaan)

You need to set up the environment for STAN in ROAR:

# Running STAN with on ROAR requires a pakcage called cmdstanR
# install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
##library(cmdstanr)

# Check if C++ toolchain has been set up correctly
##check_cmdstan_toolchain()
# install cmdstan interface
##install_cmdstan(cores = 4)

# Check the cmdstan path and version 
##cmdstan_path()
##cmdstan_version()
# blav_fit <- bsem(sem1, n.chains=1, burnin=30, sample=50, data=subset, mcmcfile = T, bcontrol = list(cores = 4))

# summary(blav_fit)