Panel data can be difficult to deal with in social science research, because many assumptions we impose on cross-sectional data modelling does not apply to panel data modelling.
However, panel data analysis is a powerful tool in social scientists’ toolbox because it allows us to draw causal conclusions on a research question, where true random experiment is not possible.
In this tutorial, we will go over:
- the concept of panel data
- the critical issues of working with panel data
- Common methods of working with panel data
- An Example of Modelling Panel data
Panel data can be seen as a combination of:
Cross-sectional data: one observation of multiple objects at a specific point in time (a snapshot).
Time-series data: one object is measured recurrently over time.
Panel data is generate by measuring multiple objects over time. We can think of it like a timeline in which we periodically observe the same individuums. Here is an example of Panel Data:
Missing Data: It’s common in panel datasets to have missing values. Survey participants may drop out of a study over time, or data might not be recorded at certain time points. Handling these missing data points requires careful consideration to avoid bias in the analysis.
Time-Invariant Variables: Some characteristics of the subjects do not change over time.
Heterogeneity: There can be variation across subjects (between-subject heterogeneity) and within the same subject over time (within-subject heterogeneity).
Autocorrelation: Observations for the same subject across time are likely to be correlated. This violates the assumption of independent observations that many statistical methods rely on and can lead to incorrect inferences if not properly addressed.
Multilevel in Nature Panel data has a multi-level structure (data is collected on the same units across multiple time points).
Fixed and Random Effects Models: Control for individual-specific characteristics that do not change over time or assume that individual-specific effect is uncorrelated with the independent variables
Difference-in-Differences (DiD): This method is used in quasi-experimental settings. It compares the before-and-after changes in outcomes for a treatment group and a control group. We’ll discuss this in the paper discussion.
Multilevel Modeling (Hierarchical Linear Modeling): Treating time as a level of the dataset.
Structural Equation Modeling (SEM): Incorporating the fixed and random effects models into a structural equation modeling framework. SEM allows us to:
First, let’s set up the environment for analysis. Beside the commonly
known packages that consists of the R computing environment, the main
workhorse package we need to implement SEM in R is
lavaan
.
lavaan
is widely considered as the best open-source
software for SEM. It is a native R package, so all you need is to
install and load the package in R, no extra compilers needed. It has its
own grammar for model specification, but follows general R syntax rules
whenever possible. For R users, it’s an added benefit.
Lavaan
also has an excellent tutorial page: https://lavaan.ugent.be/
If you use Python, STATA, or SPSS, you can check out
MPLUS
. MPLUS
has long been recognized as the
go-to choice for SEM, but it is a proprietary software, and we don’t
have university-wide license.
You can check out MPLUS in here: https://www.statmodel.com/
In this demonstration we use lavaan
:
# Setting up the environment
# rm(list=ls()) #Remove all existing objects
# setwd("/storage/home/szn5432/work/soda501") #Set the working directory
# Load dataset
load("/home/freeyang/research_data/mcs/merging/sem214.RData")
In this example, we use a panel data set drawn from a longitudinal social survey well-known in adolescent development research called the Millennium Cohort Study (MCS).
The MCS is a UK-based longitudinal survey. Starting from the year 2,000, the survey team follows the lives of around 19,000 young people born across England, Scotland, Wales and Northern Ireland in 2000-02. The study began with an original sample of 18,818 cohort members.
MCS provides multiple measures of the cohort members’ physical, socio-emotional, cognitive and behavioral development over time, as well as detailed information on their daily life, behavior and experiences. Alongside this, rich information on economic circumstances, parenting, relationships and family life is available from both resident parents.
As of now, the MCS have seven waves of survey beginning from when the cohort members were nine-month old to when they were seventeen-year-old. The eighth wave that was conducted when the cohort members were twenty-one-year-old is ongoing.
If you are interested, you can look for more information about the MCS from their official website: https://cls.ucl.ac.uk/cls-studies/millennium-cohort-study/.
MCS is huge. We combined all waves of the dataset, and produce a data frame of 19,243 rows and 18,509 columns. Panel data that has a large sample size and lots of features are extremely rare in adolescent study because they are incredibly expensive, laborious, and difficult to produce. However, they are also extremely powerful because it allows us to build complicated confirmatory models or apply machine learning models on it.
MCS is huge. We combined all waves of the dataset, and produce a data frame of 19,243 rows and 18,509 columns. Panel data that has a large sample size and lots of features are extremely rare in adolescent study because they are incredibly expensive, laborious, and difficult to produce. However, they are also extremely powerful because it allows us to build complicated confirmatory models or apply machine learning models on it.
# Check df
dim(df)
## [1] 19243 18509
head(df, 5)
## # A tibble: 5 × 18,509
## mcsid gpnum00 gcnum00 g_out_parquest gpcmed00 gpschl00 gpaslu00 gpasux0a
## <chr> <dbl> <dbl+lbl> <dbl+lbl> <dbl+lb> <dbl+lb> <dbl+lb> <dbl+lb>
## 1 M10002P 1 1 [1st Coh… 1 [Parent CAW… 1 [Yes] 2 [Fair… 1 [Very… -1 [Not…
## 2 M10007U 1 1 [1st Coh… 1 [Parent CAW… 1 [Yes] 2 [Fair… 1 [Very… -1 [Not…
## 3 M10015U 1 1 [1st Coh… 1 [Parent CAW… 1 [Yes] 1 [Very… 1 [Very… -1 [Not…
## 4 M10016V 1 1 [1st Coh… 1 [Parent CAW… 1 [Yes] 2 [Fair… 3 [Not … 0 [No]
## 5 M10018X 1 1 [1st Coh… 1 [Parent CAW… 1 [Yes] 2 [Fair… 1 [Very… -1 [Not…
## # ℹ 18,501 more variables: gpasux0b <dbl+lbl>, gpasux0c <dbl+lbl>,
## # gpasux0d <dbl+lbl>, gpasux0e <dbl+lbl>, gpasux0f <dbl+lbl>,
## # gpasux0g <dbl+lbl>, gpasux0h <dbl+lbl>, gpasux0i <dbl+lbl>,
## # gpasux0j <dbl+lbl>, gpasux0k <dbl+lbl>, gpasux0l1 <dbl+lbl>,
## # gpasux0m <dbl+lbl>, gpasux0n <dbl+lbl>, gpasux0o <dbl+lbl>,
## # gpasux0p <dbl+lbl>, gpasux0q <dbl+lbl>, gptaim00 <dbl+lbl>,
## # gpwhet00 <dbl+lbl>, gpwhut00 <dbl+lbl>, gpschc00 <dbl+lbl>, …
In this study we uses 4 waves of measurement:
We ask two main questions (along with some others):
Stability of self-control: whether individual differences in self-control are established before adolescence, and the between-person variations are stable thereafter
The lagged effect of self-control on aggression in two periods of adolescence:
It’s always necessary to do EDA before going to confirmatory analysis, but for demonstration purpose, we jump into data engineering, and won’t go over this part.
We want the dataset to be relatively balanced in terms of demographic variables like gender and race:
df1 <- data.frame(gender = rep(NA, nrow(df)), race = rep(NA, nrow(df)))
df1$gender <- factor(df$gender6, levels = c(0, 1), labels = c("Female", "Male"))
df1$race <- factor(df$race, levels = c(1, 2, 3, 4), labels = c("White", "Asian", "Black", "Other"))
# Plot for gender variable
ggplot(df1, aes(x = gender)) +
geom_bar(fill = "skyblue", color = "black") +
labs(title = "Distribution of Gender", x = "Gender", y = "Count") +
theme_minimal()
# Plot for race variable
ggplot(df1, aes(x = race)) +
geom_bar(fill = "lightgreen", color = "black") +
labs(title = "Distribution of Race", x = "Race", y = "Count") +
theme_minimal()
Gender and race remains the same for each measurement. They are the time-invariant variables.
Our key variables are time-variant variables. Line graphs demonstrate that they have ==trend over time==.
Generally, as individuals grow out of their childhood, they become less impulsive.
# Plot for cimp averages
ggplot(cimp_data, aes(x = time, y = mean)) +
geom_line(color = "blue") +
geom_point(color = "blue", size = 3) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, color = "blue") +
labs(title = "Trend of 'cimp' Over Time with Standard Deviation", x = "Time", y = "Mean Value") +
theme_minimal()
However, as one grows into adolescnece, they are more likely to take risks.
# Plotting 'crisk' with error bars
ggplot(crisk_data, aes(x = time, y = mean)) +
geom_line(color = "red") +
geom_point(color = "red", size = 3) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, color = "red") +
labs(title = "Trend of 'crisk' Over Time with Standard Deviation", x = "Time", y = "Mean Value") +
theme_minimal()
And we see a substantial jump in aggressive behaviors as teenagers grow from 14-year-old into 17-year-old.
# Plot for agr_fig averages
ggplot(agr_fig_data, aes(x = time, y = mean)) +
geom_line(color = "green") +
geom_point(color = "green", size = 3) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd), width = 0.2, color = "green") +
labs(title = "Trend of 'agr_fig' Over Time with Standard Deviation", x = "Time", y = "Mean Value") +
theme_minimal()
All of these trend we’ve seen are called within-person variations, they are sometimes called fixed effect in psychometrics.
In the three graphs above, we calculated the standard deviation for each variable at each measurement. This shows us the between-person variations. They are sometimes called random effect.
In Panel data, we usually see repeated measures being highly correlated with themselves, knows as auto-correlation. This can be detected from the correlation matrix.
# Load packages
p_load(corrplot)
# Compute the correlation matrix. Note: rcorr(as.matrix(subset)) returns a list with the matrix at $r
subset_noid <- subset[, !(names(subset) %in% c('mcsid'))]
cor_matrix <- rcorr(as.matrix(subset_noid))$r
# Basic corrplot
corrplot(cor_matrix, method = "color",
type = "upper", # Only show upper half to avoid redundancy
order = "hclust", # Order variables to cluster highly correlated ones
tl.col = "black", tl.srt = 45) # Adjust text color and rotation for readability.
# capabilities()
In panel data analysis, we need to take care of all of these:
You may have been told that your data need to be ==missing completely at random (MCAR)== or at least == ar random (MAR)== to proceed. However, panel data is almost never MCAR. The main reason is that when you track people longitudinally, they drop out constantly. This is known as attrition.
Attrition does not happen at random most of the time. For example, if you track the health status of people who smoke and who don’t smoke over the next 50 years. You would expect smokers drop out more because of lower life-expectancy among smokers.
However, if you’re lucky to have a huge set of variables (not necessarily included in your model), like what we have in this case, your data may be MAR, because some features may explain the patterns of missingness in the data set.
Our data is not MCAR, as you can see below that we failed the Little’s MCAR test. However, it is always useful to plot the missing pattern to identifdy potential issues of the data instead of just relying on test statistics.
# Missing Pattern
p_load(naniar, Amelia)
# Missingness frequency by variable
n_miss <- miss_var_summary(subset)
print(n_miss, n=40)
## # A tibble: 29 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 crisk7 12103 62.9
## 2 cimp7 11788 61.3
## 3 crisk6 9688 50.3
## 4 agr7_fig 9512 49.4
## 5 agr7_srfig 9512 49.4
## 6 cimp6 9408 48.9
## 7 agr7_wepn 9267 48.2
## 8 agr7_hit 9266 48.2
## 9 agr6_fig 9118 47.4
## 10 agr6_hps 8787 45.7
## 11 pared 8379 43.5
## 12 crisk5 8293 43.1
## 13 agr6_hpc 8056 41.9
## 14 agr6_hit 8050 41.8
## 15 agr6_wepn 8048 41.8
## 16 race 8026 41.7
## 17 cpar5_dire 7879 40.9
## 18 cimp5 7753 40.3
## 19 cpar5_ditr 7735 40.2
## 20 cpar5_dibn 7729 40.2
## 21 gender6 7517 39.1
## 22 agr5_fig 7437 38.6
## 23 agr5_hps 7341 38.1
## 24 agr5_hpc 6422 33.4
## 25 cpar4_dire 6182 32.1
## 26 cpar4_ditr 6130 31.9
## 27 cpar4_dibn 6096 31.7
## 28 married 711 3.69
## 29 mcsid 0 0
# missmap(subset, main = "Missing Data Pattern",
#col = c("yellow", "black"), legend = TRUE)
# MCAR TEST (result significant, Not completely at random)
MCAR_result <- mcar_test(subset)
print(MCAR_result)
## # A tibble: 1 × 4
## statistic df p.value missing.patterns
## <dbl> <dbl> <dbl> <int>
## 1 33209. 24493 0 1330
# Correlation analysis
vis_miss(subset)
One final thing I want to say about the data: you may have found out that our data is in wide format. This is because wide format data is native to SEM. Also, the way the MCS dataset is structured makes it difficult to reshape it into long format.
Structural equation modeling is a modelling framework that models simultaneous regression equations with latent variables. Models such as regression, multivariate regression, path analysis, confirmatory factor analysis, and structural regressionare all special cases of SEM. SEM can model:
observed to observed variables (e.g., regression)
latent to observed variables (e.g., confirmatory factor analysis)
latent to latent variables (e.g., structural regression)
SEM have two main components:
A measurement model: relates observed to latent variables
A structural models: relates latent to latent variables.
You can do SEM within R using a package called lavaan
:
https://lavaan.ugent.be/. lavaan
is
distributed under the GNU General Public License, so it is completely
free to use for different purposes.
Other popular SEM software are proprietary:
MPlus has long been the industry standard for SEM, but it is a standalone and expensive piece of software.
SPSS AMOS is popular in the business intelligence world because of its integration with the SPSS and its point-and-click style GUI. Penn State has a university-wide license for SPSS AMOS.
STATA is widely popular both in Economics and Sociology. It has a built-in SEM tool. You don’t need to pay extra for the built-in tool, but its SEM capability is limited.
SEM with lavaan
is quite straightforward. We have a full
SEM model that consists of a measurement model and a structural
model.
In lavaan
: - =~
connects the latent factor
with the observed indicators - ~
connects the outcome with
the predictors - ~~
are used to model covariances
You can look at this page to learn the complete syntax of
lavaan
# Model Specification
sem1 <- '
# Measurement Model
agr7 =~ agr7_fig + agr7_hit + agr7_wepn
agr6 =~ agr6_fig + agr6_hit + agr6_wepn + agr6_hpc
agr5 =~ agr5_fig + agr5_hpc
## Self-control
## Self-control at age 11
fsc5 =~ cimp5 + crisk5
## Self-control at age 14
fsc6 =~ cimp6 + crisk6
## Self-control at age 17
fsc7 =~ cimp7 + crisk7
## Parenting
## Parenting at age 7
par4 =~ cpar4_dibn + cpar4_ditr + cpar4_dire
## Parenting at age 11
par5 =~ cpar5_dibn + cpar5_ditr + cpar5_dire
# Structural Model
agr7 ~ fsc6 + agr6 + par5
agr6 ~ fsc5 + agr5 + par5
agr5 ~ par4 + gender6 + pared + race + married
fsc7 ~ agr6 + fsc6 + par5
fsc6 ~ agr5 + fsc5 + par5
fsc5 ~ par4 + gender6 + pared + race + married
par5 ~ par4
## Covariances
agr7 ~~ fsc7
agr6 ~~ fsc6
agr5 ~~ fsc5
agr6_fig ~~ agr5_fig
crisk6 ~~ crisk7
crisk5 ~~ crisk6
crisk5 ~~ crisk7
cimp5 ~~ cimp6
cimp6 ~~ cimp7
cimp5 ~~ cimp7
cpar4_dire ~~ cpar5_dire
cpar4_ditr ~~ cpar5_ditr
cpar4_dibn ~~ cpar5_dibn
'
After specifying the model, fitting the model is simple. However, you do need to understand the SEM framework to choose a suitable estimator.
Full Information Maximum Likelihood is the recommended method of handling missing data for SEM. However, it only works with the MLR estimator.
You can also impute the data first with the MICE
package, and then use the multiple imputed data to run SEM. A recent
study (https://psycnet.apa.org/record/2021-12018-001) found
that the two method provides equivalent estimation results in most
cases. However, FIML
is built into the workflow of SEM, so
it is a bit easier to implement.
This is not a tutorial about SEM modelling, so I will not interpret the results and model fit statistics.
# Model Fitting
## Specify the ordinal variables
### MLR estimator with FIML
fit1 <- sem(model = sem1, data = subset, estimator='MLR', missing='fiml', fixed.x = FALSE)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan WARNING: some cases are empty and will be ignored:
## 8988 9116 9168 9310 9331 9390 9442 9561 9573 9847 10160 10233 10264 10329 10391 10515 10663 10741 10787 10799 10839 10850 10907 11020 11073 11124 11280 11670 11708 11867 11890 11906 11908 11956 11985 12096 12184 12285 12351 12360 12394 12418 12434 12603 12820 12909 12916 13134 13189 13282 13311 13382 13476 13561 13646 13655 13805 13887 14206 14213 14295 14303 14350 14409 14460 14462 14515 14585 14667 14693 14775 14822 14836 15134 15152 15167 15375 15402 15406 15508 15558 15635 15659 15807 16141 16145 16209 16268 16275 16343 16429 16533 16618 16726 16735 16877 17023 17060 17069 17077 17102 17308 17311 17316 17515 17718 17909 17924 17929 18009 18073 18079 18154 18165 18197 18224 18262 18489 18567 18638 18682 18739 18823 18898 18944 18987 19120 19186
## Warning in lav_object_post_check(object): lavaan WARNING: covariance matrix of latent variables
## is not positive definite;
## use lavInspect(fit, "cov.lv") to investigate.
summary(fit1, standardized = TRUE, fit.measures=TRUE)
## lavaan 0.6.17 ended normally after 272 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 113
##
## Used Total
## Number of observations 19115 19243
## Number of missing patterns 1130
##
## Model Test User Model:
## Standard Scaled
## Test Statistic 5264.736 4450.943
## Degrees of freedom 237 237
## P-value (Chi-square) 0.000 0.000
## Scaling correction factor 1.183
## Yuan-Bentler correction (Mplus variant)
##
## Model Test Baseline Model:
##
## Test statistic 49016.445 40364.341
## Degrees of freedom 294 294
## P-value 0.000 0.000
## Scaling correction factor 1.214
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.897 0.895
## Tucker-Lewis Index (TLI) 0.872 0.870
##
## Robust Comparative Fit Index (CFI) 0.895
## Robust Tucker-Lewis Index (TLI) 0.870
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -210622.699 -210622.699
## Scaling correction factor 2.257
## for the MLR correction
## Loglikelihood unrestricted model (H1) -207990.331 -207990.331
## Scaling correction factor 1.530
## for the MLR correction
##
## Akaike (AIC) 421471.398 421471.398
## Bayesian (BIC) 422359.378 422359.378
## Sample-size adjusted Bayesian (SABIC) 422000.269 422000.269
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.033 0.030
## 90 Percent confidence interval - lower 0.033 0.030
## 90 Percent confidence interval - upper 0.034 0.031
## P-value H_0: RMSEA <= 0.050 1.000 1.000
## P-value H_0: RMSEA >= 0.080 0.000 0.000
##
## Robust RMSEA 0.048
## 90 Percent confidence interval - lower 0.047
## 90 Percent confidence interval - upper 0.050
## P-value H_0: Robust RMSEA <= 0.050 0.985
## P-value H_0: Robust RMSEA >= 0.080 0.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.052 0.052
##
## Parameter Estimates:
##
## Standard errors Sandwich
## Information bread Observed
## Observed information based on Hessian
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## agr7 =~
## agr7_fig 1.000 0.173 0.421
## agr7_hit 1.530 0.088 17.321 0.000 0.265 0.616
## agr7_wepn 0.162 0.017 9.818 0.000 0.028 0.276
## agr6 =~
## agr6_fig 1.000 0.101 0.312
## agr6_hit 2.772 0.235 11.811 0.000 0.279 0.605
## agr6_wepn 0.242 0.028 8.666 0.000 0.024 0.253
## agr6_hpc 4.311 0.328 13.132 0.000 0.434 0.488
## agr5 =~
## agr5_fig 1.000 0.130 0.409
## agr5_hpc 2.863 0.151 18.938 0.000 0.371 0.385
## fsc5 =~
## cimp5 1.000 1.103 0.585
## crisk5 0.696 0.025 28.191 0.000 0.767 0.729
## fsc6 =~
## cimp6 1.000 1.051 0.563
## crisk6 0.871 0.034 25.493 0.000 0.915 0.699
## fsc7 =~
## cimp7 1.000 1.080 0.579
## crisk7 0.758 0.037 20.602 0.000 0.818 0.637
## par4 =~
## cpar4_dibn 1.000 0.334 0.676
## cpar4_ditr 0.945 0.018 51.407 0.000 0.315 0.651
## cpar4_dire 0.280 0.013 22.205 0.000 0.094 0.275
## par5 =~
## cpar5_dibn 1.000 0.381 0.768
## cpar5_ditr 1.042 0.016 65.063 0.000 0.397 0.797
## cpar5_dire 0.562 0.012 44.977 0.000 0.214 0.488
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## agr7 ~
## fsc6 0.024 0.008 3.233 0.001 0.148 0.148
## agr6 1.449 0.108 13.368 0.000 0.844 0.844
## par5 0.036 0.010 3.501 0.000 0.079 0.079
## agr6 ~
## fsc5 0.094 0.019 5.000 0.000 1.028 1.028
## agr5 1.256 0.169 7.447 0.000 1.617 1.617
## par5 0.009 0.006 1.380 0.167 0.032 0.032
## agr5 ~
## par4 0.149 0.010 15.505 0.000 0.383 0.383
## gender6 0.072 0.005 14.380 0.000 0.557 0.279
## pared -0.013 0.001 -9.661 0.000 -0.101 -0.151
## race -0.003 0.005 -0.500 0.617 -0.020 -0.007
## married -0.044 0.004 -11.773 0.000 -0.336 -0.165
## fsc7 ~
## agr6 0.173 0.335 0.517 0.605 0.016 0.016
## fsc6 0.907 0.035 26.129 0.000 0.882 0.882
## par5 -0.070 0.052 -1.355 0.175 -0.025 -0.025
## fsc6 ~
## agr5 2.251 0.748 3.011 0.003 0.278 0.278
## fsc5 1.038 0.086 12.068 0.000 1.089 1.089
## par5 -0.131 0.043 -3.022 0.003 -0.047 -0.047
## fsc5 ~
## par4 -1.502 0.089 -16.950 0.000 -0.454 -0.454
## gender6 -0.350 0.027 -13.110 0.000 -0.317 -0.159
## pared 0.154 0.011 13.890 0.000 0.140 0.208
## race 0.223 0.035 6.427 0.000 0.202 0.076
## married 0.437 0.026 17.117 0.000 0.397 0.195
## par5 ~
## par4 0.749 0.024 31.013 0.000 0.656 0.656
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .agr7 ~~
## .fsc7 -0.037 0.004 -8.893 0.000 -0.655 -0.655
## .agr6 ~~
## .fsc6 -0.035 0.003 -10.164 0.000 -1.031 -1.031
## .agr5 ~~
## .fsc5 -0.090 0.005 -19.581 0.000 -0.906 -0.906
## .agr6_fig ~~
## .agr5_fig 0.025 0.002 11.164 0.000 0.025 0.278
## .crisk6 ~~
## .crisk7 0.300 0.029 10.178 0.000 0.300 0.323
## .crisk5 ~~
## .crisk6 0.068 0.019 3.634 0.000 0.068 0.101
## .crisk7 0.027 0.020 1.350 0.177 0.027 0.037
## .cimp5 ~~
## .cimp6 1.342 0.041 32.437 0.000 1.342 0.568
## .cimp6 ~~
## .cimp7 1.289 0.047 27.604 0.000 1.289 0.549
## .cimp5 ~~
## .cimp7 1.124 0.042 26.604 0.000 1.124 0.483
## .cpar4_dire ~~
## .cpar5_dire 0.027 0.002 16.770 0.000 0.027 0.213
## .cpar4_ditr ~~
## .cpar5_ditr -0.001 0.002 -0.284 0.776 -0.001 -0.005
## .cpar4_dibn ~~
## .cpar5_dibn 0.007 0.002 3.497 0.000 0.007 0.062
## gender6 ~~
## pared 0.006 0.007 0.864 0.388 0.006 0.008
## race -0.001 0.002 -0.731 0.465 -0.001 -0.007
## married -0.000 0.002 -0.133 0.894 -0.000 -0.001
## pared ~~
## race -0.091 0.007 -13.719 0.000 -0.091 -0.162
## married 0.133 0.007 18.442 0.000 0.133 0.182
## race ~~
## married 0.027 0.002 16.474 0.000 0.027 0.149
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .agr7_fig 0.048 0.009 5.118 0.000 0.048 0.116
## .agr7_hit 0.071 0.015 4.693 0.000 0.071 0.165
## .agr7_wepn -0.008 0.002 -4.386 0.000 -0.008 -0.077
## .agr6_fig 0.009 0.005 1.638 0.102 0.009 0.027
## .agr6_hit 0.101 0.017 5.879 0.000 0.101 0.220
## .agr6_wepn -0.008 0.002 -5.202 0.000 -0.008 -0.087
## .agr6_hpc 0.113 0.024 4.713 0.000 0.113 0.127
## .agr5_fig 0.043 0.009 4.699 0.000 0.043 0.135
## .agr5_hpc 0.351 0.027 12.913 0.000 0.351 0.364
## .cimp5 2.764 0.051 53.758 0.000 2.764 1.465
## .crisk5 6.997 0.035 199.557 0.000 6.997 6.652
## .cimp6 2.806 0.050 56.455 0.000 2.806 1.503
## .crisk6 6.371 0.047 136.579 0.000 6.371 4.863
## .cimp7 3.238 0.056 57.654 0.000 3.238 1.736
## .crisk7 5.809 0.046 126.125 0.000 5.809 4.522
## .cpar4_dibn 0.586 0.004 137.222 0.000 0.586 1.188
## .cpar4_ditr 0.626 0.004 148.667 0.000 0.626 1.291
## .cpar4_dire 0.865 0.003 289.376 0.000 0.865 2.545
## .cpar5_dibn 0.430 0.005 94.830 0.000 0.430 0.867
## .cpar5_ditr 0.450 0.005 98.562 0.000 0.450 0.903
## .cpar5_dire 0.738 0.004 180.448 0.000 0.738 1.682
## gender6 1.503 0.005 325.101 0.000 1.503 3.004
## pared 2.842 0.014 198.245 0.000 2.842 1.913
## race 0.164 0.003 47.547 0.000 0.164 0.436
## married 0.595 0.004 165.158 0.000 0.595 1.211
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .agr7_fig 0.139 0.004 33.104 0.000 0.139 0.823
## .agr7_hit 0.115 0.004 29.384 0.000 0.115 0.621
## .agr7_wepn 0.010 0.001 11.186 0.000 0.010 0.924
## .agr6_fig 0.094 0.004 23.017 0.000 0.094 0.902
## .agr6_hit 0.135 0.004 36.347 0.000 0.135 0.634
## .agr6_wepn 0.009 0.001 11.072 0.000 0.009 0.936
## .agr6_hpc 0.604 0.019 31.593 0.000 0.604 0.762
## .agr5_fig 0.084 0.003 25.698 0.000 0.084 0.832
## .agr5_hpc 0.794 0.022 36.372 0.000 0.794 0.852
## .cimp5 2.343 0.047 50.105 0.000 2.343 0.658
## .crisk5 0.518 0.019 26.949 0.000 0.518 0.468
## .cimp6 2.381 0.049 48.849 0.000 2.381 0.683
## .crisk6 0.878 0.033 26.559 0.000 0.878 0.512
## .cimp7 2.315 0.061 38.124 0.000 2.315 0.665
## .crisk7 0.981 0.038 26.046 0.000 0.981 0.594
## .cpar4_dibn 0.132 0.003 41.638 0.000 0.132 0.542
## .cpar4_ditr 0.135 0.003 42.193 0.000 0.135 0.577
## .cpar4_dire 0.107 0.002 53.932 0.000 0.107 0.924
## .cpar5_dibn 0.101 0.003 34.544 0.000 0.101 0.411
## .cpar5_ditr 0.090 0.003 29.993 0.000 0.090 0.365
## .cpar5_dire 0.147 0.002 80.997 0.000 0.147 0.762
## .agr7 0.012 0.002 7.912 0.000 0.413 0.413
## .agr6 0.004 0.001 4.814 0.000 0.381 0.381
## .agr5 0.012 0.001 10.749 0.000 0.717 0.717
## .fsc5 0.812 0.027 30.032 0.000 0.668 0.668
## .fsc6 0.295 0.022 13.151 0.000 0.267 0.267
## .fsc7 0.264 0.029 9.245 0.000 0.226 0.226
## par4 0.111 0.003 34.974 0.000 1.000 1.000
## .par5 0.082 0.003 25.365 0.000 0.569 0.569
## gender6 0.250 0.000 2652.816 0.000 0.250 1.000
## pared 2.207 0.024 92.731 0.000 2.207 1.000
## race 0.141 0.002 59.915 0.000 0.141 1.000
## married 0.241 0.001 351.529 0.000 0.241 1.000
# Check covariance of the latent variables
lavInspect(fit1, "cov.lv")
## agr7 agr6 agr5 fsc5 fsc6 fsc7 par4 par5
## agr7 0.030
## agr6 0.013 0.010
## agr5 0.011 0.009 0.017
## fsc5 -0.056 -0.052 -0.132 1.216
## fsc6 -0.079 -0.070 -0.100 0.982 1.104
## fsc7 -0.107 -0.062 -0.090 0.890 0.998 1.166
## par4 0.008 0.006 0.017 -0.167 -0.147 -0.138 0.111
## par5 0.010 0.005 0.012 -0.125 -0.121 -0.119 0.083 0.145
The package lavaanPlot
, when used together with
lavaan
can do basic plotting for the model fitting result
from lavaan
:
p_load(lavaanPlot) # For plotting
## Label the subset
my_label <- list(
par7 = "Parenting Age 7",
par11 = "Parenting Age 11",
fsc11 = "Self-control Age 11",
fsc14 = "Self-control Age 14",
fsc17 = "Self-control Age 17",
cimp5 = "Impulsivity Age 11",
crisk5 = "Risk-Taking Age 11",
cimp6 = "Impulsivity Age 14",
crisk6 = "Risk-Taking Age 14",
cimp7 = "Impulsivity Age 17",
crisk7 = "Risk-Taking Age 17",
cpar4_dibn = "Parenting Bedroom Age 7",
cpar4_ditr = "Parenting Treats Age 7",
cpar4_dire = "Parenting Reason Age 7",
cpar5_dibn = "Parenting Bedroom Age 11",
cpar5_ditr = "Parenting Treats Age 11",
cpar5_dire = "Parenting Reason Age 11",
agr7 = "Aggression Age 17",
agr6 = "Aggression Age 14",
agr5 = "Aggression Age 11",
gender6 = "Gender",
pared = "Parents Education",
race = "Ethnicity",
married = "Parents Married Status")
# MLR Estimates
lavaanPlot(model = fit1, labels = my_label, node_options = list(shape = "box", fontname = "Helvetica"),
edge_options = list(color = "grey"), coefs=TRUE, stand = TRUE)
Sometimes you can use Bayesian estimator when you have the following issue
Model identification: Your model won’t identified
Prior Knowledge: We social scientists usually have strong background in theory training. Bayesian SEM allows us to incorporate our domain-specific knowledge into the modelling process by specifying the prior
Missiness Since we can define prior, we can model missing pattern if needed. This can be useful when we work with panel data, which is usually MNAR
Accuracy Given the nature of Bayesian estimators, Bayesian SEM can account for measurement errors, non-normality, and heterogeneity in panel data.
It requires at least a complete course to discuss Bayesian estimators. So we will not discuss how to do it here.
There are two main tools we need to implement Bayesian SEM:
STAN
: STAN is a probability programming language
written in C++. However, rstan
offers a interface to invoke
STAN from the R environment.
blavaan
: The R package that uses STAN to do SEM
modelling
If someday you need to go Bayesian, you can uncomment the following code to set up the computing environment. Otherwise, I hope this short tutorial provides you with a very elementary introduction of the SEM framework of working with panel data.
## The following will create or edit a configuration file for the C++ toolchain
dotR <- file.path(Sys.getenv("HOME"), ".R")
if (!file.exists(dotR)) dir.create(dotR)
M <- file.path(dotR, "Makevars")
if (!file.exists(M)) file.create(M)
cat("\nCXX14FLAGS=-O3 -march=native -mtune=native -fPIC",
"CXX14=g++", # or clang++ but you may need a version postfix
file = M, sep = "\n", append = TRUE)
# Load relevant packages
# install.packages("rstan", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
# install.packages('blavaan')
library(rstan)
library(blavaan)
You need to set up the environment for STAN in ROAR:
# Running STAN with on ROAR requires a pakcage called cmdstanR
# install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
##library(cmdstanr)
# Check if C++ toolchain has been set up correctly
##check_cmdstan_toolchain()
# install cmdstan interface
##install_cmdstan(cores = 4)
# Check the cmdstan path and version
##cmdstan_path()
##cmdstan_version()
# blav_fit <- bsem(sem1, n.chains=1, burnin=30, sample=50, data=subset, mcmcfile = T, bcontrol = list(cores = 4))
# summary(blav_fit)