Structural Equation Modeling¶

<Multivariable Data Analysis by Joseph F. Hair, William C. Black, Barry J. Babin, & Rolph E. Anderson>

When it comes to prediction, both regression techniques (ranging from simple linear regression to regularized regression) and recent machine learning algorithms (e.g., random forest and gradient boosting) share one common limitation: Each technique can examine only a single depdent (outcome) variable at a time.

All too often, however, the researcher is faced with a set of interrelated questions. For example, what variables determine brand image of a team, athlete, store? How does the brand image combine with other variables to affect fans' purchase decisions and satisfaction? How does satisfaction with the team result in long-term loyalty to it? This series of issues has both managerial and theoretical importance.

As a popular algrothim used in behaviorial science, Structural equation modeling (SEM) can examine a series of dependence relationships simultaneously. It is particularly useful in explaining relationships between measured variables and latent variables (cannot be measured directly) and depedent relationships between latent variables. SEM examines the structure of interrelationships expressed in a series of equations, similar to a series of multiple regression equations. You may consider SEM as the combination of (oftentimes multiple) factor analysis + linear regressions. It has the following characteristics:

Estimation of multiple and interrelated dependence relationships simultaneously
An ability to represent unobserved concepts (i.e., latent variables or constructs) in these relationships and account for measurement error in the estimation process

For example, fans' satisfaction during game attendance oftentimes could not be sufficiently measured by one observable item. Instead, fan satisfaction can be latent (or even multi-dimensional) that is measured by multiple observable items such as satisfaction toward game competition, transportation, service, stadium, etc. That is multiple observable satisfaction items together consititute a latent variable --- fan satisfaction. As rule of thumb, latent variables are represented in circle and observable variables in square.

SEM consists of two parts: measaurement model and relationship model.

measurement model specifies the rules of correspondence between measured and latent variables (also called constructs in psychometrics), as the example of fan satisfaction. The measurement model enables the researcher to use any number of variables for a single independent or dependent construct. Once the constructs are defined, then the model can be used to assess the extent of measurement error (known as reliability, measured by Cronbach's alpha). As shown in the graph below, latent SAT1 (satisfaction-1) has correponding observable measurement items: sat-1.1, sat-1.2, sat-1.3. In other words, latent SAT1 is constructed by observable sat-1.1, sat-1.2, sat-1.3 together. So are latent SAT2 (satisfaction-2, another dimension of satisfaction) and latent Team_Con (team-related consumptions). In the graph, one measurement model includes 3 parts: SAT1, SAT2, and Team_Con. The path from SAT1 to sat-1.1 is factor loading.
relationship model specifies the relationships among different concepts (either obseravable or latent variables). As shown below, it means the relationships among SAT1, SAT2, and Team_Con. Note that it also could be observable variables in the relationship model. The path from SAT1 to Team_Con is regression coefficient. Note that the depedence relationships among different variables (latent/observable) in the model should be built upon theory or established research.

(Note that for each relationship in graph (both factor loading or regression coefficient), there are error terms but not depicted in the graph).

SEM employs the Maximum Likelihood estimation. See this link of details of Maximum Likelihood estimation. https://online.stat.psu.edu/stat415/lesson/1/1.2

Six-Stage Process for SEM

Notes:

constructs here refer to latent variables
stages-1, -2, and -3 should carry out based on theories and/or estbalished research

Sample Size

Minimum sample size-100: Models containing five or fewer constructs, each with more than three items (observed variables) and with high item communalities (.6 or higher).

Minimum sample size-150: Models with seven constructs or less, modest communalities (.5), and no underidentified constructs.

Minimum sample size-300: Models with seven or fewer constructs, lower communalities (below .45), and/or multiple underidentified (fewer than three) constructs.
Minimum sample size-500: Models with large numbers of constructs, some with lower communalities, and/or having fewer than three measured items.

Assessing Goodness-of-fit (GOF) of Model

Commonly-used indexes when assessing GOF of model (both measurement model and measurment + relationship model) include:

ROOT MEAN SQUARE ERROR OF APPROXIMATION {RMSEA). One of the most widely used measures that attempts to correct for the tendency of the x2 GOF test statistic to reject models with a large sample or a large number of observed variables is the root mean square error of approximation (RMSEA). It explicitly tries to correct for both model complexity and sample size by including each in its computation. Lower RMSEA values indicate better fit.
STANDARDIZED ROOT MEAN RESIDUAL {SRMR). This standardized value of RMR (i.e., the average standardized residual) is useful for comparing fit across models. Although no statistical threshold level can be established, the researcher can assess the practical significance of the magnitude of the SRMR in light of the research objectives and the observed or actual covariances or correlations. Lower RMR and SRMR values represent better fit and higher values represent worse fits.
NORMED CHI-SQUARE. This GOF measure is a simple ratio of chi-square to the degrees of freedom for a model. Generally, chi-square/df ratios on the order of 3:1 or less are associated with better-fitting models, except in circumstances with larger samples (greater than 750) or other extenuating circumstances, such as a high degree of model complexity.
Comparative Fit Index (CFI). The CFI is an incremental fit index that is an improved version of the normed fit index (NFI). The CFI is normed so that values range between 0 and 1, with higher values indicating better fit. Because the CFI has many desirable properties, including its relative, but not complete, insensitivity to model complexity, it is among the most widely used indices.
Ohter GOF includes NFI, TLI, GFI, AIC, BIC, etc. Note that NFI, TLI, GFI have suggested threshold like above indexes, but AIC and BIC don't have (these two indexes are mianly used for model comparisons).

In [1]:

import semopy
import pandas as pd
import numpy as np
import graphviz

In [2]:

df = pd.read_csv("esports viewership.csv") 
df = df.apply(pd.to_numeric, errors='coerce')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 633 entries, 0 to 632
Columns: 114 entries, Gender to PLA_7
dtypes: int64(114)
memory usage: 563.9 KB

In [3]:

np.where(df.isnull())

Out[3]:

(array([], dtype=int64), array([], dtype=int64))

In [4]:

model_spec = """
  # measurement model
    SKI =~ SKI_1 + SKI_2 + SKI_3 + SKI_4 + SKI_5 + SKI_6
    EXC =~ EXC_1 + EXC_2 + EXC_3 + EXC_4
  # relationship model
    Hours_Watch ~ SKI + EXC
"""

Diagram of model specification

latent variables are represented in circle and observable variables in square.

SKI refers to Skill Improvement: The extent to which esports fans watching esports is to learn new skills, improve their own games, and imitate professionals. Measured by 7-point Likert Scale. It was treated as a discrete variable in behavioral science SKI 1 Watching my favorite esports game helps me become a better player
SKI 2 I get to learn something new from some of the best players
SKI 3 It would give me a better idea on how to win the game if I play
SKI 4 I can improve my game by looking at techniques and strategies used by the experts
SKI 5 Watching my favorite esports game gives me a deeper understanding of what’s possible when I play
SKI 6 Watching my favorite esports game improves my own play by getting ideas from professional players

EXC refers to Competition Excitement: The extent to which esports fans obtain excitement and arousal through watching esports. Measured by 7-point Likert Scale. It was treated as a discrete variable in behavioral science EXC 1 I like the excitement associated with watching my favorite esports game
EXC 2 I find watching my favorite esports game very exciting
EXC 3 I enjoy the thrill and excitement when I watch my favorite esports game
EXC 4 I feel hyped and excited when I watch my favorite esports game

Hou_Wat refers to hours watching esports games per week: On average, how many hours do you watch esports per week? It is an ordinal variable with multiple continous categories, so it was treated as a discrete variable in behavioral science.

1-4
5-8
9-12
13-16
17-20
21-24
24 and above

In [5]:

model = semopy.Model(model_spec)
result = model.fit(df)
stats = semopy.calc_stats(model)
print(stats)

       DoF  DoF Baseline        chi2  chi2 p-value  chi2 Baseline      CFI  \
Value   42            55  120.680945  1.550331e-09    5458.951748  0.98544   

            GFI     AGFI       NFI       TLI     RMSEA        AIC         BIC  \
Value  0.977893  0.97105  0.977893  0.980933  0.054444  47.618702  154.429992   

         LogLik  
Value  0.190649

In [6]:

import semopy
import graphviz
model = semopy.Model(model_spec)
model.fit(df)
g = semopy.semplot(model, "pd.png")

Factor loadings below are unstandardized.

Under each latent variable, the factor loading of the first observable item is set at 1.0, serving as the comparison baseline for the rest of items' factor loadings. Other factor loadings could less than or more than 1.0;
Alternative, the variance of latent variable could be set at 1.0, and all items' factor loadings range from 0-1. Packages in Python don't provide such an option. Please see it in SEM with R.

Regression coefficients below are standardized.