<Multivariable Data Analysis by Joseph F. Hair, William C. Black, Barry J. Babin, & Rolph E. Anderson>
When it comes to prediction, both regression techniques (ranging from simple linear regression to regularized regression) and recent machine learning algorithms (e.g., random forest and gradient boosting) share one common limitation: Each technique can examine only a single depdent (outcome) variable at a time.
All too often, however, the researcher is faced with a set of interrelated questions. For example, what variables determine brand image of a team, athlete, store? How does the brand image combine with other variables to affect fans' purchase decisions and satisfaction? How does satisfaction with the team result in long-term loyalty to it? This series of issues has both managerial and theoretical importance.
As a popular algrothim used in behaviorial science, Structural equation modeling (SEM) can examine a series of dependence relationships simultaneously. It is particularly useful in explaining relationships between measured variables and latent variables (cannot be measured directly) and depedent relationships between latent variables. SEM examines the structure of interrelationships expressed in a series of equations, similar to a series of multiple regression equations. You may consider SEM as the combination of (oftentimes multiple) factor analysis + linear regressions. It has the following characteristics:
For example, fans' satisfaction during game attendance oftentimes could not be sufficiently measured by one observable item. Instead, fan satisfaction can be latent (or even multi-dimensional) that is measured by multiple observable items such as satisfaction toward game competition, transportation, service, stadium, etc. That is multiple observable satisfaction items together consititute a latent variable --- fan satisfaction. As rule of thumb, latent variables are represented in circle and observable variables in square.
SEM consists of two parts: measaurement model and relationship model.
(Note that for each relationship in graph (both factor loading or regression coefficient), there are error terms but not depicted in the graph).
SEM employs the Maximum Likelihood estimation. See this link of details of Maximum Likelihood estimation. https://online.stat.psu.edu/stat415/lesson/1/1.2
Six-Stage Process for SEM
Notes:
Sample Size
Minimum sample size-150: Models with seven constructs or less, modest communalities (.5), and no underidentified constructs.
Assessing Goodness-of-fit (GOF) of Model
Commonly-used indexes when assessing GOF of model (both measurement model and measurment + relationship model) include:
ROOT MEAN SQUARE ERROR OF APPROXIMATION {RMSEA). One of the most widely used measures that attempts to correct for the tendency of the x2 GOF test statistic to reject models with a large sample or a large number of observed variables is the root mean square error of approximation (RMSEA). It explicitly tries to correct for both model complexity and sample size by including each in its computation. Lower RMSEA values indicate better fit.
STANDARDIZED ROOT MEAN RESIDUAL {SRMR). This standardized value of RMR (i.e., the average standardized residual) is useful for comparing fit across models. Although no statistical threshold level can be established, the researcher can assess the practical significance of the magnitude of the SRMR in light of the research objectives and the observed or actual covariances or correlations. Lower RMR and SRMR values represent better fit and higher values represent worse fits.
NORMED CHI-SQUARE. This GOF measure is a simple ratio of chi-square to the degrees of freedom for a model. Generally, chi-square/df ratios on the order of 3:1 or less are associated with better-fitting models, except in circumstances with larger samples (greater than 750) or other extenuating circumstances, such as a high degree of model complexity.
Comparative Fit Index (CFI). The CFI is an incremental fit index that is an improved version of the normed fit index (NFI). The CFI is normed so that values range between 0 and 1, with higher values indicating better fit. Because the CFI has many desirable properties, including its relative, but not complete, insensitivity to model complexity, it is among the most widely used indices.
Ohter GOF includes NFI, TLI, GFI, AIC, BIC, etc. Note that NFI, TLI, GFI have suggested threshold like above indexes, but AIC and BIC don't have (these two indexes are mianly used for model comparisons).
import semopy
import pandas as pd
import numpy as np
import graphviz
df = pd.read_csv("esports viewership.csv")
df = df.apply(pd.to_numeric, errors='coerce')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 633 entries, 0 to 632 Columns: 114 entries, Gender to PLA_7 dtypes: int64(114) memory usage: 563.9 KB
np.where(df.isnull())
(array([], dtype=int64), array([], dtype=int64))
model_spec = """
# measurement model
SKI =~ SKI_1 + SKI_2 + SKI_3 + SKI_4 + SKI_5 + SKI_6
EXC =~ EXC_1 + EXC_2 + EXC_3 + EXC_4
# relationship model
Hours_Watch ~ SKI + EXC
"""
Diagram of model specification
latent variables are represented in circle and observable variables in square.
SKI refers to Skill Improvement: The extent to which esports fans watching esports is to learn new skills, improve their own games, and imitate professionals. Measured by 7-point Likert Scale. It was treated as a discrete variable in behavioral science
SKI 1 Watching my favorite esports game helps me become a better player
SKI 2 I get to learn something new from some of the best players
SKI 3 It would give me a better idea on how to win the game if I play
SKI 4 I can improve my game by looking at techniques and strategies used by the experts
SKI 5 Watching my favorite esports game gives me a deeper understanding of what’s possible when I play
SKI 6 Watching my favorite esports game improves my own play by getting ideas from professional players
EXC refers to Competition Excitement: The extent to which esports fans obtain excitement and arousal through watching esports. Measured by 7-point Likert Scale. It was treated as a discrete variable in behavioral science
EXC 1 I like the excitement associated with watching my favorite esports game
EXC 2 I find watching my favorite esports game very exciting
EXC 3 I enjoy the thrill and excitement when I watch my favorite esports game
EXC 4 I feel hyped and excited when I watch my favorite esports game
Hou_Wat refers to hours watching esports games per week: On average, how many hours do you watch esports per week? It is an ordinal variable with multiple continous categories, so it was treated as a discrete variable in behavioral science.
model = semopy.Model(model_spec)
result = model.fit(df)
stats = semopy.calc_stats(model)
print(stats)
DoF DoF Baseline chi2 chi2 p-value chi2 Baseline CFI \
Value 42 55 120.680945 1.550331e-09 5458.951748 0.98544
GFI AGFI NFI TLI RMSEA AIC BIC \
Value 0.977893 0.97105 0.977893 0.980933 0.054444 47.618702 154.429992
LogLik
Value 0.190649
import semopy
import graphviz
model = semopy.Model(model_spec)
model.fit(df)
g = semopy.semplot(model, "pd.png")
Factor loadings below are unstandardized.
Regression coefficients below are standardized.