Path analysis is a form of multiple regression statistical analysis used to evaluate causal models by examining the relationships between a dependent variable and two or more independent variables. Using this method one can estimate both the magnitude and significance of causal connections between variables.
There are two main requirements for path analysis:
All causal relationships between variables must go in one direction only (you cannot have a pair of variables that cause each other)
The variables must have a clear time-ordering since one variable cannot be said to cause another unless it precedes it in time.
Path analysis is theoretically useful because, unlike other techniques, it forces us to specify relationships among all of the independent variables. This results in a model showing causal mechanisms through which independent variables produce both direct and indirect effects on a dependent variable.
Typically path analysis involves the construction of a path diagram in which the relationships between all variables and the causal direction between them are specifically laid out.
When conducting path analysis one should first construct an input path diagram, which illustrates the hypothesized relationships. After statistical analysis has been completed, an output path diagram can then be constructed, which illustrates the relationships as they actually exist, according to the analysis conducted.
While path analysis is useful for evaluating causal hypotheses, this method cannot determine the direction of causality. It clarifies correlation and indicates the strength of a causal hypothesis, but does not prove direction of causation.
NOTE: OpenMx
is required to run semPlot
. To install OpenMx
, paste the below command into your console and press enter:
source('http://openmx.psyc.virginia.edu/getOpenMx.R')
Once OpenMx
is installed, you can now load the required packages:
library(lavaan)
library(semPlot)
library(OpenMx)
library(tidyverse)
library(knitr)
library(kableExtra)
library(GGally)
# Organizing package information for table
packages <- c("tidyverse", "knitr", "kableExtra", "lavaan", "semPlot", "OpenMx", "GGally")
display <- c("Package","Title", "Maintainer", "Version", "URL")
table <- matrix(NA, 1, NROW(display), dimnames = list(1, display))
for(i in 1:NROW(packages)){
list <- packageDescription(packages[i])
table <- rbind(table, matrix(unlist(list[c(display)]), 1, NROW(display), byrow = T))
}
table[,NROW(display)] <- stringr::str_extract(table[,NROW(display)], ".+,")
# Table of packages
kable(table[-1,], format = "html", align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
Package | Title | Maintainer | Version | URL | |
---|---|---|---|---|---|
tidyverse | Easily Install and Load ‘Tidyverse’ Packages | Hadley Wickham <hadley@rstudio.com>; | 1.1.1 | http://tidyverse.org, | |
knitr | A General-Purpose Package for Dynamic Report Generation in R | Yihui Xie <xie@yihui.name>; | 1.16 | NA | |
kableExtra | Construct Complex Table with ‘kable’ and Pipe Syntax | Hao Zhu <haozhu233@gmail.com>; | 0.4.0 | http://haozhu233.github.io/kableExtra/, | |
lavaan | Latent Variable Analysis | Yves Rosseel <Yves.Rosseel@UGent.be>; | 0.5-23.1097 | NA | |
semPlot | Path Diagrams and Visual Analysis of Various SEM Packages’ Output | Sacha Epskamp <mail@sachaepskamp.com>; | 1.1 | NA | |
OpenMx | Extended Structural Equation Modelling | Joshua N. Pritikin <jpritikin@pobox.com>; | 2.7.12 | http://openmx.ssri.psu.edu, | |
GGally | Extension to ‘ggplot2’ | Barret Schloerke <schloerke@gmail.com>; | 1.3.2 | https://ggobi.github.io/ggally, |
The four general steps to conducting a Path Analysis in R
include:
For this tutorial, we will use the mtcars
dataset to demonstrate how to conduct a path analysis. However, a covariance matrix can also be used if necessary.
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
First, we must identify the independent and dependent variables within our dataset.
In the R
environment, a regression formula has the following form: y ~ x1 + x2 + x3 + x4
In this formula, the tilde sign (“~”) is the regression operator. On the left-hand side of the operator, we have the dependent variable (y), and on the right-hand side, we have the independent variables, each one separated by the “+” operator.
For this demonstration, we will utilize mpg
as the independent variable and cyl
, disp
, hp
, gear
, am
, wt
and carb
as the dependent variables. Furthermore, we will also assume that hp
is a function of cyl
, disp
, and carb
.
model <-'
mpg ~ hp + gear + cyl + disp + carb + am + wt
hp ~ cyl + disp + carb
'
fit <- cfa(model, data = mtcars)
The cfa()
function is a dedicated function for fitting confirmatory factor analysis models. The first argument is the user-specified model. The second argument is the dataset that contains the observed variables. Once the model has been fitted, the summary()
function provides a nice summary of the fitted model.
summary(fit, fit.measures = TRUE, standardized=T,rsquare=T)
## lavaan (0.5-23.1097) converged normally after 62 iterations
##
## Number of observations 32
##
## Estimator ML
## Minimum Function Test Statistic 7.901
## Degrees of freedom 3
## P-value (Chi-square) 0.048
##
## Model test baseline model:
##
## Minimum Function Test Statistic 132.831
## Degrees of freedom 13
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.959
## Tucker-Lewis Index (TLI) 0.823
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -541.437
## Loglikelihood unrestricted model (H1) -537.487
##
## Number of free parameters 12
## Akaike (AIC) 1106.874
## Bayesian (BIC) 1124.463
## Sample-size adjusted Bayesian (BIC) 1087.054
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.226
## 90 Percent Confidence Interval 0.019 0.425
## P-value RMSEA <= 0.05 0.062
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.025
##
## Parameter Estimates:
##
## Information Expected
## Standard Errors Standard
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## mpg ~
## hp -0.022 0.016 -1.388 0.165 -0.022 -0.243
## gear 0.586 1.247 0.470 0.638 0.586 0.071
## cyl -0.848 0.710 -1.194 0.232 -0.848 -0.248
## disp 0.006 0.012 0.512 0.609 0.006 0.127
## carb -0.472 0.620 -0.761 0.446 -0.472 -0.125
## am 1.624 1.542 1.053 0.292 1.624 0.133
## wt -2.671 1.267 -2.109 0.035 -2.671 -0.428
## hp ~
## cyl 7.717 6.554 1.177 0.239 7.717 0.201
## disp 0.233 0.087 2.666 0.008 0.233 0.421
## carb 20.273 3.405 5.954 0.000 20.273 0.478
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .mpg 5.011 1.253 4.000 0.000 5.011 0.139
## .hp 644.737 161.184 4.000 0.000 644.737 0.142
##
## R-Square:
## Estimate
## mpg 0.861
## hp 0.858
As we can see from the above summary, wt
is a significant indicator of mpg
and both disp
and carb
are significant indicators of hp
. However, hp
itself is not significant with respect to mpg
.
One of the best ways to understand an SEM model is to inspect the model visually using a path diagram. Thanks to the semPlot
package, this is easy to do.
The semPaths()
function provides a quick and easy way to generate a visual representation of your model and automatically calculates key statistics that describe the relationships between the dependent variable and each independent variable. The SEM produced below is that of the mtcars
model we created earlier in this tutorial.
https://rdrr.io/cran/semPlot/man/semPaths.html provides a good breakdown of many additional customization options.
semPaths(fit, 'std', layout = 'circle')
Exercise 1: What other layouts can you find that might make the SEM easier to read? HINT: Google search “semPath layouts”.
semPaths(fit,"std",layout = 'tree', edge.label.cex=.9, curvePivot = TRUE)
The “tree” layout provides a good amount of space between the variables, making it easier to read. The diagram can be customized much further to the programmer’s desire, however that is beyond the scope of this tutorial.
Exercise 2: What do the arrows and values between each independent variable and the dependent variable represent?
The arrows and values between each independent variable and the dependent variable (or moderating variable) are path coefficients. Path coefficients are standardized versions of linear regression weights which can be used in examining the possible causal linkage between statistical variables in the structural equation modeling approach. The standardization involves multiplying the ordinary regression coefficient by the standard deviations of the corresponding explanatory variable: these can then be compared to assess the relative effects of the variables within the fitted regression model.
We can see from the path coefficients in our SEM that mpg
is more strongly caused by wt
than by any other variable.
Exercise 3: What other inferences can you draw about the relationship between variables from the above SEM?
Exercise 4: What do the arrows and values between the independent variables represent?
ggcorr(mtcars[-c(5, 7, 8)], nbreaks = 6, label = T, low = "red3", high = "green3",
label_round = 2, name = "Correlation Scale", label_alpha = T, hjust = 0.75) +
ggtitle(label = "Correlation Plot") +
theme(plot.title = element_text(hjust = 0.6))
As we can see, the arrows and values between the independent variables on the SEM match those calculated through the use of a correlation plot.