Partial Least Square - Path Modeling

This tutorial is based on Partial Least Square R (plspm) package maintained by Gaston Sanchez. source

Concept

From the Structural Equation Modeling (SEM) standpoint, PLS-PM offers a different approach that doesn’t impose any distributional assumptions on the data that are hard to meet in real life, especially for non-experimental data. When we use a covariance-based SEM approach we are implicitly assuming that the data is generated by some “true” theoretical model. In this scenario, the goal of covariance structure analysis (CSA) is to recover the “true” model that gave rise to the observed covariances.

Partial least square is an alternative approach to SEM covariance structural analysis, PLS-PM can also be regarded as a technique for analyzing a system of relationships between multiple blocks of variables, or if you want to put it in simple terms,multiple data tables.

Theory Development

For this example our goal will be to obtain an Index of Success using data of Spanish professional football teams. The data comes from the professional Spanish Football League, La Liga, and it consists of 14 variables measured on 20 teams. I collected and manually curated data from the 2008-2009 season from different websites like LeagueDay.com, BDFutbol, ceroacero.es, statto.com and LFP. The resulting data set comes with the package plspm under the name spainfoot.

Variable	Description
GSH	total number of goals scored at home
GSA	total number of goals scored away
SSH	percentage of matches with scores goals at home
SSA	percentage of matches with scores goals away
GCH	total number of goals conceded at home
GCA	total number of goals conceded away
CSH	percentage of matches with no conceded goals at home
CSA	percentage of matches with no conceded goals away
WMH	total number of won matches at home
WMA	total number of won matches away
LWR	longest run of won matches
LRWL	longest run of matches without losing
YC	total number of yellow cards
RC	total number of red cards

Common PLS examples of indices may include Index of Satisfaction, Index of Motivation, Index of Usability, or in our case an Index of Success. The issue with these four concepts (satisfaction, motivation, usability, success), is that they are things that cannot be measured directly.

As it is typical with structural models, we usually rely on some kind of theory to propose a model. It can be a very complex theory or an extremely simple one. In this case we are not going to reinvent the wheel, so let’s define a simple model based on a basic yet useful theory:

The better the quality of the Attack, as well as the quality of the Defense, the more Success.

Block of Attack If you check the available variables in the data spainfoot, you will see that the first four columns have to do with scored goals, which in turn can be considered to reflect the Attack of a team. We are going to take those variables as indicators of Attack:

GSH number of goals scores at home
GSA number of goals scores away
SSH percentage of matches with scores goals at home
SSA percentage of matches with scores goals away

Block of Defense The following four columns in the data (from 4 to 8) have to do with the Defense construct

GCH number of goals conceded at home
GCA number of goals conceded away
CSH percentage of matches with no conceded goals at home
CSA percentage of matches with no conceded goals away

Block of Success Finally, columns 9 to 12 can be grouped in a third block of variables, the block associated with Success

WMH number of won matches at home
WMA number of won matches away
LWR longest run of won matches
LRWL longest run of matches without losing

# load plspm
library(plspm)
# load data spainfoot
data(spainfoot)
# first 5 row of spainfoot data
head(spainfoot)

##            GSH GSA  SSH  SSA GCH GCA  CSH  CSA WMH WMA LWR LRWL  YC RC
## Barcelona   61  44 0.95 0.95  14  21 0.47 0.32  14  13  10   22  76  6
## RealMadrid  49  34 1.00 0.84  29  23 0.37 0.37  14  11  10   18 115  9
## Sevilla     28  26 0.74 0.74  20  19 0.42 0.53  11  10   4    7 100  8
## AtleMadrid  47  33 0.95 0.84  23  34 0.37 0.16  13   7   6    9 116  5
## Villarreal  33  28 0.84 0.68  25  29 0.26 0.16  12   6   5   11 102  5
## Valencia    47  21 1.00 0.68  26  28 0.26 0.26  12   6   5    8 120  6

As we already mention, the inner or structural model represents the relationships between the latent variables. Looking at the diagram of the inner model, we can think of it as a flowchart representing a causal process.Moreover, we can think of the inner model as a network and by doing so we can represent it in matrix format

Attack <-  c(0, 0, 0)
Defense <- c(0, 0, 0)
Success <- c(1, 1, 0)

foot_path <- rbind(Attack, Defense, Success)
colnames(foot_path) <- rownames(foot_path)

foot_path

##         Attack Defense Success
## Attack       0       0       0
## Defense      0       0       0
## Success      1       1       0

# graph structural model
innerplot(foot_path)

# define latent variable associated with
foot_blocks <- list(1:4, 5:8, 9:12)

# vector of modes (reflective)
foot_modes <- c("A", "A", "A")

# run plspm analysis
foot_pls <- plspm(Data = spainfoot, path_matrix = foot_path, blocks = foot_blocks, modes = foot_modes)

foot_pls

## Partial Least Squares Path Modeling (PLS-PM) 
## ---------------------------------------------
##    NAME             DESCRIPTION
## 1  $outer_model     outer model
## 2  $inner_model     inner model
## 3  $path_coefs      path coefficients matrix
## 4  $scores          latent variable scores
## 5  $crossloadings   cross-loadings
## 6  $inner_summary   summary inner model
## 7  $effects         total effects
## 8  $unidim          unidimensionality
## 9  $gof             goodness-of-fit
## 10 $boot            bootstrap results
## 11 $data            data matrix
## ---------------------------------------------
## You can also use the function 'summary'

Interpretation

Unidimensionality

Unidimensionaltiy measure the variables in the same thing and the same direction. In PLS-PM, there are 3 main indices to check unidimensionality:

Calculate the Cronbach’s alpha
Calculate the Dillon-Goldstein’s rho
Check the first eigenvalue of the indicators’ correlation matrix

foot_pls$unidim

##         Mode MVs   C.alpha     DG.rho  eig.1st   eig.2nd
## Attack     A   4 0.8905919 0.92456079 3.017160 0.7923055
## Defense    A   4 0.0000000 0.02601677 2.393442 1.1752781
## Success    A   4 0.9165491 0.94232868 3.217294 0.5370492

the Attack block has an alpha of 0.89, Defense has an alpha of 0.00, and Success has an alpha of 0.91. As a rule of thumb, a cronbach’s alpha greater than 0.7 is considered acceptable. According to this rule Attack and Success are good blocks, but not Defense. This provides us a warning sign that something potentially wrong is occurring with the manifest variables of Defense.

Another metric used to assess the unidimensionality of a reflective block is the DillonGoldstein’s rho which focuses on the variance of the sum of variables in the block of interest.As a rule of thumb, a block is considered as unidimensional when the Dillon-Goldstein’s rho is larger than 0.7

The third metric involves an eigen-analysis of the correlation matrix of each set of indicators. The use of this metric is based on the importance of the first eigenvalue. If a block is unidimensional, then the first eigenvalue should be much more" larger than 1 whereas the second eigenvalue should be smaller than 1

plot(foot_pls, what = "loadings")

Loadings and Communilaties

The loadings are correlations between latent variable and its indicators, and communalities are squared correlations. Communalities are simply squared loadings and they measure the part of the variance between a latent variable and its indicator that is common to both.

foot_pls$outer_model

##    name   block     weight    loading communality redundancy
## 1   GSH  Attack  0.3366220  0.9379455   0.8797418  0.0000000
## 2   GSA  Attack  0.2819492  0.8620997   0.7432159  0.0000000
## 3   SSH  Attack  0.2892480  0.8408295   0.7069942  0.0000000
## 4   SSA  Attack  0.2396082  0.8263084   0.6827856  0.0000000
## 5   GCH Defense -0.1087380  0.4836561   0.2339232  0.0000000
## 6   GCA Defense -0.3914625  0.8759007   0.7672021  0.0000000
## 7   CSH Defense  0.3273741 -0.7463736   0.5570735  0.0000000
## 8   CSA Defense  0.4035138 -0.8926150   0.7967615  0.0000000
## 9   WMH Success  0.2308955  0.7755070   0.6014111  0.5145455
## 10  WMA Success  0.3029553  0.8863662   0.7856450  0.6721694
## 11  LWR Success  0.2821408  0.9686187   0.9382222  0.8027089
## 12 LRWL Success  0.2957718  0.9437099   0.8905884  0.7619551

Crossloadings

Besides checking the loadings of the indicators with their own latent variables, we must also check the so-called cross-loadings. That is, the loadings of an indicator with the rest of latent variables.

foot_pls$crossloadings

##    name   block     Attack    Defense    Success
## 1   GSH  Attack  0.9379455 -0.5159446  0.8977256
## 2   GSA  Attack  0.8620997 -0.3390746  0.7519204
## 3   SSH  Attack  0.8408295 -0.4139277  0.7713854
## 4   SSA  Attack  0.8263084 -0.3361551  0.6390025
## 5   GCH Defense -0.1305171  0.4836561 -0.1597543
## 6   GCA Defense -0.4621560  0.8759007 -0.5751232
## 7   CSH Defense  0.3188076 -0.7463736  0.4809671
## 8   CSA Defense  0.4214853 -0.8926150  0.5928287
## 9   WMH Success  0.7085826 -0.4226144  0.7755070
## 10  WMA Success  0.7730524 -0.7114747  0.8863662
## 11  LWR Success  0.8444012 -0.5380149  0.9686187
## 12 LRWL Success  0.8600572 -0.5891724  0.9437099

Assessing Structural Model

After assessing the quality of the measurement model, the next stage is to assess the structural part. To inspect the results of each regression in the structural equations we need to display the results contained in $inner_model. Besides the results of the regression equations, the quality of the structural model is evaluated by examining three indices or quality metrics:

the coefficient determination
the redundancy index
the Goodness-of-Fit (GoF)

Coefficient of Determination

foot_pls$inner_model

## $Success
##                Estimate Std. Error       t value     Pr(>|t|)
## Intercept -1.997769e-16 0.09217513 -2.167362e-15 1.000000e+00
## Attack     7.572610e-01 0.10439994  7.253462e+00 1.348869e-06
## Defense   -2.836068e-01 0.10439994 -2.716542e+00 1.465963e-02

For each regression in the structural model we have an $R^2$ that is interpreted similarly as in any multiple regression analysis. $R^2$ indicates the amount of variance in the endogenous latent variable explained by its independent latent variables.

Low : R < 0.30 (although some authors consider R < 0.20)
Moderate : 0.30 < R < 0.60 (you can also find 0.20 < R < 0.50)
High: R > 0.60 (alternatively there’s also R > 0.50)

Redundancy

foot_pls$inner_summary

##               Type        R2 Block_Communality Mean_Redundancy       AVE
## Attack   Exogenous 0.0000000         0.7531844       0.0000000 0.7531844
## Defense  Exogenous 0.0000000         0.5887401       0.0000000 0.5887401
## Success Endogenous 0.8555637         0.8039667       0.6878447 0.8039667

Redundancy measures the percent of the variance of indicators in an endogenous block that is predicted from the independent latent variables associated to the endogenous LV. Another definition of redundancy is the amount of variance in an endogenous construct explained by its independent latent variables

\[Rd(LV_k, mv_{jk} = loading^2_{jk}R^2_{jk})\]

Goodness-of-Fit

The GoF index is a pseudo Goodness of fit measure that accounts for the model quality at both the measurement and the structural models. GoF is calculated as the geometric mean of the average communality and the average $R^2$ value

foot_pls$gof

## [1] 0.7822929