Quantitatively assesing similarity among CLDs

Example CLDs from real data.

Consider four examples. Each of these is a subset from one of the causal loop diagrams constructed by group model builders in RYK. For the purposes of a tractably small example, I’ve extracted just the nodes child health, child attendance, child focus on education, child interest in eduction, child inclusion in learning and child results; and the edges that connect these nodes.

It’s easy to see that there are some similarites between these, as well as some differences. Our goal here is to work with a large number of CLDs in order to see which elements are widely agreed upon in all CLDs, and to see what differences (geographic, economic, etc.) among the model builders systmatically drive differences in the CLDs.

Here, to quantitatively asses this, we will borrow commonly used techniques from ecology to calulate a ‘distance’ measure among any two CLDs. We may use the resulting pairwise distances to:

  • plot mutliple CLDs in a two dimensional ‘ordination space’ with NMMDS or a similar dimensional reduction technique in which more similar models are close together.
  • asssess what characteristics of CLDs (geography, asset index, etc.) correlate with distance in the ordination space.
  • look for clusters of CLDs in the ordiation space (indicating CLDs that are more similar to each other than to others).
  • within a cluster (or another group of CLDs), assess which substructures of the CLDs are most widely shared.
  • within a cluster (or another group of CLDs), build a centroid or “average” CLD, representative of the common features of the group of CLDs.

These methods begin with constructing a presence-absence matrix or a weighted matrix, in which the things being compared are in rows, and the characteristics used for measurement are in the columns. In ecology, the rows might plots in which species were sampled, and the columns species names. Cells would hold an indicating of whether that species is present or absent in a plot.


Information in a CLD may be decomposed into list of edges

In this case, the first step is to decompose the CLDs into their characteristics. On strategy for this decompostion is to make an ‘edgelist’, where each edge consists of a ‘cause’ node, an ‘effect’ node, and the the polarity of connnection between them. For instance, Child health increases child attendance is an edge found in three of our four examples above.

Complete edgelists for each of the examples above look like this.

You’ll see that Child health increases child attendance is represented in the edgelists for each model that contain it, as Edge 6 in GGPS Bahudi Pur Machian, Edge 3 in GPS 74 NP and Edge 3 in GCCMS Qaiser Chohan.


The presence of an edge in each CLD may be expressed in a presence-absence matrix

If we treat each of the edges as we would a species in ecological analysis, and each of the schools as a plot, the resulting presence-absence matrix looks like this, with 1 indicating presence and 0 indicating absence.

C attendance increases C focus ed. C attendance increases C inclusion L C focus ed. increases C inclusion L C focus ed. increases C results C health increases C attendance C inclusion L increases C attendance C inclusion L increases C results C interest ed. increases C attendance C results increases C inclusion L
GGCMS QAISER CHOHAN 0 1 0 0 1 0 0 1 0
GGPS 74 NP 1 0 1 0 1 1 0 1 0
GGPS BAHUDI PUR MACHIAN 1 0 1 1 1 0 1 0 1
GPS Chack 76/NP 1 0 1 1 0 1 0 0 0

You can now see the edge Child health increases child attendance in column 5, in which each of the CLDs (rows) in which that edge occurs are marked with a 1.


Distances (dissimiliarity) among CLDs maybe be calculated from a presence-absence matrix

We may turn this presence-absence matrix into distances with a number of different distance formulae. Here, we’ll use Euclidian distance, a very simple distance metric defined by sqrt(sum(x[ij]-x[ik])^2) where x[ij] and x[ik] refer to the values in edge (column) i and CLDs (rows) j and k. The triangular distance matrix below gives the pairwise euclidian distances among CLDs (in both columns and rows).

GGCMS QAISER CHOHAN GGPS 74 NP GGPS BAHUDI PUR MACHIAN GPS Chack 76/NP
GGCMS QAISER CHOHAN 0.000000 2.000000 2.645751 2.645751
GGPS 74 NP 2.000000 0.000000 2.236068 1.732051
GGPS BAHUDI PUR MACHIAN 2.645751 2.236068 0.000000 2.000000
GPS Chack 76/NP 2.645751 1.732051 2.000000 0.000000

Here you’ll see that the distance between GPS 76NP and GPS 74 NP is the smallest (about 1.7), because these two CLDs only differ in three edges (Child focus on education increases hild inlusion in learning, Child health increases child attendance and Child interest in education increases child attendance) out of the nine possible.


Distances may be visualized through ordination

Then, if we want to visualize this pairwise distance matrix, we may do it with nonmetric multidimensional scaling. This is a version as implemented in the vegan function metaMDS, which tries several different random starts of the NMDS to produce the dimensional reduction with the least stress on the the original multidimensional distance matrix. In this case, our stress is essentially zero, as the model is very simple.

Distances between each CLD point in the ordination space below are proportional to their Eucldian distances, based on columns (edges) we provided. You’ll see that the distance between GPS 76NP and GPS 74 NP is the smallest, and that other distances.


Linked edges make causal chains, which may be used as columns in the presence absence matrix

However, these distances are only as good as the presence absence matrix we provide. Consider the edge Child attendance increases child inclusion in learning. Currently, this edge is considered “present” only in GCCMS Qaiser Chohan.

C attendance increases C focus ed. C attendance increases C inclusion L C focus ed. increases C inclusion L
GGCMS QAISER CHOHAN 0 1 0
GGPS 74 NP 1 0 1
GGPS BAHUDI PUR MACHIAN 1 0 1
GPS Chack 76/NP 1 0 1

However, in each of the other three, we have both the edges Child attendance increases child focus on education and Child focus on education increases child inclusion in learning. We may consider that in a CLD, these two edges entails the causal chain Child attendance increaseses child inclusion in learning (via child focus on education).

Put in terms of a matrix, we would need to add new edges making the inferred link explicit.

We’ll now use numerical codes for our nodes, or things will get very crowded. We will code “increases” as 1, “decreases” as -1, and nodes with numerical codes as follows:

node code
Child focus on education 1
Child interest in education 2
Child inclusion in learning 3
Child results 26
Child health 29
Child attendance 30

When we expand the edgelist to include causal chains, the resulting edgelists are:

And then, adding these inferred links of a cause and its ultimate effect to our presence-absence matrix, the result is:

1,-1,1 1,-1,2 1,-1,26 1,-1,3 1,-1,30 1,1,1 1,1,2 1,1,26 1,1,3 1,1,30 2,-1,2 2,-1,26 2,-1,30 2,1,1 2,1,2 2,1,26 2,1,3 2,1,30 26,-1,26 26,-1,30 26,1,1 26,1,2 26,1,26 26,1,3 26,1,30 29,1,1 29,1,2 29,1,26 29,1,3 29,1,30 3,-1,1 3,-1,2 3,-1,26 3,-1,3 3,-1,30 3,1,1 3,1,2 3,1,26 3,1,3 3,1,30 30,-1,1 30,-1,2 30,-1,3 30,-1,30 30,1,1 30,1,2 30,1,26 30,1,3 30,1,30
GGCMS QAISER CHOHAN 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1 1 1 1 0 1 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1
GGPS 74 NP 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0
GGPS BAHUDI PUR MACHIAN 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 1 1 1
GPS Chack 76/NP 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 1 0

And the pairwise distances resulting from these presence-absence matrices based on ultimate effects are slightly different from our original analysis which only includied direct links.

In the future, we can run each of these methods, to see how our choice of method affects our conclusions.


Expanding the dataset with ALL nodes and more CLDs.

Here we use complete edgelists (all nodes) for all RYK and Ghazni CLDs.


Identifying factors that best explain this arrangement

This function tests a number of environmental variables against the ordination to see how well they fit. As an initial pass, I’ve run for each school “DISTRICT” and the average summed child responsese to sections B, C, D, E, and F. You can see that, for this data, there is a significant difference by discrict (r-squared 0.311, p=0.009). The D and E variables look like they could possibly be signficant once we include more data (that is, they have an ok r-squared, but a p>0.05).

You can see in the cluster plot that for the most part, each disrict occupies a different space in the ordination - this is why we get a reasonably high r-squared for this variable.

## [1] 23  7
## 
## ***VECTORS
## 
##      NMDS1    NMDS2     r2 Pr(>r)  
## B  0.22978  0.97324 0.1869  0.106  
## C  0.39708  0.91779 0.1276  0.217  
## D  0.13753  0.99050 0.2274  0.072 .
## E -0.34184 -0.93976 0.2150  0.079 .
## F  0.60921  0.79301 0.1016  0.325  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999
## 
## ***FACTORS:
## 
## Centroids:
##                                                          NMDS1   NMDS2
## DISTRICT_NAMEGhazni                                    -2.2696 -1.0259
## DISTRICT_NAMEMoqur                                      0.7155 -1.3023
## DISTRICT_NAMEQarabagh                                  -0.3490 -1.2626
## DISTRICT_NAMERahim Yar Khan                             0.3567  0.8625
## SCHOOL_NAMEAbo Ali Sina                                -0.4461 -2.3216
## SCHOOL_NAMEBahawodin                                    0.9377 -1.4399
## SCHOOL_NAMECMS Chah Khaji Wala                         -0.8269  3.2084
## SCHOOL_NAMEDelaram                                      0.3853 -0.3175
## SCHOOL_NAMEGBPS 108/P                                  -2.1488  0.7691
## SCHOOL_NAMEGBPS Allah Dina (JAM ABDULLAH)               0.2617  1.0931
## SCHOOL_NAMEGBPS CHAK 222 P                             -0.7438 -0.1573
## SCHOOL_NAMEGGCMS QAISER CHOHAN                         -0.5959  0.8781
## SCHOOL_NAMEGGPS 225P                                    0.0994  0.2378
## SCHOOL_NAMEGGPS 74 NP                                   0.5152  1.7916
## SCHOOL_NAMEGGPS 91/P (Basti Kot Doctor )                0.9701  0.6491
## SCHOOL_NAMEGGPS BAHUDI PUR MACHIAN                      1.5545  2.9100
## SCHOOL_NAMEGGPS Manzoor Khan Gola (NFS Gulshan Arrain)  3.8526  0.3357
## SCHOOL_NAMEGPS 48/P                                    -1.7055 -1.0137
## SCHOOL_NAMEGPS Chack 76/NP                              2.7420  1.4926
## SCHOOL_NAMEGPS Chak 82/NP                              -0.4966  0.3390
## SCHOOL_NAMEKhalid Bin Walid                             0.9252 -1.8589
## SCHOOL_NAMEMirza Khil                                   0.2895 -1.1148
## SCHOOL_NAMENawabad                                     -0.4902 -1.0846
## SCHOOL_NAMENFS Sawan Awan                               1.5155 -0.4586
## SCHOOL_NAMEnone_found                                  -6.8025  1.0196
## SCHOOL_NAMEShahid Faiz Mohammd                         -0.1106 -0.3817
## SCHOOL_NAMEShahr E Kohna Girl High School              -0.3915 -3.7799
## SCHOOL_NAMEZarkashan                                    0.7096 -0.7956
## 
## Goodness of fit:
##                   r2 Pr(>r)  
## DISTRICT_NAME 0.3081  0.018 *
## SCHOOL_NAME   1.0000  1.000  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999