SD model comparison project – Afghanistan and Pakistan

note, 2/5: updated aggregate graph at bottom of page

0: Load the packages we’ll need:

require(dplyr)
require(vegan)
require(reshape2)
require(ggplot2)
require(stringr)

1: Developing a distance metric among models (CLDs).

Our goal, in order to work with a large number of models, is to calculate the similarity (or difference aka “distance”) between two models. Once we do this, we may calculate a martrix of all pairwise distances in whatever universe of models we are considering, and use this distance matrix to:

plot the models in a two dimensional space with NMMDS or similar dimensional reduction technique / ordination,
asssess what characteristics of models correlate with differenence (or, visually, distance in the ordination), such as geography, gender, teacher/student, etc.
look for clusters that exist of models that are more similar to each other than to others (or, visually, cluster in the ordination),
build a centroid model.

Calculating distance:

Load six very simple test models that we’ll use to make sure our distance calculation is working the way we’d want it to:

test_models <- read.csv("~/Dropbox/p/afghanistan/test_models_1.csv",stringsAsFactors = F)
test_models

##   cause polarity effect model
## 1     A        1      B     1
## 2     A        1      B     2
## 3     B        1      C     2
## 4     A        1      B     3
## 5     D        1      E     3
## 6     A        1      B     4
## 7     B       -1      C     4

You’ll see above that:

in model 1, A increases B;
in model 2, A increases B and B increases C;
in model 3 A increases B and D increases E; and
in model 4 A increases B and B decreases C.

From this, it seems to me that we want a distance metric where:

Model 1 is the same distance from models 2, 3, and 4 (because it shares with each of them its only link, and each of them also have an additional link). Let’s call this distance 1.
The distance between models 2 and 3 should be higher than distance 1, becuase there’s less shared out of the total information content. The distance between models 3 and 4 should be the same as that between 2 and 3.
The distance between models 2 and 4 should be even greater, because there is one outright disagreement.

Or, to sum up our desired output distances: 12=13=14 < 23=34 < 24.

In order to compare these links, we’ll construct a matrix in which models are rows, possible causal links are columns, and polarity is the cell content (using a zero value for connection not present).

test_models <- test_models %>% mutate(chain=paste0(cause,effect)) #create a new variable called chain which combines cause and efect

matrix <- dcast(test_models,model~chain,value.var="polarity")[,-1] #use dcast to transform the dataframe into a matrix, and drop the first column (names)

matrix[is.na(matrix)] <- 0 #fill NA values in the matrix with zero

matrix

##   AB BC DE
## 1  1  0  0
## 2  1  1  0
## 3  1  0  1
## 4  1 -1  0

And then we’ll use Euclidian distance, a very simple distance metric defined by sqrt(sum(x[ij]-x[ik])^2) where x[ij] x[ik] refer to the values in link (columns) i and models (rows) j and k. The triangular distance matrix below gives the pairwise euclidian distances among models (in both columns and rows)

distances <- vegdist(matrix,method="euclidian") #use vegan::vegdist to make a euclidian distance matrix
distances

##          1        2        3
## 2 1.000000                  
## 3 1.000000 1.414214         
## 4 1.000000 2.000000 1.414214

You’ll see that this distance metric has given us distances that match the desired output: 12=13=14 < 23=34 < 24.

If we want to visualize this pairwise distance matrix, we may do it with nonmetric multidimensional scaling. This is a version as implemented in the vegan function metaMDS, which tries several different random starts of the NMDS to produce the dimensional reduction with the least stress on the the original multidimensional distance matrix. (In this very simple case, our stress is 0 becuase these distances can be represented in two dimensions).

In the ordination below, each model is indicated with its number. Distances between each number in the ordination space are proportional to their eucldian distances.

mds <- metaMDS(matrix,distance="euclidian",trace=0) #this makes the NMDS object
ordiplot(mds,type="t") # vegan::ordiplot plots it

With real data:

require(readxl)

real <- read_excel("~/Dropbox/p/afghanistan/RYK Children Final Edgelist[1].xlsx")
real1 <- real #%>% filter(V7=="./RYK- CMS 13NP- Children-revised-JFT-R-AB-BP-JFT-MK-JFT-MK.mdl"|V7=="./RYK-CMS Chah Khajiwala-Children-revised-JFT-AB-JFT-R-MK.mdl")

#do coding across all models
modellist <- data.frame(model=unique(real1$V7),modelcode=as.numeric(as.factor(unique(real1$V7))))

real1 <- data.frame("cause"=real1$V3, "polarity"=real1$V4,"effect"=real1$V6,"model"=real1$V7)

nodelist <- unique(c(real1$cause,real1$effect))
codelist <- data.frame("node"=nodelist,"code"=1:length(nodelist))

real_data <- real1 %>% 
  filter(polarity!="No Polarity") %>% #at the moment this is just a 64-42 link that also is marked as increasing
  merge(codelist,by.x="cause",by.y="node",all.x=T) %>%  
  merge(codelist,by.x="effect",by.y="node",all.x=T) %>% 
  merge(modellist,by="model",all.x=T) %>% 
  mutate(cause=code.x,effect=code.y,polarity = recode(polarity, "Increases"=1,"Decreases"=-1),model=modelcode) %>%
  select(cause,polarity,effect,model)

head(real_data)

##   cause polarity effect model
## 1    12       -1     11     1
## 2    41        1     31     1
## 3    22        1      2     1
## 4     5        1     12     1
## 5    26        1      9     1
## 6     3        1      5     1

real_data <- real_data %>% mutate(chain=paste0(cause,",",effect)) #create a new variable called chain which combines cause and efect

matrix <- dcast(real_data,model~chain,value.var="polarity")[,-1] #use dcast to transform the dataframe into a matrix, and drop the first column (names)

matrix[is.na(matrix)] <- 0 #fill NA values in the matrix with zero

matrix[1:10,1:10]

##    1,103 1,13 1,2 1,26 1,3 1,30 1,43 1,6 1,69 1,83
## 1      0   -1   1    1   1    0    0   0    0    0
## 2      0    0   0    0   0    0    0   0    0    0
## 3      0    0   0    0   1    0    1   0    0    0
## 4      0    0   1    0   1    0    0   0    0    0
## 5      0    0   0    0   1    0    1   0    0    0
## 6      0    0   0    0   0    0    0   0    0    0
## 7      0    0   0    1   1    0    0   0    0    0
## 8      0    0   0    0   0    0    0   0    0    1
## 9      0    0   0    0   1    0    1   0    1    0
## 10     0    0   0    0   1    0    0   0    1    0

distances <- vegdist(matrix,method="euclidian") #use vegan::vegdist to make a euclidian distance matrix
distances

##            1         2         3         4         5         6         7
## 2   9.797959                                                            
## 3   9.848858  8.774964                                                  
## 4  10.677078  9.273618  9.219544                                        
## 5  10.000000  8.717798  8.426150  9.273618                              
## 6   9.695360  7.874008  7.937254  8.246211  8.485281                    
## 7  10.440307  9.327379  9.055385  9.539392  9.110434  8.888194          
## 8  10.630146  9.539392  9.055385  9.848858  9.539392  8.888194  9.695360
## 9  10.246951  8.888194  8.602325  9.327379  8.660254  8.426150  9.055385
## 10 10.148892  9.219544  9.165151  9.746794  9.110434  8.660254  9.899495
## 11  9.949874  8.774964  8.831761  9.433981  8.774964  8.185353  9.273618
## 12 10.535654  9.327379  9.165151  9.746794  9.110434  8.888194  9.695360
## 13  9.899495  8.602325  8.544004  9.273618  8.717798  8.124038  9.433981
## 14 10.246951  9.110434  9.380832  9.219544  9.219544  8.888194  9.899495
## 15 10.198039  8.944272  8.544004  9.486833  8.831761  8.485281  9.539392
##            8         9        10        11        12        13        14
## 2                                                                       
## 3                                                                       
## 4                                                                       
## 5                                                                       
## 6                                                                       
## 7                                                                       
## 8                                                                       
## 9   9.380832                                                            
## 10  9.797959  8.717798                                                  
## 11  9.591663  9.055385  9.273618                                        
## 12  9.486833  9.273618  9.380832  9.273618                              
## 13  9.327379  8.660254  8.660254  8.888194  9.433981                    
## 14  9.695360  8.944272  9.055385  9.486833  9.380832  9.110434          
## 15  9.643651  9.000000  9.110434  9.433981  9.219544  8.717798  9.327379

mds <- metaMDS(matrix,distance="euclidian",trace=0) #this makes the NMDS object
ordiplot(mds,type="t") # vegan::ordiplot plots it

Nonpolar

However, with some of our data, there are (sometimes indirect) positive AND negative connections between the same cause and effect. When this is the case, our matrix system won’t work. So, rather than treating “A increases B” and “A decreases B” as opposite, we can simply treat them as independent different columns. Then we may fill the cells with presence or absence of link (or with something else once we consider indirect connections below, such as number of paths or sum(1/minimum_path_length) or similar)

real_data_p <- real_data %>% mutate(
  polarity_symbol=recode(polarity,"1"="+","-1"="-","0"="x"),
  chain=paste0(cause,polarity_symbol,effect)) #create a new variable called chain which combines cause and efect AND polarity

matrix_p <- dcast(real_data_p,model~chain,value.var="model")[,-1] #use dcast to transform the dataframe into a matrix, and drop the first column (names)

#make a binary matrix instead
matrix_p[is.na(matrix_p)] <- 0 #fill NA values in the matrix with zero
matrix_p[matrix_p!=0] <- 1 #fill any presence value with 1


distances_p <- vegdist(matrix_p,method="euclidian") #use vegan::vegdist to make a euclidian distance matrix
distances_p

##            1         2         3         4         5         6         7
## 2   9.797959                                                            
## 3   9.848858  8.774964                                                  
## 4  10.677078  9.273618  9.219544                                        
## 5  10.000000  8.717798  8.426150  9.273618                              
## 6   9.695360  7.874008  7.937254  8.246211  8.485281                    
## 7  10.440307  9.327379  9.055385  9.539392  9.110434  8.888194          
## 8  10.630146  9.539392  9.055385  9.848858  9.539392  8.888194  9.695360
## 9  10.246951  8.888194  8.602325  9.327379  8.660254  8.426150  9.055385
## 10 10.148892  9.219544  9.165151  9.746794  9.110434  8.660254  9.899495
## 11  9.949874  8.774964  8.831761  9.433981  8.774964  8.185353  9.273618
## 12 10.535654  9.327379  9.165151  9.746794  9.110434  8.888194  9.695360
## 13  9.899495  8.602325  8.544004  9.273618  8.717798  8.124038  9.433981
## 14 10.246951  9.110434  9.380832  9.219544  9.219544  8.888194  9.899495
## 15 10.198039  8.944272  8.544004  9.486833  8.831761  8.485281  9.539392
##            8         9        10        11        12        13        14
## 2                                                                       
## 3                                                                       
## 4                                                                       
## 5                                                                       
## 6                                                                       
## 7                                                                       
## 8                                                                       
## 9   9.380832                                                            
## 10  9.797959  8.717798                                                  
## 11  9.591663  9.055385  9.273618                                        
## 12  9.486833  9.273618  9.380832  9.273618                              
## 13  9.327379  8.660254  8.660254  8.888194  9.433981                    
## 14  9.695360  8.944272  9.055385  9.486833  9.380832  9.110434          
## 15  9.643651  9.000000  9.110434  9.433981  9.219544  8.717798  9.327379

mds_p <- metaMDS(matrix_p,distance="euclidian",trace=0) #this makes the NMDS object
ordiplot(mds_p,type="t",display="sites") # vegan::ordiplot plots it

We may also use this method to see the location in the ordination space of each of the links (note, these will be overplotted and hard to read as all links unique to one model will plot in the same place).

ordiplot(mds_p,type="t") # vegan::ordiplot plots it

3: Direct causation and causal chains:

We discussed two possible attributes by which models may be similar or different: direction causation and causal chains.

Consider two new models:

test_models <- read.csv("~/Dropbox/p/afghanistan/test_models_2.csv",stringsAsFactors = F)
test_models <- test_models %>% mutate(chain=paste0(cause,effect))
test_models

##   cause polarity effect model chain
## 1     A        1      B     1    AB
## 2     A        1      B     2    AB
## 3     B        1      C     2    BC
## 4     A        1      B     3    AB
## 5     D        1      E     3    DE
## 6     A        1      B     4    AB
## 7     B       -1      C     4    BC
## 8     A        1      C     5    AC
## 9     A        1      D     6    AD

You’ll see above that in model 6, A directly increases C. With our current method based on only direct causation, this model (and model 5) will share no links with any of the other models. Our distance metric computes model 6 as slightly closer to the models with fewer specific links (model 1 and model 5), just becuase there is less to differentiate them.

matrix <- dcast(test_models,model~chain,value.var="polarity")[,-1] 
matrix[is.na(matrix)] <- 0
distances <- vegdist(matrix,method="euclidian")
distances

##          1        2        3        4        5
## 2 1.000000                                    
## 3 1.000000 1.414214                           
## 4 1.000000 2.000000 1.414214                  
## 5 1.414214 1.732051 1.732051 1.732051         
## 6 1.414214 1.732051 1.732051 1.732051 1.414214

However, we might also say that in model 2, there is an inferred link we have not specified in the edgelist: that A will also increase C (via B), and further say that because of this, model 2 should have a greater similarity with (i.e. less distance from) with model 5, which directly specifies this.

Put in terms of a matrix, we would need to add new rows making the inferred link explicit. In this case, we’ve actually got two inferred links – the postive AC link in model 2 and the negative AC link in model 4.

new_chains <- data.frame("cause"=c("A","A"),"polarity"=c(1,-1),effect=c("C","C"),"model"=c(2,4))
new_chains <- new_chains %>% mutate(chain=paste0(cause,effect))
test_models_withchains <- test_models %>% rbind(new_chains)
test_models_withchains

##    cause polarity effect model chain
## 1      A        1      B     1    AB
## 2      A        1      B     2    AB
## 3      B        1      C     2    BC
## 4      A        1      B     3    AB
## 5      D        1      E     3    DE
## 6      A        1      B     4    AB
## 7      B       -1      C     4    BC
## 8      A        1      C     5    AC
## 9      A        1      D     6    AD
## 10     A        1      C     2    AC
## 11     A       -1      C     4    AC

We can see that doing this brings models 2 and and 5 closer together (distance is now 1.4 instead of 1.7) and amplifies the distance between models 2 and 4 (distance is now 2.8 instead of 2.0).

matrix <- dcast(test_models_withchains,model~chain,value.var="polarity")[,-1]
matrix[is.na(matrix)] <- 0 
distances <- vegdist(matrix,method="euclidian") 
distances

##          1        2        3        4        5
## 2 1.414214                                    
## 3 1.000000 1.732051                           
## 4 1.414214 2.828427 1.732051                  
## 5 1.414214 1.414214 1.732051 2.449490         
## 6 1.414214 2.000000 1.732051 2.000000 1.414214

mds <- metaMDS(matrix,distance="euclidian",trace=0) 
ordiplot(mds,type="t")

Automating the process:

For more complex models, the number of inferred links multiples rapidly: in an AB, BC, CA loop, for instance, inferred links would include AC, AA, BC, BB, CB and CC.

This function automates the process by going through the edgelist and first tracing each effect of a two-link chain that, itself, is a cause, then each effect of the resulting three-link chains that is a cause, etc. Polarity is inherited multiplicatively (++=+, +-=-,–=+).

expand_edgelist<-function(df){
  
options(stringsAsFactors = FALSE)
require(stringr)

df$chain <- paste0(df$cause,",",df$effect) #Combine cause and effect into a two-link causal "chain" as a string, define chain length as two-link
df$length <- 2
df$unique <- 2 #later on, we'll compare unique nodes against total nodes to identify loops

df2 <- df #save a version of the direct links for comparison later
df_expanded <- df #create another copy that we'll build upon

while(dim(df)[1]>0){ #as long as there are more links to check:
  
longerchains <- c() #make an empty list which we will fill with all longer chains implied by the current chains
polarities <- c() #make an empty list which we will fill with all the polarities of the longer chains

# for each row in the list of chains that's not already looping
for (i in which(df$length==df$unique)){ 
 leftchain <- df[i,] #set the current chain as the left side of our new longer chain
 leftpolarity <- df[i,'polarity'] #inherit the polarity
 rightchains <- df2[df2$cause==leftchain$effect,] #extract direct links in which the effect of the leftchain is the cause
 rightpolarities <- rightchains$polarity #inherit the polarity
 
 #if there are any right side chains
 if (dim(rightchains)[1]>0){ 
    chain <- paste0(leftchain$chain,",",rightchains$effect) #attach the effect of them to the left-sie chain
    longerchains <- c(longerchains,chain) #and add each result to the list of longer chains
    polarities <- c(polarities,leftpolarity*rightpolarities) #and multiply the polarities to combine them (++=+, +-=-,--=+)
    } 
}

df_longer <- data.frame(cause=if(!is.null(longerchains)){longerchains %>% strsplit(",") %>% sapply(first)}else NULL, #then construct a  new data_frame for a three-link chains in which cause is the first link of each chain...
 polarity=polarities,
 effect=if(!is.null(longerchains)){longerchains %>% strsplit(",") %>% sapply(last)}else NULL,
 chain=longerchains, 
 length=if(!is.null(longerchains)){longerchains %>% strsplit(",") %>% sapply(length)}else NULL,
 unique=if(!is.null(longerchains)){longerchains %>% str_split(",") %>% lapply(unique) %>% sapply(length)}else NULL #calcualte the number of unique nodes in each chain
 )
df_expanded <- rbind(df_expanded,df_longer) #save our longer chains into a comprehensive list
df <- df_longer #and start the process again to see if any of our longer chains can be extended
}
df_expanded #output the expanded edgelist
}

Testing this process against a few sample models:

cld_loop <- read.csv("~/Dropbox/p/afghanistan/test_models_3.csv",stringsAsFactors = F)

cld_loop

##   cause polarity effect
## 1     A        1      B
## 2     B        1      C
## 3     C        1      A

expand_edgelist(cld_loop)

##   cause polarity effect   chain length unique
## 1     A        1      B     A,B      2      2
## 2     B        1      C     B,C      2      2
## 3     C        1      A     C,A      2      2
## 4     A        1      C   A,B,C      3      3
## 5     B        1      A   B,C,A      3      3
## 6     C        1      B   C,A,B      3      3
## 7     A        1      A A,B,C,A      4      3
## 8     B        1      B B,C,A,B      4      3
## 9     C        1      C C,A,B,C      4      3

cld_loop_plus <- read.csv("~/Dropbox/p/afghanistan/test_models_5.csv",stringsAsFactors = F)

cld_loop_plus

##   cause polarity effect
## 1     A        1      B
## 2     B        1      C
## 3     C        1      A
## 4     D        1      A

expand_edgelist(cld_loop_plus)

##    cause polarity effect     chain length unique
## 1      A        1      B       A,B      2      2
## 2      B        1      C       B,C      2      2
## 3      C        1      A       C,A      2      2
## 4      D        1      A       D,A      2      2
## 5      A        1      C     A,B,C      3      3
## 6      B        1      A     B,C,A      3      3
## 7      C        1      B     C,A,B      3      3
## 8      D        1      B     D,A,B      3      3
## 9      A        1      A   A,B,C,A      4      3
## 10     B        1      B   B,C,A,B      4      3
## 11     C        1      C   C,A,B,C      4      3
## 12     D        1      C   D,A,B,C      4      4
## 13     D        1      A D,A,B,C,A      5      4

cld_complex <- read.csv("~/Dropbox/p/afghanistan/test_models_4.csv",stringsAsFactors = F)

cld_complex

##   cause polarity effect
## 1     A        1      B
## 2     A        1      C
## 3     A        1      D
## 4     B       -1      D
## 5     B        1      E
## 6     E        1      F
## 7     G        1      F
## 8     F        1      A
## 9     I        1      J

expand_edgelist(cld_complex)

##    cause polarity effect       chain length unique
## 1      A        1      B         A,B      2      2
## 2      A        1      C         A,C      2      2
## 3      A        1      D         A,D      2      2
## 4      B       -1      D         B,D      2      2
## 5      B        1      E         B,E      2      2
## 6      E        1      F         E,F      2      2
## 7      G        1      F         G,F      2      2
## 8      F        1      A         F,A      2      2
## 9      I        1      J         I,J      2      2
## 10     A       -1      D       A,B,D      3      3
## 11     A        1      E       A,B,E      3      3
## 12     B        1      F       B,E,F      3      3
## 13     E        1      A       E,F,A      3      3
## 14     G        1      A       G,F,A      3      3
## 15     F        1      B       F,A,B      3      3
## 16     F        1      C       F,A,C      3      3
## 17     F        1      D       F,A,D      3      3
## 18     A        1      F     A,B,E,F      4      4
## 19     B        1      A     B,E,F,A      4      4
## 20     E        1      B     E,F,A,B      4      4
## 21     E        1      C     E,F,A,C      4      4
## 22     E        1      D     E,F,A,D      4      4
## 23     G        1      B     G,F,A,B      4      4
## 24     G        1      C     G,F,A,C      4      4
## 25     G        1      D     G,F,A,D      4      4
## 26     F       -1      D     F,A,B,D      4      4
## 27     F        1      E     F,A,B,E      4      4
## 28     A        1      A   A,B,E,F,A      5      4
## 29     B        1      B   B,E,F,A,B      5      4
## 30     B        1      C   B,E,F,A,C      5      5
## 31     B        1      D   B,E,F,A,D      5      5
## 32     E       -1      D   E,F,A,B,D      5      5
## 33     E        1      E   E,F,A,B,E      5      4
## 34     G       -1      D   G,F,A,B,D      5      5
## 35     G        1      E   G,F,A,B,E      5      5
## 36     F        1      F   F,A,B,E,F      5      4
## 37     G        1      F G,F,A,B,E,F      6      5

Trying expansion with real data:

Expanding our real data takes a little while to run; “walking” all possible causal paths in a well-conntected CLD creates many duplicate ways to get from A to B.

#I've commented this out for now becuase it takes so long to run

#expanded_data<-data.frame("cause"=NA,"effect"=NA,"polarity"=NA,"numpaths"=NA,"shortestpath"=NA,"sumpath"=NA,"model"=NA)
#for(i in unique(real_data$model)){
#  print(i)
#  one_model <- real_data %>% filter(model==i)
#  one_model_expanded_all <- expand_edgelist(one_model[,1:3])
#  one_model_expanded <- one_model_expanded_all %>% group_by(cause,effect,polarity) %>% summarize(numpaths=length(unique(chain)),shortestpath=min(length),sumpath=sum(1/length)) 
#  one_model_expanded$model=i
#  expanded_data<-rbind(expanded_data,data.frame(one_model_expanded))
#}

#expanded_data <- expanded_data %>% mutate(
#  inv_shortestpath=1/shortestpath,
#  polarity_symbol=recode(polarity,"1"="+","-1"="-","0"="x"),
#  chain=paste0(cause,polarity_symbol,effect))

#write.table(expanded_data,"expande_data.csv",sep=",")
exp_data<-read.table("expande_data.csv",sep=",")

expanded_data<-exp_data

head(expanded_data)

##   cause effect polarity numpaths shortestpath  sumpath model
## 1    NA     NA       NA       NA           NA       NA    NA
## 2     1      1       -1       71            3 7.443262     1
## 3     1      1        1       23            3 2.843887     1
## 4     1     10       -1       14            3 2.357937     1
## 5     1     10        1       26            6 2.407115     1
## 6     1     11       -1       56            4 6.763456     1
##   polarity_symbol  chain inv_shortestpath
## 1            <NA> NANANA               NA
## 2               -    1-1        0.3333333
## 3               +    1+1        0.3333333
## 4               -   1-10        0.3333333
## 5               +   1+10        0.1666667
## 6               -   1-11        0.2500000

matrix_length <- dcast(expanded_data[-1,],model~chain,value.var="inv_shortestpath")[,-1]
matrix_length[is.na(matrix_length)] <- 0 #fill NA values in the matrix with zero

matrix_presence <- matrix_length
matrix_presence[matrix_presence!=0] <- 1 #fill any non-NA values with 1

matrix_sumpath <- dcast(expanded_data[-1,],model~chain,value.var="sumpath")[,-1]
matrix_sumpath[is.na(matrix_sumpath)] <- 0

matrix_sumpath_polar <- expanded_data%>%group_by(model,cause,effect) %>% summarize(plus_sumpath=sum(sumpath[polarity==1]),minus_sumpath=sum(sumpath[polarity==-1]),sum_sumpath=plus_sumpath-minus_sumpath,chain=paste0(unique(cause),",",unique(effect)))
matrix_sumpath_polar <- dcast(matrix_sumpath_polar,model~chain,value.var="sum_sumpath")[,-1]
matrix_sumpath_polar[is.na(matrix_sumpath_polar)] <- 0

distances_presence <- vegdist(matrix_presence,method="euclidian") 
distances_length <- vegdist(matrix_length,method="euclidian") #this look similar to above, actually??
distances_sumpath <- vegdist(matrix_sumpath,method="euclidian") 
distances_sumpath_polar <- vegdist(matrix_sumpath_polar,method="euclidian")

An oridinaton of the presence/absence matrix of the expanded (indirect link) data looks quite similar to the ordiation of the direct link data. The distance of model 1 and 4 from the others is amplified.

mds_presence <- metaMDS(matrix_presence,distance="euclidian",trace=0) 
ordiplot(mds_presence,type="t",display="sites")

An oridinaton of the matrix of the expanded data where cells are filled with 1/shortest_path is also quite similar. 1, 10, 8 and 4 still outliers.

mds_length <- metaMDS(matrix_length,distance="euclidian",trace=0) 
ordiplot(mds_length,type="t",display="sites") #quite similar, 1, 10, 8 and 4 still outliers

In contrast, an ordination of the expanded data matrix where cells are filled with the sum of 1/path_length (so that 2 four-link paths by which A increaes B would score the same as 1 two-link path) looks quite different. The previous outliers are now the central cluster.

mds_sumpath <- metaMDS(matrix_sumpath,distance="euclidian",trace=0) 
ordiplot(mds_sumpath,type="t",display="sites") #this one is quite different; previous outliers are now central cluster

Getting info on specific links

linkinfo <- expanded_data %>%
  filter(!is.na(model)) %>%
  group_by(cause,effect,polarity) %>% summarize(
  models_with_direct=length(unique(model[shortestpath==2])),
  models_with_indirect=length(unique(model[shortestpath>2])),
  indirect_avg_shortestpath=round(mean(shortestpath[shortestpath>2]),2),
  chain=unique(chain)
) %>% 
  merge(codelist,by.x="cause",by.y="code") %>% 
  merge(codelist,by.x="effect",by.y="code") %>%
  mutate(cause_label=node.x,effect_label=node.y) %>%
  select(chain,cause,cause_label,polarity,effect,effect_label,models_with_direct,models_with_indirect,indirect_avg_shortestpath) %>%
  arrange(desc(models_with_direct))

head(linkinfo,10)

##    chain cause                 cause_label polarity effect
## 1    1+3     1    Child focus on education        1      3
## 2   30+3    30            Child attendance        1      3
## 3    2+3     2 Child interest in education        1      3
## 4  29+30    29                Child health        1     30
## 5   3+26     3 Child inclusion in learning        1     26
## 6  12-11    12 Child friendly relationship       -1     11
## 7   30+1    30            Child attendance        1      1
## 8   30+2    30            Child attendance        1      2
## 9   43+3    43             Lesson learning        1      3
## 10   9+3     9           Learning material        1      3
##                   effect_label models_with_direct models_with_indirect
## 1  Child inclusion in learning                 11                    2
## 2  Child inclusion in learning                 10                    5
## 3  Child inclusion in learning                 10                    2
## 4             Child attendance                 10                    1
## 5                Child results                  9                    3
## 6                 Child fights                  7                    0
## 7     Child focus on education                  6                    6
## 8  Child interest in education                  6                    6
## 9  Child inclusion in learning                  6                    1
## 10 Child inclusion in learning                  5                    4
##    indirect_avg_shortestpath
## 1                       4.00
## 2                       3.00
## 3                       3.50
## 4                       5.00
## 5                       4.67
## 6                        NaN
## 7                       4.33
## 8                       4.67
## 9                       3.00
## 10                      3.00

4: Building a centroid/aggregate model

Build up a centroid model by take the most-prevalent causation, then adding the next, then the next, etc.; at each point I calculating average distance among all. When this distance begins to increase, the aggregate model stops building.

real_data_p <- real_data %>% mutate(
  polarity_symbol=recode(polarity,"1"="+","-1"="-","0"="x"),
  chain=paste0(cause,polarity_symbol,effect)) 
matrix_p <- dcast(real_data_p,model~chain,value.var="model")[,-1] 
matrix_p[is.na(matrix_p)] <- 0 
matrix_p[matrix_p!=0] <- 1

Here are the original 15 models:

mds_p <- metaMDS(matrix_p,distance="euclidian",trace=0)
ordiplot(mds_p,type="t",display="sites")

This code will build the aggregate model:

test_dists <- c()
new_distances <-c(999,999)
distances <- vegdist(matrix_p,method="euclidian")
next_new_row <- matrix_p[1,] #duplicate model 1
next_new_row[] <- 0 #and use it to make a new, blank model
for(i in 1:length(next_new_row)){ #for each of the possible links (ultimately this should indicated DIRECT links only))
  prevalence<-colSums(matrix_p) #...in the order of most to least common...
  next_new_row[rev(order(prevalence))[i]] <- 1 #include them in the new model
  #print(i)
  #print(new_row)
  #print(mean(distances))
  next_new_distances <- vegdist(rbind(matrix_p,next_new_row),method="euclidian") #calculate distances of including this new row in the matrix
  test_dists<-c(test_dists,mean(next_new_distances))
  
  if(mean(next_new_distances)>mean(distances)|mean(next_new_distances)>mean(new_distances)) break #if overall mean distances has increased, stop. Otherwise, go ahead and...
  new_row <- next_new_row #go ahead and save this and add another link
  new_distances <- next_new_distances #go ahead and save this and add another link
  
}
print(paste0("I added ",i-1," links to the aggregate chain. Mean pairwise distance is now ", new_distances %>% mean %>% round(3)," compared to an original of ", distances %>% mean %>% round(3), " and a mean pairwise distance if I added another chain of ", next_new_distances %>% mean %>% round(3)))

## [1] "I added 5 links to the aggregate chain. Mean pairwise distance is now 8.926 compared to an original of 9.234 and a mean pairwise distance if I added another chain of 8.927"

new_row[,new_row==1]

##   1+3 2+3 29+30 3+26 30+3
## 1   1   1     1    1    1

And it’ll plot here as model 16.

matrix_new <- rbind(matrix_p,new_row) 
mds_new <- metaMDS(matrix_new,distance="euclidian",trace=0)
ordiplot(mds_new,type="t",display="sites")

Translating this back to school names.

row.names(matrix_new)=c(trimws(substr(modellist$model,7,14))[1:15],"CENTROID")
mds_new <- metaMDS(matrix_new,distance="euclidian",trace=0)
ordiplot(mds_new,type="t",display="sites")

Plotting the centroid model

require(igraph)

#use a custom curve function from https://stackoverflow.com/questions/16875547/using-igraph-how-to-force-curvature-when-arrows-point-in-opposite-directions
autocurve.edges2 <-function (graph, start = 0.5)
{
    cm <- count.multiple(graph)
    mut <-is.mutual(graph)  #are connections mutual?
    el <- apply(get.edgelist(graph, names = FALSE), 1, paste,
        collapse = ":")
    ord <- order(el)
    res <- numeric(length(ord))
    p <- 1
    while (p <= length(res)) {
        m <- cm[ord[p]]
        mut.obs <-mut[ord[p]] #are the connections mutual for this point?
        idx <- p:(p + m - 1)
        if (m == 1 & mut.obs==FALSE) { #no mutual conn = no curve
            r <- 0
        }
        else {
            r <- seq(-start, start, length = m)
        }
        res[ord[idx]] <- r
        p <- p + m
    }
    res
}

#to graph any model from the data: 
#real_data %>%
#  filter(model==14) %>%
#  mutate(from=cause,to=effect,delay=FALSE) %>%
#  select(from,to,polarity,delay) %>%
#  graph_from_data_frame() %>%
#  plot (edge.arrow.size=.4, vertex.label.cex=.75)

aggregate_model<- new_row[,new_row==1] %>% t() %>% as.data.frame()
aggregate_model$chain=row.names(aggregate_model)

from<-c()
to<-c()
polarity<-c()
for(i in 1:length(aggregate_model$chain)){
  from<-c(from,str_split(aggregate_model[i,"chain"],c("\\+","\\-"))[[1]][1])
  to<-c(to,str_split(aggregate_model[i,"chain"],c("\\+","\\-"))[[1]][2])
  polarity<-c(polarity,if(str_detect(aggregate_model[i,"chain"], pattern="\\+")) "+" else "-")
}
aggregate_model2 <- data.frame(from,to,polarity,"delay"=FALSE)

aggregate_info<-merge(aggregate_model,linkinfo,by="chain")

#abbreviate label function
word_abbreviate<-function(vector){
  vector %>% 
  str_replace_all("(C|c)hild", "C") %>% 
  str_replace_all("(T|t)eacher", "T") %>% 
  str_replace_all("(S|s)chool", "S") %>% 
  str_replace_all("(L|l)earning", "L") }

aggregate_info <- aggregate_info %>% mutate(
  cause_label=word_abbreviate(cause_label),
  effect_label=word_abbreviate(effect_label))


aggregate_graph<-graph_from_data_frame(aggregate_model2)
E(aggregate_graph)$label <- c(recode(aggregate_info$polarity,"1"="+","-1"="-"))
E(aggregate_graph)$arrow.size <- .5
V(aggregate_graph)$label <- unique(c(aggregate_info$cause_label,aggregate_info$effect_label))
plot(aggregate_graph,vertex.color="white",vertex.frame.color="grey",vertex.label.cex=.75,edge.label.cex=1.25,edge.label.color="dark green")

Here’s a more interetsing centroid model. Same process as above, only that with each of the dirct links added to the aggregate, the indirect lengths are also included. I think this is cool becuase it highlights important loops (which are necessarily indirect links).

array_of_dataframes<-c()
number_of_indirects<-c()
list_of_distances<-c() #just for testing purposes
for(i in 1:25){ #length(linkinfo$chain)){
  aggregate_model<-linkinfo[1:i,] #it's already sorted, so this will naturally take them in order
  
  expanded_data_aggregate<-aggregate_model %>%
  select(cause,polarity,effect) %>%
  expand_edgelist() %>%
  group_by(cause,effect,polarity) %>%
   summarize(numpaths=length(unique(chain)),shortestpath=min(length),sumpath=sum(1/length)) %>%
   mutate(model=16) #note this will need to change when comparing with other data
    
  expanded_data_aggregate <- expanded_data_aggregate %>% mutate(
  polarity_symbol=recode(polarity,"1"="+","-1"="-","0"="x"),
  chain=paste0(cause,polarity_symbol,effect),
  inv_shortestpath=1/shortestpath
  ) %>% as.data.frame()

  expanded_data_with_aggregate <- rbind(expanded_data[-1,],expanded_data_aggregate) #join with real-world models

  matrix_length_with_aggregate <- dcast(expanded_data_with_aggregate[-1,],model~chain,value.var="inv_shortestpath")[,-1] 
  matrix_length_with_aggregate[is.na(matrix_length_with_aggregate)] <- 0 #fill NA values in the matrix with zero

  matrix_presence_with_aggregate <- matrix_length_with_aggregate
  matrix_presence_with_aggregate[matrix_presence_with_aggregate!=0] <- 1 #fill any non-NA values with 1

  list_of_distances<-c(list_of_distances,mean(vegdist(matrix_presence_with_aggregate,method="euclidian")))
  number_of_indirects<-c(number_of_indirects,length(expanded_data_aggregate[expanded_data_aggregate$shortestpath>2,"chain"]))
  array_of_dataframes[[i]]<-expanded_data_with_aggregate

}

Here’s a plot of the overall mean pairwise distances among the models (y axis), including implied/indirect links, as we add direct links to the aggregate model.

ggplot(data.frame('Mean.Pairwise.Distance'=list_of_distances,'Aggregate.Model.Length'=1:length(list_of_distances)),aes(Aggregate.Model.Length,Mean.Pairwise.Distance))+geom_point()+theme_bw()

It looks like the aggregate model with 17 links is the best centroid. Here it is, with arrow-width indicating how many models this link shows up in.

agg_info <- array_of_dataframes[[17]] %>%
  filter(model==16&shortestpath==2) %>% merge(linkinfo[,c(1,3,6,7,8,9)],by="chain")

agg_graph <- agg_info %>% 
  mutate(from=cause,to=effect,delay=FALSE) %>%
  select(from,to,polarity,delay) %>%
  graph_from_data_frame()

agg_info <- agg_info %>% mutate(
  cause_label=word_abbreviate(cause_label),
  effect_label=word_abbreviate(effect_label))
  
E(agg_graph)$label <- c(recode(agg_info$polarity,"1"="+","-1"="-"))
E(agg_graph)$label.color <- c(recode(agg_info$polarity,"1"="dark green","-1"="red"))
E(agg_graph)$width <- c(agg_info$models_with_direct^2)/15
E(agg_graph)$arrow.size <- 1
V(agg_graph)$label <- unique(c(agg_info$cause_label,agg_info$effect_label))
curves <-autocurve.edges2(agg_graph)

plot(agg_graph, vertex.color="white", vertex.frame.color="grey", vertex.label.cex=.75, edge.label.cex=1.25, edge.curved=curves)