scot-cats

This document shows how to use the R package Rrunstruct to use R to conduct multiple structure runs based on the parameters chosen for any parameter set specified in the structure front-end. It goes hand in hand with the computer practical in the SISG MCMC course. Before you follow this document, you will want to have completed the first part of the computer practical and have done at least one run of structure using the front-end.

The goal here is to see how to use some R tools to parse structure output to be able to quickly assess whether multiple runs of structure's MCMC algorithm give similar results. Along the way we will encounter a couple of challenges that arise, most notably the issue that the labels of the clusters may not be concordant across multiple structure runs. We might also take some time to look at some of the tools in today’s Hadley Wickham R ecosystem (dplyr, ggplot2, tidyr, stringr, etc.) that are helpful.

Before we can begin there is some setup you might have to do.

Setup

You will need to install the package Rrunstruct. Because the package includes the structure executable file, it is not available on CRAN, but can be installed from GitHub. Doing so requires the devtools package.

devtools::install_github("eriqande/Rrunstruct")

If that doesn’t work, it likely is because you need to get the devtools package and then try again:

install.packages("devtools")
devtools::install_github("eriqande/Rrunstruct")

Running Structure

To run structure you just need to:

load the package
set your R working directory to the directory inside the structure project that holds the parameter set you want to use. On my computer that is ~/Desktop/scot-cats/PlainPars. On your computer, that path will have to be set accordingly.
Use the functionstructure_runs(), telling it where to put the output, what K values to use, how many reps for each, and the seed to use for the first run (this gets incremented by one for each successive run).

A Big Note The mainparams file in the PlainPars needs to have the RANDOMIZE variable set to 0 in order to set the random number generator seed to a different value for each run of structure. This can be achieved by running a set of runs through the scheduler using the front end. (Or by editing the mainparams file with a simple text editor.)

library(Rrunstruct)

setwd("~/Desktop/scot-cats/PlainPars")  # you've got to change this for your own computer!

structure_runs(outdir = "struct-runs-from-r-1", K = 1:4, reps = 3, seed = 5566)

Once those runs have completed, you can take a look at what the outputs look like. All the output files should be in the directory struct-runs-from-r-1 that is inside the PlainPars directory. Let’s see what they look like:

dir(path = "struct-runs-from-r-1")

 [1] "struct_results_K_1_rep_1_f" "struct_results_K_1_rep_2_f"
 [3] "struct_results_K_1_rep_3_f" "struct_results_K_2_rep_1_f"
 [5] "struct_results_K_2_rep_2_f" "struct_results_K_2_rep_3_f"
 [7] "struct_results_K_3_rep_1_f" "struct_results_K_3_rep_2_f"
 [9] "struct_results_K_3_rep_3_f" "struct_results_K_4_rep_1_f"
[11] "struct_results_K_4_rep_2_f" "struct_results_K_4_rep_3_f"
[13] "struct_stdout_K_1_rep_1"    "struct_stdout_K_1_rep_2"   
[15] "struct_stdout_K_1_rep_3"    "struct_stdout_K_2_rep_1"   
[17] "struct_stdout_K_2_rep_2"    "struct_stdout_K_2_rep_3"   
[19] "struct_stdout_K_3_rep_1"    "struct_stdout_K_3_rep_2"   
[21] "struct_stdout_K_3_rep_3"    "struct_stdout_K_4_rep_1"   
[23] "struct_stdout_K_4_rep_2"    "struct_stdout_K_4_rep_3"

The naming convention is pretty simple, *results* are the files that structure spits out at the end of each run. The *stdout* files store the stuff that the command line version of structure spits out. Let’s have a look at the first couple lines of the trace files:

cat(readLines("struct-runs-from-r-1/struct_stdout_K_3_rep_1")[1:50], sep = "\n")



----------------------------------------------------
STRUCTURE by Pritchard, Stephens and Donnelly (2000)
     and Falush, Stephens and Pritchard (2003)
       Code by Pritchard, Falush and Hubisz
             Version 2.3.4 (Jul 2012)
----------------------------------------------------


Reading file "mainparams".
datafile is
/Users/eriq/Desktop/scot-cats/project_data
Reading file "extraparams".
Reading file "/Users/eriq/Desktop/scot-cats/project_data".
Number of alleles per locus: min= 9; ave=12.4; max=17


--------------------------------------

Finished initialization; starting MCMC 
10000 iterations + 5000 burnin


 Rep#:    Alpha      F1    F2    F3       Ln Like 
    1:    1.020    0.010 0.010 0.010      --  
    2:    1.006    0.010 0.010 0.010      --  
    3:    1.006    0.010 0.010 0.010      --  
    4:    1.006    0.010 0.010 0.010      --  
    5:    1.002    0.010 0.010 0.010      --  
    6:    1.002    0.010 0.010 0.010      --  
    7:    1.020    0.010 0.010 0.010      --  
    8:    1.045    0.010 0.010 0.010      --  
    9:    1.046    0.010 0.010 0.010      --  

 Rep#:    Alpha      F1    F2    F3       Ln Like 
   10:    1.033    0.010 0.010 0.010      --  
   11:    1.060    0.010 0.010 0.010      --  
   12:    1.064    0.010 0.010 0.010      --  
   13:    1.101    0.010 0.010 0.010      --  
   14:    1.067    0.010 0.010 0.010      --  
   15:    1.062    0.010 0.010 0.010      --  
   16:    1.060    0.010 0.010 0.010      --  
   17:    1.030    0.010 0.010 0.010      --  
   18:    1.004    0.010 0.013 0.010      --  
   19:    1.004    0.010 0.013 0.010      --  

 Rep#:    Alpha      F1    F2    F3       Ln Like 
   20:    1.019    0.010 0.013 0.010      --  
   21:    0.981    0.010 0.013 0.010      --

And look at a bit of one of the result files. Note that you are going to want to extract the “Inferred ancestry of individuals from this file”

cat(readLines("struct-runs-from-r-1/struct_results_K_3_rep_1_f")[1:100], sep = "\n")



----------------------------------------------------
STRUCTURE by Pritchard, Stephens and Donnelly (2000)
     and Falush, Stephens and Pritchard (2003)
       Code by Pritchard, Falush and Hubisz
             Version 2.3.4 (Jul 2012)
----------------------------------------------------



Command line arguments:   /Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rrunstruct/bin/structure_mac -K 3 -o struct-runs-from-r-1/struct_results_K_3_rep_1 -D 5572 
Input File:    /Users/eriq/Desktop/scot-cats/project_data

Run parameters:
   304 individuals
   9 loci
   3 populations assumed
   5000 Burn-in period
   10000 Reps
RANDOMIZE turned off


--------------------------------------------
Proportion of membership of each pre-defined
 population in each of the 3 clusters

Given    Inferred Clusters       Number of
 Pop       1      2      3      Individuals

  1:     0.053  0.075  0.872       27
  2:     0.082  0.150  0.769       61
  3:     0.218  0.491  0.291       91
  4:     0.074  0.156  0.770       28
  5:     0.113  0.256  0.631       19
  6:     0.022  0.065  0.913        4
  7:     0.765  0.204  0.031       10
  8:     0.652  0.307  0.041       18
  9:     0.744  0.223  0.033       46
--------------------------------------------

Allele-freq. divergence among pops (Net nucleotide distance),
computed using point estimates of P.

      1       2       3      
 1      -    0.0182  0.1162  
 2   0.0182     -    0.0887  
 3   0.1162  0.0887     -    

Average distances (expected heterozygosity) between individuals in same cluster:
cluster  1  : 0.7361 
cluster  2  : 0.7539 
cluster  3  : 0.6622 

--------------------------------------------
Estimated Ln Prob of Data   = -8716.5
Mean value of ln likelihood = -8506.6
Variance of ln likelihood   = 419.7
Mean value of alpha         = 0.1001

Mean value of Fst_1         = 0.0422
Mean value of Fst_2         = 0.0299
Mean value of Fst_3         = 0.1794


Inferred ancestry of individuals:
        Label (%Miss) Pop:  Inferred clusters
  1      Z12    (0)    3 :  0.169 0.769 0.062 
  2      Z13    (0)    3 :  0.103 0.874 0.023 
  3      Z14    (0)    3 :  0.139 0.661 0.200 
  4      Z15    (0)    3 :  0.657 0.193 0.150 
  5      Z16    (0)    3 :  0.153 0.308 0.540 
  6      Z17    (0)    1 :  0.014 0.015 0.971 
  7      Z18    (0)    3 :  0.082 0.179 0.739 
  8      Z19    (0)    3 :  0.579 0.336 0.085 
  9      Z20   (11)    2 :  0.019 0.115 0.866 
 10      Z21    (0)    2 :  0.022 0.034 0.944 
 11      Z30    (0)    3 :  0.608 0.375 0.017 
 12      Z31    (0)    3 :  0.914 0.071 0.015 
 13      Z34    (0)    3 :  0.559 0.413 0.027 
 14      DB1    (0)    2 :  0.086 0.808 0.105 
 15      DB2    (0)    2 :  0.038 0.483 0.479 
 16      DB3    (0)    3 :  0.076 0.568 0.356 
 17      DB5    (0)    2 :  0.459 0.444 0.097 
 18      DB6    (0)    2 :  0.023 0.025 0.952 
 19      DB7    (0)    5 :  0.055 0.055 0.889 
 20      DB8   (11)    3 :  0.874 0.107 0.019 
 21      DB9    (0)    2 :  0.264 0.465 0.272 
 22     DB10    (0)    3 :  0.280 0.665 0.054 
 23     DB11    (0)    3 :  0.064 0.335 0.601 
 24     DB12    (0)    1 :  0.073 0.075 0.851 
 25     DB16    (0)    2 :  0.037 0.042 0.921 
 26     DB17    (0)    3 :  0.083 0.064 0.852 
 27     DB18    (0)    3 :  0.112 0.058 0.829 
 28     DB19    (0)    3 :  0.029 0.948 0.023 
 29     DB21   (22)    5 :  0.038 0.046 0.916 
 30     DB22    (0)    2 :  0.099 0.083 0.818 
 31     DB23    (0)    2 :  0.044 0.466 0.489 
 32     DB24    (0)    3 :  0.415 0.550 0.035 
 33     DB25    (0)    4 :  0.078 0.084 0.839

Slurping up the Structure results

Although R’s syntax for text processing is perhaps not as compact as awk or sed, and it might be slower than python it has nice facilities for parsing text files. The package stringr used in combination with the function readLines() is particularly easy to use and helpful.

We won’t go into the details now though. Instead we have a single function that:

Reads in all the stdout and results files and wrangles them into long-format (tidy) data frames.
Identifies the cluster that each individual has the highest mean posterior Q for from each run.
For each rep determine the permutation of the cluster labels that makes the results the most congruent with the first rep. (This is a lightweight version of what the program CLUMPP does)
Creates columns in the data frame that have permuted versions of the cluster labels that can be used in plotting.

If you want the details of all that, you can check the package out on GitHub. I used the packages tidyr and dplyr extensively (but the code is still kind of ugly…).

Here we get all the output into a list of two data frames:

L <- read_and_process_structure_output("struct-runs-from-r-1")

And let’s take a look at the tops of the two data frames in that list.

Q values

First, the estimated Q values from each individual

L$q

Source: local data frame [9,120 x 10]

   K Rep Index Label Miss Pop cluster probability mp cluster_relabeled
1  1   1     1   Z12    0   3       1           1  1                 1
2  1   1     2   Z13    0   3       1           1  1                 1
3  1   1     3   Z14    0   3       1           1  1                 1
4  1   1     4   Z15    0   3       1           1  1                 1
5  1   1     5   Z16    0   3       1           1  1                 1
6  1   1     6   Z17    0   1       1           1  1                 1
7  1   1     7   Z18    0   3       1           1  1                 1
8  1   1     8   Z19    0   3       1           1  1                 1
9  1   1     9   Z20   11   2       1           1  1                 1
10 1   1    10   Z21    0   2       1           1  1                 1
.. . ...   ...   ...  ... ...     ...         ... ..               ...

The columns K and Rep just give the K-value assumed in the structure run, and which replicate the run was. Index is just the numerical index of the individual, Label is the label given to the individual sample, Miss is the percentage of loci missing data in that individual, Pop is the putative popuation or sampling location of the individual (as given in the data input), cluster gives the cluster that the probability refers to. probability is the posterior mean of the Q value for the given cluster. mp is the number of the label permutation that makes the results of this replicate as congruent as possible with replicate #1. Finally cluster_relabeled is the label of the cluster, once the clusters have been relabeled to be congruent with replicate #1. We will explore that further.

Traces

Here we look at the traces:

L$traces

Source: local data frame [324,000 x 7]

   K Rep Sweep variable value mp variables_relabeled
1  1   1     1    Alpha 0.998  1               Alpha
2  1   1     2    Alpha 0.977  1               Alpha
3  1   1     3    Alpha 1.023  1               Alpha
4  1   1     4    Alpha 1.035  1               Alpha
5  1   1     5    Alpha 1.089  1               Alpha
6  1   1     6    Alpha 1.134  1               Alpha
7  1   1     7    Alpha 1.092  1               Alpha
8  1   1     8    Alpha 1.032  1               Alpha
9  1   1     9    Alpha 1.012  1               Alpha
10 1   1    10    Alpha 1.039  1               Alpha
.. . ...   ...      ...   ... ..                 ...

The columns K, Rep, and mp are as above. Sweep tells us which sweep the row refers to. It runs from 1 to 6000 in this case—it includes sweeps from the burn-in period. variable tells us what variable we are looking at. Here, those are either “Alpha” (the \(\alpha\) variable) or one of the \(F\) parameters (i.e. F1, F2, etc). variables_relabeled has the name of the variable after they have been relabeled to be (more or less) congruent with replicate 1. value is the value of that variable.

Plotting the traces

Once structure’s outputs are in this type of long-format data frame, it is relatively easy to plot and compare between runs.

Alpha

First we will plot the traces of \(\alpha\) and the \(F\) parameters and see how the compare across runs.

Here are the \(\alpha\)’s:

ggplot(L$traces %>% filter(variable == "Alpha"), aes(x = Sweep, y = value)) +
  geom_line(colour = "blue") +
  facet_grid(K ~ Rep)

That is interesting. Not surprisingly, when K is 1, \(\alpha\) meanders all over because it is not really used in the model when K is 1, so there is nothing constraining it.

For \(K>1\) the results look pretty similar between the reps. But, let’s focus on \(K>1\):

ggplot(L$traces %>% filter(variable == "Alpha", K != 1), aes(x = Sweep, y = value)) +
  geom_line(colour = "blue") +
  facet_grid(K ~ Rep)

F’s

When we look at the values of F, we see the labeling issue:

ggplot(L$traces %>% filter(str_detect(variable, "^F")), aes(x = Sweep, y = value, colour = variable)) +
  geom_line() +
  facet_grid(K ~ Rep)

When \(K>1\) the colors don’t match up because the clusters don’t have the same name. We can rectify that by using the relabeled version:

ggplot(L$traces %>% filter(str_detect(variables_relabeled, "^F")), aes(x = Sweep, y = value, colour = variables_relabeled)) +
  geom_line() +
  facet_grid(K ~ Rep)

We see that the course of the F variables during MCMC looks similar between reps for K=1 and K=2 but quite different between reps for K>2. We contend that the runs for K>2 might have to be run for much longer (and even then they may not converge to the same place, reliably).

Plotting the Q values

Often what we are really interested in are the Q values for different individuals. It is thus important to assess how much those differ between runs. One way to do that is to plot the estimated Q values in each rep > 1 with those in the first rep. To do that, we can add the rep1 results (at each K) to the data frame of results:

Qs <- L$q %>%
  filter(Rep == 1) %>%  # get just the first rep
  group_by(K, Index, cluster_relabeled) %>% # group by these to preserve the columns
  transmute(rep1_cluster = cluster_relabeled, rep1_prob = probability) %>% # rename columns
  ungroup %>%
  inner_join(L$q) %>%  # slap those values back on the original data frame matching K and Rep
  filter(Rep != 1) %>% # toss out the rows comparing Rep 1 to Rep 1
  mutate(which_cluster = paste(rep1_cluster))  # name a new column with something better for the plot

Joining by: c("K", "Index", "cluster_relabeled")

Here is what that data frame looks like:

Qs

Source: local data frame [6,080 x 13]

   K Index cluster_relabeled rep1_cluster rep1_prob Rep Label Miss Pop
1  1     1                 1            1         1   2   Z12    0   3
2  1     1                 1            1         1   3   Z12    0   3
3  1     2                 1            1         1   2   Z13    0   3
4  1     2                 1            1         1   3   Z13    0   3
5  1     3                 1            1         1   2   Z14    0   3
6  1     3                 1            1         1   3   Z14    0   3
7  1     4                 1            1         1   2   Z15    0   3
8  1     4                 1            1         1   3   Z15    0   3
9  1     5                 1            1         1   2   Z16    0   3
10 1     5                 1            1         1   3   Z16    0   3
.. .   ...               ...          ...       ... ...   ...  ... ...
Variables not shown: cluster (fctr), probability (dbl), mp (int),
  which_cluster (chr)

And now it is easy to compare the Q values in each Rep after Rep 1 with the Q values in Rep 1:

ggplot(Qs, aes(x = rep1_prob, y = probability, colour = which_cluster)) +
  geom_point() +
  facet_grid(K ~ Rep)

This is interesting as it shows that for K=2 the three runs all converged to pretty much the same Q values. The “wiggle” off the \(y=x\) line there is due to Monte Carlo variance, but not caused from the chains converging to very different parts of the space, as is clear what has happened for some of the replicates for K=3 and K=4.

Longer runs at K = 2

Let’s see if we can tighten up that Monte Carlo variance be doing longer runs. I have made a new directory LongerRun with the structure GUI and set it up to to 5000 burn in and 45000 sweeps after burn in. Once again, you have to ensure that the RANDOMIZE setting in the mainparams file in that directory is set to 0. Let’s do 4 reps at K=2 to see how that works out.

# do the runs
setwd("~/Desktop/scot-cats/LongerRun")  # this must be changed on other computers
structure_runs(outdir = "r-runs-2", K = 2, reps = 4, seed = 9876)

Then slurp up the results

setwd("~/Desktop/scot-cats/LongerRun")  # this must be changed on other computers
L2 <- read_and_process_structure_output("r-runs-2")

reading files
finding label permutations for the reps
relabeling the traces and results

Then, compare rep 1 to the others:

# format the data frame
Qs2 <- L2$q %>%
  filter(Rep == 1) %>%  # get just the first rep
  group_by(K, Index, cluster_relabeled) %>% # group by these to preserve the columns
  transmute(rep1_cluster = cluster_relabeled, rep1_prob = probability) %>% # rename columns
  ungroup %>%
  inner_join(L2$q) %>%  # slap those values back on the original data frame matching K and Rep
  filter(Rep != 1) %>% # toss out the rows comparing Rep 1 to Rep 1
  mutate(which_cluster = paste(rep1_cluster))  # name a new column with something better for the plot

Joining by: c("K", "Index", "cluster_relabeled")

# plot those up:

ggplot(Qs2, aes(x = rep1_prob, y = probability, colour = which_cluster)) +
  geom_point() +
  facet_wrap( ~ Rep, ncol = 3)

Those look pretty concordant apart from a few individuals.