This document shows how to use the R package Rrunstruct to use R to conduct multiple structure runs based on the parameters chosen for any parameter set specified in the structure front-end. It goes hand in hand with the computer practical in the SISG MCMC course. Before you follow this document, you will want to have completed the first part of the computer practical and have done at least one run of structure using the front-end.
The goal here is to see how to use some R tools to parse structure output to be able to quickly assess whether multiple runs of structure's MCMC algorithm give similar results. Along the way we will encounter a couple of challenges that arise, most notably the issue that the labels of the clusters may not be concordant across multiple structure runs. We might also take some time to look at some of the tools in today’s Hadley Wickham R ecosystem (dplyr, ggplot2, tidyr, stringr, etc.) that are helpful.
Before we can begin there is some setup you might have to do.
You will need to install the package Rrunstruct. Because the package includes the structure executable file, it is not available on CRAN, but can be installed from GitHub. Doing so requires the devtools package.
devtools::install_github("eriqande/Rrunstruct")
If that doesn’t work, it likely is because you need to get the devtools package and then try again:
install.packages("devtools")
devtools::install_github("eriqande/Rrunstruct")
To run structure you just need to:
structure project that holds the parameter set you want to use. On my computer that is ~/Desktop/scot-cats/PlainPars. On your computer, that path will have to be set accordingly.structure_runs(), telling it where to put the output, what K values to use, how many reps for each, and the seed to use for the first run (this gets incremented by one for each successive run).A Big Note The mainparams file in the PlainPars needs to have the RANDOMIZE variable set to 0 in order to set the random number generator seed to a different value for each run of structure. This can be achieved by running a set of runs through the scheduler using the front end. (Or by editing the mainparams file with a simple text editor.)
library(Rrunstruct)
setwd("~/Desktop/scot-cats/PlainPars") # you've got to change this for your own computer!
structure_runs(outdir = "struct-runs-from-r-1", K = 1:4, reps = 3, seed = 5566)
Once those runs have completed, you can take a look at what the outputs look like. All the output files should be in the directory struct-runs-from-r-1 that is inside the PlainPars directory. Let’s see what they look like:
dir(path = "struct-runs-from-r-1")
[1] "struct_results_K_1_rep_1_f" "struct_results_K_1_rep_2_f"
[3] "struct_results_K_1_rep_3_f" "struct_results_K_2_rep_1_f"
[5] "struct_results_K_2_rep_2_f" "struct_results_K_2_rep_3_f"
[7] "struct_results_K_3_rep_1_f" "struct_results_K_3_rep_2_f"
[9] "struct_results_K_3_rep_3_f" "struct_results_K_4_rep_1_f"
[11] "struct_results_K_4_rep_2_f" "struct_results_K_4_rep_3_f"
[13] "struct_stdout_K_1_rep_1" "struct_stdout_K_1_rep_2"
[15] "struct_stdout_K_1_rep_3" "struct_stdout_K_2_rep_1"
[17] "struct_stdout_K_2_rep_2" "struct_stdout_K_2_rep_3"
[19] "struct_stdout_K_3_rep_1" "struct_stdout_K_3_rep_2"
[21] "struct_stdout_K_3_rep_3" "struct_stdout_K_4_rep_1"
[23] "struct_stdout_K_4_rep_2" "struct_stdout_K_4_rep_3"
The naming convention is pretty simple, *results* are the files that structure spits out at the end of each run. The *stdout* files store the stuff that the command line version of structure spits out. Let’s have a look at the first couple lines of the trace files:
cat(readLines("struct-runs-from-r-1/struct_stdout_K_3_rep_1")[1:50], sep = "\n")
----------------------------------------------------
STRUCTURE by Pritchard, Stephens and Donnelly (2000)
and Falush, Stephens and Pritchard (2003)
Code by Pritchard, Falush and Hubisz
Version 2.3.4 (Jul 2012)
----------------------------------------------------
Reading file "mainparams".
datafile is
/Users/eriq/Desktop/scot-cats/project_data
Reading file "extraparams".
Reading file "/Users/eriq/Desktop/scot-cats/project_data".
Number of alleles per locus: min= 9; ave=12.4; max=17
--------------------------------------
Finished initialization; starting MCMC
10000 iterations + 5000 burnin
Rep#: Alpha F1 F2 F3 Ln Like
1: 1.020 0.010 0.010 0.010 --
2: 1.006 0.010 0.010 0.010 --
3: 1.006 0.010 0.010 0.010 --
4: 1.006 0.010 0.010 0.010 --
5: 1.002 0.010 0.010 0.010 --
6: 1.002 0.010 0.010 0.010 --
7: 1.020 0.010 0.010 0.010 --
8: 1.045 0.010 0.010 0.010 --
9: 1.046 0.010 0.010 0.010 --
Rep#: Alpha F1 F2 F3 Ln Like
10: 1.033 0.010 0.010 0.010 --
11: 1.060 0.010 0.010 0.010 --
12: 1.064 0.010 0.010 0.010 --
13: 1.101 0.010 0.010 0.010 --
14: 1.067 0.010 0.010 0.010 --
15: 1.062 0.010 0.010 0.010 --
16: 1.060 0.010 0.010 0.010 --
17: 1.030 0.010 0.010 0.010 --
18: 1.004 0.010 0.013 0.010 --
19: 1.004 0.010 0.013 0.010 --
Rep#: Alpha F1 F2 F3 Ln Like
20: 1.019 0.010 0.013 0.010 --
21: 0.981 0.010 0.013 0.010 --
And look at a bit of one of the result files. Note that you are going to want to extract the “Inferred ancestry of individuals from this file”
cat(readLines("struct-runs-from-r-1/struct_results_K_3_rep_1_f")[1:100], sep = "\n")
----------------------------------------------------
STRUCTURE by Pritchard, Stephens and Donnelly (2000)
and Falush, Stephens and Pritchard (2003)
Code by Pritchard, Falush and Hubisz
Version 2.3.4 (Jul 2012)
----------------------------------------------------
Command line arguments: /Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rrunstruct/bin/structure_mac -K 3 -o struct-runs-from-r-1/struct_results_K_3_rep_1 -D 5572
Input File: /Users/eriq/Desktop/scot-cats/project_data
Run parameters:
304 individuals
9 loci
3 populations assumed
5000 Burn-in period
10000 Reps
RANDOMIZE turned off
--------------------------------------------
Proportion of membership of each pre-defined
population in each of the 3 clusters
Given Inferred Clusters Number of
Pop 1 2 3 Individuals
1: 0.053 0.075 0.872 27
2: 0.082 0.150 0.769 61
3: 0.218 0.491 0.291 91
4: 0.074 0.156 0.770 28
5: 0.113 0.256 0.631 19
6: 0.022 0.065 0.913 4
7: 0.765 0.204 0.031 10
8: 0.652 0.307 0.041 18
9: 0.744 0.223 0.033 46
--------------------------------------------
Allele-freq. divergence among pops (Net nucleotide distance),
computed using point estimates of P.
1 2 3
1 - 0.0182 0.1162
2 0.0182 - 0.0887
3 0.1162 0.0887 -
Average distances (expected heterozygosity) between individuals in same cluster:
cluster 1 : 0.7361
cluster 2 : 0.7539
cluster 3 : 0.6622
--------------------------------------------
Estimated Ln Prob of Data = -8716.5
Mean value of ln likelihood = -8506.6
Variance of ln likelihood = 419.7
Mean value of alpha = 0.1001
Mean value of Fst_1 = 0.0422
Mean value of Fst_2 = 0.0299
Mean value of Fst_3 = 0.1794
Inferred ancestry of individuals:
Label (%Miss) Pop: Inferred clusters
1 Z12 (0) 3 : 0.169 0.769 0.062
2 Z13 (0) 3 : 0.103 0.874 0.023
3 Z14 (0) 3 : 0.139 0.661 0.200
4 Z15 (0) 3 : 0.657 0.193 0.150
5 Z16 (0) 3 : 0.153 0.308 0.540
6 Z17 (0) 1 : 0.014 0.015 0.971
7 Z18 (0) 3 : 0.082 0.179 0.739
8 Z19 (0) 3 : 0.579 0.336 0.085
9 Z20 (11) 2 : 0.019 0.115 0.866
10 Z21 (0) 2 : 0.022 0.034 0.944
11 Z30 (0) 3 : 0.608 0.375 0.017
12 Z31 (0) 3 : 0.914 0.071 0.015
13 Z34 (0) 3 : 0.559 0.413 0.027
14 DB1 (0) 2 : 0.086 0.808 0.105
15 DB2 (0) 2 : 0.038 0.483 0.479
16 DB3 (0) 3 : 0.076 0.568 0.356
17 DB5 (0) 2 : 0.459 0.444 0.097
18 DB6 (0) 2 : 0.023 0.025 0.952
19 DB7 (0) 5 : 0.055 0.055 0.889
20 DB8 (11) 3 : 0.874 0.107 0.019
21 DB9 (0) 2 : 0.264 0.465 0.272
22 DB10 (0) 3 : 0.280 0.665 0.054
23 DB11 (0) 3 : 0.064 0.335 0.601
24 DB12 (0) 1 : 0.073 0.075 0.851
25 DB16 (0) 2 : 0.037 0.042 0.921
26 DB17 (0) 3 : 0.083 0.064 0.852
27 DB18 (0) 3 : 0.112 0.058 0.829
28 DB19 (0) 3 : 0.029 0.948 0.023
29 DB21 (22) 5 : 0.038 0.046 0.916
30 DB22 (0) 2 : 0.099 0.083 0.818
31 DB23 (0) 2 : 0.044 0.466 0.489
32 DB24 (0) 3 : 0.415 0.550 0.035
33 DB25 (0) 4 : 0.078 0.084 0.839
Although R’s syntax for text processing is perhaps not as compact as awk or sed, and it might be slower than python it has nice facilities for parsing text files. The package stringr used in combination with the function readLines() is particularly easy to use and helpful.
We won’t go into the details now though. Instead we have a single function that:
stdout and results files and wrangles them into long-format (tidy) data frames.If you want the details of all that, you can check the package out on GitHub. I used the packages tidyr and dplyr extensively (but the code is still kind of ugly…).
Here we get all the output into a list of two data frames:
L <- read_and_process_structure_output("struct-runs-from-r-1")
And let’s take a look at the tops of the two data frames in that list.
First, the estimated Q values from each individual
L$q
Source: local data frame [9,120 x 10]
K Rep Index Label Miss Pop cluster probability mp cluster_relabeled
1 1 1 1 Z12 0 3 1 1 1 1
2 1 1 2 Z13 0 3 1 1 1 1
3 1 1 3 Z14 0 3 1 1 1 1
4 1 1 4 Z15 0 3 1 1 1 1
5 1 1 5 Z16 0 3 1 1 1 1
6 1 1 6 Z17 0 1 1 1 1 1
7 1 1 7 Z18 0 3 1 1 1 1
8 1 1 8 Z19 0 3 1 1 1 1
9 1 1 9 Z20 11 2 1 1 1 1
10 1 1 10 Z21 0 2 1 1 1 1
.. . ... ... ... ... ... ... ... .. ...
The columns K and Rep just give the K-value assumed in the structure run, and which replicate the run was. Index is just the numerical index of the individual, Label is the label given to the individual sample, Miss is the percentage of loci missing data in that individual, Pop is the putative popuation or sampling location of the individual (as given in the data input), cluster gives the cluster that the probability refers to. probability is the posterior mean of the Q value for the given cluster. mp is the number of the label permutation that makes the results of this replicate as congruent as possible with replicate #1. Finally cluster_relabeled is the label of the cluster, once the clusters have been relabeled to be congruent with replicate #1. We will explore that further.
Here we look at the traces:
L$traces
Source: local data frame [324,000 x 7]
K Rep Sweep variable value mp variables_relabeled
1 1 1 1 Alpha 0.998 1 Alpha
2 1 1 2 Alpha 0.977 1 Alpha
3 1 1 3 Alpha 1.023 1 Alpha
4 1 1 4 Alpha 1.035 1 Alpha
5 1 1 5 Alpha 1.089 1 Alpha
6 1 1 6 Alpha 1.134 1 Alpha
7 1 1 7 Alpha 1.092 1 Alpha
8 1 1 8 Alpha 1.032 1 Alpha
9 1 1 9 Alpha 1.012 1 Alpha
10 1 1 10 Alpha 1.039 1 Alpha
.. . ... ... ... ... .. ...
The columns K, Rep, and mp are as above. Sweep tells us which sweep the row refers to. It runs from 1 to 6000 in this case—it includes sweeps from the burn-in period. variable tells us what variable we are looking at. Here, those are either “Alpha” (the \(\alpha\) variable) or one of the \(F\) parameters (i.e. F1, F2, etc). variables_relabeled has the name of the variable after they have been relabeled to be (more or less) congruent with replicate 1. value is the value of that variable.
Once structure’s outputs are in this type of long-format data frame, it is relatively easy to plot and compare between runs.
First we will plot the traces of \(\alpha\) and the \(F\) parameters and see how the compare across runs.
Here are the \(\alpha\)’s:
ggplot(L$traces %>% filter(variable == "Alpha"), aes(x = Sweep, y = value)) +
geom_line(colour = "blue") +
facet_grid(K ~ Rep)
That is interesting. Not surprisingly, when K is 1, \(\alpha\) meanders all over because it is not really used in the model when K is 1, so there is nothing constraining it.
For \(K>1\) the results look pretty similar between the reps. But, let’s focus on \(K>1\):
ggplot(L$traces %>% filter(variable == "Alpha", K != 1), aes(x = Sweep, y = value)) +
geom_line(colour = "blue") +
facet_grid(K ~ Rep)
When we look at the values of F, we see the labeling issue:
ggplot(L$traces %>% filter(str_detect(variable, "^F")), aes(x = Sweep, y = value, colour = variable)) +
geom_line() +
facet_grid(K ~ Rep)
When \(K>1\) the colors don’t match up because the clusters don’t have the same name. We can rectify that by using the relabeled version:
ggplot(L$traces %>% filter(str_detect(variables_relabeled, "^F")), aes(x = Sweep, y = value, colour = variables_relabeled)) +
geom_line() +
facet_grid(K ~ Rep)
We see that the course of the F variables during MCMC looks similar between reps for K=1 and K=2 but quite different between reps for K>2. We contend that the runs for K>2 might have to be run for much longer (and even then they may not converge to the same place, reliably).
Often what we are really interested in are the Q values for different individuals. It is thus important to assess how much those differ between runs. One way to do that is to plot the estimated Q values in each rep > 1 with those in the first rep. To do that, we can add the rep1 results (at each K) to the data frame of results:
Qs <- L$q %>%
filter(Rep == 1) %>% # get just the first rep
group_by(K, Index, cluster_relabeled) %>% # group by these to preserve the columns
transmute(rep1_cluster = cluster_relabeled, rep1_prob = probability) %>% # rename columns
ungroup %>%
inner_join(L$q) %>% # slap those values back on the original data frame matching K and Rep
filter(Rep != 1) %>% # toss out the rows comparing Rep 1 to Rep 1
mutate(which_cluster = paste(rep1_cluster)) # name a new column with something better for the plot
Joining by: c("K", "Index", "cluster_relabeled")
Here is what that data frame looks like:
Qs
Source: local data frame [6,080 x 13]
K Index cluster_relabeled rep1_cluster rep1_prob Rep Label Miss Pop
1 1 1 1 1 1 2 Z12 0 3
2 1 1 1 1 1 3 Z12 0 3
3 1 2 1 1 1 2 Z13 0 3
4 1 2 1 1 1 3 Z13 0 3
5 1 3 1 1 1 2 Z14 0 3
6 1 3 1 1 1 3 Z14 0 3
7 1 4 1 1 1 2 Z15 0 3
8 1 4 1 1 1 3 Z15 0 3
9 1 5 1 1 1 2 Z16 0 3
10 1 5 1 1 1 3 Z16 0 3
.. . ... ... ... ... ... ... ... ...
Variables not shown: cluster (fctr), probability (dbl), mp (int),
which_cluster (chr)
And now it is easy to compare the Q values in each Rep after Rep 1 with the Q values in Rep 1:
ggplot(Qs, aes(x = rep1_prob, y = probability, colour = which_cluster)) +
geom_point() +
facet_grid(K ~ Rep)
This is interesting as it shows that for K=2 the three runs all converged to pretty much the same Q values. The “wiggle” off the \(y=x\) line there is due to Monte Carlo variance, but not caused from the chains converging to very different parts of the space, as is clear what has happened for some of the replicates for K=3 and K=4.
Let’s see if we can tighten up that Monte Carlo variance be doing longer runs. I have made a new directory LongerRun with the structure GUI and set it up to to 5000 burn in and 45000 sweeps after burn in. Once again, you have to ensure that the RANDOMIZE setting in the mainparams file in that directory is set to 0. Let’s do 4 reps at K=2 to see how that works out.
# do the runs
setwd("~/Desktop/scot-cats/LongerRun") # this must be changed on other computers
structure_runs(outdir = "r-runs-2", K = 2, reps = 4, seed = 9876)
Then slurp up the results
setwd("~/Desktop/scot-cats/LongerRun") # this must be changed on other computers
L2 <- read_and_process_structure_output("r-runs-2")
reading files
finding label permutations for the reps
relabeling the traces and results
Then, compare rep 1 to the others:
# format the data frame
Qs2 <- L2$q %>%
filter(Rep == 1) %>% # get just the first rep
group_by(K, Index, cluster_relabeled) %>% # group by these to preserve the columns
transmute(rep1_cluster = cluster_relabeled, rep1_prob = probability) %>% # rename columns
ungroup %>%
inner_join(L2$q) %>% # slap those values back on the original data frame matching K and Rep
filter(Rep != 1) %>% # toss out the rows comparing Rep 1 to Rep 1
mutate(which_cluster = paste(rep1_cluster)) # name a new column with something better for the plot
Joining by: c("K", "Index", "cluster_relabeled")
# plot those up:
ggplot(Qs2, aes(x = rep1_prob, y = probability, colour = which_cluster)) +
geom_point() +
facet_wrap( ~ Rep, ncol = 3)
Those look pretty concordant apart from a few individuals.