This script was written to demonstrate some of the functions I have written, as well as some I have modified from the library SOLOMON written by Mark Christie
sim.pop.gts()
gt.summary()*
m.data.counter()
six.char()
three.char()
dup.identifier()
uniq.alleles()
NEP.calc()
*denotes I used some of Mark Christie's code to make part of the given function
alf.freq()
sim.gt.data()
power.analysis()
exclusion()
no.par.bayes()
There is a figure below that uses ggplot2(), so the library needs to be loaded
library(ggplot2)
If you don't have ggplot2() installed, then delete the # and run this code. Then go back up and load the library.
#install.packages("ggplot2")
The exact location of the personalized function script will change for everyone, so make sure to change the working directory accordingly
#loading in functions
source("C:/Users/Nick.Sard/Documents/Research/R/Source scripts/share.R")
First, we need to make some fake data
I first decided to create this function so that someone trying out the scripts would some fake data to work with. Thus this function simulates microsatellite like data that are poisson distributed. I say microsatellite like data because I have not created functions within sim.pop.gts to mimic any evolutionary processes. There are several arguments for the function:
1) n - the number of individuals in each population. Default - 1000
2) loci - the number of loci desired. Default - 10
3) missing.data - the proportion of individuals (randomly assigned) at each locus that will have missing data (coded as “000”). Default - 0
4) pops - the number of populations desired. Default - 2
5) k - the mean number of alleles at each locus. Default - 20
6) k.sd - the standard deviation around the mean number of alleles at each locus - intended to add a bit of variation among loci. Default - 3
The output is a data frame - the first two columns identify the population ID and the individual ID, followed by genotypes at each locus
tmp <- sim.pop.gts(n=1000, loci=10, missing.data=0.01, pops=2, k=20, k.sd=2)
head(tmp)
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1 ID.1 103 103 115 115 105 117 115 130 122 131
## 2 POP.1 ID.2 103 115 115 117 106 134 109 115 101 113
## 3 POP.1 ID.3 115 122 109 117 124 129 107 107 102 130
## 4 POP.1 ID.4 103 104 109 117 104 124 107 109 109 141
## 5 POP.1 ID.5 115 122 106 115 102 102 109 113 122 128
## 8 POP.1 ID.6 109 118 106 108 104 130 109 130 109 114
## Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1 101 115 115 130 108 119 111 130 115 120
## 2 101 118 107 109 108 111 121 127 114 128
## 3 115 140 109 136 103 119 111 125 110 122
## 4 115 121 130 130 111 123 111 127 000 000
## 5 115 118 115 115 111 114 105 121 115 122
## 8 113 118 107 115 103 119 114 125 114 120
The format of the simulated data is what is accepted for all other functions. For new R users - the output is automatically wrapped within the R console because there are many columns.
Some of this function was originally written by Mark Christie, but I have modified it and added additional code to calculate other population genetic descriptive statistics. For each locus the function can calculate number of alleles, observed and expected heterozygosity, the percent number of individuals that were genotyped, and Fis. It also counts the number of individuals in each population and the number missing genotypes. This function can calculate the statistics averaged over all loci for each population, as well as over all populations. Here are the arguments:
1) population - the data frame with the populations, sample ids, and genotypes. Default - none -
2) pop.sum - will summarize stats overall loci if TRUE. Default - FALSE
3) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE
4) missing.data - used to identify missing data. By default it is “000”, but the user can change it to whatever they use to identify missing data.
5) round.num - identified what decimal point the output is rounded to. Default - 4
Expected heterozygosity is corrected for sample size (Nei 1977)
Fis is calculated as Fis = 1 - (Ho/He)
#overall gt.summary disregarding population IDs (by removing the first column)
x <- gt.summary(population=tmp[,-1],pop.sum=F,one.pop=T,missing.data="000")
head(x)
## Pop Locus A He Ho Fis N N.Missing P.GT
## 1 Pop Locus_1.1 27 0.94 0.91 0.03 2000 20 99
## 2 Pop Locus_2.1 30 0.94 0.88 0.06 2000 20 99
## 3 Pop Locus_3.1 29 0.94 0.89 0.05 2000 20 99
## 4 Pop Locus_4.1 30 0.94 0.89 0.05 2000 20 99
## 5 Pop Locus_5.1 30 0.93 0.91 0.02 2000 20 99
## 6 Pop Locus_6.1 28 0.94 0.91 0.03 2000 20 99
#doing the same thing but with population IDs
y <- gt.summary(population=tmp,pop.sum=F,one.pop=F,missing.data="000")
head(y)
## Pop Locus A He Ho Fis N N.Missing P.GT
## 1 POP.1 Locus_1.1 18 0.91 0.91 0.00 1000 10 99
## 2 POP.1 Locus_2.1 19 0.90 0.90 0.00 1000 10 99
## 3 POP.1 Locus_3.1 18 0.90 0.91 -0.01 1000 10 99
## 4 POP.1 Locus_4.1 18 0.90 0.88 0.02 1000 10 99
## 5 POP.1 Locus_5.1 18 0.89 0.89 0.00 1000 10 99
## 6 POP.1 Locus_6.1 18 0.91 0.91 0.00 1000 10 99
#now averaging everything accross all loci - getting standard errors too
z <- gt.summary(population=tmp,pop.sum=T,one.pop=F,missing.data="000")
z
## Pop N A He Ho Fis N.Missing P.GT A.se He.se Ho.se Fis.se N.Missing.se P.GT.se
## 1 POP.1 1000 17.6 0.897 0.897 0.000 10 99 0.3712 0.0054 0.0060 0.0026 0 0
## 2 POP.2 1000 17.5 0.902 0.906 -0.004 10 99 0.4282 0.0042 0.0056 0.0031 0 0
This function was created by Mark Christie, and it summarizes the alleles at each locus by counting them and calculating their observed frequency. I have barely changed anything in it. The only things I have done are to make it into a command line function, rounded the frequencies of each allele to 5 decimal places, and put it into a for loop so that it does the calculations for each population (if desired). There are four arguments in the function:
1) population - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - FALSE
3) opt - determines what type of output is desired: Default - 1
4) missing.data - used to identify missing data. By default it is “000”, but the user can change it to whatever they use to identify missing data.
Here are some examples:
#for each population ID
a <- alf.freq(population=tmp, one.pop=F, opt=1,missing.data="000")
head(a)
## Pops Locus allele count frequency
## 2 POP.1 Locus_1.1 102 119 0.06010
## 3 POP.1 Locus_1.1 103 148 0.07475
## 4 POP.1 Locus_1.1 104 141 0.07121
## 5 POP.1 Locus_1.1 106 185 0.09343
## 6 POP.1 Locus_1.1 109 227 0.11465
## 7 POP.1 Locus_1.1 115 250 0.12626
### just a quick visualization to look at the distribution of alleles
ggplot(a, aes(x=as.factor(allele), y=count))+
facet_grid(Pops~Locus)+
geom_bar(stat="identity")+
labs(y="Number of alleles", x= "Alleles")+
theme_bw()+
theme(axis.title = element_text(size = 32),
axis.text.y = element_text(size = 24, colour = "Black"),
axis.text.x = element_blank(),
strip.text = element_text(size = 18),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank())
#overall - no population IDs
b <- alf.freq(population=tmp[,-1], one.pop=T, opt=1)
head(b)
## Pops Locus allele count frequency
## 2 Pop Locus_1.1 102 288 0.07273
## 3 Pop Locus_1.1 103 310 0.07828
## 4 Pop Locus_1.1 104 310 0.07828
## 5 Pop Locus_1.1 106 407 0.10278
## 6 Pop Locus_1.1 107 234 0.05909
## 7 Pop Locus_1.1 109 227 0.05732
#same as above but with option 2 -simplifed output that can be fed right into SOLOMON to simulate genotypes
c <-alf.freq(population=tmp, one.pop=F, opt=2)
head(c)
## Locus frequency
## 2 1 0.06010
## 3 1 0.07475
## 4 1 0.07121
## 5 1 0.09343
## 6 1 0.11465
## 7 1 0.12626
d <-alf.freq(population=tmp[,-1],one.pop=T,opt=2)
head(d)
## Locus frequency
## 2 1 0.07273
## 3 1 0.07828
## 4 1 0.07828
## 5 1 0.10278
## 6 1 0.05909
## 7 1 0.05732
This function counts how many loci are missing data for each individual in the dataset. It requires data that are in three character form. Arguments include:
1) tmp - table with genotypes in three character form and sample names in column one. Default - none
2) opt - determines what type of output is desired: . Default - 1
3) missing.data - default “000”, but if it's different such as “999”, this can be changed.
4) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE
This bit of code will result in a extra column being added to the end of the data frame. It summarizes how many loci were missing data for each individual. Using table() the user will see that there are a few individuals with one missing genotype and a couple that are missing more than that.
e <- m.data.counter(tmp=tmp[,-1],one.pop=T,missing.data="000")
head(e)
## Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 ID.1 103 103 115 115 105 117 115 130 122 131
## 2 ID.2 103 115 115 117 106 134 109 115 101 113
## 3 ID.3 115 122 109 117 124 129 107 107 102 130
## 4 ID.4 103 104 109 117 104 124 107 109 109 141
## 5 ID.5 115 122 106 115 102 102 109 113 122 128
## 8 ID.6 109 118 106 108 104 130 109 130 109 114
## Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2 m.data
## 1 101 115 115 130 108 119 111 130 115 120 0
## 2 101 118 107 109 108 111 121 127 114 128 0
## 3 115 140 109 136 103 119 111 125 110 122 0
## 4 115 121 130 130 111 123 111 127 000 000 1
## 5 115 118 115 115 111 114 105 121 115 122 0
## 8 113 118 107 115 103 119 114 125 114 120 0
table(e$m.data)
##
## 0 1 2 3
## 1810 181 8 1
f <- m.data.counter(tmp=tmp,opt=2,one.pop=F)
head(f)
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1 ID.1 103 103 115 115 105 117 115 130 122 131
## 2 POP.1 ID.2 103 115 115 117 106 134 109 115 101 113
## 3 POP.1 ID.3 115 122 109 117 124 129 107 107 102 130
## 4 POP.1 ID.4 103 104 109 117 104 124 107 109 109 141
## 5 POP.1 ID.5 115 122 106 115 102 102 109 113 122 128
## 8 POP.1 ID.6 109 118 106 108 104 130 109 130 109 114
## Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2 m.data
## 1 101 115 115 130 108 119 111 130 115 120 0
## 2 101 118 107 109 108 111 121 127 114 128 0
## 3 115 140 109 136 103 119 111 125 110 122 0
## 4 115 121 130 130 111 123 111 127 000 000 1
## 5 115 118 115 115 111 114 105 121 115 122 0
## 8 113 118 107 115 103 119 114 125 114 120 0
This function takes three character formatted genotypes and puts them into six character form. Arguments include:
1) population - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE
3) missing.data - default “000”, but if it's different such as “999”, this can be changed.
g <- six.char(population=tmp[,-1],one.pop=T,missing.data="000")
head(g)
## Sample.name Locus_1.1 Locus_2.1 Locus_3.1 Locus_4.1 Locus_5.1 Locus_6.1 Locus_7.1 Locus_8.1 Locus_9.1 Locus_10.1
## 1 ID.1 103103 115115 105117 115130 122131 101115 115130 108119 111130 115120
## 2 ID.2 103115 115117 106134 109115 101113 101118 107109 108111 121127 114128
## 3 ID.3 115122 109117 124129 107107 102130 115140 109136 103119 111125 110122
## 4 ID.4 103104 109117 104124 107109 109141 115121 130130 111123 111127 000000
## 5 ID.5 115122 106115 102102 109113 122128 115118 115115 111114 105121 115122
## 6 ID.6 109118 106108 104130 109130 109114 113118 107115 103119 114125 114120
g <- six.char(tmp,one.pop=F)
head(g)
## Pops Sample.name Locus_1.1 Locus_2.1 Locus_3.1 Locus_4.1 Locus_5.1 Locus_6.1 Locus_7.1 Locus_8.1 Locus_9.1
## 1 POP.1 ID.1 103103 115115 105117 115130 122131 101115 115130 108119 111130
## 2 POP.1 ID.2 103115 115117 106134 109115 101113 101118 107109 108111 121127
## 3 POP.1 ID.3 115122 109117 124129 107107 102130 115140 109136 103119 111125
## 4 POP.1 ID.4 103104 109117 104124 107109 109141 115121 130130 111123 111127
## 5 POP.1 ID.5 115122 106115 102102 109113 122128 115118 115115 111114 105121
## 6 POP.1 ID.6 109118 106108 104130 109130 109114 113118 107115 103119 114125
## Locus_10.1
## 1 115120
## 2 114128
## 3 110122
## 4 000000
## 5 115122
## 6 114120
This function takes genotypes in six character form and puts them into three character form. Arguments include:
1) tmp - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE
h <- three.char(tmp=g[,-1], one.pop=T)
head(h)
## Sample.name Locus_1.1.1 Locus_1.1.2 Locus_2.1.1 Locus_2.1.2 Locus_3.1.1 Locus_3.1.2 Locus_4.1.1 Locus_4.1.2
## 1 ID.1 103 103 115 115 105 117 115 130
## 2 ID.2 103 115 115 117 106 134 109 115
## 3 ID.3 115 122 109 117 124 129 107 107
## 4 ID.4 103 104 109 117 104 124 107 109
## 5 ID.5 115 122 106 115 102 102 109 113
## 6 ID.6 109 118 106 108 104 130 109 130
## Locus_5.1.1 Locus_5.1.2 Locus_6.1.1 Locus_6.1.2 Locus_7.1.1 Locus_7.1.2 Locus_8.1.1 Locus_8.1.2 Locus_9.1.1
## 1 122 131 101 115 115 130 108 119 111
## 2 101 113 101 118 107 109 108 111 121
## 3 102 130 115 140 109 136 103 119 111
## 4 109 141 115 121 130 130 111 123 111
## 5 122 128 115 118 115 115 111 114 105
## 6 109 114 113 118 107 115 103 119 114
## Locus_9.1.2 Locus_10.1.1 Locus_10.1.2
## 1 130 115 120
## 2 127 114 128
## 3 125 110 122
## 4 127 000 000
## 5 121 115 122
## 6 125 114 120
h <- three.char(g, one.pop=F)
head(h)
## Pops Sample.name Locus_1.1.1 Locus_1.1.2 Locus_2.1.1 Locus_2.1.2 Locus_3.1.1 Locus_3.1.2 Locus_4.1.1 Locus_4.1.2
## 1 POP.1 ID.1 103 103 115 115 105 117 115 130
## 2 POP.1 ID.2 103 115 115 117 106 134 109 115
## 3 POP.1 ID.3 115 122 109 117 124 129 107 107
## 4 POP.1 ID.4 103 104 109 117 104 124 107 109
## 5 POP.1 ID.5 115 122 106 115 102 102 109 113
## 6 POP.1 ID.6 109 118 106 108 104 130 109 130
## Locus_5.1.1 Locus_5.1.2 Locus_6.1.1 Locus_6.1.2 Locus_7.1.1 Locus_7.1.2 Locus_8.1.1 Locus_8.1.2 Locus_9.1.1
## 1 122 131 101 115 115 130 108 119 111
## 2 101 113 101 118 107 109 108 111 121
## 3 102 130 115 140 109 136 103 119 111
## 4 109 141 115 121 130 130 111 123 111
## 5 122 128 115 118 115 115 111 114 105
## 6 109 114 113 118 107 115 103 119 114
## Locus_9.1.2 Locus_10.1.1 Locus_10.1.2
## 1 130 115 120
## 2 127 114 128
## 3 125 110 122
## 4 127 000 000
## 5 121 115 122
## 6 125 114 120
I have intentions of building on the three.char() and six.char() functions to prepare data for genepop and structure in the near future.
This function identifies any genotypes that are identical and reports which ones are in a new column. Arguments include:
1) tmp - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE
Note: When no duplicated genotypes present it just returns a simple message and doesnt add a new column
p <- dup.identifer(tmp,one.pop=T)
## [1] "There were no duplicated genotypes present in data"
head(p)
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1 ID.1 103 103 115 115 105 117 115 130 122 131
## 2 POP.1 ID.2 103 115 115 117 106 134 109 115 101 113
## 3 POP.1 ID.3 115 122 109 117 124 129 107 107 102 130
## 4 POP.1 ID.4 103 104 109 117 104 124 107 109 109 141
## 5 POP.1 ID.5 115 122 106 115 102 102 109 113 122 128
## 8 POP.1 ID.6 109 118 106 108 104 130 109 130 109 114
## Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1 101 115 115 130 108 119 111 130 115 120
## 2 101 118 107 109 108 111 121 127 114 128
## 3 115 140 109 136 103 119 111 125 110 122
## 4 115 121 130 130 111 123 111 127 000 000
## 5 115 118 115 115 111 114 105 121 115 122
## 8 113 118 107 115 103 119 114 125 114 120
There are no duplicated individuals, so I will make some duplicates to demonstrate it can find them
#getting five genotypes to duplicate, one twice
head(tmp)
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1 ID.1 103 103 115 115 105 117 115 130 122 131
## 2 POP.1 ID.2 103 115 115 117 106 134 109 115 101 113
## 3 POP.1 ID.3 115 122 109 117 124 129 107 107 102 130
## 4 POP.1 ID.4 103 104 109 117 104 124 107 109 109 141
## 5 POP.1 ID.5 115 122 106 115 102 102 109 113 122 128
## 8 POP.1 ID.6 109 118 106 108 104 130 109 130 109 114
## Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1 101 115 115 130 108 119 111 130 115 120
## 2 101 118 107 109 108 111 121 127 114 128
## 3 115 140 109 136 103 119 111 125 110 122
## 4 115 121 130 130 111 123 111 127 000 000
## 5 115 118 115 115 111 114 105 121 115 122
## 8 113 118 107 115 103 119 114 125 114 120
x <- sample(1:nrow(tmp),size=4,replace=F)
y <- sample(x,1,F)
x <- c(x,y)
x
## [1] 130 1600 867 876 1600
#getting those genotypes and renaming sample.name column by adding a letter to the end of them
tmp1 <- tmp[x,]
tmp1$Sample.name <- paste(tmp1$Sample.name,letters[seq(from=1,to=nrow(tmp1))],sep=".")
tmp1
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1
## 228 POP.1 ID.130.a 120 120 107 108 102 105 109 128 114
## 1151 POP.2 ID.600.b 111 136 107 113 119 120 104 114 113
## 731 POP.1 ID.867.c 000 000 109 117 106 125 115 116 114
## 744 POP.1 ID.876.d 109 115 108 115 104 125 109 124 101
## 1151.1 POP.2 ID.600.e 111 136 107 113 119 120 104 114 113
## Locus_5.2 Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 228 130 109 125 107 107 119 119 125 130 128 128
## 1151 114 109 114 105 113 114 116 116 134 106 129
## 731 127 118 128 114 115 103 108 114 125 122 134
## 744 109 106 121 105 114 111 123 111 121 113 132
## 1151.1 114 109 114 105 113 114 116 116 134 106 129
#adding genotypes to tmp
tmp <- rbind(tmp,tmp1)
head(tmp)
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1 ID.1 103 103 115 115 105 117 115 130 122 131
## 2 POP.1 ID.2 103 115 115 117 106 134 109 115 101 113
## 3 POP.1 ID.3 115 122 109 117 124 129 107 107 102 130
## 4 POP.1 ID.4 103 104 109 117 104 124 107 109 109 141
## 5 POP.1 ID.5 115 122 106 115 102 102 109 113 122 128
## 8 POP.1 ID.6 109 118 106 108 104 130 109 130 109 114
## Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1 101 115 115 130 108 119 111 130 115 120
## 2 101 118 107 109 108 111 121 127 114 128
## 3 115 140 109 136 103 119 111 125 110 122
## 4 115 121 130 130 111 123 111 127 000 000
## 5 115 118 115 115 111 114 105 121 115 122
## 8 113 118 107 115 103 119 114 125 114 120
Now the duplicated function - there should be four genotypes that were duplicated - 3 were found twice and 1 was found three times
p <- dup.identifer(tmp,one.pop=F)
table(p$ids)
##
## dup.group.1 dup.group.2 dup.group.3 dup.group.4 unique
## 2 3 2 2 1996
p[p$ids != "unique",]
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1
## 228 POP.1 ID.130 120 120 107 108 102 105 109 128 114
## 731 POP.1 ID.867 000 000 109 117 106 125 115 116 114
## 744 POP.1 ID.876 109 115 108 115 104 125 109 124 101
## 1151 POP.2 ID.600 111 136 107 113 119 120 104 114 113
## 2282 POP.1 ID.130.a 120 120 107 108 102 105 109 128 114
## 11511 POP.2 ID.600.b 111 136 107 113 119 120 104 114 113
## 7312 POP.1 ID.867.c 000 000 109 117 106 125 115 116 114
## 7442 POP.1 ID.876.d 109 115 108 115 104 125 109 124 101
## 1151.1 POP.2 ID.600.e 111 136 107 113 119 120 104 114 113
## Locus_5.2 Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 228 130 109 125 107 107 119 119 125 130 128 128
## 731 127 118 128 114 115 103 108 114 125 122 134
## 744 109 106 121 105 114 111 123 111 121 113 132
## 1151 114 109 114 105 113 114 116 116 134 106 129
## 2282 130 109 125 107 107 119 119 125 130 128 128
## 11511 114 109 114 105 113 114 116 116 134 106 129
## 7312 127 118 128 114 115 103 108 114 125 122 134
## 7442 109 106 121 105 114 111 123 111 121 113 132
## 1151.1 114 109 114 105 113 114 116 116 134 106 129
## ids
## 228 dup.group.1
## 731 dup.group.3
## 744 dup.group.4
## 1151 dup.group.2
## 2282 dup.group.1
## 11511 dup.group.2
## 7312 dup.group.3
## 7442 dup.group.4
## 1151.1 dup.group.2
head(p)
## Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1 ID.1 103 103 115 115 105 117 115 130 122 131
## 2 POP.1 ID.2 103 115 115 117 106 134 109 115 101 113
## 3 POP.1 ID.3 115 122 109 117 124 129 107 107 102 130
## 4 POP.1 ID.4 103 104 109 117 104 124 107 109 109 141
## 5 POP.1 ID.5 115 122 106 115 102 102 109 113 122 128
## 8 POP.1 ID.6 109 118 106 108 104 130 109 130 109 114
## Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2 ids
## 1 101 115 115 130 108 119 111 130 115 120 unique
## 2 101 118 107 109 108 111 121 127 114 128 unique
## 3 115 140 109 136 103 119 111 125 110 122 unique
## 4 115 121 130 130 111 123 111 127 000 000 unique
## 5 115 118 115 115 111 114 105 121 115 122 unique
## 8 113 118 107 115 103 119 114 125 114 120 unique
This function was written by Mark Christie to simulate genotypic data for parents and offspring. I have barely changed anything in it. The only things I have done are to make it into a command line function, and add the option to return the output to the console (if desired). There are several arguments in the function:
1) afreq - allele frequencies from alf.freq(,opt=2, one.pop=T). Default - none
2) Nparents - the number of parents desired. Default - 2
3) Noffs_perpair - the number of offspring produced by each pair of female and male. Default - 1
4) error - the frequncies of random genotypic error used. Default - 0
5) Nunrelated - number of unrelated genotypes to be included. Default - 0
6) Nsibs - the number of full siblings in the dataset. Default - 0
7) write2file=F - accepts a TRUE or FALSE. FALSE returns output to console and TRUE saves files to current working directory. Default - FALSE
This function identifies alleles that are only found in the parents or the offspring. I suppose it could be used to compare two populations to find private alleles too. It only requiries output from alf.freq(,opt=1) that was run on a dataset with two “populations” - in this case parents and offspring.
1) afreq - allele frequencies from alf.freq(,opt=1, one.pop=T). Default - none
First I need some allele frequencies for parents and offspring only, so I will randomly assign parents and offspring IDs under the Pop column. I will use allele frequency data I created early in Christie's sim.gt.data().
tmp1 <- sim.gt.data(afreq=d,Nparents=10,Noffs_perpair=2,error=0,Nunrelated=0,Nsibs=0,write2file=F)
head(tmp1)
## Type IDs Locus.1 Locus.1.1 Locus.2 Locus.2.1 Locus.3 Locus.3.1 Locus.4 Locus.4.1 Locus.5 Locus.5.1 Locus.6
## 1 Parents Mom 1 105 106 105 106 100 107 112 114 100 104 104
## 2 Parents Mom 2 111 112 105 108 100 103 111 112 108 110 106
## 3 Parents Mom 3 103 106 101 104 104 118 107 115 103 103 104
## 4 Parents Mom 4 102 106 107 113 104 106 104 114 111 111 112
## 5 Parents Mom 5 101 115 107 108 100 110 108 111 107 109 102
## 6 Parents Mom 6 103 105 102 104 107 114 107 114 107 107 104
## Locus.6.1 Locus.7 Locus.7.1 Locus.8 Locus.8.1 Locus.9 Locus.9.1 Locus.10 Locus.10.1
## 1 108 107 111 110 110 110 111 100 116
## 2 111 100 109 102 112 107 110 104 117
## 3 121 111 111 109 110 105 106 102 118
## 4 115 103 104 102 113 101 108 102 105
## 5 118 107 107 101 102 101 112 107 116
## 6 111 106 110 101 110 106 112 108 119
tail(tmp1)
## Type IDs Locus.1 Locus.1.1 Locus.2 Locus.2.1 Locus.3 Locus.3.1 Locus.4 Locus.4.1 Locus.5 Locus.5.1
## 35 Offspring Offspring 8.15 100 108 100 108 103 111 107 114 105 107
## 36 Offspring Offspring 8.16 100 108 101 113 101 111 114 119 105 107
## 37 Offspring Offspring 9.17 102 107 105 106 100 102 112 114 103 113
## 38 Offspring Offspring 9.18 102 112 105 109 100 102 107 112 103 113
## 39 Offspring Offspring 10.19 105 107 102 120 102 114 114 116 101 103
## 40 Offspring Offspring 10.20 104 110 102 120 102 114 105 114 103 117
## Locus.6 Locus.6.1 Locus.7 Locus.7.1 Locus.8 Locus.8.1 Locus.9 Locus.9.1 Locus.10 Locus.10.1
## 35 102 110 101 109 102 112 101 109 116 117
## 36 102 110 102 109 110 112 101 109 103 117
## 37 106 116 107 115 109 115 100 106 103 115
## 38 104 106 106 107 114 115 103 106 102 119
## 39 111 116 102 122 108 108 103 104 105 106
## 40 111 116 102 113 108 108 103 104 105 106
I will run alf.freq() on the two “Pops” (Parents and offspring)
k <- alf.freq(tmp1,one.pop=F)
head(k)
## Pops Locus allele count frequency
## 1 Parents Locus.1 100 2 0.050
## 2 Parents Locus.1 101 3 0.075
## 3 Parents Locus.1 102 2 0.050
## 4 Parents Locus.1 103 4 0.100
## 5 Parents Locus.1 104 4 0.100
## 6 Parents Locus.1 105 4 0.100
Now for unique alleles() - should have no unique alleles in offspring, but there could be some in the parents because so few offspring were produced by each pair
m <- uniq.alleles(k)
m
## Pops Locus allele count frequency
## 40 Parents Locus.3 112 2 0.050
## 44 Parents Locus.3 118 1 0.025
## 55 Parents Locus.4 113 1 0.025
## 60 Parents Locus.5 100 1 0.025
## 79 Parents Locus.6 105 2 0.050
## 82 Parents Locus.6 109 1 0.025
## 104 Parents Locus.7 116 1 0.025
## 106 Parents Locus.8 100 1 0.025
## 111 Parents Locus.8 107 1 0.025
## 131 Parents Locus.9 108 1 0.025
## 136 Parents Locus.9 113 1 0.025
## 138 Parents Locus.9 118 1 0.025
## 148 Parents Locus.10 110 1 0.025
Now I will do the same process fo my random population data, I think that there will be at least some, likely many alleles, that are unique to one population or the other
First run alf.freq() on the two “Pops”“
k <- alf.freq(tmp,one.pop=F, opt=1)
head(k)
## Pops Locus allele count frequency
## 2 POP.1 Locus_1.1 102 119 0.05998
## 3 POP.1 Locus_1.1 103 148 0.07460
## 4 POP.1 Locus_1.1 104 141 0.07107
## 5 POP.1 Locus_1.1 106 185 0.09325
## 6 POP.1 Locus_1.1 109 228 0.11492
## 7 POP.1 Locus_1.1 115 251 0.12651
tail(k)
## Pops Locus allele count frequency
## 1491 POP.2 Locus_10.1 135 45 0.022681
## 1591 POP.2 Locus_10.1 140 12 0.006048
## 1681 POP.2 Locus_10.1 142 13 0.006552
## 1771 POP.2 Locus_10.1 152 2 0.001008
## 1841 POP.2 Locus_10.1 153 1 0.000504
## 192 POP.2 Locus_10.1 155 1 0.000504
Now for unique alleles()
m <- uniq.alleles(k)
head(m)
## Pops Locus allele count frequency
## 6 POP.1 Locus_1.1 109 228 0.11492
## 7 POP.1 Locus_1.1 115 251 0.12651
## 8 POP.1 Locus_1.1 118 254 0.12802
## 9 POP.1 Locus_1.1 120 217 0.10938
## 10 POP.1 Locus_1.1 122 208 0.10484
## 11 POP.1 Locus_1.1 131 71 0.03579
tail(m)
## Pops Locus allele count frequency
## 1491 POP.2 Locus_10.1 135 45 0.022681
## 1591 POP.2 Locus_10.1 140 12 0.006048
## 1681 POP.2 Locus_10.1 142 13 0.006552
## 1771 POP.2 Locus_10.1 152 2 0.001008
## 1841 POP.2 Locus_10.1 153 1 0.000504
## 192 POP.2 Locus_10.1 155 1 0.000504
dim(m)
## [1] 227 5
This function was written by Mark Christie to calculate the number of false parent offspring pairs within a data at a user defined number of loci. I have barely changed anything in it. The only things I have done are to make it into a command line function, and add the option to calculate the expected number of false parent offspring pairs for multiple values for loci_number at once (if desired). There are four arguments in the function:
Defaults are indicated.
1) adults - data.frame with parents ID and genotypes. Default - none
2) offspring - data.frame with offspring ID and genotypes. Default - none
3) loci_number - number of loci being evaluated - maximum is the number of loci genotypes for parents and offspring. Default - none
4) opt - determines what type of output is desired: Default 1
To do a power analysis I need to put the parents and offspring into different objects
par <- tmp1[grepl("Mom",tmp1$IDs),-1]
off <- tmp1[grepl("Offspring", tmp1$IDs),-1]
#first writes to file - default in SOLOMON
power.analysis(adults=par,offspring=off,loci_number=10,opt=1)
## Warning: appending column names to file
## Warning: appending column names to file
## [1] "Locus power.txt saved to current working directory"
## [1] "Expected Number of False Pairs.txt saved to current working directory"
#returns to console with option two
power.analysis(adults=par,offspring=off,loci_number=10,opt=2)
## loci_number Expected_Number_of_False_Pairs Pr_delta
## 1 10 0.001206 6.031e-06
#loci_number can be more than one number
n <- power.analysis(adults=par,offspring=off,loci_number=9:10,opt=2)
n
## loci_number Expected_Number_of_False_Pairs Pr_delta
## 1 9 0.004519 2.259e-05
## 2 10 0.004519 2.259e-05
This function calculates non-exclusion probabilities for a single parent, second parent, or a parent pair randomly assigning to any given offspring by chance. It also calculates the probability of identity among other randomly selected individuals in the population, as well as among siblings. The output is identical to CERVUS. Arguments include:
1) tmp - allele frequencies from alf.freq(,opt=2, one.pop=T). Default - none
2) opt - determines what type of output is desired: Default - 1
NE_P1 - Non-exclusion probability for single parent (Jamieson and Taylor 1997)
NE_P2 - Non-exclusion probability for second parent (Jamieson and Taylor 1997)
NE_P3 – Non-exclusion probability for parent pair (Jamieson and Taylor 1997)
PrId - Non-exclusion probability for identity (Waits et al. 2001)
PrIdSid - Non-exclusion probability for identity among siblings (Waits et al. 2001)
#locus specific and overall non-exclusion powers stats
NEP.calc(tmp=d,opt=1)
## Locus NE_1P NE_2P NE_PP PrId PrIdSib
## 1 1 7.743e-01 8.730e-01 9.722e-01 7.484e-03 2.831e-01
## 2 2 7.730e-01 8.721e-01 9.719e-01 7.579e-03 2.834e-01
## 3 3 7.853e-01 8.799e-01 9.749e-01 6.713e-03 2.812e-01
## 4 4 7.868e-01 8.806e-01 9.755e-01 6.606e-03 2.812e-01
## 5 5 7.606e-01 8.636e-01 9.689e-01 8.544e-03 2.861e-01
## 6 6 7.711e-01 8.707e-01 9.715e-01 7.721e-03 2.839e-01
## 7 7 7.370e-01 8.484e-01 9.619e-01 1.054e-02 2.903e-01
## 8 8 7.685e-01 8.691e-01 9.708e-01 7.910e-03 2.843e-01
## 9 9 7.375e-01 8.489e-01 9.621e-01 1.043e-02 2.900e-01
## 10 10 7.929e-01 8.843e-01 9.769e-01 6.221e-03 2.802e-01
## 11 Combined 4.255e-07 1.426e-09 4.141e-16 8.976e-22 3.455e-06
#just overall stats
NEP.calc(tmp=d,opt=2)
## Locus NE_1P NE_2P NE_PP PrId PrIdSib
## 11 Combined 4.255e-07 1.426e-09 4.141e-16 8.976e-22 3.455e-06