Function Demo

Created on March 14th, 2014

Written by: Nick Sard

ABOUT

This script was written to demonstrate some of the functions I have written, as well as some I have modified from the library SOLOMON written by Mark Christie

My functions:

sim.pop.gts()
gt.summary()*
m.data.counter()
six.char()
three.char()
dup.identifier()
uniq.alleles()
NEP.calc()

*denotes I used some of Mark Christie's code to make part of the given function

Mark Christie's functions:

alf.freq()
sim.gt.data()
power.analysis()
exclusion()
no.par.bayes()

First the library ggplot2() needs to be loaded:

There is a figure below that uses ggplot2(), so the library needs to be loaded

library(ggplot2)

If you don't have ggplot2() installed, then delete the # and run this code. Then go back up and load the library.

#install.packages("ggplot2")

The exact location of the personalized function script will change for everyone, so make sure to change the working directory accordingly

#loading in functions
source("C:/Users/Nick.Sard/Documents/Research/R/Source scripts/share.R")

First, we need to make some fake data

sim.pop.gts()

I first decided to create this function so that someone trying out the scripts would some fake data to work with. Thus this function simulates microsatellite like data that are poisson distributed. I say microsatellite like data because I have not created functions within sim.pop.gts to mimic any evolutionary processes. There are several arguments for the function:

1) n - the number of individuals in each population. Default - 1000
2) loci - the number of loci desired. Default - 10
3) missing.data - the proportion of individuals (randomly assigned) at each locus that will have missing data (coded as “000”). Default - 0
4) pops - the number of populations desired. Default - 2
5) k - the mean number of alleles at each locus. Default - 20
6) k.sd - the standard deviation around the mean number of alleles at each locus - intended to add a bit of variation among loci. Default - 3

The output is a data frame - the first two columns identify the population ID and the individual ID, followed by genotypes at each locus

tmp <- sim.pop.gts(n=1000, loci=10, missing.data=0.01, pops=2, k=20, k.sd=2)
head(tmp)
##    Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1        ID.1       103       103       115       115       105       117       115       130       122       131
## 2 POP.1        ID.2       103       115       115       117       106       134       109       115       101       113
## 3 POP.1        ID.3       115       122       109       117       124       129       107       107       102       130
## 4 POP.1        ID.4       103       104       109       117       104       124       107       109       109       141
## 5 POP.1        ID.5       115       122       106       115       102       102       109       113       122       128
## 8 POP.1        ID.6       109       118       106       108       104       130       109       130       109       114
##   Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1       101       115       115       130       108       119       111       130        115        120
## 2       101       118       107       109       108       111       121       127        114        128
## 3       115       140       109       136       103       119       111       125        110        122
## 4       115       121       130       130       111       123       111       127        000        000
## 5       115       118       115       115       111       114       105       121        115        122
## 8       113       118       107       115       103       119       114       125        114        120

The format of the simulated data is what is accepted for all other functions. For new R users - the output is automatically wrapped within the R console because there are many columns.

gt.summary()

Some of this function was originally written by Mark Christie, but I have modified it and added additional code to calculate other population genetic descriptive statistics. For each locus the function can calculate number of alleles, observed and expected heterozygosity, the percent number of individuals that were genotyped, and Fis. It also counts the number of individuals in each population and the number missing genotypes. This function can calculate the statistics averaged over all loci for each population, as well as over all populations. Here are the arguments:

1) population - the data frame with the populations, sample ids, and genotypes. Default - none -
2) pop.sum - will summarize stats overall loci if TRUE. Default - FALSE
3) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE
4) missing.data - used to identify missing data. By default it is “000”, but the user can change it to whatever they use to identify missing data.
5) round.num - identified what decimal point the output is rounded to. Default - 4

Note:

Expected heterozygosity is corrected for sample size (Nei 1977)
Fis is calculated as Fis = 1 - (Ho/He)

#overall gt.summary disregarding population IDs (by removing the first column)
x <- gt.summary(population=tmp[,-1],pop.sum=F,one.pop=T,missing.data="000")
head(x)
##   Pop     Locus  A   He   Ho  Fis    N N.Missing P.GT
## 1 Pop Locus_1.1 27 0.94 0.91 0.03 2000        20   99
## 2 Pop Locus_2.1 30 0.94 0.88 0.06 2000        20   99
## 3 Pop Locus_3.1 29 0.94 0.89 0.05 2000        20   99
## 4 Pop Locus_4.1 30 0.94 0.89 0.05 2000        20   99
## 5 Pop Locus_5.1 30 0.93 0.91 0.02 2000        20   99
## 6 Pop Locus_6.1 28 0.94 0.91 0.03 2000        20   99
#doing the same thing but with population IDs
y <- gt.summary(population=tmp,pop.sum=F,one.pop=F,missing.data="000")
head(y)
##     Pop     Locus  A   He   Ho   Fis    N N.Missing P.GT
## 1 POP.1 Locus_1.1 18 0.91 0.91  0.00 1000        10   99
## 2 POP.1 Locus_2.1 19 0.90 0.90  0.00 1000        10   99
## 3 POP.1 Locus_3.1 18 0.90 0.91 -0.01 1000        10   99
## 4 POP.1 Locus_4.1 18 0.90 0.88  0.02 1000        10   99
## 5 POP.1 Locus_5.1 18 0.89 0.89  0.00 1000        10   99
## 6 POP.1 Locus_6.1 18 0.91 0.91  0.00 1000        10   99
#now averaging everything accross all loci - getting standard errors too
z <- gt.summary(population=tmp,pop.sum=T,one.pop=F,missing.data="000")
z
##     Pop    N    A    He    Ho    Fis N.Missing P.GT   A.se  He.se  Ho.se Fis.se N.Missing.se P.GT.se
## 1 POP.1 1000 17.6 0.897 0.897  0.000        10   99 0.3712 0.0054 0.0060 0.0026            0       0
## 2 POP.2 1000 17.5 0.902 0.906 -0.004        10   99 0.4282 0.0042 0.0056 0.0031            0       0

alf.freq()

This function was created by Mark Christie, and it summarizes the alleles at each locus by counting them and calculating their observed frequency. I have barely changed anything in it. The only things I have done are to make it into a command line function, rounded the frequencies of each allele to 5 decimal places, and put it into a for loop so that it does the calculations for each population (if desired). There are four arguments in the function:

1) population - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - FALSE
3) opt - determines what type of output is desired: Default - 1

4) missing.data - used to identify missing data. By default it is “000”, but the user can change it to whatever they use to identify missing data.

Here are some examples:

#for each population ID
a <- alf.freq(population=tmp, one.pop=F, opt=1,missing.data="000")
head(a)
##    Pops     Locus allele count frequency
## 2 POP.1 Locus_1.1    102   119   0.06010
## 3 POP.1 Locus_1.1    103   148   0.07475
## 4 POP.1 Locus_1.1    104   141   0.07121
## 5 POP.1 Locus_1.1    106   185   0.09343
## 6 POP.1 Locus_1.1    109   227   0.11465
## 7 POP.1 Locus_1.1    115   250   0.12626
### just a quick visualization to look at the distribution of alleles
ggplot(a, aes(x=as.factor(allele), y=count))+
  facet_grid(Pops~Locus)+
  geom_bar(stat="identity")+
  labs(y="Number of alleles", x= "Alleles")+
  theme_bw()+
  theme(axis.title = element_text(size = 32),
        axis.text.y = element_text(size = 24, colour = "Black"),
        axis.text.x = element_blank(),
        strip.text = element_text(size = 18),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

plot of chunk unnamed-chunk-10

#overall - no population IDs
b <- alf.freq(population=tmp[,-1], one.pop=T, opt=1)
head(b)
##   Pops     Locus allele count frequency
## 2  Pop Locus_1.1    102   288   0.07273
## 3  Pop Locus_1.1    103   310   0.07828
## 4  Pop Locus_1.1    104   310   0.07828
## 5  Pop Locus_1.1    106   407   0.10278
## 6  Pop Locus_1.1    107   234   0.05909
## 7  Pop Locus_1.1    109   227   0.05732
#same as above but with option 2 -simplifed output that can be fed right into SOLOMON to simulate genotypes
c <-alf.freq(population=tmp, one.pop=F, opt=2)
head(c)
##   Locus frequency
## 2     1   0.06010
## 3     1   0.07475
## 4     1   0.07121
## 5     1   0.09343
## 6     1   0.11465
## 7     1   0.12626
d <-alf.freq(population=tmp[,-1],one.pop=T,opt=2)
head(d)
##   Locus frequency
## 2     1   0.07273
## 3     1   0.07828
## 4     1   0.07828
## 5     1   0.10278
## 6     1   0.05909
## 7     1   0.05732

m.data.counter()

This function counts how many loci are missing data for each individual in the dataset. It requires data that are in three character form. Arguments include:
1) tmp - table with genotypes in three character form and sample names in column one. Default - none
2) opt - determines what type of output is desired: . Default - 1

3) missing.data - default “000”, but if it's different such as “999”, this can be changed.
4) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE

This bit of code will result in a extra column being added to the end of the data frame. It summarizes how many loci were missing data for each individual. Using table() the user will see that there are a few individuals with one missing genotype and a couple that are missing more than that.

e <- m.data.counter(tmp=tmp[,-1],one.pop=T,missing.data="000")
head(e)
##   Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1        ID.1       103       103       115       115       105       117       115       130       122       131
## 2        ID.2       103       115       115       117       106       134       109       115       101       113
## 3        ID.3       115       122       109       117       124       129       107       107       102       130
## 4        ID.4       103       104       109       117       104       124       107       109       109       141
## 5        ID.5       115       122       106       115       102       102       109       113       122       128
## 8        ID.6       109       118       106       108       104       130       109       130       109       114
##   Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2 m.data
## 1       101       115       115       130       108       119       111       130        115        120      0
## 2       101       118       107       109       108       111       121       127        114        128      0
## 3       115       140       109       136       103       119       111       125        110        122      0
## 4       115       121       130       130       111       123       111       127        000        000      1
## 5       115       118       115       115       111       114       105       121        115        122      0
## 8       113       118       107       115       103       119       114       125        114        120      0
table(e$m.data)
## 
##    0    1    2    3 
## 1810  181    8    1
f <- m.data.counter(tmp=tmp,opt=2,one.pop=F)
head(f)
##    Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1        ID.1       103       103       115       115       105       117       115       130       122       131
## 2 POP.1        ID.2       103       115       115       117       106       134       109       115       101       113
## 3 POP.1        ID.3       115       122       109       117       124       129       107       107       102       130
## 4 POP.1        ID.4       103       104       109       117       104       124       107       109       109       141
## 5 POP.1        ID.5       115       122       106       115       102       102       109       113       122       128
## 8 POP.1        ID.6       109       118       106       108       104       130       109       130       109       114
##   Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2 m.data
## 1       101       115       115       130       108       119       111       130        115        120      0
## 2       101       118       107       109       108       111       121       127        114        128      0
## 3       115       140       109       136       103       119       111       125        110        122      0
## 4       115       121       130       130       111       123       111       127        000        000      1
## 5       115       118       115       115       111       114       105       121        115        122      0
## 8       113       118       107       115       103       119       114       125        114        120      0

six.char()

This function takes three character formatted genotypes and puts them into six character form. Arguments include:

1) population - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE
3) missing.data - default “000”, but if it's different such as “999”, this can be changed.

g <- six.char(population=tmp[,-1],one.pop=T,missing.data="000")
head(g)
##   Sample.name Locus_1.1 Locus_2.1 Locus_3.1 Locus_4.1 Locus_5.1 Locus_6.1 Locus_7.1 Locus_8.1 Locus_9.1 Locus_10.1
## 1        ID.1    103103    115115    105117    115130    122131    101115    115130    108119    111130     115120
## 2        ID.2    103115    115117    106134    109115    101113    101118    107109    108111    121127     114128
## 3        ID.3    115122    109117    124129    107107    102130    115140    109136    103119    111125     110122
## 4        ID.4    103104    109117    104124    107109    109141    115121    130130    111123    111127     000000
## 5        ID.5    115122    106115    102102    109113    122128    115118    115115    111114    105121     115122
## 6        ID.6    109118    106108    104130    109130    109114    113118    107115    103119    114125     114120
g <- six.char(tmp,one.pop=F)
head(g)
##    Pops Sample.name Locus_1.1 Locus_2.1 Locus_3.1 Locus_4.1 Locus_5.1 Locus_6.1 Locus_7.1 Locus_8.1 Locus_9.1
## 1 POP.1        ID.1    103103    115115    105117    115130    122131    101115    115130    108119    111130
## 2 POP.1        ID.2    103115    115117    106134    109115    101113    101118    107109    108111    121127
## 3 POP.1        ID.3    115122    109117    124129    107107    102130    115140    109136    103119    111125
## 4 POP.1        ID.4    103104    109117    104124    107109    109141    115121    130130    111123    111127
## 5 POP.1        ID.5    115122    106115    102102    109113    122128    115118    115115    111114    105121
## 6 POP.1        ID.6    109118    106108    104130    109130    109114    113118    107115    103119    114125
##   Locus_10.1
## 1     115120
## 2     114128
## 3     110122
## 4     000000
## 5     115122
## 6     114120

three.char()

This function takes genotypes in six character form and puts them into three character form. Arguments include:

1) tmp - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE

h <- three.char(tmp=g[,-1], one.pop=T)
head(h)
##   Sample.name Locus_1.1.1 Locus_1.1.2 Locus_2.1.1 Locus_2.1.2 Locus_3.1.1 Locus_3.1.2 Locus_4.1.1 Locus_4.1.2
## 1        ID.1         103         103         115         115         105         117         115         130
## 2        ID.2         103         115         115         117         106         134         109         115
## 3        ID.3         115         122         109         117         124         129         107         107
## 4        ID.4         103         104         109         117         104         124         107         109
## 5        ID.5         115         122         106         115         102         102         109         113
## 6        ID.6         109         118         106         108         104         130         109         130
##   Locus_5.1.1 Locus_5.1.2 Locus_6.1.1 Locus_6.1.2 Locus_7.1.1 Locus_7.1.2 Locus_8.1.1 Locus_8.1.2 Locus_9.1.1
## 1         122         131         101         115         115         130         108         119         111
## 2         101         113         101         118         107         109         108         111         121
## 3         102         130         115         140         109         136         103         119         111
## 4         109         141         115         121         130         130         111         123         111
## 5         122         128         115         118         115         115         111         114         105
## 6         109         114         113         118         107         115         103         119         114
##   Locus_9.1.2 Locus_10.1.1 Locus_10.1.2
## 1         130          115          120
## 2         127          114          128
## 3         125          110          122
## 4         127          000          000
## 5         121          115          122
## 6         125          114          120
h <- three.char(g, one.pop=F)
head(h)
##    Pops Sample.name Locus_1.1.1 Locus_1.1.2 Locus_2.1.1 Locus_2.1.2 Locus_3.1.1 Locus_3.1.2 Locus_4.1.1 Locus_4.1.2
## 1 POP.1        ID.1         103         103         115         115         105         117         115         130
## 2 POP.1        ID.2         103         115         115         117         106         134         109         115
## 3 POP.1        ID.3         115         122         109         117         124         129         107         107
## 4 POP.1        ID.4         103         104         109         117         104         124         107         109
## 5 POP.1        ID.5         115         122         106         115         102         102         109         113
## 6 POP.1        ID.6         109         118         106         108         104         130         109         130
##   Locus_5.1.1 Locus_5.1.2 Locus_6.1.1 Locus_6.1.2 Locus_7.1.1 Locus_7.1.2 Locus_8.1.1 Locus_8.1.2 Locus_9.1.1
## 1         122         131         101         115         115         130         108         119         111
## 2         101         113         101         118         107         109         108         111         121
## 3         102         130         115         140         109         136         103         119         111
## 4         109         141         115         121         130         130         111         123         111
## 5         122         128         115         118         115         115         111         114         105
## 6         109         114         113         118         107         115         103         119         114
##   Locus_9.1.2 Locus_10.1.1 Locus_10.1.2
## 1         130          115          120
## 2         127          114          128
## 3         125          110          122
## 4         127          000          000
## 5         121          115          122
## 6         125          114          120

I have intentions of building on the three.char() and six.char() functions to prepare data for genepop and structure in the near future.

dup.identifer()

This function identifies any genotypes that are identical and reports which ones are in a new column. Arguments include:

1) tmp - the data frame with the populations, sample ids, and three character genotypes. Default - none
2) one.pop - option that identifies (TRUE or FALSE) if only one population is being analyzed. Default - TRUE

Note: When no duplicated genotypes present it just returns a simple message and doesnt add a new column

p <- dup.identifer(tmp,one.pop=T)
## [1] "There were no duplicated genotypes present in data"
head(p)
##    Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1        ID.1       103       103       115       115       105       117       115       130       122       131
## 2 POP.1        ID.2       103       115       115       117       106       134       109       115       101       113
## 3 POP.1        ID.3       115       122       109       117       124       129       107       107       102       130
## 4 POP.1        ID.4       103       104       109       117       104       124       107       109       109       141
## 5 POP.1        ID.5       115       122       106       115       102       102       109       113       122       128
## 8 POP.1        ID.6       109       118       106       108       104       130       109       130       109       114
##   Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1       101       115       115       130       108       119       111       130        115        120
## 2       101       118       107       109       108       111       121       127        114        128
## 3       115       140       109       136       103       119       111       125        110        122
## 4       115       121       130       130       111       123       111       127        000        000
## 5       115       118       115       115       111       114       105       121        115        122
## 8       113       118       107       115       103       119       114       125        114        120

There are no duplicated individuals, so I will make some duplicates to demonstrate it can find them

#getting five genotypes to duplicate, one twice
head(tmp)
##    Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1        ID.1       103       103       115       115       105       117       115       130       122       131
## 2 POP.1        ID.2       103       115       115       117       106       134       109       115       101       113
## 3 POP.1        ID.3       115       122       109       117       124       129       107       107       102       130
## 4 POP.1        ID.4       103       104       109       117       104       124       107       109       109       141
## 5 POP.1        ID.5       115       122       106       115       102       102       109       113       122       128
## 8 POP.1        ID.6       109       118       106       108       104       130       109       130       109       114
##   Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1       101       115       115       130       108       119       111       130        115        120
## 2       101       118       107       109       108       111       121       127        114        128
## 3       115       140       109       136       103       119       111       125        110        122
## 4       115       121       130       130       111       123       111       127        000        000
## 5       115       118       115       115       111       114       105       121        115        122
## 8       113       118       107       115       103       119       114       125        114        120
x <- sample(1:nrow(tmp),size=4,replace=F)
y <- sample(x,1,F)
x <- c(x,y)
x
## [1]  130 1600  867  876 1600

#getting those genotypes and renaming sample.name column by adding a letter to the end of them
tmp1 <- tmp[x,]
tmp1$Sample.name <- paste(tmp1$Sample.name,letters[seq(from=1,to=nrow(tmp1))],sep=".")
tmp1
##         Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1
## 228    POP.1    ID.130.a       120       120       107       108       102       105       109       128       114
## 1151   POP.2    ID.600.b       111       136       107       113       119       120       104       114       113
## 731    POP.1    ID.867.c       000       000       109       117       106       125       115       116       114
## 744    POP.1    ID.876.d       109       115       108       115       104       125       109       124       101
## 1151.1 POP.2    ID.600.e       111       136       107       113       119       120       104       114       113
##        Locus_5.2 Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 228          130       109       125       107       107       119       119       125       130        128        128
## 1151         114       109       114       105       113       114       116       116       134        106        129
## 731          127       118       128       114       115       103       108       114       125        122        134
## 744          109       106       121       105       114       111       123       111       121        113        132
## 1151.1       114       109       114       105       113       114       116       116       134        106        129

#adding genotypes to tmp
tmp <- rbind(tmp,tmp1)
head(tmp)
##    Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1        ID.1       103       103       115       115       105       117       115       130       122       131
## 2 POP.1        ID.2       103       115       115       117       106       134       109       115       101       113
## 3 POP.1        ID.3       115       122       109       117       124       129       107       107       102       130
## 4 POP.1        ID.4       103       104       109       117       104       124       107       109       109       141
## 5 POP.1        ID.5       115       122       106       115       102       102       109       113       122       128
## 8 POP.1        ID.6       109       118       106       108       104       130       109       130       109       114
##   Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 1       101       115       115       130       108       119       111       130        115        120
## 2       101       118       107       109       108       111       121       127        114        128
## 3       115       140       109       136       103       119       111       125        110        122
## 4       115       121       130       130       111       123       111       127        000        000
## 5       115       118       115       115       111       114       105       121        115        122
## 8       113       118       107       115       103       119       114       125        114        120

Now the duplicated function - there should be four genotypes that were duplicated - 3 were found twice and 1 was found three times

p <- dup.identifer(tmp,one.pop=F)
table(p$ids)
## 
## dup.group.1 dup.group.2 dup.group.3 dup.group.4      unique 
##           2           3           2           2        1996
p[p$ids != "unique",]
##         Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1
## 228    POP.1      ID.130       120       120       107       108       102       105       109       128       114
## 731    POP.1      ID.867       000       000       109       117       106       125       115       116       114
## 744    POP.1      ID.876       109       115       108       115       104       125       109       124       101
## 1151   POP.2      ID.600       111       136       107       113       119       120       104       114       113
## 2282   POP.1    ID.130.a       120       120       107       108       102       105       109       128       114
## 11511  POP.2    ID.600.b       111       136       107       113       119       120       104       114       113
## 7312   POP.1    ID.867.c       000       000       109       117       106       125       115       116       114
## 7442   POP.1    ID.876.d       109       115       108       115       104       125       109       124       101
## 1151.1 POP.2    ID.600.e       111       136       107       113       119       120       104       114       113
##        Locus_5.2 Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2
## 228          130       109       125       107       107       119       119       125       130        128        128
## 731          127       118       128       114       115       103       108       114       125        122        134
## 744          109       106       121       105       114       111       123       111       121        113        132
## 1151         114       109       114       105       113       114       116       116       134        106        129
## 2282         130       109       125       107       107       119       119       125       130        128        128
## 11511        114       109       114       105       113       114       116       116       134        106        129
## 7312         127       118       128       114       115       103       108       114       125        122        134
## 7442         109       106       121       105       114       111       123       111       121        113        132
## 1151.1       114       109       114       105       113       114       116       116       134        106        129
##                ids
## 228    dup.group.1
## 731    dup.group.3
## 744    dup.group.4
## 1151   dup.group.2
## 2282   dup.group.1
## 11511  dup.group.2
## 7312   dup.group.3
## 7442   dup.group.4
## 1151.1 dup.group.2
head(p)
##    Pops Sample.name Locus_1.1 Locus_1.2 Locus_2.1 Locus_2.2 Locus_3.1 Locus_3.2 Locus_4.1 Locus_4.2 Locus_5.1 Locus_5.2
## 1 POP.1        ID.1       103       103       115       115       105       117       115       130       122       131
## 2 POP.1        ID.2       103       115       115       117       106       134       109       115       101       113
## 3 POP.1        ID.3       115       122       109       117       124       129       107       107       102       130
## 4 POP.1        ID.4       103       104       109       117       104       124       107       109       109       141
## 5 POP.1        ID.5       115       122       106       115       102       102       109       113       122       128
## 8 POP.1        ID.6       109       118       106       108       104       130       109       130       109       114
##   Locus_6.1 Locus_6.2 Locus_7.1 Locus_7.2 Locus_8.1 Locus_8.2 Locus_9.1 Locus_9.2 Locus_10.1 Locus_10.2    ids
## 1       101       115       115       130       108       119       111       130        115        120 unique
## 2       101       118       107       109       108       111       121       127        114        128 unique
## 3       115       140       109       136       103       119       111       125        110        122 unique
## 4       115       121       130       130       111       123       111       127        000        000 unique
## 5       115       118       115       115       111       114       105       121        115        122 unique
## 8       113       118       107       115       103       119       114       125        114        120 unique

Moving to parentage specific functions

sim.gt.data()

This function was written by Mark Christie to simulate genotypic data for parents and offspring. I have barely changed anything in it. The only things I have done are to make it into a command line function, and add the option to return the output to the console (if desired). There are several arguments in the function:

1) afreq - allele frequencies from alf.freq(,opt=2, one.pop=T). Default - none
2) Nparents - the number of parents desired. Default - 2
3) Noffs_perpair - the number of offspring produced by each pair of female and male. Default - 1
4) error - the frequncies of random genotypic error used. Default - 0
5) Nunrelated - number of unrelated genotypes to be included. Default - 0
6) Nsibs - the number of full siblings in the dataset. Default - 0
7) write2file=F - accepts a TRUE or FALSE. FALSE returns output to console and TRUE saves files to current working directory. Default - FALSE

uniq.alleles()

This function identifies alleles that are only found in the parents or the offspring. I suppose it could be used to compare two populations to find private alleles too. It only requiries output from alf.freq(,opt=1) that was run on a dataset with two “populations” - in this case parents and offspring.

1) afreq - allele frequencies from alf.freq(,opt=1, one.pop=T). Default - none

First I need some allele frequencies for parents and offspring only, so I will randomly assign parents and offspring IDs under the Pop column. I will use allele frequency data I created early in Christie's sim.gt.data().

tmp1 <- sim.gt.data(afreq=d,Nparents=10,Noffs_perpair=2,error=0,Nunrelated=0,Nsibs=0,write2file=F)
head(tmp1)
##      Type   IDs Locus.1 Locus.1.1 Locus.2 Locus.2.1 Locus.3 Locus.3.1 Locus.4 Locus.4.1 Locus.5 Locus.5.1 Locus.6
## 1 Parents Mom 1     105       106     105       106     100       107     112       114     100       104     104
## 2 Parents Mom 2     111       112     105       108     100       103     111       112     108       110     106
## 3 Parents Mom 3     103       106     101       104     104       118     107       115     103       103     104
## 4 Parents Mom 4     102       106     107       113     104       106     104       114     111       111     112
## 5 Parents Mom 5     101       115     107       108     100       110     108       111     107       109     102
## 6 Parents Mom 6     103       105     102       104     107       114     107       114     107       107     104
##   Locus.6.1 Locus.7 Locus.7.1 Locus.8 Locus.8.1 Locus.9 Locus.9.1 Locus.10 Locus.10.1
## 1       108     107       111     110       110     110       111      100        116
## 2       111     100       109     102       112     107       110      104        117
## 3       121     111       111     109       110     105       106      102        118
## 4       115     103       104     102       113     101       108      102        105
## 5       118     107       107     101       102     101       112      107        116
## 6       111     106       110     101       110     106       112      108        119
tail(tmp1)
##         Type             IDs Locus.1 Locus.1.1 Locus.2 Locus.2.1 Locus.3 Locus.3.1 Locus.4 Locus.4.1 Locus.5 Locus.5.1
## 35 Offspring  Offspring 8.15     100       108     100       108     103       111     107       114     105       107
## 36 Offspring  Offspring 8.16     100       108     101       113     101       111     114       119     105       107
## 37 Offspring  Offspring 9.17     102       107     105       106     100       102     112       114     103       113
## 38 Offspring  Offspring 9.18     102       112     105       109     100       102     107       112     103       113
## 39 Offspring Offspring 10.19     105       107     102       120     102       114     114       116     101       103
## 40 Offspring Offspring 10.20     104       110     102       120     102       114     105       114     103       117
##    Locus.6 Locus.6.1 Locus.7 Locus.7.1 Locus.8 Locus.8.1 Locus.9 Locus.9.1 Locus.10 Locus.10.1
## 35     102       110     101       109     102       112     101       109      116        117
## 36     102       110     102       109     110       112     101       109      103        117
## 37     106       116     107       115     109       115     100       106      103        115
## 38     104       106     106       107     114       115     103       106      102        119
## 39     111       116     102       122     108       108     103       104      105        106
## 40     111       116     102       113     108       108     103       104      105        106

I will run alf.freq() on the two “Pops” (Parents and offspring)

k <- alf.freq(tmp1,one.pop=F)
head(k)
##      Pops   Locus allele count frequency
## 1 Parents Locus.1    100     2     0.050
## 2 Parents Locus.1    101     3     0.075
## 3 Parents Locus.1    102     2     0.050
## 4 Parents Locus.1    103     4     0.100
## 5 Parents Locus.1    104     4     0.100
## 6 Parents Locus.1    105     4     0.100

Now for unique alleles() - should have no unique alleles in offspring, but there could be some in the parents because so few offspring were produced by each pair

m <- uniq.alleles(k)
m
##        Pops    Locus allele count frequency
## 40  Parents  Locus.3    112     2     0.050
## 44  Parents  Locus.3    118     1     0.025
## 55  Parents  Locus.4    113     1     0.025
## 60  Parents  Locus.5    100     1     0.025
## 79  Parents  Locus.6    105     2     0.050
## 82  Parents  Locus.6    109     1     0.025
## 104 Parents  Locus.7    116     1     0.025
## 106 Parents  Locus.8    100     1     0.025
## 111 Parents  Locus.8    107     1     0.025
## 131 Parents  Locus.9    108     1     0.025
## 136 Parents  Locus.9    113     1     0.025
## 138 Parents  Locus.9    118     1     0.025
## 148 Parents Locus.10    110     1     0.025

Now I will do the same process fo my random population data, I think that there will be at least some, likely many alleles, that are unique to one population or the other

First run alf.freq() on the two “Pops”“

k <- alf.freq(tmp,one.pop=F, opt=1)
head(k)
##    Pops     Locus allele count frequency
## 2 POP.1 Locus_1.1    102   119   0.05998
## 3 POP.1 Locus_1.1    103   148   0.07460
## 4 POP.1 Locus_1.1    104   141   0.07107
## 5 POP.1 Locus_1.1    106   185   0.09325
## 6 POP.1 Locus_1.1    109   228   0.11492
## 7 POP.1 Locus_1.1    115   251   0.12651
tail(k)
##       Pops      Locus allele count frequency
## 1491 POP.2 Locus_10.1    135    45  0.022681
## 1591 POP.2 Locus_10.1    140    12  0.006048
## 1681 POP.2 Locus_10.1    142    13  0.006552
## 1771 POP.2 Locus_10.1    152     2  0.001008
## 1841 POP.2 Locus_10.1    153     1  0.000504
## 192  POP.2 Locus_10.1    155     1  0.000504

Now for unique alleles()

m <- uniq.alleles(k)
head(m)
##     Pops     Locus allele count frequency
## 6  POP.1 Locus_1.1    109   228   0.11492
## 7  POP.1 Locus_1.1    115   251   0.12651
## 8  POP.1 Locus_1.1    118   254   0.12802
## 9  POP.1 Locus_1.1    120   217   0.10938
## 10 POP.1 Locus_1.1    122   208   0.10484
## 11 POP.1 Locus_1.1    131    71   0.03579
tail(m)
##       Pops      Locus allele count frequency
## 1491 POP.2 Locus_10.1    135    45  0.022681
## 1591 POP.2 Locus_10.1    140    12  0.006048
## 1681 POP.2 Locus_10.1    142    13  0.006552
## 1771 POP.2 Locus_10.1    152     2  0.001008
## 1841 POP.2 Locus_10.1    153     1  0.000504
## 192  POP.2 Locus_10.1    155     1  0.000504
dim(m)
## [1] 227   5

power.analysis()

This function was written by Mark Christie to calculate the number of false parent offspring pairs within a data at a user defined number of loci. I have barely changed anything in it. The only things I have done are to make it into a command line function, and add the option to calculate the expected number of false parent offspring pairs for multiple values for loci_number at once (if desired). There are four arguments in the function:

Defaults are indicated.

1) adults - data.frame with parents ID and genotypes. Default - none
2) offspring - data.frame with offspring ID and genotypes. Default - none
3) loci_number - number of loci being evaluated - maximum is the number of loci genotypes for parents and offspring. Default - none
4) opt - determines what type of output is desired: Default 1

To do a power analysis I need to put the parents and offspring into different objects

par <- tmp1[grepl("Mom",tmp1$IDs),-1]
off <- tmp1[grepl("Offspring", tmp1$IDs),-1]

#first writes to file - default in SOLOMON
power.analysis(adults=par,offspring=off,loci_number=10,opt=1)
## Warning: appending column names to file
## Warning: appending column names to file
## [1] "Locus power.txt saved to current working directory"
## [1] "Expected Number of False Pairs.txt saved to current working directory"

#returns to console with option two
power.analysis(adults=par,offspring=off,loci_number=10,opt=2)
##   loci_number Expected_Number_of_False_Pairs  Pr_delta
## 1          10                       0.001206 6.031e-06

#loci_number can be more than one number
n <- power.analysis(adults=par,offspring=off,loci_number=9:10,opt=2)
n
##   loci_number Expected_Number_of_False_Pairs  Pr_delta
## 1           9                       0.004519 2.259e-05
## 2          10                       0.004519 2.259e-05

NEP.calc()

This function calculates non-exclusion probabilities for a single parent, second parent, or a parent pair randomly assigning to any given offspring by chance. It also calculates the probability of identity among other randomly selected individuals in the population, as well as among siblings. The output is identical to CERVUS. Arguments include:

1) tmp - allele frequencies from alf.freq(,opt=2, one.pop=T). Default - none
2) opt - determines what type of output is desired: Default - 1

NE_P1 - Non-exclusion probability for single parent (Jamieson and Taylor 1997)
NE_P2 - Non-exclusion probability for second parent (Jamieson and Taylor 1997)
NE_P3 – Non-exclusion probability for parent pair (Jamieson and Taylor 1997)
PrId - Non-exclusion probability for identity (Waits et al. 2001)
PrIdSid - Non-exclusion probability for identity among siblings (Waits et al. 2001)

#locus specific and overall non-exclusion powers stats
NEP.calc(tmp=d,opt=1)
##       Locus     NE_1P     NE_2P     NE_PP      PrId   PrIdSib
## 1         1 7.743e-01 8.730e-01 9.722e-01 7.484e-03 2.831e-01
## 2         2 7.730e-01 8.721e-01 9.719e-01 7.579e-03 2.834e-01
## 3         3 7.853e-01 8.799e-01 9.749e-01 6.713e-03 2.812e-01
## 4         4 7.868e-01 8.806e-01 9.755e-01 6.606e-03 2.812e-01
## 5         5 7.606e-01 8.636e-01 9.689e-01 8.544e-03 2.861e-01
## 6         6 7.711e-01 8.707e-01 9.715e-01 7.721e-03 2.839e-01
## 7         7 7.370e-01 8.484e-01 9.619e-01 1.054e-02 2.903e-01
## 8         8 7.685e-01 8.691e-01 9.708e-01 7.910e-03 2.843e-01
## 9         9 7.375e-01 8.489e-01 9.621e-01 1.043e-02 2.900e-01
## 10       10 7.929e-01 8.843e-01 9.769e-01 6.221e-03 2.802e-01
## 11 Combined 4.255e-07 1.426e-09 4.141e-16 8.976e-22 3.455e-06

#just overall stats
NEP.calc(tmp=d,opt=2)
##       Locus     NE_1P     NE_2P     NE_PP      PrId   PrIdSib
## 11 Combined 4.255e-07 1.426e-09 4.141e-16 8.976e-22 3.455e-06

fin!